linux-ext4 - Re: [RFC][PATCH] link.2: AT_ATOMIC_DATA and AT_ATOMIC

lists.openwall.net		lists / announce owl-users owl-dev john-users john-dev passwdqc-users yescrypt popa3d-users / oss-security kernel-hardening musl sabotage tlsify passwords / crypt-dev xvendor / Bugtraq Full-Disclosure linux-kernel linux-netdev linux-ext4 linux-hardening linux-cve-announce PHC
Open Source and information security mailing list archives

Hash Suite: Windows password security audit tool. GUI, reports in PDF.

[<prev] [next>] [<thread-prev] [thread-next>] [day] [month] [year] [list]

Date:   Sat, 1 Jun 2019 08:45:49 +1000
From:   Dave Chinner <david@...morbit.com>
To:     Theodore Ts'o <tytso@....edu>
Cc:     Amir Goldstein <amir73il@...il.com>, Jan Kara <jack@...e.cz>,
        "Darrick J . Wong" <darrick.wong@...cle.com>,
        Chris Mason <clm@...com>, Al Viro <viro@...iv.linux.org.uk>,
        linux-fsdevel <linux-fsdevel@...r.kernel.org>,
        linux-xfs <linux-xfs@...r.kernel.org>,
        Ext4 <linux-ext4@...r.kernel.org>,
        Linux Btrfs <linux-btrfs@...r.kernel.org>,
        Linux API <linux-api@...r.kernel.org>
Subject: Re: [RFC][PATCH] link.2: AT_ATOMIC_DATA and AT_ATOMIC_METADATA

On Fri, May 31, 2019 at 12:41:36PM -0400, Theodore Ts'o wrote:
> On Fri, May 31, 2019 at 06:21:45PM +0300, Amir Goldstein wrote:
> > What do you think of:
> > 
> > "AT_ATOMIC_DATA (since Linux 5.x)
> > A filesystem which accepts this flag will guarantee that if the linked file
> > name exists after a system crash, then all of the data written to the file
> > and all of the file's metadata at the time of the linkat(2) call will be
> > visible.
> 
> ".... will be visible after the the file system is remounted".  (Never
> hurts to be explicit.)
> 
> > The way to achieve this guarantee on old kernels is to call fsync (2)
> > before linking the file, but doing so will also results in flushing of
> > volatile disk caches.
> >
> > A filesystem which accepts this flag does NOT
> > guarantee that any of the file hardlinks will exist after a system crash,
> > nor that the last observed value of st_nlink (see stat (2)) will persist."
> > 
> 
> This is I think more precise:
> 
>     This guarantee can be achieved by calling fsync(2) before linking
>     the file, but there may be more performant ways to provide these
>     semantics.  In particular, note that the use of the AT_ATOMIC_DATA
>     flag does *not* guarantee that the new link created by linkat(2)
>     will be persisted after a crash.

So here's the *implementation* problem I see with this definition of
AT_ATOMIC_DATA. After linkat(dirfd, name, AT_ATOMIC_DATA), there is
no guarantee that the data is on disk or that the link is present.

However:

	linkat(dirfd, name, AT_ATOMIC_DATA);
	fsync(dirfd);

Suddenly changes all that.

i.e. when we fsync(dirfd) we guarantee that "name" is present in the
directory and because we used AT_ATOMIC_DATA it implies that the
data pointed to by "name" must be present on disk. IOWs, what was
once a pure directory sync operation now *must* fsync all the child
inodes that have been linkat(AT_ATOMIC_DATA) since the last time the
direct has been made stable. 

IOWs, the described AT_ATOMIC_DATA "we don't have to write the data
during linkat() go-fast-get-out-of-gaol-free" behaviour isn't worth
the pixels it is written on - it just moves all the complexity to
directory fsync, and that's /already/ a behavioural minefield.

IMO, the "work-around" of forcing filesystems to write back
destination inodes during a link() operation is just nasty and will
just end up with really weird performance anomalies occurring in
production systems. That's not really a solution, either, especially
as it is far, far faster for applications to use AIO_FSYNC and then
on the completion callback run a normal linkat() operation...

Hence, if I've understood these correctly, then I'll be recommending
that XFS follows this:

> We should also document that a file system which does not implement
> this flag MUST return EINVAL if it is passed this flag to linkat(2).

and returns -EINVAL to these flags because we do not have the change
tracking infrastructure to handle these directory fsync semantics.
I also suspect that, even if we could track this efficiently, we
can't do the flushing atomically because of locking order
constraints between directories, regular files, pages in the page
cache, etc.

Given that we can already use AIO to provide this sort of ordering,
and AIO is vastly faster than synchronous IO, I don't see any point
in adding complex barrier interfaces that can be /easily implemented
in userspace/ using existing AIO primitives. You should start
thinking about expanding libaio with stuff like
"link_after_fdatasync()" and suddenly the whole problem of
filesystem data vs metadata ordering goes away because the
application directly controls all ordering without blocking and
doesn't need to care what the filesystem under it does....

Cheers,

Dave.
-- 
Dave Chinner
david@...morbit.com