LinuxLists.cc - Question about clone_range() metadata stability

2019-11-27 18:42:26

Subject: Question about clone_range() metadata stability

Hi all

A quick question about clone_range() and guarantees around metadata
stability.

Are users required to call fsync/fsync_range() after calling
clone_range() in order to guarantee that the cloned range metadata is
persisted? I'm assuming that it is required in order to guarantee that
data is persisted.

I'm asking because knfsd currently just does a call to
vfs_clone_file_range() when parsing a NFSv4.2 CLONE operation. It does
not call fsync()/fsync_range() on the destination file, and since the
NFSv4.2 protocol does not require you to perform any other operation in
order to persist data/metadata, I'm worried that we may be corrupting
the cloned file if the NFS server crashes at the wrong moment after the
client has been told the clone completed.

Cheers
Trond
--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]

2019-11-27 20:22:48

by Darrick J. Wong

[permalink] [raw]

Subject: Re: Question about clone_range() metadata stability

On Wed, Nov 27, 2019 at 06:38:46PM +0000, Trond Myklebust wrote:
> Hi all
>
> A quick question about clone_range() and guarantees around metadata
> stability.
>
> Are users required to call fsync/fsync_range() after calling
> clone_range() in order to guarantee that the cloned range metadata is
> persisted?

Yes.

> I'm assuming that it is required in order to guarantee that
> data is persisted.

Data and metadata. XFS and ocfs2's reflink implementations will flush
the page cache before starting the remap, but they both require fsync to
force the log/journal to disk.

(AFAICT the same reasoning applies to btrfs, but don't trust my word for
it.)

> I'm asking because knfsd currently just does a call to
> vfs_clone_file_range() when parsing a NFSv4.2 CLONE operation. It does
> not call fsync()/fsync_range() on the destination file, and since the
> NFSv4.2 protocol does not require you to perform any other operation in
> order to persist data/metadata, I'm worried that we may be corrupting
> the cloned file if the NFS server crashes at the wrong moment after the
> client has been told the clone completed.

That analysis seems correct.

--D

> Cheers
> Trond
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>
>

2019-11-29 12:47:38

by Filipe Manana

[permalink] [raw]

Subject: Re: Question about clone_range() metadata stability

On Wed, Nov 27, 2019 at 8:24 PM Darrick J. Wong <[email protected]> wrote:
>
> On Wed, Nov 27, 2019 at 06:38:46PM +0000, Trond Myklebust wrote:
> > Hi all
> >
> > A quick question about clone_range() and guarantees around metadata
> > stability.
> >
> > Are users required to call fsync/fsync_range() after calling
> > clone_range() in order to guarantee that the cloned range metadata is
> > persisted?
>
> Yes.
>
> > I'm assuming that it is required in order to guarantee that
> > data is persisted.
>
> Data and metadata. XFS and ocfs2's reflink implementations will flush
> the page cache before starting the remap, but they both require fsync to
> force the log/journal to disk.
>
> (AFAICT the same reasoning applies to btrfs, but don't trust my word for
> it.)

Yep, exactly the same for btrfs.

>
> > I'm asking because knfsd currently just does a call to
> > vfs_clone_file_range() when parsing a NFSv4.2 CLONE operation. It does
> > not call fsync()/fsync_range() on the destination file, and since the
> > NFSv4.2 protocol does not require you to perform any other operation in
> > order to persist data/metadata, I'm worried that we may be corrupting
> > the cloned file if the NFS server crashes at the wrong moment after the
> > client has been told the clone completed.
>
> That analysis seems correct.
>
> --D
>
> > Cheers
> > Trond
> > --
> > Trond Myklebust
> > Linux NFS client maintainer, Hammerspace
> > [email protected]
> >
> >

--
Filipe David Manana,

“Whether you think you can, or you think you can't — you're right.”

2019-12-01 21:06:04

by Dave Chinner

[permalink] [raw]

Subject: Re: Question about clone_range() metadata stability

On Wed, Nov 27, 2019 at 12:21:36PM -0800, Darrick J. Wong wrote:
> On Wed, Nov 27, 2019 at 06:38:46PM +0000, Trond Myklebust wrote:
> > Hi all
> >
> > A quick question about clone_range() and guarantees around metadata
> > stability.
> >
> > Are users required to call fsync/fsync_range() after calling
> > clone_range() in order to guarantee that the cloned range metadata is
> > persisted?
>
> Yes.
>
> > I'm assuming that it is required in order to guarantee that
> > data is persisted.
>
> Data and metadata. XFS and ocfs2's reflink implementations will flush
> the page cache before starting the remap, but they both require fsync to
> force the log/journal to disk.

So we need to call xfs_fs_nfs_commit_metadata() to get that done
post vfs_clone_file_range() completion on the server side, yes?

>
> (AFAICT the same reasoning applies to btrfs, but don't trust my word for
> it.)
>
> > I'm asking because knfsd currently just does a call to
> > vfs_clone_file_range() when parsing a NFSv4.2 CLONE operation. It does
> > not call fsync()/fsync_range() on the destination file, and since the
> > NFSv4.2 protocol does not require you to perform any other operation in
> > order to persist data/metadata, I'm worried that we may be corrupting
> > the cloned file if the NFS server crashes at the wrong moment after the
> > client has been told the clone completed.

Yup, that's exactly what server side calls to commit_metadata() are
supposed to address.

I suspect to be correct, this might require commit_metadata() to be
called on both the source and destination inodes, as both of them
may have modified metadata as a result of the clone operation. For
XFS one of them will be a no-op, but for other filesystems that
don't implement ->commit_metadata, we'll need to call
sync_inode_metadata() on both inodes...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2019-12-02 17:10:10

by Darrick J. Wong

[permalink] [raw]

Subject: Re: Question about clone_range() metadata stability

On Mon, Dec 02, 2019 at 08:05:19AM +1100, Dave Chinner wrote:
> On Wed, Nov 27, 2019 at 12:21:36PM -0800, Darrick J. Wong wrote:
> > On Wed, Nov 27, 2019 at 06:38:46PM +0000, Trond Myklebust wrote:
> > > Hi all
> > >
> > > A quick question about clone_range() and guarantees around metadata
> > > stability.
> > >
> > > Are users required to call fsync/fsync_range() after calling
> > > clone_range() in order to guarantee that the cloned range metadata is
> > > persisted?
> >
> > Yes.
> >
> > > I'm assuming that it is required in order to guarantee that
> > > data is persisted.
> >
> > Data and metadata. XFS and ocfs2's reflink implementations will flush
> > the page cache before starting the remap, but they both require fsync to
> > force the log/journal to disk.
>
> So we need to call xfs_fs_nfs_commit_metadata() to get that done
> post vfs_clone_file_range() completion on the server side, yes?

That sounds like a much better/less hastily researched answer! :)

>
> >
> > (AFAICT the same reasoning applies to btrfs, but don't trust my word for
> > it.)
> >
> > > I'm asking because knfsd currently just does a call to
> > > vfs_clone_file_range() when parsing a NFSv4.2 CLONE operation. It does
> > > not call fsync()/fsync_range() on the destination file, and since the
> > > NFSv4.2 protocol does not require you to perform any other operation in
> > > order to persist data/metadata, I'm worried that we may be corrupting
> > > the cloned file if the NFS server crashes at the wrong moment after the
> > > client has been told the clone completed.
>
> Yup, that's exactly what server side calls to commit_metadata() are
> supposed to address.
>
> I suspect to be correct, this might require commit_metadata() to be
> called on both the source and destination inodes, as both of them
> may have modified metadata as a result of the clone operation. For
> XFS one of them will be a no-op,

Hmm. If xfs had to set its reflink flag on the source inode then we
want to ->commit_metadata the source inode to push the log forward far
enough to record the metadata change. That said, we set the reflink
flag on both inodes before we remap anything, so chances are that
->commit_metadata on the dest inode will be enough to push the log
forward.

I suspect that from NFS' point of view it probably ought to
->commit_metadata both inodes to insulate itself from fs-specific
behaviors and avoid weird crash dataloss bugs. Someday, someone will
design a filesystem with per-inode logs /and/ hook it up to NFS.

> but for other filesystems that
> don't implement ->commit_metadata, we'll need to call
> sync_inode_metadata() on both inodes...

<nod>

--D

> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]

2019-12-03 07:38:30

by Trond Myklebust

[permalink] [raw]

Subject: Re: Question about clone_range() metadata stability

On Mon, 2019-12-02 at 08:05 +1100, Dave Chinner wrote:
> On Wed, Nov 27, 2019 at 12:21:36PM -0800, Darrick J. Wong wrote:
> > On Wed, Nov 27, 2019 at 06:38:46PM +0000, Trond Myklebust wrote:
> > > Hi all
> > >
> > > A quick question about clone_range() and guarantees around
> > > metadata
> > > stability.
> > >
> > > Are users required to call fsync/fsync_range() after calling
> > > clone_range() in order to guarantee that the cloned range
> > > metadata is
> > > persisted?
> >
> > Yes.
> >
> > > I'm assuming that it is required in order to guarantee that
> > > data is persisted.
> >
> > Data and metadata. XFS and ocfs2's reflink implementations will
> > flush
> > the page cache before starting the remap, but they both require
> > fsync to
> > force the log/journal to disk.
>
> So we need to call xfs_fs_nfs_commit_metadata() to get that done
> post vfs_clone_file_range() completion on the server side, yes?
>

I chose to implement this using a full call to vfs_fsync_range(), since
we really do want to ensure data stability as well. Consider, for
instance, the case where client A is running an application, and client
B runs vfs_clone_file_range() in order to create a point in time
snapshot of the file for disaster recovery purposes...

> > (AFAICT the same reasoning applies to btrfs, but don't trust my
> > word for
> > it.)
> >
> > > I'm asking because knfsd currently just does a call to
> > > vfs_clone_file_range() when parsing a NFSv4.2 CLONE operation. It
> > > does
> > > not call fsync()/fsync_range() on the destination file, and since
> > > the
> > > NFSv4.2 protocol does not require you to perform any other
> > > operation in
> > > order to persist data/metadata, I'm worried that we may be
> > > corrupting
> > > the cloned file if the NFS server crashes at the wrong moment
> > > after the
> > > client has been told the clone completed.
>
> Yup, that's exactly what server side calls to commit_metadata() are
> supposed to address.
>
> I suspect to be correct, this might require commit_metadata() to be
> called on both the source and destination inodes, as both of them
> may have modified metadata as a result of the clone operation. For
> XFS one of them will be a no-op, but for other filesystems that
> don't implement ->commit_metadata, we'll need to call
> sync_inode_metadata() on both inodes...
>

That's interesting. I hadn't considered that a clone might cause the
source metadata to change as well. What kind of change specifically are
we talking about? Is it just delayed block allocation, or is there
more?

Thanks
Trond

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]

2019-12-03 16:36:01

by Darrick J. Wong

[permalink] [raw]

Subject: Re: Question about clone_range() metadata stability

On Tue, Dec 03, 2019 at 07:36:29AM +0000, Trond Myklebust wrote:
> On Mon, 2019-12-02 at 08:05 +1100, Dave Chinner wrote:
> > On Wed, Nov 27, 2019 at 12:21:36PM -0800, Darrick J. Wong wrote:
> > > On Wed, Nov 27, 2019 at 06:38:46PM +0000, Trond Myklebust wrote:
> > > > Hi all
> > > >
> > > > A quick question about clone_range() and guarantees around
> > > > metadata
> > > > stability.
> > > >
> > > > Are users required to call fsync/fsync_range() after calling
> > > > clone_range() in order to guarantee that the cloned range
> > > > metadata is
> > > > persisted?
> > >
> > > Yes.
> > >
> > > > I'm assuming that it is required in order to guarantee that
> > > > data is persisted.
> > >
> > > Data and metadata. XFS and ocfs2's reflink implementations will
> > > flush
> > > the page cache before starting the remap, but they both require
> > > fsync to
> > > force the log/journal to disk.
> >
> > So we need to call xfs_fs_nfs_commit_metadata() to get that done
> > post vfs_clone_file_range() completion on the server side, yes?
> >
>
> I chose to implement this using a full call to vfs_fsync_range(), since
> we really do want to ensure data stability as well. Consider, for
> instance, the case where client A is running an application, and client
> B runs vfs_clone_file_range() in order to create a point in time
> snapshot of the file for disaster recovery purposes...

Seems reasonable, since (alas) we didn't define the ->remap_range api to
guarantee that for you.

> > > (AFAICT the same reasoning applies to btrfs, but don't trust my
> > > word for
> > > it.)
> > >
> > > > I'm asking because knfsd currently just does a call to
> > > > vfs_clone_file_range() when parsing a NFSv4.2 CLONE operation. It
> > > > does
> > > > not call fsync()/fsync_range() on the destination file, and since
> > > > the
> > > > NFSv4.2 protocol does not require you to perform any other
> > > > operation in
> > > > order to persist data/metadata, I'm worried that we may be
> > > > corrupting
> > > > the cloned file if the NFS server crashes at the wrong moment
> > > > after the
> > > > client has been told the clone completed.
> >
> > Yup, that's exactly what server side calls to commit_metadata() are
> > supposed to address.
> >
> > I suspect to be correct, this might require commit_metadata() to be
> > called on both the source and destination inodes, as both of them
> > may have modified metadata as a result of the clone operation. For
> > XFS one of them will be a no-op, but for other filesystems that
> > don't implement ->commit_metadata, we'll need to call
> > sync_inode_metadata() on both inodes...
> >
>
> That's interesting. I hadn't considered that a clone might cause the
> source metadata to change as well. What kind of change specifically are
> we talking about? Is it just delayed block allocation, or is there
> more?

In XFS' case, we added a per-inode flag to help us bypass the reference
count lookup during a write if the file has never shared any blocks, so
if you never share anything, you'll never pay any of the runtime costs
of the COW mechanism.

ocfs2's design has a reference count tree that is shared between groups
of files that have been reflinked from each other. So if you start with
unshared files A and B and clone A to A1 and A2; and B to B1 and B2,
then A* will have their own refcount tree and B* will also have their
own refcount tree.

Either way, nfs has to assume that changes could have been made to the
source file.

--D

> Thanks
> Trond
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>
>

2019-12-03 23:01:41

by Trond Myklebust

[permalink] [raw]

Subject: Re: Question about clone_range() metadata stability

On Tue, 2019-12-03 at 08:35 -0800, Darrick J. Wong wrote:
> On Tue, Dec 03, 2019 at 07:36:29AM +0000, Trond Myklebust wrote:
> > On Mon, 2019-12-02 at 08:05 +1100, Dave Chinner wrote:
> > > On Wed, Nov 27, 2019 at 12:21:36PM -0800, Darrick J. Wong wrote:
> > > > On Wed, Nov 27, 2019 at 06:38:46PM +0000, Trond Myklebust
> > > > wrote:
> > > > > Hi all
> > > > >
> > > > > A quick question about clone_range() and guarantees around
> > > > > metadata
> > > > > stability.
> > > > >
> > > > > Are users required to call fsync/fsync_range() after calling
> > > > > clone_range() in order to guarantee that the cloned range
> > > > > metadata is
> > > > > persisted?
> > > >
> > > > Yes.
> > > >
> > > > > I'm assuming that it is required in order to guarantee that
> > > > > data is persisted.
> > > >
> > > > Data and metadata. XFS and ocfs2's reflink implementations
> > > > will
> > > > flush
> > > > the page cache before starting the remap, but they both require
> > > > fsync to
> > > > force the log/journal to disk.
> > >
> > > So we need to call xfs_fs_nfs_commit_metadata() to get that done
> > > post vfs_clone_file_range() completion on the server side, yes?
> > >
> >
> > I chose to implement this using a full call to vfs_fsync_range(),
> > since
> > we really do want to ensure data stability as well. Consider, for
> > instance, the case where client A is running an application, and
> > client
> > B runs vfs_clone_file_range() in order to create a point in time
> > snapshot of the file for disaster recovery purposes...
>
> Seems reasonable, since (alas) we didn't define the ->remap_range api
> to
> guarantee that for you.
>
> > > > (AFAICT the same reasoning applies to btrfs, but don't trust my
> > > > word for
> > > > it.)
> > > >
> > > > > I'm asking because knfsd currently just does a call to
> > > > > vfs_clone_file_range() when parsing a NFSv4.2 CLONE
> > > > > operation. It
> > > > > does
> > > > > not call fsync()/fsync_range() on the destination file, and
> > > > > since
> > > > > the
> > > > > NFSv4.2 protocol does not require you to perform any other
> > > > > operation in
> > > > > order to persist data/metadata, I'm worried that we may be
> > > > > corrupting
> > > > > the cloned file if the NFS server crashes at the wrong moment
> > > > > after the
> > > > > client has been told the clone completed.
> > >
> > > Yup, that's exactly what server side calls to commit_metadata()
> > > are
> > > supposed to address.
> > >
> > > I suspect to be correct, this might require commit_metadata() to
> > > be
> > > called on both the source and destination inodes, as both of them
> > > may have modified metadata as a result of the clone operation.
> > > For
> > > XFS one of them will be a no-op, but for other filesystems that
> > > don't implement ->commit_metadata, we'll need to call
> > > sync_inode_metadata() on both inodes...
> > >
> >
> > That's interesting. I hadn't considered that a clone might cause
> > the
> > source metadata to change as well. What kind of change specifically
> > are
> > we talking about? Is it just delayed block allocation, or is there
> > more?
>
> In XFS' case, we added a per-inode flag to help us bypass the
> reference
> count lookup during a write if the file has never shared any blocks,
> so
> if you never share anything, you'll never pay any of the runtime
> costs
> of the COW mechanism.
>
> ocfs2's design has a reference count tree that is shared between
> groups
> of files that have been reflinked from each other. So if you start
> with
> unshared files A and B and clone A to A1 and A2; and B to B1 and B2,
> then A* will have their own refcount tree and B* will also have their
> own refcount tree.
>
> Either way, nfs has to assume that changes could have been made to
> the
> source file.

Interesting. Thanks for the explanation! I'll try to send off an
amended patch to Bruce (hopefully before he merges).

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]

2019-12-06 01:31:47

by Dave Chinner

[permalink] [raw]