LinuxLists.cc - Locking issue with directory renames

2023-01-17 12:40:12

Subject: Locking issue with directory renames

Hello!

I've some across an interesting issue that was spotted by syzbot [1]. The
report is against UDF but AFAICS the problem exists for ext4 as well and
possibly other filesystems. The problem is the following: When we are
renaming directory 'dir' say rename("foo/dir", "bar/") we lock 'foo' and
'bar' but 'dir' is unlocked because the locking done by vfs_rename() is

if (!is_dir || (flags & RENAME_EXCHANGE))
lock_two_nondirectories(source, target);
else if (target)
inode_lock(target);

However some filesystems (e.g. UDF but ext4 as well, I suspect XFS may be
hurt by this as well because it converts among multiple dir formats) need
to update parent pointer in 'dir' and nothing protects this update against
a race with someone else modifying 'dir'. Now this is mostly harmless
because the parent pointer (".." directory entry) is at the beginning of
the directory and stable however if for example the directory is converted
from packed "in-inode" format to "expanded" format as a result of
concurrent operation on 'dir', the filesystem gets corrupted (or crashes as
in case of UDF).

So we'd need to lock 'source' if it is a directory. Ideally this would
happen in VFS as otherwise I bet a lot of filesystems will get this wrong
so could vfs_rename() lock 'source' if it is a dir as well? Essentially
this would amount to calling lock_two_nondirectories(source, target)
unconditionally but that would become a serious misnomer ;). Al, any
thought?

Honza

[1] https://lore.kernel.org/all/[email protected]

--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-01-17 17:09:24

by Al Viro

[permalink] [raw]

Subject: Re: Locking issue with directory renames

2023-01-17 22:56:02

by Dave Chinner

[permalink] [raw]

Subject: Re: Locking issue with directory renames

On Tue, Jan 17, 2023 at 01:37:35PM +0100, Jan Kara wrote:
> Hello!
>
> I've some across an interesting issue that was spotted by syzbot [1]. The
> report is against UDF but AFAICS the problem exists for ext4 as well and
> possibly other filesystems. The problem is the following: When we are
> renaming directory 'dir' say rename("foo/dir", "bar/") we lock 'foo' and
> 'bar' but 'dir' is unlocked because the locking done by vfs_rename() is
>
> if (!is_dir || (flags & RENAME_EXCHANGE))
> lock_two_nondirectories(source, target);
> else if (target)
> inode_lock(target);
>
> However some filesystems (e.g. UDF but ext4 as well, I suspect XFS may be
> hurt by this as well because it converts among multiple dir formats) need
> to update parent pointer in 'dir' and nothing protects this update against
> a race with someone else modifying 'dir'. Now this is mostly harmless
> because the parent pointer (".." directory entry) is at the beginning of
> the directory and stable however if for example the directory is converted
> from packed "in-inode" format to "expanded" format as a result of
> concurrent operation on 'dir', the filesystem gets corrupted (or crashes as
> in case of UDF).

No, xfs_rename() does not have this problem - we pass four inodes to
the function - the source directory, source inode, destination
directory and destination inode.

In the above case, "dir/" is passed to XFs as the source inode - the
src_dir is "foo/", the target dir is "bar/" and the target inode is
null. src_dir != target_dir, so we set the "new_parent" flag. the
srouce inode is a directory, so we set the src_is_directory flag,
too.

We lock all three inodes that are passed. We do various things, then
run:

if (new_parent && src_is_directory) {
/*
* Rewrite the ".." entry to point to the new
* directory.
*/
error = xfs_dir_replace(tp, src_ip, &xfs_name_dotdot,
target_dp->i_ino, spaceres);
ASSERT(error != -EEXIST);
if (error)
goto out_trans_cancel;
}

which replaces the ".." entry in source inode atomically whilst it
is locked. Any directory format changes that occur during the
rename are done while the ILOCK is held, so they appear atomic to
outside observers that are trying to parse the directory structure
(e.g. readdir).

> So we'd need to lock 'source' if it is a directory.

Yup, and XFS goes further by always locking the source inode in a
rename, even if it is not a directory. This ensures the inode being
moved cannot have it's metadata otherwise modified whilst the rename
is in progress, even if that modification would have no impact on
the rename. It's a pretty strict interpretation of "rename is an
atomic operation", but it avoids accidentally missing nasty corner
cases like the one described above...

> Ideally this would
> happen in VFS as otherwise I bet a lot of filesystems will get this wrong
> so could vfs_rename() lock 'source' if it is a dir as well? Essentially
> this would amount to calling lock_two_nondirectories(source, target)
> unconditionally but that would become a serious misnomer ;). Al, any
> thought?

XFS just has a function that allows for an arbitrary number of
inodes to be locked in the given order: xfs_lock_inodes(). For
rename, the lock order is determined by xfs_sort_for_rename().

Cheers,

Dave.
--
Dave Chinner
[email protected]

2023-01-18 09:43:26

by Jan Kara

[permalink] [raw]

Subject: Re: Locking issue with directory renames

Hello,

On Wed 18-01-23 08:44:57, Dave Chinner wrote:
> On Tue, Jan 17, 2023 at 01:37:35PM +0100, Jan Kara wrote:
> > I've some across an interesting issue that was spotted by syzbot [1]. The
> > report is against UDF but AFAICS the problem exists for ext4 as well and
> > possibly other filesystems. The problem is the following: When we are
> > renaming directory 'dir' say rename("foo/dir", "bar/") we lock 'foo' and
> > 'bar' but 'dir' is unlocked because the locking done by vfs_rename() is
> >
> > if (!is_dir || (flags & RENAME_EXCHANGE))
> > lock_two_nondirectories(source, target);
> > else if (target)
> > inode_lock(target);
> >
> > However some filesystems (e.g. UDF but ext4 as well, I suspect XFS may be
> > hurt by this as well because it converts among multiple dir formats) need
> > to update parent pointer in 'dir' and nothing protects this update against
> > a race with someone else modifying 'dir'. Now this is mostly harmless
> > because the parent pointer (".." directory entry) is at the beginning of
> > the directory and stable however if for example the directory is converted
> > from packed "in-inode" format to "expanded" format as a result of
> > concurrent operation on 'dir', the filesystem gets corrupted (or crashes as
> > in case of UDF).
>
> No, xfs_rename() does not have this problem - we pass four inodes to
> the function - the source directory, source inode, destination
> directory and destination inode.
>
> In the above case, "dir/" is passed to XFs as the source inode - the
> src_dir is "foo/", the target dir is "bar/" and the target inode is
> null. src_dir != target_dir, so we set the "new_parent" flag. the
> srouce inode is a directory, so we set the src_is_directory flag,
> too.
>
> We lock all three inodes that are passed. We do various things, then
> run:
>
> if (new_parent && src_is_directory) {
> /*
> * Rewrite the ".." entry to point to the new
> * directory.
> */
> error = xfs_dir_replace(tp, src_ip, &xfs_name_dotdot,
> target_dp->i_ino, spaceres);
> ASSERT(error != -EEXIST);
> if (error)
> goto out_trans_cancel;
> }
>
> which replaces the ".." entry in source inode atomically whilst it
> is locked. Any directory format changes that occur during the
> rename are done while the ILOCK is held, so they appear atomic to
> outside observers that are trying to parse the directory structure
> (e.g. readdir).

Thanks for explanation! I've missed the ILOCK locking in xfs_rename() when
I was glancing over the function...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-01-18 10:09:58

by Jan Kara

[permalink] [raw]

Subject: Re: Locking issue with directory renames

On Tue 17-01-23 16:57:55, Al Viro wrote:
> On Tue, Jan 17, 2023 at 01:37:35PM +0100, Jan Kara wrote:
> > Hello!
> >
> > I've some across an interesting issue that was spotted by syzbot [1]. The
> > report is against UDF but AFAICS the problem exists for ext4 as well and
> > possibly other filesystems. The problem is the following: When we are
> > renaming directory 'dir' say rename("foo/dir", "bar/") we lock 'foo' and
> > 'bar' but 'dir' is unlocked because the locking done by vfs_rename() is
> >
> > if (!is_dir || (flags & RENAME_EXCHANGE))
> > lock_two_nondirectories(source, target);
> > else if (target)
> > inode_lock(target);
> >
> > However some filesystems (e.g. UDF but ext4 as well, I suspect XFS may be
> > hurt by this as well because it converts among multiple dir formats) need
> > to update parent pointer in 'dir' and nothing protects this update against
> > a race with someone else modifying 'dir'. Now this is mostly harmless
> > because the parent pointer (".." directory entry) is at the beginning of
> > the directory and stable however if for example the directory is converted
> > from packed "in-inode" format to "expanded" format as a result of
> > concurrent operation on 'dir', the filesystem gets corrupted (or crashes as
> > in case of UDF).
> >
> > So we'd need to lock 'source' if it is a directory. Ideally this would
> > happen in VFS as otherwise I bet a lot of filesystems will get this wrong
> > so could vfs_rename() lock 'source' if it is a dir as well? Essentially
> > this would amount to calling lock_two_nondirectories(source, target)
> > unconditionally but that would become a serious misnomer ;). Al, any
> > thought?
>
> FWIW, I suspect that majority of filesystems that do implement rename
> do not have that problem. Moreover, on cross-directory rename we already
> have
> * tree topology stabilized
> * source guaranteed not to be an ancestor of target or either of
> the parents
> so the method instance should be free to lock the source if it needs to
> do so.

Yes, we can lock the source inode in ->rename() if we need it. The snag is
that if 'target' exists, it is already locked so when locking 'source' we
are possibly not following the VFS lock ordering of i_rwsem by inode
address (I don't think it can cause any real dealock but still it looks
suspicious). Also we'll have to lock with I_MUTEX_NONDIR2 lockdep class to
make lockdep happy but that's just a minor annoyance. Finally, we'll have
to check for RENAME_EXCHANGE because in that case, both source and target
will be already locked. Thus if we do the additional locking in the
filesystem, we will leak quite some details about rename locking into the
filesystem which seems undesirable to me.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-01-18 16:34:06

by Al Viro

[permalink] [raw]

Subject: Re: Locking issue with directory renames

On Wed, Jan 18, 2023 at 10:10:36AM +0100, Jan Kara wrote:

>
> Yes, we can lock the source inode in ->rename() if we need it. The snag is
> that if 'target' exists, it is already locked so when locking 'source' we
> are possibly not following the VFS lock ordering of i_rwsem by inode
> address (I don't think it can cause any real dealock but still it looks
> suspicious). Also we'll have to lock with I_MUTEX_NONDIR2 lockdep class to
> make lockdep happy but that's just a minor annoyance. Finally, we'll have
> to check for RENAME_EXCHANGE because in that case, both source and target
> will be already locked. Thus if we do the additional locking in the
> filesystem, we will leak quite some details about rename locking into the
> filesystem which seems undesirable to me.

Rules for inode locks are simple:
* directories before non-directories
* ancestors before descendents
* for non-directories the ordering is by in-core inode address

So the instances that need that extra lock would do that when source is
a directory and non RENAME_EXCHANGE is given. Having the target already
locked is irrelevant - if it exists, it's already checked to be a directory
as well, and had it been a descendent of source, we would have already
found that and failed with -ELOOP.

If A and B are both directories, there's no ordering between them unless
one is an ancestor of another - such can be locked in any order.
However, one of the following must be true:
* C is locked and both A and B had been observed to be children of C
after the lock on C had been acquired, or
* ->s_vfs_rename_mutex is held for the filesystem containing both
A and B.

Note that ->s_vfs_rename_mutex is there to stabilize the tree topology and
make "is A an ancestor of B?" possible to check for more than "A is locked,
B is a child of A, so A will remain its ancestor until unlocked"...

2023-01-18 18:48:40

by Jan Kara

[permalink] [raw]

Subject: Re: Locking issue with directory renames

On Wed 18-01-23 16:30:06, Al Viro wrote:
> On Wed, Jan 18, 2023 at 10:10:36AM +0100, Jan Kara wrote:
>
> >
> > Yes, we can lock the source inode in ->rename() if we need it. The snag is
> > that if 'target' exists, it is already locked so when locking 'source' we
> > are possibly not following the VFS lock ordering of i_rwsem by inode
> > address (I don't think it can cause any real dealock but still it looks
> > suspicious). Also we'll have to lock with I_MUTEX_NONDIR2 lockdep class to
> > make lockdep happy but that's just a minor annoyance. Finally, we'll have
> > to check for RENAME_EXCHANGE because in that case, both source and target
> > will be already locked. Thus if we do the additional locking in the
> > filesystem, we will leak quite some details about rename locking into the
> > filesystem which seems undesirable to me.
>
> Rules for inode locks are simple:
> * directories before non-directories
> * ancestors before descendents
> * for non-directories the ordering is by in-core inode address
>
> So the instances that need that extra lock would do that when source is
> a directory and non RENAME_EXCHANGE is given. Having the target already
> locked is irrelevant - if it exists, it's already checked to be a directory
> as well, and had it been a descendent of source, we would have already
> found that and failed with -ELOOP.
>
> If A and B are both directories, there's no ordering between them unless
> one is an ancestor of another - such can be locked in any order.
> However, one of the following must be true:
> * C is locked and both A and B had been observed to be children of C
> after the lock on C had been acquired, or
> * ->s_vfs_rename_mutex is held for the filesystem containing both
> A and B.
>
> Note that ->s_vfs_rename_mutex is there to stabilize the tree topology and
> make "is A an ancestor of B?" possible to check for more than "A is locked,
> B is a child of A, so A will remain its ancestor until unlocked"...

OK, fair enough. I'll fix things inside UDF and ext4.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-02-24 00:26:38

by Kent Overstreet

[permalink] [raw]

Subject: Re: Locking issue with directory renames

On Wed, Jan 18, 2023 at 08:44:57AM +1100, Dave Chinner wrote:
> XFS just has a function that allows for an arbitrary number of
> inodes to be locked in the given order: xfs_lock_inodes(). For
> rename, the lock order is determined by xfs_sort_for_rename().

bcachefs does the same thing - we just sort and dedup the inodes being
locked, lock order is always pointer ordering.

Is there some reason we couldn't do the same for inode locks? Then
pointer order would be the only lock ordering, no child/descendent
stuff.

On a related note, I also just sent Peter Zijlstra a lockdep patch so
that we can define an ordering within a class - soon we'll be able to
have lockdep check that we're taking locks in pointer order, or whatever
ordering we decide.

2023-02-25 03:47:01

by Darrick J. Wong

[permalink] [raw]

Subject: Re: Locking issue with directory renames

On Wed, Jan 18, 2023 at 08:44:57AM +1100, Dave Chinner wrote:
> On Tue, Jan 17, 2023 at 01:37:35PM +0100, Jan Kara wrote:
> > Hello!
> >
> > I've some across an interesting issue that was spotted by syzbot [1]. The
> > report is against UDF but AFAICS the problem exists for ext4 as well and
> > possibly other filesystems. The problem is the following: When we are
> > renaming directory 'dir' say rename("foo/dir", "bar/") we lock 'foo' and
> > 'bar' but 'dir' is unlocked because the locking done by vfs_rename() is
> >
> > if (!is_dir || (flags & RENAME_EXCHANGE))
> > lock_two_nondirectories(source, target);
> > else if (target)
> > inode_lock(target);
> >
> > However some filesystems (e.g. UDF but ext4 as well, I suspect XFS may be
> > hurt by this as well because it converts among multiple dir formats) need
> > to update parent pointer in 'dir' and nothing protects this update against
> > a race with someone else modifying 'dir'. Now this is mostly harmless
> > because the parent pointer (".." directory entry) is at the beginning of
> > the directory and stable however if for example the directory is converted
> > from packed "in-inode" format to "expanded" format as a result of
> > concurrent operation on 'dir', the filesystem gets corrupted (or crashes as
> > in case of UDF).
>
> No, xfs_rename() does not have this problem - we pass four inodes to
> the function - the source directory, source inode, destination
> directory and destination inode.

Um, I think it does have this problem. xfs_readdir thinks it can parse
a shortform inode without taking the ILOCK:

if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL)
return xfs_dir2_sf_getdents(&args, ctx);

lock_mode = xfs_ilock_data_map_shared(dp);
error = xfs_dir2_isblock(&args, &isblock);

So xfs_dir2_sf_replace can rewrite the shortform structure (or even
convert it to block format!) while readdir is accessing it. Or am I
mising something?

--D

> In the above case, "dir/" is passed to XFs as the source inode - the
> src_dir is "foo/", the target dir is "bar/" and the target inode is
> null. src_dir != target_dir, so we set the "new_parent" flag. the
> srouce inode is a directory, so we set the src_is_directory flag,
> too.
>
> We lock all three inodes that are passed. We do various things, then
> run:
>
> if (new_parent && src_is_directory) {
> /*
> * Rewrite the ".." entry to point to the new
> * directory.
> */
> error = xfs_dir_replace(tp, src_ip, &xfs_name_dotdot,
> target_dp->i_ino, spaceres);
> ASSERT(error != -EEXIST);
> if (error)
> goto out_trans_cancel;
> }
>
> which replaces the ".." entry in source inode atomically whilst it
> is locked. Any directory format changes that occur during the
> rename are done while the ILOCK is held, so they appear atomic to
> outside observers that are trying to parse the directory structure
> (e.g. readdir).
>
> > So we'd need to lock 'source' if it is a directory.
>
> Yup, and XFS goes further by always locking the source inode in a
> rename, even if it is not a directory. This ensures the inode being
> moved cannot have it's metadata otherwise modified whilst the rename
> is in progress, even if that modification would have no impact on
> the rename. It's a pretty strict interpretation of "rename is an
> atomic operation", but it avoids accidentally missing nasty corner
> cases like the one described above...
>
> > Ideally this would
> > happen in VFS as otherwise I bet a lot of filesystems will get this wrong
> > so could vfs_rename() lock 'source' if it is a dir as well? Essentially
> > this would amount to calling lock_two_nondirectories(source, target)
> > unconditionally but that would become a serious misnomer ;). Al, any
> > thought?
>
> XFS just has a function that allows for an arbitrary number of
> inodes to be locked in the given order: xfs_lock_inodes(). For
> rename, the lock order is determined by xfs_sort_for_rename().
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]

2023-02-28 01:58:18

by Dave Chinner

[permalink] [raw]

Subject: Re: Locking issue with directory renames

On Fri, Feb 24, 2023 at 07:46:57PM -0800, Darrick J. Wong wrote:
> On Wed, Jan 18, 2023 at 08:44:57AM +1100, Dave Chinner wrote:
> > On Tue, Jan 17, 2023 at 01:37:35PM +0100, Jan Kara wrote:
> > > Hello!
> > >
> > > I've some across an interesting issue that was spotted by syzbot [1]. The
> > > report is against UDF but AFAICS the problem exists for ext4 as well and
> > > possibly other filesystems. The problem is the following: When we are
> > > renaming directory 'dir' say rename("foo/dir", "bar/") we lock 'foo' and
> > > 'bar' but 'dir' is unlocked because the locking done by vfs_rename() is
> > >
> > > if (!is_dir || (flags & RENAME_EXCHANGE))
> > > lock_two_nondirectories(source, target);
> > > else if (target)
> > > inode_lock(target);
> > >
> > > However some filesystems (e.g. UDF but ext4 as well, I suspect XFS may be
> > > hurt by this as well because it converts among multiple dir formats) need
> > > to update parent pointer in 'dir' and nothing protects this update against
> > > a race with someone else modifying 'dir'. Now this is mostly harmless
> > > because the parent pointer (".." directory entry) is at the beginning of
> > > the directory and stable however if for example the directory is converted
> > > from packed "in-inode" format to "expanded" format as a result of
> > > concurrent operation on 'dir', the filesystem gets corrupted (or crashes as
> > > in case of UDF).
> >
> > No, xfs_rename() does not have this problem - we pass four inodes to
> > the function - the source directory, source inode, destination
> > directory and destination inode.
>
> Um, I think it does have this problem. xfs_readdir thinks it can parse
> a shortform inode without taking the ILOCK:
>
> if (dp->i_df.if_format == XFS_DINODE_FMT_LOCAL)
> return xfs_dir2_sf_getdents(&args, ctx);
>
> lock_mode = xfs_ilock_data_map_shared(dp);
> error = xfs_dir2_isblock(&args, &isblock);
>
> So xfs_dir2_sf_replace can rewrite the shortform structure (or even
> convert it to block format!) while readdir is accessing it. Or am I
> mising something?

True, I missed that.

Hmmmm. ISTR that holding ILOCK over filldir callbacks causes
problems with lock ordering{1], and that's why we removed the ILOCK
from the getdents path in the first place and instead relied on the
IOLOCK being held by the VFS across readdir for exclusion against
concurrent modification from the VFS.

Yup, the current code only holds the ILOCK for the extent lookup and
buffer read process, it drops it while it is walking the locked
buffer and calling the filldir callback. Which is why we don't hold
it for xfs_dir2_sf_getdents() - the VFS is supposed to be holding
i_rwsem in exclusive mode for any operation that modifies a
directory entry. We should only need the ILOCK for serialising the
extent tree loading, not for serialising access vs modification to
the directory.

So, yeah, I think you're right, Darrick. And the fix is that the VFS
needs to hold the i_rwsem correctly for allo inodes that may be
modified during rename...

-Dave.

[1]:

commit dbad7c993053d8f482a5f76270a93307537efd8e
Author: Dave Chinner <[email protected]>
Date: Wed Aug 19 10:33:00 2015 +1000

xfs: stop holding ILOCK over filldir callbacks

The recent change to the readdir locking made in 40194ec ("xfs:
reinstate the ilock in xfs_readdir") for CXFS directory sanity was
probably the wrong thing to do. Deep in the readdir code we
can take page faults in the filldir callback, and so taking a page
fault while holding an inode ilock creates a new set of locking
issues that lockdep warns all over the place about.

The locking order for regular inodes w.r.t. page faults is io_lock
-> pagefault -> mmap_sem -> ilock. The directory readdir code now
triggers ilock -> page fault -> mmap_sem. While we cannot deadlock
at this point, it inverts all the locking patterns that lockdep
normally sees on XFS inodes, and so triggers lockdep. We worked
around this with commit 93a8614 ("xfs: fix directory inode iolock
lockdep false positive"), but that then just moved the lockdep
warning to deeper in the page fault path and triggered on security
inode locks. Fixing the shmem issue there just moved the lockdep
reports somewhere else, and now we are getting false positives from
filesystem freezing annotations getting confused.

Further, if we enter memory reclaim in a readdir path, we now get
lockdep warning about potential deadlocks because the ilock is held
when we enter reclaim. This, again, is different to a regular file
in that we never allow memory reclaim to run while holding the ilock
for regular files. Hence lockdep now throws
ilock->kmalloc->reclaim->ilock warnings.

Basically, the problem is that the ilock is being used to protect
the directory data and the inode metadata, whereas for a regular
file the iolock protects the data and the ilock protects the
metadata. From the VFS perspective, the i_mutex serialises all
accesses to the directory data, and so not holding the ilock for
readdir doesn't matter. The issue is that CXFS doesn't access
directory data via the VFS, so it has no "data serialisaton"
mechanism. Hence we need to hold the IOLOCK in the correct places to
provide this low level directory data access serialisation.

The ilock can then be used just when the extent list needs to be
read, just like we do for regular files. The directory modification
code can take the iolock exclusive when the ilock is also taken,
and this then ensures that readdir is correct excluded while
modifications are in progress.

Signed-off-by: Dave Chinner <[email protected]>
Reviewed-by: Brian Foster <[email protected]>
Signed-off-by: Dave Chinner <[email protected]>

The referenced commit is this:

commit 40194ecc6d78327d98e66de3213db96ca0a31e6f
Author: Ben Myers <[email protected]>
Date: Fri Dec 6 12:30:11 2013 -0800

xfs: reinstate the ilock in xfs_readdir

Although it was removed in commit 051e7cd44ab8, ilock needs to be taken in
xfs_readdir because we might have to read the extent list in from disk. This
keeps other threads from reading from or writing to the extent list while it is
being read in and is still in a transitional state.

This has been associated with "Access to block zero" messages on directories
with large numbers of extents resulting from excessive filesytem fragmentation,
as well as extent list corruption. Unfortunately no test case at this point.

Signed-off-by: Ben Myers <[email protected]>
Reviewed-by: Dave Chinner <[email protected]>

Which references a commit made in 2007 (6 years prior) that
converted XFS to use the VFS filldir mechanism and that was the
original commit that removed the ILOCK from the readdir path and
instead relied on VFS directory locking for access serialisation.

>
> --D
>
> > In the above case, "dir/" is passed to XFs as the source inode - the
> > src_dir is "foo/", the target dir is "bar/" and the target inode is
> > null. src_dir != target_dir, so we set the "new_parent" flag. the
> > srouce inode is a directory, so we set the src_is_directory flag,
> > too.
> >
> > We lock all three inodes that are passed. We do various things, then
> > run:
> >
> > if (new_parent && src_is_directory) {
> > /*
> > * Rewrite the ".." entry to point to the new
> > * directory.
> > */
> > error = xfs_dir_replace(tp, src_ip, &xfs_name_dotdot,
> > target_dp->i_ino, spaceres);
> > ASSERT(error != -EEXIST);
> > if (error)
> > goto out_trans_cancel;
> > }
> >
> > which replaces the ".." entry in source inode atomically whilst it
> > is locked. Any directory format changes that occur during the
> > rename are done while the ILOCK is held, so they appear atomic to
> > outside observers that are trying to parse the directory structure
> > (e.g. readdir).
> >
> > > So we'd need to lock 'source' if it is a directory.
> >
> > Yup, and XFS goes further by always locking the source inode in a
> > rename, even if it is not a directory. This ensures the inode being
> > moved cannot have it's metadata otherwise modified whilst the rename
> > is in progress, even if that modification would have no impact on
> > the rename. It's a pretty strict interpretation of "rename is an
> > atomic operation", but it avoids accidentally missing nasty corner
> > cases like the one described above...
> >
> > > Ideally this would
> > > happen in VFS as otherwise I bet a lot of filesystems will get this wrong
> > > so could vfs_rename() lock 'source' if it is a dir as well? Essentially
> > > this would amount to calling lock_two_nondirectories(source, target)
> > > unconditionally but that would become a serious misnomer ;). Al, any
> > > thought?
> >
> > XFS just has a function that allows for an arbitrary number of
> > inodes to be locked in the given order: xfs_lock_inodes(). For
> > rename, the lock order is determined by xfs_sort_for_rename().
> >
> > Cheers,
> >
> > Dave.
> > --
> > Dave Chinner
> > [email protected]
>

--
Dave Chinner
[email protected]

2023-03-01 12:36:40

by Jan Kara

[permalink] [raw]

Subject: Re: Locking issue with directory renames

On Tue 28-02-23 12:58:07, Dave Chinner wrote:
> On Fri, Feb 24, 2023 at 07:46:57PM -0800, Darrick J. Wong wrote:
> > So xfs_dir2_sf_replace can rewrite the shortform structure (or even
> > convert it to block format!) while readdir is accessing it. Or am I
> > mising something?
>
> True, I missed that.
>
> Hmmmm. ISTR that holding ILOCK over filldir callbacks causes
> problems with lock ordering{1], and that's why we removed the ILOCK
> from the getdents path in the first place and instead relied on the
> IOLOCK being held by the VFS across readdir for exclusion against
> concurrent modification from the VFS.
>
> Yup, the current code only holds the ILOCK for the extent lookup and
> buffer read process, it drops it while it is walking the locked
> buffer and calling the filldir callback. Which is why we don't hold
> it for xfs_dir2_sf_getdents() - the VFS is supposed to be holding
> i_rwsem in exclusive mode for any operation that modifies a
> directory entry. We should only need the ILOCK for serialising the
> extent tree loading, not for serialising access vs modification to
> the directory.
>
> So, yeah, I think you're right, Darrick. And the fix is that the VFS
> needs to hold the i_rwsem correctly for allo inodes that may be
> modified during rename...

But Al Viro didn't want to lock the inode in the VFS (as some filesystems
don't need the lock) so in ext4 we ended up grabbing the lock in
ext4_rename() like:

+ /*
+ * We need to protect against old.inode directory getting
+ * converted from inline directory format into a normal one.
+ */
+ inode_lock_nested(old.inode, I_MUTEX_NONDIR2);

(Linus didn't merge the ext4 pull request so the change isn't upstream
yet).

Honza

--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-03-02 00:30:58

by Dave Chinner

[permalink] [raw]

Subject: Re: Locking issue with directory renames

On Wed, Mar 01, 2023 at 01:36:28PM +0100, Jan Kara wrote:
> On Tue 28-02-23 12:58:07, Dave Chinner wrote:
> > On Fri, Feb 24, 2023 at 07:46:57PM -0800, Darrick J. Wong wrote:
> > > So xfs_dir2_sf_replace can rewrite the shortform structure (or even
> > > convert it to block format!) while readdir is accessing it. Or am I
> > > mising something?
> >
> > True, I missed that.
> >
> > Hmmmm. ISTR that holding ILOCK over filldir callbacks causes
> > problems with lock ordering{1], and that's why we removed the ILOCK
> > from the getdents path in the first place and instead relied on the
> > IOLOCK being held by the VFS across readdir for exclusion against
> > concurrent modification from the VFS.
> >
> > Yup, the current code only holds the ILOCK for the extent lookup and
> > buffer read process, it drops it while it is walking the locked
> > buffer and calling the filldir callback. Which is why we don't hold
> > it for xfs_dir2_sf_getdents() - the VFS is supposed to be holding
> > i_rwsem in exclusive mode for any operation that modifies a
> > directory entry. We should only need the ILOCK for serialising the
> > extent tree loading, not for serialising access vs modification to
> > the directory.
> >
> > So, yeah, I think you're right, Darrick. And the fix is that the VFS
> > needs to hold the i_rwsem correctly for allo inodes that may be
> > modified during rename...
>
> But Al Viro didn't want to lock the inode in the VFS (as some filesystems
> don't need the lock)

Was any reason given?

We know we have to modify the ".." entry of the child directory
being moved, so I'd really like to understand why the locking rule
of "directory i_rwsem must be held exclusively over modifications"
so that we can use shared access for read operations has been waived
for this specific case.

Apart from exposing multiple filesystems to modifications racing
with operations that hold the i_rwsem shared to *prevent concurrent
directory modifications*, what performance or scalability benefit is
seen as a result of eliding this inode lock from the VFS rename
setup?

This looks like a straight forward VFS level directory
locking violation, and now we are playing whack-a-mole to fix it in
each filesystem we discover that requires the child directory inode
to be locked...

> so in ext4 we ended up grabbing the lock in
> ext4_rename() like:
>
> + /*
> + * We need to protect against old.inode directory getting
> + * converted from inline directory format into a normal one.
> + */
> + inode_lock_nested(old.inode, I_MUTEX_NONDIR2);

Why are you using the I_MUTEX_NONDIR2 annotation when locking a
directory inode? That doesn't seem right.

Further, how do we guarantee correct i_rwsem lock ordering against
the all the other inodes that the VFS has already locked and/or
other multi-inode i_rwsem locking primitives in the VFS?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2023-03-02 09:22:22

by Jan Kara

[permalink] [raw]

Subject: Re: Locking issue with directory renames

On Thu 02-03-23 11:30:50, Dave Chinner wrote:
> On Wed, Mar 01, 2023 at 01:36:28PM +0100, Jan Kara wrote:
> > On Tue 28-02-23 12:58:07, Dave Chinner wrote:
> > > On Fri, Feb 24, 2023 at 07:46:57PM -0800, Darrick J. Wong wrote:
> > > > So xfs_dir2_sf_replace can rewrite the shortform structure (or even
> > > > convert it to block format!) while readdir is accessing it. Or am I
> > > > mising something?
> > >
> > > True, I missed that.
> > >
> > > Hmmmm. ISTR that holding ILOCK over filldir callbacks causes
> > > problems with lock ordering{1], and that's why we removed the ILOCK
> > > from the getdents path in the first place and instead relied on the
> > > IOLOCK being held by the VFS across readdir for exclusion against
> > > concurrent modification from the VFS.
> > >
> > > Yup, the current code only holds the ILOCK for the extent lookup and
> > > buffer read process, it drops it while it is walking the locked
> > > buffer and calling the filldir callback. Which is why we don't hold
> > > it for xfs_dir2_sf_getdents() - the VFS is supposed to be holding
> > > i_rwsem in exclusive mode for any operation that modifies a
> > > directory entry. We should only need the ILOCK for serialising the
> > > extent tree loading, not for serialising access vs modification to
> > > the directory.
> > >
> > > So, yeah, I think you're right, Darrick. And the fix is that the VFS
> > > needs to hold the i_rwsem correctly for allo inodes that may be
> > > modified during rename...
> >
> > But Al Viro didn't want to lock the inode in the VFS (as some filesystems
> > don't need the lock)
>
> Was any reason given?

Kind of what I wrote above. See:

https://lore.kernel.org/all/Y8bTk1CsH9AaAnLt@ZenIV

> We know we have to modify the ".." entry of the child directory
> being moved, so I'd really like to understand why the locking rule
> of "directory i_rwsem must be held exclusively over modifications"
> so that we can use shared access for read operations has been waived
> for this specific case.

Well, not every filesystem has physical ".." directory entry but I share
your sentiment that avoiding grabbing the directory lock in this particular
case is not worth the maintenance burden of trying to track down all the
filesystems that may need it. So I'm still all for grabbing it in VFS and
maybe Al is willing to reconsider given XFS was found to be prone to the
race as well. Al?

> Apart from exposing multiple filesystems to modifications racing
> with operations that hold the i_rwsem shared to *prevent concurrent
> directory modifications*, what performance or scalability benefit is
> seen as a result of eliding this inode lock from the VFS rename
> setup?
>
> This looks like a straight forward VFS level directory
> locking violation, and now we are playing whack-a-mole to fix it in
> each filesystem we discover that requires the child directory inode
> to be locked...
>
> > so in ext4 we ended up grabbing the lock in
> > ext4_rename() like:
> >
> > + /*
> > + * We need to protect against old.inode directory getting
> > + * converted from inline directory format into a normal one.
> > + */
> > + inode_lock_nested(old.inode, I_MUTEX_NONDIR2);
>
> Why are you using the I_MUTEX_NONDIR2 annotation when locking a
> directory inode? That doesn't seem right.

Because that's the only locking subclass left unused during rename and it
happens to have the right ordering for ext4 purposes wrt other i_rwsem
subclasses. In other words it is a hack to fix the race and silence lockdep
;). If we are going to lift this to VFS, we should probably add
I_MUTEX_MOVED_DIR subclass, possibly as an alias to I_MUTEX_NONDIR2.

> Further, how do we guarantee correct i_rwsem lock ordering against
> the all the other inodes that the VFS has already locked and/or
> other multi-inode i_rwsem locking primitives in the VFS?

Well, cross directory renames are all serialized by sb->s_vfs_rename_mutex
so we don't have to be afraid of two renames racing against each other.
Also directories are locked in topological order by all operations so
grabbing moved directory lock last is the safe thing to do (because rename
is the only operation that can lock two topologically incomparable
directories).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR