2023-01-17 12:40:12

by Jan Kara

[permalink] [raw]
Subject: Locking issue with directory renames

Hello!

I've some across an interesting issue that was spotted by syzbot [1]. The
report is against UDF but AFAICS the problem exists for ext4 as well and
possibly other filesystems. The problem is the following: When we are
renaming directory 'dir' say rename("foo/dir", "bar/") we lock 'foo' and
'bar' but 'dir' is unlocked because the locking done by vfs_rename() is

if (!is_dir || (flags & RENAME_EXCHANGE))
lock_two_nondirectories(source, target);
else if (target)
inode_lock(target);

However some filesystems (e.g. UDF but ext4 as well, I suspect XFS may be
hurt by this as well because it converts among multiple dir formats) need
to update parent pointer in 'dir' and nothing protects this update against
a race with someone else modifying 'dir'. Now this is mostly harmless
because the parent pointer (".." directory entry) is at the beginning of
the directory and stable however if for example the directory is converted
from packed "in-inode" format to "expanded" format as a result of
concurrent operation on 'dir', the filesystem gets corrupted (or crashes as
in case of UDF).

So we'd need to lock 'source' if it is a directory. Ideally this would
happen in VFS as otherwise I bet a lot of filesystems will get this wrong
so could vfs_rename() lock 'source' if it is a dir as well? Essentially
this would amount to calling lock_two_nondirectories(source, target)
unconditionally but that would become a serious misnomer ;). Al, any
thought?

Honza

[1] https://lore.kernel.org/all/[email protected]

--
Jan Kara <[email protected]>
SUSE Labs, CR


2023-01-17 17:09:24

by Al Viro

[permalink] [raw]
Subject: Re: Locking issue with directory renames

On Tue, Jan 17, 2023 at 01:37:35PM +0100, Jan Kara wrote:
> Hello!
>
> I've some across an interesting issue that was spotted by syzbot [1]. The
> report is against UDF but AFAICS the problem exists for ext4 as well and
> possibly other filesystems. The problem is the following: When we are
> renaming directory 'dir' say rename("foo/dir", "bar/") we lock 'foo' and
> 'bar' but 'dir' is unlocked because the locking done by vfs_rename() is
>
> if (!is_dir || (flags & RENAME_EXCHANGE))
> lock_two_nondirectories(source, target);
> else if (target)
> inode_lock(target);
>
> However some filesystems (e.g. UDF but ext4 as well, I suspect XFS may be
> hurt by this as well because it converts among multiple dir formats) need
> to update parent pointer in 'dir' and nothing protects this update against
> a race with someone else modifying 'dir'. Now this is mostly harmless
> because the parent pointer (".." directory entry) is at the beginning of
> the directory and stable however if for example the directory is converted
> from packed "in-inode" format to "expanded" format as a result of
> concurrent operation on 'dir', the filesystem gets corrupted (or crashes as
> in case of UDF).
>
> So we'd need to lock 'source' if it is a directory. Ideally this would
> happen in VFS as otherwise I bet a lot of filesystems will get this wrong
> so could vfs_rename() lock 'source' if it is a dir as well? Essentially
> this would amount to calling lock_two_nondirectories(source, target)
> unconditionally but that would become a serious misnomer ;). Al, any
> thought?

FWIW, I suspect that majority of filesystems that do implement rename
do not have that problem. Moreover, on cross-directory rename we already
have
* tree topology stabilized
* source guaranteed not to be an ancestor of target or either of
the parents
so the method instance should be free to lock the source if it needs to
do so.

Not sure, I'll need to grep around and take a look at the instances...

2023-01-17 22:56:02

by Dave Chinner

[permalink] [raw]
Subject: Re: Locking issue with directory renames

On Tue, Jan 17, 2023 at 01:37:35PM +0100, Jan Kara wrote:
> Hello!
>
> I've some across an interesting issue that was spotted by syzbot [1]. The
> report is against UDF but AFAICS the problem exists for ext4 as well and
> possibly other filesystems. The problem is the following: When we are
> renaming directory 'dir' say rename("foo/dir", "bar/") we lock 'foo' and
> 'bar' but 'dir' is unlocked because the locking done by vfs_rename() is
>
> if (!is_dir || (flags & RENAME_EXCHANGE))
> lock_two_nondirectories(source, target);
> else if (target)
> inode_lock(target);
>
> However some filesystems (e.g. UDF but ext4 as well, I suspect XFS may be
> hurt by this as well because it converts among multiple dir formats) need
> to update parent pointer in 'dir' and nothing protects this update against
> a race with someone else modifying 'dir'. Now this is mostly harmless
> because the parent pointer (".." directory entry) is at the beginning of
> the directory and stable however if for example the directory is converted
> from packed "in-inode" format to "expanded" format as a result of
> concurrent operation on 'dir', the filesystem gets corrupted (or crashes as
> in case of UDF).

No, xfs_rename() does not have this problem - we pass four inodes to
the function - the source directory, source inode, destination
directory and destination inode.

In the above case, "dir/" is passed to XFs as the source inode - the
src_dir is "foo/", the target dir is "bar/" and the target inode is
null. src_dir != target_dir, so we set the "new_parent" flag. the
srouce inode is a directory, so we set the src_is_directory flag,
too.

We lock all three inodes that are passed. We do various things, then
run:

if (new_parent && src_is_directory) {
/*
* Rewrite the ".." entry to point to the new
* directory.
*/
error = xfs_dir_replace(tp, src_ip, &xfs_name_dotdot,
target_dp->i_ino, spaceres);
ASSERT(error != -EEXIST);
if (error)
goto out_trans_cancel;
}

which replaces the ".." entry in source inode atomically whilst it
is locked. Any directory format changes that occur during the
rename are done while the ILOCK is held, so they appear atomic to
outside observers that are trying to parse the directory structure
(e.g. readdir).

> So we'd need to lock 'source' if it is a directory.

Yup, and XFS goes further by always locking the source inode in a
rename, even if it is not a directory. This ensures the inode being
moved cannot have it's metadata otherwise modified whilst the rename
is in progress, even if that modification would have no impact on
the rename. It's a pretty strict interpretation of "rename is an
atomic operation", but it avoids accidentally missing nasty corner
cases like the one described above...

> Ideally this would
> happen in VFS as otherwise I bet a lot of filesystems will get this wrong
> so could vfs_rename() lock 'source' if it is a dir as well? Essentially
> this would amount to calling lock_two_nondirectories(source, target)
> unconditionally but that would become a serious misnomer ;). Al, any
> thought?

XFS just has a function that allows for an arbitrary number of
inodes to be locked in the given order: xfs_lock_inodes(). For
rename, the lock order is determined by xfs_sort_for_rename().

Cheers,

Dave.
--
Dave Chinner
[email protected]

2023-01-18 09:43:26

by Jan Kara

[permalink] [raw]
Subject: Re: Locking issue with directory renames

Hello,

On Wed 18-01-23 08:44:57, Dave Chinner wrote:
> On Tue, Jan 17, 2023 at 01:37:35PM +0100, Jan Kara wrote:
> > I've some across an interesting issue that was spotted by syzbot [1]. The
> > report is against UDF but AFAICS the problem exists for ext4 as well and
> > possibly other filesystems. The problem is the following: When we are
> > renaming directory 'dir' say rename("foo/dir", "bar/") we lock 'foo' and
> > 'bar' but 'dir' is unlocked because the locking done by vfs_rename() is
> >
> > if (!is_dir || (flags & RENAME_EXCHANGE))
> > lock_two_nondirectories(source, target);
> > else if (target)
> > inode_lock(target);
> >
> > However some filesystems (e.g. UDF but ext4 as well, I suspect XFS may be
> > hurt by this as well because it converts among multiple dir formats) need
> > to update parent pointer in 'dir' and nothing protects this update against
> > a race with someone else modifying 'dir'. Now this is mostly harmless
> > because the parent pointer (".." directory entry) is at the beginning of
> > the directory and stable however if for example the directory is converted
> > from packed "in-inode" format to "expanded" format as a result of
> > concurrent operation on 'dir', the filesystem gets corrupted (or crashes as
> > in case of UDF).
>
> No, xfs_rename() does not have this problem - we pass four inodes to
> the function - the source directory, source inode, destination
> directory and destination inode.
>
> In the above case, "dir/" is passed to XFs as the source inode - the
> src_dir is "foo/", the target dir is "bar/" and the target inode is
> null. src_dir != target_dir, so we set the "new_parent" flag. the
> srouce inode is a directory, so we set the src_is_directory flag,
> too.
>
> We lock all three inodes that are passed. We do various things, then
> run:
>
> if (new_parent && src_is_directory) {
> /*
> * Rewrite the ".." entry to point to the new
> * directory.
> */
> error = xfs_dir_replace(tp, src_ip, &xfs_name_dotdot,
> target_dp->i_ino, spaceres);
> ASSERT(error != -EEXIST);
> if (error)
> goto out_trans_cancel;
> }
>
> which replaces the ".." entry in source inode atomically whilst it
> is locked. Any directory format changes that occur during the
> rename are done while the ILOCK is held, so they appear atomic to
> outside observers that are trying to parse the directory structure
> (e.g. readdir).

Thanks for explanation! I've missed the ILOCK locking in xfs_rename() when
I was glancing over the function...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-01-18 10:09:58

by Jan Kara

[permalink] [raw]
Subject: Re: Locking issue with directory renames

On Tue 17-01-23 16:57:55, Al Viro wrote:
> On Tue, Jan 17, 2023 at 01:37:35PM +0100, Jan Kara wrote:
> > Hello!
> >
> > I've some across an interesting issue that was spotted by syzbot [1]. The
> > report is against UDF but AFAICS the problem exists for ext4 as well and
> > possibly other filesystems. The problem is the following: When we are
> > renaming directory 'dir' say rename("foo/dir", "bar/") we lock 'foo' and
> > 'bar' but 'dir' is unlocked because the locking done by vfs_rename() is
> >
> > if (!is_dir || (flags & RENAME_EXCHANGE))
> > lock_two_nondirectories(source, target);
> > else if (target)
> > inode_lock(target);
> >
> > However some filesystems (e.g. UDF but ext4 as well, I suspect XFS may be
> > hurt by this as well because it converts among multiple dir formats) need
> > to update parent pointer in 'dir' and nothing protects this update against
> > a race with someone else modifying 'dir'. Now this is mostly harmless
> > because the parent pointer (".." directory entry) is at the beginning of
> > the directory and stable however if for example the directory is converted
> > from packed "in-inode" format to "expanded" format as a result of
> > concurrent operation on 'dir', the filesystem gets corrupted (or crashes as
> > in case of UDF).
> >
> > So we'd need to lock 'source' if it is a directory. Ideally this would
> > happen in VFS as otherwise I bet a lot of filesystems will get this wrong
> > so could vfs_rename() lock 'source' if it is a dir as well? Essentially
> > this would amount to calling lock_two_nondirectories(source, target)
> > unconditionally but that would become a serious misnomer ;). Al, any
> > thought?
>
> FWIW, I suspect that majority of filesystems that do implement rename
> do not have that problem. Moreover, on cross-directory rename we already
> have
> * tree topology stabilized
> * source guaranteed not to be an ancestor of target or either of
> the parents
> so the method instance should be free to lock the source if it needs to
> do so.

Yes, we can lock the source inode in ->rename() if we need it. The snag is
that if 'target' exists, it is already locked so when locking 'source' we
are possibly not following the VFS lock ordering of i_rwsem by inode
address (I don't think it can cause any real dealock but still it looks
suspicious). Also we'll have to lock with I_MUTEX_NONDIR2 lockdep class to
make lockdep happy but that's just a minor annoyance. Finally, we'll have
to check for RENAME_EXCHANGE because in that case, both source and target
will be already locked. Thus if we do the additional locking in the
filesystem, we will leak quite some details about rename locking into the
filesystem which seems undesirable to me.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2023-01-18 16:34:06

by Al Viro

[permalink] [raw]
Subject: Re: Locking issue with directory renames

On Wed, Jan 18, 2023 at 10:10:36AM +0100, Jan Kara wrote:

>
> Yes, we can lock the source inode in ->rename() if we need it. The snag is
> that if 'target' exists, it is already locked so when locking 'source' we
> are possibly not following the VFS lock ordering of i_rwsem by inode
> address (I don't think it can cause any real dealock but still it looks
> suspicious). Also we'll have to lock with I_MUTEX_NONDIR2 lockdep class to
> make lockdep happy but that's just a minor annoyance. Finally, we'll have
> to check for RENAME_EXCHANGE because in that case, both source and target
> will be already locked. Thus if we do the additional locking in the
> filesystem, we will leak quite some details about rename locking into the
> filesystem which seems undesirable to me.

Rules for inode locks are simple:
* directories before non-directories
* ancestors before descendents
* for non-directories the ordering is by in-core inode address

So the instances that need that extra lock would do that when source is
a directory and non RENAME_EXCHANGE is given. Having the target already
locked is irrelevant - if it exists, it's already checked to be a directory
as well, and had it been a descendent of source, we would have already
found that and failed with -ELOOP.

If A and B are both directories, there's no ordering between them unless
one is an ancestor of another - such can be locked in any order.
However, one of the following must be true:
* C is locked and both A and B had been observed to be children of C
after the lock on C had been acquired, or
* ->s_vfs_rename_mutex is held for the filesystem containing both
A and B.

Note that ->s_vfs_rename_mutex is there to stabilize the tree topology and
make "is A an ancestor of B?" possible to check for more than "A is locked,
B is a child of A, so A will remain its ancestor until unlocked"...

2023-01-18 18:48:40

by Jan Kara

[permalink] [raw]
Subject: Re: Locking issue with directory renames

On Wed 18-01-23 16:30:06, Al Viro wrote:
> On Wed, Jan 18, 2023 at 10:10:36AM +0100, Jan Kara wrote:
>
> >
> > Yes, we can lock the source inode in ->rename() if we need it. The snag is
> > that if 'target' exists, it is already locked so when locking 'source' we
> > are possibly not following the VFS lock ordering of i_rwsem by inode
> > address (I don't think it can cause any real dealock but still it looks
> > suspicious). Also we'll have to lock with I_MUTEX_NONDIR2 lockdep class to
> > make lockdep happy but that's just a minor annoyance. Finally, we'll have
> > to check for RENAME_EXCHANGE because in that case, both source and target
> > will be already locked. Thus if we do the additional locking in the
> > filesystem, we will leak quite some details about rename locking into the
> > filesystem which seems undesirable to me.
>
> Rules for inode locks are simple:
> * directories before non-directories
> * ancestors before descendents
> * for non-directories the ordering is by in-core inode address
>
> So the instances that need that extra lock would do that when source is
> a directory and non RENAME_EXCHANGE is given. Having the target already
> locked is irrelevant - if it exists, it's already checked to be a directory
> as well, and had it been a descendent of source, we would have already
> found that and failed with -ELOOP.
>
> If A and B are both directories, there's no ordering between them unless
> one is an ancestor of another - such can be locked in any order.
> However, one of the following must be true:
> * C is locked and both A and B had been observed to be children of C
> after the lock on C had been acquired, or
> * ->s_vfs_rename_mutex is held for the filesystem containing both
> A and B.
>
> Note that ->s_vfs_rename_mutex is there to stabilize the tree topology and
> make "is A an ancestor of B?" possible to check for more than "A is locked,
> B is a child of A, so A will remain its ancestor until unlocked"...

OK, fair enough. I'll fix things inside UDF and ext4.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR