Hello!
This is again about the problem with directory renames I've already
reported in [1]. To quickly sum it up some filesystems (so far we know at
least about xfs, ext4, udf, reiserfs) need to lock the directory when it is
being renamed into another directory. This is because we need to update the
parent pointer in the directory in that case and if that races with other
operation on the directory, bad things can happen.
So far we've done the locking in the filesystem code but recently Darrick
pointed out [2] that we've missed the RENAME_EXCHANGE case in our ext4 fix.
That one is particularly nasty because RENAME_EXCHANGE can arbitrarily mix
regular files and directories. Couple nasty arising cases:
1) We need to additionally lock two exchanged directories. Suppose a
situation like:
mkdir P; mkdir P/A; mkdir P/B; touch P/B/F
CPU1 CPU2
renameat2("P/A", "P/B", RENAME_EXCHANGE); renameat2("P/B/F", "P/A", 0);
Both operations need to lock A and B directories which are unrelated in the
tree. This means we must establish stable lock ordering on directory locks
even for the case when they are not in ancestor relationship.
2) We may need to lock a directory and a non-directory and they can be in
parent-child relationship when hardlinks are involved:
mkdir A; mkdir B; touch A/F; ln A/F B/F
renameat2("A/F", "B");
And this is really nasty because we don't have a way to find out whether
"A/F" and "B" are in any relationship - in particular whether B happens to
be another parent of A/F or not.
What I've decided to do is to make sure we always lock directory first in
this mixed case and that *should* avoid all the deadlocks but I'm spelling
this out here just in case people can think of some even more wicked case
before I'll send patches.
Also I wanted to ask (Miklos in particular as RENAME_EXCHANGE author): Why
do we lock non-directories in RENAME_EXCHANGE case? If we didn't have to do
that things would be somewhat simpler...
Honza
[1] https://lore.kernel.org/all/20230117123735.un7wbamlbdihninm@quack3
[2] https://lore.kernel.org/all/20230517045836.GA11594@frogsfrogsfrogs
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed, May 24, 2023 at 06:35:04PM +0200, Jan Kara wrote:
> Hello!
>
> This is again about the problem with directory renames I've already
> reported in [1]. To quickly sum it up some filesystems (so far we know at
> least about xfs, ext4, udf, reiserfs) need to lock the directory when it is
> being renamed into another directory. This is because we need to update the
> parent pointer in the directory in that case and if that races with other
> operation on the directory, bad things can happen.
>
> So far we've done the locking in the filesystem code but recently Darrick
> pointed out [2] that we've missed the RENAME_EXCHANGE case in our ext4 fix.
> That one is particularly nasty because RENAME_EXCHANGE can arbitrarily mix
> regular files and directories. Couple nasty arising cases:
>
> 1) We need to additionally lock two exchanged directories. Suppose a
> situation like:
>
> mkdir P; mkdir P/A; mkdir P/B; touch P/B/F
>
> CPU1 CPU2
> renameat2("P/A", "P/B", RENAME_EXCHANGE); renameat2("P/B/F", "P/A", 0);
>
> Both operations need to lock A and B directories which are unrelated in the
> tree. This means we must establish stable lock ordering on directory locks
> even for the case when they are not in ancestor relationship.
>
> 2) We may need to lock a directory and a non-directory and they can be in
> parent-child relationship when hardlinks are involved:
>
> mkdir A; mkdir B; touch A/F; ln A/F B/F
> renameat2("A/F", "B");
>
> And this is really nasty because we don't have a way to find out whether
> "A/F" and "B" are in any relationship - in particular whether B happens to
> be another parent of A/F or not.
>
> What I've decided to do is to make sure we always lock directory first in
> this mixed case and that *should* avoid all the deadlocks but I'm spelling
> this out here just in case people can think of some even more wicked case
> before I'll send patches.
I can think of a few things:
Not taking i_rwsem when updating the dotdot entry as part of moving
child directories means that readdir can race with a dotdot update to a
shortformat XFS directory. That currently doesn't get too ugly because
we reserve a full 8 bytes for the shortform dotdot entry's inode (which
means we are not at risk of changing the format). As Dave has
previously noted, xfs synchronizes on the ILOCK for metadata updates.
This also makes online directory fsck harder because any time we have to
cycle the directory's ILOCK means we have to revalidate the directory
contents afterwards or hook the directory update code so that we can
learn about changes that happened while the ILOCK was dropped.
(I'm sure the rest of you are thinking "that's some crazy xfs thing").
Things get weirder with the new xfs parent pointers feature -- rename
doesn't take i_rwsem of the children, which means that the only
synchronization point is with the ILOCK held. Parent pointers are
stored in a special xattr namespace. This makes me nervous about
collisions with a getxattr call, since those can be done with or without
i_rwsem held at all. I think the ILOCK saves us in this case, since
I've been testing it for a few months now without any corruption
problems.
Obviously, online fsck of parent pointers has to employ the same
revalidate/hook to deal with ILOCK cycling. But again, that's "some
crazy xfs thing".
I wouldn't mind the VFS taking i_rwsem on the children to simplify
online fsck, but I've got a reasonable enough workaround.
--D
> Also I wanted to ask (Miklos in particular as RENAME_EXCHANGE author): Why
> do we lock non-directories in RENAME_EXCHANGE case? If we didn't have to do
> that things would be somewhat simpler...
>
> Honza
>
> [1] https://lore.kernel.org/all/20230117123735.un7wbamlbdihninm@quack3
> [2] https://lore.kernel.org/all/20230517045836.GA11594@frogsfrogsfrogs
>
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
On Wed, 24 May 2023 at 18:35, Jan Kara <[email protected]> wrote:
>
> Hello!
>
> This is again about the problem with directory renames I've already
> reported in [1]. To quickly sum it up some filesystems (so far we know at
> least about xfs, ext4, udf, reiserfs) need to lock the directory when it is
> being renamed into another directory. This is because we need to update the
> parent pointer in the directory in that case and if that races with other
> operation on the directory, bad things can happen.
>
> So far we've done the locking in the filesystem code but recently Darrick
> pointed out [2] that we've missed the RENAME_EXCHANGE case in our ext4 fix.
> That one is particularly nasty because RENAME_EXCHANGE can arbitrarily mix
> regular files and directories. Couple nasty arising cases:
>
> 1) We need to additionally lock two exchanged directories. Suppose a
> situation like:
>
> mkdir P; mkdir P/A; mkdir P/B; touch P/B/F
>
> CPU1 CPU2
> renameat2("P/A", "P/B", RENAME_EXCHANGE); renameat2("P/B/F", "P/A", 0);
Not sure I get it.
CPU1 locks P then A then B
CPU2 locks P then B then A
Both start with P and after that ordering between A and B doesn't
matter as long as the topology stays the same, which is guaranteed.
Or did you mean renameat2("P/B/F", "P/A/F", 0);?
This indeed looks deadlocky.
>
> Both operations need to lock A and B directories which are unrelated in the
> tree. This means we must establish stable lock ordering on directory locks
> even for the case when they are not in ancestor relationship.
>
> 2) We may need to lock a directory and a non-directory and they can be in
> parent-child relationship when hardlinks are involved:
>
> mkdir A; mkdir B; touch A/F; ln A/F B/F
> renameat2("A/F", "B");
>
> And this is really nasty because we don't have a way to find out whether
> "A/F" and "B" are in any relationship - in particular whether B happens to
> be another parent of A/F or not.
>
> What I've decided to do is to make sure we always lock directory first in
> this mixed case and that *should* avoid all the deadlocks but I'm spelling
> this out here just in case people can think of some even more wicked case
> before I'll send patches.
Locking directories first has always been the case, described in
detail in Documentation/filesystems/directory-locking.rst
> Also I wanted to ask (Miklos in particular as RENAME_EXCHANGE author): Why
> do we lock non-directories in RENAME_EXCHANGE case? If we didn't have to do
> that things would be somewhat simpler...
I can't say I remember anything, but digging into
lock_two_nondirectories() this comes up quickly:
6cedba8962f4 ("vfs: take i_mutex on renamed file")
So apparently NFS is relying on i_mutex to prevent delegations from
being broken without its knowledge. Might be that is't NFS only, and
then the RENAME_EXCHANGE case doesn't need it (NFS doesn't support
RENAME_EXCHANGE), but I can't say for sure.
Also Al seems to have had some thoughts on this in d42b386834ee
("update D/f/directory-locking")
Thanks,
Miklos
On Wed 24-05-23 21:33:07, Miklos Szeredi wrote:
> On Wed, 24 May 2023 at 18:35, Jan Kara <[email protected]> wrote:
> >
> > Hello!
> >
> > This is again about the problem with directory renames I've already
> > reported in [1]. To quickly sum it up some filesystems (so far we know at
> > least about xfs, ext4, udf, reiserfs) need to lock the directory when it is
> > being renamed into another directory. This is because we need to update the
> > parent pointer in the directory in that case and if that races with other
> > operation on the directory, bad things can happen.
> >
> > So far we've done the locking in the filesystem code but recently Darrick
> > pointed out [2] that we've missed the RENAME_EXCHANGE case in our ext4 fix.
> > That one is particularly nasty because RENAME_EXCHANGE can arbitrarily mix
> > regular files and directories. Couple nasty arising cases:
> >
> > 1) We need to additionally lock two exchanged directories. Suppose a
> > situation like:
> >
> > mkdir P; mkdir P/A; mkdir P/B; touch P/B/F
> >
> > CPU1 CPU2
> > renameat2("P/A", "P/B", RENAME_EXCHANGE); renameat2("P/B/F", "P/A", 0);
>
> Not sure I get it.
>
> CPU1 locks P then A then B
> CPU2 locks P then B then A
>
> Both start with P and after that ordering between A and B doesn't
> matter as long as the topology stays the same, which is guaranteed.
>
> Or did you mean renameat2("P/B/F", "P/A/F", 0);?
>
> This indeed looks deadlocky.
Right, that is what I meant. Sorry for confusion.
> > Both operations need to lock A and B directories which are unrelated in the
> > tree. This means we must establish stable lock ordering on directory locks
> > even for the case when they are not in ancestor relationship.
> >
> > 2) We may need to lock a directory and a non-directory and they can be in
> > parent-child relationship when hardlinks are involved:
> >
> > mkdir A; mkdir B; touch A/F; ln A/F B/F
> > renameat2("A/F", "B");
> >
> > And this is really nasty because we don't have a way to find out whether
> > "A/F" and "B" are in any relationship - in particular whether B happens to
> > be another parent of A/F or not.
> >
> > What I've decided to do is to make sure we always lock directory first in
> > this mixed case and that *should* avoid all the deadlocks but I'm spelling
> > this out here just in case people can think of some even more wicked case
> > before I'll send patches.
>
> Locking directories first has always been the case, described in
> detail in Documentation/filesystems/directory-locking.rst
>
> > Also I wanted to ask (Miklos in particular as RENAME_EXCHANGE author): Why
> > do we lock non-directories in RENAME_EXCHANGE case? If we didn't have to do
> > that things would be somewhat simpler...
>
> I can't say I remember anything, but digging into
> lock_two_nondirectories() this comes up quickly:
>
> 6cedba8962f4 ("vfs: take i_mutex on renamed file")
>
> So apparently NFS is relying on i_mutex to prevent delegations from
> being broken without its knowledge. Might be that is't NFS only, and
> then the RENAME_EXCHANGE case doesn't need it (NFS doesn't support
> RENAME_EXCHANGE), but I can't say for sure.
>
> Also Al seems to have had some thoughts on this in d42b386834ee
> ("update D/f/directory-locking")
Thanks for the references. I've now updated the document
Documentation/filesystems/directory-locking.rst and I'm now more convinced
the scheme is correct. It is also kind of neat there are less special cases
:).
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR