2022-02-14 21:08:49

by Xavier Roche

[permalink] [raw]
Subject: fs: race between vfs_rename and do_linkat (mv and link)

There has been a longstanding race condition between vfs_rename and do_linkat,
when those operations are done in parallel:

1. Moving a file to an existing target file (eg. mv file target)
2. Creating a link from the target file to a third file (eg. ln target link)

A typical example would be (1) a regular process putting a new version
of a database in place and (2) a regular process backuping the live
database by hardlinking it.

My understanding is that as the target file is never erased on client
side, but just replaced, the link should never fail.

The issue seem to lie inside vfs_link (fs/namei.c):
inode_lock(inode);
/* Make sure we don't allow creating hardlink to an unlinked file */
if (inode->i_nlink == 0 && !(inode->i_state & I_LINKABLE))
error = -ENOENT;

The possible answer is that the inode refcount is zero because the
file has just been replaced concurrently, old file being erased, and
as such, the link operation is failing.

The race appears to have been introduced by aae8a97d3ec30, to fix
_another_ race between unlink and link (but I'm not sure to understand
what were the implications).

Reverting the inode->i_nlink == 0 section "fixes" the issue, but would
probably reintroduce this another issue.

At this point I don't know what would be the best way to fix this issue.

Trivial case that will lead to ENOENT: (reproduced on 5.16.5)
Note that the race _seems_ to last while some IO are pending (getting the
race on tmpfs is typically much harder)

========== Cut here ==========
#!/bin/bash
#

rm -f link file target
touch target

# Link target -> link in loop
while ln target link && rm link; do :; done &

# Overwrite file -> target in loop until we fail
while touch file && mv file target; do :; done &

wait
========== Cut here ==========

Kudos to Xavier Grand from Algolia for spotting the issue with a
reproducible case.

The issue was reported three years ago, but only on the fsdevel
mailing-list, where it might have been overlooked.
It was also reported at https://bugzilla.kernel.org/show_bug.cgi?id=204705


2022-02-15 11:31:03

by Miklos Szeredi

[permalink] [raw]
Subject: Re: race between vfs_rename and do_linkat (mv and link)

On Mon, 14 Feb 2022 at 22:07, Xavier Roche <[email protected]> wrote:
>
> There has been a longstanding race condition between vfs_rename and do_linkat,
> when those operations are done in parallel:
>
> 1. Moving a file to an existing target file (eg. mv file target)
> 2. Creating a link from the target file to a third file (eg. ln target link)
>
> A typical example would be (1) a regular process putting a new version
> of a database in place and (2) a regular process backuping the live
> database by hardlinking it.
>
> My understanding is that as the target file is never erased on client
> side, but just replaced, the link should never fail.
>
> The issue seem to lie inside vfs_link (fs/namei.c):
> inode_lock(inode);
> /* Make sure we don't allow creating hardlink to an unlinked file */
> if (inode->i_nlink == 0 && !(inode->i_state & I_LINKABLE))
> error = -ENOENT;
>
> The possible answer is that the inode refcount is zero because the
> file has just been replaced concurrently, old file being erased, and
> as such, the link operation is failing.
>
> The race appears to have been introduced by aae8a97d3ec30, to fix
> _another_ race between unlink and link (but I'm not sure to understand
> what were the implications).
>
> Reverting the inode->i_nlink == 0 section "fixes" the issue, but would
> probably reintroduce this another issue.
>
> At this point I don't know what would be the best way to fix this issue.

Doing "lock_rename() + lookup last components" would fix this race.
If this was only done on retry, then that would prevent possible
performance regressions, at the cost of extra complexity.

Thanks,
Miklos

2022-02-15 23:52:15

by Al Viro

[permalink] [raw]
Subject: Re: race between vfs_rename and do_linkat (mv and link)

On Tue, Feb 15, 2022 at 10:56:29AM +0100, Miklos Szeredi wrote:

> Doing "lock_rename() + lookup last components" would fix this race.

No go - thanks to the possibility of AT_SYMLINK_FOLLOW there.
Think of it - we'd need to
* lock parents (both at the same time)
* look up the last component of source
* if it turns a symlink - unlock parents and repeat the entire
thing for its body, except when asked not to.
* when we are done with the source, look the last component of
target up

... and then there is sodding -ESTALE handling, with all the elegance
that brings in.

> If this was only done on retry, then that would prevent possible
> performance regressions, at the cost of extra complexity.

Extra compared to the above, that is. How delightful...

2022-02-16 00:00:41

by Al Viro

[permalink] [raw]
Subject: Re: race between vfs_rename and do_linkat (mv and link)

On Tue, Feb 15, 2022 at 01:37:40PM +0000, Al Viro wrote:
> On Tue, Feb 15, 2022 at 10:56:29AM +0100, Miklos Szeredi wrote:
>
> > Doing "lock_rename() + lookup last components" would fix this race.
>
> No go - thanks to the possibility of AT_SYMLINK_FOLLOW there.
> Think of it - we'd need to
> * lock parents (both at the same time)
> * look up the last component of source
> * if it turns a symlink - unlock parents and repeat the entire
> thing for its body, except when asked not to.
> * when we are done with the source, look the last component of
> target up
>
> ... and then there is sodding -ESTALE handling, with all the elegance
> that brings in.
>
> > If this was only done on retry, then that would prevent possible
> > performance regressions, at the cost of extra complexity.
>
> Extra compared to the above, that is. How delightful...

Actually, it's even viler than that: lock_rename() relies upon the
directories being locked sitting on the same fs. Now, surely link(2)
would fail if source and target are on the different filesystem,
wouldn't it? Alas, with AT_SYMLINK_FOLLOW it's quite possible to have
the source resolving to a symlink that does lead to the same fs as the
target, while the symlink itself is on a different fs. So it's not
even straight lock_rename() - it has to be a special version that would
handle cross-fs invocations (somehow - e.g. ordering them on superblock
or mount in-core address in such case; ordering between dentries could
be arbitrary for cross-fs cases).

Worse, you need to deal with the corner cases. "/" or anything ending on
"." or ".." can be rejected (no links to directories) and thankfully we
do not allow AT_EMPTY for linkat(2), but... procfs symlinks are in the
game, since AT_SYMLINK_FOLLOW is there.

And _that_ is a real bitch - what "parent" would you lock for (followed)
/proc/self/fd/0? It can change right under you; one solution would
be to grab ->vfs_rename_mutex first, same parent or not, then do what
lock_rename() does, relying upon ->d_parent having been stabilized
by ->vfs_rename_mutex. But that would have to be conditional upon
running into that case - you don't want to serialize the shit out of
(same-directory) link(2) on given filesystem. Which makes the entire
thing even harder to follow and reason about.

And to make it even more fun, you'll need to either duplicate pick_link()
guts, or try and make it usable in this situation. Might or might not
be easy - I hadn't tried to go into that.

"Fucking ugly" is inadequate for the likely results of that approach.
It's guaranteed to be a source of headache for pretty much ever after.

Does POSIX actually make any promises in that area? That would affect
how high a cost we ought to pay for that - I agree that it would be nicer
to have atomicity from userland point of view, but there's a difference
between hard bug and QoI issue.

Again, what really makes it painful is AT_SYMLINK_FOLLOW support in
linkat(2). For plain link(2) it would be easier to deal with.

2022-02-16 06:58:34

by Al Viro

[permalink] [raw]
Subject: Re: race between vfs_rename and do_linkat (mv and link)

On Tue, Feb 15, 2022 at 04:06:06PM +0000, Al Viro wrote:

> Worse, you need to deal with the corner cases. "/" or anything ending on
> "." or ".." can be rejected (no links to directories) and thankfully we
> do not allow AT_EMPTY for linkat(2), but... procfs symlinks are in the

That'd be AT_EMPTY_PATH, obviously, and unfortunately we do allow that.
Which brings another fun case to deal with. Same problem with "what's the
parent of that thing and how do we make it stable?"...

Oh, and you need to cope with O_TMPFILE ones as well - both for that and
for procfs symlinks to them. Which is fine from the vfs_link() POV (see
I_LINKABLE check in there), but the locking is outside of that, so we
need to deal with that joy. And _there_ you have no parent at all - what
could it be, anyway? So we'd need to decide what to lock. *AND* we have
the possibility of another thread doing link(2) on what used to be
O_TMPFILE, which would give it a parent by the time we get to doing
the actual operation...