LinuxLists.cc - Re: [f2fs-dev] [PATCH v6 0/9] Support negative dentries on case-insensitive ext4 and f2fs

2023-11-25 22:02:05

Subject: Re: [f2fs-dev] [PATCH v6 0/9] Support negative dentries on case-insensitive ext4 and f2fs

On Fri, Nov 24, 2023 at 10:22:49AM -0500, Gabriel Krisman Bertazi wrote:

> ack. I'll base the other changes we discussed on top of your branch.

Rebased to v6.7-rc1, fixed up (ceph calls fscrypt_d_revalidate() directly,
and D/f/porting entry had been missing), pushed out as #no-rebase-d_revalidate

2023-11-26 04:52:40

by Al Viro

[permalink] [raw]

Subject: Re: [f2fs-dev] [PATCH v6 0/9] Support negative dentries on case-insensitive ext4 and f2fs

On Sat, Nov 25, 2023 at 10:01:36PM +0000, Al Viro wrote:
> On Fri, Nov 24, 2023 at 10:22:49AM -0500, Gabriel Krisman Bertazi wrote:
>
> > ack. I'll base the other changes we discussed on top of your branch.
>
> Rebased to v6.7-rc1, fixed up (ceph calls fscrypt_d_revalidate() directly,
> and D/f/porting entry had been missing), pushed out as #no-rebase-d_revalidate

FWIW, ->d_revalidate() has an old unpleasant problem we might try to solve
now.

In non-RCU mode We treat 0 as "invalidate that sucker and do a fresh lookup".
Fine, except that there's a possibility of race here - we'd hit ->d_revalidate()
while another thread was renaming object in question. Or has just found it
by doing lookup in a place where it had been moved on server.

->d_revalidate() decides that it needs to be looked up on server and forms
a request before rename succeeds. So NFS (e.g.) request goes out with the
old parent and name. By the time server sees it, RENAME has been processed
and succeeded. There's no such file in the old place anymore.

So ->d_revalidate() returns 0... and we proceed to invalidate the dentry.
Which had been moved to *new* place by now. In that place it's perfectly
valid and does not deserve invalidation.

Scenario when rename had been done not from this client is even worse:

server:/srv/nfs/foo is mounted on /mnt/foo
we state /mnt/foo/bar
/mnt/foo/bar is in dcache
somebody on server renames /srv/nfs/foo/bar to /srv/nfs/foo/barf
process A: stat /mnt/foo/bar/baz.
process B: mount something on /mnt/foo/barf/
process B: no /mnt/foo/barf in dcache, let's look it up
found fhandle of /mnt/foo
sent LOOKUP "barf" in it
got an fhandle and found it matching the inode of /mnt/foo/bar
process A: has reached /mnt/foo/bar and decided to revalidate it.
found fhandle of /mnt/foo
sent a LOOKUP "bar" in that
got "nothing with that name there"
->d_revalidate() returns 0
loses CPU
process B: splice the dentry of /mnt/foo/bar to /mnt/foo/barf
proceed to mount on top of it
process A: gets CPU back
calls d_invalidate() on the dentry that now is /mnt/foo/barf
dissolves the mount created by process B

Note that server:/srv/nfs/foo/barf has been there and perfectly valid
since before B has started doing anything. It has no idea that the
damn thing used to be in a different place and something on the same
client had seen it at the old place once upon a time. As far as it is
concerned, mount has succeeded and then quietly disappeared. The mountpoint
is still there - with freshly looked up dentry, since the old one had been
invalidated, but userland doesn't see that, so... WTF?

It's not easy to hit, but I'd expect it to be feasible on SMP KVM, where instead
of A losing CPU we might've had the virtual CPU losing the timeslice on host.

IMO we should only do d_invalidate() if
* ->d_revalidate() has returned 0
* dentry is still hashed, still has the same parent and still matches
the name from ->d_compare() POV.
If it doesn't, we should just leave it whereever it has been moved to and
act as if we hadn't seen it in the first place.

In other words, have
d_revalidate(dentry, parent, name, flags) doing the following:
if no ->d_revalidate
return 1
ret = ->d_revalidate(...)
if (unlikely(ret == 0) && !(flags & LOOKUP_RCU)) {
spin_lock(&dentry->d_lock);
if (!d_same_name(dentry, parent, name))
spin_lock(&dentry->d_lock);
else
d_invalidate_locked(dentry);
}
return ret

where d_invalidate_locked() would be d_invalidate() sans the initial
spin_lock(&dentry->d_lock);

That would solve that problem, AFAICS. Objections, anyone? I'm too
sleepy to put together a patch at the moment, will post after I get
some sleep...

PS: as the matter of fact, it might be a good idea to pass the parent
as explicit argument to ->d_revalidate(), now that we are passing the
name as well. Look at the boilerplate in the instances; all that
parent = READ_ONCE(dentry->d_parent);
dir = d_inode_rcu(parent);
if (!dir)
return -ECHILD;
...
on the RCU side combined with
parent = dget_parent(dentry);
dir = d_inode(parent);
...
dput(dir);
stuff.

It's needed only because the caller had not told us which directory
is that thing supposed to be in; in non-RCU mode the parent is
explicitly pinned down, no need to play those games. All we need
is
dir = d_inode_rcu(parent);
if (!dir) // could happen only in RCU mode
return -ECHILD;
assuming we need the parent inode, that is.

So... how about
int (*d_revalidate)(struct dentry *dentry, struct dentry *parent,
const struct qstr *name, unsigned int flags);
since we are touching all instances anyway?

2023-11-26 18:42:10

by Al Viro

[permalink] [raw]

Subject: fun with d_invalidate() vs. d_splice_alias() was Re: [f2fs-dev] [PATCH v6 0/9] Support negative dentries on case-insensitive ext4 and f2fs

[folks involved into d_invalidate()/submount eviction stuff Cc'd]
On Sun, Nov 26, 2023 at 04:52:19AM +0000, Al Viro wrote:
> PS: as the matter of fact, it might be a good idea to pass the parent
> as explicit argument to ->d_revalidate(), now that we are passing the
> name as well. Look at the boilerplate in the instances; all that
> parent = READ_ONCE(dentry->d_parent);
> dir = d_inode_rcu(parent);
> if (!dir)
> return -ECHILD;
> ...
> on the RCU side combined with
> parent = dget_parent(dentry);
> dir = d_inode(parent);
> ...
> dput(dir);
> stuff.
>
> It's needed only because the caller had not told us which directory
> is that thing supposed to be in; in non-RCU mode the parent is
> explicitly pinned down, no need to play those games. All we need
> is
> dir = d_inode_rcu(parent);
> if (!dir) // could happen only in RCU mode
> return -ECHILD;
> assuming we need the parent inode, that is.
>
> So... how about
> int (*d_revalidate)(struct dentry *dentry, struct dentry *parent,
> const struct qstr *name, unsigned int flags);
> since we are touching all instances anyway?

OK, it's definitely a good idea for simplifying ->d_revalidate() instances
and I think we should go for it on thes grounds alone. I'll do that.

d_invalidate() situation is more subtle - we need to sort out its interplay
with d_splice_alias().

More concise variant of the scenario in question:
* we have /mnt/foo/bar and a lot of its descendents in dcache on client
* server does a rename, after which what used to be /mnt/foo/bar is /mnt/foo/baz
* somebody on the client does a lookup of /mnt/foo/bar and gets told by
the server that there's no directory with that name anymore.
* that somebody hits d_invalidate(), unhashes /mnt/foo/bar and starts
evicting its descendents
* We try to mount something on /mnt/foo/baz/blah. We look up baz, get
an fhandle and notice that there's a directory inode for it (/mnt/foo/bar).
d_splice_alias() picks the bugger and moves it to /mnt/foo/baz, rehashing
it in process, as it ought to. Then we find /mnt/foo/baz/blah in dcache and
mount on top of it.
* d_invalidate() finishes shrink_dcache_parent() and starts hunting for
submounts to dissolve. And finds the mount we'd done. Which mount quietly
disappears.

Note that from the server POV the thing had been moved quite a while ago.
No server-side races involved - all it seeem is a couple of LOOKUP in the
same directory, one for the old name, one for the new.

On the client on the mounter side we have an uneventful mount on /mnt/foo/baz,
which had been there on server at the time we started and which remains in
place after mount we'd created suddenly disappears.

For the thread that ended up calling d_invalidate(), they'd been doing e.g.
stat on a pathname that used to be there a while ago, but currently isn't.
They get -ENOENT and no indication that something odd might have happened.

From ->d_revalidate() point of view there's also nothing odd happening -
dentry is not a mountpoint, it stays in place until we return and there's
no directory entry with that name on in its parent. It's as clear-cut
as it gets - dentry is stale.

The only overlap happening there is d_splice_alias() hitting in the middle
of already started d_invalidate().

For a while I thought that ff17fa561a04 "d_invalidate(): unhash immediately"
and 3a8e3611e0ba "d_walk(): kill 'finish' callback" might have something
to do with it, but the same problem existed prior to that.

FWIW, I suspect that the right answer would be along the lines of
* if d_splice_alias() does move an exsiting (attached) alias in
place, it ought to dissolve all mountpoints in subtree being moved.
There might be subtleties, but in case when that __d_unalias() happens
due to rename on server this is definitely the right thing to do.
* d_invalidate() should *NOT* do anything with dentry that
got moved (including moved by d_splice_alias()) from the place we'd
found it in dcache. At least d_invalidate() done due to having
->d_revalidate() return 0.
* d_invalidate() should dissolve all mountpoints in the
subtree that existed when it got started (and found the victim
still unmoved, that is). It should (as it does) prevent any
new mountpoints added in that subtree, unless the mountpoint
to be had been moved (spliced) out. What it really shouldn't
do is touch the mountpoints that are currently outside of it
due to moves.

I'm going to look around and see if we have any weird cases where
d_splice_alias() is used for things like "correct the case of
dentry name on a case-mangled filesystem" - that would presumably
not want to dissolve any submounts. I seem to recall seeing
some shite of that sort, but that was a long time ago.

Eric, Miklos - it might be a good idea if you at least took a
look at whatever comes out of that (sub)thread; I'm trying to
reconstruct the picture, but the last round of serious reworking
of that area had been almost 10 years ago and your recollections
of the considerations back then might help. I realize that they
are probably rather fragmentary (mine definitely are) and any
analysis will need to be redone on the current tree, but...

2023-11-27 06:39:14

by Al Viro

[permalink] [raw]

Subject: Re: fun with d_invalidate() vs. d_splice_alias() was Re: [f2fs-dev] [PATCH v6 0/9] Support negative dentries on case-insensitive ext4 and f2fs

On Sun, Nov 26, 2023 at 06:41:41PM +0000, Al Viro wrote:

> d_invalidate() situation is more subtle - we need to sort out its interplay
> with d_splice_alias().
>
> More concise variant of the scenario in question:
> * we have /mnt/foo/bar and a lot of its descendents in dcache on client
> * server does a rename, after which what used to be /mnt/foo/bar is /mnt/foo/baz
> * somebody on the client does a lookup of /mnt/foo/bar and gets told by
> the server that there's no directory with that name anymore.
> * that somebody hits d_invalidate(), unhashes /mnt/foo/bar and starts
> evicting its descendents
> * We try to mount something on /mnt/foo/baz/blah. We look up baz, get
> an fhandle and notice that there's a directory inode for it (/mnt/foo/bar).
> d_splice_alias() picks the bugger and moves it to /mnt/foo/baz, rehashing
> it in process, as it ought to. Then we find /mnt/foo/baz/blah in dcache and
> mount on top of it.
> * d_invalidate() finishes shrink_dcache_parent() and starts hunting for
> submounts to dissolve. And finds the mount we'd done. Which mount quietly
> disappears.
>
> Note that from the server POV the thing had been moved quite a while ago.
> No server-side races involved - all it seeem is a couple of LOOKUP in the
> same directory, one for the old name, one for the new.
>
> On the client on the mounter side we have an uneventful mount on /mnt/foo/baz,
> which had been there on server at the time we started and which remains in
> place after mount we'd created suddenly disappears.
>
> For the thread that ended up calling d_invalidate(), they'd been doing e.g.
> stat on a pathname that used to be there a while ago, but currently isn't.
> They get -ENOENT and no indication that something odd might have happened.
>
> >From ->d_revalidate() point of view there's also nothing odd happening -
> dentry is not a mountpoint, it stays in place until we return and there's
> no directory entry with that name on in its parent. It's as clear-cut
> as it gets - dentry is stale.
>
> The only overlap happening there is d_splice_alias() hitting in the middle
> of already started d_invalidate().
>
> For a while I thought that ff17fa561a04 "d_invalidate(): unhash immediately"
> and 3a8e3611e0ba "d_walk(): kill 'finish' callback" might have something
> to do with it, but the same problem existed prior to that.
>
> FWIW, I suspect that the right answer would be along the lines of
> * if d_splice_alias() does move an exsiting (attached) alias in
> place, it ought to dissolve all mountpoints in subtree being moved.
> There might be subtleties, but in case when that __d_unalias() happens
> due to rename on server this is definitely the right thing to do.
> * d_invalidate() should *NOT* do anything with dentry that
> got moved (including moved by d_splice_alias()) from the place we'd
> found it in dcache. At least d_invalidate() done due to having
> ->d_revalidate() return 0.
> * d_invalidate() should dissolve all mountpoints in the
> subtree that existed when it got started (and found the victim
> still unmoved, that is). It should (as it does) prevent any
> new mountpoints added in that subtree, unless the mountpoint
> to be had been moved (spliced) out. What it really shouldn't
> do is touch the mountpoints that are currently outside of it
> due to moves.
>
> I'm going to look around and see if we have any weird cases where
> d_splice_alias() is used for things like "correct the case of
> dentry name on a case-mangled filesystem" - that would presumably
> not want to dissolve any submounts. I seem to recall seeing
> some shite of that sort, but that was a long time ago.
>
> Eric, Miklos - it might be a good idea if you at least took a
> look at whatever comes out of that (sub)thread; I'm trying to
> reconstruct the picture, but the last round of serious reworking
> of that area had been almost 10 years ago and your recollections
> of the considerations back then might help. I realize that they
> are probably rather fragmentary (mine definitely are) and any
> analysis will need to be redone on the current tree, but...

TBH, I wonder if we ought to have d_invalidate() variant that would
unhash the dentry in question, do a variant of shrink_dcache_parent()
that would report if there had been any mountpoints and if there
had been any, do namespace_lock() and go hunting for mounts in that
subtree, moving corresponding struct mountpoint to a private list
as we go (removing them from mountpoint hash chains, that it). Then
have them all evicted after we'd finished walking the subtree...

The tricky part will be lock ordering - right now we have the
mountpoint hash protected by mount_lock (same as mount hash, probably
worth splitting anyway) and that nests outside of ->d_lock.

Note that we don't do mountpoint hash lookups on mountpoint crossing
- it's nowhere near the really hot paths. What we have is
lookup_mountpoint() - plain hash lookup. Always
under namespace_lock() and mount_lock.
get_mountpoint() - there's an insertion into hash chain,
with dentry passed through the d_set_mounted(), which would
fail if we have d_invalidate() on the subtree.
Also always under namespace_lock() and mount_lock.
__put_mountpoint() - removal from the hash chain.
We remove from hash chain after having cleared DCACHE_MOUNTED.
_That_ can happen under mount_lock alone (taking out the stuck
submounts on final mntput()).

So convert the mountpoint hash chains to hlist_bl, bitlocks nesting under
->d_lock. Introduce a new dentry flag (DCHACE_MOUNT_INVALIDATION?)
In d_walk() callback we would
* do nothing if DCACHE_MOUNT is not set or DCACHE_MOUNT_INVALIDATION
is.
* otherwise set DCACHE_MOUNT_INVALIDATION, grab the bitlock on the
mountpoint hash chain matching that dentry, find struct mountpoint in it,
remove it from the chain and insert into a separate "collector" chain, all
without messing with refcount.
In lookup_mountpoint() and get_mountpoint() take the bitlock on chain.
In __put_mountpoint(), once it has grabbed ->d_lock
* check if it has DCACHE_MOUNT_INVALIDATION, use that to
decide which chain we are locking - the normal one or the collector
* clear both DCACHE_MOUNT and DCACHE_MOUNT_INVALIDATION
* remove from chain
* unlock the chain
* drop ->d_lock.

Once we are finished walking the tree, go over the collector list
and do what __detach_mount() guts do. We are no longer under
any ->d_lock, so locking is not a problem. namespace_unlock() will
flush them all, same as it does for __detach_mount().

In __d_unalias() case do that d_invalidate() analogues of the alias.
Yes, it might do final mntput() of other filesystems, while under
->i_rwsem on our parent. Not a problem, fs shutdown will go
either through task_work or schedule_delayed_work(); in any
case, it won't happen under ->i_rwsem. We obviously can't do
that under rename_lock, though, so we'll need to massage that
path in d_splice_alias() a bit.

So, something like d_invalidate_locked(victim) called with
victim->d_lock held. d_splice_alias() would use that (see above)
and places where we do d_invalidate() after ->d_revalidate() having
returned 0 would do this:
lock dentry
if it still has the same parent and name
d_invalidate_locked()
else
unlock dentry
probably folded into fs/namei.c:d_revalidate()... Not tonight,
though - I'd rather do that while properly awake ;-/

2023-11-27 16:02:04

by Eric W. Biederman

[permalink] [raw]

Subject: Re: fun with d_invalidate() vs. d_splice_alias() was Re: [f2fs-dev] [PATCH v6 0/9] Support negative dentries on case-insensitive ext4 and f2fs

"Eric W. Biederman" <[email protected]> writes:

> I am confused what is going on with ext4 and f2fs. I think they
> are calling d_invalidate when all they need to call is d_drop.

ext4 and f2f2 are buggy in how they call d_invalidate, if I am reading
the code correctly.

d_invalidate calls detach_mounts.

detach_mounts relies on setting D_CANT_MOUNT on the top level dentry to
prevent races with new mounts.

ext4 and f2fs (in their case insensitive code) are calling d_invalidate
before dont_mount has been called to set D_CANT_MOUNT.

Eric

2023-11-27 16:03:53

by Al Viro

[permalink] [raw]

Subject: Re: fun with d_invalidate() vs. d_splice_alias() was Re: [f2fs-dev] [PATCH v6 0/9] Support negative dentries on case-insensitive ext4 and f2fs

On Mon, Nov 27, 2023 at 09:47:47AM -0600, Eric W. Biederman wrote:

> There is a lot going on there. I remember one of the relevant
> restrictions was marking dentries dont_mount, and inodes S_DEAD
> in unlink and rmdir.
>
> But even without out that marking if d_invalidate is called
> from d_revalidate the inode and all of it's dentries must be
> dead because the inode is stale and most go. There should
> be no resurrecting it at that point.
>
> I suspect the most fruitful way to think of the d_invalidate vs
> d_splice_alias races is an unlink vs rename race.
>
> I don't think the mechanism matters, but deeply and fundamentally
> if we detect a directory inode is dead we need to stick with
> that decision and not attempt to resurrect it with d_splice_alias.

Wrong. Deeply and fundamentally we detect a dentry that does not
match the directory contents according to the server.

For example, due to rename done on server. With object in question
perfectly alive there - fhandle still works, etc.

However, it's no longer where it used to be. And we would bloody better
not have lookups for the old name result in access to that object.
We also should never allow the access to *new* name lead to two live
dentries for the same directory inode.

Again, this is not about rmdir() or unlink() - invalidation can happen
for object that is still open, still accessed and still very much alive.
Does that all the time for any filesystem with ->d_revalidate().

2023-11-27 16:06:04

by Eric W. Biederman

[permalink] [raw]

Subject: Re: fun with d_invalidate() vs. d_splice_alias() was Re: [f2fs-dev] [PATCH v6 0/9] Support negative dentries on case-insensitive ext4 and f2fs

Al Viro <[email protected]> writes:

> On Sun, Nov 26, 2023 at 06:41:41PM +0000, Al Viro wrote:
>
>> d_invalidate() situation is more subtle - we need to sort out its interplay
>> with d_splice_alias().
>>
>> More concise variant of the scenario in question:
>> * we have /mnt/foo/bar and a lot of its descendents in dcache on client
>> * server does a rename, after which what used to be /mnt/foo/bar is /mnt/foo/baz
>> * somebody on the client does a lookup of /mnt/foo/bar and gets told by
>> the server that there's no directory with that name anymore.
>> * that somebody hits d_invalidate(), unhashes /mnt/foo/bar and starts
>> evicting its descendents
>> * We try to mount something on /mnt/foo/baz/blah. We look up baz, get
>> an fhandle and notice that there's a directory inode for it (/mnt/foo/bar).
>> d_splice_alias() picks the bugger and moves it to /mnt/foo/baz, rehashing
>> it in process, as it ought to. Then we find /mnt/foo/baz/blah in dcache and
>> mount on top of it.
>> * d_invalidate() finishes shrink_dcache_parent() and starts hunting for
>> submounts to dissolve. And finds the mount we'd done. Which mount quietly
>> disappears.
>>
>> Note that from the server POV the thing had been moved quite a while ago.
>> No server-side races involved - all it seeem is a couple of LOOKUP in the
>> same directory, one for the old name, one for the new.
>>
>> On the client on the mounter side we have an uneventful mount on /mnt/foo/baz,
>> which had been there on server at the time we started and which remains in
>> place after mount we'd created suddenly disappears.
>>
>> For the thread that ended up calling d_invalidate(), they'd been doing e.g.
>> stat on a pathname that used to be there a while ago, but currently isn't.
>> They get -ENOENT and no indication that something odd might have happened.
>>
>> >From ->d_revalidate() point of view there's also nothing odd happening -
>> dentry is not a mountpoint, it stays in place until we return and there's
>> no directory entry with that name on in its parent. It's as clear-cut
>> as it gets - dentry is stale.
>>
>> The only overlap happening there is d_splice_alias() hitting in the middle
>> of already started d_invalidate().
>>
>> For a while I thought that ff17fa561a04 "d_invalidate(): unhash immediately"
>> and 3a8e3611e0ba "d_walk(): kill 'finish' callback" might have something
>> to do with it, but the same problem existed prior to that.
>>
>> FWIW, I suspect that the right answer would be along the lines of
>> * if d_splice_alias() does move an exsiting (attached) alias in
>> place, it ought to dissolve all mountpoints in subtree being moved.
>> There might be subtleties, but in case when that __d_unalias() happens
>> due to rename on server this is definitely the right thing to do.
>> * d_invalidate() should *NOT* do anything with dentry that
>> got moved (including moved by d_splice_alias()) from the place we'd
>> found it in dcache. At least d_invalidate() done due to having
>> ->d_revalidate() return 0.
>> * d_invalidate() should dissolve all mountpoints in the
>> subtree that existed when it got started (and found the victim
>> still unmoved, that is). It should (as it does) prevent any
>> new mountpoints added in that subtree, unless the mountpoint
>> to be had been moved (spliced) out. What it really shouldn't
>> do is touch the mountpoints that are currently outside of it
>> due to moves.
>>
>> I'm going to look around and see if we have any weird cases where
>> d_splice_alias() is used for things like "correct the case of
>> dentry name on a case-mangled filesystem" - that would presumably
>> not want to dissolve any submounts. I seem to recall seeing
>> some shite of that sort, but that was a long time ago.
>>
>> Eric, Miklos - it might be a good idea if you at least took a
>> look at whatever comes out of that (sub)thread; I'm trying to
>> reconstruct the picture, but the last round of serious reworking
>> of that area had been almost 10 years ago and your recollections
>> of the considerations back then might help. I realize that they
>> are probably rather fragmentary (mine definitely are) and any
>> analysis will need to be redone on the current tree, but...

By subthread I assume you are referring to the work to that generalized
check_submounts_and_drop into the current d_invalidate.

My memory is that there were deliberate restrictions on where
d_revalidate could be called so as not to mess with mounts.

I believe those restrictions either prevented or convinced us it
prevented nasty interactions between d_invalidate and d_splice_alias.

There is a lot going on there. I remember one of the relevant
restrictions was marking dentries dont_mount, and inodes S_DEAD
in unlink and rmdir.

But even without out that marking if d_invalidate is called
from d_revalidate the inode and all of it's dentries must be
dead because the inode is stale and most go. There should
be no resurrecting it at that point.

I suspect the most fruitful way to think of the d_invalidate vs
d_splice_alias races is an unlink vs rename race.

I don't think the mechanism matters, but deeply and fundamentally
if we detect a directory inode is dead we need to stick with
that decision and not attempt to resurrect it with d_splice_alias.

Looking at ext4 and f2fs it appears when case folding they are calling
d_invalidate before the generic code can, and before marking like
dont_mount happen. Is that the tie in with where the current
conversation comes in?

> TBH, I wonder if we ought to have d_invalidate() variant that would
> unhash the dentry in question,

You mean like the current d_invalidate does? It calls __d_drop which
unhashes the thing and prevent lookups. You even pointed to the change
that added that in the previous email. The only thing that does not
happen currently is marking the dentry as unhashed.

Looking the rmdir code uses not only dont_mount but marks the
inode S_DEAD as well.

Right now we can't even get to d_splice_alias unless the original
dentry is unhashed.

So I suspect it isn't d_invalidate you are fighting.

> do a variant of shrink_dcache_parent()
> that would report if there had been any mountpoints and if there
> had been any, do namespace_lock() and go hunting for mounts in that
> subtree, moving corresponding struct mountpoint to a private list
> as we go (removing them from mountpoint hash chains, that it). Then
> have them all evicted after we'd finished walking the subtree...

>
> The tricky part will be lock ordering - right now we have the
> mountpoint hash protected by mount_lock (same as mount hash, probably
> worth splitting anyway) and that nests outside of ->d_lock.

I don't get get it.

All we have to do is to prevent the inode lookup from succeeding
if we have decided the inode has been deleted. It may be a little
more subtle the path of the inode we are connecting goes through
a dentry that is being invalidated.

But either need to prevent it in the lookup that leads to d_alloc,
or prevent the new dentry from being attached.

I know d_splice_alias takes the rename_lock to prevent some of those
races.

I hope that helps on the recollection front.

I am confused what is going on with ext4 and f2fs. I think they
are calling d_invalidate when all they need to call is d_drop.

Eric

2023-11-27 16:14:54

by Al Viro

[permalink] [raw]

Subject: Re: fun with d_invalidate() vs. d_splice_alias() was Re: [f2fs-dev] [PATCH v6 0/9] Support negative dentries on case-insensitive ext4 and f2fs

On Mon, Nov 27, 2023 at 04:03:18PM +0000, Al Viro wrote:
> On Mon, Nov 27, 2023 at 09:47:47AM -0600, Eric W. Biederman wrote:
>
> > There is a lot going on there. I remember one of the relevant
> > restrictions was marking dentries dont_mount, and inodes S_DEAD
> > in unlink and rmdir.
> >
> > But even without out that marking if d_invalidate is called
> > from d_revalidate the inode and all of it's dentries must be
> > dead because the inode is stale and most go. There should
> > be no resurrecting it at that point.
> >
> > I suspect the most fruitful way to think of the d_invalidate vs
> > d_splice_alias races is an unlink vs rename race.
> >
> > I don't think the mechanism matters, but deeply and fundamentally
> > if we detect a directory inode is dead we need to stick with
> > that decision and not attempt to resurrect it with d_splice_alias.
>
> Wrong. Deeply and fundamentally we detect a dentry that does not
> match the directory contents according to the server.
>
> For example, due to rename done on server. With object in question
> perfectly alive there - fhandle still works, etc.
>
> However, it's no longer where it used to be. And we would bloody better
> not have lookups for the old name result in access to that object.
> We also should never allow the access to *new* name lead to two live
> dentries for the same directory inode.
>
> Again, this is not about rmdir() or unlink() - invalidation can happen
> for object that is still open, still accessed and still very much alive.
> Does that all the time for any filesystem with ->d_revalidate().

Put another way, there used to be very odd song and dance in ->d_revalidate()
instances along the lines of "we can't possibly tell the caller to invalidate
a mountpoint"; it was racy in the best case and during the rewrite of
d_invalidate() to teach it how to evict submounts those attempts had been
dropped - ->d_revalidate() returning 0 does end up with mounts dissolved
by d_invalidate() from caller.

It always had been racy, starting with the checks that used to be in
->d_revalidate() instances way before all those changes. So the switch
of d_invalidate() to dissolving submounts had been a step in the right
direction, but it's not being careful enough.

Again, it's about d_invalidate() caused by pathwalk running into a dentry that
doesn't match the reality vs. d_splice_alias() finding that it matches the
inode we had looked up elsewhere.

2023-11-27 16:34:01

by Christian Brauner

[permalink] [raw]

Subject: Re: fun with d_invalidate() vs. d_splice_alias() was Re: [f2fs-dev] [PATCH v6 0/9] Support negative dentries on case-insensitive ext4 and f2fs

On Mon, Nov 27, 2023 at 06:38:42AM +0000, Al Viro wrote:
> On Sun, Nov 26, 2023 at 06:41:41PM +0000, Al Viro wrote:
>
> > d_invalidate() situation is more subtle - we need to sort out its interplay
> > with d_splice_alias().
> >
> > More concise variant of the scenario in question:
> > * we have /mnt/foo/bar and a lot of its descendents in dcache on client
> > * server does a rename, after which what used to be /mnt/foo/bar is /mnt/foo/baz
> > * somebody on the client does a lookup of /mnt/foo/bar and gets told by
> > the server that there's no directory with that name anymore.
> > * that somebody hits d_invalidate(), unhashes /mnt/foo/bar and starts
> > evicting its descendents
> > * We try to mount something on /mnt/foo/baz/blah. We look up baz, get
> > an fhandle and notice that there's a directory inode for it (/mnt/foo/bar).
> > d_splice_alias() picks the bugger and moves it to /mnt/foo/baz, rehashing
> > it in process, as it ought to. Then we find /mnt/foo/baz/blah in dcache and
> > mount on top of it.
> > * d_invalidate() finishes shrink_dcache_parent() and starts hunting for
> > submounts to dissolve. And finds the mount we'd done. Which mount quietly
> > disappears.
> >
> > Note that from the server POV the thing had been moved quite a while ago.
> > No server-side races involved - all it seeem is a couple of LOOKUP in the
> > same directory, one for the old name, one for the new.
> >
> > On the client on the mounter side we have an uneventful mount on /mnt/foo/baz,
> > which had been there on server at the time we started and which remains in
> > place after mount we'd created suddenly disappears.
> >
> > For the thread that ended up calling d_invalidate(), they'd been doing e.g.
> > stat on a pathname that used to be there a while ago, but currently isn't.
> > They get -ENOENT and no indication that something odd might have happened.
> >
> > >From ->d_revalidate() point of view there's also nothing odd happening -
> > dentry is not a mountpoint, it stays in place until we return and there's
> > no directory entry with that name on in its parent. It's as clear-cut
> > as it gets - dentry is stale.
> >
> > The only overlap happening there is d_splice_alias() hitting in the middle
> > of already started d_invalidate().
> >
> > For a while I thought that ff17fa561a04 "d_invalidate(): unhash immediately"
> > and 3a8e3611e0ba "d_walk(): kill 'finish' callback" might have something
> > to do with it, but the same problem existed prior to that.
> >
> > FWIW, I suspect that the right answer would be along the lines of
> > * if d_splice_alias() does move an exsiting (attached) alias in
> > place, it ought to dissolve all mountpoints in subtree being moved.
> > There might be subtleties, but in case when that __d_unalias() happens
> > due to rename on server this is definitely the right thing to do.
> > * d_invalidate() should *NOT* do anything with dentry that
> > got moved (including moved by d_splice_alias()) from the place we'd
> > found it in dcache. At least d_invalidate() done due to having
> > ->d_revalidate() return 0.
> > * d_invalidate() should dissolve all mountpoints in the
> > subtree that existed when it got started (and found the victim
> > still unmoved, that is). It should (as it does) prevent any
> > new mountpoints added in that subtree, unless the mountpoint
> > to be had been moved (spliced) out. What it really shouldn't
> > do is touch the mountpoints that are currently outside of it
> > due to moves.
> >
> > I'm going to look around and see if we have any weird cases where
> > d_splice_alias() is used for things like "correct the case of
> > dentry name on a case-mangled filesystem" - that would presumably
> > not want to dissolve any submounts. I seem to recall seeing
> > some shite of that sort, but that was a long time ago.
> >
> > Eric, Miklos - it might be a good idea if you at least took a
> > look at whatever comes out of that (sub)thread; I'm trying to
> > reconstruct the picture, but the last round of serious reworking
> > of that area had been almost 10 years ago and your recollections
> > of the considerations back then might help. I realize that they
> > are probably rather fragmentary (mine definitely are) and any
> > analysis will need to be redone on the current tree, but...
>
> TBH, I wonder if we ought to have d_invalidate() variant that would
> unhash the dentry in question, do a variant of shrink_dcache_parent()
> that would report if there had been any mountpoints and if there
> had been any, do namespace_lock() and go hunting for mounts in that
> subtree, moving corresponding struct mountpoint to a private list
> as we go (removing them from mountpoint hash chains, that it). Then
> have them all evicted after we'd finished walking the subtree...

That sounds reasonable.

>
> The tricky part will be lock ordering - right now we have the
> mountpoint hash protected by mount_lock (same as mount hash, probably
> worth splitting anyway) and that nests outside of ->d_lock.
>
> Note that we don't do mountpoint hash lookups on mountpoint crossing
> - it's nowhere near the really hot paths. What we have is
> lookup_mountpoint() - plain hash lookup. Always
> under namespace_lock() and mount_lock.
> get_mountpoint() - there's an insertion into hash chain,
> with dentry passed through the d_set_mounted(), which would
> fail if we have d_invalidate() on the subtree.
> Also always under namespace_lock() and mount_lock.
> __put_mountpoint() - removal from the hash chain.
> We remove from hash chain after having cleared DCACHE_MOUNTED.
> _That_ can happen under mount_lock alone (taking out the stuck
> submounts on final mntput()).
>
> So convert the mountpoint hash chains to hlist_bl, bitlocks nesting under
> ->d_lock. Introduce a new dentry flag (DCHACE_MOUNT_INVALIDATION?)
> In d_walk() callback we would
> * do nothing if DCACHE_MOUNT is not set or DCACHE_MOUNT_INVALIDATION
> is.
> * otherwise set DCACHE_MOUNT_INVALIDATION, grab the bitlock on the
> mountpoint hash chain matching that dentry, find struct mountpoint in it,
> remove it from the chain and insert into a separate "collector" chain, all
> without messing with refcount.

Ok.

> In lookup_mountpoint() and get_mountpoint() take the bitlock on chain.
> In __put_mountpoint(), once it has grabbed ->d_lock
> * check if it has DCACHE_MOUNT_INVALIDATION, use that to
> decide which chain we are locking - the normal one or the collector
> * clear both DCACHE_MOUNT and DCACHE_MOUNT_INVALIDATION
> * remove from chain
> * unlock the chain
> * drop ->d_lock.
>
> Once we are finished walking the tree, go over the collector list
> and do what __detach_mount() guts do. We are no longer under
> any ->d_lock, so locking is not a problem. namespace_unlock() will
> flush them all, same as it does for __detach_mount().

Ok.

2023-11-27 17:26:16

by Al Viro

[permalink] [raw]

Subject: Re: fun with d_invalidate() vs. d_splice_alias() was Re: [f2fs-dev] [PATCH v6 0/9] Support negative dentries on case-insensitive ext4 and f2fs

On Mon, Nov 27, 2023 at 10:01:34AM -0600, Eric W. Biederman wrote:
> "Eric W. Biederman" <[email protected]> writes:
>
> > I am confused what is going on with ext4 and f2fs. I think they
> > are calling d_invalidate when all they need to call is d_drop.
>
> ext4 and f2f2 are buggy in how they call d_invalidate, if I am reading
> the code correctly.
>
> d_invalidate calls detach_mounts.
>
> detach_mounts relies on setting D_CANT_MOUNT on the top level dentry to
> prevent races with new mounts.
>
> ext4 and f2fs (in their case insensitive code) are calling d_invalidate
> before dont_mount has been called to set D_CANT_MOUNT.

Not really - note that the place where we check cant_mount() is under
the lock on the mountpoint's inode, so anything inside ->unlink() or
->rmdir() is indistinguishable from the places where we do dont_mount()
in vfs_{unlink,rmdir}.

2023-11-27 18:19:52

by Eric W. Biederman

[permalink] [raw]

Subject: Re: fun with d_invalidate() vs. d_splice_alias() was Re: [f2fs-dev] [PATCH v6 0/9] Support negative dentries on case-insensitive ext4 and f2fs

Al Viro <[email protected]> writes:

> On Mon, Nov 27, 2023 at 04:03:18PM +0000, Al Viro wrote:
>> On Mon, Nov 27, 2023 at 09:47:47AM -0600, Eric W. Biederman wrote:
>>
>> > There is a lot going on there. I remember one of the relevant
>> > restrictions was marking dentries dont_mount, and inodes S_DEAD
>> > in unlink and rmdir.
>> >
>> > But even without out that marking if d_invalidate is called
>> > from d_revalidate the inode and all of it's dentries must be
>> > dead because the inode is stale and most go. There should
>> > be no resurrecting it at that point.
>> >
>> > I suspect the most fruitful way to think of the d_invalidate vs
>> > d_splice_alias races is an unlink vs rename race.
>> >
>> > I don't think the mechanism matters, but deeply and fundamentally
>> > if we detect a directory inode is dead we need to stick with
>> > that decision and not attempt to resurrect it with d_splice_alias.
>>
>> Wrong. Deeply and fundamentally we detect a dentry that does not
>> match the directory contents according to the server.
>>
>> For example, due to rename done on server. With object in question
>> perfectly alive there - fhandle still works, etc.
>>
>> However, it's no longer where it used to be. And we would bloody better
>> not have lookups for the old name result in access to that object.
>> We also should never allow the access to *new* name lead to two live
>> dentries for the same directory inode.
>>
>> Again, this is not about rmdir() or unlink() - invalidation can happen
>> for object that is still open, still accessed and still very much alive.
>> Does that all the time for any filesystem with ->d_revalidate().
>
> Put another way, there used to be very odd song and dance in ->d_revalidate()
> instances along the lines of "we can't possibly tell the caller to invalidate
> a mountpoint"; it was racy in the best case and during the rewrite of
> d_invalidate() to teach it how to evict submounts those attempts had been
> dropped - ->d_revalidate() returning 0 does end up with mounts dissolved
> by d_invalidate() from caller.
>
> It always had been racy, starting with the checks that used to be in
> ->d_revalidate() instances way before all those changes. So the switch
> of d_invalidate() to dissolving submounts had been a step in the right
> direction, but it's not being careful enough.
>
> Again, it's about d_invalidate() caused by pathwalk running into a dentry that
> doesn't match the reality vs. d_splice_alias() finding that it matches the
> inode we had looked up elsewhere.

My point is we should have a atomic way to decide the disposition of
such a dentry, and it's children.

Either we should decide it is useless and remove it and all of it's
children.

Or we should decide it was renamed and just handle it that way.

If we can record such a decision on the dentry or possibly on the inode
then we can resolve the race by having it be a proper race of which
comes first.

It isn't a proper delete of the inode so anything messing with the inode
and marking it S_DEAD is probably wrong.

The code could do something like mark the dentry dont_mount which should
be enough to for d_splice_alias to say oops, something is not proper
here. Let the d_invalidate do it's thing.

Or the code could remove the dentry from inode->i_dentry and keep
d_splice alias from finding it, and it's children completely.
That is different from unhashing it.

Anyway that is my memory and my general sense of what is going on.
I help it helps.

Eric

2023-11-27 18:26:57

by Al Viro

[permalink] [raw]

Subject: Re: fun with d_invalidate() vs. d_splice_alias() was Re: [f2fs-dev] [PATCH v6 0/9] Support negative dentries on case-insensitive ext4 and f2fs

On Mon, Nov 27, 2023 at 05:25:44PM +0000, Al Viro wrote:
> On Mon, Nov 27, 2023 at 10:01:34AM -0600, Eric W. Biederman wrote:
> > "Eric W. Biederman" <[email protected]> writes:
> >
> > > I am confused what is going on with ext4 and f2fs. I think they
> > > are calling d_invalidate when all they need to call is d_drop.
> >
> > ext4 and f2f2 are buggy in how they call d_invalidate, if I am reading
> > the code correctly.
> >
> > d_invalidate calls detach_mounts.
> >
> > detach_mounts relies on setting D_CANT_MOUNT on the top level dentry to
> > prevent races with new mounts.
> >
> > ext4 and f2fs (in their case insensitive code) are calling d_invalidate
> > before dont_mount has been called to set D_CANT_MOUNT.
>
> Not really - note that the place where we check cant_mount() is under
> the lock on the mountpoint's inode, so anything inside ->unlink() or
> ->rmdir() is indistinguishable from the places where we do dont_mount()
> in vfs_{unlink,rmdir}.

Said that, we could simply use d_drop() in those, since the caller will
take care of mount eviction - we have ->unlink() or ->rmdir() returning
success, after all.

The same goes for xfs caller and for cifs_prime_dcache() (in the latter
case we have just checked that they sucker is negative, so d_invalidate()
and d_drop() are doing the same thing).

2023-11-27 18:44:26

by Al Viro

[permalink] [raw]

Subject: Re: fun with d_invalidate() vs. d_splice_alias() was Re: [f2fs-dev] [PATCH v6 0/9] Support negative dentries on case-insensitive ext4 and f2fs

On Mon, Nov 27, 2023 at 12:19:09PM -0600, Eric W. Biederman wrote:

> Either we should decide it is useless and remove it and all of it's
> children.
>
> Or we should decide it was renamed and just handle it that way.

How? An extra roundtrip to server trying to do getattr on the fhandle
we've got?

Cost of that aside, we *still* need to dissolve submounts in such case;
there is no warranty that we'll ever guess the new name and no way
to ask the server for one, so we can't let them sit around. Not that
having mounts (local by definition) suddenly show up in the unexpected
place because of rename on server looks like a good thing, especially
since had that rename on server been done as cp -rl + rm -rf the same
mounts would be gone...

> If we can record such a decision on the dentry or possibly on the inode
> then we can resolve the race by having it be a proper race of which
> comes first.
>
> It isn't a proper delete of the inode so anything messing with the inode
> and marking it S_DEAD is probably wrong.

s/probably/certainly/, but where would d_invalidate() do such a thing?
It's none of its business...

> The code could do something like mark the dentry dont_mount which should
> be enough to for d_splice_alias to say oops, something is not proper
> here. Let the d_invalidate do it's thing.
>
> Or the code could remove the dentry from inode->i_dentry and keep
> d_splice alias from finding it, and it's children completely.
> That is different from unhashing it.

We might be just in the middle of getdents(2) on the directory in question.
It can be opened; we can't do anything that destructive there.

Again, it's about the d_invalidate() on ->d_revalidate() reporting 0;
uses like proc_invalidate_siblings_dcache() are separate story, simply
because there d_splice_alias() is not going to move anything anywhere.

2023-11-29 04:53:30

by Al Viro

[permalink] [raw]

Subject: Re: fun with d_invalidate() vs. d_splice_alias() was Re: [f2fs-dev] [PATCH v6 0/9] Support negative dentries on case-insensitive ext4 and f2fs

On Mon, Nov 27, 2023 at 06:38:43AM +0000, Al Viro wrote:

> > FWIW, I suspect that the right answer would be along the lines of
> > * if d_splice_alias() does move an exsiting (attached) alias in
> > place, it ought to dissolve all mountpoints in subtree being moved.
> > There might be subtleties,

Are there ever... Starting with the "our test for loop creation
(alias is a direct ancestor, need to fail with -ELOOP) is dependent
upon rename_lock being held all along".

Folks, what semantics do we want for dissolving mounts on splice?
The situation when it happens is when we have a subtree on e.g. NFS
and have some mounts (on client) inside that. Then somebody on
server moves the root of that subtree somewhere else and we try
to do a lookup in new place. Options:

1) our dentry for directory that got moved on server is moved into
new place, along with the entire subtree *and* everything mounted
on it. Very dubious semantics, especially since if we look the
old location up before looking for new one, the mounts will be
dissolved; no way around that.

2) lookup fails. It's already possible; e.g. if server has
/srv/nfs/1/2/3 moved to /srv/nfs/x, then /srv/nfs/1/2 moved
to /srv/nfs/x/y and client has a process with cwd in /mnt/nfs/1/2/3
doing a lookup for "y", there's no way in hell to handle that -
the lookup will return the fhandle of /srv/nfs/x, which is the
same thing the client has for /mnt/nfs/1/2; we *can't* move that
dentry to /mnt/nfs/1/2/3/y - not without creating a detached loop.
We can also run into -ESTALE if one of the trylocks in
__d_unalias() fails. Having the same happen if there are mounts
in the subtree we are trying to splice would be unpleasant, but
not fatal. The trouble is, that won't be a transient failure -
not until somebody tries to look the old location up.

3) dissolve the mounts. Doable, but it's not easy; especially
since we end up having to redo the loop-prevention check after
the mounts had been dissolved. And that check may be failing
by that time, with no way to undo that dissolving...

2023-11-29 10:22:02

by Christian Brauner

[permalink] [raw]

Subject: Re: fun with d_invalidate() vs. d_splice_alias() was Re: [f2fs-dev] [PATCH v6 0/9] Support negative dentries on case-insensitive ext4 and f2fs

> 2) lookup fails. It's already possible; e.g. if server has

I think that's the sanest option. The other options seem even less
intuitive.

> not fatal. The trouble is, that won't be a transient failure -
> not until somebody tries to look the old location up.

Eh, nfs semantics are quite special anyway already. I'd rather have that
in lookup than more magic involving moving mounts around or having them
disappear (Yes, we have the detach semantics on removal but that's
different.).

2023-11-29 15:20:04

by Eric W. Biederman

[permalink] [raw]

Subject: Re: fun with d_invalidate() vs. d_splice_alias() was Re: [f2fs-dev] [PATCH v6 0/9] Support negative dentries on case-insensitive ext4 and f2fs

Al Viro <[email protected]> writes:

> On Mon, Nov 27, 2023 at 06:38:43AM +0000, Al Viro wrote:
>
>> > FWIW, I suspect that the right answer would be along the lines of
>> > * if d_splice_alias() does move an exsiting (attached) alias in
>> > place, it ought to dissolve all mountpoints in subtree being moved.
>> > There might be subtleties,
>
> Are there ever... Starting with the "our test for loop creation
> (alias is a direct ancestor, need to fail with -ELOOP) is dependent
> upon rename_lock being held all along".
>
> Folks, what semantics do we want for dissolving mounts on splice?
> The situation when it happens is when we have a subtree on e.g. NFS
> and have some mounts (on client) inside that. Then somebody on
> server moves the root of that subtree somewhere else and we try
> to do a lookup in new place. Options:
>
> 1) our dentry for directory that got moved on server is moved into
> new place, along with the entire subtree *and* everything mounted
> on it. Very dubious semantics, especially since if we look the
> old location up before looking for new one, the mounts will be
> dissolved; no way around that.
>
> 2) lookup fails. It's already possible; e.g. if server has
> /srv/nfs/1/2/3 moved to /srv/nfs/x, then /srv/nfs/1/2 moved
> to /srv/nfs/x/y and client has a process with cwd in /mnt/nfs/1/2/3
> doing a lookup for "y", there's no way in hell to handle that -
> the lookup will return the fhandle of /srv/nfs/x, which is the
> same thing the client has for /mnt/nfs/1/2; we *can't* move that
> dentry to /mnt/nfs/1/2/3/y - not without creating a detached loop.
> We can also run into -ESTALE if one of the trylocks in
> __d_unalias() fails. Having the same happen if there are mounts
> in the subtree we are trying to splice would be unpleasant, but
> not fatal. The trouble is, that won't be a transient failure -
> not until somebody tries to look the old location up.
>
> 3) dissolve the mounts. Doable, but it's not easy; especially
> since we end up having to redo the loop-prevention check after
> the mounts had been dissolved. And that check may be failing
> by that time, with no way to undo that dissolving...

To be clear this is a change in current semantics and has a minuscule
change of resulting in a regression. That should be called out in the
change log.

If we choose to change the semantics I would suggest that the new
semantics be:

If a different name for a directory already exists:
* Detach the mounts unconditionally (leaving dentry descendants alone).
* Attempt the current splice.
- If the splice succeeds ( return the new dentry )
- If the splice fails ( fail the lookup, and d_invalidate the existing name )

Unconditionally dissolving the mounts before attempting the rename
should simplify everything.

In the worst case a race between d_invalidate and d_splice_alias will
now become a race to see who can detach the mounts first.

Eric