2016-09-20 01:45:03

by Oleg Drokin

[permalink] [raw]
Subject: Revalidate failure leads to unmount (was: Mountpoints disappearing from namespace unexpectedly.)

Hello!

I think I have found an interesting condition for filesystems that have a
revalidate op and I am not quite sure this is really what we want?

Basically it all started with mountpoints randomly getting unmounted during
testing that I could not quite explain (see my quoted message at the end).

Now I finally caught the culprit and it's lookup_dcache calling d_invalidate
that in turn detaches all mountpoints on the entire subtree like this:

Breakpoint 1, umount_tree (mnt=<optimized out>, how=<optimized out>)
at /home/green/bk/linux-test/fs/namespace.c:1441
1441 umount_mnt(p);
(gdb) bt
#0 umount_tree (mnt=<optimized out>, how=<optimized out>)
at /home/green/bk/linux-test/fs/namespace.c:1441
#1 0xffffffff8129ec82 in __detach_mounts (dentry=<optimized out>)
at /home/green/bk/linux-test/fs/namespace.c:1572
#2 0xffffffff8129359e in detach_mounts (dentry=<optimized out>)
at /home/green/bk/linux-test/fs/mount.h:100
#3 d_invalidate (dentry=0xffff8800ab38feb0)
at /home/green/bk/linux-test/fs/dcache.c:1534
#4 0xffffffff8128122c in lookup_dcache (name=<optimized out>,
dir=<optimized out>, flags=1536)
at /home/green/bk/linux-test/fs/namei.c:1485
#5 0xffffffff81281d92 in __lookup_hash (name=0xffff88005c1a3eb8,
base=0xffff8800a8609eb0, flags=1536)
at /home/green/bk/linux-test/fs/namei.c:1522
#6 0xffffffff81288196 in filename_create (dfd=<optimized out>,
name=0xffff88006d3e7000, path=0xffff88005c1a3f08,
lookup_flags=<optimized out>) at /home/green/bk/linux-test/fs/namei.c:3604
#7 0xffffffff812891f1 in user_path_create (lookup_flags=<optimized out>,
path=<optimized out>, pathname=<optimized out>, dfd=<optimized out>)
at /home/green/bk/linux-test/fs/namei.c:3661
#8 SYSC_mkdirat (mode=511, pathname=<optimized out>, dfd=<optimized out>)
at /home/green/bk/linux-test/fs/namei.c:3793
#9 SyS_mkdirat (mode=<optimized out>, pathname=<optimized out>,
dfd=<optimized out>) at /home/green/bk/linux-test/fs/namei.c:3785
#10 SYSC_mkdir (mode=<optimized out>, pathname=<optimized out>)
at /home/green/bk/linux-test/fs/namei.c:3812
#11 SyS_mkdir (pathname=-2115143072, mode=<optimized out>)
at /home/green/bk/linux-test/fs/namei.c:3810
#12 0xffffffff8189f03c in entry_SYSCALL_64_fastpath ()
at /home/green/bk/linux-test/arch/x86/entry/entry_64.S:207

While I imagine the original idea was "cannot revalidate? Nuke the whole
tree from orbit", cases for "Why cannot we revalidate" were not considered.
In my case it appears that killing a bunch of scripts just at the right time
as they are in the middle of revalidating of some path component that has
mountpoints below it, the whole thing gets nuked (somewhat) unexpectedly because
nfs/sunrpc code notices the signal and returns ERESTARTSYS in the middle of lookup.
(This could be even exploitable in some setups I imagine, since it allows an
unprivileged user to unmount anything mounted on top of nfs).

It's even worse for Lustre, for example, because Lustre never tries to actually
re-lookup anything anymore (because that brought a bunch of complexities around
so we were glad we could get rid of it) and just returns whenever the name is
valid or not hoping for a retry the next time around.

So this brings up the question:
Is revalidate really required to go to great lengths to avoid returning 0
unless the underlying name has really-really changed? My reading
of documentation does not seem to match this as the whole LOOKUP_REVAL logic
is then redundant more or less?

Or is totally nuking the whole underlying tree a little bit over the top and
could be replaced with something less drastic, after all following re-lookup
could restore the dentries, but unmounts are not really reversible.

Thanks.

Bye,
Oleg
On Sep 5, 2016, at 12:45 PM, Oleg Drokin wrote:

> Hello!
>
> I am seeing a strange phenomenon here that I have not been able to completely figure
> out and perhaps it might ring some bells for somebody else.
>
> I first noticed this in 4.6-rc testing in early June, but just hit it in a similar
> way in 4.8-rc5
>
> Basically I have a test script that does a bunch of stuff in a limited namespace
> in three related namespaced (backend is the same, mountpoints are separate).
>
> When a process (a process group or something) is killed, sometimes ones of the
> mountpoints disappears from the namespace completely, even though the scripts
> themselves do not unmount anything.
>
> No traces of the mountpoint anywhere in /proc (including /proc/*/mounts), so it's not
> in any private namespaces of any of the processes either it seems.
>
> The filesystems are a locally mounted ext4 (loopback-backed) + 2 nfs
> (of the ext4 reexported).
> In the past it was always ext4 that was dropping, but today I got one of the nfs
> ones.
>
> Sequence looks like this:
> + mount /tmp/loop /mnt/lustre -o loop
> + mkdir /mnt/lustre/racer
> mkdir: cannot create directory '/mnt/lustre/racer': File exists
> + service nfs-server start
> Redirecting to /bin/systemctl start nfs-server.service
> + mount localhost:/mnt/lustre /mnt/nfs -t nfs -o nolock
> + mount localhost:/ /mnt/nfs2 -t nfs4
> + DURATION=3600
> + sh racer.sh /mnt/nfs/racer
> + DURATION=3600
> + sh racer.sh /mnt/nfs2/racer
> + wait %1 %2 %3
> + DURATION=3600
> + sh racer.sh /mnt/lustre/racer
> Running racer.sh for 3600 seconds. CTRL-C to exit
> Running racer.sh for 3600 seconds. CTRL-C to exit
> Running racer.sh for 3600 seconds. CTRL-C to exit
> ./file_exec.sh: line 12: 216042 Bus error $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
> ./file_exec.sh: line 12: 229086 Segmentation fault (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
> ./file_exec.sh: line 12: 230134 Segmentation fault $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
> ./file_exec.sh: line 12: 235154 Segmentation fault (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
> ./file_exec.sh: line 12: 270951 Segmentation fault (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
> racer cleanup
> racer cleanup
> racer cleanup
> sleeping 5 sec ...
> sleeping 5 sec ...
> sleeping 5 sec ...
> file_create.sh: no process found
> file_create.sh: no process found
> dir_create.sh: no process found
> file_create.sh: no process found
> dir_create.sh: no process found
> file_rm.sh: no process found
> dir_create.sh: no process found
> file_rm.sh: no process found
> file_rename.sh: no process found
> file_rm.sh: no process found
> file_rename.sh: no process found
> file_link.sh: no process found
> file_rename.sh: no process found
> file_link.sh: no process found
> file_symlink.sh: no process found
> file_link.sh: no process found
> file_symlink.sh: no process found
> file_list.sh: no process found
> file_list.sh: no process found
> file_symlink.sh: no process found
> file_concat.sh: no process found
> file_concat.sh: no process found
> file_list.sh: no process found
> file_exec.sh: no process found
> file_concat.sh: no process found
> file_exec.sh: no process found
> file_chown.sh: no process found
> file_exec.sh: no process found
> file_chown.sh: no process found
> file_chmod.sh: no process found
> file_chown.sh: no process found
> file_chmod.sh: no process found
> file_mknod.sh: no process found
> file_chmod.sh: no process found
> file_mknod.sh: no process found
> file_truncate.sh: no process found
> file_mknod.sh: no process found
> file_delxattr.sh: no process found
> file_truncate.sh: no process found
> file_truncate.sh: no process found
> file_getxattr.sh: no process found
> file_delxattr.sh: no process found
> file_delxattr.sh: no process found
> file_setxattr.sh: no process found
> there should be NO racer processes:
> file_getxattr.sh: no process found
> file_getxattr.sh: no process found
> file_setxattr.sh: no process found
> there should be NO racer processes:
> file_setxattr.sh: no process found
> there should be NO racer processes:
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
> df: /mnt/nfs/racer: No such file or directory
> Filesystem 1K-blocks Used Available Use% Mounted on
> /dev/loop0 999320 46376 884132 5% /mnt/lustre
> We survived racer.sh for 3600 seconds.
> Filesystem 1K-blocks Used Available Use% Mounted on
> localhost:/ 999424 46080 884224 5% /mnt/nfs2
> We survived racer.sh for 3600 seconds.
> + umount /mnt/nfs
> umount: /mnt/nfs: not mounted
> + exit 5
>
> Now you see in the middle of that /mnt/nfs suddenly disappeared.
>
> The racer scripts are at
> http://git.whamcloud.com/fs/lustre-release.git/tree/refs/heads/master:/lustre/tests/racer
> There's absolutely no unmounts in there.
>
> In the past I was just able to do the three racers in parallel, wait ~10 minutes and
> then kill all three of them and with significant probability the ext4 mountpoint would
> disappear.
>
> Any idea on how to better pinpoint this?
>
> Thanks.
>
> Bye,
> Oleg



2016-12-06 02:01:01

by Al Viro

[permalink] [raw]
Subject: Re: Revalidate failure leads to unmount

On Mon, Dec 05, 2016 at 08:39:15PM -0500, Oleg Drokin wrote:
> > Basically it all started with mountpoints randomly getting unmounted during
> > testing that I could not quite explain (see my quoted message at the end).
> >
> > Now I finally caught the culprit and it's lookup_dcache calling d_invalidate
> > that in turn detaches all mountpoints on the entire subtree like this:

Yes, it does.

> > While I imagine the original idea was "cannot revalidate? Nuke the whole
> > tree from orbit", cases for "Why cannot we revalidate" were not considered.

What would you do instead?

> > So this brings up the question:
> > Is revalidate really required to go to great lengths to avoid returning 0
> > unless the underlying name has really-really changed? My reading
> > of documentation does not seem to match this as the whole LOOKUP_REVAL logic
> > is then redundant more or less?

LOOKUP_REVAL is about avoiding false _postives_ on revalidation - i.e. if
you have several layers of actually stale entries in dcache and notice only
when you try to do lookup in the last one, with server telling you to fuck
off, your only hope is to apply full-strength revalidation from the very
beginning. Again, the problem it tries to avoid is over-optimistic fs
assuming that directories are valid without asking the server.

> > Or is totally nuking the whole underlying tree a little bit over the top and
> > could be replaced with something less drastic, after all following re-lookup
> > could restore the dentries, but unmounts are not really reversible.

Like what? Seriously, what would you do in such situation? Leave the
damn thing unreachable (and thus impossible to unmount)? Suppose the
/mnt/foo really had been removed (along with everything under it) on
the server. You had something mounted on /mnt/foo/bar/baz; what should
the kernel do?

2016-12-06 02:03:21

by Al Viro

[permalink] [raw]
Subject: Re: Revalidate failure leads to unmount

[gyah - Eric's old address used by mistake; resent with the right address]

On Mon, Dec 05, 2016 at 08:39:15PM -0500, Oleg Drokin wrote:
> > Basically it all started with mountpoints randomly getting unmounted during
> > testing that I could not quite explain (see my quoted message at the end).
> >
> > Now I finally caught the culprit and it's lookup_dcache calling d_invalidate
> > that in turn detaches all mountpoints on the entire subtree like this:

Yes, it does.

> > While I imagine the original idea was "cannot revalidate? Nuke the whole
> > tree from orbit", cases for "Why cannot we revalidate" were not considered.

What would you do instead?

> > So this brings up the question:
> > Is revalidate really required to go to great lengths to avoid returning 0
> > unless the underlying name has really-really changed? My reading
> > of documentation does not seem to match this as the whole LOOKUP_REVAL logic
> > is then redundant more or less?

LOOKUP_REVAL is about avoiding false _postives_ on revalidation - i.e. if
you have several layers of actually stale entries in dcache and notice only
when you try to do lookup in the last one, with server telling you to fuck
off, your only hope is to apply full-strength revalidation from the very
beginning. Again, the problem it tries to avoid is over-optimistic fs
assuming that directories are valid without asking the server.

> > Or is totally nuking the whole underlying tree a little bit over the top and
> > could be replaced with something less drastic, after all following re-lookup
> > could restore the dentries, but unmounts are not really reversible.

Like what? Seriously, what would you do in such situation? Leave the
damn thing unreachable (and thus impossible to unmount)? Suppose the
/mnt/foo really had been removed (along with everything under it) on
the server. You had something mounted on /mnt/foo/bar/baz; what should
the kernel do?

2016-12-06 02:14:17

by Oleg Drokin

[permalink] [raw]
Subject: Re: Revalidate failure leads to unmount

This is still happening in 4.9-rc8 and I still this is kind of wrong.
Is there a deeper reason why behavior like this is ok?

On Sep 19, 2016, at 9:44 PM, Oleg Drokin wrote:

> Hello!
>
> I think I have found an interesting condition for filesystems that have a
> revalidate op and I am not quite sure this is really what we want?
>
> Basically it all started with mountpoints randomly getting unmounted during
> testing that I could not quite explain (see my quoted message at the end).
>
> Now I finally caught the culprit and it's lookup_dcache calling d_invalidate
> that in turn detaches all mountpoints on the entire subtree like this:
>
> Breakpoint 1, umount_tree (mnt=<optimized out>, how=<optimized out>)
> at /home/green/bk/linux-test/fs/namespace.c:1441
> 1441 umount_mnt(p);
> (gdb) bt
> #0 umount_tree (mnt=<optimized out>, how=<optimized out>)
> at /home/green/bk/linux-test/fs/namespace.c:1441
> #1 0xffffffff8129ec82 in __detach_mounts (dentry=<optimized out>)
> at /home/green/bk/linux-test/fs/namespace.c:1572
> #2 0xffffffff8129359e in detach_mounts (dentry=<optimized out>)
> at /home/green/bk/linux-test/fs/mount.h:100
> #3 d_invalidate (dentry=0xffff8800ab38feb0)
> at /home/green/bk/linux-test/fs/dcache.c:1534
> #4 0xffffffff8128122c in lookup_dcache (name=<optimized out>,
> dir=<optimized out>, flags=1536)
> at /home/green/bk/linux-test/fs/namei.c:1485
> #5 0xffffffff81281d92 in __lookup_hash (name=0xffff88005c1a3eb8,
> base=0xffff8800a8609eb0, flags=1536)
> at /home/green/bk/linux-test/fs/namei.c:1522
> #6 0xffffffff81288196 in filename_create (dfd=<optimized out>,
> name=0xffff88006d3e7000, path=0xffff88005c1a3f08,
> lookup_flags=<optimized out>) at /home/green/bk/linux-test/fs/namei.c:3604
> #7 0xffffffff812891f1 in user_path_create (lookup_flags=<optimized out>,
> path=<optimized out>, pathname=<optimized out>, dfd=<optimized out>)
> at /home/green/bk/linux-test/fs/namei.c:3661
> #8 SYSC_mkdirat (mode=511, pathname=<optimized out>, dfd=<optimized out>)
> at /home/green/bk/linux-test/fs/namei.c:3793
> #9 SyS_mkdirat (mode=<optimized out>, pathname=<optimized out>,
> dfd=<optimized out>) at /home/green/bk/linux-test/fs/namei.c:3785
> #10 SYSC_mkdir (mode=<optimized out>, pathname=<optimized out>)
> at /home/green/bk/linux-test/fs/namei.c:3812
> #11 SyS_mkdir (pathname=-2115143072, mode=<optimized out>)
> at /home/green/bk/linux-test/fs/namei.c:3810
> #12 0xffffffff8189f03c in entry_SYSCALL_64_fastpath ()
> at /home/green/bk/linux-test/arch/x86/entry/entry_64.S:207
>
> While I imagine the original idea was "cannot revalidate? Nuke the whole
> tree from orbit", cases for "Why cannot we revalidate" were not considered.
> In my case it appears that killing a bunch of scripts just at the right time
> as they are in the middle of revalidating of some path component that has
> mountpoints below it, the whole thing gets nuked (somewhat) unexpectedly because
> nfs/sunrpc code notices the signal and returns ERESTARTSYS in the middle of lookup.
> (This could be even exploitable in some setups I imagine, since it allows an
> unprivileged user to unmount anything mounted on top of nfs).
>
> It's even worse for Lustre, for example, because Lustre never tries to actually
> re-lookup anything anymore (because that brought a bunch of complexities around
> so we were glad we could get rid of it) and just returns whenever the name is
> valid or not hoping for a retry the next time around.
>
> So this brings up the question:
> Is revalidate really required to go to great lengths to avoid returning 0
> unless the underlying name has really-really changed? My reading
> of documentation does not seem to match this as the whole LOOKUP_REVAL logic
> is then redundant more or less?
>
> Or is totally nuking the whole underlying tree a little bit over the top and
> could be replaced with something less drastic, after all following re-lookup
> could restore the dentries, but unmounts are not really reversible.
>
> Thanks.
>
> Bye,
> Oleg
> On Sep 5, 2016, at 12:45 PM, Oleg Drokin wrote:
>
>> Hello!
>>
>> I am seeing a strange phenomenon here that I have not been able to completely figure
>> out and perhaps it might ring some bells for somebody else.
>>
>> I first noticed this in 4.6-rc testing in early June, but just hit it in a similar
>> way in 4.8-rc5
>>
>> Basically I have a test script that does a bunch of stuff in a limited namespace
>> in three related namespaced (backend is the same, mountpoints are separate).
>>
>> When a process (a process group or something) is killed, sometimes ones of the
>> mountpoints disappears from the namespace completely, even though the scripts
>> themselves do not unmount anything.
>>
>> No traces of the mountpoint anywhere in /proc (including /proc/*/mounts), so it's not
>> in any private namespaces of any of the processes either it seems.
>>
>> The filesystems are a locally mounted ext4 (loopback-backed) + 2 nfs
>> (of the ext4 reexported).
>> In the past it was always ext4 that was dropping, but today I got one of the nfs
>> ones.
>>
>> Sequence looks like this:
>> + mount /tmp/loop /mnt/lustre -o loop
>> + mkdir /mnt/lustre/racer
>> mkdir: cannot create directory '/mnt/lustre/racer': File exists
>> + service nfs-server start
>> Redirecting to /bin/systemctl start nfs-server.service
>> + mount localhost:/mnt/lustre /mnt/nfs -t nfs -o nolock
>> + mount localhost:/ /mnt/nfs2 -t nfs4
>> + DURATION=3600
>> + sh racer.sh /mnt/nfs/racer
>> + DURATION=3600
>> + sh racer.sh /mnt/nfs2/racer
>> + wait %1 %2 %3
>> + DURATION=3600
>> + sh racer.sh /mnt/lustre/racer
>> Running racer.sh for 3600 seconds. CTRL-C to exit
>> Running racer.sh for 3600 seconds. CTRL-C to exit
>> Running racer.sh for 3600 seconds. CTRL-C to exit
>> ./file_exec.sh: line 12: 216042 Bus error $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
>> ./file_exec.sh: line 12: 229086 Segmentation fault (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
>> ./file_exec.sh: line 12: 230134 Segmentation fault $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
>> ./file_exec.sh: line 12: 235154 Segmentation fault (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
>> ./file_exec.sh: line 12: 270951 Segmentation fault (core dumped) $DIR/$file 0.$((RANDOM % 5 + 1)) 2> /dev/null
>> racer cleanup
>> racer cleanup
>> racer cleanup
>> sleeping 5 sec ...
>> sleeping 5 sec ...
>> sleeping 5 sec ...
>> file_create.sh: no process found
>> file_create.sh: no process found
>> dir_create.sh: no process found
>> file_create.sh: no process found
>> dir_create.sh: no process found
>> file_rm.sh: no process found
>> dir_create.sh: no process found
>> file_rm.sh: no process found
>> file_rename.sh: no process found
>> file_rm.sh: no process found
>> file_rename.sh: no process found
>> file_link.sh: no process found
>> file_rename.sh: no process found
>> file_link.sh: no process found
>> file_symlink.sh: no process found
>> file_link.sh: no process found
>> file_symlink.sh: no process found
>> file_list.sh: no process found
>> file_list.sh: no process found
>> file_symlink.sh: no process found
>> file_concat.sh: no process found
>> file_concat.sh: no process found
>> file_list.sh: no process found
>> file_exec.sh: no process found
>> file_concat.sh: no process found
>> file_exec.sh: no process found
>> file_chown.sh: no process found
>> file_exec.sh: no process found
>> file_chown.sh: no process found
>> file_chmod.sh: no process found
>> file_chown.sh: no process found
>> file_chmod.sh: no process found
>> file_mknod.sh: no process found
>> file_chmod.sh: no process found
>> file_mknod.sh: no process found
>> file_truncate.sh: no process found
>> file_mknod.sh: no process found
>> file_delxattr.sh: no process found
>> file_truncate.sh: no process found
>> file_truncate.sh: no process found
>> file_getxattr.sh: no process found
>> file_delxattr.sh: no process found
>> file_delxattr.sh: no process found
>> file_setxattr.sh: no process found
>> there should be NO racer processes:
>> file_getxattr.sh: no process found
>> file_getxattr.sh: no process found
>> file_setxattr.sh: no process found
>> there should be NO racer processes:
>> file_setxattr.sh: no process found
>> there should be NO racer processes:
>> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
>> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
>> USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
>> df: /mnt/nfs/racer: No such file or directory
>> Filesystem 1K-blocks Used Available Use% Mounted on
>> /dev/loop0 999320 46376 884132 5% /mnt/lustre
>> We survived racer.sh for 3600 seconds.
>> Filesystem 1K-blocks Used Available Use% Mounted on
>> localhost:/ 999424 46080 884224 5% /mnt/nfs2
>> We survived racer.sh for 3600 seconds.
>> + umount /mnt/nfs
>> umount: /mnt/nfs: not mounted
>> + exit 5
>>
>> Now you see in the middle of that /mnt/nfs suddenly disappeared.
>>
>> The racer scripts are at
>> http://git.whamcloud.com/fs/lustre-release.git/tree/refs/heads/master:/lustre/tests/racer
>> There's absolutely no unmounts in there.
>>
>> In the past I was just able to do the three racers in parallel, wait ~10 minutes and
>> then kill all three of them and with significant probability the ext4 mountpoint would
>> disappear.
>>
>> Any idea on how to better pinpoint this?
>>
>> Thanks.
>>
>> Bye,
>> Oleg
>


2016-12-06 02:23:45

by Oleg Drokin

[permalink] [raw]
Subject: Re: Revalidate failure leads to unmount


On Dec 5, 2016, at 9:00 PM, Al Viro wrote:

> On Mon, Dec 05, 2016 at 08:39:15PM -0500, Oleg Drokin wrote:
>>> Basically it all started with mountpoints randomly getting unmounted during
>>> testing that I could not quite explain (see my quoted message at the end).
>>>
>>> Now I finally caught the culprit and it's lookup_dcache calling d_invalidate
>>> that in turn detaches all mountpoints on the entire subtree like this:
>
> Yes, it does.
>
>>> While I imagine the original idea was "cannot revalidate? Nuke the whole
>>> tree from orbit", cases for "Why cannot we revalidate" were not considered.
>
> What would you do instead?

Retry? Not always, of course, but if it was EINTR, why not?
Sure, it needs some more logic to actually propagate those codes, or perhaps
revalidate itself needs to be smarter not to fail for such cases?
Or is this something that you think should be wholly within filesystem
and as such in this case it's just an nfs bug?

>>> So this brings up the question:
>>> Is revalidate really required to go to great lengths to avoid returning 0
>>> unless the underlying name has really-really changed? My reading
>>> of documentation does not seem to match this as the whole LOOKUP_REVAL logic
>>> is then redundant more or less?
>
> LOOKUP_REVAL is about avoiding false _postives_ on revalidation - i.e. if
> you have several layers of actually stale entries in dcache and notice only
> when you try to do lookup in the last one, with server telling you to fuck
> off, your only hope is to apply full-strength revalidation from the very
> beginning. Again, the problem it tries to avoid is over-optimistic fs
> assuming that directories are valid without asking the server.

Right, but in this case it's not server telling us off, not that we know at that
point.

>>> Or is totally nuking the whole underlying tree a little bit over the top and
>>> could be replaced with something less drastic, after all following re-lookup
>>> could restore the dentries, but unmounts are not really reversible.
>
> Like what? Seriously, what would you do in such situation? Leave the
> damn thing unreachable (and thus impossible to unmount)? Suppose the
> /mnt/foo really had been removed (along with everything under it) on
> the server. You had something mounted on /mnt/foo/bar/baz; what should
> the kernel do?

Well, if *I* ended up in this situation, I'd probably just recreate the missing
path and then then did umount (ESTALE galore?) ;)
(or course there are other less sane approaches like pinning the whole path until
unmount happens, but that's likely rife with a lot of other gotchas, but
there's a limited version of this already - if I have /mnt/foo mountpoint
and I delete /mnt/foo on the server, nobody would notice because we pin
the foo part already and all accesses go to the filesystem mounted on top).
But sure, when stuff is really missing, unmounting the subtrees looks like a very
sensible thing to do.
It's just I suspect revalidate for a network filesystem is more than just
"valid" and "invalid", there's a third option of "I don't know, ask me later"
(because the server is busy, down for a moment or whatever) and there's
at least some value in being able to interrupt a process that's stuck on a network
mountpoint without killing the whole thing under it, no?


2016-12-06 05:02:57

by Al Viro

[permalink] [raw]
Subject: Re: Revalidate failure leads to unmount

On Mon, Dec 05, 2016 at 09:22:47PM -0500, Oleg Drokin wrote:

> Retry? Not always, of course, but if it was EINTR, why not?
> Sure, it needs some more logic to actually propagate those codes, or perhaps
> revalidate itself needs to be smarter not to fail for such cases?
> Or is this something that you think should be wholly within filesystem
> and as such in this case it's just an nfs bug?

Umm... Might be doable, but then there's a nasty question - what if that
happens from umount(2) itself?

> > Like what? Seriously, what would you do in such situation? Leave the
> > damn thing unreachable (and thus impossible to unmount)? Suppose the
> > /mnt/foo really had been removed (along with everything under it) on
> > the server. You had something mounted on /mnt/foo/bar/baz; what should
> > the kernel do?
>
> Well, if *I* ended up in this situation, I'd probably just recreate the missing
> path and then then did umount (ESTALE galore?) ;)
> (or course there are other less sane approaches like pinning the whole path until
> unmount happens, but that's likely rife with a lot of other gotchas, but
> there's a limited version of this already - if I have /mnt/foo mountpoint
> and I delete /mnt/foo on the server, nobody would notice because we pin
> the foo part already and all accesses go to the filesystem mounted on top).

Try it...

> But sure, when stuff is really missing, unmounting the subtrees looks like a very
> sensible thing to do.
> It's just I suspect revalidate for a network filesystem is more than just
> "valid" and "invalid", there's a third option of "I don't know, ask me later"
> (because the server is busy, down for a moment or whatever) and there's
> at least some value in being able to interrupt a process that's stuck on a network
> mountpoint without killing the whole thing under it, no?

It's actually even more interesting - some form of delaying invalidation
might very well be a good thing, *if* we had a way to unhash the sucker
and have it fall through into lookup. With invalidation happening only
if lookup has returned something other than the object we'd just unhashed.
Then e.g. NFS could bail out in all cases when it would have to talk to
server and let the regular lookups do the work. However, right now that
only works for directories - for regular files we just get a new alias and
that's it. If something had been bound on top of the old one, we would lose
it. And turning that check into "new dentry is an alias of what we'd
unhashed" is a bad idea - it's already been hashed by us, so we'd have
a window when dcache lookup would've picked that new alias.

In that respect irregularities in Lustre become very interesting. What if
we taught d_splice_alias() to look for _exact_ unhashed alias (same parent,
same name) in case of non-directories and did "rehash and return that
alias, dropping inode reference" if one has been found? Could we get rid
of the weird dcache games in Lustre that way?

2016-12-06 05:45:25

by Oleg Drokin

[permalink] [raw]
Subject: Re: Revalidate failure leads to unmount


On Dec 6, 2016, at 12:02 AM, Al Viro wrote:

> On Mon, Dec 05, 2016 at 09:22:47PM -0500, Oleg Drokin wrote:
>
>> Retry? Not always, of course, but if it was EINTR, why not?
>> Sure, it needs some more logic to actually propagate those codes, or perhaps
>> revalidate itself needs to be smarter not to fail for such cases?
>> Or is this something that you think should be wholly within filesystem
>> and as such in this case it's just an nfs bug?
>
> Umm... Might be doable, but then there's a nasty question - what if that
> happens from umount(2) itself?

the EINTR? umount fails presumably and you can try it again later.

>>> Like what? Seriously, what would you do in such situation? Leave the
>>> damn thing unreachable (and thus impossible to unmount)? Suppose the
>>> /mnt/foo really had been removed (along with everything under it) on
>>> the server. You had something mounted on /mnt/foo/bar/baz; what should
>>> the kernel do?
>>
>> Well, if *I* ended up in this situation, I'd probably just recreate the missing
>> path and then then did umount (ESTALE galore?) ;)
>> (or course there are other less sane approaches like pinning the whole path until
>> unmount happens, but that's likely rife with a lot of other gotchas, but
>> there's a limited version of this already - if I have /mnt/foo mountpoint
>> and I delete /mnt/foo on the server, nobody would notice because we pin
>> the foo part already and all accesses go to the filesystem mounted on top).
> Try it...

Hm, it does unmount it due to lookup failure,
though I was always under impression the mountpoint dentry is pinned.
Perhaps in some old version, or just my imagination.

>> But sure, when stuff is really missing, unmounting the subtrees looks like a very
>> sensible thing to do.
>> It's just I suspect revalidate for a network filesystem is more than just
>> "valid" and "invalid", there's a third option of "I don't know, ask me later"
>> (because the server is busy, down for a moment or whatever) and there's
>> at least some value in being able to interrupt a process that's stuck on a network
>> mountpoint without killing the whole thing under it, no?
>
> It's actually even more interesting - some form of delaying invalidation
> might very well be a good thing, *if* we had a way to unhash the sucker
> and have it fall through into lookup. With invalidation happening only
> if lookup has returned something other than the object we'd just unhashed.
> Then e.g. NFS could bail out in all cases when it would have to talk to
> server and let the regular lookups do the work. However, right now that
> only works for directories - for regular files we just get a new alias and
> that's it. If something had been bound on top of the old one, we would lose
> it. And turning that check into "new dentry is an alias of what we'd
> unhashed" is a bad idea - it's already been hashed by us, so we'd have
> a window when dcache lookup would've picked that new alias.
>
> In that respect irregularities in Lustre become very interesting. What if
> we taught d_splice_alias() to look for _exact_ unhashed alias (same parent,
> same name) in case of non-directories and did "rehash and return that
> alias, dropping inode reference" if one has been found? Could we get rid
> of the weird dcache games in Lustre that way?

Well, certainly if d_splice_alias was working like that so that even non-directory
dentry would find an alias (not necessarily unhashed even) for that same inode and use that instead, that would make ll_splice_alias/ll_find_alias unnecessary.

We still retain the weird d_compare() that rejects otherwise perfectly valid aliases
if the lock guarding them is gone, triggering relookup (and necessiating the
above logic to pick up just rejected alias again now that we have the lock again).


2016-12-06 06:17:49

by Al Viro

[permalink] [raw]
Subject: Re: Revalidate failure leads to unmount

On Tue, Dec 06, 2016 at 12:45:11AM -0500, Oleg Drokin wrote:

> Well, certainly if d_splice_alias was working like that so that even non-directory
> dentry would find an alias (not necessarily unhashed even) for that same inode and use that instead, that would make ll_splice_alias/ll_find_alias unnecessary.
>
> We still retain the weird d_compare() that rejects otherwise perfectly valid aliases
> if the lock guarding them is gone, triggering relookup (and necessiating the
> above logic to pick up just rejected alias again now that we have the lock again).

Why not have ->d_revalidate() kick them, instead? _IF_ we have a way to
do unhash-and-trigger-lookup that way, do you really need those games with
->d_compare()?

2016-12-06 06:46:56

by Oleg Drokin

[permalink] [raw]
Subject: Re: Revalidate failure leads to unmount


On Dec 6, 2016, at 1:17 AM, Al Viro wrote:

> On Tue, Dec 06, 2016 at 12:45:11AM -0500, Oleg Drokin wrote:
>
>> Well, certainly if d_splice_alias was working like that so that even non-directory
>> dentry would find an alias (not necessarily unhashed even) for that same inode and use that instead, that would make ll_splice_alias/ll_find_alias unnecessary.
>>
>> We still retain the weird d_compare() that rejects otherwise perfectly valid aliases
>> if the lock guarding them is gone, triggering relookup (and necessiating the
>> above logic to pick up just rejected alias again now that we have the lock again).
>
> Why not have ->d_revalidate() kick them, instead? _IF_ we have a way to
> do unhash-and-trigger-lookup that way, do you really need those games with
> ->d_compare()?

The comment there says:
* This avoids a race where ll_lookup_it() instantiates a dentry, but we get
* an AST before calling d_revalidate_it(). The dentry still exists (marked
* INVALID) so d_lookup() matches it, but we have no lock on it (so
* lock_match() fails) and we spin around real_lookup().

and indeed I seem to have a memory of some code that did lookup followed
by revalidate (though checking commit history, that was a long time ago).
I checked and apparently now lookup_real() does not do any revalidation and
neither does its caller.

So assuming lookup no longer is followed by mandatory revalidate, we probably
can just move the lock check to the start of revalidate, I'll try that tomorrow.



2016-12-08 05:03:47

by Oleg Drokin

[permalink] [raw]
Subject: Re: Revalidate failure leads to unmount


On Dec 6, 2016, at 1:17 AM, Al Viro wrote:

> On Tue, Dec 06, 2016 at 12:45:11AM -0500, Oleg Drokin wrote:
>
>> Well, certainly if d_splice_alias was working like that so that even non-directory
>> dentry would find an alias (not necessarily unhashed even) for that same inode and use that instead, that would make ll_splice_alias/ll_find_alias unnecessary.
>>
>> We still retain the weird d_compare() that rejects otherwise perfectly valid aliases
>> if the lock guarding them is gone, triggering relookup (and necessiating the
>> above logic to pick up just rejected alias again now that we have the lock again).
>
> Why not have ->d_revalidate() kick them, instead? _IF_ we have a way to
> do unhash-and-trigger-lookup that way, do you really need those games with
> ->d_compare()?

Ok, so this does appear to work in mainline, but not in the well known vendor kernels,
where they still live in the past with d_invalidate returning non zero and
lookup_dcache() dutifuly ignoring the error code from revalidate.

Anyway, I can submit the patch doing away with ll_dcompare, but I still
need the kludge to always return 1 when revalidating mountpoints,
otherwise they would be constantly unmounted, I guess.

Is this something you'd like to carry along with the rest of (yet to be) patches
that only unmount stuff on lookup failure/different result as we discussed before,
or should I shoot this to Greg right away?

The patch pretty much amounts to this now:
diff --git a/drivers/staging/lustre/lustre/llite/dcache.c b/drivers/staging/lustre/lustre/llite/dcache.c
index 0e45d8f..f532167 100644
--- a/drivers/staging/lustre/lustre/llite/dcache.c
+++ b/drivers/staging/lustre/lustre/llite/dcache.c
@@ -69,38 +69,6 @@ static void ll_release(struct dentry *de)
call_rcu(&lld->lld_rcu_head, free_dentry_data);
}

-/* Compare if two dentries are the same. Don't match if the existing dentry
- * is marked invalid. Returns 1 if different, 0 if the same.
- *
- * This avoids a race where ll_lookup_it() instantiates a dentry, but we get
- * an AST before calling d_revalidate_it(). The dentry still exists (marked
- * INVALID) so d_lookup() matches it, but we have no lock on it (so
- * lock_match() fails) and we spin around real_lookup().
- */
-static int ll_dcompare(const struct dentry *dentry,
- unsigned int len, const char *str,
- const struct qstr *name)
-{
- if (len != name->len)
- return 1;
-
- if (memcmp(str, name->name, len))
- return 1;
-
- CDEBUG(D_DENTRY, "found name %.*s(%p) flags %#x refc %d\n",
- name->len, name->name, dentry, dentry->d_flags,
- d_count(dentry));
-
- /* mountpoint is always valid */
- if (d_mountpoint((struct dentry *)dentry))
- return 0;
-
- if (d_lustre_invalid(dentry))
- return 1;
-
- return 0;
-}
-
/**
* Called when last reference to a dentry is dropped and dcache wants to know
* whether or not it should cache it:
@@ -255,6 +223,15 @@ static int ll_revalidate_dentry(struct dentry *dentry,
{
struct inode *dir = d_inode(dentry->d_parent);

+ /* mountpoint is always valid */
+ if (d_mountpoint((struct dentry *)dentry))
+ return 1;
+
+ /* No lock? Bail out */
+ if (d_lustre_invalid(dentry))
+ return 0;
+
+
/* If this is intermediate component path lookup and we were able to get
* to this dentry, then its lock has not been revoked and the
* path component is valid.
@@ -303,5 +280,4 @@ const struct dentry_operations ll_d_ops = {
.d_revalidate = ll_revalidate_nd,
.d_release = ll_release,
.d_delete = ll_ddelete,
- .d_compare = ll_dcompare,
};