LinuxLists.cc - [PATCH] VFS: fix recent breakage of FS_REVAL

2010-05-24 06:58:07

Subject: [PATCH] VFS: fix recent breakage of FS_REVAL_DOT

Commit 1f36f774b22a0ceb7dd33eca626746c81a97b6a5 broke FS_REVAL_DOT semantics.

In particular, before this patch, the command
ls -l
in an NFS mounted directory would always check if the directory on the server
had changed and if so would flush and refill the pagecache for the dir.
After this patch, the same "ls -l" will repeatedly return stale date until
the cached attributes for the directory time out.

The following patch fixes this by ensuring the d_revalidate is called by
do_last when "." is being looked-up.
link_path_walk has already called d_revalidate, but in that case LOOKUP_OPEN
is not set so nfs_lookup_verify_inode chooses not to do any validation.

The following patch restores the original behaviour.

Cc: [email protected]
Signed-off-by: NeilBrown <[email protected]>

---
As an aside the whole "FS_REVAL_DOT" concept seems deeply flawed to me.
The apparent purpose for d_revalidate is to ensure that the given name in the
given directory still points to the given inode. For '.' this is pointless
as '.' in a given directory has a very well defined and unchangeable value.

It seems that d_revalidate is also used to update the cached attributes for
the target as well. While that is reasonable side effect it seems wrong to
expect it as is the basis for FS_REVAL_DOT.

Surely ->open should perform any final validation of cached attributes.

NeilBrown

(Yes, I have posted this before but I didn't seem to be getting any real
progress, so I split this more obviously correct patch out from the other
less obvious one relating to mountpoints. Once I succeed with this I'll try
again with the other).

diff --git a/fs/namei.c b/fs/namei.c
index 48e1f60..868d0cb 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -1621,6 +1621,7 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
case LAST_DOTDOT:
follow_dotdot(nd);
dir = nd->path.dentry;
+ case LAST_DOT:
if (nd->path.mnt->mnt_sb->s_type->fs_flags & FS_REVAL_DOT) {
if (!dir->d_op->d_revalidate(dir, nd)) {
error = -ESTALE;
@@ -1628,7 +1629,6 @@ static struct file *do_last(struct nameidata *nd, struct path *path,
}
}
/* fallthrough */
- case LAST_DOT:
case LAST_ROOT:
if (open_flag & O_CREAT)
goto exit;

2010-05-24 15:50:38

by Al Viro

[permalink] [raw]

Subject: Re: [PATCH] VFS: fix recent breakage of FS_REVAL_DOT

On Mon, May 24, 2010 at 12:59:03PM +0100, Al Viro wrote:

> BTW, here's a question for nfs client folks: is it true that for any two
> pathnames on _client_ resolving to pairs (mnt1, dentry) and (mnt2, dentry)
> resp., nfs_devname(mnt1, dentry, ...) and nfs_devname(mnt2, dentry, ...)
> should yield the strings that do not differ past the ':' (i.e. that the
> only possible difference is going to be in spelling the server name)?

Actually, there's a related one: suppose we have two mounts from the same
server, with the same flags, etc., ending up sharing a dentry on client.
What will we get from GETATTR asking for fs_locations, in fs_root field?

Can an nfs4 server e.g. have /x/y being a symlink that resolves to /a/b and
allow mounting of both /x/y/c and /a/b/c? Which path would it return to
client that has mounted both, walked to some referral point and called
nfs_do_refmount(), triggering nfs4_proc_fs_locations()?

Trond, Neil?

2010-05-25 01:36:02

by NeilBrown

[permalink] [raw]

Subject: Re: [PATCH] VFS: fix recent breakage of FS_REVAL_DOT

On Mon, 24 May 2010 17:47:36 +0100
Al Viro <[email protected]> wrote:

> On Mon, May 24, 2010 at 12:21:22PM -0400, Trond Myklebust wrote:
> > > Can an nfs4 server e.g. have /x/y being a symlink that resolves to /a/b and
> > > allow mounting of both /x/y/c and /a/b/c? Which path would it return to
> > > client that has mounted both, walked to some referral point and called
> > > nfs_do_refmount(), triggering nfs4_proc_fs_locations()?
> > >
> > > Trond, Neil?
> >
> > When mounting /x/y/c in your example above, the NFSv4 protocol requires
> > the client itself to resolve the symlink, and then walk down /a/b/c
> > (looking up component by component), so it will in practice not see
> > anything other than /a/b/c.
> >
> > If it walks down to a referral, and then calls nfs_do_refmount, it will
> > do the same thing: obtain a path /e/f/g on the new server, and then walk
> > down that component by component while resolving any symlinks and/or
> > referrals that it crosses in the process.
>
> Ho-hum... What happens if the same fs is mounted twice on server? I.e.
> have ext2 from /dev/sda1 mounted on /a and /b on server, then on the client
> do mount -t nfs foo:/a /tmp/a; mount -t nfs foo:/b /tmp/b. Which path
> would we get from GETATTR with fs_locations requested, if we do it for
> /tmp/a/x and /tmp/b/x resp.? Dentry will be the same, since fsid would
> match.
>
> Or would the server refuse to export things that way?

If an explicit fsid or uuid is given for the two different export points,
then the server will happily export both and the client will see two
different filesystems that happen to contain identical content. They could
return different fs_locations, or could return the same depending on what is
specified in /etc/exports.

If mountd is left to choose the default uuid, then it will only export one of
them at a time. When the client requests the mount of "foo:/b", the
information given to the kernel will over-ride the export of "/a". The
client won't notice much as the filehandles will all be the same.

The value returned for fs_locations is given to the kernel by mountd based on
the contents of /etc/exports.
So if you mount /dev/sda1 on /a /b, then in /etc/exports have:

/a *(locations=foo_a)
/b *(locations=foo_b)

(or whatever the correct syntax is), then when the client does

mount foo:/a /tmp/a

and asks for fs_locations in /tmp/a it will get something based on foo_a..
If it then does
mount foo:/b /tmp/b
then future fs_locations requests in either /tmp/a or /tmp/b will be based on
foo_b.

If one client mounts foo:/a and then another mounts foo:/b, then both will
see foo_b locations.

Obviously setting /etc/exports up like this with different locations for the
same data would be pretty silly. If the different exports of the same fs
had the same locations, then it would all work sensibly.

At least, the above is a worst case. It is not impossible that mountd could
detect that the two exports are the same and would prefer the "first" one in
all cases. In that case the results would be more stable, but not
necessarily more "correct" (whatever that means here).

NeilBrown

2010-05-24 16:21:31

by Trond Myklebust

[permalink] [raw]

Subject: Re: [PATCH] VFS: fix recent breakage of FS_REVAL_DOT

On Mon, 2010-05-24 at 16:50 +0100, Al Viro wrote:
> On Mon, May 24, 2010 at 12:59:03PM +0100, Al Viro wrote:
>
> > BTW, here's a question for nfs client folks: is it true that for any two
> > pathnames on _client_ resolving to pairs (mnt1, dentry) and (mnt2, dentry)
> > resp., nfs_devname(mnt1, dentry, ...) and nfs_devname(mnt2, dentry, ...)
> > should yield the strings that do not differ past the ':' (i.e. that the
> > only possible difference is going to be in spelling the server name)?
>
> Actually, there's a related one: suppose we have two mounts from the same
> server, with the same flags, etc., ending up sharing a dentry on client.
> What will we get from GETATTR asking for fs_locations, in fs_root field?
>
> Can an nfs4 server e.g. have /x/y being a symlink that resolves to /a/b and
> allow mounting of both /x/y/c and /a/b/c? Which path would it return to
> client that has mounted both, walked to some referral point and called
> nfs_do_refmount(), triggering nfs4_proc_fs_locations()?
>
> Trond, Neil?

When mounting /x/y/c in your example above, the NFSv4 protocol requires
the client itself to resolve the symlink, and then walk down /a/b/c
(looking up component by component), so it will in practice not see
anything other than /a/b/c.

If it walks down to a referral, and then calls nfs_do_refmount, it will
do the same thing: obtain a path /e/f/g on the new server, and then walk
down that component by component while resolving any symlinks and/or
referrals that it crosses in the process.

Cheers
Trond

2010-05-24 19:08:32

by Al Viro

[permalink] [raw]

Subject: Re: [PATCH] VFS: fix recent breakage of FS_REVAL_DOT

On Mon, May 24, 2010 at 01:06:31PM -0400, Trond Myklebust wrote:

> I believe that the answer is that most filehandle types include an
> encoding of the inode number of the export directory. In other words, as
> long as '/a' and '/b' are different directories, then they will result
> in the generation of different filehandles for /a/x and /b/x.
>
> It seems that is not always the case, though. According to the
> definition of mk_fsid(), it looks as if the 'FSID_UUID8' and
> 'FSID_UUID16' filehandle types only encode the uuid of the filesystem,
> and have no inode information. They will therefore not be able to
> distinguish between an export through '/a' or '/b'.
>
> Neil, Bruce am I right?

Er? On server:

mount -t ext2 /dev/sdb1 /srv/nfs4
mount -t ext2 /dev/sda1 /srv/nfs4/a
mount -t ext2 /dev/sda1 /srv/nfs4/b

after that /srv/nfs4/a and /srv/nfs4/b will have the *same* inode, nevermind
the inode number. I really mean the same filesystem mounted twice; if you
want to include inumber of mountpoint into fsid, fine, turn the above into

mount -t ext2 /dev/sdb1 /srv/nfs4
mount -t ext2 /dev/sda1 /srv/nfs4/a
mount -t ext2 /dev/sda1 /srv/nfs4/b
mount -t ext2 /dev/sda3 /srv/nfs4/a/z
mount -t ext2 /dev/sda3 /srv/nfs4/b/z

At that point you have the same fs (ext2 from sda3) mounted on /srv/nfs4/a/z
and /srv/nfs4/b/z, with the same directory inode overmounted by it in both
mountpoints. Suppose your referral point is on /a/z/x and /b/z/x resp. and
see the question upthread...

2010-05-24 21:13:40

by Trond Myklebust

[permalink] [raw]

Subject: Re: [PATCH] VFS: fix recent breakage of FS_REVAL_DOT

On Mon, 2010-05-24 at 20:08 +0100, Al Viro wrote:
> On Mon, May 24, 2010 at 01:06:31PM -0400, Trond Myklebust wrote:
>
> > I believe that the answer is that most filehandle types include an
> > encoding of the inode number of the export directory. In other words, as
> > long as '/a' and '/b' are different directories, then they will result
> > in the generation of different filehandles for /a/x and /b/x.
> >
> > It seems that is not always the case, though. According to the
> > definition of mk_fsid(), it looks as if the 'FSID_UUID8' and
> > 'FSID_UUID16' filehandle types only encode the uuid of the filesystem,
> > and have no inode information. They will therefore not be able to
> > distinguish between an export through '/a' or '/b'.
> >
> > Neil, Bruce am I right?
>
> Er? On server:
>
> mount -t ext2 /dev/sdb1 /srv/nfs4
> mount -t ext2 /dev/sda1 /srv/nfs4/a
> mount -t ext2 /dev/sda1 /srv/nfs4/b
>
> after that /srv/nfs4/a and /srv/nfs4/b will have the *same* inode, nevermind
> the inode number. I really mean the same filesystem mounted twice; if you
> want to include inumber of mountpoint into fsid, fine, turn the above into
>
> mount -t ext2 /dev/sdb1 /srv/nfs4
> mount -t ext2 /dev/sda1 /srv/nfs4/a
> mount -t ext2 /dev/sda1 /srv/nfs4/b
> mount -t ext2 /dev/sda3 /srv/nfs4/a/z
> mount -t ext2 /dev/sda3 /srv/nfs4/b/z
>
> At that point you have the same fs (ext2 from sda3) mounted on /srv/nfs4/a/z
> and /srv/nfs4/b/z, with the same directory inode overmounted by it in both
> mountpoints. Suppose your referral point is on /a/z/x and /b/z/x resp. and
> see the question upthread...

Sorry... I misunderstood you.

In cases like the above, then the default behaviour of the server would
be to assign the same filehandles to those mount points. The
administrator can, however, make them different by choosing to use the
'fsid' mount option to manually assign different fsids to the different
export points.

If not, then the client will automatically group these things in the
same superblock, so like the server, it too is supposed to share the
same inode for these different objects. It will then use
d_obtain_alias() to get a root dentry for that inode (see
nfs4_get_root()).

Cheers
Trond

2010-05-25 13:04:50

by Trond Myklebust

[permalink] [raw]

Subject: Re: [PATCH] VFS: fix recent breakage of FS_REVAL_DOT

On Tue, 2010-05-25 at 00:44 +0100, Al Viro wrote:
> On Tue, May 25, 2010 at 12:01:09AM +0100, Al Viro wrote:
> > Client question: what stops you from stack overflows in that area? Call
> > chains you've got are *deep*, and I really wonder what happens if you
> > hit a referral point while traversing nested symlink, get pathname
> > resolution (already several levels into recursion) call ->follow_link(),
> > bounce down through nfs_do_refmount/nfs_follow_referral/try_location/
> > vfs_kern_mount/nfs4_referral_get_sb/nfs_follow_remote_path into
> > vfs_path_lookup, which will cheerfully add a few more loops like that.
> >
> > Sure, the *total* nesting depth through symlinks is still limited by 8, but
> > that pile of stack frames is _MUCH_ fatter than what we normally have in
> > pathname resolution. You've suddenly added ~60 extra stack frames to the
> > worst-case stack footprint of the pathname resolution. Don't try that
> > on sparc64, boys and girls, it won't be happy with attempt to carve ~12Kb
> > extra out of its kernel stack... In fact, it's worse than just ~60 stack
> > frames - several will contain (on-stack) struct nameidata in them, which
> > very definitely will _not_ fit into the minimal stack frame. It's about
> > 160 bytes extra, for each of those (up to 7).
>
> Actually, just what will happen if you have a referral that would eventually
> resolve to a directory you have no permissions to access? AFAICS, you'll
> end up trying it on all alternates, since nfs_follow_referral() will cheerfully
> keep trying one variant after another, getting -EACCES from each. Worse,
> if there are nested referrals in it, you'll get all sequences of alternates
> tried before you give up.
>
> ..o*O(at least it's merely exponential; Ackermann would be even more fun)

We could perhaps quit if the referral resolves to EACCES, but there is
the theoretical possibility that the administrator of the particular
replicated server has instead just unexported the filesystem, (which
will also result in EACCES).

2010-05-24 23:44:07

by Al Viro

[permalink] [raw]

Subject: Re: [PATCH] VFS: fix recent breakage of FS_REVAL_DOT

On Tue, May 25, 2010 at 12:01:09AM +0100, Al Viro wrote:
> Client question: what stops you from stack overflows in that area? Call
> chains you've got are *deep*, and I really wonder what happens if you
> hit a referral point while traversing nested symlink, get pathname
> resolution (already several levels into recursion) call ->follow_link(),
> bounce down through nfs_do_refmount/nfs_follow_referral/try_location/
> vfs_kern_mount/nfs4_referral_get_sb/nfs_follow_remote_path into
> vfs_path_lookup, which will cheerfully add a few more loops like that.
>
> Sure, the *total* nesting depth through symlinks is still limited by 8, but
> that pile of stack frames is _MUCH_ fatter than what we normally have in
> pathname resolution. You've suddenly added ~60 extra stack frames to the
> worst-case stack footprint of the pathname resolution. Don't try that
> on sparc64, boys and girls, it won't be happy with attempt to carve ~12Kb
> extra out of its kernel stack... In fact, it's worse than just ~60 stack
> frames - several will contain (on-stack) struct nameidata in them, which
> very definitely will _not_ fit into the minimal stack frame. It's about
> 160 bytes extra, for each of those (up to 7).

Actually, just what will happen if you have a referral that would eventually
resolve to a directory you have no permissions to access? AFAICS, you'll
end up trying it on all alternates, since nfs_follow_referral() will cheerfully
keep trying one variant after another, getting -EACCES from each. Worse,
if there are nested referrals in it, you'll get all sequences of alternates
tried before you give up.

..o*O(at least it's merely exponential; Ackermann would be even more fun)

2010-05-26 02:52:39

by NeilBrown

[permalink] [raw]

Subject: Re: [PATCH] VFS: fix recent breakage of FS_REVAL_DOT

On Tue, 25 May 2010 02:58:22 +0100
Al Viro <[email protected]> wrote:

> On Tue, May 25, 2010 at 11:14:05AM +1000, Neil Brown wrote:
>
> > I must confess though that I don't feel I understand VFS name lookup properly
> > any more. Since intents were added it seems to have become much more obscure
> > and complex. I cannot help thinking that there must be a better way:
> > distinguish between the various cases at a higher level so we don't need as
> > many flags being passed around and interpreted by widely separate pieces of
> > code. I don't have a concrete proposal but I would certainly be interested
> > to work on one if there were any hope of real change.
> > Thoughts?
>
> Intents are vile crap that has been introduced by the nfs folks to start
> with... I've been trying to localize the mess and it's got a _lot_ better
> than it used to be a year ago, but they are still not gone. And yes, I
> plan to kill that crap. Basically, most of the do_last() guts will become
> a method that would get struct file *explicitly* and ask the fs to do
> (possibly atomic) open. With normal filesystems defaulting to what's there
> right now.

That sounds like the sort of direction I was imagining.
I note however that vfat uses intents in a way that would not be addressed by
a '->do_last' method. It wants to invalidate negative dentries in
d_revalidate if they are the target of a rename (or another create), and
presumably rename wouldn't use ->do_last? Or maybe it would, but with a NULL
file ??

>
> The main obstacle at the moment is in ->d_revalidate() abuses. NFS, CIFS
> *and* autofs, the last one in a way that isn't really compatible with what
> NFS et.al. are trying to do. Overloading of ->d_revalidate() and ->lookup()
> to do the work of open() doesn't help, and the horrors nfs4 piles on top
> of that are even scarier.
>
> _Another_ fine piece of something is ->follow_link() abuses, including
> referrals' treatment. Also tied to the previous messes.
>
> We definitely will need to get VFS-to-fs APIs in that area changed; most of
> the mess has been created by the deeply misguided efforts to keep the API
> changes minimal.
>
> As for the flags, quite a few will be gone once we split "opening the final
> component" from the normal cases. Google for lookup_instantiate_filp+shit
> for details of these plans...

I tried cloning
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6.git
and couldn't find anything in the 'untested' branch. Did I look in the wrong
place? Is there some work-in-progress I can explore?

Thanks,
NeilBrown

2010-05-24 17:06:40

by Trond Myklebust

[permalink] [raw]

Subject: Re: [PATCH] VFS: fix recent breakage of FS_REVAL_DOT

On Mon, 2010-05-24 at 17:47 +0100, Al Viro wrote:
> On Mon, May 24, 2010 at 12:21:22PM -0400, Trond Myklebust wrote:
> > > Can an nfs4 server e.g. have /x/y being a symlink that resolves to /a/b and
> > > allow mounting of both /x/y/c and /a/b/c? Which path would it return to
> > > client that has mounted both, walked to some referral point and called
> > > nfs_do_refmount(), triggering nfs4_proc_fs_locations()?
> > >
> > > Trond, Neil?
> >
> > When mounting /x/y/c in your example above, the NFSv4 protocol requires
> > the client itself to resolve the symlink, and then walk down /a/b/c
> > (looking up component by component), so it will in practice not see
> > anything other than /a/b/c.
> >
> > If it walks down to a referral, and then calls nfs_do_refmount, it will
> > do the same thing: obtain a path /e/f/g on the new server, and then walk
> > down that component by component while resolving any symlinks and/or
> > referrals that it crosses in the process.
>
> Ho-hum... What happens if the same fs is mounted twice on server? I.e.
> have ext2 from /dev/sda1 mounted on /a and /b on server, then on the client
> do mount -t nfs foo:/a /tmp/a; mount -t nfs foo:/b /tmp/b. Which path
> would we get from GETATTR with fs_locations requested, if we do it for
> /tmp/a/x and /tmp/b/x resp.? Dentry will be the same, since fsid would
> match.
>
> Or would the server refuse to export things that way?

I believe that the answer is that most filehandle types include an
encoding of the inode number of the export directory. In other words, as
long as '/a' and '/b' are different directories, then they will result
in the generation of different filehandles for /a/x and /b/x.

It seems that is not always the case, though. According to the
definition of mk_fsid(), it looks as if the 'FSID_UUID8' and
'FSID_UUID16' filehandle types only encode the uuid of the filesystem,
and have no inode information. They will therefore not be able to
distinguish between an export through '/a' or '/b'.

Neil, Bruce am I right?

Cheers
Trond

2010-05-24 16:47:39

by Al Viro

[permalink] [raw]

Subject: Re: [PATCH] VFS: fix recent breakage of FS_REVAL_DOT

On Mon, May 24, 2010 at 12:21:22PM -0400, Trond Myklebust wrote:
> > Can an nfs4 server e.g. have /x/y being a symlink that resolves to /a/b and
> > allow mounting of both /x/y/c and /a/b/c? Which path would it return to
> > client that has mounted both, walked to some referral point and called
> > nfs_do_refmount(), triggering nfs4_proc_fs_locations()?
> >
> > Trond, Neil?
>
> When mounting /x/y/c in your example above, the NFSv4 protocol requires
> the client itself to resolve the symlink, and then walk down /a/b/c
> (looking up component by component), so it will in practice not see
> anything other than /a/b/c.
>
> If it walks down to a referral, and then calls nfs_do_refmount, it will
> do the same thing: obtain a path /e/f/g on the new server, and then walk
> down that component by component while resolving any symlinks and/or
> referrals that it crosses in the process.

Ho-hum... What happens if the same fs is mounted twice on server? I.e.
have ext2 from /dev/sda1 mounted on /a and /b on server, then on the client
do mount -t nfs foo:/a /tmp/a; mount -t nfs foo:/b /tmp/b. Which path
would we get from GETATTR with fs_locations requested, if we do it for
/tmp/a/x and /tmp/b/x resp.? Dentry will be the same, since fsid would
match.

Or would the server refuse to export things that way?

2010-05-24 23:01:13

by Al Viro

[permalink] [raw]

Subject: Re: [PATCH] VFS: fix recent breakage of FS_REVAL_DOT

On Mon, May 24, 2010 at 05:13:32PM -0400, Trond Myklebust wrote:

> Sorry... I misunderstood you.
>
> In cases like the above, then the default behaviour of the server would
> be to assign the same filehandles to those mount points. The
> administrator can, however, make them different by choosing to use the
> 'fsid' mount option to manually assign different fsids to the different
> export points.
>
> If not, then the client will automatically group these things in the
> same superblock, so like the server, it too is supposed to share the
> same inode for these different objects. It will then use
> d_obtain_alias() to get a root dentry for that inode (see
> nfs4_get_root()).

Yes, it will. So what will happen in nfs_follow_referral()? Note that
we check the rootpath returned by the server (whatever it will end up
being) against the mnt_devname + relative path from mnt_root to referral
point. In this case it'll be /a/z or /b/z (depending on which export
will server select when it sees the fsid) vs /a/z/x or /b/z/x (depending
on which one does client walk into). And the calls of nfs4_proc_fs_locations()
will get identical arguments whether client walks into a/z/x or b/z/x.
So will the actual RPC requests seen by the server, so it looks like in
at least one of those cases we will get the rootpath that is _not_ a prefix
we are expecting, stepping into
if (strncmp(path, fs_path, strlen(fs_path)) != 0) {
dprintk("%s: path %s does not begin with fsroot %s\n",
__func__, path, fs_path);
return -ENOENT;
}
in nfs4_validate_fspath().

Question regarding RFC3530: is it actually allowed to have the same fhandle
show up in two different locations in server's namespace? If so, what
should GETATTR with FS_LOCATIONS return for it?

Client question: what stops you from stack overflows in that area? Call
chains you've got are *deep*, and I really wonder what happens if you
hit a referral point while traversing nested symlink, get pathname
resolution (already several levels into recursion) call ->follow_link(),
bounce down through nfs_do_refmount/nfs_follow_referral/try_location/
vfs_kern_mount/nfs4_referral_get_sb/nfs_follow_remote_path into
vfs_path_lookup, which will cheerfully add a few more loops like that.

Sure, the *total* nesting depth through symlinks is still limited by 8, but
that pile of stack frames is _MUCH_ fatter than what we normally have in
pathname resolution. You've suddenly added ~60 extra stack frames to the
worst-case stack footprint of the pathname resolution. Don't try that
on sparc64, boys and girls, it won't be happy with attempt to carve ~12Kb
extra out of its kernel stack... In fact, it's worse than just ~60 stack
frames - several will contain (on-stack) struct nameidata in them, which
very definitely will _not_ fit into the minimal stack frame. It's about
160 bytes extra, for each of those (up to 7).

Come to think of that, x86 variants might get rather upset about that kind
of treatment as well. Minimal stack frames are smaller, but so's the stack...

2010-05-25 01:14:15

by NeilBrown

[permalink] [raw]

Subject: Re: [PATCH] VFS: fix recent breakage of FS_REVAL_DOT

On Mon, 24 May 2010 12:59:03 +0100
Al Viro <[email protected]> wrote:

> On Mon, May 24, 2010 at 04:57:56PM +1000, Neil Brown wrote:
> >
> > Commit 1f36f774b22a0ceb7dd33eca626746c81a97b6a5 broke FS_REVAL_DOT semantics.
> >
> > In particular, before this patch, the command
> > ls -l
> > in an NFS mounted directory would always check if the directory on the server
> > had changed and if so would flush and refill the pagecache for the dir.
> > After this patch, the same "ls -l" will repeatedly return stale date until
> > the cached attributes for the directory time out.
> >
> > The following patch fixes this by ensuring the d_revalidate is called by
> > do_last when "." is being looked-up.
> > link_path_walk has already called d_revalidate, but in that case LOOKUP_OPEN
> > is not set so nfs_lookup_verify_inode chooses not to do any validation.
> >
> > The following patch restores the original behaviour.
> >
> > Cc: [email protected]
> > Signed-off-by: NeilBrown <[email protected]>
>
> Applied, but I really don't like the way you do it; note that e.g. foo/bar/.
> gets that revalidation as well, for no good reason. If anything, shouldn't
> we handle that thing in the _beginning_ of pathname resolution, not in
> the end? For now it'd do, and it's a genuine regression, but...
>

Thanks.

I think I see what you mean by "at the beginning" - the problem path is
simply ".", and both the start and end of that are "." so we can handle at
either end... But I don't think there is any other special handling of the
'start' of a path, so I imagine it would be a fairly ugly special case.

We could avoid the extra GETATTR in "foo/bar/." by allowing NFS to keep some
state in the namei_data to record that it has valid attributes for a given
dentry so if it sees the same dentry again it doesn't need to revalidate.

I must confess though that I don't feel I understand VFS name lookup properly
any more. Since intents were added it seems to have become much more obscure
and complex. I cannot help thinking that there must be a better way:
distinguish between the various cases at a higher level so we don't need as
many flags being passed around and interpreted by widely separate pieces of
code. I don't have a concrete proposal but I would certainly be interested
to work on one if there were any hope of real change.
Thoughts?

Thanks,
NeilBrown

> BTW, here's a question for nfs client folks: is it true that for any two
> pathnames on _client_ resolving to pairs (mnt1, dentry) and (mnt2, dentry)
> resp., nfs_devname(mnt1, dentry, ...) and nfs_devname(mnt2, dentry, ...)
> should yield the strings that do not differ past the ':' (i.e. that the
> only possible difference is going to be in spelling the server name)?

2010-05-25 12:57:22

by Trond Myklebust

[permalink] [raw]

Subject: Re: [PATCH] VFS: fix recent breakage of FS_REVAL_DOT

On Tue, 2010-05-25 at 00:01 +0100, Al Viro wrote:
> On Mon, May 24, 2010 at 05:13:32PM -0400, Trond Myklebust wrote:
>
> > Sorry... I misunderstood you.
> >
> > In cases like the above, then the default behaviour of the server would
> > be to assign the same filehandles to those mount points. The
> > administrator can, however, make them different by choosing to use the
> > 'fsid' mount option to manually assign different fsids to the different
> > export points.
> >
> > If not, then the client will automatically group these things in the
> > same superblock, so like the server, it too is supposed to share the
> > same inode for these different objects. It will then use
> > d_obtain_alias() to get a root dentry for that inode (see
> > nfs4_get_root()).
>
> Yes, it will. So what will happen in nfs_follow_referral()? Note that
> we check the rootpath returned by the server (whatever it will end up
> being) against the mnt_devname + relative path from mnt_root to referral
> point. In this case it'll be /a/z or /b/z (depending on which export
> will server select when it sees the fsid) vs /a/z/x or /b/z/x (depending
> on which one does client walk into). And the calls of nfs4_proc_fs_locations()
> will get identical arguments whether client walks into a/z/x or b/z/x.
> So will the actual RPC requests seen by the server, so it looks like in
> at least one of those cases we will get the rootpath that is _not_ a prefix
> we are expecting, stepping into
> if (strncmp(path, fs_path, strlen(fs_path)) != 0) {
> dprintk("%s: path %s does not begin with fsroot %s\n",
> __func__, path, fs_path);
> return -ENOENT;
> }
> in nfs4_validate_fspath().
>
> Question regarding RFC3530: is it actually allowed to have the same fhandle
> show up in two different locations in server's namespace? If so, what
> should GETATTR with FS_LOCATIONS return for it?

I think the general expectation in the protocol is that there are no
hard linked directories. This assumption is reflected in the fact that
we have operations such as LOOKUPP (the NFSv4 equivalent of
lookup("..")) which only take a filehandle argument.

> Client question: what stops you from stack overflows in that area? Call
> chains you've got are *deep*, and I really wonder what happens if you
> hit a referral point while traversing nested symlink, get pathname
> resolution (already several levels into recursion) call ->follow_link(),
> bounce down through nfs_do_refmount/nfs_follow_referral/try_location/
> vfs_kern_mount/nfs4_referral_get_sb/nfs_follow_remote_path into
> vfs_path_lookup, which will cheerfully add a few more loops like that.
>
> Sure, the *total* nesting depth through symlinks is still limited by 8, but
> that pile of stack frames is _MUCH_ fatter than what we normally have in
> pathname resolution. You've suddenly added ~60 extra stack frames to the
> worst-case stack footprint of the pathname resolution. Don't try that
> on sparc64, boys and girls, it won't be happy with attempt to carve ~12Kb
> extra out of its kernel stack... In fact, it's worse than just ~60 stack
> frames - several will contain (on-stack) struct nameidata in them, which
> very definitely will _not_ fit into the minimal stack frame. It's about
> 160 bytes extra, for each of those (up to 7).
>
> Come to think of that, x86 variants might get rather upset about that kind
> of treatment as well. Minimal stack frames are smaller, but so's the stack...

See commit ce587e07ba2e25b5c9d286849885b82676661f3e (NFS: Prevent the
mount code from looping forever on broken exports) which was just
merged. It prevents nesting > 2 levels deep (ignore the changelog
comment about MAX_NESTED_LINKS - that is a typo).

Cheers
Trond

2010-05-24 11:59:10

by Al Viro

[permalink] [raw]

Subject: Re: [PATCH] VFS: fix recent breakage of FS_REVAL_DOT

On Mon, May 24, 2010 at 04:57:56PM +1000, Neil Brown wrote:
>
> Commit 1f36f774b22a0ceb7dd33eca626746c81a97b6a5 broke FS_REVAL_DOT semantics.
>
> In particular, before this patch, the command
> ls -l
> in an NFS mounted directory would always check if the directory on the server
> had changed and if so would flush and refill the pagecache for the dir.
> After this patch, the same "ls -l" will repeatedly return stale date until
> the cached attributes for the directory time out.
>
> The following patch fixes this by ensuring the d_revalidate is called by
> do_last when "." is being looked-up.
> link_path_walk has already called d_revalidate, but in that case LOOKUP_OPEN
> is not set so nfs_lookup_verify_inode chooses not to do any validation.
>
> The following patch restores the original behaviour.
>
> Cc: [email protected]
> Signed-off-by: NeilBrown <[email protected]>

Applied, but I really don't like the way you do it; note that e.g. foo/bar/.
gets that revalidation as well, for no good reason. If anything, shouldn't
we handle that thing in the _beginning_ of pathname resolution, not in
the end? For now it'd do, and it's a genuine regression, but...

BTW, here's a question for nfs client folks: is it true that for any two
pathnames on _client_ resolving to pairs (mnt1, dentry) and (mnt2, dentry)
resp., nfs_devname(mnt1, dentry, ...) and nfs_devname(mnt2, dentry, ...)
should yield the strings that do not differ past the ':' (i.e. that the
only possible difference is going to be in spelling the server name)?

2010-05-25 01:58:25

by Al Viro

[permalink] [raw]

Subject: Re: [PATCH] VFS: fix recent breakage of FS_REVAL_DOT

On Tue, May 25, 2010 at 11:14:05AM +1000, Neil Brown wrote:

> I must confess though that I don't feel I understand VFS name lookup properly
> any more. Since intents were added it seems to have become much more obscure
> and complex. I cannot help thinking that there must be a better way:
> distinguish between the various cases at a higher level so we don't need as
> many flags being passed around and interpreted by widely separate pieces of
> code. I don't have a concrete proposal but I would certainly be interested
> to work on one if there were any hope of real change.
> Thoughts?

Intents are vile crap that has been introduced by the nfs folks to start
with... I've been trying to localize the mess and it's got a _lot_ better
than it used to be a year ago, but they are still not gone. And yes, I
plan to kill that crap. Basically, most of the do_last() guts will become
a method that would get struct file *explicitly* and ask the fs to do
(possibly atomic) open. With normal filesystems defaulting to what's there
right now.

The main obstacle at the moment is in ->d_revalidate() abuses. NFS, CIFS
*and* autofs, the last one in a way that isn't really compatible with what
NFS et.al. are trying to do. Overloading of ->d_revalidate() and ->lookup()
to do the work of open() doesn't help, and the horrors nfs4 piles on top
of that are even scarier.

_Another_ fine piece of something is ->follow_link() abuses, including
referrals' treatment. Also tied to the previous messes.

We definitely will need to get VFS-to-fs APIs in that area changed; most of
the mess has been created by the deeply misguided efforts to keep the API
changes minimal.

As for the flags, quite a few will be gone once we split "opening the final
component" from the normal cases. Google for lookup_instantiate_filp+shit
for details of these plans...