2008-08-01 13:56:21

by David Woodhouse

[permalink] [raw]
Subject: Re: [RFC] Reinstate NFS exportability for JFFS2.

On Fri, 2008-08-01 at 14:36 +0100, David Woodhouse wrote:
> Those are needed by NFSv3 too -- and can be handled with a lookup_fh()
> method in the file system which is guaranteed to be called from within
> the filldir callback, and some support in the VFS for checking if it's a
> mountpoint.
>
> NFSv4 introduces another problem though, which is that it seems to be
> able to return the _full_ getattr() results for each object, and there's
> no real way round the fact that we need to do the ->lookup() for that.
>
> If sane clients aren't expected to _ask_ for that, though, then perhaps
> it would be OK to fall back to something like the existing
> readdir-to-buffer hack for that case, while most normal clients won't
> trigger it.

Or maybe we could just mask the offending attrs out of ->rd_bmval for
readdir calls, and say we don't support them? Would anyone scream if we
did that?

--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation





2008-08-01 16:05:01

by Chuck Lever III

[permalink] [raw]
Subject: Re: [RFC] Reinstate NFS exportability for JFFS2.

Hi David-

On Fri, Aug 1, 2008 at 9:56 AM, David Woodhouse <[email protected]> wrote:
> On Fri, 2008-08-01 at 14:36 +0100, David Woodhouse wrote:
>> Those are needed by NFSv3 too -- and can be handled with a lookup_fh()
>> method in the file system which is guaranteed to be called from within
>> the filldir callback, and some support in the VFS for checking if it's a
>> mountpoint.
>>
>> NFSv4 introduces another problem though, which is that it seems to be
>> able to return the _full_ getattr() results for each object, and there's
>> no real way round the fact that we need to do the ->lookup() for that.
>>
>> If sane clients aren't expected to _ask_ for that, though, then perhaps
>> it would be OK to fall back to something like the existing
>> readdir-to-buffer hack for that case, while most normal clients won't
>> trigger it.

Sanity / insanity is probably not the right description for these
different types of clients... NFSv4 is designed to work with clients
that have a typical VFS (like an in-kernel Unix client), or user-space
clients, or clients on systems that don't have typical VFS APIs (like
Windows). So, servers have to be prepared to expect a wide gamut of
different combinations of individual operations via compound RPCs.

In practice, nearly all client implementations so far are VFS-based
in-kernel clients, and have roughly the same kinds of readdir (ie
driven by getdents(3) and allowing seek offsets like a byte stream).
But they are at generally different stages of implementation, and have
different ways to go about their work.

However, speaking generally, the advanced features of NFSv4, like
FS_LOCATIONS and pseudofs often do require some special sauce that is
sometimes not terribly friendly to the server-side VFS.

I only mentioned that the Linux client doesn't use WORD0_FILEHANDLE to
caution against testing any server change with a Linux NFSv4 client --
it wouldn't necessarily exercise the server code in question.

Some NFSv3 clients don't support READDIRPLUS at all, while some can
disable it (like Linux, Mac OS, and FreeBSD), and others use it only
in certain cases (Linux). I wouldn't describe any of these as saner
or more commonly encountered than another.

> Or maybe we could just mask the offending attrs out of ->rd_bmval for
> readdir calls, and say we don't support them? Would anyone scream if we
> did that?

I'm not an NFSv4 expert (hence my initial incorrect assertion about
NFSv4 not supporting readdirplus at all). I defer to those who are
actually working on the standard and Linux implementation (Bruce?)
But typically masking out these features could potentially cause
severe interoperability problems for certain client implementations.
We can only know for sure after a lot of testing at multivendor events
like Connectathon; it's not something I would disable cavalierly.

I believe the server can also indicate to clients that NFSv3
READDIRPLUS is not supported, and that wouldn't cause as much of a
disaster for clients. It would even be feasible to disable
READDIRPLUS only for certain physical file systems that would have a
problem with lookup-during-filldir.

I rather prefer making NFSD do the right thing here -- it seems to
localize and document the issue and provide a solution that all file
systems can use with a minimum of real fuss.

I know that OCFS2 has this locking inversion issue as well, and we
know OCFS2 users do share data from it with NFS. So Oracle is
interested in a good solution here too.

--
Chuck Lever

2008-08-01 16:19:30

by David Woodhouse

[permalink] [raw]
Subject: Re: [RFC] Reinstate NFS exportability for JFFS2.

On Fri, 2008-08-01 at 12:05 -0400, Chuck Lever wrote:
> Some NFSv3 clients don't support READDIRPLUS at all, while some can
> disable it (like Linux, Mac OS, and FreeBSD), and others use it only
> in certain cases (Linux). I wouldn't describe any of these as saner
> or more commonly encountered than another.

Readdirplus is easy enough to fix with a lookup_fh() method and some way
to check if a given entry is a mountpoint. I'm not worried about that.

> > Or maybe we could just mask the offending attrs out of ->rd_bmval for
> > readdir calls, and say we don't support them? Would anyone scream if we
> > did that?
>
> I'm not an NFSv4 expert (hence my initial incorrect assertion about
> NFSv4 not supporting readdirplus at all). I defer to those who are
> actually working on the standard and Linux implementation (Bruce?)
> But typically masking out these features could potentially cause
> severe interoperability problems for certain client implementations.
> We can only know for sure after a lot of testing at multivendor events
> like Connectathon; it's not something I would disable cavalierly.

It's not readdirplus (FATTR4_WORD0_FILEHANDLE?) I was talking about
masking out -- as I said, we can cope with that. It's things that would
_really_ require the inode, like FATTR4_WORD0_ACL, FATTR4_WORD0_CHANGE,
etc.

But still, if masking them out would be a problem, we could use the
existing trick of doing readdir into a buffer and then going back and
doing the actual lookup later. But _only_ for that relatively rare case,
rather than for _all_ users of readdirplus, as we do at the moment.

> I rather prefer making NFSD do the right thing here -- it seems to
> localize and document the issue and provide a solution that all file
> systems can use with a minimum of real fuss.

Yes, that's definitely my preference too. What I've posted is a good
first step to that, and we can talk about improving it later.

--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation




2008-08-01 17:47:58

by Chuck Lever III

[permalink] [raw]
Subject: Re: [RFC] Reinstate NFS exportability for JFFS2.

On Fri, Aug 1, 2008 at 12:19 PM, David Woodhouse <[email protected]> wrote:
> On Fri, 2008-08-01 at 12:05 -0400, Chuck Lever wrote:
>> Some NFSv3 clients don't support READDIRPLUS at all, while some can
>> disable it (like Linux, Mac OS, and FreeBSD), and others use it only
>> in certain cases (Linux). I wouldn't describe any of these as saner
>> or more commonly encountered than another.
>
> Readdirplus is easy enough to fix with a lookup_fh() method and some way
> to check if a given entry is a mountpoint. I'm not worried about that.
>
>> > Or maybe we could just mask the offending attrs out of ->rd_bmval for
>> > readdir calls, and say we don't support them? Would anyone scream if we
>> > did that?
>>
>> I'm not an NFSv4 expert (hence my initial incorrect assertion about
>> NFSv4 not supporting readdirplus at all). I defer to those who are
>> actually working on the standard and Linux implementation (Bruce?)
>> But typically masking out these features could potentially cause
>> severe interoperability problems for certain client implementations.
>> We can only know for sure after a lot of testing at multivendor events
>> like Connectathon; it's not something I would disable cavalierly.
>
> It's not readdirplus (FATTR4_WORD0_FILEHANDLE?) I was talking about
> masking out -- as I said, we can cope with that. It's things that would
> _really_ require the inode, like FATTR4_WORD0_ACL, FATTR4_WORD0_CHANGE,
> etc.

AFAIK WORD0_CHANGE is a required value for NFSv4 clients to detect
data changes on the server. That's one I think needs to remain
enabled in the server capabilities bitmask. In fact most local file
systems on Linux don't handle that one correctly yet anyway, but
"saner" clients do depend on that one to manage their page caches.

I don't think there's a way to indicate to a client that a server
supports WORD0_CHANGE for a normal NFSv4 GETATTR, but not during a
NFSv4 READDIR.

> But still, if masking them out would be a problem, we could use the
> existing trick of doing readdir into a buffer and then going back and
> doing the actual lookup later. But _only_ for that relatively rare case,
> rather than for _all_ users of readdirplus, as we do at the moment.

Personally I don't have an immediate problem with the double-buffering
here. The extra data copy wouldn't add significant overhead to an
already expensive and not terribly frequent operation. There may be a
caching opportunity to saving the buffered result, but I would bet
either the contents of a directory or the attributes of objects
residing in it would change too often for there to be much benefit.

I think we have an opportunity for a little better concurrency here,
though. Doing all the lookups and the directory read asynchronously
and concurrently, if that's possible, might be a performance
enhancement in most normal cases. This would allow the requests to be
passed all at once to the local file system, which could then organize
any necessary disk accesses in a more optimal way.

That would be a reason to continue to do a double-buffered readdir in
every case.

--
Chuck Lever

2008-08-02 18:26:44

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [RFC] Reinstate NFS exportability for JFFS2.

On Fri, Aug 01, 2008 at 01:47:58PM -0400, Chuck Lever wrote:
> AFAIK WORD0_CHANGE is a required value for NFSv4 clients to detect
> data changes on the server. That's one I think needs to remain
> enabled in the server capabilities bitmask. In fact most local file
> systems on Linux don't handle that one correctly yet anyway, but
> "saner" clients do depend on that one to manage their page caches.
>
> I don't think there's a way to indicate to a client that a server
> supports WORD0_CHANGE for a normal NFSv4 GETATTR, but not during a
> NFSv4 READDIR.
>
> > But still, if masking them out would be a problem, we could use the
> > existing trick of doing readdir into a buffer and then going back and
> > doing the actual lookup later. But _only_ for that relatively rare case,
> > rather than for _all_ users of readdirplus, as we do at the moment.
>
> Personally I don't have an immediate problem with the double-buffering
> here. The extra data copy wouldn't add significant overhead to an
> already expensive and not terribly frequent operation. There may be a
> caching opportunity to saving the buffered result, but I would bet
> either the contents of a directory or the attributes of objects
> residing in it would change too often for there to be much benefit.
>
> I think we have an opportunity for a little better concurrency here,
> though. Doing all the lookups and the directory read asynchronously
> and concurrently, if that's possible, might be a performance
> enhancement in most normal cases. This would allow the requests to be
> passed all at once to the local file system, which could then organize
> any necessary disk accesses in a more optimal way.
>
> That would be a reason to continue to do a double-buffered readdir in
> every case.

I don't see how xfs's double-buffered readdir allows you to do the
"lookups and directory reads asynchronously and concurrently"? The
filesystem doesn't know the lookups are going to be coming at the time
that it gets the readdir, so it doesn't know to start them yet.

Could you add a readdirplus vfs operation which took a flag indicating
how much extra information you're going to need?

The three cases where readdir can be called are:

- ordinary readdir, for which the existing filldir_t provides
all the information needed.
- nfsv3 readdirplus, where the only additional information
needed is whether there's a mountpoint.
- nfsv4 readdir, where we may need all the stat info, acls,
etc., etc.

So you could define a readdirplus file operation like

readdirplus(file, void, plusfiller, flag)

with plusfiller defined just as filldir_t but with an extra struct
dentry * argument. The only nonzero flag value would be DENTRY_NEEDED,
and without DENTRY_NEEDED set, the filesystem would have the option of
passing NULL in the plusfiller callback's dentry argument. So nfsv3
would use flag == 0, and get a dentry passed back to it only when it was
already cached (all it needs to detect mountpoints). But nfsv4 would
pass in DENTRY_NEEDED, and get a dentry on every callback.

Um. That assumes it's only the lookup that has the locking problems,
and that the calls back into the filesystem for acls and so on will be
safe. Etc.

Or you could define a more complicated set of flags indicating exactly
what would be needed, and have the callback pass back a big structure,
most of whose fields could be left unset depending on what was
requested.

Though really I can't see any great objection to just moving xfs's hack
up into nfsd. It may not do everything, but it seems like an
incremental improvement.

--b.

2008-08-02 20:42:58

by David Woodhouse

[permalink] [raw]
Subject: Re: [RFC] Reinstate NFS exportability for JFFS2.

On Sat, 2008-08-02 at 14:26 -0400, J. Bruce Fields wrote:
> Could you add a readdirplus vfs operation which took a flag indicating
> how much extra information you're going to need?

Actually, if we're screwing with readdir then xfs would like to know how
much it's going to be asked to read, rather than just have the filldir
callback return zero when it's done.

> The three cases where readdir can be called are:
> - ordinary readdir, for which the existing filldir_t provides
> all the information needed.
> - nfsv3 readdirplus, where the only additional information
> needed is whether there's a mountpoint.

It also wants a file handle. For which I think it just needs
i_generation in addition to the information it already has.

> - nfsv4 readdir, where we may need all the stat info, acls,
> etc., etc.

We _might_, but most of the time we won't. It might be OK to fall back
to the existing double-buffer hack for the cases where we _do_ need that
extra information.

I think a ->lookup_fh() method (which _expects_ to be called from within
filldir, and hence does its locking automatically) is the way to go.

One alternative might just be a full lookup method with the same locking
rules: ->lookup_locked(). That might be easier to implement, and would
solve the problem even for the corner cases of NFSv4. Although I still
think we'd do best to avoid populating the inode cache with _all_
children when we do a readdirplus.

--
David Woodhouse Open Source Technology Centre
[email protected] Intel Corporation


2008-08-02 21:33:38

by J. Bruce Fields

[permalink] [raw]
Subject: Re: [RFC] Reinstate NFS exportability for JFFS2.

On Sat, Aug 02, 2008 at 09:42:32PM +0100, David Woodhouse wrote:
> On Sat, 2008-08-02 at 14:26 -0400, J. Bruce Fields wrote:
> > Could you add a readdirplus vfs operation which took a flag indicating
> > how much extra information you're going to need?
>
> Actually, if we're screwing with readdir then xfs would like to know how
> much it's going to be asked to read, rather than just have the filldir
> callback return zero when it's done.

OK.

> > The three cases where readdir can be called are:
> > - ordinary readdir, for which the existing filldir_t provides
> > all the information needed.
> > - nfsv3 readdirplus, where the only additional information
> > needed is whether there's a mountpoint.
>
> It also wants a file handle.

Oops, right.

> For which I think it just needs
> i_generation in addition to the information it already has.

Typically, right, though the filesystem's allowed some choice about what
exactly it wants to use in the filehandle. I don't know how the various
filesystems are actually using that in practice.

> > - nfsv4 readdir, where we may need all the stat info, acls,
> > etc., etc.
>
> We _might_, but most of the time we won't. It might be OK to fall back
> to the existing double-buffer hack for the cases where we _do_ need that
> extra information.

How bad is the "double-buffer hack" anyway? Rather than have this as a
fallback case that's rarely used (hence rarely tested), it might be
simpler just to use it for everything if we're going to use it at all.

--b.

> I think a ->lookup_fh() method (which _expects_ to be called from within
> filldir, and hence does its locking automatically) is the way to go.
>
> One alternative might just be a full lookup method with the same locking
> rules: ->lookup_locked(). That might be easier to implement, and would
> solve the problem even for the corner cases of NFSv4. Although I still
> think we'd do best to avoid populating the inode cache with _all_
> children when we do a readdirplus.
>
> --
> David Woodhouse Open Source Technology Centre
> [email protected] Intel Corporation
>

2008-08-03 08:40:01

by David Woodhouse

[permalink] [raw]
Subject: Re: [RFC] Reinstate NFS exportability for JFFS2.

On Sat, 2008-08-02 at 17:33 -0400, J. Bruce Fields wrote:
> On Sat, Aug 02, 2008 at 09:42:32PM +0100, David Woodhouse wrote:
> > On Sat, 2008-08-02 at 14:26 -0400, J. Bruce Fields wrote:
> > > Could you add a readdirplus vfs operation which took a flag indicating
> > > how much extra information you're going to need?
> >
> > Actually, if we're screwing with readdir then xfs would like to know how
> > much it's going to be asked to read, rather than just have the filldir
> > callback return zero when it's done.
>
> OK.
>
> > > The three cases where readdir can be called are:
> > > - ordinary readdir, for which the existing filldir_t provides
> > > all the information needed.
> > > - nfsv3 readdirplus, where the only additional information
> > > needed is whether there's a mountpoint.
> >
> > It also wants a file handle.
>
> Oops, right.
>
> > For which I think it just needs
> > i_generation in addition to the information it already has.
>
> Typically, right, though the filesystem's allowed some choice about what
> exactly it wants to use in the filehandle. I don't know how the various
> filesystems are actually using that in practice.

That's OK. If we do ->lookup_fh() and they're making their own, they can
put what they like in it.

> > > - nfsv4 readdir, where we may need all the stat info, acls,
> > > etc., etc.
> >
> > We _might_, but most of the time we won't. It might be OK to fall back
> > to the existing double-buffer hack for the cases where we _do_ need that
> > extra information.
>
> How bad is the "double-buffer hack" anyway? Rather than have this as a
> fallback case that's rarely used (hence rarely tested), it might be
> simpler just to use it for everything if we're going to use it at all.

It's certainly a good enough answer for now, which is why I've posted
the patches to do exactly that.

And yes, I have wondered the same myself, since realising that we'll
need a full lookup for some NFSv4 clients anyway. Or maybe the full
->lookup_locked(), perhaps...

--
dwmw2