G'day,
I'm looking at porting to 2.6 the SGI ProPack code for supporting
Hierarchical Storage Managers in nfsd. Basically, this comprises
telling the filesystem to do stuff non-blocking at various times
(read, write and setattr) and converting the resulting -EAGAIN
when the HSM has to go out to tape into NFSERR_JUKEBOX to the
client.
So I was somewhat surprised to see that in 2.6, the mapping from
Linux errno to NFS error in fs/nfsd/nfsproc.c:nfserrno() maps
-EAGAIN to nfserr_dropit, causing nfsd_dispatch to just drop the
call and not reply. Furthermore I need to return the network
error -ETIMEDOUT to get NFSERR_JUKEBOX. I don't get it...can
someone explain both of these?
Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.
-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. http://www.ostg.com
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
On Wed, Aug 04, 2004 at 10:13:46AM -0400, J. Bruce Fields wrote:
> On Wed, Aug 04, 2004 at 05:35:00PM +1000, Greg Banks wrote:
> > On Tue, Aug 03, 2004 at 03:16:10PM -0400, J. Bruce Fields wrote:
> >
> > Let me see if I understand this...the *NFS* request is silently dropped,
> > and the *sunrpc cache* request is remembered on the server machine and
> > sent upstairs later, presumably as the userspace daemon replies to
> > earlier upcalls. The NFS client gets nothing...no reply and no
> > indication that it should retry the original NFS request.
>
> Right. But it shouldn't have to retry.
But if the upcall latency (the upcall time itself plus the queue delay)
is greater than the client's initial timeout (e.g. 1.1 sec) the client
*will* retry anyway. I don't have a feel for how long your upcalls
are...
> > So you're implicitly relying on the normal clientside timeout and
> > retry mechanism to get the NFS request resubmitted?
>
> No. When the userspace daemon replies, the request data that was copied
> in svc_defer() is used to make a new request.
You construct a new *NFS* request?
> So from the point of view
> of the NFS server code, it does look like a retry, but the NFS client
> isn't involved--the server rpc code did the retry on its own.
Aha...I see now.
> This is only right if upcalls are done before you've done anything
> non-idempotent, which makes it hard to handle NFSv4 compounds
> correctly.
Ouch, this is not a good assumption, especially considering servers
rebooting and cache timeouts.
> > Why not send EJUKEBOX to the client, and let it manage retry using a
> > retry strategy designed for a slow server instead of the one designed
> > for lossy networks?
>
> That might mean returning EJUKEBOX on a lot of common operations (e.g.
> on the first rpc request from a new client), when the server usually
> could have replied very quickly.
Potentially. But, in your experience do idmapper upcalls proceed quickly?
> Not that I'm happy with this internal retry. Personally I'd rather just
> put the thread to sleep on a short timeout (1 second or less) and then
> return EJUKEBOX. That's currently what we're doing for NFSv4 idmapping
> upcalls.
That sounds like a more balanced approach, although I'd want the
timeout a bit shorter, say 500ms to 800 ms.
> > Anyway, the problem I have is the use of EAGAIN. [...]
> Well, you could either translate those EAGAIN's to ETIMEDOUT's, which
> will do what you want, or you could change all the cache code to use
> some other error in place of EAGAIN and change EAGAIN to map to
> NFSERR_JUKEBOX....
I don't fancy futzing with the cache code, or more precisely I don't
fancy having to test it. Perhaps I'll just swallow my pride and
translate EAGAIN from the VFS layer to ETIMEDOUT in the three places
where it can happen.
Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.
-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. http://www.ostg.com
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
On Thu, Aug 05, 2004 at 12:26:08PM +1000, Greg Banks wrote:
> On Wed, Aug 04, 2004 at 10:13:46AM -0400, J. Bruce Fields wrote:
> > So from the point of view
> > of the NFS server code, it does look like a retry, but the NFS client
> > isn't involved--the server rpc code did the retry on its own.
>
> Aha...I see now.
>
> > This is only right if upcalls are done before you've done anything
> > non-idempotent, which makes it hard to handle NFSv4 compounds
> > correctly.
>
> Ouch, this is not a good assumption, especially considering servers
> rebooting and cache timeouts.
I don't see the problem you're referring to. I don't believe that in
either case these internal replays add any problems that we don't
already have, but perhaps I'm missing something.
> > > Why not send EJUKEBOX to the client, and let it manage retry using a
> > > retry strategy designed for a slow server instead of the one designed
> > > for lossy networks?
> >
> > That might mean returning EJUKEBOX on a lot of common operations (e.g.
> > on the first rpc request from a new client), when the server usually
> > could have replied very quickly.
>
> Potentially. But, in your experience do idmapper upcalls proceed quickly?
It's a good question; it would be interesting to measure them sometime
with a variety of different configurations (local /etc/passwd, ldap,
etc.), but we haven't gotten around to that yet.
--Bruce Fields
-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. http://www.ostg.com
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
On Thu, Aug 05, 2004 at 10:21:59AM -0400, J. Bruce Fields wrote:
> On Thu, Aug 05, 2004 at 12:26:08PM +1000, Greg Banks wrote:
> > On Wed, Aug 04, 2004 at 10:13:46AM -0400, J. Bruce Fields wrote:
> > > This is only right if upcalls are done before you've done anything
> > > non-idempotent, which makes it hard to handle NFSv4 compounds
> > > correctly.
> >
> > Ouch, this is not a good assumption, especially considering servers
> > rebooting and cache timeouts.
>
> I don't see the problem you're referring to. I don't believe that in
> either case these internal replays add any problems that we don't
> already have, but perhaps I'm missing something.
Perhaps I misunderstood your sentence...did you instead mean the
assumption is that nothing non-idempotent happens *between* when the
cache is missed and an upcall enqueued, and when the upcall completes?
I was thinking something different...
Perhaps that still sound like a problem. For example, imagine a test
where two clients both try to create a file (typical file locking
strategy e.g. for mail clients) and one of them sends creds which
miss the cache and need an upcall and the other's creds hit the
cache. Does that cache-missing client always lose the race
gracefully? Does something unpleasant happen with the file?
> > > > Why not send EJUKEBOX to the client, and let it manage retry using a
> > > > retry strategy designed for a slow server instead of the one designed
> > > > for lossy networks?
> > >
> > > That might mean returning EJUKEBOX on a lot of common operations (e.g.
> > > on the first rpc request from a new client), when the server usually
> > > could have replied very quickly.
> >
> > Potentially. But, in your experience do idmapper upcalls proceed quickly?
>
> It's a good question; it would be interesting to measure them sometime
> with a variety of different configurations (local /etc/passwd, ldap,
> etc.), but we haven't gotten around to that yet.
You might want to look into what happens when the userspace daemon
does a query to an LDAP server which is very slow, e.g. multiple
seconds delay. There was some unpleasantness in situations like
this when upcalls to rpc.mountd which involved mountd doing reverse
DNS lookups, were added to IRIX a few releases ago. One of the results
was that mountd needed to become multithreaded. The lesson from
that was to assume that upcalls are going to take more time than
you want them to, and have a sensible strategy with timeouts and
parallelism.
Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.
-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. http://www.ostg.com
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
On Fri, Aug 06, 2004 at 10:50:50AM +1000, Greg Banks wrote:
> Perhaps I misunderstood your sentence...did you instead mean the
> assumption is that nothing non-idempotent happens *between* when the
> cache is missed and an upcall enqueued, and when the upcall completes?
> I was thinking something different...
No, the problem is just that you can't, as part of processing a single
request, perform, for example, a mkdir followed by a cache lookup that
might result in deferring and revisiting the request, because the wrong
thing will happen when you revisit the request.
With v2/v3, requests are simple enough that this shouldn't be a problem.
But this is a bit harder to avoid when processing a v4 compound.
--Bruce Fields
-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. http://www.ostg.com
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
On Tue, Aug 03, 2004 at 06:15:03PM +1000, Greg Banks wrote:
> So I was somewhat surprised to see that in 2.6, the mapping from
> Linux errno to NFS error in fs/nfsd/nfsproc.c:nfserrno() maps
> -EAGAIN to nfserr_dropit, causing nfsd_dispatch to just drop the
> call and not reply. Furthermore I need to return the network
> error -ETIMEDOUT to get NFSERR_JUKEBOX. I don't get it...can
> someone explain both of these?
The server does upcalls to userspace daemons (usually to mountd to get
export options or IP address->client name mappings) by doing a lookup in
a cache, and returning -EAGAIN if an upcall is required. The request is
then dropped, a copy of the request is made (see svcsock.c:svc_defer())
and reprocessed at a later time (see svcsock.c:svc_revisit()).
I have some rough notes on all this at
http://www.fieldses.org/~bfields/kernel/svc_caches/sunrpc_svc_cache.txt
and Neil has some documentation in Documentation/sunrpc-cache.txt.
--Bruce Fields
-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. http://www.ostg.com
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
On Tue, Aug 03, 2004 at 03:16:10PM -0400, J. Bruce Fields wrote:
> On Tue, Aug 03, 2004 at 06:15:03PM +1000, Greg Banks wrote:
> > So I was somewhat surprised to see that in 2.6, the mapping from
> > Linux errno to NFS error in fs/nfsd/nfsproc.c:nfserrno() maps
> > -EAGAIN to nfserr_dropit, causing nfsd_dispatch to just drop the
> > call and not reply. Furthermore I need to return the network
> > error -ETIMEDOUT to get NFSERR_JUKEBOX. I don't get it...can
> > someone explain both of these?
>
> The server does upcalls to userspace daemons (usually to mountd to get
> export options or IP address->client name mappings) by doing a lookup in
> a cache, and returning -EAGAIN if an upcall is required.
With, if I understand your document correctly, the side effect of
queuing an upcall which can be expected to fill the cache at some
later date depending on the action of the userspace daemon (and
its LDAP or NIS lookups or whatever).
> The request is
> then dropped, a copy of the request is made (see svcsock.c:svc_defer())
> and reprocessed at a later time (see svcsock.c:svc_revisit()).
Let me see if I understand this...the *NFS* request is silently dropped,
and the *sunrpc cache* request is remembered on the server machine and
sent upstairs later, presumably as the userspace daemon replies to
earlier upcalls. The NFS client gets nothing...no reply and no
indication that it should retry the original NFS request.
So you're implicitly relying on the normal clientside timeout and
retry mechanism to get the NFS request resubmitted? Why not send
EJUKEBOX to the client, and let it manage retry using a retry strategy
designed for a slow server instead of the one designed for lossy
networks? Otherwise the exponential backoff could hurt the client's
latency more than it deserves.
Anyway, the problem I have is the use of EAGAIN. The normal semantics
of EAGAIN are that the receiver should cause a retry of whatever it was
doing. So if (say) nfsd_write() returned -EAGAIN it would make
sense to translate that into NFSERR_JUKEBOX which is designed to have
the same effect at the client.
This is what happens on an Altix with DMF installed: the DMF hooks
in the filesystem return EAGAIN when they need to pull a file in
from tape, and that percolates naturally through several layers to
be translated to NFSERR_JUKEBOX to the client.
http://www.sgi.com/products/storage/tech/dmf.html
That DMF code returns EAGAIN because it's been given O_NONBLOCK,
and that's the semantics of O_NONBLOCK: return EAGAIN if you can't
succeed without blocking. This happens in code that's common to the
IRIX and Linux implementations of DMF. In other words, there's a
whole body of history for using EAGAIN this way.
Instead you're using EAGAIN in another part of the code to indicate
that a cache entry has expired or is missing and that an upcall is
in progress and might succeed later, which then drives a completely
different retry strategy (with the client being kept in the dark).
We need to figure out how to resolve these two competing usages of
EAGAIN.
Greg.
--
Greg Banks, R&D Software Engineer, SGI Australian Software Group.
I don't speak for SGI.
-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. http://www.ostg.com
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
On Wed, Aug 04, 2004 at 05:35:00PM +1000, Greg Banks wrote:
> On Tue, Aug 03, 2004 at 03:16:10PM -0400, J. Bruce Fields wrote:
> > The server does upcalls to userspace daemons (usually to mountd to get
> > export options or IP address->client name mappings) by doing a lookup in
> > a cache, and returning -EAGAIN if an upcall is required.
>
> With, if I understand your document correctly, the side effect of
> queuing an upcall which can be expected to fill the cache at some
> later date depending on the action of the userspace daemon (and
> its LDAP or NIS lookups or whatever).
Yes.
> > The request is
> > then dropped, a copy of the request is made (see svcsock.c:svc_defer())
> > and reprocessed at a later time (see svcsock.c:svc_revisit()).
>
> Let me see if I understand this...the *NFS* request is silently dropped,
> and the *sunrpc cache* request is remembered on the server machine and
> sent upstairs later, presumably as the userspace daemon replies to
> earlier upcalls. The NFS client gets nothing...no reply and no
> indication that it should retry the original NFS request.
Right. But it shouldn't have to retry.
> So you're implicitly relying on the normal clientside timeout and
> retry mechanism to get the NFS request resubmitted?
No. When the userspace daemon replies, the request data that was copied
in svc_defer() is used to make a new request. So from the point of view
of the NFS server code, it does look like a retry, but the NFS client
isn't involved--the server rpc code did the retry on its own.
This is only right if upcalls are done before you've done anything
non-idempotent, which makes it hard to handle NFSv4 compounds
correctly.
> Why not send EJUKEBOX to the client, and let it manage retry using a
> retry strategy designed for a slow server instead of the one designed
> for lossy networks?
That might mean returning EJUKEBOX on a lot of common operations (e.g.
on the first rpc request from a new client), when the server usually
could have replied very quickly.
Not that I'm happy with this internal retry. Personally I'd rather just
put the thread to sleep on a short timeout (1 second or less) and then
return EJUKEBOX. That's currently what we're doing for NFSv4 idmapping
upcalls.
> Anyway, the problem I have is the use of EAGAIN. The normal semantics
> of EAGAIN are that the receiver should cause a retry of whatever it was
> doing. So if (say) nfsd_write() returned -EAGAIN it would make
> sense to translate that into NFSERR_JUKEBOX which is designed to have
> the same effect at the client.
>
> This is what happens on an Altix with DMF installed: the DMF hooks
> in the filesystem return EAGAIN when they need to pull a file in
> from tape, and that percolates naturally through several layers to
> be translated to NFSERR_JUKEBOX to the client.
Well, you could either translate those EAGAIN's to ETIMEDOUT's, which
will do what you want, or you could change all the cache code to use
some other error in place of EAGAIN and change EAGAIN to map to
NFSERR_JUKEBOX....
--Bruce Fields
-------------------------------------------------------
This SF.Net email is sponsored by OSTG. Have you noticed the changes on
Linux.com, ITManagersJournal and NewsForge in the past few weeks? Now,
one more big change to announce. We are now OSTG- Open Source Technology
Group. Come see the changes on the new OSTG site. http://www.ostg.com
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs