2003-07-10 18:30:28

by Danny Smith

[permalink] [raw]
Subject: NFSERR_EAGAIN

I've been trying to resolve some issues we have with a set of systems
running 2.4.20+NFS_ALL, dual CPU and Gigabit Ethernet. They're talking
to SGI IRIX servers, (6.5.19), and having intermittent problems - this
can be seen sometimes where an NFS mounted directory will "disappear",
but subsequently be accessible. No errors are returned to the shell -
the directory just appears to have no entries.

This seems (although I don't have proof positive yet, more testing is in
progrees) to coincide with errors in the logs:

Jul 9 14:18:09 trout-node13 kernel: nfs_stat_to_errno: bad nfs status
return value: 11

Looking through nfs2xdr.c and nfs.h and googling, it seems that error
number 11 is not properly defined, but certainly seems to be in use by
SGI. From nfs2xdr.c:

{ NFSERR_NXIO, ENXIO },
/* { NFSERR_EAGAIN, EAGAIN }, */
{ NFSERR_ACCES, EACCES },

(EAGAIN having value 11)

Does anyone know much about the history of this? Was this removed in
order to be RFC compliant, or is there a stronger motivation not to have
this?

It would maybe explain what I'm seeing if SGI are interpreting error 11
as "Try again" - this mainly happens when server load is high.

Any insight into this would be welcome - meanwhile I'm going to try some
tests with NFSERR_EGAIN put back in.

Danny

--
Danny Smith
Senior Systems Administrator, Cinesite (Europe) Ltd
020 7973 4000 - x4055 / [email protected]




-------------------------------------------------------
This SF.Net email sponsored by: Parasoft
Error proof Web apps, automate testing & more.
Download & eval WebKing and get a free book.
http://www.parasoft.com/bulletproofapps1
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs


2003-07-10 22:41:27

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFSERR_EAGAIN

>>>>> " " == Danny Smith <[email protected]> writes:

> Looking through nfs2xdr.c and nfs.h and googling, it seems that
> error number 11 is not properly defined, but certainly seems to
> be in use by SGI. From nfs2xdr.c:

> { NFSERR_NXIO, ENXIO },

> /* { NFSERR_EAGAIN, EAGAIN }, */
> { NFSERR_ACCES, EACCES },

> (EAGAIN having value 11)

> Does anyone know much about the history of this? Was this
> removed in order to be RFC compliant, or is there a stronger
> motivation not to have this?

Why should we be supporting something which isn't documented in the RFCs?

What is this error anyway? Is it some SGI hack for emulating
NFS3ERR_JUKEBOX under NFSv2?

Cheers,
Trond


-------------------------------------------------------
This SF.Net email sponsored by: Parasoft
Error proof Web apps, automate testing & more.
Download & eval WebKing and get a free book.
http://www.parasoft.com/bulletproofapps1
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2003-07-11 09:55:39

by Danny Smith

[permalink] [raw]
Subject: Re: NFSERR_EAGAIN

Trond Myklebust wrote:

>>>>>>" " == Danny Smith <[email protected]> writes:
>>>>>>
>>>>>>
>
> > Looking through nfs2xdr.c and nfs.h and googling, it seems that
> > error number 11 is not properly defined, but certainly seems to
> > be in use by SGI. From nfs2xdr.c:
>
> > { NFSERR_NXIO, ENXIO },
>
> > /* { NFSERR_EAGAIN, EAGAIN }, */
> > { NFSERR_ACCES, EACCES },
>
> > (EAGAIN having value 11)
>
> > Does anyone know much about the history of this? Was this
> > removed in order to be RFC compliant, or is there a stronger
> > motivation not to have this?
>
>Why should we be supporting something which isn't documented in the RFCs?
>
I really want to know how it got into the source in the first place? Is
it a feature that's "not part of the standard, but seem to be widely
used nevertheless" (quoting from nfs.h).

Searching around shows a (very) few ocurrences of this, with Solaris and
IRIX servers, although often accompanied by other issues.

If the server is broken, but we can work with it anyway without
upsetting anything else, this would be a "good thing" in my book (of
course, we would take it up with SGI too).

>What is this error anyway? Is it some SGI hack for emulating
>NFS3ERR_JUKEBOX under NFSv2?
>
>
Right now, I don't really know, but I'm guessing it's something of the
sort (seems to make sense with what we're seeing).
I'm trying to get a packet trace to verify this is indeed what is being
sent - will update when I have further evidence.

Danny

--
Danny Smith
Senior Systems Administrator, Cinesite (Europe) Ltd
020 7973 4000 - x4055 / [email protected]




-------------------------------------------------------
This SF.Net email sponsored by: Parasoft
Error proof Web apps, automate testing & more.
Download & eval WebKing and get a free book.
http://www.parasoft.com/bulletproofapps1
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2003-07-11 10:13:55

by Trond Myklebust

[permalink] [raw]
Subject: Re: NFSERR_EAGAIN

>>>>> " " == Danny Smith <[email protected]> writes:

> I really want to know how it got into the source in the first
> place?

That's a question for Olaf Kirch. I've never touched that entry.

> If the server is broken, but we can work with it anyway without
> upsetting anything else, this would be a "good thing" in my
> book (of course, we would take it up with SGI too).

Depends what it is. If it *is* jukebox, then we might as well just
recommend that people use NFSv3 (NFSv2 is legacy anyway). Jukebox is
only used for slow media (tape, cd-exchangers,...), so it's not going
ever going to be a common case.

cheers,
Trond




-------------------------------------------------------
This SF.Net email sponsored by: Parasoft
Error proof Web apps, automate testing & more.
Download & eval WebKing and get a free book.
http://www.parasoft.com/bulletproofapps1
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs

2003-07-16 15:54:22

by Danny Smith

[permalink] [raw]
Subject: Re: NFSERR_EAGAIN - resolved.

Danny Smith wrote:

> I've been trying to resolve some issues we have with a set of systems
> running 2.4.20+NFS_ALL, dual CPU and Gigabit Ethernet. They're talking
> to SGI IRIX servers, (6.5.19), and having intermittent problems - this
> can be seen sometimes where an NFS mounted directory will "disappear",
> but subsequently be accessible. No errors are returned to the shell -
> the directory just appears to have no entries.
>
> This seems (although I don't have proof positive yet, more testing is
> in progrees) to coincide with errors in the logs:
>
> Jul 9 14:18:09 trout-node13 kernel: nfs_stat_to_errno: bad nfs status
> return value: 11
>
I've found the cause of these messages - SGI are in the clear - it's not
coming from their servers.
It's 'amd' (am-utils) which is providing the incorrect responses.

Trigger seems to be trying to access an automount on a host which is not
running an NFS server. amd then sends back a garbled response over the
loopback interface (after getting several "RPC - program not registered"
responses).

Whether this is causing our other problems remains to be seen.

Danny

--
Danny Smith
Senior Systems Administrator, Cinesite (Europe) Ltd
020 7973 4000 - x4055 / [email protected]




-------------------------------------------------------
This SF.net email is sponsored by: VM Ware
With VMware you can run multiple operating systems on a single machine.
WITHOUT REBOOTING! Mix Linux / Windows / Novell virtual machines at the
same time. Free trial click here: http://www.vmware.com/wl/offer/345/0
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs