2003-11-13 14:19:14

by martin.knoblauch

[permalink] [raw]
Subject: nfs_statfs: statfs error = 116

Hi,

sorry if OT, but what is above message trying to tell me? Where can I
find a translation of the numbers? We are seeing 116 very frequently,
512 and 5 on occasion.


We have a bunch of Linux clients (Dual P4, RH7.3, 2.4.20-18.7smp
errata kernel) hanging off two Sun NFS Servers (Solaris 8) in a
Veritas/VCS HA configuration. All of the clients show the 116 messages,
while some of them show the 512 in addition. Those with the 512s seem to
"hang" for some periods of time.

The mounts are "vers=3,proto=tcp,hard,intr,bg". Some of them mounted
at boottime, quite a few via "amd".

Any ideas are welcome.

Thanks
Martin



2003-11-13 14:38:19

by Richard B. Johnson

[permalink] [raw]
Subject: Re: nfs_statfs: statfs error = 116

On Thu, 13 Nov 2003, martin.knoblauch wrote:

> Hi,
>
> sorry if OT, but what is above message trying to tell me? Where can I
> find a translation of the numbers? We are seeing 116 very frequently,
> 512 and 5 on occasion.
>

ESTALE is "errno" 116
EIO is "errno" 5
ERESTARTSYS is "errno" 512

You can find these in /usr/include/asm/errno.h (not good to
directly include in a program).

The program reporting these errors should have included:

<errno.h>
<string.h>

Then used...
strerror(errno);
or
perror("");
etc.


Errno 512 should never be seen by user-mode program, so the
header file, /usr/include/linux/errno.h, states...

ESTALE happens when a mounted file-system is on a server that
went down or re-booted. The file-handles are then "stale".

EIO is a general catch-all for an I/O error.

ERESTARTSYS is the error returned by a server that has
re-booted that is supposed to tell the client-side software
to get a new file-handle because of an attempt to access with
a stale file-handle. When getting this error, the client
should have reopened the file(s) to obtain a new handle.


Cheers,
Dick Johnson
Penguin : Linux version 2.4.22 on an i686 machine (797.90 BogoMips).
Note 96.31% of all statistics are fiction.


2003-11-13 14:55:36

by martin.knoblauch

[permalink] [raw]
Subject: Re: nfs_statfs: statfs error = 116





"Richard B. Johnson" <[email protected]> wrote on 11/13/2003 03:39:53
PM:

> On Thu, 13 Nov 2003, martin.knoblauch wrote:
>
> > Hi,
> >
> > sorry if OT, but what is above message trying to tell me? Where can I
> > find a translation of the numbers? We are seeing 116 very frequently,
> > 512 and 5 on occasion.
> >
>
> ESTALE is "errno" 116
> EIO is "errno" 5
> ERESTARTSYS is "errno" 512
>
> You can find these in /usr/include/asm/errno.h (not good to
> directly include in a program).
>
> The program reporting these errors should have included:
>
> <errno.h>
> <string.h>
>

The messages actually come out of the kernel-nfs code (inode.c). Should
have mentioned "dmesg" :-)

> Then used...
> strerror(errno);
> or
> perror("");
> etc.
>
>
> Errno 512 should never be seen by user-mode program, so the
> header file, /usr/include/linux/errno.h, states...
>

This worries me a bit :-)

> ESTALE happens when a mounted file-system is on a server that
> went down or re-booted. The file-handles are then "stale".
>

I am "alomost" sure that there were no reboot or failover events at the
time of most of the stale messages. But I'm not going to lay my hand on the
book for that.

> EIO is a general catch-all for an I/O error.
>
> ERESTARTSYS is the error returned by a server that has
> re-booted that is supposed to tell the client-side software
> to get a new file-handle because of an attempt to access with
> a stale file-handle. When getting this error, the client
> should have reopened the file(s) to obtain a new handle.
>

Definitely no server reboot or HA Failover at the time of the messages.

Thanks
Martin

2003-11-13 15:30:49

by Trond Myklebust

[permalink] [raw]
Subject: Re: nfs_statfs: statfs error = 116

>>>>> " " == Richard B Johnson <[email protected]> writes:

> ESTALE happens when a mounted file-system is on a server that
> went down or re-booted. The file-handles are then "stale".

Sort of. It means that the server is unable to find the file that
corresponds to the filehandle that the client sent it. If the server
strictly follows the NFS specs, then this is only supposed to happen
if somebody else has deleted the file (and this is why designing a
scheme for generating filehandles is such a difficult job).

Some broken servers do, however, "lose" the file in other interesting
and unpredictable ways.

> ERESTARTSYS is the error returned by a server that has
> re-booted that is supposed to tell the client-side software to
> get a new file-handle because of an attempt to access with a
> stale file-handle. When getting this error, the client should
> have reopened the file(s) to obtain a new handle.

ERESTARTSYS actually just means that a signal was received while
inside a system call. If this results in a interruption of that
syscall, the kernel is supposed to translate ERESTARTSYS into the user
error EINTR.

Userland should therefore never have to handle ERESTARTSYS errors.

Cheers,
Trond

2003-11-13 15:58:07

by Richard B. Johnson

[permalink] [raw]
Subject: Re: nfs_statfs: statfs error = 116

On Thu, 13 Nov 2003, Trond Myklebust wrote:

> >>>>> " " == Richard B Johnson <[email protected]> writes:
>
> > ESTALE happens when a mounted file-system is on a server that
> > went down or re-booted. The file-handles are then "stale".
>
> Sort of. It means that the server is unable to find the file that
> corresponds to the filehandle that the client sent it. If the server
> strictly follows the NFS specs, then this is only supposed to happen
> if somebody else has deleted the file (and this is why designing a
> scheme for generating filehandles is such a difficult job).
>
> Some broken servers do, however, "lose" the file in other interesting
> and unpredictable ways.
>
> > ERESTARTSYS is the error returned by a server that has
> > re-booted that is supposed to tell the client-side software to
> > get a new file-handle because of an attempt to access with a
> > stale file-handle. When getting this error, the client should
> > have reopened the file(s) to obtain a new handle.
>
> ERESTARTSYS actually just means that a signal was received while
> inside a system call. If this results in a interruption of that
> syscall, the kernel is supposed to translate ERESTARTSYS into the user
> error EINTR.
>
> Userland should therefore never have to handle ERESTARTSYS errors.
>

Hmmm, Maybe I'm getting confused by all the winning-lottery messages,
but it's in the syscall specifications for connect() and
even fcntl(). http:/http://www.infran.ru/Techinfo/syscalls/syscalls_43.html

Also, maybe Linux now claims exclusive ownership and keeps it internal,
but some networking software, nfsd and pcnfsd, might not know about that.
I've seen ERESTARTSYS returned from a DOS (actually FAT) file-handle use
after a server has crashed and come back on-line.

Moot point, though, the reported errors were internal via syslog, which
was not previously known when I responded.


Cheers,
Dick Johnson
Penguin : Linux version 2.4.22 on an i686 machine (797.90 BogoMips).
Note 96.31% of all statistics are fiction.


2003-11-13 17:03:55

by Trond Myklebust

[permalink] [raw]
Subject: Re: nfs_statfs: statfs error = 116

>>>>> " " == Richard B Johnson <[email protected]> writes:

>> ERESTARTSYS actually just means that a signal was received
>> while inside a system call. If this results in a interruption
>> of that syscall, the kernel is supposed to translate
>> ERESTARTSYS into the user error EINTR.

> Hmmm, Maybe I'm getting confused by all the winning-lottery
> messages, but it's in the syscall specifications for connect()
> and even
> fcntl(). http:/http://www.infran.ru/Techinfo/syscalls/syscalls_43.html

AFAICS that documentation was written in 1994, and refers to Linux
v1.0. We've come a long way since then...

Todays Linux userland is supposed to try to comply with the Single
Unix Specification (see http://www.unix-systems.org/version3/)
whenever possible. ERESTARTSYS is missing altogether from the SUSv3
definitions in <errno.h> (and hence does not appear as a valid return
value for any SUSv3-compliant functions).

Note: the Linux manpages do list ERESTARTSYS as still being returned
by the accept() and syslog() system call. In both those cases,
however, they point out that your libc is supposed to intercept it
before it gets to the user.

> Also, maybe Linux now claims exclusive ownership and keeps it
> internal, but some networking software, nfsd and pcnfsd, might
> not know about that. I've seen ERESTARTSYS returned from a DOS
> (actually FAT) file-handle use after a server has crashed and
> come back on-line.

Linux used to be buggy/non-compliant w.r.t. NFS exporting of FAT
filesystems. I'm not sure if that has been fixed yet.

Cheers,
Trond

2003-11-13 20:27:16

by Jesse Pollard

[permalink] [raw]
Subject: Re: nfs_statfs: statfs error = 116

On Thursday 13 November 2003 08:52, [email protected] wrote:
> "Richard B. Johnson" <[email protected]> wrote on 11/13/2003 03:39:53
>
> PM:
> > On Thu, 13 Nov 2003, martin.knoblauch wrote:
[snip]
> > ESTALE happens when a mounted file-system is on a server that
> > went down or re-booted. The file-handles are then "stale".
>
> I am "alomost" sure that there were no reboot or failover events at the
> time of most of the stale messages. But I'm not going to lay my hand on the
> book for that.

ESTALE should occur whenever the client looses connection to the server,
or thinks it has lost connection. It isn't directly related to the server
other than the fact that a server reboot will also cause it to happen.

This should be a transient failure that recovers when communication verified
from some of the timeouts/retries associated with NFS.

At worst, it can require a remount of the NFS volumn.

2003-11-13 20:35:46

by Trond Myklebust

[permalink] [raw]
Subject: Re: nfs_statfs: statfs error = 116

>>>>> " " == Jesse Pollard <[email protected]> writes:

> ESTALE should occur whenever the client looses connection to
> the server, or thinks it has lost connection.

No it should not.

Cheers,
Trond

2003-11-14 08:44:20

by martin.knoblauch

[permalink] [raw]
Subject: Re: nfs_statfs: statfs error = 116






Trond Myklebust <[email protected]> wrote on 11/13/2003 09:34:55
PM:

> >>>>> " " == Jesse Pollard <[email protected]> writes:
>
> > ESTALE should occur whenever the client looses connection to
> > the server, or thinks it has lost connection.
>
> No it should not.
>
> Cheers,
> Trond
Hi Trond,

just by incident I found one reason when an user space application can get
the ESTALE in our setup (Linux client RH-2.4.20-18.7smp, Solaris 2.8
Server). I accidentally run iozone on two clients with the output file
being the same and residing on the NFS Server. Pure luser error, but it
produced ESTALE pretty much reproducibly.

B^HCheers
Martin
--
Martin Knoblauch
Senior System Architect
MSC.software GmbH
Am Moosfeld 13
D-81829 Muenchen, Germany

e-mail: [email protected]
http://www.mscsoftware.com
Phone/Fax: +49-89-431987-189 / -7189
Mobile: +49-174-3069245


2003-11-14 13:49:51

by Trond Myklebust

[permalink] [raw]
Subject: Re: nfs_statfs: statfs error = 116

>>>>> " " == Martin Knoblauch <[email protected]> writes:

> I accidentally run iozone on two clients with the output file
> being the same and residing on the NFS Server. Pure luser
> error, but it produced ESTALE pretty much reproducibly.

Sure. This is a prime example of where ESTALE *is* appropriate. One
NFS client is deleting a file on the server while the other is still
using it.

In the NFSv2/v3 protocols, the assumption is that filehandles are
valid for the entire lifetime of the file on the server. IOW only
"unlink()" can cause a valid filehandle to become stale. This is
mainly because there is no notion of open()/close(), so the server
would never be capable of determining when your client has stopped
using the filehandle.

If your 2 processes were running on the same machine, you would have
seen the kernel temporarily rename your file to .nfsXXXXXX in order to
work around the above problem. Delete that file, and you will generate
ESTALE reproducibly too....

Cheers,
Trond

2003-11-14 14:25:13

by martin.knoblauch

[permalink] [raw]
Subject: Re: nfs_statfs: statfs error = 116






Trond Myklebust <[email protected]> wrote on 11/14/2003 02:49:31
PM:

> >>>>> " " == Martin Knoblauch <[email protected]> writes:
>
> > I accidentally run iozone on two clients with the output file
> > being the same and residing on the NFS Server. Pure luser
> > error, but it produced ESTALE pretty much reproducibly.
>
> Sure. This is a prime example of where ESTALE *is* appropriate. One
> NFS client is deleting a file on the server while the other is still
> using it.
>
> In the NFSv2/v3 protocols, the assumption is that filehandles are
> valid for the entire lifetime of the file on the server. IOW only
> "unlink()" can cause a valid filehandle to become stale. This is
> mainly because there is no notion of open()/close(), so the server
> would never be capable of determining when your client has stopped
> using the filehandle.
>
> If your 2 processes were running on the same machine, you would have
> seen the kernel temporarily rename your file to .nfsXXXXXX in order to
> work around the above problem. Delete that file, and you will generate
> ESTALE reproducibly too....
>
> Cheers,
> Trond
Trond,

cool. Great explanation. Always good if you can get those that know into
talking :-)

Cheers
Martin