Hi,
We have a bunch of Linux clients (SLES 10 SP1) which mount a NetApp filer.
When the NetApp gets very, very busy, for example, one user is
deleting 1Tbyte of data
while another user is doing a 30 client throughput test, it will stop
responding to some requests.
Although we are using hard mounts, some users report that during the
hammering period, some of their
file operations produce "I/O Error" messages on their terminal.
We checked, and the hosts are indeed using hard mounting. From our
reading, I/O Errors
should only ever make it back to the user if are using soft mounting.
We're pretty sure the filer is not sending back an NFS_ERR response (and we're
pretty sure that wouldn't get reported to the user as an I/O Error...)
At this point, we suspect there must be a path in the NFS
implementation that returns I/O Error to user
space even with a hard mount.
Any ideas?
Dave
-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that [email protected] is being discontinued.
Please subscribe to [email protected] instead.
http://vger.kernel.org/vger-lists.html#linux-nfs
Can you take a trace on the client when the =8Ci/o error=B9 message sho=
ws up? I
suppose that=B9d be hard, but that=B9d tell us pretty quick where the m=
essage is
coming from. You could also take a trace from the filer if that's easi=
er
using pktt.
-Blake
On 6/4/08 6:33 AM, "David Konerding" <[email protected]> wrote:
> Hi,
>=20
> We have a bunch of Linux clients (SLES 10 SP1) which mount a NetApp f=
iler.
>=20
> When the NetApp gets very, very busy, for example, one user is
> deleting 1Tbyte of data
> while another user is doing a 30 client throughput test, it will stop
> responding to some requests.
>=20
> Although we are using hard mounts, some users report that during the
> hammering period, some of their
> file operations produce "I/O Error" messages on their terminal.
>=20
> We checked, and the hosts are indeed using hard mounting. From our
> reading, I/O Errors
> should only ever make it back to the user if are using soft mounting.
>=20
> We're pretty sure the filer is not sending back an NFS_ERR response (=
and we're
> pretty sure that wouldn't get reported to the user as an I/O Error...=
)
>=20
> At this point, we suspect there must be a path in the NFS
> implementation that returns I/O Error to user
> space even with a hard mount.
>=20
> Any ideas?
>=20
> Dave
>=20
> ---------------------------------------------------------------------=
----
> Check out the new SourceForge.net Marketplace.
> It's the best place to buy or sell services for
> just about anything Open Source.
> http://sourceforge.net/services/buy/index.php
> _______________________________________________
> NFS maillist - [email protected]
> https://lists.sourceforge.net/lists/listinfo/nfs
> _______________________________________________
> Please note that [email protected] is being discontinued.
> Please subscribe to [email protected] instead.
> http://vger.kernel.org/vger-lists.html#linux-nfs
>=20
-----------------------------------------------------------------------=
--
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that [email protected] is being discontinued.
Please subscribe to [email protected] instead.
http://vger.kernel.org/vger-lists.html#linux-nfs
On Wed, 4 Jun 2008 06:33:18 -0700
"David Konerding" <[email protected]> wrote:
> Hi,
>
> We have a bunch of Linux clients (SLES 10 SP1) which mount a NetApp filer.
>
> When the NetApp gets very, very busy, for example, one user is
> deleting 1Tbyte of data
> while another user is doing a 30 client throughput test, it will stop
> responding to some requests.
>
> Although we are using hard mounts, some users report that during the
> hammering period, some of their
> file operations produce "I/O Error" messages on their terminal.
>
> We checked, and the hosts are indeed using hard mounting. From our
> reading, I/O Errors
> should only ever make it back to the user if are using soft mounting.
>
> We're pretty sure the filer is not sending back an NFS_ERR response (and we're
> pretty sure that wouldn't get reported to the user as an I/O Error...)
>
> At this point, we suspect there must be a path in the NFS
> implementation that returns I/O Error to user
> space even with a hard mount.
>
> Any ideas?
>
hard/soft only governs what happens when there is a major timeout (i.e.
the server doesn't respond within a given time). If there are other
errors (for instance, client side memory shortage, server starts
refusing connections, etc), then there can be errors returned to the
application.
EIO is pretty generic, and is often what you see when a more obscure
error is translated into what a syscall would expect. It can happen for
other reasons besides an RPC timeout.
--
Jeff Layton <[email protected]>
-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that [email protected] is being discontinued.
Please subscribe to [email protected] instead.
http://vger.kernel.org/vger-lists.html#linux-nfs
>> Although we are using hard mounts, some users report that during the
>> hammering period, some of their
>> file operations produce "I/O Error" messages on their terminal.
>>
>> We checked, and the hosts are indeed using hard mounting. From our
>> reading, I/O Errors
>> should only ever make it back to the user if are using soft mounting.
>>
> hard/soft only governs what happens when there is a major timeout (i.e.
> the server doesn't respond within a given time). If there are other
> errors (for instance, client side memory shortage, server starts
> refusing connections, etc), then there can be errors returned to the
> application.
>
OK; we're already using TCP mounts, so I don't think that any new
client->server connections
should occur after the mount is established.
Second, memory is not an issue; this happens on lightly loaded clients
with 64Gbytes RAM,
and RAM is all cache and buffer.
> EIO is pretty generic, and is often what you see when a more obscure
> error is translated into what a syscall would expect. It can happen for
> other reasons besides an RPC timeout.
OK, so, our best bet to debug this, is to:
1) reproduce the problem
2) when the problem occurs, make sure the command that run that got an
EIO was running
under strace, so we know what syscall was being made
3) when we know what syscall was being made, backtrack to the kernel
source for that syscall
4) inspect the source to see what paths generate EIO
Dave
>
> --
> Jeff Layton <[email protected]>
>
-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that [email protected] is being discontinued.
Please subscribe to [email protected] instead.
http://vger.kernel.org/vger-lists.html#linux-nfs
On Wed, 4 Jun 2008 10:00:16 -0700
"David Konerding" <[email protected]> wrote:
> >> Although we are using hard mounts, some users report that during the
> >> hammering period, some of their
> >> file operations produce "I/O Error" messages on their terminal.
> >>
> >> We checked, and the hosts are indeed using hard mounting. From our
> >> reading, I/O Errors
> >> should only ever make it back to the user if are using soft mounting.
> >>
> > hard/soft only governs what happens when there is a major timeout (i.e.
> > the server doesn't respond within a given time). If there are other
> > errors (for instance, client side memory shortage, server starts
> > refusing connections, etc), then there can be errors returned to the
> > application.
> >
>
> OK; we're already using TCP mounts, so I don't think that any new
> client->server connections
> should occur after the mount is established.
>
Unless the connection is broken for some reason and the socket has
to be reconnected.
> Second, memory is not an issue; this happens on lightly loaded clients
> with 64Gbytes RAM,
> and RAM is all cache and buffer.
>
Yeah, you'd probably get a -ENOMEM or something if memory were short. I
was just offering up that as an obvious way to get errors even if
you're hard mounting.
>
> > EIO is pretty generic, and is often what you see when a more obscure
> > error is translated into what a syscall would expect. It can happen for
> > other reasons besides an RPC timeout.
>
>
> OK, so, our best bet to debug this, is to:
> 1) reproduce the problem
> 2) when the problem occurs, make sure the command that run that got an
> EIO was running
> under strace, so we know what syscall was being made
> 3) when we know what syscall was being made, backtrack to the kernel
> source for that syscall
> 4) inspect the source to see what paths generate EIO
>
> Dave
Getting straces of the apps failing might be helpful, particularly if
it's always in the same syscalls. I have a hunch though that you'll find
yourself in the twisty maze of RPC code. In that case, knowing the
particular syscalls might not be that informative.
Looking at network captures might also be helpful. If you can correlate
the straces with what's going over the wire, then you might be able to
determine whether this error is being generated as a result of a NFS
error from the server or something else entirely.
NFS/RPC debugging might also be helpful (see rpcdebug manpage and note
that it can have significant performance impact).
--
Jeff Layton <[email protected]>
-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that [email protected] is being discontinued.
Please subscribe to [email protected] instead.
http://vger.kernel.org/vger-lists.html#linux-nfs
On Jun 4, 2008, at 9:33 AM, David Konerding wrote:
> Hi,
>
> We have a bunch of Linux clients (SLES 10 SP1) which mount a NetApp
> filer.
>
> When the NetApp gets very, very busy, for example, one user is
> deleting 1Tbyte of data
> while another user is doing a 30 client throughput test, it will stop
> responding to some requests.
>
> Although we are using hard mounts, some users report that during the
> hammering period, some of their
> file operations produce "I/O Error" messages on their terminal.
>
> We checked, and the hosts are indeed using hard mounting. From our
> reading, I/O Errors
> should only ever make it back to the user if are using soft mounting.
>
> We're pretty sure the filer is not sending back an NFS_ERR response
> (and we're
> pretty sure that wouldn't get reported to the user as an I/O Error...)
>
> At this point, we suspect there must be a path in the NFS
> implementation that returns I/O Error to user
> space even with a hard mount.
>
> Any ideas?
One place where this can occur is if XDR encoding or decoding fails.
This is not too likely though. I would look at the RPC client's
decoding logic first: call_decode() and friends.
--
Chuck Lever
chuck[dot]lever[at]oracle[dot]com
-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that [email protected] is being discontinued.
Please subscribe to [email protected] instead.
http://vger.kernel.org/vger-lists.html#linux-nfs
>
> Getting straces of the apps failing might be helpful, particularly if
> it's always in the same syscalls. I have a hunch though that you'll find
> yourself in the twisty maze of RPC code. In that case, knowing the
> particular syscalls might not be that informative.
>
> Looking at network captures might also be helpful. If you can correlate
> the straces with what's going over the wire, then you might be able to
> determine whether this error is being generated as a result of a NFS
> error from the server or something else entirely.
>
One hint is that if I run ls, and hit control-C while it's trolling
through filer directories
(but not local dirs), I get an I/O Error on the command line. This
may not reproduce our rm
problems (since those don't have Control-C events), but here's the
last part of the strace:
open("src/modules", O_RDONLY|O_NONBLOCK|O_DIRECTORY) = 3
fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
fcntl(3, F_SETFD, FD_CLOEXEC) = 0
getdents64(3, 0x534808, 32768) = -1 EIO (Input/output error)
--- SIGINT (Interrupt) @ 0 (0) ---
We're trying to reproduce the problem with an uninterrupted rm under strace.
Dave
-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that [email protected] is being discontinued.
Please subscribe to [email protected] instead.
http://vger.kernel.org/vger-lists.html#linux-nfs
Does=A0/var/log/messages show any errors around the same time?=A0 In ad=
dition=A0to the=A0network trace=A0and=A0rpcdebug on the client, take a =
look at "nfsstat -d" on the filer.=A0Is=A0the filer=A0dropping the conn=
ection?=A0 Look for "dropped with EAGAIN" or "dropped from vol offline"=
in the output.=A0 This will help narrow down the problem.
- ricardo
> -----Original Message-----
> From: David Konerding [mailto:[email protected]]=20
> Sent: Wednesday, June 04, 2008 6:33 AM
> To: [email protected]
> Subject: [NFS] I/O Errors with hard mounts
>=20
> Hi,
>=20
> We have a bunch of Linux clients (SLES 10 SP1) which mount a=20
> NetApp filer.
>=20
> When the NetApp gets very, very busy, for example, one user is
> deleting 1Tbyte of data
> while another user is doing a 30 client throughput test, it will stop
> responding to some requests.
>=20
> Although we are using hard mounts, some users report that during the
> hammering period, some of their
> file operations produce "I/O Error" messages on their terminal.
>=20
> We checked, and the hosts are indeed using hard mounting.=A0 From our
> reading, I/O Errors
> should only ever make it back to the user if are using soft mounting.
>=20
> We're pretty sure the filer is not sending back an NFS_ERR=20
> response (and we're
> pretty sure that wouldn't get reported to the user as an I/O Error...=
)
>=20
> At this point, we suspect there must be a path in the NFS
> implementation that returns I/O Error to user
> space even with a hard mount.
>=20
> Any ideas?
>=20
> Dave
=20
-----------------------------------------------------------------------=
--
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that [email protected] is being discontinued.
Please subscribe to [email protected] instead.
http://vger.kernel.org/vger-lists.html#linux-nfs
On Wed, Jun 4, 2008 at 3:45 PM, Ricardo Labiaga <[email protected]> wrote:
> Does /var/log/messages show any errors around the same time? In addition to the network trace and rpcdebug on the client, take a look at "nfsstat -d" on the filer. Is the filer dropping the connection? Look for "dropped with EAGAIN" or "dropped from vol offline" in the output. This will help narrow down the problem.
So, sometimes when somebody deletes a lot of data (like the problem we
just observed),
the deleting host, and often other hosts, do report 'filer not
responding' in the logs.
However, operations that aren't happening in the delete dir, tend to
work just fine (for example, iozone could be running and doing pretty
well)). Further, the most recent time this happened, the host didn't
report filer not responding.
This is the only EAGAN reference I see:
assist queue (queued, split mbufs, drop for EAGAIN) = (0, 64478612, 94340)
Dave
-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that [email protected] is being discontinued.
Please subscribe to [email protected] instead.
http://vger.kernel.org/vger-lists.html#linux-nfs
Can you provide the entire nfsstat -d output on the filer?
(Apologies for the lack of subject line in previous reply)
- ricardo
> -----Original Message-----
> From: David Konerding [mailto:[email protected]]=20
> Sent: Wednesday, June 04, 2008 3:56 PM
> To: [email protected]
> Subject: Re: [NFS] I/O Errors with hard mounts
>=20
> On Wed, Jun 4, 2008 at 3:45 PM, Ricardo Labiaga=20
> <[email protected]> wrote:
> > Does /var/log/messages show any errors around the same=20
> time?=A0 In addition to the network trace and rpcdebug on the=20
> client, take a look at "nfsstat -d" on the filer. Is the=20
> filer dropping the connection?=A0 Look for "dropped with=20
> EAGAIN" or "dropped from vol offline" in the output.=A0 This=20
> will help narrow down the problem.
>=20
> So, sometimes when somebody deletes a lot of data (like the problem w=
e
> just observed),
> the deleting host, and often other hosts, do report=A0 'filer not
> responding' in the logs.
> However, operations that aren't happening in the delete dir, tend to
> work just fine (for example, iozone could be running and doing pretty
> well)).=A0 Further, the most recent time this happened, the host didn=
't
> report filer not responding.
>=20
>=20
> This is the only EAGAN reference I see:
>=20
> assist queue (queued, split mbufs, drop for EAGAIN) =3D (0,=20
> 64478612, 94340)
>=20
>=20
> Dave
>=20
=20
-----------------------------------------------------------------------=
--
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://sourceforge.net/services/buy/index.php
_______________________________________________
NFS maillist - [email protected]
https://lists.sourceforge.net/lists/listinfo/nfs
_______________________________________________
Please note that [email protected] is being discontinued.
Please subscribe to [email protected] instead.
http://vger.kernel.org/vger-lists.html#linux-nfs