2020-04-29 17:21:54

by Alberto Gonzalez Iniesta

[permalink] [raw]
Subject: Random IO errors on nfs clients running linux > 4.20

Hello NFS maintainers,

I'm sorry for reporting this (a little bit) late, but it took us (Miguel
in Cc:) some time to track this issue to an exact kernel update.

We're running a +200 clients NFS server with Ubuntu 16.04 and 18.04
clients. The server runs Debian 8.11 (jessie) with Linux 3.16.0 and
nfs-kernel-server 1:1.2.8-9+deb8u1. It has been working some years now
without issues.

But since we started moving clients from Ubuntu 16.04 to Ubuntu 18.04
some of them started experiencing failures while working on NFS mounts.
The failures are arbitrary and sometimes it may take more than 20 minutes
to come out (which made finding out which kernel version introduced
this a pain). We are almost sure that some directories are more prone to
suffer from this than others (maybe related to path length/chars?).

The error is also not very "verbose", from an strace:

execve("/bin/ls", ["ls", "-lR", "Becas y ayudas/"], 0x7ffccb7f5b20 /* 16 vars */) = 0
[lots of uninteresting output]
openat(AT_FDCWD, "Becas y ayudas/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
fstat(3, {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
fstat(3, {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
fstat(1, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 3), ...}) = 0
ioctl(1, TCGETS, 0x7ffd8b725c80) = -1 ENOTTY (Inappropriate ioctl for device)
getdents(3, /* 35 entries */, 32768) = 1936
[lots of lstats)
lstat("Becas y ayudas/Convocatorias", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
getdents(3, 0x561af78de890, 32768) = -1 EIO (Input/output error)

(I can send you the full output if you need it)

We can run the previous "ls -lR" 20 times and get no error, or get
this "ls: leyendo el directorio 'Becas y ayudas/': Error de entrada/salida"
(ls: reading directorio 'Becas y ayudas/': Input/Output Error") every
now and then.

The error happens (obviously?) with ls, rsync and the users's GUI tools.

There's nothing in dmesg (or elsewhere).
These are the kernels with tried:
4.18.0-25 -> Can't reproduce
4.19.0 -> Can't reproduce
4.20.17 -> Happening (hard to reproduce)
5.0.0-15 -> Happening (hard to reproduce)
5.3.0-45 -> Happening (more frequently)
5.6.0-rc7 -> Reproduced a couple of times after boot, then nothing

We did long (as in daylong) testing trying to reproduce this with all
those kernel versions, so we are pretty sure 4.18 and 4.19 don't
experience this and our Ubuntu 16.04 clients don't have any issue.

I know we aren't providing much info but we are really looking forward
to doing all the testing required (we already spent lots of time in it).

Thanks for your work.

Regards,

Alberto

--
Alberto Gonz?lez Iniesta | Universidad a Distancia
[email protected] | de Madrid


2020-04-30 06:17:09

by Alberto Gonzalez Iniesta

[permalink] [raw]
Subject: Re: Random IO errors on nfs clients running linux > 4.20

On Wed, Apr 29, 2020 at 07:15:27PM +0200, Alberto Gonzalez Iniesta wrote:
> Hello NFS maintainers,
>
> I know we aren't providing much info but we are really looking forward
> to doing all the testing required (we already spent lots of time in it).

Hi,

Sorry, I was providing way too little info...
We're using NFSv4 with kerberos, mounts are done with:
mount -t nfs4 -o sec=krb5p,exec,noauto pluto.XXXX:/publico /media/pluto

Regards,

Alberto

--
Alberto Gonzalez Iniesta | Formaci?n, consultor?a y soporte t?cnico
mailto/sip: [email protected] | en GNU/Linux y software libre
Encrypted mail preferred | http://inittab.com

Key fingerprint = 5347 CBD8 3E30 A9EB 4D7D 4BF2 009B 3375 6B9A AA55

2020-04-30 17:33:49

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Random IO errors on nfs clients running linux > 4.20

On Wed, Apr 29, 2020 at 07:15:27PM +0200, Alberto Gonzalez Iniesta wrote:
> I'm sorry for reporting this (a little bit) late, but it took us (Miguel
> in Cc:) some time to track this issue to an exact kernel update.
>
> We're running a +200 clients NFS server with Ubuntu 16.04 and 18.04
> clients. The server runs Debian 8.11 (jessie) with Linux 3.16.0 and
> nfs-kernel-server 1:1.2.8-9+deb8u1. It has been working some years now
> without issues.
>
> But since we started moving clients from Ubuntu 16.04 to Ubuntu 18.04
> some of them started experiencing failures while working on NFS mounts.
> The failures are arbitrary and sometimes it may take more than 20 minutes
> to come out (which made finding out which kernel version introduced
> this a pain). We are almost sure that some directories are more prone to
> suffer from this than others (maybe related to path length/chars?).
>
> The error is also not very "verbose", from an strace:
>
> execve("/bin/ls", ["ls", "-lR", "Becas y ayudas/"], 0x7ffccb7f5b20 /* 16 vars */) = 0
> [lots of uninteresting output]
> openat(AT_FDCWD, "Becas y ayudas/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
> fstat(3, {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
> fstat(3, {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
> fstat(1, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 3), ...}) = 0
> ioctl(1, TCGETS, 0x7ffd8b725c80) = -1 ENOTTY (Inappropriate ioctl for device)
> getdents(3, /* 35 entries */, 32768) = 1936
> [lots of lstats)
> lstat("Becas y ayudas/Convocatorias", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
> getdents(3, 0x561af78de890, 32768) = -1 EIO (Input/output error)

Ideas off the top of my head....

It'd be really useful to get a network trace--something like tcpdump -s0
-wtmp.pcap -i<interface>, then reproduce the problem, then look through
it to see if you can find the READDIR or STAT or whatever that results
in the unexpected EIO. But if takes a while to reproduce, that may be
difficult.

Is there anything in the logs?

It might be worth turning on some more debug logging--see the "rpcdebug"
command.

--b.

>
> (I can send you the full output if you need it)
>
> We can run the previous "ls -lR" 20 times and get no error, or get
> this "ls: leyendo el directorio 'Becas y ayudas/': Error de entrada/salida"
> (ls: reading directorio 'Becas y ayudas/': Input/Output Error") every
> now and then.
>
> The error happens (obviously?) with ls, rsync and the users's GUI tools.
>
> There's nothing in dmesg (or elsewhere).
> These are the kernels with tried:
> 4.18.0-25 -> Can't reproduce
> 4.19.0 -> Can't reproduce
> 4.20.17 -> Happening (hard to reproduce)
> 5.0.0-15 -> Happening (hard to reproduce)
> 5.3.0-45 -> Happening (more frequently)
> 5.6.0-rc7 -> Reproduced a couple of times after boot, then nothing
>
> We did long (as in daylong) testing trying to reproduce this with all
> those kernel versions, so we are pretty sure 4.18 and 4.19 don't
> experience this and our Ubuntu 16.04 clients don't have any issue.
>
> I know we aren't providing much info but we are really looking forward
> to doing all the testing required (we already spent lots of time in it).
>
> Thanks for your work.
>
> Regards,
>
> Alberto
>
> --
> Alberto González Iniesta | Universidad a Distancia
> [email protected] | de Madrid

2020-09-09 09:38:59

by Alberto Gonzalez Iniesta

[permalink] [raw]
Subject: Re: Random IO errors on nfs clients running linux > 4.20

On Thu, Apr 30, 2020 at 01:32:00PM -0400, J. Bruce Fields wrote:
> On Wed, Apr 29, 2020 at 07:15:27PM +0200, Alberto Gonzalez Iniesta wrote:
> > I'm sorry for reporting this (a little bit) late, but it took us (Miguel
> > in Cc:) some time to track this issue to an exact kernel update.
> >
> > We're running a +200 clients NFS server with Ubuntu 16.04 and 18.04
> > clients. The server runs Debian 8.11 (jessie) with Linux 3.16.0 and
> > nfs-kernel-server 1:1.2.8-9+deb8u1. It has been working some years now
> > without issues.
> >
> > But since we started moving clients from Ubuntu 16.04 to Ubuntu 18.04
> > some of them started experiencing failures while working on NFS mounts.
> > The failures are arbitrary and sometimes it may take more than 20 minutes
> > to come out (which made finding out which kernel version introduced
> > this a pain). We are almost sure that some directories are more prone to
> > suffer from this than others (maybe related to path length/chars?).
> >
> > The error is also not very "verbose", from an strace:
> >
> > execve("/bin/ls", ["ls", "-lR", "Becas y ayudas/"], 0x7ffccb7f5b20 /* 16 vars */) = 0
> > [lots of uninteresting output]
> > openat(AT_FDCWD, "Becas y ayudas/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
> > fstat(3, {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
> > fstat(3, {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
> > fstat(1, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 3), ...}) = 0
> > ioctl(1, TCGETS, 0x7ffd8b725c80) = -1 ENOTTY (Inappropriate ioctl for device)
> > getdents(3, /* 35 entries */, 32768) = 1936
> > [lots of lstats)
> > lstat("Becas y ayudas/Convocatorias", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
> > getdents(3, 0x561af78de890, 32768) = -1 EIO (Input/output error)
>
> Ideas off the top of my head....
>
> It'd be really useful to get a network trace--something like tcpdump -s0
> -wtmp.pcap -i<interface>, then reproduce the problem, then look through
> it to see if you can find the READDIR or STAT or whatever that results
> in the unexpected EIO. But if takes a while to reproduce, that may be
> difficult.
>
> Is there anything in the logs?
>
> It might be worth turning on some more debug logging--see the "rpcdebug"
> command.

Hi, Bruce et at.

I'm sorry this reply took so long, but with debugging enabled the error
was harder to reproduce.

I'm attaching 3 log files (2 with just "nfs" debugging and one with
"nfs" and "rpc" debugging modules enabled).

I'm also attaching a pcap file, don't know if it would be useful since
we run "sec=krb5p".

Let me know if there's anything else I can test/provide.
These tests were done with Ubuntu's 5.3.0-53-generic.

Thanks,

Alberto

> > (I can send you the full output if you need it)
> >
> > We can run the previous "ls -lR" 20 times and get no error, or get
> > this "ls: leyendo el directorio 'Becas y ayudas/': Error de entrada/salida"
> > (ls: reading directorio 'Becas y ayudas/': Input/Output Error") every
> > now and then.
> >
> > The error happens (obviously?) with ls, rsync and the users's GUI tools.
> >
> > There's nothing in dmesg (or elsewhere).
> > These are the kernels with tried:
> > 4.18.0-25 -> Can't reproduce
> > 4.19.0 -> Can't reproduce
> > 4.20.17 -> Happening (hard to reproduce)
> > 5.0.0-15 -> Happening (hard to reproduce)
> > 5.3.0-45 -> Happening (more frequently)
> > 5.6.0-rc7 -> Reproduced a couple of times after boot, then nothing
> >
> > We did long (as in daylong) testing trying to reproduce this with all
> > those kernel versions, so we are pretty sure 4.18 and 4.19 don't
> > experience this and our Ubuntu 16.04 clients don't have any issue.
> >
> > I know we aren't providing much info but we are really looking forward
> > to doing all the testing required (we already spent lots of time in it).
> >
> > Thanks for your work.
> >
> > Regards,
> >
> > Alberto
> >
> > --
> > Alberto Gonz?lez Iniesta | Universidad a Distancia
> > [email protected] | de Madrid

--
Alberto Gonz?lez Iniesta | Universidad a Distancia
[email protected] | de Madrid


Attachments:
(No filename) (4.15 kB)
err3.txt (10.00 kB)
err2.txt (8.52 kB)
err1.txt (16.64 kB)
nfs_error.pcap (6.34 kB)
Download all attachments

2020-09-09 16:48:02

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Random IO errors on nfs clients running linux > 4.20

On Wed, Sep 09, 2020 at 11:29:00AM +0200, Alberto Gonzalez Iniesta wrote:
> On Thu, Apr 30, 2020 at 01:32:00PM -0400, J. Bruce Fields wrote:
> > On Wed, Apr 29, 2020 at 07:15:27PM +0200, Alberto Gonzalez Iniesta wrote:
> > > I'm sorry for reporting this (a little bit) late, but it took us (Miguel
> > > in Cc:) some time to track this issue to an exact kernel update.
> > >
> > > We're running a +200 clients NFS server with Ubuntu 16.04 and 18.04
> > > clients. The server runs Debian 8.11 (jessie) with Linux 3.16.0 and
> > > nfs-kernel-server 1:1.2.8-9+deb8u1. It has been working some years now
> > > without issues.
> > >
> > > But since we started moving clients from Ubuntu 16.04 to Ubuntu 18.04
> > > some of them started experiencing failures while working on NFS mounts.
> > > The failures are arbitrary and sometimes it may take more than 20 minutes
> > > to come out (which made finding out which kernel version introduced
> > > this a pain). We are almost sure that some directories are more prone to
> > > suffer from this than others (maybe related to path length/chars?).
> > >
> > > The error is also not very "verbose", from an strace:
> > >
> > > execve("/bin/ls", ["ls", "-lR", "Becas y ayudas/"], 0x7ffccb7f5b20 /* 16 vars */) = 0
> > > [lots of uninteresting output]
> > > openat(AT_FDCWD, "Becas y ayudas/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
> > > fstat(3, {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
> > > fstat(3, {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
> > > fstat(1, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 3), ...}) = 0
> > > ioctl(1, TCGETS, 0x7ffd8b725c80) = -1 ENOTTY (Inappropriate ioctl for device)
> > > getdents(3, /* 35 entries */, 32768) = 1936
> > > [lots of lstats)
> > > lstat("Becas y ayudas/Convocatorias", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
> > > getdents(3, 0x561af78de890, 32768) = -1 EIO (Input/output error)
> >
> > Ideas off the top of my head....
> >
> > It'd be really useful to get a network trace--something like tcpdump -s0
> > -wtmp.pcap -i<interface>, then reproduce the problem, then look through
> > it to see if you can find the READDIR or STAT or whatever that results
> > in the unexpected EIO. But if takes a while to reproduce, that may be
> > difficult.
> >
> > Is there anything in the logs?
> >
> > It might be worth turning on some more debug logging--see the "rpcdebug"
> > command.
>
> Hi, Bruce et at.
>
> I'm sorry this reply took so long, but with debugging enabled the error
> was harder to reproduce.
>
> I'm attaching 3 log files (2 with just "nfs" debugging and one with
> "nfs" and "rpc" debugging modules enabled).
>
> I'm also attaching a pcap file, don't know if it would be useful since
> we run "sec=krb5p".

Yeah, not too useful.

> Let me know if there's anything else I can test/provide.
> These tests were done with Ubuntu's 5.3.0-53-generic.
>
> Thanks,
>
> Alberto
>
> > > (I can send you the full output if you need it)
> > >
> > > We can run the previous "ls -lR" 20 times and get no error, or get
> > > this "ls: leyendo el directorio 'Becas y ayudas/': Error de entrada/salida"
> > > (ls: reading directorio 'Becas y ayudas/': Input/Output Error") every
> > > now and then.
> > >
> > > The error happens (obviously?) with ls, rsync and the users's GUI tools.
> > >
> > > There's nothing in dmesg (or elsewhere).
> > > These are the kernels with tried:
> > > 4.18.0-25 -> Can't reproduce
> > > 4.19.0 -> Can't reproduce
> > > 4.20.17 -> Happening (hard to reproduce)
> > > 5.0.0-15 -> Happening (hard to reproduce)
> > > 5.3.0-45 -> Happening (more frequently)
> > > 5.6.0-rc7 -> Reproduced a couple of times after boot, then nothing
> > >
> > > We did long (as in daylong) testing trying to reproduce this with all
> > > those kernel versions, so we are pretty sure 4.18 and 4.19 don't
> > > experience this and our Ubuntu 16.04 clients don't have any issue.
> > >
> > > I know we aren't providing much info but we are really looking forward
> > > to doing all the testing required (we already spent lots of time in it).
> > >
> > > Thanks for your work.

So all I notice from this one is the readdir EIO came from call_decode.
I suspect that means it failed in the xdr decoding. Looks like xdr
decoding of the actual directory data (which is the complicated part) is
done later, so this means it failed decoding the header or verifier,
which is a little odd:

> Sep 8 16:03:23 portatil264 kernel: [15033.016276] RPC: 3284 call_decode result -5
> Sep 8 16:03:23 portatil264 kernel: [15033.016281] nfs41_sequence_process: Error 1 free the slot
> Sep 8 16:03:23 portatil264 kernel: [15033.016286] RPC: wake_up_first(00000000d3f50f4d "ForeChannel Slot table")
> Sep 8 16:03:23 portatil264 kernel: [15033.016288] nfs4_free_slot: slotid 0 highest_used_slotid 4294967295
> Sep 8 16:03:23 portatil264 kernel: [15033.016290] RPC: 3284 return 0, status -5
> Sep 8 16:03:23 portatil264 kernel: [15033.016291] RPC: 3284 release task
> Sep 8 16:03:23 portatil264 kernel: [15033.016295] RPC: freeing buffer of size 4144 at 00000000a3649daf
> Sep 8 16:03:23 portatil264 kernel: [15033.016298] RPC: 3284 release request 0000000079df89b2
> Sep 8 16:03:23 portatil264 kernel: [15033.016300] RPC: wake_up_first(00000000c5ee49ee "xprt_backlog")
> Sep 8 16:03:23 portatil264 kernel: [15033.016302] RPC: rpc_release_client(00000000b930c343)
> Sep 8 16:03:23 portatil264 kernel: [15033.016304] RPC: 3284 freeing task
> Sep 8 16:03:23 portatil264 kernel: [15033.016309] _nfs4_proc_readdir: returns -5
> Sep 8 16:03:23 portatil264 kernel: [15033.016318] NFS: readdir(departamentos/innovacion) returns -5

--b.

2020-10-26 14:19:52

by Alberto Gonzalez Iniesta

[permalink] [raw]
Subject: Re: Random IO errors on nfs clients running linux > 4.20

On Wed, Sep 09, 2020 at 09:47:27AM -0400, J. Bruce Fields wrote:
> On Wed, Sep 09, 2020 at 11:29:00AM +0200, Alberto Gonzalez Iniesta wrote:
> > On Thu, Apr 30, 2020 at 01:32:00PM -0400, J. Bruce Fields wrote:
> > > On Wed, Apr 29, 2020 at 07:15:27PM +0200, Alberto Gonzalez Iniesta wrote:
> > > > We can run the previous "ls -lR" 20 times and get no error, or get
> > > > this "ls: leyendo el directorio 'Becas y ayudas/': Error de entrada/salida"
> > > > (ls: reading directorio 'Becas y ayudas/': Input/Output Error") every
> > > > now and then.
> > > >
> > > > The error happens (obviously?) with ls, rsync and the users's GUI tools.
> > > >
> > > > There's nothing in dmesg (or elsewhere).
> > > > These are the kernels with tried:
> > > > 4.18.0-25 -> Can't reproduce
> > > > 4.19.0 -> Can't reproduce
> > > > 4.20.17 -> Happening (hard to reproduce)
> > > > 5.0.0-15 -> Happening (hard to reproduce)
> > > > 5.3.0-45 -> Happening (more frequently)
> > > > 5.6.0-rc7 -> Reproduced a couple of times after boot, then nothing
> > > >
> > > > We did long (as in daylong) testing trying to reproduce this with all
> > > > those kernel versions, so we are pretty sure 4.18 and 4.19 don't
> > > > experience this and our Ubuntu 16.04 clients don't have any issue.
> > > >
> > > > I know we aren't providing much info but we are really looking forward
> > > > to doing all the testing required (we already spent lots of time in it).
> > > >
> > > > Thanks for your work.
>
> So all I notice from this one is the readdir EIO came from call_decode.
> I suspect that means it failed in the xdr decoding. Looks like xdr
> decoding of the actual directory data (which is the complicated part) is
> done later, so this means it failed decoding the header or verifier,
> which is a little odd:
>
> > Sep 8 16:03:23 portatil264 kernel: [15033.016276] RPC: 3284 call_decode result -5
> > Sep 8 16:03:23 portatil264 kernel: [15033.016281] nfs41_sequence_process: Error 1 free the slot
> > Sep 8 16:03:23 portatil264 kernel: [15033.016286] RPC: wake_up_first(00000000d3f50f4d "ForeChannel Slot table")
> > Sep 8 16:03:23 portatil264 kernel: [15033.016288] nfs4_free_slot: slotid 0 highest_used_slotid 4294967295
> > Sep 8 16:03:23 portatil264 kernel: [15033.016290] RPC: 3284 return 0, status -5
> > Sep 8 16:03:23 portatil264 kernel: [15033.016291] RPC: 3284 release task
> > Sep 8 16:03:23 portatil264 kernel: [15033.016295] RPC: freeing buffer of size 4144 at 00000000a3649daf
> > Sep 8 16:03:23 portatil264 kernel: [15033.016298] RPC: 3284 release request 0000000079df89b2
> > Sep 8 16:03:23 portatil264 kernel: [15033.016300] RPC: wake_up_first(00000000c5ee49ee "xprt_backlog")
> > Sep 8 16:03:23 portatil264 kernel: [15033.016302] RPC: rpc_release_client(00000000b930c343)
> > Sep 8 16:03:23 portatil264 kernel: [15033.016304] RPC: 3284 freeing task
> > Sep 8 16:03:23 portatil264 kernel: [15033.016309] _nfs4_proc_readdir: returns -5
> > Sep 8 16:03:23 portatil264 kernel: [15033.016318] NFS: readdir(departamentos/innovacion) returns -5

Hi, Bruce et al.

Is there anything we can do to help debugging/fixing this? It's still
biting our users with a +4.20.x kernel.

Thanks,

Alberto

--
Alberto Gonz?lez Iniesta | Universidad a Distancia
[email protected] | de Madrid

2020-10-26 14:20:45

by Chuck Lever III

[permalink] [raw]
Subject: Re: Random IO errors on nfs clients running linux > 4.20



> On Oct 26, 2020, at 9:42 AM, Alberto Gonzalez Iniesta <[email protected]> wrote:
>
> On Wed, Sep 09, 2020 at 09:47:27AM -0400, J. Bruce Fields wrote:
>> On Wed, Sep 09, 2020 at 11:29:00AM +0200, Alberto Gonzalez Iniesta wrote:
>>> On Thu, Apr 30, 2020 at 01:32:00PM -0400, J. Bruce Fields wrote:
>>>> On Wed, Apr 29, 2020 at 07:15:27PM +0200, Alberto Gonzalez Iniesta wrote:
>>>>> We can run the previous "ls -lR" 20 times and get no error, or get
>>>>> this "ls: leyendo el directorio 'Becas y ayudas/': Error de entrada/salida"
>>>>> (ls: reading directorio 'Becas y ayudas/': Input/Output Error") every
>>>>> now and then.
>>>>>
>>>>> The error happens (obviously?) with ls, rsync and the users's GUI tools.
>>>>>
>>>>> There's nothing in dmesg (or elsewhere).
>>>>> These are the kernels with tried:
>>>>> 4.18.0-25 -> Can't reproduce
>>>>> 4.19.0 -> Can't reproduce
>>>>> 4.20.17 -> Happening (hard to reproduce)
>>>>> 5.0.0-15 -> Happening (hard to reproduce)
>>>>> 5.3.0-45 -> Happening (more frequently)
>>>>> 5.6.0-rc7 -> Reproduced a couple of times after boot, then nothing
>>>>>
>>>>> We did long (as in daylong) testing trying to reproduce this with all
>>>>> those kernel versions, so we are pretty sure 4.18 and 4.19 don't
>>>>> experience this and our Ubuntu 16.04 clients don't have any issue.
>>>>>
>>>>> I know we aren't providing much info but we are really looking forward
>>>>> to doing all the testing required (we already spent lots of time in it).
>>>>>
>>>>> Thanks for your work.
>>
>> So all I notice from this one is the readdir EIO came from call_decode.
>> I suspect that means it failed in the xdr decoding. Looks like xdr
>> decoding of the actual directory data (which is the complicated part) is
>> done later, so this means it failed decoding the header or verifier,
>> which is a little odd:
>>
>>> Sep 8 16:03:23 portatil264 kernel: [15033.016276] RPC: 3284 call_decode result -5
>>> Sep 8 16:03:23 portatil264 kernel: [15033.016281] nfs41_sequence_process: Error 1 free the slot
>>> Sep 8 16:03:23 portatil264 kernel: [15033.016286] RPC: wake_up_first(00000000d3f50f4d "ForeChannel Slot table")
>>> Sep 8 16:03:23 portatil264 kernel: [15033.016288] nfs4_free_slot: slotid 0 highest_used_slotid 4294967295
>>> Sep 8 16:03:23 portatil264 kernel: [15033.016290] RPC: 3284 return 0, status -5
>>> Sep 8 16:03:23 portatil264 kernel: [15033.016291] RPC: 3284 release task
>>> Sep 8 16:03:23 portatil264 kernel: [15033.016295] RPC: freeing buffer of size 4144 at 00000000a3649daf
>>> Sep 8 16:03:23 portatil264 kernel: [15033.016298] RPC: 3284 release request 0000000079df89b2
>>> Sep 8 16:03:23 portatil264 kernel: [15033.016300] RPC: wake_up_first(00000000c5ee49ee "xprt_backlog")
>>> Sep 8 16:03:23 portatil264 kernel: [15033.016302] RPC: rpc_release_client(00000000b930c343)
>>> Sep 8 16:03:23 portatil264 kernel: [15033.016304] RPC: 3284 freeing task
>>> Sep 8 16:03:23 portatil264 kernel: [15033.016309] _nfs4_proc_readdir: returns -5
>>> Sep 8 16:03:23 portatil264 kernel: [15033.016318] NFS: readdir(departamentos/innovacion) returns -5
>
> Hi, Bruce et al.
>
> Is there anything we can do to help debugging/fixing this? It's still
> biting our users with a +4.20.x kernel.

Alberto, can you share a snippet of a raw network capture that shows
the READDIR Reply that fails to decode?


--
Chuck Lever



2020-10-26 14:25:43

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Random IO errors on nfs clients running linux > 4.20

On Mon, Oct 26, 2020 at 02:42:16PM +0100, Alberto Gonzalez Iniesta wrote:
> On Wed, Sep 09, 2020 at 09:47:27AM -0400, J. Bruce Fields wrote:
> > On Wed, Sep 09, 2020 at 11:29:00AM +0200, Alberto Gonzalez Iniesta wrote:
> > > On Thu, Apr 30, 2020 at 01:32:00PM -0400, J. Bruce Fields wrote:
> > > > On Wed, Apr 29, 2020 at 07:15:27PM +0200, Alberto Gonzalez Iniesta wrote:
> > > > > We can run the previous "ls -lR" 20 times and get no error, or get
> > > > > this "ls: leyendo el directorio 'Becas y ayudas/': Error de entrada/salida"
> > > > > (ls: reading directorio 'Becas y ayudas/': Input/Output Error") every
> > > > > now and then.
> > > > >
> > > > > The error happens (obviously?) with ls, rsync and the users's GUI tools.
> > > > >
> > > > > There's nothing in dmesg (or elsewhere).
> > > > > These are the kernels with tried:
> > > > > 4.18.0-25 -> Can't reproduce
> > > > > 4.19.0 -> Can't reproduce
> > > > > 4.20.17 -> Happening (hard to reproduce)
> > > > > 5.0.0-15 -> Happening (hard to reproduce)
> > > > > 5.3.0-45 -> Happening (more frequently)
> > > > > 5.6.0-rc7 -> Reproduced a couple of times after boot, then nothing
> > > > >
> > > > > We did long (as in daylong) testing trying to reproduce this with all
> > > > > those kernel versions, so we are pretty sure 4.18 and 4.19 don't
> > > > > experience this and our Ubuntu 16.04 clients don't have any issue.
> > > > >
> > > > > I know we aren't providing much info but we are really looking forward
> > > > > to doing all the testing required (we already spent lots of time in it).
> > > > >
> > > > > Thanks for your work.
> >
> > So all I notice from this one is the readdir EIO came from call_decode.
> > I suspect that means it failed in the xdr decoding. Looks like xdr
> > decoding of the actual directory data (which is the complicated part) is
> > done later, so this means it failed decoding the header or verifier,
> > which is a little odd:
> >
> > > Sep 8 16:03:23 portatil264 kernel: [15033.016276] RPC: 3284 call_decode result -5
> > > Sep 8 16:03:23 portatil264 kernel: [15033.016281] nfs41_sequence_process: Error 1 free the slot
> > > Sep 8 16:03:23 portatil264 kernel: [15033.016286] RPC: wake_up_first(00000000d3f50f4d "ForeChannel Slot table")
> > > Sep 8 16:03:23 portatil264 kernel: [15033.016288] nfs4_free_slot: slotid 0 highest_used_slotid 4294967295
> > > Sep 8 16:03:23 portatil264 kernel: [15033.016290] RPC: 3284 return 0, status -5
> > > Sep 8 16:03:23 portatil264 kernel: [15033.016291] RPC: 3284 release task
> > > Sep 8 16:03:23 portatil264 kernel: [15033.016295] RPC: freeing buffer of size 4144 at 00000000a3649daf
> > > Sep 8 16:03:23 portatil264 kernel: [15033.016298] RPC: 3284 release request 0000000079df89b2
> > > Sep 8 16:03:23 portatil264 kernel: [15033.016300] RPC: wake_up_first(00000000c5ee49ee "xprt_backlog")
> > > Sep 8 16:03:23 portatil264 kernel: [15033.016302] RPC: rpc_release_client(00000000b930c343)
> > > Sep 8 16:03:23 portatil264 kernel: [15033.016304] RPC: 3284 freeing task
> > > Sep 8 16:03:23 portatil264 kernel: [15033.016309] _nfs4_proc_readdir: returns -5
> > > Sep 8 16:03:23 portatil264 kernel: [15033.016318] NFS: readdir(departamentos/innovacion) returns -5
>
> Hi, Bruce et al.
>
> Is there anything we can do to help debugging/fixing this? It's still
> biting our users with a +4.20.x kernel.

Sorry, I just don't know what this is off the top of my head. If I had
the time, stuff I might try:

- look at the wire traffic with wireshark: try to figure out
which operation this is happening on, and if there's anything
unusual about the reply. May be difficult if a lot of traffic
is required to reproduce.
- if you're using krb5, try without just to see if that makes a
difference.
- look at history: gitk v4.19..v4.20.17 fs/nfs net/sunrpc, see
if anything looks relevant; or even just try a git bisect
(difficult again given the intermittent failure).
- trace through the code to work out where call_decode might be
returning -EIO, try to narrow it down with tracepoints or even
just debugging printk's.

--b.

2020-10-26 14:44:59

by Alberto Gonzalez Iniesta

[permalink] [raw]
Subject: Re: Random IO errors on nfs clients running linux > 4.20

On Mon, Oct 26, 2020 at 09:58:17AM -0400, Chuck Lever wrote:
> >> So all I notice from this one is the readdir EIO came from call_decode.
> >> I suspect that means it failed in the xdr decoding. Looks like xdr
> >> decoding of the actual directory data (which is the complicated part) is
> >> done later, so this means it failed decoding the header or verifier,
> >> which is a little odd:
> >>
> >>> Sep 8 16:03:23 portatil264 kernel: [15033.016276] RPC: 3284 call_decode result -5
> >>> Sep 8 16:03:23 portatil264 kernel: [15033.016281] nfs41_sequence_process: Error 1 free the slot
> >>> Sep 8 16:03:23 portatil264 kernel: [15033.016286] RPC: wake_up_first(00000000d3f50f4d "ForeChannel Slot table")
> >>> Sep 8 16:03:23 portatil264 kernel: [15033.016288] nfs4_free_slot: slotid 0 highest_used_slotid 4294967295
> >>> Sep 8 16:03:23 portatil264 kernel: [15033.016290] RPC: 3284 return 0, status -5
> >>> Sep 8 16:03:23 portatil264 kernel: [15033.016291] RPC: 3284 release task
> >>> Sep 8 16:03:23 portatil264 kernel: [15033.016295] RPC: freeing buffer of size 4144 at 00000000a3649daf
> >>> Sep 8 16:03:23 portatil264 kernel: [15033.016298] RPC: 3284 release request 0000000079df89b2
> >>> Sep 8 16:03:23 portatil264 kernel: [15033.016300] RPC: wake_up_first(00000000c5ee49ee "xprt_backlog")
> >>> Sep 8 16:03:23 portatil264 kernel: [15033.016302] RPC: rpc_release_client(00000000b930c343)
> >>> Sep 8 16:03:23 portatil264 kernel: [15033.016304] RPC: 3284 freeing task
> >>> Sep 8 16:03:23 portatil264 kernel: [15033.016309] _nfs4_proc_readdir: returns -5
> >>> Sep 8 16:03:23 portatil264 kernel: [15033.016318] NFS: readdir(departamentos/innovacion) returns -5
> >
> > Hi, Bruce et al.
> >
> > Is there anything we can do to help debugging/fixing this? It's still
> > biting our users with a +4.20.x kernel.
>
> Alberto, can you share a snippet of a raw network capture that shows
> the READDIR Reply that fails to decode?

Hi, Chuck.

Thanks for your reply. We're using "sec=krb5p", which makes the network
capture useless :-(
I'm afraid it's quite difficult to disable that, since we have +200 NFS
clients on production.

--
Alberto Gonz?lez Iniesta | Universidad a Distancia
[email protected] | de Madrid

2020-10-26 14:49:04

by Chuck Lever III

[permalink] [raw]
Subject: Re: Random IO errors on nfs clients running linux > 4.20



> On Oct 26, 2020, at 10:43 AM, Alberto Gonzalez Iniesta <[email protected]> wrote:
>
> On Mon, Oct 26, 2020 at 09:58:17AM -0400, Chuck Lever wrote:
>>>> So all I notice from this one is the readdir EIO came from call_decode.
>>>> I suspect that means it failed in the xdr decoding. Looks like xdr
>>>> decoding of the actual directory data (which is the complicated part) is
>>>> done later, so this means it failed decoding the header or verifier,
>>>> which is a little odd:
>>>>
>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016276] RPC: 3284 call_decode result -5
>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016281] nfs41_sequence_process: Error 1 free the slot
>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016286] RPC: wake_up_first(00000000d3f50f4d "ForeChannel Slot table")
>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016288] nfs4_free_slot: slotid 0 highest_used_slotid 4294967295
>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016290] RPC: 3284 return 0, status -5
>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016291] RPC: 3284 release task
>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016295] RPC: freeing buffer of size 4144 at 00000000a3649daf
>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016298] RPC: 3284 release request 0000000079df89b2
>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016300] RPC: wake_up_first(00000000c5ee49ee "xprt_backlog")
>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016302] RPC: rpc_release_client(00000000b930c343)
>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016304] RPC: 3284 freeing task
>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016309] _nfs4_proc_readdir: returns -5
>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016318] NFS: readdir(departamentos/innovacion) returns -5
>>>
>>> Hi, Bruce et al.
>>>
>>> Is there anything we can do to help debugging/fixing this? It's still
>>> biting our users with a +4.20.x kernel.
>>
>> Alberto, can you share a snippet of a raw network capture that shows
>> the READDIR Reply that fails to decode?
>
> Hi, Chuck.
>
> Thanks for your reply. We're using "sec=krb5p", which makes the network
> capture useless :-(

You can plug keytabs into Wireshark to enable it to decrypt the traffic.


--
Chuck Lever



2020-10-26 14:55:15

by Alberto Gonzalez Iniesta

[permalink] [raw]
Subject: Re: Random IO errors on nfs clients running linux > 4.20

On Mon, Oct 26, 2020 at 10:14:52AM -0400, J. Bruce Fields wrote:
> Sorry, I just don't know what this is off the top of my head. If I had
> the time, stuff I might try:
>
> - look at the wire traffic with wireshark: try to figure out
> which operation this is happening on, and if there's anything
> unusual about the reply. May be difficult if a lot of traffic
> is required to reproduce.
> - if you're using krb5, try without just to see if that makes a
> difference.

Using "sec=krb5p" here... And with lots of active clients. :-/

Oh! Could I add another entry in /etc/exports with krb5 without afecting
current krb5p clients?

> - look at history: gitk v4.19..v4.20.17 fs/nfs net/sunrpc, see
> if anything looks relevant; or even just try a git bisect
> (difficult again given the intermittent failure).
> - trace through the code to work out where call_decode might be
> returning -EIO, try to narrow it down with tracepoints or even
> just debugging printk's.

I'm afraid I don't have the skills to carry out those tasks, haven't
coded in C for decades, not to mention kernel/nfs hacking.
I guess we'll have to wait until someone else hits this too, or we
update the server and (hopefully/miraculously) fixes it self :-)

Thanks a lot,

Alberto

--
Alberto Gonz?lez Iniesta | Universidad a Distancia
[email protected] | de Madrid

2020-10-26 14:58:30

by Alberto Gonzalez Iniesta

[permalink] [raw]
Subject: Re: Random IO errors on nfs clients running linux > 4.20

On Mon, Oct 26, 2020 at 10:46:05AM -0400, Chuck Lever wrote:
>
>
> > On Oct 26, 2020, at 10:43 AM, Alberto Gonzalez Iniesta <[email protected]> wrote:
> >
> > On Mon, Oct 26, 2020 at 09:58:17AM -0400, Chuck Lever wrote:
> >>>> So all I notice from this one is the readdir EIO came from call_decode.
> >>>> I suspect that means it failed in the xdr decoding. Looks like xdr
> >>>> decoding of the actual directory data (which is the complicated part) is
> >>>> done later, so this means it failed decoding the header or verifier,
> >>>> which is a little odd:
> >>>>
> >>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016276] RPC: 3284 call_decode result -5
> >>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016281] nfs41_sequence_process: Error 1 free the slot
> >>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016286] RPC: wake_up_first(00000000d3f50f4d "ForeChannel Slot table")
> >>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016288] nfs4_free_slot: slotid 0 highest_used_slotid 4294967295
> >>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016290] RPC: 3284 return 0, status -5
> >>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016291] RPC: 3284 release task
> >>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016295] RPC: freeing buffer of size 4144 at 00000000a3649daf
> >>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016298] RPC: 3284 release request 0000000079df89b2
> >>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016300] RPC: wake_up_first(00000000c5ee49ee "xprt_backlog")
> >>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016302] RPC: rpc_release_client(00000000b930c343)
> >>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016304] RPC: 3284 freeing task
> >>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016309] _nfs4_proc_readdir: returns -5
> >>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016318] NFS: readdir(departamentos/innovacion) returns -5
> >>>
> >>> Hi, Bruce et al.
> >>>
> >>> Is there anything we can do to help debugging/fixing this? It's still
> >>> biting our users with a +4.20.x kernel.
> >>
> >> Alberto, can you share a snippet of a raw network capture that shows
> >> the READDIR Reply that fails to decode?
> >
> > Hi, Chuck.
> >
> > Thanks for your reply. We're using "sec=krb5p", which makes the network
> > capture useless :-(
>
> You can plug keytabs into Wireshark to enable it to decrypt the traffic.
>

Cool, I'll look into that then.

Thanks!

--
Alberto Gonz?lez Iniesta | Universidad a Distancia
[email protected] | de Madrid

2020-10-26 15:04:58

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Random IO errors on nfs clients running linux > 4.20

On Mon, Oct 26, 2020 at 10:46:05AM -0400, Chuck Lever wrote:
>
>
> > On Oct 26, 2020, at 10:43 AM, Alberto Gonzalez Iniesta <[email protected]> wrote:
> >
> > On Mon, Oct 26, 2020 at 09:58:17AM -0400, Chuck Lever wrote:
> >>>> So all I notice from this one is the readdir EIO came from call_decode.
> >>>> I suspect that means it failed in the xdr decoding. Looks like xdr
> >>>> decoding of the actual directory data (which is the complicated part) is
> >>>> done later, so this means it failed decoding the header or verifier,
> >>>> which is a little odd:
> >>>>
> >>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016276] RPC: 3284 call_decode result -5
> >>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016281] nfs41_sequence_process: Error 1 free the slot
> >>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016286] RPC: wake_up_first(00000000d3f50f4d "ForeChannel Slot table")
> >>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016288] nfs4_free_slot: slotid 0 highest_used_slotid 4294967295
> >>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016290] RPC: 3284 return 0, status -5
> >>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016291] RPC: 3284 release task
> >>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016295] RPC: freeing buffer of size 4144 at 00000000a3649daf
> >>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016298] RPC: 3284 release request 0000000079df89b2
> >>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016300] RPC: wake_up_first(00000000c5ee49ee "xprt_backlog")
> >>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016302] RPC: rpc_release_client(00000000b930c343)
> >>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016304] RPC: 3284 freeing task
> >>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016309] _nfs4_proc_readdir: returns -5
> >>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016318] NFS: readdir(departamentos/innovacion) returns -5
> >>>
> >>> Hi, Bruce et al.
> >>>
> >>> Is there anything we can do to help debugging/fixing this? It's still
> >>> biting our users with a +4.20.x kernel.
> >>
> >> Alberto, can you share a snippet of a raw network capture that shows
> >> the READDIR Reply that fails to decode?
> >
> > Hi, Chuck.
> >
> > Thanks for your reply. We're using "sec=krb5p", which makes the network
> > capture useless :-(
>
> You can plug keytabs into Wireshark to enable it to decrypt the traffic.

Just skimming that range of history, there's some changes to the
handling of gss sequence numbers, I wonder if there's a chance he could
be hitting that? You had a workload that would lead to calls dropping
out of the sequence number window, didn't you, Chuck? Is there a quick
way to check whether that's happening?

--b.

2020-10-26 15:09:48

by Chuck Lever III

[permalink] [raw]
Subject: Re: Random IO errors on nfs clients running linux > 4.20



> On Oct 26, 2020, at 11:02 AM, Bruce Fields <[email protected]> wrote:
>
> On Mon, Oct 26, 2020 at 10:46:05AM -0400, Chuck Lever wrote:
>>
>>
>>> On Oct 26, 2020, at 10:43 AM, Alberto Gonzalez Iniesta <[email protected]> wrote:
>>>
>>> On Mon, Oct 26, 2020 at 09:58:17AM -0400, Chuck Lever wrote:
>>>>>> So all I notice from this one is the readdir EIO came from call_decode.
>>>>>> I suspect that means it failed in the xdr decoding. Looks like xdr
>>>>>> decoding of the actual directory data (which is the complicated part) is
>>>>>> done later, so this means it failed decoding the header or verifier,
>>>>>> which is a little odd:
>>>>>>
>>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016276] RPC: 3284 call_decode result -5
>>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016281] nfs41_sequence_process: Error 1 free the slot
>>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016286] RPC: wake_up_first(00000000d3f50f4d "ForeChannel Slot table")
>>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016288] nfs4_free_slot: slotid 0 highest_used_slotid 4294967295
>>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016290] RPC: 3284 return 0, status -5
>>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016291] RPC: 3284 release task
>>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016295] RPC: freeing buffer of size 4144 at 00000000a3649daf
>>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016298] RPC: 3284 release request 0000000079df89b2
>>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016300] RPC: wake_up_first(00000000c5ee49ee "xprt_backlog")
>>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016302] RPC: rpc_release_client(00000000b930c343)
>>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016304] RPC: 3284 freeing task
>>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016309] _nfs4_proc_readdir: returns -5
>>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016318] NFS: readdir(departamentos/innovacion) returns -5
>>>>>
>>>>> Hi, Bruce et al.
>>>>>
>>>>> Is there anything we can do to help debugging/fixing this? It's still
>>>>> biting our users with a +4.20.x kernel.
>>>>
>>>> Alberto, can you share a snippet of a raw network capture that shows
>>>> the READDIR Reply that fails to decode?
>>>
>>> Hi, Chuck.
>>>
>>> Thanks for your reply. We're using "sec=krb5p", which makes the network
>>> capture useless :-(
>>
>> You can plug keytabs into Wireshark to enable it to decrypt the traffic.
>
> Just skimming that range of history, there's some changes to the
> handling of gss sequence numbers, I wonder if there's a chance he could
> be hitting that? You had a workload that would lead to calls dropping
> out of the sequence number window, didn't you, Chuck? Is there a quick
> way to check whether that's happening?

The server is supposed to drop the connection when that happens, though
I'm not sure 4.20's NFSD was perfect in that regard. Connection loss in
itself wouldn't result in EIO.


--
Chuck Lever



2020-10-26 18:16:35

by Alberto Gonzalez Iniesta

[permalink] [raw]
Subject: Re: Random IO errors on nfs clients running linux > 4.20

On Mon, Oct 26, 2020 at 11:06:24AM -0400, Chuck Lever wrote:
>
>
> > On Oct 26, 2020, at 11:02 AM, Bruce Fields <[email protected]> wrote:
> >
> > On Mon, Oct 26, 2020 at 10:46:05AM -0400, Chuck Lever wrote:
> >>
> >>
> >>> On Oct 26, 2020, at 10:43 AM, Alberto Gonzalez Iniesta <[email protected]> wrote:
> >>>
> >>> On Mon, Oct 26, 2020 at 09:58:17AM -0400, Chuck Lever wrote:
> >>>>>> So all I notice from this one is the readdir EIO came from call_decode.
> >>>>>> I suspect that means it failed in the xdr decoding. Looks like xdr
> >>>>>> decoding of the actual directory data (which is the complicated part) is
> >>>>>> done later, so this means it failed decoding the header or verifier,
> >>>>>> which is a little odd:
> >>>>>>
> >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016276] RPC: 3284 call_decode result -5
> >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016281] nfs41_sequence_process: Error 1 free the slot
> >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016286] RPC: wake_up_first(00000000d3f50f4d "ForeChannel Slot table")
> >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016288] nfs4_free_slot: slotid 0 highest_used_slotid 4294967295
> >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016290] RPC: 3284 return 0, status -5
> >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016291] RPC: 3284 release task
> >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016295] RPC: freeing buffer of size 4144 at 00000000a3649daf
> >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016298] RPC: 3284 release request 0000000079df89b2
> >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016300] RPC: wake_up_first(00000000c5ee49ee "xprt_backlog")
> >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016302] RPC: rpc_release_client(00000000b930c343)
> >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016304] RPC: 3284 freeing task
> >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016309] _nfs4_proc_readdir: returns -5
> >>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016318] NFS: readdir(departamentos/innovacion) returns -5
> >>>>>
> >>>>> Hi, Bruce et al.
> >>>>>
> >>>>> Is there anything we can do to help debugging/fixing this? It's still
> >>>>> biting our users with a +4.20.x kernel.
> >>>>
> >>>> Alberto, can you share a snippet of a raw network capture that shows
> >>>> the READDIR Reply that fails to decode?
> >>>
> >>> Hi, Chuck.
> >>>
> >>> Thanks for your reply. We're using "sec=krb5p", which makes the network
> >>> capture useless :-(
> >>
> >> You can plug keytabs into Wireshark to enable it to decrypt the traffic.
> >
> > Just skimming that range of history, there's some changes to the
> > handling of gss sequence numbers, I wonder if there's a chance he could
> > be hitting that? You had a workload that would lead to calls dropping
> > out of the sequence number window, didn't you, Chuck? Is there a quick
> > way to check whether that's happening?
>
> The server is supposed to drop the connection when that happens, though
> I'm not sure 4.20's NFSD was perfect in that regard. Connection loss in
> itself wouldn't result in EIO.

In case this is relevant, server is running 3.16.0. Clients (with
issues) +4.20.

--
Alberto Gonz?lez Iniesta | Universidad a Distancia
[email protected] | de Madrid

2020-10-26 20:11:20

by Chuck Lever III

[permalink] [raw]
Subject: Re: Random IO errors on nfs clients running linux > 4.20



> On Oct 26, 2020, at 11:26 AM, Alberto Gonzalez Iniesta <[email protected]> wrote:
>
> On Mon, Oct 26, 2020 at 11:06:24AM -0400, Chuck Lever wrote:
>>
>>
>>> On Oct 26, 2020, at 11:02 AM, Bruce Fields <[email protected]> wrote:
>>>
>>> On Mon, Oct 26, 2020 at 10:46:05AM -0400, Chuck Lever wrote:
>>>>
>>>>
>>>>> On Oct 26, 2020, at 10:43 AM, Alberto Gonzalez Iniesta <[email protected]> wrote:
>>>>>
>>>>> On Mon, Oct 26, 2020 at 09:58:17AM -0400, Chuck Lever wrote:
>>>>>>>> So all I notice from this one is the readdir EIO came from call_decode.
>>>>>>>> I suspect that means it failed in the xdr decoding. Looks like xdr
>>>>>>>> decoding of the actual directory data (which is the complicated part) is
>>>>>>>> done later, so this means it failed decoding the header or verifier,
>>>>>>>> which is a little odd:
>>>>>>>>
>>>>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016276] RPC: 3284 call_decode result -5
>>>>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016281] nfs41_sequence_process: Error 1 free the slot
>>>>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016286] RPC: wake_up_first(00000000d3f50f4d "ForeChannel Slot table")
>>>>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016288] nfs4_free_slot: slotid 0 highest_used_slotid 4294967295
>>>>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016290] RPC: 3284 return 0, status -5
>>>>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016291] RPC: 3284 release task
>>>>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016295] RPC: freeing buffer of size 4144 at 00000000a3649daf
>>>>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016298] RPC: 3284 release request 0000000079df89b2
>>>>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016300] RPC: wake_up_first(00000000c5ee49ee "xprt_backlog")
>>>>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016302] RPC: rpc_release_client(00000000b930c343)
>>>>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016304] RPC: 3284 freeing task
>>>>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016309] _nfs4_proc_readdir: returns -5
>>>>>>>>> Sep 8 16:03:23 portatil264 kernel: [15033.016318] NFS: readdir(departamentos/innovacion) returns -5
>>>>>>>
>>>>>>> Hi, Bruce et al.
>>>>>>>
>>>>>>> Is there anything we can do to help debugging/fixing this? It's still
>>>>>>> biting our users with a +4.20.x kernel.
>>>>>>
>>>>>> Alberto, can you share a snippet of a raw network capture that shows
>>>>>> the READDIR Reply that fails to decode?
>>>>>
>>>>> Hi, Chuck.
>>>>>
>>>>> Thanks for your reply. We're using "sec=krb5p", which makes the network
>>>>> capture useless :-(
>>>>
>>>> You can plug keytabs into Wireshark to enable it to decrypt the traffic.
>>>
>>> Just skimming that range of history, there's some changes to the
>>> handling of gss sequence numbers, I wonder if there's a chance he could
>>> be hitting that? You had a workload that would lead to calls dropping
>>> out of the sequence number window, didn't you, Chuck? Is there a quick
>>> way to check whether that's happening?
>>
>> The server is supposed to drop the connection when that happens, though
>> I'm not sure 4.20's NFSD was perfect in that regard. Connection loss in
>> itself wouldn't result in EIO.
>
> In case this is relevant, server is running 3.16.0. Clients (with
> issues) +4.20.

Ah, I see. Well that's an old kernel. Have you engaged your distributor?
They might be able to provide builds with debugging instrumentation, for
example, if we can give them some instructions or a patch.

My experience tells me that this is probably an issue with either the
server's GSS wrap function, or the client's GSS unwrap function, if
you don't ever see this failure without krb5p in the picture.

If you don't see this problem on older clients, than I would start
looking at the client's GSS unwrap function, fwiw.


--
Chuck Lever