2020-04-29 17:21:54

by Alberto Gonzalez Iniesta

[permalink] [raw]
Subject: Random IO errors on nfs clients running linux > 4.20

Hello NFS maintainers,

I'm sorry for reporting this (a little bit) late, but it took us (Miguel
in Cc:) some time to track this issue to an exact kernel update.

We're running a +200 clients NFS server with Ubuntu 16.04 and 18.04
clients. The server runs Debian 8.11 (jessie) with Linux 3.16.0 and
nfs-kernel-server 1:1.2.8-9+deb8u1. It has been working some years now
without issues.

But since we started moving clients from Ubuntu 16.04 to Ubuntu 18.04
some of them started experiencing failures while working on NFS mounts.
The failures are arbitrary and sometimes it may take more than 20 minutes
to come out (which made finding out which kernel version introduced
this a pain). We are almost sure that some directories are more prone to
suffer from this than others (maybe related to path length/chars?).

The error is also not very "verbose", from an strace:

execve("/bin/ls", ["ls", "-lR", "Becas y ayudas/"], 0x7ffccb7f5b20 /* 16 vars */) = 0
[lots of uninteresting output]
openat(AT_FDCWD, "Becas y ayudas/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
fstat(3, {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
fstat(3, {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
fstat(1, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 3), ...}) = 0
ioctl(1, TCGETS, 0x7ffd8b725c80) = -1 ENOTTY (Inappropriate ioctl for device)
getdents(3, /* 35 entries */, 32768) = 1936
[lots of lstats)
lstat("Becas y ayudas/Convocatorias", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
getdents(3, 0x561af78de890, 32768) = -1 EIO (Input/output error)

(I can send you the full output if you need it)

We can run the previous "ls -lR" 20 times and get no error, or get
this "ls: leyendo el directorio 'Becas y ayudas/': Error de entrada/salida"
(ls: reading directorio 'Becas y ayudas/': Input/Output Error") every
now and then.

The error happens (obviously?) with ls, rsync and the users's GUI tools.

There's nothing in dmesg (or elsewhere).
These are the kernels with tried:
4.18.0-25 -> Can't reproduce
4.19.0 -> Can't reproduce
4.20.17 -> Happening (hard to reproduce)
5.0.0-15 -> Happening (hard to reproduce)
5.3.0-45 -> Happening (more frequently)
5.6.0-rc7 -> Reproduced a couple of times after boot, then nothing

We did long (as in daylong) testing trying to reproduce this with all
those kernel versions, so we are pretty sure 4.18 and 4.19 don't
experience this and our Ubuntu 16.04 clients don't have any issue.

I know we aren't providing much info but we are really looking forward
to doing all the testing required (we already spent lots of time in it).

Thanks for your work.

Regards,

Alberto

--
Alberto Gonz?lez Iniesta | Universidad a Distancia
[email protected] | de Madrid


2020-04-30 06:17:09

by Alberto Gonzalez Iniesta

[permalink] [raw]
Subject: Re: Random IO errors on nfs clients running linux > 4.20

On Wed, Apr 29, 2020 at 07:15:27PM +0200, Alberto Gonzalez Iniesta wrote:
> Hello NFS maintainers,
>
> I know we aren't providing much info but we are really looking forward
> to doing all the testing required (we already spent lots of time in it).

Hi,

Sorry, I was providing way too little info...
We're using NFSv4 with kerberos, mounts are done with:
mount -t nfs4 -o sec=krb5p,exec,noauto pluto.XXXX:/publico /media/pluto

Regards,

Alberto

--
Alberto Gonzalez Iniesta | Formaci?n, consultor?a y soporte t?cnico
mailto/sip: [email protected] | en GNU/Linux y software libre
Encrypted mail preferred | http://inittab.com

Key fingerprint = 5347 CBD8 3E30 A9EB 4D7D 4BF2 009B 3375 6B9A AA55

2020-04-30 17:33:49

by J. Bruce Fields

[permalink] [raw]
Subject: Re: Random IO errors on nfs clients running linux > 4.20

On Wed, Apr 29, 2020 at 07:15:27PM +0200, Alberto Gonzalez Iniesta wrote:
> I'm sorry for reporting this (a little bit) late, but it took us (Miguel
> in Cc:) some time to track this issue to an exact kernel update.
>
> We're running a +200 clients NFS server with Ubuntu 16.04 and 18.04
> clients. The server runs Debian 8.11 (jessie) with Linux 3.16.0 and
> nfs-kernel-server 1:1.2.8-9+deb8u1. It has been working some years now
> without issues.
>
> But since we started moving clients from Ubuntu 16.04 to Ubuntu 18.04
> some of them started experiencing failures while working on NFS mounts.
> The failures are arbitrary and sometimes it may take more than 20 minutes
> to come out (which made finding out which kernel version introduced
> this a pain). We are almost sure that some directories are more prone to
> suffer from this than others (maybe related to path length/chars?).
>
> The error is also not very "verbose", from an strace:
>
> execve("/bin/ls", ["ls", "-lR", "Becas y ayudas/"], 0x7ffccb7f5b20 /* 16 vars */) = 0
> [lots of uninteresting output]
> openat(AT_FDCWD, "Becas y ayudas/", O_RDONLY|O_NONBLOCK|O_CLOEXEC|O_DIRECTORY) = 3
> fstat(3, {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
> fstat(3, {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
> fstat(1, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 3), ...}) = 0
> ioctl(1, TCGETS, 0x7ffd8b725c80) = -1 ENOTTY (Inappropriate ioctl for device)
> getdents(3, /* 35 entries */, 32768) = 1936
> [lots of lstats)
> lstat("Becas y ayudas/Convocatorias", {st_mode=S_IFDIR|0775, st_size=4096, ...}) = 0
> getdents(3, 0x561af78de890, 32768) = -1 EIO (Input/output error)

Ideas off the top of my head....

It'd be really useful to get a network trace--something like tcpdump -s0
-wtmp.pcap -i<interface>, then reproduce the problem, then look through
it to see if you can find the READDIR or STAT or whatever that results
in the unexpected EIO. But if takes a while to reproduce, that may be
difficult.

Is there anything in the logs?

It might be worth turning on some more debug logging--see the "rpcdebug"
command.

--b.

>
> (I can send you the full output if you need it)
>
> We can run the previous "ls -lR" 20 times and get no error, or get
> this "ls: leyendo el directorio 'Becas y ayudas/': Error de entrada/salida"
> (ls: reading directorio 'Becas y ayudas/': Input/Output Error") every
> now and then.
>
> The error happens (obviously?) with ls, rsync and the users's GUI tools.
>
> There's nothing in dmesg (or elsewhere).
> These are the kernels with tried:
> 4.18.0-25 -> Can't reproduce
> 4.19.0 -> Can't reproduce
> 4.20.17 -> Happening (hard to reproduce)
> 5.0.0-15 -> Happening (hard to reproduce)
> 5.3.0-45 -> Happening (more frequently)
> 5.6.0-rc7 -> Reproduced a couple of times after boot, then nothing
>
> We did long (as in daylong) testing trying to reproduce this with all
> those kernel versions, so we are pretty sure 4.18 and 4.19 don't
> experience this and our Ubuntu 16.04 clients don't have any issue.
>
> I know we aren't providing much info but we are really looking forward
> to doing all the testing required (we already spent lots of time in it).
>
> Thanks for your work.
>
> Regards,
>
> Alberto
>
> --
> Alberto González Iniesta | Universidad a Distancia
> [email protected] | de Madrid