2017-09-23 22:06:34

by Armin Größlinger

[permalink] [raw]
Subject: NFS client hangs with NFSv4 and Kerberos

Hi,

we are experiencing issues with Linux as NFS client with NFSv4 and Kerberos.
Occasionally, processes get stuck in NFS operations and the kernel reports
a blocked task and claims that the NFS server is not responding. When this
happens, the hanging process and the kernel never recover from the situation.

After some experimenting, I could create a small test program which seems
to trigger this behavior; it can be found here:
https://www.infosun.fim.uni-passau.de/cl/staff/groesslinger/writesync.c

The test program only performs lseek(), write() and fsync() on a file
(given as command line argument) in an endless loop. For the hang to occur,
it seems to be important not to write linearly to the file (therefore the
lseek()s to write with different strides). Usually after a few thousand
iterations, the process hangs uninterruptibly and the kernel logs something
like

[ 1272.779598] nfs: server nexstaff not responding, still trying
[ 1330.122586] INFO: task writesync:1443 blocked for more than 120 seconds.
[ 1330.122905] Not tainted 4.14.0-rc1 #1
[ 1330.123141] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
[ 1330.123407] writesync D 0 1443 1442 0x00000000
[ 1330.123413] Call Trace:
[ 1330.123436] __schedule+0x3c2/0x8b0
[ 1330.123441] schedule+0x36/0x80
[ 1330.123452] io_schedule+0x16/0x40
[ 1330.123468] wait_on_page_bit_common+0xf8/0x190
[ 1330.123473] ? page_cache_tree_insert+0xb0/0xb0
[ 1330.123478] __filemap_fdatawait_range+0x123/0x190
[ 1330.123484] filemap_write_and_wait_range+0x4b/0x90
[ 1330.123515] nfs_file_fsync+0x3b/0x1f0 [nfs]
[ 1330.123536] vfs_fsync_range+0x4b/0xb0
[ 1330.123541] do_fsync+0x3d/0x70
[ 1330.123546] SyS_fsync+0x10/0x20
[ 1330.123552] entry_SYSCALL_64_fastpath+0x1e/0xa9
[ 1330.123556] RIP: 0033:0x7fa8e61ff730
[ 1330.123558] RSP: 002b:00007ffe159f2a28 EFLAGS: 00000246 ORIG_RAX: 000000000000004a
[ 1330.123562] RAX: ffffffffffffffda RBX: 0000000000000005 RCX: 00007fa8e61ff730
[ 1330.123564] RDX: 0000000000000001 RSI: 00000000004008f0 RDI: 0000000000000003
[ 1330.123565] RBP: 00007ffe159f03a0 R08: 00007fa8e64c8770 R09: 00007fa8e66cb700
[ 1330.123567] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000005
[ 1330.123568] R13: 0000000000000001 R14: 00007fa8e64c7540 R15: 0000000000000000

I have performed a few experiments to see under which configurations
these hangs occurs:

- Kernels 3.16.48, 4.9.51, 4.13.2, 4.14-rc1 all show this behavior.
(I'm not aware of a kernel version which does not show this behavior.)

- The hangs seem to occur only when NFSv4 is used together with Kerberos
(sec=krb5, krb5i or krb5p all show the hangs), but I do not observe
these hangs with sec=sys.

- nfs_utils versions on the clients used were 1.2.8 (Ubuntu 16.04) and
1.3.4 (Debian 9).

- It does not occur with NFSv3 or NFSv4.1 or NFSv4.2 (with or without Kerberos).

- I can reproduce the hangs with Nexenta and Debian 9 (Kernel 4.9) as NFS server.

- I could not reproduce the hangs with FreeBSD 11.1 as client.

Since our department's productive server (running Nexenta) does not offer
NFSv4.1/4.2, using one of these protocol versions, which do not seem to be
affected, is unfortunately not an option. Our users are affected by the kernel
hangs every few days. We first observed the hangs with programs using SQLite,
e.g., Firefox often triggers the hangs when writing its databases.
(I constructed the test program by looking at what operations SQLite
does on the databases when the hangs occur.)

I'll be happy to provide further information (kernel dumps or similar);
please instruct me how to gather the information needed. I'd also be
happy to try kernel patches if somebody can suggest a fix for the problem.

Best regards,
Armin Groesslinger