LinuxLists.cc - Locking problems with Linux 4.9 with NFSD and `fs/iomap.c`

2017-05-07 22:45:28

Subject: Locking problems with Linux 4.9 with NFSD and `fs/iomap.c`

Dear Linux folks,

There seems to be a regression in Linux 4.9 compared to 4.4. Maybe you
have an idea.

The system used 4.4.38 without issues, and was updated to 4.9.23 on
April 24th. Since Friday, the NFS exports where not accessible anymore.
Rebooting the system into 4.9.24, 4.9.24, and 4.9.25 didn’t change
anything, and the system went into the some lock right away. Booting
4.4.38 fixed the issue though.

Here is more information.

NFS doesn’t respond to a null call.

```
# tshark -r x.x
1 0.000000 141.14.16.23 -> 141.14.26.2 TCP 74 1005 → 2049 [SYN]
Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=327954228 TSecr=0
WS=512
2 0.000245 141.14.26.2 -> 141.14.16.23 TCP 74 2049 → 1005 [SYN,
ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=1126765132
TSecr=327954228 WS=512
3 0.000269 141.14.16.23 -> 141.14.26.2 TCP 66 1005 → 2049 [ACK]
Seq=1 Ack=1 Win=29696 Len=0 TSval=327954229 TSecr=1126765132
4 0.000300 141.14.16.23 -> 141.14.26.2 NFS 110 V4 NULL Call
5 0.202652 141.14.16.23 -> 141.14.26.2 TCP 110 [TCP Retransmission]
1005 → 2049 [PSH, ACK] Seq=1 Ack=1 Win=29696 Len=44 TSval=327954432
TSecr=1126765132
6 0.410651 141.14.16.23 -> 141.14.26.2 TCP 110 [TCP Retransmission]
1005 → 2049 [PSH, ACK] Seq=1 Ack=1 Win=29696 Len=44 TSval=327954640
TSecr=1126765132
7 0.818649 141.14.16.23 -> 141.14.26.2 TCP 110 [TCP Retransmission]
1005 → 2049 [PSH, ACK] Seq=1 Ack=1 Win=29696 Len=44 TSval=327955048
TSecr=1126765132
8 1.012429 141.14.26.2 -> 141.14.16.23 TCP 74 [TCP Spurious
Retransmission] 2049 → 1005 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0
MSS=1460 SACK_PERM=1 TSval=1126766144 TSecr=327955048 WS=512
9 1.012448 141.14.16.23 -> 141.14.26.2 TCP 66 [TCP Dup ACK 3#1]
1005 → 2049 [ACK] Seq=45 Ack=1 Win=29696 Len=0 TSval=327955241
TSecr=1126765132
10 1.674650 141.14.16.23 -> 141.14.26.2 TCP 110 [TCP Retransmission]
1005 → 2049 [PSH, ACK] Seq=1 Ack=1 Win=29696 Len=44 TSval=327955904
TSecr=1126765132
11 3.060502 141.14.26.2 -> 141.14.16.23 TCP 74 [TCP Spurious
Retransmission] 2049 → 1005 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0
MSS=1460 SACK_PERM=1 TSval=1126768193 TSecr=327955904 WS=512
12 3.060521 141.14.16.23 -> 141.14.26.2 TCP 66 [TCP Dup ACK 3#2]
1005 → 2049 [ACK] Seq=45 Ack=1 Win=29696 Len=0 TSval=327957289
TSecr=1126765132
13 3.338653 141.14.16.23 -> 141.14.26.2 TCP 110 [TCP Retransmission]
1005 → 2049 [PSH, ACK] Seq=1 Ack=1 Win=29696 Len=44 TSval=327957568
TSecr=1126765132
```

``
# tshark -r x.x
1 0.000000 141.14.16.23 -> 141.14.26.2 TCP 74 690 → 2049 [SYN]
Seq=0 Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=328229577 TSecr=0
WS=512
2 0.000039 141.14.26.2 -> 141.14.16.23 TCP 74 2049 → 690 [SYN, ACK]
Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=1127040480
TSecr=328229577 WS=512
3 0.000155 141.14.16.23 -> 141.14.26.2 TCP 66 690 → 2049 [ACK]
Seq=1 Ack=1 Win=29696 Len=0 TSval=328229577 TSecr=1127040480
4 0.000180 141.14.16.23 -> 141.14.26.2 NFS 110 V4 NULL Call
5 0.206937 141.14.16.23 -> 141.14.26.2 TCP 110 [TCP Retransmission]
690 → 2049 [PSH, ACK] Seq=1 Ack=1 Win=29696 Len=44 TSval=328229784
TSecr=1127040480
6 0.414925 141.14.16.23 -> 141.14.26.2 TCP 110 [TCP Retransmission]
690 → 2049 [PSH, ACK] Seq=1 Ack=1 Win=29696 Len=44 TSval=328229992
TSecr=1127040480
7 0.822928 141.14.16.23 -> 141.14.26.2 TCP 110 [TCP Retransmission]
690 → 2049 [PSH, ACK] Seq=1 Ack=1 Win=29696 Len=44 TSval=328230400
TSecr=1127040480
8 1.056498 141.14.26.2 -> 141.14.16.23 TCP 74 [TCP Spurious
Retransmission] 2049 → 690 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0
MSS=1460 SACK_PERM=1 TSval=1127041536 TSecr=328230400 WS=512
```

Looking at the stack traces of the nfsd process with

```
for p in $(pgrep 'nfsd$'); do echo $p; cat /proc/$p/stack; done
```

all but one have the trace below.

```
[<ffffffff813eded7>] call_rwsem_down_write_failed+0x17/0x30
[<ffffffff8133aa1c>] xfs_file_buffered_aio_write+0x5c/0x2d0
[<ffffffff8133ad2e>] xfs_file_write_iter+0x9e/0x150
[<ffffffff81196b50>] do_iter_readv_writev+0xa0/0xf0
[<ffffffff81197d4a>] do_readv_writev+0x18a/0x230
[<ffffffff8119803c>] vfs_writev+0x3c/0x50
[<ffffffffa02826b5>] nfsd_vfs_write+0xc5/0x280 [nfsd]
[<ffffffffa028e48a>] nfsd4_write+0x17a/0x1d0 [nfsd]
[<ffffffffa02902f9>] nfsd4_proc_compound+0x369/0x5e0 [nfsd]
[<ffffffffa027e218>] nfsd_dispatch+0xa8/0x180 [nfsd]
[<ffffffffa01a66af>] svc_process_common+0x3ff/0x580 [sunrpc]
[<ffffffffa01a696b>] svc_process+0x13b/0x1b0 [sunrpc]
[<ffffffffa027dc7d>] nfsd+0xed/0x160 [nfsd]
[<ffffffff8107b0ba>] kthread+0xca/0xe0
[<ffffffff81a97092>] ret_from_fork+0x22/0x30
[<ffffffffffffffff>] 0xffffffffffffffff
```

The one problematic one has the stack trace below.

```
[<ffffffff811f6f6c>] iomap_write_begin+0x8c/0x120
[<ffffffff811f758b>] iomap_zero_range_actor+0xeb/0x210 # oder
iomap_zero_range_actor+0x90/0x210
[<ffffffff811f77e2>] iomap_apply+0xa2/0x110
[<ffffffff811f79b8>] iomap_zero_range+0x58/0x80
[<ffffffff8133a38e>] xfs_zero_eof+0x4e/0xb0
[<ffffffff8133a58d>] xfs_file_aio_write_checks+0x19d/0x1c0
[<ffffffff8133aa39>] xfs_file_buffered_aio_write+0x79/0x2d0
[<ffffffff8133ad2e>] xfs_file_write_iter+0x9e/0x150
[<ffffffff81196b50>] do_iter_readv_writev+0xa0/0xf0
[<ffffffff81197d4a>] do_readv_writev+0x18a/0x230
[<ffffffff8119803c>] vfs_writev+0x3c/0x50
[<ffffffffa02826b5>] nfsd_vfs_write+0xc5/0x280 [nfsd]
[<ffffffffa028e48a>] nfsd4_write+0x17a/0x1d0 [nfsd]
[<ffffffffa02902f9>] nfsd4_proc_compound+0x369/0x5e0 [nfsd]
[<ffffffffa027e218>] nfsd_dispatch+0xa8/0x180 [nfsd]
[<ffffffffa01a66af>] svc_process_common+0x3ff/0x580 [sunrpc]
[<ffffffffa01a696b>] svc_process+0x13b/0x1b0 [sunrpc]
[<ffffffffa027dc7d>] nfsd+0xed/0x160 [nfsd]
[<ffffffff8107b0ba>] kthread+0xca/0xe0
[<ffffffff81a97092>] ret_from_fork+0x22/0x30
[<ffffffffffffffff>] 0xffffffffffffffff
```

Here is the corresponding lines in the source code.

```
info line *0xffffffff811f6f6c Line 124 of "fs/iomap.c" starts at address
0xffffffff811f6f6c <iomap_write_begin+140> and ends at
0xffffffff811f6f70 <iomap_write_begin+144>.
info line *0xffffffff811f758b Line 345 of "fs/iomap.c" starts at address
0xffffffff811f758b <iomap_zero_range_actor+235> and ends at
0xffffffff811f758f <iomap_zero_range_actor+239>.
info line *0xffffffff811f7530 Line 385 of "fs/iomap.c" starts at address
0xffffffff811f7530 <iomap_zero_range_actor+144> and ends at
0xffffffff811f7538 <iomap_zero_range_actor+152>.
```

Here is the code.

```
[…]
109 static int
110 iomap_write_begin(struct inode *inode, loff_t pos, unsigned len,
unsigned flags,
111 struct page **pagep, struct iomap *iomap)
112 {
113 pgoff_t index = pos >> PAGE_SHIFT;
114 struct page *page;
115 int status = 0;
116
117 BUG_ON(pos + len > iomap->offset + iomap->length);
118
119 if (fatal_signal_pending(current))
120 return -EINTR;
121
122 page = grab_cache_page_write_begin(inode->i_mapping, index,
flags);
123 if (!page)
124 return -ENOMEM;
125
126 status = __block_write_begin_int(page, pos, len, NULL,
iomap);
127 if (unlikely(status)) {
128 unlock_page(page);
129 put_page(page);
130 page = NULL;
131
132 iomap_write_failed(inode, pos, len);
133 }
134
135 *pagep = page;
136 return status;
137 }
[…]
```

So it looks like insufficent memory? But the machine has 1 TB of RAM,
and I didn’t see any suspicious running `htop`. Also it should have
returned from with that error, shouldn’t it have?

Kind regards,

Paul

2017-05-08 13:18:48

by Brian Foster

[permalink] [raw]

Subject: Re: Locking problems with Linux 4.9 with NFSD and `fs/iomap.c`

cc Christoph, who's more familiar with nfs and wrote the iomap bits.

On Sun, May 07, 2017 at 09:09:49PM +0200, Paul Menzel wrote:
> Dear Linux folks,
>
>
> There seems to be a regression in Linux 4.9 compared to 4.4. Maybe you have
> an idea.
>
> The system used 4.4.38 without issues, and was updated to 4.9.23 on April
> 24th. Since Friday, the NFS exports where not accessible anymore. Rebooting
> the system into 4.9.24, 4.9.24, and 4.9.25 didn’t change anything, and the
> system went into the some lock right away. Booting 4.4.38 fixed the issue
> though.
>

The buffered write path was rewritten with the iomap mechanism around
4.7 or so, so there's a pretty big functionality gap between 4.4 and
4.9.

> Here is more information.
>
> NFS doesn’t respond to a null call.
>

What exactly is a NULL call? Can this be reproduced easily?

Otherwise, it's not clear to me whether you've hit a deadlock or some
kind of livelock. Have you checked syslog for any crash or hung task
messages? Please also provide the hung task output (echo w >
/proc/sysrq-trigger) once you've hit this state. It would be
particularly interesting to see whether the iomap_zero_range() path is
included in that output.

It may also be interesting to enable the xfs_zero_eof() tracepoint
(trace-cmd start -e 'xfs:xfs_zero_eof') and see what the last few
entries are from /sys/kernel/debug/tracing/trace_pipe.

Brian

> ```
> # tshark -r x.x
> 1 0.000000 141.14.16.23 -> 141.14.26.2 TCP 74 1005 → 2049 [SYN] Seq=0
> Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=327954228 TSecr=0 WS=512
> 2 0.000245 141.14.26.2 -> 141.14.16.23 TCP 74 2049 → 1005 [SYN, ACK]
> Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=1126765132
> TSecr=327954228 WS=512
> 3 0.000269 141.14.16.23 -> 141.14.26.2 TCP 66 1005 → 2049 [ACK] Seq=1
> Ack=1 Win=29696 Len=0 TSval=327954229 TSecr=1126765132
> 4 0.000300 141.14.16.23 -> 141.14.26.2 NFS 110 V4 NULL Call
> 5 0.202652 141.14.16.23 -> 141.14.26.2 TCP 110 [TCP Retransmission]
> 1005 → 2049 [PSH, ACK] Seq=1 Ack=1 Win=29696 Len=44 TSval=327954432
> TSecr=1126765132
> 6 0.410651 141.14.16.23 -> 141.14.26.2 TCP 110 [TCP Retransmission]
> 1005 → 2049 [PSH, ACK] Seq=1 Ack=1 Win=29696 Len=44 TSval=327954640
> TSecr=1126765132
> 7 0.818649 141.14.16.23 -> 141.14.26.2 TCP 110 [TCP Retransmission]
> 1005 → 2049 [PSH, ACK] Seq=1 Ack=1 Win=29696 Len=44 TSval=327955048
> TSecr=1126765132
> 8 1.012429 141.14.26.2 -> 141.14.16.23 TCP 74 [TCP Spurious
> Retransmission] 2049 → 1005 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460
> SACK_PERM=1 TSval=1126766144 TSecr=327955048 WS=512
> 9 1.012448 141.14.16.23 -> 141.14.26.2 TCP 66 [TCP Dup ACK 3#1] 1005 →
> 2049 [ACK] Seq=45 Ack=1 Win=29696 Len=0 TSval=327955241 TSecr=1126765132
> 10 1.674650 141.14.16.23 -> 141.14.26.2 TCP 110 [TCP Retransmission]
> 1005 → 2049 [PSH, ACK] Seq=1 Ack=1 Win=29696 Len=44 TSval=327955904
> TSecr=1126765132
> 11 3.060502 141.14.26.2 -> 141.14.16.23 TCP 74 [TCP Spurious
> Retransmission] 2049 → 1005 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460
> SACK_PERM=1 TSval=1126768193 TSecr=327955904 WS=512
> 12 3.060521 141.14.16.23 -> 141.14.26.2 TCP 66 [TCP Dup ACK 3#2] 1005 →
> 2049 [ACK] Seq=45 Ack=1 Win=29696 Len=0 TSval=327957289 TSecr=1126765132
> 13 3.338653 141.14.16.23 -> 141.14.26.2 TCP 110 [TCP Retransmission]
> 1005 → 2049 [PSH, ACK] Seq=1 Ack=1 Win=29696 Len=44 TSval=327957568
> TSecr=1126765132
> ```
>
> ``
> # tshark -r x.x
> 1 0.000000 141.14.16.23 -> 141.14.26.2 TCP 74 690 → 2049 [SYN] Seq=0
> Win=29200 Len=0 MSS=1460 SACK_PERM=1 TSval=328229577 TSecr=0 WS=512
> 2 0.000039 141.14.26.2 -> 141.14.16.23 TCP 74 2049 → 690 [SYN, ACK]
> Seq=0 Ack=1 Win=28960 Len=0 MSS=1460 SACK_PERM=1 TSval=1127040480
> TSecr=328229577 WS=512
> 3 0.000155 141.14.16.23 -> 141.14.26.2 TCP 66 690 → 2049 [ACK] Seq=1
> Ack=1 Win=29696 Len=0 TSval=328229577 TSecr=1127040480
> 4 0.000180 141.14.16.23 -> 141.14.26.2 NFS 110 V4 NULL Call
> 5 0.206937 141.14.16.23 -> 141.14.26.2 TCP 110 [TCP Retransmission] 690
> → 2049 [PSH, ACK] Seq=1 Ack=1 Win=29696 Len=44 TSval=328229784
> TSecr=1127040480
> 6 0.414925 141.14.16.23 -> 141.14.26.2 TCP 110 [TCP Retransmission] 690
> → 2049 [PSH, ACK] Seq=1 Ack=1 Win=29696 Len=44 TSval=328229992
> TSecr=1127040480
> 7 0.822928 141.14.16.23 -> 141.14.26.2 TCP 110 [TCP Retransmission] 690
> → 2049 [PSH, ACK] Seq=1 Ack=1 Win=29696 Len=44 TSval=328230400
> TSecr=1127040480
> 8 1.056498 141.14.26.2 -> 141.14.16.23 TCP 74 [TCP Spurious
> Retransmission] 2049 → 690 [SYN, ACK] Seq=0 Ack=1 Win=28960 Len=0 MSS=1460
> SACK_PERM=1 TSval=1127041536 TSecr=328230400 WS=512
> ```
>
> Looking at the stack traces of the nfsd process with
>
> ```
> for p in $(pgrep 'nfsd$'); do echo $p; cat /proc/$p/stack; done
> ```
>
> all but one have the trace below.
>
>
> ```
> [<ffffffff813eded7>] call_rwsem_down_write_failed+0x17/0x30
> [<ffffffff8133aa1c>] xfs_file_buffered_aio_write+0x5c/0x2d0
> [<ffffffff8133ad2e>] xfs_file_write_iter+0x9e/0x150
> [<ffffffff81196b50>] do_iter_readv_writev+0xa0/0xf0
> [<ffffffff81197d4a>] do_readv_writev+0x18a/0x230
> [<ffffffff8119803c>] vfs_writev+0x3c/0x50
> [<ffffffffa02826b5>] nfsd_vfs_write+0xc5/0x280 [nfsd]
> [<ffffffffa028e48a>] nfsd4_write+0x17a/0x1d0 [nfsd]
> [<ffffffffa02902f9>] nfsd4_proc_compound+0x369/0x5e0 [nfsd]
> [<ffffffffa027e218>] nfsd_dispatch+0xa8/0x180 [nfsd]
> [<ffffffffa01a66af>] svc_process_common+0x3ff/0x580 [sunrpc]
> [<ffffffffa01a696b>] svc_process+0x13b/0x1b0 [sunrpc]
> [<ffffffffa027dc7d>] nfsd+0xed/0x160 [nfsd]
> [<ffffffff8107b0ba>] kthread+0xca/0xe0
> [<ffffffff81a97092>] ret_from_fork+0x22/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
> ```
>
> The one problematic one has the stack trace below.
>
> ```
> [<ffffffff811f6f6c>] iomap_write_begin+0x8c/0x120
> [<ffffffff811f758b>] iomap_zero_range_actor+0xeb/0x210 # oder
> iomap_zero_range_actor+0x90/0x210
> [<ffffffff811f77e2>] iomap_apply+0xa2/0x110
> [<ffffffff811f79b8>] iomap_zero_range+0x58/0x80
> [<ffffffff8133a38e>] xfs_zero_eof+0x4e/0xb0
> [<ffffffff8133a58d>] xfs_file_aio_write_checks+0x19d/0x1c0
> [<ffffffff8133aa39>] xfs_file_buffered_aio_write+0x79/0x2d0
> [<ffffffff8133ad2e>] xfs_file_write_iter+0x9e/0x150
> [<ffffffff81196b50>] do_iter_readv_writev+0xa0/0xf0
> [<ffffffff81197d4a>] do_readv_writev+0x18a/0x230
> [<ffffffff8119803c>] vfs_writev+0x3c/0x50
> [<ffffffffa02826b5>] nfsd_vfs_write+0xc5/0x280 [nfsd]
> [<ffffffffa028e48a>] nfsd4_write+0x17a/0x1d0 [nfsd]
> [<ffffffffa02902f9>] nfsd4_proc_compound+0x369/0x5e0 [nfsd]
> [<ffffffffa027e218>] nfsd_dispatch+0xa8/0x180 [nfsd]
> [<ffffffffa01a66af>] svc_process_common+0x3ff/0x580 [sunrpc]
> [<ffffffffa01a696b>] svc_process+0x13b/0x1b0 [sunrpc]
> [<ffffffffa027dc7d>] nfsd+0xed/0x160 [nfsd]
> [<ffffffff8107b0ba>] kthread+0xca/0xe0
> [<ffffffff81a97092>] ret_from_fork+0x22/0x30
> [<ffffffffffffffff>] 0xffffffffffffffff
> ```
>
> Here is the corresponding lines in the source code.
>
> ```
> info line *0xffffffff811f6f6c Line 124 of "fs/iomap.c" starts at address
> 0xffffffff811f6f6c <iomap_write_begin+140> and ends at 0xffffffff811f6f70
> <iomap_write_begin+144>.
> info line *0xffffffff811f758b Line 345 of "fs/iomap.c" starts at address
> 0xffffffff811f758b <iomap_zero_range_actor+235> and ends at
> 0xffffffff811f758f <iomap_zero_range_actor+239>.
> info line *0xffffffff811f7530 Line 385 of "fs/iomap.c" starts at address
> 0xffffffff811f7530 <iomap_zero_range_actor+144> and ends at
> 0xffffffff811f7538 <iomap_zero_range_actor+152>.
> ```
>
> Here is the code.
>
> ```
> […]
> 109 static int
> 110 iomap_write_begin(struct inode *inode, loff_t pos, unsigned len,
> unsigned flags,
> 111 struct page **pagep, struct iomap *iomap)
> 112 {
> 113 pgoff_t index = pos >> PAGE_SHIFT;
> 114 struct page *page;
> 115 int status = 0;
> 116
> 117 BUG_ON(pos + len > iomap->offset + iomap->length);
> 118
> 119 if (fatal_signal_pending(current))
> 120 return -EINTR;
> 121
> 122 page = grab_cache_page_write_begin(inode->i_mapping, index,
> flags);
> 123 if (!page)
> 124 return -ENOMEM;
> 125
> 126 status = __block_write_begin_int(page, pos, len, NULL, iomap);
> 127 if (unlikely(status)) {
> 128 unlock_page(page);
> 129 put_page(page);
> 130 page = NULL;
> 131
> 132 iomap_write_failed(inode, pos, len);
> 133 }
> 134
> 135 *pagep = page;
> 136 return status;
> 137 }
> […]
> ```
>
> So it looks like insufficent memory? But the machine has 1 TB of RAM, and I
> didn’t see any suspicious running `htop`. Also it should have returned from
> with that error, shouldn’t it have?
>
>
> Kind regards,
>
> Paul
> --
> To unsubscribe from this list: send the line "unsubscribe linux-xfs" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2017-05-09 09:05:36

by Christoph Hellwig

[permalink] [raw]

Subject: Re: Locking problems with Linux 4.9 with NFSD and `fs/iomap.c`

On Mon, May 08, 2017 at 09:18:46AM -0400, Brian Foster wrote:
> What exactly is a NULL call? Can this be reproduced easily?

It's a sunrpc concept of a dummy procedure that does nothing but
ensures the communication works.

> It may also be interesting to enable the xfs_zero_eof() tracepoint
> (trace-cmd start -e 'xfs:xfs_zero_eof') and see what the last few
> entries are from /sys/kernel/debug/tracing/trace_pipe.

Yeah. I have a customer that uses Linux 4.9-stable with XFS and
nfsd exports in a product, so it's certainly not something that happens
too easily, but memory pressure might be a factor. I'll see if
I can figure out what happens if I inject ENOMEM in the said
spot.

2017-05-10 09:08:54

by Paul Menzel

[permalink] [raw]

Subject: Re: Locking problems with Linux 4.9 with NFSD and `fs/iomap.c`

Dear Brian,

On 05/08/17 15:18, Brian Foster wrote:
> cc Christoph, who's more familiar with nfs and wrote the iomap bits.

Thank you.

> On Sun, May 07, 2017 at 09:09:49PM +0200, Paul Menzel wrote:

>> There seems to be a regression in Linux 4.9 compared to 4.4. Maybe you have
>> an idea.
>>
>> The system used 4.4.38 without issues, and was updated to 4.9.23 on April
>> 24th. Since Friday, the NFS exports where not accessible anymore. Rebooting
>> the system into 4.9.24, 4.9.24, and 4.9.25 didn’t change anything, and the
>> system went into the some lock right away. Booting 4.4.38 fixed the issue
>> though.
>
> The buffered write path was rewritten with the iomap mechanism around
> 4.7 or so, so there's a pretty big functionality gap between 4.4 and
> 4.9.
>
>> Here is more information.
>>
>> NFS doesn’t respond to a null call.
>
> What exactly is a NULL call?

Sorry for not making that clear for non-NFS people. From *NFS Version 3
Protocol Specification* [1]:

> Procedure NULL does not do any work. It is made available to
> allow server response testing and timing.

> Can this be reproduced easily?

Unfortunately, we don’t know how to reproduce it. It seems to happen
after heavy input/output operations though.

```
$ sudo nfsstat -s
Server rpc stats:
calls badcalls badclnt badauth xdrcall
15644232 0 0 0 0

Server nfs v4:
null compound
1071 0% 15643006 99%

Server nfs v4 operations:
op0-unused op1-unused op2-future access close commit

0 0% 0 0% 0 0% 87846 0% 42798 0% 50658
0%
create delegpurge delegreturn getattr getfh link

2805 0% 0 0% 16866 0% 8271204 21% 82924 0% 0
0%
lock lockt locku lookup lookup_root nverify

0 0% 0 0% 0 0% 64424 0% 0 0% 0
0%
open openattr open_conf open_dgrd putfh
putpubfh
53848 0% 0 0% 1081 0% 20 0% 15569041 39% 0
0%
putrootfh read readdir readlink remove rename

1072 0% 7187366 18% 2045 0% 73 0% 9116 0% 5836
0%
renew restorefh savefh secinfo setattr
setcltid
72534 0% 0 0% 5836 0% 0 0% 21817 0% 1854
0%
setcltidconf verify write rellockowner bc_ctl
bind_conn
1854 0% 0 0% 7794634 19% 0 0% 0 0% 0
0%
exchange_id create_ses destroy_ses free_stateid getdirdeleg
getdevinfo
0 0% 0 0% 0 0% 0 0% 0 0% 0
0%
getdevlist layoutcommit layoutget layoutreturn secinfononam
sequence
0 0% 0 0% 0 0% 0 0% 0 0% 0
0%
set_ssv test_stateid want_deleg destroy_clid reclaim_comp
0 0% 0 0% 0 0% 0 0% 0 0%
```

> Otherwise, it's not clear to me whether you've hit a deadlock or some
> kind of livelock. Have you checked syslog for any crash or hung task
> messages? Please also provide the hung task output (echo w >
> /proc/sysrq-trigger) once you've hit this state. It would be
> particularly interesting to see whether the iomap_zero_range() path is
> included in that output.

Please see the Linux messages in my reply to Christoph’s message.

> It may also be interesting to enable the xfs_zero_eof() tracepoint
> (trace-cmd start -e 'xfs:xfs_zero_eof') and see what the last few
> entries are from /sys/kernel/debug/tracing/trace_pipe.

I built `trace-cmd`, and did what you asked, but there are no messages.

```
$ sudo strace ~/src/trace-cmd/trace-cmd start -e 'xfs:xfs_zero_eof'
$ sudo cat /sys/kernel/tracing/events/xfs/xfs_zero_eof/enable
1
$ sudo cat /sys/kernel/debug/tracing/tracing_on
1
$ sudo cat /sys/kernel/debug/tracing/trace_pipe

```

Kind regards,

Paul

[1] https://www.ietf.org/rfc/rfc1813.txt

2017-05-10 17:23:23

by J. Bruce Fields

[permalink] [raw]

Subject: Re: Locking problems with Linux 4.9 with NFSD and `fs/iomap.c`

On Wed, May 10, 2017 at 11:08:52AM +0200, Paul Menzel wrote:
> On 05/08/17 15:18, Brian Foster wrote:
> >>NFS doesn’t respond to a null call.
> >
> >What exactly is a NULL call?
>
> Sorry for not making that clear for non-NFS people. From *NFS
> Version 3 Protocol Specification* [1]:
>
> >Procedure NULL does not do any work. It is made available to
> >allow server response testing and timing.

NFSD has a fixed number of threads for processing requests, so if one of
them hangs in the filesystem while holding a lock, all of them are
likely to eventually end up waiting for that lock, and there will be
none left even to answer NULL requests. So this is a pretty normal
symptom of a deadlock or similar problem.

--b.