Dear all,
for our researchers we are running file servers in the hundreds-of-TiB to
low-PiB range that export via NFS and SMB. Storage is iSCSI-over-Infiniband
LUNs LVM'ed into individual XFS file systems. With Ubuntu 18.04 nearing EOL,
we prepared an upgrade to Debian bookworm and tests went well. About a week
after one of the upgrades, we ran into the first occurence of our problem: all
of a sudden, all nfsds enter the D state and are not recoverable. However, the
underlying file systems seem fine and can be read and written to. The only way
out appears to be to reboot the server. The only clues are the frozen nfsds
and strack traces like
[<0>] rq_qos_wait+0xbc/0x130
[<0>] wbt_wait+0xa2/0x110
[<0>] __rq_qos_throttle+0x20/0x40
[<0>] blk_mq_submit_bio+0x2d3/0x580
[<0>] submit_bio_noacct_nocheck+0xf7/0x2c0
[<0>] iomap_submit_ioend+0x4b/0x80
[<0>] iomap_do_writepage+0x4b4/0x820
[<0>] write_cache_pages+0x180/0x4c0
[<0>] iomap_writepages+0x1c/0x40
[<0>] xfs_vm_writepages+0x79/0xb0 [xfs]
[<0>] do_writepages+0xbd/0x1c0
[<0>] filemap_fdatawrite_wbc+0x5f/0x80
[<0>] __filemap_fdatawrite_range+0x58/0x80
[<0>] file_write_and_wait_range+0x41/0x90
[<0>] xfs_file_fsync+0x5a/0x2a0 [xfs]
[<0>] nfsd_commit+0x93/0x190 [nfsd]
[<0>] nfsd4_commit+0x5e/0x90 [nfsd]
[<0>] nfsd4_proc_compound+0x352/0x660 [nfsd]
[<0>] nfsd_dispatch+0x167/0x280 [nfsd]
[<0>] svc_process_common+0x286/0x5e0 [sunrpc]
[<0>] svc_process+0xad/0x100 [sunrpc]
[<0>] nfsd+0xd5/0x190 [nfsd]
[<0>] kthread+0xe6/0x110
[<0>] ret_from_fork+0x1f/0x30
(we've also seen nfsd3). It's very sporadic, we have no idea what's triggering
it and it has now happened 4 times on one server and once on a second.
Needless to say, these are production systems, so we have a window of a few
minutes for debugging before people start yelling. We've thrown everything we
could at our test setup but so far haven't been able to trigger it.
Any pointers would be highly appreciated.
thanks and best regards,
-Christian
cat /etc/os-release
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
uname -vr
6.1.0-7-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.20-1 (2023-03-19)
apt list --installed '*nfs*'
libnfsidmap1/testing,now 1:2.6.2-4 amd64 [installed,automatic]
nfs-common/testing,now 1:2.6.2-4 amd64 [installed]
nfs-kernel-server/testing,now 1:2.6.2-4 amd64 [installed]
nfsconf -d
[exportd]
debug = all
[exportfs]
debug = all
[general]
pipefs-directory = /run/rpc_pipefs
[lockd]
port = 32769
udp-port = 32769
[mountd]
debug = all
manage-gids = True
port = 892
[nfsd]
debug = all
port = 2049
threads = 48
[nfsdcld]
debug = all
[nfsdcltrack]
debug = all
[sm-notify]
debug = all
outgoing-port = 846
[statd]
debug = all
outgoing-port = 2020
port = 662
--
Dr. Christian Herzog <[email protected]> support: +41 44 633 26 68
Head, IT Services Group, HPT H 8 voice: +41 44 633 39 50
Department of Physics, ETH Zurich
8093 Zurich, Switzerland http://isg.phys.ethz.ch/
> On Apr 6, 2023, at 7:09 AM, Christian Herzog <[email protected]> wrote:
>
> Dear all,
>
> for our researchers we are running file servers in the hundreds-of-TiB to
> low-PiB range that export via NFS and SMB. Storage is iSCSI-over-Infiniband
> LUNs LVM'ed into individual XFS file systems. With Ubuntu 18.04 nearing EOL,
> we prepared an upgrade to Debian bookworm and tests went well. About a week
> after one of the upgrades, we ran into the first occurence of our problem: all
> of a sudden, all nfsds enter the D state and are not recoverable. However, the
> underlying file systems seem fine and can be read and written to. The only way
> out appears to be to reboot the server. The only clues are the frozen nfsds
> and strack traces like
>
> [<0>] rq_qos_wait+0xbc/0x130
> [<0>] wbt_wait+0xa2/0x110
Hi Christian, you have a pretty deep storage stack!
rq_qos_wait is a few layers below NFSD. Jens Axboe
and linux-block are the folks who maintain that.
> [<0>] __rq_qos_throttle+0x20/0x40
> [<0>] blk_mq_submit_bio+0x2d3/0x580
> [<0>] submit_bio_noacct_nocheck+0xf7/0x2c0
> [<0>] iomap_submit_ioend+0x4b/0x80
> [<0>] iomap_do_writepage+0x4b4/0x820
> [<0>] write_cache_pages+0x180/0x4c0
> [<0>] iomap_writepages+0x1c/0x40
> [<0>] xfs_vm_writepages+0x79/0xb0 [xfs]
> [<0>] do_writepages+0xbd/0x1c0
> [<0>] filemap_fdatawrite_wbc+0x5f/0x80
> [<0>] __filemap_fdatawrite_range+0x58/0x80
> [<0>] file_write_and_wait_range+0x41/0x90
> [<0>] xfs_file_fsync+0x5a/0x2a0 [xfs]
> [<0>] nfsd_commit+0x93/0x190 [nfsd]
> [<0>] nfsd4_commit+0x5e/0x90 [nfsd]
> [<0>] nfsd4_proc_compound+0x352/0x660 [nfsd]
> [<0>] nfsd_dispatch+0x167/0x280 [nfsd]
> [<0>] svc_process_common+0x286/0x5e0 [sunrpc]
> [<0>] svc_process+0xad/0x100 [sunrpc]
> [<0>] nfsd+0xd5/0x190 [nfsd]
> [<0>] kthread+0xe6/0x110
> [<0>] ret_from_fork+0x1f/0x30
>
> (we've also seen nfsd3). It's very sporadic, we have no idea what's triggering
> it and it has now happened 4 times on one server and once on a second.
> Needless to say, these are production systems, so we have a window of a few
> minutes for debugging before people start yelling. We've thrown everything we
> could at our test setup but so far haven't been able to trigger it.
> Any pointers would be highly appreciated.
>
>
> thanks and best regards,
> -Christian
>
>
>
> cat /etc/os-release
> PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
>
> uname -vr
> 6.1.0-7-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.20-1 (2023-03-19)
>
> apt list --installed '*nfs*'
> libnfsidmap1/testing,now 1:2.6.2-4 amd64 [installed,automatic]
> nfs-common/testing,now 1:2.6.2-4 amd64 [installed]
> nfs-kernel-server/testing,now 1:2.6.2-4 amd64 [installed]
>
> nfsconf -d
> [exportd]
> debug = all
> [exportfs]
> debug = all
> [general]
> pipefs-directory = /run/rpc_pipefs
> [lockd]
> port = 32769
> udp-port = 32769
> [mountd]
> debug = all
> manage-gids = True
> port = 892
> [nfsd]
> debug = all
> port = 2049
> threads = 48
> [nfsdcld]
> debug = all
> [nfsdcltrack]
> debug = all
> [sm-notify]
> debug = all
> outgoing-port = 846
> [statd]
> debug = all
> outgoing-port = 2020
> port = 662
>
>
>
> --
> Dr. Christian Herzog <[email protected]> support: +41 44 633 26 68
> Head, IT Services Group, HPT H 8 voice: +41 44 633 39 50
> Department of Physics, ETH Zurich
> 8093 Zurich, Switzerland http://isg.phys.ethz.ch/
--
Chuck Lever
Dear Chuck,
> > for our researchers we are running file servers in the hundreds-of-TiB to
> > low-PiB range that export via NFS and SMB. Storage is iSCSI-over-Infiniband
> > LUNs LVM'ed into individual XFS file systems. With Ubuntu 18.04 nearing EOL,
> > we prepared an upgrade to Debian bookworm and tests went well. About a week
> > after one of the upgrades, we ran into the first occurence of our problem: all
> > of a sudden, all nfsds enter the D state and are not recoverable. However, the
> > underlying file systems seem fine and can be read and written to. The only way
> > out appears to be to reboot the server. The only clues are the frozen nfsds
> > and strack traces like
> >
> > [<0>] rq_qos_wait+0xbc/0x130
> > [<0>] wbt_wait+0xa2/0x110
>
> Hi Christian, you have a pretty deep storage stack!
> rq_qos_wait is a few layers below NFSD. Jens Axboe
> and linux-block are the folks who maintain that.
are you saying the root cause isn't nfs*, but the file system? That was our
first idea too, but we haven't found any indication that this is the case. The
xfs file systems seem perfectly fine when all nfsds are in D state, and we can
read from them and write to them. If xfs were to block nfs IO, this should
affect other processes too, right?
thanks and Happy Easter,
-Christian
> On Apr 6, 2023, at 11:33 AM, Christian Herzog <[email protected]> wrote:
>
> Dear Chuck,
>
>>> for our researchers we are running file servers in the hundreds-of-TiB to
>>> low-PiB range that export via NFS and SMB. Storage is iSCSI-over-Infiniband
>>> LUNs LVM'ed into individual XFS file systems. With Ubuntu 18.04 nearing EOL,
>>> we prepared an upgrade to Debian bookworm and tests went well. About a week
>>> after one of the upgrades, we ran into the first occurence of our problem: all
>>> of a sudden, all nfsds enter the D state and are not recoverable. However, the
>>> underlying file systems seem fine and can be read and written to. The only way
>>> out appears to be to reboot the server. The only clues are the frozen nfsds
>>> and strack traces like
>>>
>>> [<0>] rq_qos_wait+0xbc/0x130
>>> [<0>] wbt_wait+0xa2/0x110
>>
>> Hi Christian, you have a pretty deep storage stack!
>> rq_qos_wait is a few layers below NFSD. Jens Axboe
>> and linux-block are the folks who maintain that.
> are you saying the root cause isn't nfs*, but the file system?
I can't possibly know what the root cause is at this point.
> That was our first idea too, but we haven't found any indication that this is the case. The xfs file systems seem perfectly fine when all nfsds are in D state, and we can
> read from them and write to them. If xfs were to block nfs IO, this should
> affect other processes too, right?
It's possible that the NFSD threads are waiting on I/O to a particular filesystem block. XFS is not likely to block other activity in this case.
I'm merely suggesting that you should start troubleshooting at the bottom of the stack instead of the top. The wait is far outside the realm of NFSD.
--
Chuck Lever
Dear Chuck,
> > That was our first idea too, but we haven't found any indication that this is the case. The xfs file systems seem perfectly fine when all nfsds are in D state, and we can
> > read from them and write to them. If xfs were to block nfs IO, this should
> > affect other processes too, right?
>
> It's possible that the NFSD threads are waiting on I/O to a particular filesystem block. XFS is not likely to block other activity in this case.
ok good to know. So far we were under the impression that a file system would
block as a whole.
> I'm merely suggesting that you should start troubleshooting at the bottom of the stack instead of the top. The wait is far outside the realm of NFSD.
thanks, point taken. So next time it happens we'll make sure to poke in this
direction during the few minutes we have for debugging before we get tarred
and feathered by the users.
-Christian
--
Dr. Christian Herzog <[email protected]> support: +41 44 633 26 68
Head, IT Services Group, HPT H 8 voice: +41 44 633 39 50
Department of Physics, ETH Zurich
8093 Zurich, Switzerland http://isg.phys.ethz.ch/
> On Apr 6, 2023, at 11:54 AM, Christian Herzog <[email protected]> wrote:
>
> Dear Chuck,
>
>>> That was our first idea too, but we haven't found any indication that this is the case. The xfs file systems seem perfectly fine when all nfsds are in D state, and we can
>>> read from them and write to them. If xfs were to block nfs IO, this should
>>> affect other processes too, right?
>>
>> It's possible that the NFSD threads are waiting on I/O to a particular filesystem block. XFS is not likely to block other activity in this case.
> ok good to know. So far we were under the impression that a file system would
> block as a whole.
XFS tries to operate in parallel as much as it can. Maybe other filesystems aren't as capable.
If the unresponsive block is part of a superblock or the journal (ie, shared metadata) I would expect XFS to become unresponsive. For I/O on blocks containing file data, it is likely to have more robust behavior.
>> I'm merely suggesting that you should start troubleshooting at the bottom of the stack instead of the top. The wait is far outside the realm of NFSD.
> thanks, point taken. So next time it happens we'll make sure to poke in this
> direction during the few minutes we have for debugging before we get tarred
> and feathered by the users.
I encourage you to discuss debugging tactics with Jens and the block folks -- you can probably capture a lot of info during those few minutes if you have some expert guidance.
Good luck!
--
Chuck Lever
Dear Bob,
thanks a lot for your input.
> >>>> That was our first idea too, but we haven't found any indication that this is the case. The xfs file systems seem perfectly fine when all nfsds are in D state, and we can
> >>>> read from them and write to them. If xfs were to block nfs IO, this should
> >>>> affect other processes too, right?
> >>> It's possible that the NFSD threads are waiting on I/O to a particular filesystem block. XFS is not likely to block other activity in this case.
> >> ok good to know. So far we were under the impression that a file system would
> >> block as a whole.
> >
> > XFS tries to operate in parallel as much as it can. Maybe other filesystems aren't as capable.
> >
> > If the unresponsive block is part of a superblock or the journal (ie, shared metadata) I would expect XFS to become unresponsive. For I/O on blocks containing file data, it is likely to have more robust behavior.
> >
>
> Pretty sure we have seen a similar issue - never fully explained. From what I recall, the server gets to a low memory state. At that point, efforts to coalesce writes are abandoned, and each write request is processed in line - vs scheduled - all nfsd's then pile up in D. writes continue to arrive at a rate higher than can keep up. But, the back end store (a high end netapp raid 6 w/240 drives also with xfs) had very little load - not too busy. Never fully explained it - but Chucks point on shared metadata block may be good place to look - and whether in-line write at low memory could have synergy. IIRC, worked around with releases and tunables like minfree kmem et.al. , that came into play to reduce - but not eliminate. I'm away from reference material for a while but I'll review and update if I find anything.
we'll certainly investigate this topic, but right now it's kinda hard to
imagine since I've never seen the file server above ~10G of its 64G of RAM
(excluding page cache of course). We're not even sure heavy writes trigger the
problem, in one case our monitoring hinted at a lot of reads leading up to the
freeze.
OTOH if our issue could be resolved by throwing a bunch of RAM bars into the
server, all the better.
thanks,
-Christian