LinuxLists.cc - Regression: deadlock in io_schedule / nfs_writepage

2022-08-22 08:25:42

Subject: Regression: deadlock in io_schedule / nfs_writepage_locked

Hello everyone,

Hopefully I'm sending this to the right place…
We recently started to see the following stacktrace quite often on our
VMs that are using NFS extensively (I think after upgrading to
5.18.11+, but not sure when exactly. For sure it happens on 5.18.15):

INFO: task kworker/u36:10:377691 blocked for more than 122 seconds.
Tainted: G E 5.18.15-1.gdc.el8.x86_64 #1
"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
task:kworker/u36:10 state:D stack: 0 pid:377691 ppid: 2 flags:0x00004000
Workqueue: writeback wb_workfn (flush-0:308)
Call Trace:
<TASK>
__schedule+0x38c/0x7d0
schedule+0x41/0xb0
io_schedule+0x12/0x40
__folio_lock+0x110/0x260
? filemap_alloc_folio+0x90/0x90
write_cache_pages+0x1e3/0x4d0
? nfs_writepage_locked+0x1d0/0x1d0 [nfs]
nfs_writepages+0xe1/0x200 [nfs]
do_writepages+0xd2/0x1b0
? check_preempt_curr+0x47/0x70
? ttwu_do_wakeup+0x17/0x180
__writeback_single_inode+0x41/0x360
writeback_sb_inodes+0x1f0/0x460
__writeback_inodes_wb+0x5f/0xd0
wb_writeback+0x235/0x2d0
wb_workfn+0x348/0x4a0
? put_prev_task_fair+0x1b/0x30
? pick_next_task+0x84/0x940
? __update_idle_core+0x1b/0xb0
process_one_work+0x1c5/0x390
worker_thread+0x30/0x360
? process_one_work+0x390/0x390
kthread+0xd7/0x100
? kthread_complete_and_exit+0x20/0x20
ret_from_fork+0x1f/0x30
</TASK>

I see that something very similar was fixed in btrfs
(https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commit/?h=linux-5.18.y&id=9535ec371d741fa037e37eddc0a5b25ba82d0027)
but I could not find anything similar for NFS.

Do you happen to know if this is already fixed? If so, would you mind
sharing some commits? If not, could you help getting this addressed?

Thanks in advance!
--
Igor Raits

2022-08-22 14:06:52

by Trond Myklebust

[permalink] [raw]

Subject: Re: Regression: deadlock in io_schedule / nfs_writepage_locked

On Mon, 2022-08-22 at 10:16 +0200, Igor Raits wrote:
> [You don't often get email from [email protected]. Learn why this is
> important at https://aka.ms/LearnAboutSenderIdentification ]
>
> Hello everyone,
>
> Hopefully I'm sending this to the right place…
> We recently started to see the following stacktrace quite often on
> our
> VMs that are using NFS extensively (I think after upgrading to
> 5.18.11+, but not sure when exactly. For sure it happens on 5.18.15):
>
> INFO: task kworker/u36:10:377691 blocked for more than 122 seconds.
> Tainted: G E 5.18.15-1.gdc.el8.x86_64 #1
> "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> message.
> task:kworker/u36:10 state:D stack: 0 pid:377691 ppid: 2
> flags:0x00004000
> Workqueue: writeback wb_workfn (flush-0:308)
> Call Trace:
> <TASK>
> __schedule+0x38c/0x7d0
> schedule+0x41/0xb0
> io_schedule+0x12/0x40
> __folio_lock+0x110/0x260
> ? filemap_alloc_folio+0x90/0x90
> write_cache_pages+0x1e3/0x4d0
> ? nfs_writepage_locked+0x1d0/0x1d0 [nfs]
> nfs_writepages+0xe1/0x200 [nfs]
> do_writepages+0xd2/0x1b0
> ? check_preempt_curr+0x47/0x70
> ? ttwu_do_wakeup+0x17/0x180
> __writeback_single_inode+0x41/0x360
> writeback_sb_inodes+0x1f0/0x460
> __writeback_inodes_wb+0x5f/0xd0
> wb_writeback+0x235/0x2d0
> wb_workfn+0x348/0x4a0
> ? put_prev_task_fair+0x1b/0x30
> ? pick_next_task+0x84/0x940
> ? __update_idle_core+0x1b/0xb0
> process_one_work+0x1c5/0x390
> worker_thread+0x30/0x360
> ? process_one_work+0x390/0x390
> kthread+0xd7/0x100
> ? kthread_complete_and_exit+0x20/0x20
> ret_from_fork+0x1f/0x30
> </TASK>
>
> I see that something very similar was fixed in btrfs
> (
> https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commi
> t/?h=linux-5.18.y&id=9535ec371d741fa037e37eddc0a5b25ba82d0027)
> but I could not find anything similar for NFS.
>
> Do you happen to know if this is already fixed? If so, would you mind
> sharing some commits? If not, could you help getting this addressed?
>

The stack trace you show above isn't particularly helpful for
diagnosing what the problem is.

All it is saying is that 'thread A' is waiting to take a page lock that
is being held by a different 'thread B'. Without information on what
'thread B' is doing, and why it isn't releasing the lock, there is
nothing we can conclude.

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]

2022-08-22 14:47:10

by Igor Raits

[permalink] [raw]

Subject: Re: Regression: deadlock in io_schedule / nfs_writepage_locked

Hello Trond,

On Mon, Aug 22, 2022 at 4:02 PM Trond Myklebust <[email protected]> wrote:
>
> On Mon, 2022-08-22 at 10:16 +0200, Igor Raits wrote:
> > [You don't often get email from [email protected]. Learn why this is
> > important at https://aka.ms/LearnAboutSenderIdentification ]
> >
> > Hello everyone,
> >
> > Hopefully I'm sending this to the right place…
> > We recently started to see the following stacktrace quite often on
> > our
> > VMs that are using NFS extensively (I think after upgrading to
> > 5.18.11+, but not sure when exactly. For sure it happens on 5.18.15):
> >
> > INFO: task kworker/u36:10:377691 blocked for more than 122 seconds.
> > Tainted: G E 5.18.15-1.gdc.el8.x86_64 #1
> > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> > message.
> > task:kworker/u36:10 state:D stack: 0 pid:377691 ppid: 2
> > flags:0x00004000
> > Workqueue: writeback wb_workfn (flush-0:308)
> > Call Trace:
> > <TASK>
> > __schedule+0x38c/0x7d0
> > schedule+0x41/0xb0
> > io_schedule+0x12/0x40
> > __folio_lock+0x110/0x260
> > ? filemap_alloc_folio+0x90/0x90
> > write_cache_pages+0x1e3/0x4d0
> > ? nfs_writepage_locked+0x1d0/0x1d0 [nfs]
> > nfs_writepages+0xe1/0x200 [nfs]
> > do_writepages+0xd2/0x1b0
> > ? check_preempt_curr+0x47/0x70
> > ? ttwu_do_wakeup+0x17/0x180
> > __writeback_single_inode+0x41/0x360
> > writeback_sb_inodes+0x1f0/0x460
> > __writeback_inodes_wb+0x5f/0xd0
> > wb_writeback+0x235/0x2d0
> > wb_workfn+0x348/0x4a0
> > ? put_prev_task_fair+0x1b/0x30
> > ? pick_next_task+0x84/0x940
> > ? __update_idle_core+0x1b/0xb0
> > process_one_work+0x1c5/0x390
> > worker_thread+0x30/0x360
> > ? process_one_work+0x390/0x390
> > kthread+0xd7/0x100
> > ? kthread_complete_and_exit+0x20/0x20
> > ret_from_fork+0x1f/0x30
> > </TASK>
> >
> > I see that something very similar was fixed in btrfs
> > (
> > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commi
> > t/?h=linux-5.18.y&id=9535ec371d741fa037e37eddc0a5b25ba82d0027)
> > but I could not find anything similar for NFS.
> >
> > Do you happen to know if this is already fixed? If so, would you mind
> > sharing some commits? If not, could you help getting this addressed?
> >
>
> The stack trace you show above isn't particularly helpful for
> diagnosing what the problem is.
>
> All it is saying is that 'thread A' is waiting to take a page lock that
> is being held by a different 'thread B'. Without information on what
> 'thread B' is doing, and why it isn't releasing the lock, there is
> nothing we can conclude.

Do you have some hint how to debug this issue further (when it happens
again)? Would `virsh dump` to get a memory dump and then some kind of
"bt all" via crash help to get more information?
Or something else?

Thanks in advance!
--
Igor Raits

2022-08-22 15:06:55

by Trond Myklebust

[permalink] [raw]

Subject: Re: Regression: deadlock in io_schedule / nfs_writepage_locked

On Mon, 2022-08-22 at 16:43 +0200, Igor Raits wrote:
> [You don't often get email from [email protected]. Learn why this is
> important at https://aka.ms/LearnAboutSenderIdentification ]
>
> Hello Trond,
>
> On Mon, Aug 22, 2022 at 4:02 PM Trond Myklebust
> <[email protected]> wrote:
> >
> > On Mon, 2022-08-22 at 10:16 +0200, Igor Raits wrote:
> > > [You don't often get email from [email protected]. Learn why this
> > > is
> > > important at https://aka.ms/LearnAboutSenderIdentification ]
> > >
> > > Hello everyone,
> > >
> > > Hopefully I'm sending this to the right place…
> > > We recently started to see the following stacktrace quite often
> > > on
> > > our
> > > VMs that are using NFS extensively (I think after upgrading to
> > > 5.18.11+, but not sure when exactly. For sure it happens on
> > > 5.18.15):
> > >
> > > INFO: task kworker/u36:10:377691 blocked for more than 122
> > > seconds.
> > > Tainted: G E 5.18.15-1.gdc.el8.x86_64 #1
> > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> > > message.
> > > task:kworker/u36:10 state:D stack: 0 pid:377691 ppid: 2
> > > flags:0x00004000
> > > Workqueue: writeback wb_workfn (flush-0:308)
> > > Call Trace:
> > > <TASK>
> > > __schedule+0x38c/0x7d0
> > > schedule+0x41/0xb0
> > > io_schedule+0x12/0x40
> > > __folio_lock+0x110/0x260
> > > ? filemap_alloc_folio+0x90/0x90
> > > write_cache_pages+0x1e3/0x4d0
> > > ? nfs_writepage_locked+0x1d0/0x1d0 [nfs]
> > > nfs_writepages+0xe1/0x200 [nfs]
> > > do_writepages+0xd2/0x1b0
> > > ? check_preempt_curr+0x47/0x70
> > > ? ttwu_do_wakeup+0x17/0x180
> > > __writeback_single_inode+0x41/0x360
> > > writeback_sb_inodes+0x1f0/0x460
> > > __writeback_inodes_wb+0x5f/0xd0
> > > wb_writeback+0x235/0x2d0
> > > wb_workfn+0x348/0x4a0
> > > ? put_prev_task_fair+0x1b/0x30
> > > ? pick_next_task+0x84/0x940
> > > ? __update_idle_core+0x1b/0xb0
> > > process_one_work+0x1c5/0x390
> > > worker_thread+0x30/0x360
> > > ? process_one_work+0x390/0x390
> > > kthread+0xd7/0x100
> > > ? kthread_complete_and_exit+0x20/0x20
> > > ret_from_fork+0x1f/0x30
> > > </TASK>
> > >
> > > I see that something very similar was fixed in btrfs
> > > (
> > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commi
> > > t/?h=linux-5.18.y&id=9535ec371d741fa037e37eddc0a5b25ba82d0027)
> > > but I could not find anything similar for NFS.
> > >
> > > Do you happen to know if this is already fixed? If so, would you
> > > mind
> > > sharing some commits? If not, could you help getting this
> > > addressed?
> > >
> >
> > The stack trace you show above isn't particularly helpful for
> > diagnosing what the problem is.
> >
> > All it is saying is that 'thread A' is waiting to take a page lock
> > that
> > is being held by a different 'thread B'. Without information on
> > what
> > 'thread B' is doing, and why it isn't releasing the lock, there is
> > nothing we can conclude.
>
> Do you have some hint how to debug this issue further (when it
> happens
> again)? Would `virsh dump` to get a memory dump and then some kind of
> "bt all" via crash help to get more information?
> Or something else?
>
> Thanks in advance!
> --
> Igor Raits

Please try running the following two lines of 'bash' script as root:

(for tt in $(grep -l 'nfs[^d]' /proc/*/stack); do echo "${tt}:"; cat ${tt}; echo; done) >/tmp/nfs_threads.txt

cat /sys/kernel/debug/sunrpc/rpc_clnt/*/tasks > /tmp/rpc_tasks.txt

and then send us the output from the two files /tmp/nfs_threads.txt and
/tmp/rpc_tasks.txt.

The file nfs_threads.txt gives us a full set of stack traces from all
processes that are currently in the NFS client code. So it should
contain both the stack trace from your 'thread A' above, and the traces
from all candidates for the 'thread B' process that is causing the
blockage.
The file rpc_tasks.txt gives us the status of any RPC calls that might
be outstanding and might help diagnose any issues with the TCP
connection.

That should therefore give us a better starting point for root causing
the problem.

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]

2022-08-23 15:09:01

by Igor Raits

[permalink] [raw]

Subject: Re: Regression: deadlock in io_schedule / nfs_writepage_locked

Hello Trond,

On Mon, Aug 22, 2022 at 5:01 PM Trond Myklebust <[email protected]> wrote:
>
> On Mon, 2022-08-22 at 16:43 +0200, Igor Raits wrote:
> > [You don't often get email from [email protected]. Learn why this is
> > important at https://aka.ms/LearnAboutSenderIdentification ]
> >
> > Hello Trond,
> >
> > On Mon, Aug 22, 2022 at 4:02 PM Trond Myklebust
> > <[email protected]> wrote:
> > >
> > > On Mon, 2022-08-22 at 10:16 +0200, Igor Raits wrote:
> > > > [You don't often get email from [email protected]. Learn why this
> > > > is
> > > > important at https://aka.ms/LearnAboutSenderIdentification ]
> > > >
> > > > Hello everyone,
> > > >
> > > > Hopefully I'm sending this to the right place…
> > > > We recently started to see the following stacktrace quite often
> > > > on
> > > > our
> > > > VMs that are using NFS extensively (I think after upgrading to
> > > > 5.18.11+, but not sure when exactly. For sure it happens on
> > > > 5.18.15):
> > > >
> > > > INFO: task kworker/u36:10:377691 blocked for more than 122
> > > > seconds.
> > > > Tainted: G E 5.18.15-1.gdc.el8.x86_64 #1
> > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> > > > message.
> > > > task:kworker/u36:10 state:D stack: 0 pid:377691 ppid: 2
> > > > flags:0x00004000
> > > > Workqueue: writeback wb_workfn (flush-0:308)
> > > > Call Trace:
> > > > <TASK>
> > > > __schedule+0x38c/0x7d0
> > > > schedule+0x41/0xb0
> > > > io_schedule+0x12/0x40
> > > > __folio_lock+0x110/0x260
> > > > ? filemap_alloc_folio+0x90/0x90
> > > > write_cache_pages+0x1e3/0x4d0
> > > > ? nfs_writepage_locked+0x1d0/0x1d0 [nfs]
> > > > nfs_writepages+0xe1/0x200 [nfs]
> > > > do_writepages+0xd2/0x1b0
> > > > ? check_preempt_curr+0x47/0x70
> > > > ? ttwu_do_wakeup+0x17/0x180
> > > > __writeback_single_inode+0x41/0x360
> > > > writeback_sb_inodes+0x1f0/0x460
> > > > __writeback_inodes_wb+0x5f/0xd0
> > > > wb_writeback+0x235/0x2d0
> > > > wb_workfn+0x348/0x4a0
> > > > ? put_prev_task_fair+0x1b/0x30
> > > > ? pick_next_task+0x84/0x940
> > > > ? __update_idle_core+0x1b/0xb0
> > > > process_one_work+0x1c5/0x390
> > > > worker_thread+0x30/0x360
> > > > ? process_one_work+0x390/0x390
> > > > kthread+0xd7/0x100
> > > > ? kthread_complete_and_exit+0x20/0x20
> > > > ret_from_fork+0x1f/0x30
> > > > </TASK>
> > > >
> > > > I see that something very similar was fixed in btrfs
> > > > (
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commi
> > > > t/?h=linux-5.18.y&id=9535ec371d741fa037e37eddc0a5b25ba82d0027)
> > > > but I could not find anything similar for NFS.
> > > >
> > > > Do you happen to know if this is already fixed? If so, would you
> > > > mind
> > > > sharing some commits? If not, could you help getting this
> > > > addressed?
> > > >
> > >
> > > The stack trace you show above isn't particularly helpful for
> > > diagnosing what the problem is.
> > >
> > > All it is saying is that 'thread A' is waiting to take a page lock
> > > that
> > > is being held by a different 'thread B'. Without information on
> > > what
> > > 'thread B' is doing, and why it isn't releasing the lock, there is
> > > nothing we can conclude.
> >
> > Do you have some hint how to debug this issue further (when it
> > happens
> > again)? Would `virsh dump` to get a memory dump and then some kind of
> > "bt all" via crash help to get more information?
> > Or something else?
> >
> > Thanks in advance!
> > --
> > Igor Raits
>
> Please try running the following two lines of 'bash' script as root:
>
> (for tt in $(grep -l 'nfs[^d]' /proc/*/stack); do echo "${tt}:"; cat ${tt}; echo; done) >/tmp/nfs_threads.txt
>
> cat /sys/kernel/debug/sunrpc/rpc_clnt/*/tasks > /tmp/rpc_tasks.txt
>
> and then send us the output from the two files /tmp/nfs_threads.txt and
> /tmp/rpc_tasks.txt.
>
> The file nfs_threads.txt gives us a full set of stack traces from all
> processes that are currently in the NFS client code. So it should
> contain both the stack trace from your 'thread A' above, and the traces
> from all candidates for the 'thread B' process that is causing the
> blockage.
> The file rpc_tasks.txt gives us the status of any RPC calls that might
> be outstanding and might help diagnose any issues with the TCP
> connection.
>
> That should therefore give us a better starting point for root causing
> the problem.

We just hit the problem again and here are the files (copying in-line
as the output is not very large):

22/08/23 13:11:37 PD277054 rack-na3/k8s_worker, Cluster 53 @ rack-na3 (id 53)
[root@na3-k8s-worker03:~] cat /tmp/rpc_tasks.txt

22/08/23 13:11:43 PD277054 rack-na3/k8s_worker, Cluster 53 @ rack-na3 (id 53)
[root@na3-k8s-worker03:~] cat /tmp/nfs_threads.txt
/proc/3093002/stack:
[<0>] __folio_lock+0x110/0x260
[<0>] write_cache_pages+0x1e3/0x4d0
[<0>] nfs_writepages+0xe1/0x200 [nfs]
[<0>] do_writepages+0xd2/0x1b0
[<0>] __writeback_single_inode+0x41/0x360
[<0>] writeback_sb_inodes+0x1f0/0x460
[<0>] __writeback_inodes_wb+0x5f/0xd0
[<0>] wb_writeback+0x235/0x2d0
[<0>] wb_workfn+0x312/0x4a0
[<0>] process_one_work+0x1c5/0x390
[<0>] worker_thread+0x30/0x360
[<0>] kthread+0xd7/0x100
[<0>] ret_from_fork+0x1f/0x30

/proc/89011/stack:
[<0>] nfs41_callback_svc+0x1af/0x1c0 [nfsv4]
[<0>] kthread+0xd7/0x100
[<0>] ret_from_fork+0x1f/0x30

>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>
>

--
Igor Raits

2022-09-09 15:13:52

by Igor Raits

[permalink] [raw]

Subject: Re: Regression: deadlock in io_schedule / nfs_writepage_locked

Hello Trond,

On Mon, Aug 22, 2022 at 5:01 PM Trond Myklebust <[email protected]> wrote:
>
> On Mon, 2022-08-22 at 16:43 +0200, Igor Raits wrote:
> > [You don't often get email from [email protected]. Learn why this is
> > important at https://aka.ms/LearnAboutSenderIdentification ]
> >
> > Hello Trond,
> >
> > On Mon, Aug 22, 2022 at 4:02 PM Trond Myklebust
> > <[email protected]> wrote:
> > >
> > > On Mon, 2022-08-22 at 10:16 +0200, Igor Raits wrote:
> > > > [You don't often get email from [email protected]. Learn why this
> > > > is
> > > > important at https://aka.ms/LearnAboutSenderIdentification ]
> > > >
> > > > Hello everyone,
> > > >
> > > > Hopefully I'm sending this to the right place…
> > > > We recently started to see the following stacktrace quite often
> > > > on
> > > > our
> > > > VMs that are using NFS extensively (I think after upgrading to
> > > > 5.18.11+, but not sure when exactly. For sure it happens on
> > > > 5.18.15):
> > > >
> > > > INFO: task kworker/u36:10:377691 blocked for more than 122
> > > > seconds.
> > > > Tainted: G E 5.18.15-1.gdc.el8.x86_64 #1
> > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this
> > > > message.
> > > > task:kworker/u36:10 state:D stack: 0 pid:377691 ppid: 2
> > > > flags:0x00004000
> > > > Workqueue: writeback wb_workfn (flush-0:308)
> > > > Call Trace:
> > > > <TASK>
> > > > __schedule+0x38c/0x7d0
> > > > schedule+0x41/0xb0
> > > > io_schedule+0x12/0x40
> > > > __folio_lock+0x110/0x260
> > > > ? filemap_alloc_folio+0x90/0x90
> > > > write_cache_pages+0x1e3/0x4d0
> > > > ? nfs_writepage_locked+0x1d0/0x1d0 [nfs]
> > > > nfs_writepages+0xe1/0x200 [nfs]
> > > > do_writepages+0xd2/0x1b0
> > > > ? check_preempt_curr+0x47/0x70
> > > > ? ttwu_do_wakeup+0x17/0x180
> > > > __writeback_single_inode+0x41/0x360
> > > > writeback_sb_inodes+0x1f0/0x460
> > > > __writeback_inodes_wb+0x5f/0xd0
> > > > wb_writeback+0x235/0x2d0
> > > > wb_workfn+0x348/0x4a0
> > > > ? put_prev_task_fair+0x1b/0x30
> > > > ? pick_next_task+0x84/0x940
> > > > ? __update_idle_core+0x1b/0xb0
> > > > process_one_work+0x1c5/0x390
> > > > worker_thread+0x30/0x360
> > > > ? process_one_work+0x390/0x390
> > > > kthread+0xd7/0x100
> > > > ? kthread_complete_and_exit+0x20/0x20
> > > > ret_from_fork+0x1f/0x30
> > > > </TASK>
> > > >
> > > > I see that something very similar was fixed in btrfs
> > > > (
> > > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commi
> > > > t/?h=linux-5.18.y&id=9535ec371d741fa037e37eddc0a5b25ba82d0027)
> > > > but I could not find anything similar for NFS.
> > > >
> > > > Do you happen to know if this is already fixed? If so, would you
> > > > mind
> > > > sharing some commits? If not, could you help getting this
> > > > addressed?
> > > >
> > >
> > > The stack trace you show above isn't particularly helpful for
> > > diagnosing what the problem is.
> > >
> > > All it is saying is that 'thread A' is waiting to take a page lock
> > > that
> > > is being held by a different 'thread B'. Without information on
> > > what
> > > 'thread B' is doing, and why it isn't releasing the lock, there is
> > > nothing we can conclude.
> >
> > Do you have some hint how to debug this issue further (when it
> > happens
> > again)? Would `virsh dump` to get a memory dump and then some kind of
> > "bt all" via crash help to get more information?
> > Or something else?
> >
> > Thanks in advance!
> > --
> > Igor Raits
>
> Please try running the following two lines of 'bash' script as root:
>
> (for tt in $(grep -l 'nfs[^d]' /proc/*/stack); do echo "${tt}:"; cat ${tt}; echo; done) >/tmp/nfs_threads.txt
>
> cat /sys/kernel/debug/sunrpc/rpc_clnt/*/tasks > /tmp/rpc_tasks.txt
>
> and then send us the output from the two files /tmp/nfs_threads.txt and
> /tmp/rpc_tasks.txt.
>
> The file nfs_threads.txt gives us a full set of stack traces from all
> processes that are currently in the NFS client code. So it should
> contain both the stack trace from your 'thread A' above, and the traces
> from all candidates for the 'thread B' process that is causing the
> blockage.
> The file rpc_tasks.txt gives us the status of any RPC calls that might
> be outstanding and might help diagnose any issues with the TCP
> connection.
>
> That should therefore give us a better starting point for root causing
> the problem.

The rpc_tasks is empty but I got nfs_threads from the moment it is
stuck (see attached file).

It still happens with 5.19.3, 5.19.6.
--
Igor Raits

Attachments:

nfs_threads.txt (5.42 kB)

2022-09-09 16:52:48

by Trond Myklebust

[permalink] [raw]

Subject: Re: Regression: deadlock in io_schedule / nfs_writepage_locked

On Fri, 2022-09-09 at 16:47 +0000, Trond Myklebust wrote:
> This looks like it might be the root cause issue. It looks like
> you're using pNFS:
>
> /proc/3278822/stack:
> [<0>] pnfs_update_layout+0x603/0xed0 [nfsv4]
> [<0>] fl_pnfs_update_layout.constprop.18+0x23/0x1e0
> [nfs_layout_nfsv41_files]
> [<0>] filelayout_pg_init_write+0x3a/0x70 [nfs_layout_nfsv41_files]
> [<0>] __nfs_pageio_add_request+0x294/0x470 [nfs]
> [<0>] nfs_pageio_add_request_mirror+0x2f/0x40 [nfs]
> [<0>] nfs_pageio_add_request+0x200/0x2d0 [nfs]
> [<0>] nfs_page_async_flush+0x120/0x310 [nfs]
> [<0>] nfs_writepages_callback+0x5b/0xc0 [nfs]
> [<0>] write_cache_pages+0x187/0x4d0
> [<0>] nfs_writepages+0xe1/0x200 [nfs]
> [<0>] do_writepages+0xd2/0x1b0
> [<0>] __writeback_single_inode+0x41/0x360
> [<0>] writeback_sb_inodes+0x1f0/0x460
> [<0>] __writeback_inodes_wb+0x5f/0xd0
> [<0>] wb_writeback+0x235/0x2d0
> [<0>] wb_workfn+0x312/0x4a0
> [<0>] process_one_work+0x1c5/0x390
> [<0>] worker_thread+0x30/0x360
> [<0>] kthread+0xd7/0x100
> [<0>] ret_from_fork+0x1f/0x30
>
> What is the pNFS server you are running against? I see you're using
> the files pNFS layout type, so is this a NetApp?
>

Sorry for the HTML spam... Resending with all that crap stripped out.

> From: Igor Raits <[email protected]>
> Sent: Friday, September 9, 2022 11:09
> To: Trond Myklebust <[email protected]>
> Cc: [email protected] <[email protected]>; [email protected]
> <[email protected]>
> Subject: Re: Regression: deadlock in io_schedule /
> nfs_writepage_locked
>
> Hello Trond,
>
> On Mon, Aug 22, 2022 at 5:01 PM Trond Myklebust
> <[email protected]> wrote:
> >
> > On Mon, 2022-08-22 at 16:43 +0200, Igor Raits wrote:
> > > [You don't often get email from [email protected]. Learn why this
> > > is
> > > important at https://aka.ms/LearnAboutSenderIdentification ]
> > >
> > > Hello Trond,
> > >
> > > On Mon, Aug 22, 2022 at 4:02 PM Trond Myklebust
> > > <[email protected]> wrote:
> > > >
> > > > On Mon, 2022-08-22 at 10:16 +0200, Igor Raits wrote:
> > > > > [You don't often get email from [email protected]. Learn why
> > > > > this
> > > > > is
> > > > > important at https://aka.ms/LearnAboutSenderIdentification ]
> > > > >
> > > > > Hello everyone,
> > > > >
> > > > > Hopefully I'm sending this to the right place…
> > > > > We recently started to see the following stacktrace quite
> > > > > often
> > > > > on
> > > > > our
> > > > > VMs that are using NFS extensively (I think after upgrading
> > > > > to
> > > > > 5.18.11+, but not sure when exactly. For sure it happens on
> > > > > 5.18.15):
> > > > >
> > > > > INFO: task kworker/u36:10:377691 blocked for more than 122
> > > > > seconds.
> > > > > Tainted: G E 5.18.15-1.gdc.el8.x86_64 #1
> > > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> > > > > this
> > > > > message.
> > > > > task:kworker/u36:10 state:D stack: 0 pid:377691 ppid:
> > > > > 2
> > > > > flags:0x00004000
> > > > > Workqueue: writeback wb_workfn (flush-0:308)
> > > > > Call Trace:
> > > > > <TASK>
> > > > > __schedule+0x38c/0x7d0
> > > > > schedule+0x41/0xb0
> > > > > io_schedule+0x12/0x40
> > > > > __folio_lock+0x110/0x260
> > > > > ? filemap_alloc_folio+0x90/0x90
> > > > > write_cache_pages+0x1e3/0x4d0
> > > > > ? nfs_writepage_locked+0x1d0/0x1d0 [nfs]
> > > > > nfs_writepages+0xe1/0x200 [nfs]
> > > > > do_writepages+0xd2/0x1b0
> > > > > ? check_preempt_curr+0x47/0x70
> > > > > ? ttwu_do_wakeup+0x17/0x180
> > > > > __writeback_single_inode+0x41/0x360
> > > > > writeback_sb_inodes+0x1f0/0x460
> > > > > __writeback_inodes_wb+0x5f/0xd0
> > > > > wb_writeback+0x235/0x2d0
> > > > > wb_workfn+0x348/0x4a0
> > > > > ? put_prev_task_fair+0x1b/0x30
> > > > > ? pick_next_task+0x84/0x940
> > > > > ? __update_idle_core+0x1b/0xb0
> > > > > process_one_work+0x1c5/0x390
> > > > > worker_thread+0x30/0x360
> > > > > ? process_one_work+0x390/0x390
> > > > > kthread+0xd7/0x100
> > > > > ? kthread_complete_and_exit+0x20/0x20
> > > > > ret_from_fork+0x1f/0x30
> > > > > </TASK>
> > > > >
> > > > > I see that something very similar was fixed in btrfs
> > > > > (
> > > > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commi
> > > > > t/?h=linux-
> > > > > 5.18.y&id=9535ec371d741fa037e37eddc0a5b25ba82d0027)
> > > > > but I could not find anything similar for NFS.
> > > > >
> > > > > Do you happen to know if this is already fixed? If so, would
> > > > > you
> > > > > mind
> > > > > sharing some commits? If not, could you help getting this
> > > > > addressed?
> > > > >
> > > >
> > > > The stack trace you show above isn't particularly helpful for
> > > > diagnosing what the problem is.
> > > >
> > > > All it is saying is that 'thread A' is waiting to take a page
> > > > lock
> > > > that
> > > > is being held by a different 'thread B'. Without information on
> > > > what
> > > > 'thread B' is doing, and why it isn't releasing the lock, there
> > > > is
> > > > nothing we can conclude.
> > >
> > > Do you have some hint how to debug this issue further (when it
> > > happens
> > > again)? Would `virsh dump` to get a memory dump and then some
> > > kind of
> > > "bt all" via crash help to get more information?
> > > Or something else?
> > >
> > > Thanks in advance!
> > > --
> > > Igor Raits
> >
> > Please try running the following two lines of 'bash' script as
> > root:
> >
> > (for tt in $(grep -l 'nfs[^d]' /proc/*/stack); do echo "${tt}:";
> > cat ${tt}; echo; done) >/tmp/nfs_threads.txt
> >
> > cat /sys/kernel/debug/sunrpc/rpc_clnt/*/tasks > /tmp/rpc_tasks.txt
> >
> > and then send us the output from the two files /tmp/nfs_threads.txt
> > and
> > /tmp/rpc_tasks.txt.
> >
> > The file nfs_threads.txt gives us a full set of stack traces from
> > all
> > processes that are currently in the NFS client code. So it should
> > contain both the stack trace from your 'thread A' above, and the
> > traces
> > from all candidates for the 'thread B' process that is causing the
> > blockage.
> > The file rpc_tasks.txt gives us the status of any RPC calls that
> > might
> > be outstanding and might help diagnose any issues with the TCP
> > connection.
> >
> > That should therefore give us a better starting point for root
> > causing
> > the problem.
>
> The rpc_tasks is empty but I got nfs_threads from the moment it is
> stuck (see attached file).
>
> It still happens with 5.19.3, 5.19.6.

--
Trond Myklebust
Linux NFS client maintainer, Hammerspace
[email protected]

2022-09-09 17:18:32

by Igor Raits

[permalink] [raw]

Subject: Re: Regression: deadlock in io_schedule / nfs_writepage_locked

On Fri, Sep 9, 2022 at 6:50 PM Trond Myklebust <[email protected]> wrote:
>
> On Fri, 2022-09-09 at 16:47 +0000, Trond Myklebust wrote:
> > This looks like it might be the root cause issue. It looks like
> > you're using pNFS:
> >
> > /proc/3278822/stack:
> > [<0>] pnfs_update_layout+0x603/0xed0 [nfsv4]
> > [<0>] fl_pnfs_update_layout.constprop.18+0x23/0x1e0
> > [nfs_layout_nfsv41_files]
> > [<0>] filelayout_pg_init_write+0x3a/0x70 [nfs_layout_nfsv41_files]
> > [<0>] __nfs_pageio_add_request+0x294/0x470 [nfs]
> > [<0>] nfs_pageio_add_request_mirror+0x2f/0x40 [nfs]
> > [<0>] nfs_pageio_add_request+0x200/0x2d0 [nfs]
> > [<0>] nfs_page_async_flush+0x120/0x310 [nfs]
> > [<0>] nfs_writepages_callback+0x5b/0xc0 [nfs]
> > [<0>] write_cache_pages+0x187/0x4d0
> > [<0>] nfs_writepages+0xe1/0x200 [nfs]
> > [<0>] do_writepages+0xd2/0x1b0
> > [<0>] __writeback_single_inode+0x41/0x360
> > [<0>] writeback_sb_inodes+0x1f0/0x460
> > [<0>] __writeback_inodes_wb+0x5f/0xd0
> > [<0>] wb_writeback+0x235/0x2d0
> > [<0>] wb_workfn+0x312/0x4a0
> > [<0>] process_one_work+0x1c5/0x390
> > [<0>] worker_thread+0x30/0x360
> > [<0>] kthread+0xd7/0x100
> > [<0>] ret_from_fork+0x1f/0x30
> >
> > What is the pNFS server you are running against? I see you're using
> > the files pNFS layout type, so is this a NetApp?

Exactly! The NFS is provided by Netapp AFF-A700s. Mount options are below.
nas-prod01.int.na3.pcigdc.com:/vol/prod_DataLoading on
/mnt/kubelet/pods/c018cfaa-ca32-4864-b141-0e3b8852949c/volumes/kubernetes.io~nfs/datawarehouse-query-worker-nfs-pv
type nfs4 (rw,relatime,vers=4.2,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,clientaddr=10.7.3.9,local_lock=none,addr=10.7.0.11)

> >
>
> Sorry for the HTML spam... Resending with all that crap stripped out.
>
> > From: Igor Raits <[email protected]>
> > Sent: Friday, September 9, 2022 11:09
> > To: Trond Myklebust <[email protected]>
> > Cc: [email protected] <[email protected]>; [email protected]
> > <[email protected]>
> > Subject: Re: Regression: deadlock in io_schedule /
> > nfs_writepage_locked
> >
> > Hello Trond,
> >
> > On Mon, Aug 22, 2022 at 5:01 PM Trond Myklebust
> > <[email protected]> wrote:
> > >
> > > On Mon, 2022-08-22 at 16:43 +0200, Igor Raits wrote:
> > > > [You don't often get email from [email protected]. Learn why this
> > > > is
> > > > important at https://aka.ms/LearnAboutSenderIdentification ]
> > > >
> > > > Hello Trond,
> > > >
> > > > On Mon, Aug 22, 2022 at 4:02 PM Trond Myklebust
> > > > <[email protected]> wrote:
> > > > >
> > > > > On Mon, 2022-08-22 at 10:16 +0200, Igor Raits wrote:
> > > > > > [You don't often get email from [email protected]. Learn why
> > > > > > this
> > > > > > is
> > > > > > important at https://aka.ms/LearnAboutSenderIdentification ]
> > > > > >
> > > > > > Hello everyone,
> > > > > >
> > > > > > Hopefully I'm sending this to the right place…
> > > > > > We recently started to see the following stacktrace quite
> > > > > > often
> > > > > > on
> > > > > > our
> > > > > > VMs that are using NFS extensively (I think after upgrading
> > > > > > to
> > > > > > 5.18.11+, but not sure when exactly. For sure it happens on
> > > > > > 5.18.15):
> > > > > >
> > > > > > INFO: task kworker/u36:10:377691 blocked for more than 122
> > > > > > seconds.
> > > > > > Tainted: G E 5.18.15-1.gdc.el8.x86_64 #1
> > > > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> > > > > > this
> > > > > > message.
> > > > > > task:kworker/u36:10 state:D stack: 0 pid:377691 ppid:
> > > > > > 2
> > > > > > flags:0x00004000
> > > > > > Workqueue: writeback wb_workfn (flush-0:308)
> > > > > > Call Trace:
> > > > > > <TASK>
> > > > > > __schedule+0x38c/0x7d0
> > > > > > schedule+0x41/0xb0
> > > > > > io_schedule+0x12/0x40
> > > > > > __folio_lock+0x110/0x260
> > > > > > ? filemap_alloc_folio+0x90/0x90
> > > > > > write_cache_pages+0x1e3/0x4d0
> > > > > > ? nfs_writepage_locked+0x1d0/0x1d0 [nfs]
> > > > > > nfs_writepages+0xe1/0x200 [nfs]
> > > > > > do_writepages+0xd2/0x1b0
> > > > > > ? check_preempt_curr+0x47/0x70
> > > > > > ? ttwu_do_wakeup+0x17/0x180
> > > > > > __writeback_single_inode+0x41/0x360
> > > > > > writeback_sb_inodes+0x1f0/0x460
> > > > > > __writeback_inodes_wb+0x5f/0xd0
> > > > > > wb_writeback+0x235/0x2d0
> > > > > > wb_workfn+0x348/0x4a0
> > > > > > ? put_prev_task_fair+0x1b/0x30
> > > > > > ? pick_next_task+0x84/0x940
> > > > > > ? __update_idle_core+0x1b/0xb0
> > > > > > process_one_work+0x1c5/0x390
> > > > > > worker_thread+0x30/0x360
> > > > > > ? process_one_work+0x390/0x390
> > > > > > kthread+0xd7/0x100
> > > > > > ? kthread_complete_and_exit+0x20/0x20
> > > > > > ret_from_fork+0x1f/0x30
> > > > > > </TASK>
> > > > > >
> > > > > > I see that something very similar was fixed in btrfs
> > > > > > (
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commi
> > > > > > t/?h=linux-
> > > > > > 5.18.y&id=9535ec371d741fa037e37eddc0a5b25ba82d0027)
> > > > > > but I could not find anything similar for NFS.
> > > > > >
> > > > > > Do you happen to know if this is already fixed? If so, would
> > > > > > you
> > > > > > mind
> > > > > > sharing some commits? If not, could you help getting this
> > > > > > addressed?
> > > > > >
> > > > >
> > > > > The stack trace you show above isn't particularly helpful for
> > > > > diagnosing what the problem is.
> > > > >
> > > > > All it is saying is that 'thread A' is waiting to take a page
> > > > > lock
> > > > > that
> > > > > is being held by a different 'thread B'. Without information on
> > > > > what
> > > > > 'thread B' is doing, and why it isn't releasing the lock, there
> > > > > is
> > > > > nothing we can conclude.
> > > >
> > > > Do you have some hint how to debug this issue further (when it
> > > > happens
> > > > again)? Would `virsh dump` to get a memory dump and then some
> > > > kind of
> > > > "bt all" via crash help to get more information?
> > > > Or something else?
> > > >
> > > > Thanks in advance!
> > > > --
> > > > Igor Raits
> > >
> > > Please try running the following two lines of 'bash' script as
> > > root:
> > >
> > > (for tt in $(grep -l 'nfs[^d]' /proc/*/stack); do echo "${tt}:";
> > > cat ${tt}; echo; done) >/tmp/nfs_threads.txt
> > >
> > > cat /sys/kernel/debug/sunrpc/rpc_clnt/*/tasks > /tmp/rpc_tasks.txt
> > >
> > > and then send us the output from the two files /tmp/nfs_threads.txt
> > > and
> > > /tmp/rpc_tasks.txt.
> > >
> > > The file nfs_threads.txt gives us a full set of stack traces from
> > > all
> > > processes that are currently in the NFS client code. So it should
> > > contain both the stack trace from your 'thread A' above, and the
> > > traces
> > > from all candidates for the 'thread B' process that is causing the
> > > blockage.
> > > The file rpc_tasks.txt gives us the status of any RPC calls that
> > > might
> > > be outstanding and might help diagnose any issues with the TCP
> > > connection.
> > >
> > > That should therefore give us a better starting point for root
> > > causing
> > > the problem.
> >
> > The rpc_tasks is empty but I got nfs_threads from the moment it is
> > stuck (see attached file).
> >
> > It still happens with 5.19.3, 5.19.6.
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>
>

--

Igor Raits | Principal SW Engineer

[email protected]

+420 775 117 817

Moravske namesti 1007/14

602 00 Brno-Veveri, Czech Republic

2022-09-09 18:01:52

by Olga Kornievskaia

[permalink] [raw]

Subject: Re: Regression: deadlock in io_schedule / nfs_writepage_locked

On Fri, Sep 9, 2022 at 12:52 PM Trond Myklebust <[email protected]> wrote:
>
> On Fri, 2022-09-09 at 16:47 +0000, Trond Myklebust wrote:
> > This looks like it might be the root cause issue. It looks like
> > you're using pNFS:
> >
> > /proc/3278822/stack:
> > [<0>] pnfs_update_layout+0x603/0xed0 [nfsv4]
> > [<0>] fl_pnfs_update_layout.constprop.18+0x23/0x1e0
> > [nfs_layout_nfsv41_files]
> > [<0>] filelayout_pg_init_write+0x3a/0x70 [nfs_layout_nfsv41_files]
> > [<0>] __nfs_pageio_add_request+0x294/0x470 [nfs]
> > [<0>] nfs_pageio_add_request_mirror+0x2f/0x40 [nfs]
> > [<0>] nfs_pageio_add_request+0x200/0x2d0 [nfs]
> > [<0>] nfs_page_async_flush+0x120/0x310 [nfs]
> > [<0>] nfs_writepages_callback+0x5b/0xc0 [nfs]
> > [<0>] write_cache_pages+0x187/0x4d0
> > [<0>] nfs_writepages+0xe1/0x200 [nfs]
> > [<0>] do_writepages+0xd2/0x1b0
> > [<0>] __writeback_single_inode+0x41/0x360
> > [<0>] writeback_sb_inodes+0x1f0/0x460
> > [<0>] __writeback_inodes_wb+0x5f/0xd0
> > [<0>] wb_writeback+0x235/0x2d0
> > [<0>] wb_workfn+0x312/0x4a0
> > [<0>] process_one_work+0x1c5/0x390
> > [<0>] worker_thread+0x30/0x360
> > [<0>] kthread+0xd7/0x100
> > [<0>] ret_from_fork+0x1f/0x30
> >
> > What is the pNFS server you are running against? I see you're using
> > the files pNFS layout type, so is this a NetApp?

This reminds me of the problem that was supposed to be fixed by the
patches that went into 5.19-rc3?.
pNFS: Don't keep retrying if the server replied NFS4ERR_LAYOUTUNAVAILABLE
pNFS: Avoid a live lock condition in pnfs_update_layout()

Igor,

Is the server constantly returning LAYOUT_UNAVAILABLE? And does this
happen to be co-located with a volume move operation?

> >
>
> Sorry for the HTML spam... Resending with all that crap stripped out.
>
> > From: Igor Raits <[email protected]>
> > Sent: Friday, September 9, 2022 11:09
> > To: Trond Myklebust <[email protected]>
> > Cc: [email protected] <[email protected]>; [email protected]
> > <[email protected]>
> > Subject: Re: Regression: deadlock in io_schedule /
> > nfs_writepage_locked
> >
> > Hello Trond,
> >
> > On Mon, Aug 22, 2022 at 5:01 PM Trond Myklebust
> > <[email protected]> wrote:
> > >
> > > On Mon, 2022-08-22 at 16:43 +0200, Igor Raits wrote:
> > > > [You don't often get email from [email protected]. Learn why this
> > > > is
> > > > important at https://aka.ms/LearnAboutSenderIdentification ]
> > > >
> > > > Hello Trond,
> > > >
> > > > On Mon, Aug 22, 2022 at 4:02 PM Trond Myklebust
> > > > <[email protected]> wrote:
> > > > >
> > > > > On Mon, 2022-08-22 at 10:16 +0200, Igor Raits wrote:
> > > > > > [You don't often get email from [email protected]. Learn why
> > > > > > this
> > > > > > is
> > > > > > important at https://aka.ms/LearnAboutSenderIdentification ]
> > > > > >
> > > > > > Hello everyone,
> > > > > >
> > > > > > Hopefully I'm sending this to the right place…
> > > > > > We recently started to see the following stacktrace quite
> > > > > > often
> > > > > > on
> > > > > > our
> > > > > > VMs that are using NFS extensively (I think after upgrading
> > > > > > to
> > > > > > 5.18.11+, but not sure when exactly. For sure it happens on
> > > > > > 5.18.15):
> > > > > >
> > > > > > INFO: task kworker/u36:10:377691 blocked for more than 122
> > > > > > seconds.
> > > > > > Tainted: G E 5.18.15-1.gdc.el8.x86_64 #1
> > > > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> > > > > > this
> > > > > > message.
> > > > > > task:kworker/u36:10 state:D stack: 0 pid:377691 ppid:
> > > > > > 2
> > > > > > flags:0x00004000
> > > > > > Workqueue: writeback wb_workfn (flush-0:308)
> > > > > > Call Trace:
> > > > > > <TASK>
> > > > > > __schedule+0x38c/0x7d0
> > > > > > schedule+0x41/0xb0
> > > > > > io_schedule+0x12/0x40
> > > > > > __folio_lock+0x110/0x260
> > > > > > ? filemap_alloc_folio+0x90/0x90
> > > > > > write_cache_pages+0x1e3/0x4d0
> > > > > > ? nfs_writepage_locked+0x1d0/0x1d0 [nfs]
> > > > > > nfs_writepages+0xe1/0x200 [nfs]
> > > > > > do_writepages+0xd2/0x1b0
> > > > > > ? check_preempt_curr+0x47/0x70
> > > > > > ? ttwu_do_wakeup+0x17/0x180
> > > > > > __writeback_single_inode+0x41/0x360
> > > > > > writeback_sb_inodes+0x1f0/0x460
> > > > > > __writeback_inodes_wb+0x5f/0xd0
> > > > > > wb_writeback+0x235/0x2d0
> > > > > > wb_workfn+0x348/0x4a0
> > > > > > ? put_prev_task_fair+0x1b/0x30
> > > > > > ? pick_next_task+0x84/0x940
> > > > > > ? __update_idle_core+0x1b/0xb0
> > > > > > process_one_work+0x1c5/0x390
> > > > > > worker_thread+0x30/0x360
> > > > > > ? process_one_work+0x390/0x390
> > > > > > kthread+0xd7/0x100
> > > > > > ? kthread_complete_and_exit+0x20/0x20
> > > > > > ret_from_fork+0x1f/0x30
> > > > > > </TASK>
> > > > > >
> > > > > > I see that something very similar was fixed in btrfs
> > > > > > (
> > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commi
> > > > > > t/?h=linux-
> > > > > > 5.18.y&id=9535ec371d741fa037e37eddc0a5b25ba82d0027)
> > > > > > but I could not find anything similar for NFS.
> > > > > >
> > > > > > Do you happen to know if this is already fixed? If so, would
> > > > > > you
> > > > > > mind
> > > > > > sharing some commits? If not, could you help getting this
> > > > > > addressed?
> > > > > >
> > > > >
> > > > > The stack trace you show above isn't particularly helpful for
> > > > > diagnosing what the problem is.
> > > > >
> > > > > All it is saying is that 'thread A' is waiting to take a page
> > > > > lock
> > > > > that
> > > > > is being held by a different 'thread B'. Without information on
> > > > > what
> > > > > 'thread B' is doing, and why it isn't releasing the lock, there
> > > > > is
> > > > > nothing we can conclude.
> > > >
> > > > Do you have some hint how to debug this issue further (when it
> > > > happens
> > > > again)? Would `virsh dump` to get a memory dump and then some
> > > > kind of
> > > > "bt all" via crash help to get more information?
> > > > Or something else?
> > > >
> > > > Thanks in advance!
> > > > --
> > > > Igor Raits
> > >
> > > Please try running the following two lines of 'bash' script as
> > > root:
> > >
> > > (for tt in $(grep -l 'nfs[^d]' /proc/*/stack); do echo "${tt}:";
> > > cat ${tt}; echo; done) >/tmp/nfs_threads.txt
> > >
> > > cat /sys/kernel/debug/sunrpc/rpc_clnt/*/tasks > /tmp/rpc_tasks.txt
> > >
> > > and then send us the output from the two files /tmp/nfs_threads.txt
> > > and
> > > /tmp/rpc_tasks.txt.
> > >
> > > The file nfs_threads.txt gives us a full set of stack traces from
> > > all
> > > processes that are currently in the NFS client code. So it should
> > > contain both the stack trace from your 'thread A' above, and the
> > > traces
> > > from all candidates for the 'thread B' process that is causing the
> > > blockage.
> > > The file rpc_tasks.txt gives us the status of any RPC calls that
> > > might
> > > be outstanding and might help diagnose any issues with the TCP
> > > connection.
> > >
> > > That should therefore give us a better starting point for root
> > > causing
> > > the problem.
> >
> > The rpc_tasks is empty but I got nfs_threads from the moment it is
> > stuck (see attached file).
> >
> > It still happens with 5.19.3, 5.19.6.
>
> --
> Trond Myklebust
> Linux NFS client maintainer, Hammerspace
> [email protected]
>
>

2022-09-13 09:37:36

by Igor Raits

[permalink] [raw]

Subject: Re: Regression: deadlock in io_schedule / nfs_writepage_locked

Seems those 2 were backported to 5.18.6 which is around when we
started to see the issues…
I'll try to revert them and see if it helps.

> Is the server constantly returning LAYOUT_UNAVAILABLE?

Not sure, any hint how to check it?

> And does this happen to be co-located with a volume move operation?

We don't really touch the volumes during that time, although we use
FlexVolumes so it could expand/shrink on Netapp's side.

On Fri, Sep 9, 2022 at 7:00 PM Olga Kornievskaia <[email protected]> wrote:
>
> On Fri, Sep 9, 2022 at 12:52 PM Trond Myklebust <[email protected]> wrote:
> >
> > On Fri, 2022-09-09 at 16:47 +0000, Trond Myklebust wrote:
> > > This looks like it might be the root cause issue. It looks like
> > > you're using pNFS:
> > >
> > > /proc/3278822/stack:
> > > [<0>] pnfs_update_layout+0x603/0xed0 [nfsv4]
> > > [<0>] fl_pnfs_update_layout.constprop.18+0x23/0x1e0
> > > [nfs_layout_nfsv41_files]
> > > [<0>] filelayout_pg_init_write+0x3a/0x70 [nfs_layout_nfsv41_files]
> > > [<0>] __nfs_pageio_add_request+0x294/0x470 [nfs]
> > > [<0>] nfs_pageio_add_request_mirror+0x2f/0x40 [nfs]
> > > [<0>] nfs_pageio_add_request+0x200/0x2d0 [nfs]
> > > [<0>] nfs_page_async_flush+0x120/0x310 [nfs]
> > > [<0>] nfs_writepages_callback+0x5b/0xc0 [nfs]
> > > [<0>] write_cache_pages+0x187/0x4d0
> > > [<0>] nfs_writepages+0xe1/0x200 [nfs]
> > > [<0>] do_writepages+0xd2/0x1b0
> > > [<0>] __writeback_single_inode+0x41/0x360
> > > [<0>] writeback_sb_inodes+0x1f0/0x460
> > > [<0>] __writeback_inodes_wb+0x5f/0xd0
> > > [<0>] wb_writeback+0x235/0x2d0
> > > [<0>] wb_workfn+0x312/0x4a0
> > > [<0>] process_one_work+0x1c5/0x390
> > > [<0>] worker_thread+0x30/0x360
> > > [<0>] kthread+0xd7/0x100
> > > [<0>] ret_from_fork+0x1f/0x30
> > >
> > > What is the pNFS server you are running against? I see you're using
> > > the files pNFS layout type, so is this a NetApp?
>
>
> This reminds me of the problem that was supposed to be fixed by the
> patches that went into 5.19-rc3?.
> pNFS: Don't keep retrying if the server replied NFS4ERR_LAYOUTUNAVAILABLE
> pNFS: Avoid a live lock condition in pnfs_update_layout()
>
> Igor,
>
> Is the server constantly returning LAYOUT_UNAVAILABLE? And does this
> happen to be co-located with a volume move operation?
>
>
> > >
> >
> > Sorry for the HTML spam... Resending with all that crap stripped out.
> >
> > > From: Igor Raits <[email protected]>
> > > Sent: Friday, September 9, 2022 11:09
> > > To: Trond Myklebust <[email protected]>
> > > Cc: [email protected] <[email protected]>; [email protected]
> > > <[email protected]>
> > > Subject: Re: Regression: deadlock in io_schedule /
> > > nfs_writepage_locked
> > >
> > > Hello Trond,
> > >
> > > On Mon, Aug 22, 2022 at 5:01 PM Trond Myklebust
> > > <[email protected]> wrote:
> > > >
> > > > On Mon, 2022-08-22 at 16:43 +0200, Igor Raits wrote:
> > > > > [You don't often get email from [email protected]. Learn why this
> > > > > is
> > > > > important at https://aka.ms/LearnAboutSenderIdentification ]
> > > > >
> > > > > Hello Trond,
> > > > >
> > > > > On Mon, Aug 22, 2022 at 4:02 PM Trond Myklebust
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > > On Mon, 2022-08-22 at 10:16 +0200, Igor Raits wrote:
> > > > > > > [You don't often get email from [email protected]. Learn why
> > > > > > > this
> > > > > > > is
> > > > > > > important at https://aka.ms/LearnAboutSenderIdentification ]
> > > > > > >
> > > > > > > Hello everyone,
> > > > > > >
> > > > > > > Hopefully I'm sending this to the right place…
> > > > > > > We recently started to see the following stacktrace quite
> > > > > > > often
> > > > > > > on
> > > > > > > our
> > > > > > > VMs that are using NFS extensively (I think after upgrading
> > > > > > > to
> > > > > > > 5.18.11+, but not sure when exactly. For sure it happens on
> > > > > > > 5.18.15):
> > > > > > >
> > > > > > > INFO: task kworker/u36:10:377691 blocked for more than 122
> > > > > > > seconds.
> > > > > > > Tainted: G E 5.18.15-1.gdc.el8.x86_64 #1
> > > > > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> > > > > > > this
> > > > > > > message.
> > > > > > > task:kworker/u36:10 state:D stack: 0 pid:377691 ppid:
> > > > > > > 2
> > > > > > > flags:0x00004000
> > > > > > > Workqueue: writeback wb_workfn (flush-0:308)
> > > > > > > Call Trace:
> > > > > > > <TASK>
> > > > > > > __schedule+0x38c/0x7d0
> > > > > > > schedule+0x41/0xb0
> > > > > > > io_schedule+0x12/0x40
> > > > > > > __folio_lock+0x110/0x260
> > > > > > > ? filemap_alloc_folio+0x90/0x90
> > > > > > > write_cache_pages+0x1e3/0x4d0
> > > > > > > ? nfs_writepage_locked+0x1d0/0x1d0 [nfs]
> > > > > > > nfs_writepages+0xe1/0x200 [nfs]
> > > > > > > do_writepages+0xd2/0x1b0
> > > > > > > ? check_preempt_curr+0x47/0x70
> > > > > > > ? ttwu_do_wakeup+0x17/0x180
> > > > > > > __writeback_single_inode+0x41/0x360
> > > > > > > writeback_sb_inodes+0x1f0/0x460
> > > > > > > __writeback_inodes_wb+0x5f/0xd0
> > > > > > > wb_writeback+0x235/0x2d0
> > > > > > > wb_workfn+0x348/0x4a0
> > > > > > > ? put_prev_task_fair+0x1b/0x30
> > > > > > > ? pick_next_task+0x84/0x940
> > > > > > > ? __update_idle_core+0x1b/0xb0
> > > > > > > process_one_work+0x1c5/0x390
> > > > > > > worker_thread+0x30/0x360
> > > > > > > ? process_one_work+0x390/0x390
> > > > > > > kthread+0xd7/0x100
> > > > > > > ? kthread_complete_and_exit+0x20/0x20
> > > > > > > ret_from_fork+0x1f/0x30
> > > > > > > </TASK>
> > > > > > >
> > > > > > > I see that something very similar was fixed in btrfs
> > > > > > > (
> > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commi
> > > > > > > t/?h=linux-
> > > > > > > 5.18.y&id=9535ec371d741fa037e37eddc0a5b25ba82d0027)
> > > > > > > but I could not find anything similar for NFS.
> > > > > > >
> > > > > > > Do you happen to know if this is already fixed? If so, would
> > > > > > > you
> > > > > > > mind
> > > > > > > sharing some commits? If not, could you help getting this
> > > > > > > addressed?
> > > > > > >
> > > > > >
> > > > > > The stack trace you show above isn't particularly helpful for
> > > > > > diagnosing what the problem is.
> > > > > >
> > > > > > All it is saying is that 'thread A' is waiting to take a page
> > > > > > lock
> > > > > > that
> > > > > > is being held by a different 'thread B'. Without information on
> > > > > > what
> > > > > > 'thread B' is doing, and why it isn't releasing the lock, there
> > > > > > is
> > > > > > nothing we can conclude.
> > > > >
> > > > > Do you have some hint how to debug this issue further (when it
> > > > > happens
> > > > > again)? Would `virsh dump` to get a memory dump and then some
> > > > > kind of
> > > > > "bt all" via crash help to get more information?
> > > > > Or something else?
> > > > >
> > > > > Thanks in advance!
> > > > > --
> > > > > Igor Raits
> > > >
> > > > Please try running the following two lines of 'bash' script as
> > > > root:
> > > >
> > > > (for tt in $(grep -l 'nfs[^d]' /proc/*/stack); do echo "${tt}:";
> > > > cat ${tt}; echo; done) >/tmp/nfs_threads.txt
> > > >
> > > > cat /sys/kernel/debug/sunrpc/rpc_clnt/*/tasks > /tmp/rpc_tasks.txt
> > > >
> > > > and then send us the output from the two files /tmp/nfs_threads.txt
> > > > and
> > > > /tmp/rpc_tasks.txt.
> > > >
> > > > The file nfs_threads.txt gives us a full set of stack traces from
> > > > all
> > > > processes that are currently in the NFS client code. So it should
> > > > contain both the stack trace from your 'thread A' above, and the
> > > > traces
> > > > from all candidates for the 'thread B' process that is causing the
> > > > blockage.
> > > > The file rpc_tasks.txt gives us the status of any RPC calls that
> > > > might
> > > > be outstanding and might help diagnose any issues with the TCP
> > > > connection.
> > > >
> > > > That should therefore give us a better starting point for root
> > > > causing
> > > > the problem.
> > >
> > > The rpc_tasks is empty but I got nfs_threads from the moment it is
> > > stuck (see attached file).
> > >
> > > It still happens with 5.19.3, 5.19.6.
> >
> > --
> > Trond Myklebust
> > Linux NFS client maintainer, Hammerspace
> > [email protected]
> >
> >

--
Igor Raits

2022-09-13 13:46:02

by Olga Kornievskaia

[permalink] [raw]

Subject: Re: Regression: deadlock in io_schedule / nfs_writepage_locked

On Tue, Sep 13, 2022 at 5:34 AM Igor Raits <[email protected]> wrote:
>
> Seems those 2 were backported to 5.18.6 which is around when we
> started to see the issues…
> I'll try to revert them and see if it helps.

I'm definitely not suggesting to revert them.

> > Is the server constantly returning LAYOUT_UNAVAILABLE?
>
> Not sure, any hint how to check it?

Could you try to keep a set of rotating network traces and/or nfs4
tracepoints and try to hit your condition?

> > And does this happen to be co-located with a volume move operation?
>
> We don't really touch the volumes during that time, although we use
> FlexVolumes so it could expand/shrink on Netapp's side.

Ok thanks. My experience was with an SFO event,

> On Fri, Sep 9, 2022 at 7:00 PM Olga Kornievskaia <[email protected]> wrote:
> >
> > On Fri, Sep 9, 2022 at 12:52 PM Trond Myklebust <[email protected]> wrote:
> > >
> > > On Fri, 2022-09-09 at 16:47 +0000, Trond Myklebust wrote:
> > > > This looks like it might be the root cause issue. It looks like
> > > > you're using pNFS:
> > > >
> > > > /proc/3278822/stack:
> > > > [<0>] pnfs_update_layout+0x603/0xed0 [nfsv4]
> > > > [<0>] fl_pnfs_update_layout.constprop.18+0x23/0x1e0
> > > > [nfs_layout_nfsv41_files]
> > > > [<0>] filelayout_pg_init_write+0x3a/0x70 [nfs_layout_nfsv41_files]
> > > > [<0>] __nfs_pageio_add_request+0x294/0x470 [nfs]
> > > > [<0>] nfs_pageio_add_request_mirror+0x2f/0x40 [nfs]
> > > > [<0>] nfs_pageio_add_request+0x200/0x2d0 [nfs]
> > > > [<0>] nfs_page_async_flush+0x120/0x310 [nfs]
> > > > [<0>] nfs_writepages_callback+0x5b/0xc0 [nfs]
> > > > [<0>] write_cache_pages+0x187/0x4d0
> > > > [<0>] nfs_writepages+0xe1/0x200 [nfs]
> > > > [<0>] do_writepages+0xd2/0x1b0
> > > > [<0>] __writeback_single_inode+0x41/0x360
> > > > [<0>] writeback_sb_inodes+0x1f0/0x460
> > > > [<0>] __writeback_inodes_wb+0x5f/0xd0
> > > > [<0>] wb_writeback+0x235/0x2d0
> > > > [<0>] wb_workfn+0x312/0x4a0
> > > > [<0>] process_one_work+0x1c5/0x390
> > > > [<0>] worker_thread+0x30/0x360
> > > > [<0>] kthread+0xd7/0x100
> > > > [<0>] ret_from_fork+0x1f/0x30
> > > >
> > > > What is the pNFS server you are running against? I see you're using
> > > > the files pNFS layout type, so is this a NetApp?
> >
> >
> > This reminds me of the problem that was supposed to be fixed by the
> > patches that went into 5.19-rc3?.
> > pNFS: Don't keep retrying if the server replied NFS4ERR_LAYOUTUNAVAILABLE
> > pNFS: Avoid a live lock condition in pnfs_update_layout()
> >
> > Igor,
> >
> > Is the server constantly returning LAYOUT_UNAVAILABLE? And does this
> > happen to be co-located with a volume move operation?
> >
> >
> > > >
> > >
> > > Sorry for the HTML spam... Resending with all that crap stripped out.
> > >
> > > > From: Igor Raits <[email protected]>
> > > > Sent: Friday, September 9, 2022 11:09
> > > > To: Trond Myklebust <[email protected]>
> > > > Cc: [email protected] <[email protected]>; [email protected]
> > > > <[email protected]>
> > > > Subject: Re: Regression: deadlock in io_schedule /
> > > > nfs_writepage_locked
> > > >
> > > > Hello Trond,
> > > >
> > > > On Mon, Aug 22, 2022 at 5:01 PM Trond Myklebust
> > > > <[email protected]> wrote:
> > > > >
> > > > > On Mon, 2022-08-22 at 16:43 +0200, Igor Raits wrote:
> > > > > > [You don't often get email from [email protected]. Learn why this
> > > > > > is
> > > > > > important at https://aka.ms/LearnAboutSenderIdentification ]
> > > > > >
> > > > > > Hello Trond,
> > > > > >
> > > > > > On Mon, Aug 22, 2022 at 4:02 PM Trond Myklebust
> > > > > > <[email protected]> wrote:
> > > > > > >
> > > > > > > On Mon, 2022-08-22 at 10:16 +0200, Igor Raits wrote:
> > > > > > > > [You don't often get email from [email protected]. Learn why
> > > > > > > > this
> > > > > > > > is
> > > > > > > > important at https://aka.ms/LearnAboutSenderIdentification ]
> > > > > > > >
> > > > > > > > Hello everyone,
> > > > > > > >
> > > > > > > > Hopefully I'm sending this to the right place…
> > > > > > > > We recently started to see the following stacktrace quite
> > > > > > > > often
> > > > > > > > on
> > > > > > > > our
> > > > > > > > VMs that are using NFS extensively (I think after upgrading
> > > > > > > > to
> > > > > > > > 5.18.11+, but not sure when exactly. For sure it happens on
> > > > > > > > 5.18.15):
> > > > > > > >
> > > > > > > > INFO: task kworker/u36:10:377691 blocked for more than 122
> > > > > > > > seconds.
> > > > > > > > Tainted: G E 5.18.15-1.gdc.el8.x86_64 #1
> > > > > > > > "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
> > > > > > > > this
> > > > > > > > message.
> > > > > > > > task:kworker/u36:10 state:D stack: 0 pid:377691 ppid:
> > > > > > > > 2
> > > > > > > > flags:0x00004000
> > > > > > > > Workqueue: writeback wb_workfn (flush-0:308)
> > > > > > > > Call Trace:
> > > > > > > > <TASK>
> > > > > > > > __schedule+0x38c/0x7d0
> > > > > > > > schedule+0x41/0xb0
> > > > > > > > io_schedule+0x12/0x40
> > > > > > > > __folio_lock+0x110/0x260
> > > > > > > > ? filemap_alloc_folio+0x90/0x90
> > > > > > > > write_cache_pages+0x1e3/0x4d0
> > > > > > > > ? nfs_writepage_locked+0x1d0/0x1d0 [nfs]
> > > > > > > > nfs_writepages+0xe1/0x200 [nfs]
> > > > > > > > do_writepages+0xd2/0x1b0
> > > > > > > > ? check_preempt_curr+0x47/0x70
> > > > > > > > ? ttwu_do_wakeup+0x17/0x180
> > > > > > > > __writeback_single_inode+0x41/0x360
> > > > > > > > writeback_sb_inodes+0x1f0/0x460
> > > > > > > > __writeback_inodes_wb+0x5f/0xd0
> > > > > > > > wb_writeback+0x235/0x2d0
> > > > > > > > wb_workfn+0x348/0x4a0
> > > > > > > > ? put_prev_task_fair+0x1b/0x30
> > > > > > > > ? pick_next_task+0x84/0x940
> > > > > > > > ? __update_idle_core+0x1b/0xb0
> > > > > > > > process_one_work+0x1c5/0x390
> > > > > > > > worker_thread+0x30/0x360
> > > > > > > > ? process_one_work+0x390/0x390
> > > > > > > > kthread+0xd7/0x100
> > > > > > > > ? kthread_complete_and_exit+0x20/0x20
> > > > > > > > ret_from_fork+0x1f/0x30
> > > > > > > > </TASK>
> > > > > > > >
> > > > > > > > I see that something very similar was fixed in btrfs
> > > > > > > > (
> > > > > > > > https://git.kernel.org/pub/scm/linux/kernel/git/stable/linux.git/commi
> > > > > > > > t/?h=linux-
> > > > > > > > 5.18.y&id=9535ec371d741fa037e37eddc0a5b25ba82d0027)
> > > > > > > > but I could not find anything similar for NFS.
> > > > > > > >
> > > > > > > > Do you happen to know if this is already fixed? If so, would
> > > > > > > > you
> > > > > > > > mind
> > > > > > > > sharing some commits? If not, could you help getting this
> > > > > > > > addressed?
> > > > > > > >
> > > > > > >
> > > > > > > The stack trace you show above isn't particularly helpful for
> > > > > > > diagnosing what the problem is.
> > > > > > >
> > > > > > > All it is saying is that 'thread A' is waiting to take a page
> > > > > > > lock
> > > > > > > that
> > > > > > > is being held by a different 'thread B'. Without information on
> > > > > > > what
> > > > > > > 'thread B' is doing, and why it isn't releasing the lock, there
> > > > > > > is
> > > > > > > nothing we can conclude.
> > > > > >
> > > > > > Do you have some hint how to debug this issue further (when it
> > > > > > happens
> > > > > > again)? Would `virsh dump` to get a memory dump and then some
> > > > > > kind of
> > > > > > "bt all" via crash help to get more information?
> > > > > > Or something else?
> > > > > >
> > > > > > Thanks in advance!
> > > > > > --
> > > > > > Igor Raits
> > > > >
> > > > > Please try running the following two lines of 'bash' script as
> > > > > root:
> > > > >
> > > > > (for tt in $(grep -l 'nfs[^d]' /proc/*/stack); do echo "${tt}:";
> > > > > cat ${tt}; echo; done) >/tmp/nfs_threads.txt
> > > > >
> > > > > cat /sys/kernel/debug/sunrpc/rpc_clnt/*/tasks > /tmp/rpc_tasks.txt
> > > > >
> > > > > and then send us the output from the two files /tmp/nfs_threads.txt
> > > > > and
> > > > > /tmp/rpc_tasks.txt.
> > > > >
> > > > > The file nfs_threads.txt gives us a full set of stack traces from
> > > > > all
> > > > > processes that are currently in the NFS client code. So it should
> > > > > contain both the stack trace from your 'thread A' above, and the
> > > > > traces
> > > > > from all candidates for the 'thread B' process that is causing the
> > > > > blockage.
> > > > > The file rpc_tasks.txt gives us the status of any RPC calls that
> > > > > might
> > > > > be outstanding and might help diagnose any issues with the TCP
> > > > > connection.
> > > > >
> > > > > That should therefore give us a better starting point for root
> > > > > causing
> > > > > the problem.
> > > >
> > > > The rpc_tasks is empty but I got nfs_threads from the moment it is
> > > > stuck (see attached file).
> > > >
> > > > It still happens with 5.19.3, 5.19.6.
> > >
> > > --
> > > Trond Myklebust
> > > Linux NFS client maintainer, Hammerspace
> > > [email protected]
> > >
> > >
>
>
>
> --
> Igor Raits