2020-11-09 21:12:35

by Chris Friesen

[permalink] [raw]
Subject: looking for assistance with jbd2 (and other processes) hung trying to write to disk

Hi,

I'm running a 3.10.0-1127.rt56.1093 CentOS kernel. I realize you don't
support this particular kernel but I'm hoping for some general pointers.

I've got a system with four "housekeeping" CPUs, with rcu_nocbs and
hohz_full used to reduce system overhead on the "application" CPUs, with
four CPUs set as "isolcpus" to try and isolate them even further. I
have a crashdump vmcore file from a softdog expiry when the process that
pets the softdog hung trying to write to /dev/log after the unix socket
backlog had been reached.

I can see a "jbd2/nvme2n1p4-" process that appears to be hung for over 9
minutes waiting to commit a transaction. /dev/nvme2n1p4 corresponds to
the root filesystem. The "ps" and "bt" output from the crashdump are
below. syslog-ng is also blocked waiting on filesystem access, and
there are other tasks also blocked on disk, including a few jbd2 tasks
that are associated with device mapper.

Can anyone give some suggestions on how to track down what's causing the
delay here? I suspect there's a race condition somewhere similar to
what happened with https://access.redhat.com/solutions/3226391, although
that one was specific to device-mapper and the root filesystem here is
directly on the nvme device.

Any suggestions would be appreciated.

Additional info below:

crash> ps -m 930
[0 00:09:11.694] [UN] PID: 930 TASK: ffffa14b5f9032c0 CPU: 1
COMMAND: "jbd2/nvme2n1p4-"

crash> bt 930
PID: 930 TASK: ffffa14b5f9032c0 CPU: 1 COMMAND: "jbd2/nvme2n1p4-"
#0 [ffffa14b5ff0ba20] __schedule at ffffffffafe1b959
#1 [ffffa14b5ff0bab0] schedule at ffffffffafe1be80
#2 [ffffa14b5ff0bac8] schedule_timeout at ffffffffafe19d4c
#3 [ffffa14b5ff0bb70] io_schedule_timeout at ffffffffafe1ab6d
#4 [ffffa14b5ff0bba0] io_schedule at ffffffffafe1ac08
#5 [ffffa14b5ff0bbb0] bit_wait_io at ffffffffafe1a561
#6 [ffffa14b5ff0bbc8] __wait_on_bit at ffffffffafe1a087
#7 [ffffa14b5ff0bc08] out_of_line_wait_on_bit at ffffffffafe1a1f1
#8 [ffffa14b5ff0bc80] __wait_on_buffer at ffffffffaf85068a
#9 [ffffa14b5ff0bc90] jbd2_journal_commit_transaction at
ffffffffc0e543fc [jbd2]
#10 [ffffa14b5ff0be48] kjournald2 at ffffffffc0e5a6ad [jbd2]
#11 [ffffa14b5ff0bec8] kthread at ffffffffaf6ad781
#12 [ffffa14b5ff0bf50] ret_from_fork_nospec_begin at ffffffffafe1fe5d

Possibly of interest, syslog-ng is also blocked waiting on filesystem
access:

crash> bt 1912
PID: 1912 TASK: ffffa14b62dc2610 CPU: 1 COMMAND: "syslog-ng"
#0 [ffffa14b635b7980] __schedule at ffffffffafe1b959
#1 [ffffa14b635b7a10] schedule at ffffffffafe1be80
#2 [ffffa14b635b7a28] schedule_timeout at ffffffffafe19d4c
#3 [ffffa14b635b7ad0] io_schedule_timeout at ffffffffafe1ab6d
#4 [ffffa14b635b7b00] io_schedule at ffffffffafe1ac08
#5 [ffffa14b635b7b10] bit_wait_io at ffffffffafe1a561
#6 [ffffa14b635b7b28] __wait_on_bit at ffffffffafe1a087
#7 [ffffa14b635b7b68] out_of_line_wait_on_bit at ffffffffafe1a1f1
#8 [ffffa14b635b7be0] do_get_write_access at ffffffffc0e51e94 [jbd2]
#9 [ffffa14b635b7c80] jbd2_journal_get_write_access at
ffffffffc0e521b7 [jbd2]
#10 [ffffa14b635b7ca0] __ext4_journal_get_write_access at
ffffffffc0eb8e31 [ext4]
#11 [ffffa14b635b7cd0] ext4_reserve_inode_write at ffffffffc0e87fa0 [ext4]
#12 [ffffa14b635b7d00] ext4_mark_inode_dirty at ffffffffc0e8801e [ext4]
#13 [ffffa14b635b7d58] ext4_dirty_inode at ffffffffc0e8bc40 [ext4]
#14 [ffffa14b635b7d78] __mark_inode_dirty at ffffffffaf84855d
#15 [ffffa14b635b7da8] ext4_setattr at ffffffffc0e8b558 [ext4]
#16 [ffffa14b635b7e18] notify_change at ffffffffaf8363fc
#17 [ffffa14b635b7e60] chown_common at ffffffffaf8128ac
#18 [ffffa14b635b7f08] sys_fchown at ffffffffaf813fb7
#19 [ffffa14b635b7f50] tracesys at ffffffffafe202a8 (via system_call)

One of the hung jbd2 tasks associated with device mapper:

crash> bt 1489
PID: 1489 TASK: ffffa14b641f0000 CPU: 1 COMMAND: "jbd2/dm-0-8"
#0 [ffffa14b5fab7a20] __schedule at ffffffffafe1b959
#1 [ffffa14b5fab7ab0] schedule at ffffffffafe1be80
#2 [ffffa14b5fab7ac8] schedule_timeout at ffffffffafe19d4c
#3 [ffffa14b5fab7b70] io_schedule_timeout at ffffffffafe1ab6d
#4 [ffffa14b5fab7ba0] io_schedule at ffffffffafe1ac08
#5 [ffffa14b5fab7bb0] bit_wait_io at ffffffffafe1a561
#6 [ffffa14b5fab7bc8] __wait_on_bit at ffffffffafe1a087
#7 [ffffa14b5fab7c08] out_of_line_wait_on_bit at ffffffffafe1a1f1
#8 [ffffa14b5fab7c80] __wait_on_buffer at ffffffffaf85068a
#9 [ffffa14b5fab7c90] jbd2_journal_commit_transaction at
ffffffffc0e543fc [jbd2]
#10 [ffffa14b5fab7e48] kjournald2 at ffffffffc0e5a6ad [jbd2]
#11 [ffffa14b5fab7ec8] kthread at ffffffffaf6ad781
#12 [ffffa14b5fab7f50] ret_from_fork_nospec_begin at ffffffffafe1fe5d

Thanks,

Chris


2020-11-10 11:42:36

by Jan Kara

[permalink] [raw]
Subject: Re: looking for assistance with jbd2 (and other processes) hung trying to write to disk

Hello!

On Mon 09-11-20 15:11:58, Chris Friesen wrote:
> I'm running a 3.10.0-1127.rt56.1093 CentOS kernel. I realize you don't
> support this particular kernel but I'm hoping for some general pointers.
>
> I've got a system with four "housekeeping" CPUs, with rcu_nocbs and
> hohz_full used to reduce system overhead on the "application" CPUs, with
> four CPUs set as "isolcpus" to try and isolate them even further. I have a
> crashdump vmcore file from a softdog expiry when the process that pets the
> softdog hung trying to write to /dev/log after the unix socket backlog had
> been reached.
>
> I can see a "jbd2/nvme2n1p4-" process that appears to be hung for over 9
> minutes waiting to commit a transaction. /dev/nvme2n1p4 corresponds to the
> root filesystem. The "ps" and "bt" output from the crashdump are below.
> syslog-ng is also blocked waiting on filesystem access, and there are other
> tasks also blocked on disk, including a few jbd2 tasks that are associated
> with device mapper.

Yeah, 9 minutes seems far too long.

> Can anyone give some suggestions on how to track down what's causing the
> delay here? I suspect there's a race condition somewhere similar to what
> happened with https://access.redhat.com/solutions/3226391, although that one
> was specific to device-mapper and the root filesystem here is directly on
> the nvme device.

Sadly I don't have access to RH portal to be able to check what that hang
was about...

> crash> ps -m 930
> [0 00:09:11.694] [UN] PID: 930 TASK: ffffa14b5f9032c0 CPU: 1 COMMAND:
> "jbd2/nvme2n1p4-"
>

Are the tasks below the only ones hanging in D state (UN state in crash)?
Because I can see processes are waiting for the locked buffer but it is
unclear who is holding the buffer lock...

> crash> bt 930
> PID: 930 TASK: ffffa14b5f9032c0 CPU: 1 COMMAND: "jbd2/nvme2n1p4-"
> #0 [ffffa14b5ff0ba20] __schedule at ffffffffafe1b959
> #1 [ffffa14b5ff0bab0] schedule at ffffffffafe1be80
> #2 [ffffa14b5ff0bac8] schedule_timeout at ffffffffafe19d4c
> #3 [ffffa14b5ff0bb70] io_schedule_timeout at ffffffffafe1ab6d
> #4 [ffffa14b5ff0bba0] io_schedule at ffffffffafe1ac08
> #5 [ffffa14b5ff0bbb0] bit_wait_io at ffffffffafe1a561
> #6 [ffffa14b5ff0bbc8] __wait_on_bit at ffffffffafe1a087
> #7 [ffffa14b5ff0bc08] out_of_line_wait_on_bit at ffffffffafe1a1f1
> #8 [ffffa14b5ff0bc80] __wait_on_buffer at ffffffffaf85068a
> #9 [ffffa14b5ff0bc90] jbd2_journal_commit_transaction at ffffffffc0e543fc
> [jbd2]
> #10 [ffffa14b5ff0be48] kjournald2 at ffffffffc0e5a6ad [jbd2]
> #11 [ffffa14b5ff0bec8] kthread at ffffffffaf6ad781
> #12 [ffffa14b5ff0bf50] ret_from_fork_nospec_begin at ffffffffafe1fe5d
>
> Possibly of interest, syslog-ng is also blocked waiting on filesystem
> access:
>
> crash> bt 1912
> PID: 1912 TASK: ffffa14b62dc2610 CPU: 1 COMMAND: "syslog-ng"
> #0 [ffffa14b635b7980] __schedule at ffffffffafe1b959
> #1 [ffffa14b635b7a10] schedule at ffffffffafe1be80
> #2 [ffffa14b635b7a28] schedule_timeout at ffffffffafe19d4c
> #3 [ffffa14b635b7ad0] io_schedule_timeout at ffffffffafe1ab6d
> #4 [ffffa14b635b7b00] io_schedule at ffffffffafe1ac08
> #5 [ffffa14b635b7b10] bit_wait_io at ffffffffafe1a561
> #6 [ffffa14b635b7b28] __wait_on_bit at ffffffffafe1a087
> #7 [ffffa14b635b7b68] out_of_line_wait_on_bit at ffffffffafe1a1f1
> #8 [ffffa14b635b7be0] do_get_write_access at ffffffffc0e51e94 [jbd2]
> #9 [ffffa14b635b7c80] jbd2_journal_get_write_access at ffffffffc0e521b7
> [jbd2]
> #10 [ffffa14b635b7ca0] __ext4_journal_get_write_access at ffffffffc0eb8e31
> [ext4]
> #11 [ffffa14b635b7cd0] ext4_reserve_inode_write at ffffffffc0e87fa0 [ext4]
> #12 [ffffa14b635b7d00] ext4_mark_inode_dirty at ffffffffc0e8801e [ext4]
> #13 [ffffa14b635b7d58] ext4_dirty_inode at ffffffffc0e8bc40 [ext4]
> #14 [ffffa14b635b7d78] __mark_inode_dirty at ffffffffaf84855d
> #15 [ffffa14b635b7da8] ext4_setattr at ffffffffc0e8b558 [ext4]
> #16 [ffffa14b635b7e18] notify_change at ffffffffaf8363fc
> #17 [ffffa14b635b7e60] chown_common at ffffffffaf8128ac
> #18 [ffffa14b635b7f08] sys_fchown at ffffffffaf813fb7
> #19 [ffffa14b635b7f50] tracesys at ffffffffafe202a8 (via system_call)
>
> One of the hung jbd2 tasks associated with device mapper:
>
> crash> bt 1489
> PID: 1489 TASK: ffffa14b641f0000 CPU: 1 COMMAND: "jbd2/dm-0-8"
> #0 [ffffa14b5fab7a20] __schedule at ffffffffafe1b959
> #1 [ffffa14b5fab7ab0] schedule at ffffffffafe1be80
> #2 [ffffa14b5fab7ac8] schedule_timeout at ffffffffafe19d4c
> #3 [ffffa14b5fab7b70] io_schedule_timeout at ffffffffafe1ab6d
> #4 [ffffa14b5fab7ba0] io_schedule at ffffffffafe1ac08
> #5 [ffffa14b5fab7bb0] bit_wait_io at ffffffffafe1a561
> #6 [ffffa14b5fab7bc8] __wait_on_bit at ffffffffafe1a087
> #7 [ffffa14b5fab7c08] out_of_line_wait_on_bit at ffffffffafe1a1f1
> #8 [ffffa14b5fab7c80] __wait_on_buffer at ffffffffaf85068a
> #9 [ffffa14b5fab7c90] jbd2_journal_commit_transaction at ffffffffc0e543fc
> [jbd2]
> #10 [ffffa14b5fab7e48] kjournald2 at ffffffffc0e5a6ad [jbd2]
> #11 [ffffa14b5fab7ec8] kthread at ffffffffaf6ad781
> #12 [ffffa14b5fab7f50] ret_from_fork_nospec_begin at ffffffffafe1fe5d
>
> Thanks,
>
> Chris
--
Jan Kara <[email protected]>
SUSE Labs, CR

2020-11-10 15:58:17

by Chris Friesen

[permalink] [raw]
Subject: Re: looking for assistance with jbd2 (and other processes) hung trying to write to disk

On 11/10/2020 5:42 AM, Jan Kara wrote:
> On Mon 09-11-20 15:11:58, Chris Friesen wrote:

>> Can anyone give some suggestions on how to track down what's causing the
>> delay here? I suspect there's a race condition somewhere similar to what
>> happened with https://access.redhat.com/solutions/3226391, although that one
>> was specific to device-mapper and the root filesystem here is directly on
>> the nvme device.
>
> Sadly I don't have access to RH portal to be able to check what that hang
> was about...

They had exactly the same stack trace (different addresses) with
"jbd2/dm-16-8" trying to commit a journal transaction. In their case it
was apparently due to two problems, "the RAID1 code leaking the r1bio",
and "dm-raid not handling a needed retry scenario". They fixed it by
backporting upstream commits. The kernel we're running should have
those fixes, and in our case we're operating directly on an nvme device.

>> crash> ps -m 930
>> [0 00:09:11.694] [UN] PID: 930 TASK: ffffa14b5f9032c0 CPU: 1 COMMAND:
>> "jbd2/nvme2n1p4-"
>>
>
> Are the tasks below the only ones hanging in D state (UN state in crash)?
> Because I can see processes are waiting for the locked buffer but it is
> unclear who is holding the buffer lock...

No, there are quite a few of them. I've included them below. I agree,
it's not clear who's holding the lock. Is there a way to find that out?

Just to be sure, I'm looking for whoever has the BH_Lock bit set on the
buffer_head "b_state" field, right? I don't see any ownership field the
way we have for mutexes. Is there some way to find out who would have
locked the buffer?

Do you think it would help at all to enable CONFIG_JBD_DEBUG?

Processes in "UN" state in crashdump:

crash> ps|grep UN
1 0 1 ffffa14b687d8000 UN 0.0 193616 6620 systemd
930 2 1 ffffa14b5f9032c0 UN 0.0 0 0
[jbd2/nvme2n1p4-]
1489 2 1 ffffa14b641f0000 UN 0.0 0 0
[jbd2/dm-0-8]
1494 2 1 ffffa14b641f2610 UN 0.0 0 0
[jbd2/dm-11-8]
1523 2 1 ffffa14b64182610 UN 0.0 0 0
[jbd2/dm-1-8]
1912 1 1 ffffa14b62dc2610 UN 0.0 117868 17568 syslog-ng
86293 1 1 ffffa14ae4650cb0 UN 0.1 4618100 116664 containerd
86314 1 1 ffffa14ae2639960 UN 0.1 4618100 116664 containerd
88019 1 1 ffffa14ae26ad8d0 UN 0.2 651196 210260 safe_timer
90539 1 1 ffffa13caca3bf70 UN 0.0 25868 2140 fsmond
94006 93595 1 ffffa14ae31fe580 UN 0.1 13843140 113604 etcd
95061 93508 1 ffffa14a913e8cb0 UN 0.1 721888 114652 log
96367 1 1 ffffa14af53f9960 UN 0.0 119404 19084 python
121292 1 1 ffffa14ae18932c0 UN 0.1 4618100 116664 containerd
122042 1 1 ffffa14a950a6580 UN 0.0 111680 9496
containerd-shim
126119 122328 23 ffffa14b3d76a610 UN 0.0 0 0 com.xcg
126171 122328 47 ffffa14a91571960 UN 0.0 0 0 com.xcg
126173 122328 23 ffffa14a91573f70 UN 0.0 0 0 com.xcg
126177 122328 23 ffffa14a91888000 UN 0.0 0 0 com.xcg
128049 124763 47 ffffa14a964e6580 UN 0.1 1817292 80388 confd
136938 136924 1 ffffa14b5bb7d8d0 UN 0.0 146256 25672 coredns
136972 136924 1 ffffa14a9aae2610 UN 0.0 146256 25672 coredns
136978 136924 1 ffffa14ae2238000 UN 0.0 146256 25672 coredns
143026 142739 1 ffffa14b035e0000 UN 0.0 0 0 cainjector
166456 165537 44 ffffa14af3cb8000 UN 0.0 325468 10736 nronmd.xcg
166712 165537 44 ffffa149a2fecc20 UN 0.0 200116 3728 vpms.xcg
166725 165537 44 ffffa14962fb6580 UN 0.1 2108336 58176 vrlcb.xcg
166860 165537 45 ffffa14afd22bf70 UN 0.0 848320 12180 gcci.xcg
166882 165537 45 ffffa14aff3c58d0 UN 0.0 693256 11624 ndc.xcg
167556 165537 44 ffffa14929a6cc20 UN 0.0 119604 2612 gcdm.xcg
170732 122328 23 ffffa1492987bf70 UN 0.0 616660 4348 com.xcg
170741 122328 46 ffffa1492987cc20 UN 0.0 0 0 com.xcg
170745 122328 23 ffffa1492987e580 UN 0.0 0 0 com.xcg
170750 122328 23 ffffa14924d4f230 UN 0.0 0 0 com.xcg
170774 122328 23 ffffa14924d4bf70 UN 0.0 0 0 com.xcg
189870 187717 46 ffffa14873591960 UN 0.1 881516 83840 filebeat
332649 136924 1 ffffa147efd49960 UN 0.0 146256 25672 coredns
1036113 3779184 23 ffffa13c9317bf70 UN 0.9 6703644 878848
pool-3-thread-1
1793349 2 1 ffffa14ae2402610 UN 0.0 0 0
[kworker/1:0]
1850718 166101 0 ffffa14807448cb0 UN 0.0 18724 6068 exe
1850727 1850722 0 ffffa147e18dd8d0 UN 0.0 18724 6068 exe
1850733 120305 1 ffffa147e18da610 UN 0.0 135924 6512 runc
1850792 128006 46 ffffa14ae1948cb0 UN 0.0 21716 1280 logrotate
1850914 1850911 1 ffffa147086dbf70 UN 0.0 18724 6068 exe
1851274 127909 46 ffffa14703661960 UN 0.0 53344 3232
redis-server
1851474 1850787 1 ffffa1470026cc20 UN 0.0 115704 1244 ceph
1853422 1853340 44 ffffa146dfdc1960 UN 0.0 12396 2312 sh
1854005 1 1 ffffa146d7d8f230 UN 0.0 116872 812 mkdir
1854955 2847282 1 ffffa146c5d18cb0 UN 0.0 18724 6068 exe
1856515 166108 1 ffffa146aa071960 UN 0.0 18724 6068 exe
1856602 84624 1 ffffa146aa073f70 UN 0.0 184416 1988 crond
1859661 1859658 1 ffffa14672090000 UN 0.0 116872 812 mkdir
2232051 165443 7 ffffa147e1ac0000 UN 0.0 0 0
eal-intr-thread


Thanks,
Chris

2020-11-10 19:48:32

by Theodore Ts'o

[permalink] [raw]
Subject: Re: looking for assistance with jbd2 (and other processes) hung trying to write to disk

On Tue, Nov 10, 2020 at 09:57:39AM -0600, Chris Friesen wrote:
> No, there are quite a few of them. I've included them below. I agree, it's
> not clear who's holding the lock. Is there a way to find that out?
>
> Just to be sure, I'm looking for whoever has the BH_Lock bit set on the
> buffer_head "b_state" field, right? I don't see any ownership field the way
> we have for mutexes. Is there some way to find out who would have locked
> the buffer?

It's quite possible that the buffer was locked as part of doing I/O,
and we are just waiting for the I/O to complete. An example of this
is in journal_submit_commit_record(), where we lock the buffer using
lock_buffer(), and then call submit_bh() to submit the buffer for I/O.
When the I/O is completed, the buffer head will be unlocked, and we
can check the buffer_uptodate flag to see if the I/O completed
successfully. (See journal_wait_on_commit_record() for an example of
this.)

So the first thing I'd suggest doing is looking at the console output
or dmesg output from the crashdump to see if there are any clues in
terms of kernel messages from the device driver before things locked
up. This could be as simple as the device falling off the bus, in
which case there might be some kernel error messages from the block
layer or device driver that would give some insight.

Good luck,

- Ted

2020-11-10 20:15:09

by Chris Friesen

[permalink] [raw]
Subject: Re: looking for assistance with jbd2 (and other processes) hung trying to write to disk

On 11/10/2020 1:46 PM, Theodore Y. Ts'o wrote:
> [Please note this e-mail is from an EXTERNAL e-mail address]
>
> On Tue, Nov 10, 2020 at 09:57:39AM -0600, Chris Friesen wrote:

>> Just to be sure, I'm looking for whoever has the BH_Lock bit set on the
>> buffer_head "b_state" field, right? I don't see any ownership field the way
>> we have for mutexes. Is there some way to find out who would have locked
>> the buffer?
>
> It's quite possible that the buffer was locked as part of doing I/O,
> and we are just waiting for the I/O to complete. An example of this
> is in journal_submit_commit_record(), where we lock the buffer using
> lock_buffer(), and then call submit_bh() to submit the buffer for I/O.
> When the I/O is completed, the buffer head will be unlocked, and we
> can check the buffer_uptodate flag to see if the I/O completed
> successfully. (See journal_wait_on_commit_record() for an example of
> this.)

Running "ps -m 'jbd2'" in the crashdump shows jbd2/nvme2n1p4- in the
uninterruptible state, with a "last run" timestamp of over 9 minutes
before the crash. Same for a number of jbd2/dm* tasks. This seems like
a very long time to wait for I/O to complete, which is why I'm assuming
something's gone off the rails.

> So the first thing I'd suggest doing is looking at the console output
> or dmesg output from the crashdump to see if there are any clues in
> terms of kernel messages from the device driver before things locked
> up. This could be as simple as the device falling off the bus, in
> which case there might be some kernel error messages from the block
> layer or device driver that would give some insight.

The timeline looks like this (CPUs 0,1,24,25 are the housekeeping CPUS):

The only device-related issue I see is this, just a bit over 9 minutes
before the eventual panic. Prior to this there are no crashdump dmesg
logs for a couple hours previous.
[119982.636995] WARNING: CPU: 1 PID: 21 at net/sched/sch_generic.c:360
dev_watchdog+0x268/0x280
[119982.636997] NETDEV WATCHDOG: mh0 (iavf): transmit queue 3 timed out

Then I see rcu_sched self-detecting stalls:
[120024.146369] INFO: rcu_sched self-detected stall on CPU { 25}
(t=60000 jiffies g=10078853 c=10078852 q=250)
[120203.725976] INFO: rcu_sched self-detected stall on CPU { 25}
(t=240003 jiffies g=10078853 c=10078852 q=361)
[120383.305584] INFO: rcu_sched self-detected stall on CPU { 25}
(t=420006 jiffies g=10078853 c=10078852 q=401)

The actual panic is here:
[120536.886219] Kernel panic - not syncing: Software Watchdog Timer expired
[120536.886221] CPU: 1 PID: 21 Comm: ktimersoftd/1 Kdump: loaded
Tainted: G W O ------------ T
3.10.0-1127.rt56.1093.el7.tis.2.x86_64 #1


Chris

2020-11-11 15:58:24

by Jan Kara

[permalink] [raw]
Subject: Re: looking for assistance with jbd2 (and other processes) hung trying to write to disk

On Tue 10-11-20 09:57:39, Chris Friesen wrote:
> On 11/10/2020 5:42 AM, Jan Kara wrote:
> > On Mon 09-11-20 15:11:58, Chris Friesen wrote:
>
> > > Can anyone give some suggestions on how to track down what's causing the
> > > delay here? I suspect there's a race condition somewhere similar to what
> > > happened with https://access.redhat.com/solutions/3226391, although that one
> > > was specific to device-mapper and the root filesystem here is directly on
> > > the nvme device.
> >
> > Sadly I don't have access to RH portal to be able to check what that hang
> > was about...
>
> They had exactly the same stack trace (different addresses) with
> "jbd2/dm-16-8" trying to commit a journal transaction. In their case it was
> apparently due to two problems, "the RAID1 code leaking the r1bio", and
> "dm-raid not handling a needed retry scenario". They fixed it by
> backporting upstream commits. The kernel we're running should have those
> fixes, and in our case we're operating directly on an nvme device.

I see. Thanks for explanation.

> > > crash> ps -m 930
> > > [0 00:09:11.694] [UN] PID: 930 TASK: ffffa14b5f9032c0 CPU: 1 COMMAND:
> > > "jbd2/nvme2n1p4-"
> > >
> >
> > Are the tasks below the only ones hanging in D state (UN state in crash)?
> > Because I can see processes are waiting for the locked buffer but it is
> > unclear who is holding the buffer lock...
>
> No, there are quite a few of them. I've included them below. I agree, it's
> not clear who's holding the lock. Is there a way to find that out?
>
> Just to be sure, I'm looking for whoever has the BH_Lock bit set on the
> buffer_head "b_state" field, right? I don't see any ownership field the way
> we have for mutexes. Is there some way to find out who would have locked
> the buffer?

Buffer lock is a bitlock so there's no owner field. If you can reproduce
the problem at will and can use debug kernels, then it's easiest to add
code to lock_buffer() (and fields to struct buffer_head) to track lock
owner and then see who locked the buffer. Without this, the only way is to
check stack traces of all UN processes and see whether some stacktrace
looks suspicious like it could hold the buffer locked (e.g. recursing into
memory allocation and reclaim while holding buffer locked or something like
that)...

As Ted wrote the buffer is indeed usually locked because the IO is running
and that would be the expected situation with the jdb2 stacktrace you
posted. So it could also be the IO got stuck somewhere in the block layer
or NVME (frankly, AFAIR NVME was pretty rudimentary with 3.10). To see
whether that's the case, you need to find 'bio' pointing to the buffer_head
(through bi_private field), possibly also struct request for that bio and see
what state they are in... Again, if you can run debug kernels, you can
write code to simplify this search for you...

> Do you think it would help at all to enable CONFIG_JBD_DEBUG?

I don't think so.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2020-11-11 16:27:42

by Chris Friesen

[permalink] [raw]
Subject: Re: looking for assistance with jbd2 (and other processes) hung trying to write to disk

On 11/11/2020 9:57 AM, Jan Kara wrote:
> On Tue 10-11-20 09:57:39, Chris Friesen wrote:
>> Just to be sure, I'm looking for whoever has the BH_Lock bit set on the
>> buffer_head "b_state" field, right? I don't see any ownership field the way
>> we have for mutexes. Is there some way to find out who would have locked
>> the buffer?
>
> Buffer lock is a bitlock so there's no owner field. If you can reproduce
> the problem at will and can use debug kernels, then it's easiest to add
> code to lock_buffer() (and fields to struct buffer_head) to track lock
> owner and then see who locked the buffer. Without this, the only way is to
> check stack traces of all UN processes and see whether some stacktrace
> looks suspicious like it could hold the buffer locked (e.g. recursing into
> memory allocation and reclaim while holding buffer locked or something like
> that)...

That's what I thought. :) Debug kernels are doable, but unfortunately
we can't (yet) reproduce the problem at will. Naturally it's only shown
up in a couple of customer sites so far and not in any test labs.

> As Ted wrote the buffer is indeed usually locked because the IO is running
> and that would be the expected situation with the jdb2 stacktrace you
> posted. So it could also be the IO got stuck somewhere in the block layer
> or NVME (frankly, AFAIR NVME was pretty rudimentary with 3.10). To see
> whether that's the case, you need to find 'bio' pointing to the buffer_head
> (through bi_private field), possibly also struct request for that bio and see
> what state they are in... Again, if you can run debug kernels, you can
> write code to simplify this search for you...

Thanks, that's helpful.

Chris