2021-03-12 08:08:37

by Shashidhar Patil

[permalink] [raw]
Subject: jbd2 task hung in jbd2_journal_commit_transaction

Hi,
We are seeing problem with many tasks hung in 'D' state and
stack output of jbd2 thread is hung at
jbd2_journal_commit_transaction().

The stack trace of jdb2 task is:

[<0>] jbd2_journal_commit_transaction+0x26e/0x1870^M
[<0>] kjournald2+0xc8/0x250^M
[<0>] kthread+0x105/0x140^M
[<0>] ret_from_fork+0x35/0x40^M
[<0>] 0xffffffffffffffff^M

The symptoms look similar to
https://www.spinics.net/lists/linux-ext4/msg53502.html.

The system has zswap configured with swap backing is an ext4 file.
There are oom kills recorded in the dmesg.

As per Theodore in the above link the issue is caused by a leaked
atomic handle. But one of the backtraces collected (below) points to
a possile problem of cylic dependency during direct memory reclaim as
poined here
https://www.kernel.org/doc/html/latest/core-api/gfp_mask-from-fs-io.html

This issue is explained in below steps
1. sys_write processing allocates atomic handle for journaling and
handle got allocated
, journalling started on the handle
2. as part of write processing page cache allocation successed and the
write operation also started
3. During the write iteration jbd2_alloc() was called with (GFP_NOFS |
__GFP_NOFAIL) which imply __GFP_DIRECT_RECLAIM

4. zswap backup ram is full so it tries to reclaim memory to free up
some pages and initiates a swap page disk writeback. In this case the
swap file is on the same partition on which write() call is done.

5. Until the write request is processed the atomic handle is not
released(journal_stop)

6. The write transfer elsewhere in jbd2 layer hangs in
transaction_commit since there is an open atomic handle.

So this becomes a deadlock condition.

As per the traceback and code this looks like a possibility . I have
more logs captured with /proc/sysrq-trigger which I can share as
required.

Appreciate your help.

The backtrace which might cause the deadlock

4,1736499,1121675690224,-;Call Trace:
4,1736500,1121675690231,-; __schedule+0x3d6/0x8b0
4,1736501,1121675690237,-; schedule+0x36/0x80
4,1736502,1121675690245,-; io_schedule+0x16/0x40
4,1736503,1121675690253,-; wbt_wait+0x22f/0x380
4,1736504,1121675690261,-; ? trace_event_raw_event_wbt_timer+0x100/0x100
4,1736505,1121675690269,-; ? end_swap_bio_read+0xd0/0xd0
4,1736506,1121675690275,-; blk_mq_make_request+0x103/0x5b0
4,1736507,1121675690283,-; ? end_swap_bio_read+0xd0/0xd0
4,1736508,1121675690288,-; generic_make_request+0x122/0x2f0
4,1736509,1121675690295,-; submit_bio+0x73/0x140
4,1736510,1121675690302,-; ? submit_bio+0x73/0x140
4,1736511,1121675690308,-; ? get_swap_bio+0xcf/0x100
4,1736512,1121675690316,-; __swap_writepage+0x33f/0x3b0
4,1736513,1121675690322,-; ? lru_cache_add_file+0x37/0x40
4,1736514,1121675690329,-; ? lzo_decompress+0x38/0x70
4,1736515,1121675690336,-; zswap_writeback_entry+0x249/0x350
4,1736516,1121675690343,-; zbud_zpool_evict+0x31/0x40
4,1736517,1121675690349,-; zbud_reclaim_page+0x1e9/0x250
4,1736518,1121675690356,-; zbud_zpool_shrink+0x3b/0x60
4,1736519,1121675690362,-; zpool_shrink+0x1c/0x20
4,1736520,1121675690369,-; zswap_frontswap_store+0x274/0x530
4,1736521,1121675690376,-; __frontswap_store+0x78/0x100
4,1736522,1121675690382,-; swap_writepage+0x3f/0x80
4,1736523,1121675690390,-; pageout.isra.53+0x1e6/0x340
4,1736524,1121675690397,-; shrink_page_list+0x992/0xbe0
4,1736525,1121675690403,-; shrink_inactive_list+0x2af/0x5f0
4,1736526,1121675690409,-; ? _find_next_bit+0x40/0x70
4,1736527,1121675690416,-; shrink_node_memcg+0x36f/0x7f0
4,1736528,1121675690423,-; shrink_node+0xe1/0x310
4,1736529,1121675690429,-; ? shrink_node+0xe1/0x310
4,1736530,1121675690435,-; do_try_to_free_pages+0xee/0x360
4,1736531,1121675690439,-; try_to_free_pages+0xf1/0x1c0
4,1736532,1121675690442,-; __alloc_pages_slowpath+0x399/0xe90
4,1736533,1121675690446,-; __alloc_pages_nodemask+0x289/0x2d0
4,1736534,1121675690449,-; alloc_pages_current+0x6a/0xe0
4,1736535,1121675690452,-; __get_free_pages+0xe/0x30
4,1736536,1121675690455,-; jbd2_alloc+0x3a/0x60
4,1736537,1121675690458,-; do_get_write_access+0x182/0x3e0
4,1736538,1121675690461,-; jbd2_journal_get_write_access+0x51/0x80
4,1736539,1121675690464,-; __ext4_journal_get_write_access+0x3b/0x80
4,1736540,1121675690466,-; ext4_reserve_inode_write+0x95/0xc0
4,1736541,1121675690467,-; ? ext4_dirty_inode+0x48/0x70
4,1736542,1121675690469,-; ext4_mark_inode_dirty+0x53/0x1d0
4,1736543,1121675690470,-; ? __ext4_journal_start_sb+0x6d/0x120
4,1736544,1121675690473,-; ext4_dirty_inode+0x48/0x70
4,1736545,1121675690475,-; __mark_inode_dirty+0x184/0x3b0
4,1736546,1121675690479,-; generic_update_time+0x7b/0xd0
4,1736547,1121675690482,-; ? current_time+0x32/0x70
4,1736548,1121675690484,-; file_update_time+0xbe/0x110
4,1736549,1121675690487,-; __generic_file_write_iter+0x9d/0x1f0
4,1736550,1121675690490,-; ext4_file_write_iter+0xc4/0x3b0
4,1736551,1121675690492,-; ? sock_sendmsg+0x3e/0x50
4,1736552,1121675690495,-; new_sync_write+0xe5/0x140
4,1736553,1121675690498,-; __vfs_write+0x29/0x40
4,1736554,1121675690501,-; vfs_write+0xb8/0x1b0
4,1736555,1121675690503,-; ? syscall_trace_enter+0x1d6/0x2f0
4,1736556,1121675690505,-; SyS_write+0x5c/0xe0
4,1736557,1121675690508,-; do_syscall_64+0x73/0x130
4,1736558,1121675690509,-; entry_SYSCALL_64_after_hwframe+0x3d/0xa2


2021-03-12 12:52:45

by Theodore Ts'o

[permalink] [raw]
Subject: Re: jbd2 task hung in jbd2_journal_commit_transaction

On Fri, Mar 12, 2021 at 01:33:26PM +0530, Shashidhar Patil wrote:
> Hi,
> We are seeing problem with many tasks hung in 'D' state and
> stack output of jbd2 thread is hung at
> jbd2_journal_commit_transaction().
>
> The stack trace of jdb2 task is:
>
> [<0>] jbd2_journal_commit_transaction+0x26e/0x1870^M
> [<0>] kjournald2+0xc8/0x250^M
> [<0>] kthread+0x105/0x140^M
> [<0>] ret_from_fork+0x35/0x40^M
> [<0>] 0xffffffffffffffff^M
>
> The symptoms look similar to
> https://www.spinics.net/lists/linux-ext4/msg53502.html.
>
> The system has zswap configured with swap backing is an ext4 file.
> There are oom kills recorded in the dmesg.
>
> As per Theodore in the above link the issue is caused by a leaked
> atomic handle. But one of the backtraces collected (below) points to
> a possile problem of cylic dependency during direct memory reclaim as
> poined here
> https://www.kernel.org/doc/html/latest/core-api/gfp_mask-from-fs-io.html
>
> This issue is explained in below steps
> 1. sys_write processing allocates atomic handle for journaling and
> handle got allocated
> , journalling started on the handle
> 2. as part of write processing page cache allocation successed and the
> write operation also started
> 3. During the write iteration jbd2_alloc() was called with (GFP_NOFS |
> __GFP_NOFAIL) which imply __GFP_DIRECT_RECLAIM
>
> 4. zswap backup ram is full so it tries to reclaim memory to free up
> some pages and initiates a swap page disk writeback. In this case the
> swap file is on the same partition on which write() call is done.

From what I can tell zswap is using writepage(), and since the swap
file should be *completely* preallocated and initialized, we should
never be trying to start a handle from zswap. This should prevent the
deadlock from happening. If zswap is doing something which is causing
ext4 to start a handle when it tries to writeout a swap page, then
that would certainly be a problem. But that really shouldn't be the
case.

>
> The backtrace which might cause the deadlock
>
> 4,1736499,1121675690224,-;Call Trace:
> 4,1736500,1121675690231,-; __schedule+0x3d6/0x8b0
> 4,1736501,1121675690237,-; schedule+0x36/0x80
> 4,1736502,1121675690245,-; io_schedule+0x16/0x40
> 4,1736503,1121675690253,-; wbt_wait+0x22f/0x380
> 4,1736504,1121675690261,-; ? trace_event_raw_event_wbt_timer+0x100/0x100
> 4,1736505,1121675690269,-; ? end_swap_bio_read+0xd0/0xd0
> 4,1736506,1121675690275,-; blk_mq_make_request+0x103/0x5b0

We can see that zswap has called _swap_writepage(), and then the I/O
has been submitted... and we're waiting for the I/O to complete ---
see the call io_schedule(). So we're not actually deadlocking on
anything other else than... the storage device not getting back to us
with the I/O completion notification. And that's not really a cyclic
deadlock, but probably a lost I/O completion interrupt, or the storage
device firmware panicing/crashing/show stopping, or falling off the
SATA bus and then needing to reconnect, etc.

- Ted


> 4,1736507,1121675690283,-; ? end_swap_bio_read+0xd0/0xd0
> 4,1736508,1121675690288,-; generic_make_request+0x122/0x2f0
> 4,1736509,1121675690295,-; submit_bio+0x73/0x140
> 4,1736510,1121675690302,-; ? submit_bio+0x73/0x140
> 4,1736511,1121675690308,-; ? get_swap_bio+0xcf/0x100
> 4,1736512,1121675690316,-; __swap_writepage+0x33f/0x3b0
> 4,1736513,1121675690322,-; ? lru_cache_add_file+0x37/0x40
> 4,1736514,1121675690329,-; ? lzo_decompress+0x38/0x70
> 4,1736515,1121675690336,-; zswap_writeback_entry+0x249/0x350
> 4,1736516,1121675690343,-; zbud_zpool_evict+0x31/0x40
> 4,1736517,1121675690349,-; zbud_reclaim_page+0x1e9/0x250
> 4,1736518,1121675690356,-; zbud_zpool_shrink+0x3b/0x60
> 4,1736519,1121675690362,-; zpool_shrink+0x1c/0x20
> 4,1736520,1121675690369,-; zswap_frontswap_store+0x274/0x530
> 4,1736521,1121675690376,-; __frontswap_store+0x78/0x100
> 4,1736522,1121675690382,-; swap_writepage+0x3f/0x80
> 4,1736523,1121675690390,-; pageout.isra.53+0x1e6/0x340
> 4,1736524,1121675690397,-; shrink_page_list+0x992/0xbe0
> 4,1736525,1121675690403,-; shrink_inactive_list+0x2af/0x5f0
> 4,1736526,1121675690409,-; ? _find_next_bit+0x40/0x70
> 4,1736527,1121675690416,-; shrink_node_memcg+0x36f/0x7f0
> 4,1736528,1121675690423,-; shrink_node+0xe1/0x310
> 4,1736529,1121675690429,-; ? shrink_node+0xe1/0x310
> 4,1736530,1121675690435,-; do_try_to_free_pages+0xee/0x360
> 4,1736531,1121675690439,-; try_to_free_pages+0xf1/0x1c0
> 4,1736532,1121675690442,-; __alloc_pages_slowpath+0x399/0xe90
> 4,1736533,1121675690446,-; __alloc_pages_nodemask+0x289/0x2d0
> 4,1736534,1121675690449,-; alloc_pages_current+0x6a/0xe0
> 4,1736535,1121675690452,-; __get_free_pages+0xe/0x30
> 4,1736536,1121675690455,-; jbd2_alloc+0x3a/0x60
> 4,1736537,1121675690458,-; do_get_write_access+0x182/0x3e0
> 4,1736538,1121675690461,-; jbd2_journal_get_write_access+0x51/0x80
> 4,1736539,1121675690464,-; __ext4_journal_get_write_access+0x3b/0x80
> 4,1736540,1121675690466,-; ext4_reserve_inode_write+0x95/0xc0
> 4,1736541,1121675690467,-; ? ext4_dirty_inode+0x48/0x70
> 4,1736542,1121675690469,-; ext4_mark_inode_dirty+0x53/0x1d0
> 4,1736543,1121675690470,-; ? __ext4_journal_start_sb+0x6d/0x120
> 4,1736544,1121675690473,-; ext4_dirty_inode+0x48/0x70
> 4,1736545,1121675690475,-; __mark_inode_dirty+0x184/0x3b0
> 4,1736546,1121675690479,-; generic_update_time+0x7b/0xd0
> 4,1736547,1121675690482,-; ? current_time+0x32/0x70
> 4,1736548,1121675690484,-; file_update_time+0xbe/0x110
> 4,1736549,1121675690487,-; __generic_file_write_iter+0x9d/0x1f0
> 4,1736550,1121675690490,-; ext4_file_write_iter+0xc4/0x3b0
> 4,1736551,1121675690492,-; ? sock_sendmsg+0x3e/0x50
> 4,1736552,1121675690495,-; new_sync_write+0xe5/0x140
> 4,1736553,1121675690498,-; __vfs_write+0x29/0x40
> 4,1736554,1121675690501,-; vfs_write+0xb8/0x1b0
> 4,1736555,1121675690503,-; ? syscall_trace_enter+0x1d6/0x2f0
> 4,1736556,1121675690505,-; SyS_write+0x5c/0xe0
> 4,1736557,1121675690508,-; do_syscall_64+0x73/0x130
> 4,1736558,1121675690509,-; entry_SYSCALL_64_after_hwframe+0x3d/0xa2

2021-03-13 08:15:21

by Shashidhar Patil

[permalink] [raw]
Subject: Re: jbd2 task hung in jbd2_journal_commit_transaction

Hi Theodore,
Thank you for the explaination and clarification about the real
problem. Some more queries below to understand the
problem better.

On Fri, Mar 12, 2021 at 6:21 PM Theodore Ts'o <[email protected]> wrote:
>
> On Fri, Mar 12, 2021 at 01:33:26PM +0530, Shashidhar Patil wrote:
> > Hi,
> > We are seeing problem with many tasks hung in 'D' state and
> > stack output of jbd2 thread is hung at
> > jbd2_journal_commit_transaction().
> >
> > The stack trace of jdb2 task is:
> >
> > [<0>] jbd2_journal_commit_transaction+0x26e/0x1870^M
> > [<0>] kjournald2+0xc8/0x250^M
> > [<0>] kthread+0x105/0x140^M
> > [<0>] ret_from_fork+0x35/0x40^M
> > [<0>] 0xffffffffffffffff^M
> >
> > The symptoms look similar to
> > https://www.spinics.net/lists/linux-ext4/msg53502.html.
> >
> > The system has zswap configured with swap backing is an ext4 file.
> > There are oom kills recorded in the dmesg.
> >
> > As per Theodore in the above link the issue is caused by a leaked
> > atomic handle. But one of the backtraces collected (below) points to
> > a possile problem of cylic dependency during direct memory reclaim as
> > poined here
> > https://www.kernel.org/doc/html/latest/core-api/gfp_mask-from-fs-io.html
> >
> > This issue is explained in below steps
> > 1. sys_write processing allocates atomic handle for journaling and
> > handle got allocated
> > , journalling started on the handle
> > 2. as part of write processing page cache allocation successed and the
> > write operation also started
> > 3. During the write iteration jbd2_alloc() was called with (GFP_NOFS |
> > __GFP_NOFAIL) which imply __GFP_DIRECT_RECLAIM
> >
> > 4. zswap backup ram is full so it tries to reclaim memory to free up
> > some pages and initiates a swap page disk writeback. In this case the
> > swap file is on the same partition on which write() call is done.
>
> From what I can tell zswap is using writepage(), and since the swap
> file should be *completely* preallocated and initialized, we should
> never be trying to start a handle from zswap. This should prevent the
> deadlock from happening. If zswap is doing something which is causing
> ext4 to start a handle when it tries to writeout a swap page, then
> that would certainly be a problem. But that really shouldn't be the
> case.

Yes. But the the first sys_write() called by the application did
allocate an journal handle as required and since
this specific request now is waiting for IO to complete the handle is
not closed. Elsewhere in jbd2 task the commit_transaction is
blocked since there is one or more open journalling handles. Is my
understanding correct ?

>
> >
> > The backtrace which might cause the deadlock
> >
> > 4,1736499,1121675690224,-;Call Trace:
> > 4,1736500,1121675690231,-; __schedule+0x3d6/0x8b0
> > 4,1736501,1121675690237,-; schedule+0x36/0x80
> > 4,1736502,1121675690245,-; io_schedule+0x16/0x40
> > 4,1736503,1121675690253,-; wbt_wait+0x22f/0x380
> > 4,1736504,1121675690261,-; ? trace_event_raw_event_wbt_timer+0x100/0x100
> > 4,1736505,1121675690269,-; ? end_swap_bio_read+0xd0/0xd0
> > 4,1736506,1121675690275,-; blk_mq_make_request+0x103/0x5b0
>
> We can see that zswap has called _swap_writepage(), and then the I/O
> has been submitted... and we're waiting for the I/O to complete ---
> see the call io_schedule(). So we're not actually deadlocking on

I missed this important part of submitting the requests directly to
generic bio. So clearly no journalling handles are allocated.
I do see there are many such io requests in "waiting" state which
means the storage device is not responding.
something like:

4,1737846,1121675697013,-; schedule+0x36/0x80
4,1737847,1121675697015,-; io_schedule+0x16/0x40
4,1737848,1121675697016,-; blk_mq_get_tag+0x161/0x250
4,1737849,1121675697018,-; ? wait_woken+0x80/0x80
4,1737850,1121675697020,-; blk_mq_get_request+0xdc/0x3b0
4,1737851,1121675697021,-; blk_mq_make_request+0x128/0x5b0
4,1737852,1121675697023,-; generic_make_request+0x122/0x2f0
4,1737853,1121675697024,-; ? bio_alloc_bioset+0xd2/0x1e0
4,1737854,1121675697026,-; submit_bio+0x73/0x140
.....
So all those IO requests are waiting for response from the raid port,
is that right ?

But the megaraid_sas driver( the system has LSI MEGARAID port) in most
cases handles the unresponsive behavior
by resetting the device. IN this case the reset did not happen, maybe
there is some other bug in the megaraid driver.

Thanks
-Shashidhar

> anything other else than... the storage device not getting back to us
> with the I/O completion notification. And that's not really a cyclic
> deadlock, but probably a lost I/O completion interrupt, or the storage
> device firmware panicing/crashing/show stopping, or falling off the
> SATA bus and then needing to reconnect, etc.
>
> - Ted
>
>
> > 4,1736507,1121675690283,-; ? end_swap_bio_read+0xd0/0xd0
> > 4,1736508,1121675690288,-; generic_make_request+0x122/0x2f0
> > 4,1736509,1121675690295,-; submit_bio+0x73/0x140
> > 4,1736510,1121675690302,-; ? submit_bio+0x73/0x140
> > 4,1736511,1121675690308,-; ? get_swap_bio+0xcf/0x100
> > 4,1736512,1121675690316,-; __swap_writepage+0x33f/0x3b0
> > 4,1736513,1121675690322,-; ? lru_cache_add_file+0x37/0x40
> > 4,1736514,1121675690329,-; ? lzo_decompress+0x38/0x70
> > 4,1736515,1121675690336,-; zswap_writeback_entry+0x249/0x350
> > 4,1736516,1121675690343,-; zbud_zpool_evict+0x31/0x40
> > 4,1736517,1121675690349,-; zbud_reclaim_page+0x1e9/0x250
> > 4,1736518,1121675690356,-; zbud_zpool_shrink+0x3b/0x60
> > 4,1736519,1121675690362,-; zpool_shrink+0x1c/0x20
> > 4,1736520,1121675690369,-; zswap_frontswap_store+0x274/0x530
> > 4,1736521,1121675690376,-; __frontswap_store+0x78/0x100
> > 4,1736522,1121675690382,-; swap_writepage+0x3f/0x80
> > 4,1736523,1121675690390,-; pageout.isra.53+0x1e6/0x340
> > 4,1736524,1121675690397,-; shrink_page_list+0x992/0xbe0
> > 4,1736525,1121675690403,-; shrink_inactive_list+0x2af/0x5f0
> > 4,1736526,1121675690409,-; ? _find_next_bit+0x40/0x70
> > 4,1736527,1121675690416,-; shrink_node_memcg+0x36f/0x7f0
> > 4,1736528,1121675690423,-; shrink_node+0xe1/0x310
> > 4,1736529,1121675690429,-; ? shrink_node+0xe1/0x310
> > 4,1736530,1121675690435,-; do_try_to_free_pages+0xee/0x360
> > 4,1736531,1121675690439,-; try_to_free_pages+0xf1/0x1c0
> > 4,1736532,1121675690442,-; __alloc_pages_slowpath+0x399/0xe90
> > 4,1736533,1121675690446,-; __alloc_pages_nodemask+0x289/0x2d0
> > 4,1736534,1121675690449,-; alloc_pages_current+0x6a/0xe0
> > 4,1736535,1121675690452,-; __get_free_pages+0xe/0x30
> > 4,1736536,1121675690455,-; jbd2_alloc+0x3a/0x60
> > 4,1736537,1121675690458,-; do_get_write_access+0x182/0x3e0
> > 4,1736538,1121675690461,-; jbd2_journal_get_write_access+0x51/0x80
> > 4,1736539,1121675690464,-; __ext4_journal_get_write_access+0x3b/0x80
> > 4,1736540,1121675690466,-; ext4_reserve_inode_write+0x95/0xc0
> > 4,1736541,1121675690467,-; ? ext4_dirty_inode+0x48/0x70
> > 4,1736542,1121675690469,-; ext4_mark_inode_dirty+0x53/0x1d0
> > 4,1736543,1121675690470,-; ? __ext4_journal_start_sb+0x6d/0x120
> > 4,1736544,1121675690473,-; ext4_dirty_inode+0x48/0x70
> > 4,1736545,1121675690475,-; __mark_inode_dirty+0x184/0x3b0
> > 4,1736546,1121675690479,-; generic_update_time+0x7b/0xd0
> > 4,1736547,1121675690482,-; ? current_time+0x32/0x70
> > 4,1736548,1121675690484,-; file_update_time+0xbe/0x110
> > 4,1736549,1121675690487,-; __generic_file_write_iter+0x9d/0x1f0
> > 4,1736550,1121675690490,-; ext4_file_write_iter+0xc4/0x3b0
> > 4,1736551,1121675690492,-; ? sock_sendmsg+0x3e/0x50
> > 4,1736552,1121675690495,-; new_sync_write+0xe5/0x140
> > 4,1736553,1121675690498,-; __vfs_write+0x29/0x40
> > 4,1736554,1121675690501,-; vfs_write+0xb8/0x1b0
> > 4,1736555,1121675690503,-; ? syscall_trace_enter+0x1d6/0x2f0
> > 4,1736556,1121675690505,-; SyS_write+0x5c/0xe0
> > 4,1736557,1121675690508,-; do_syscall_64+0x73/0x130
> > 4,1736558,1121675690509,-; entry_SYSCALL_64_after_hwframe+0x3d/0xa2

2021-03-14 04:08:10

by Theodore Ts'o

[permalink] [raw]
Subject: Re: jbd2 task hung in jbd2_journal_commit_transaction

On Sat, Mar 13, 2021 at 01:29:43PM +0530, Shashidhar Patil wrote:
> > From what I can tell zswap is using writepage(), and since the swap
> > file should be *completely* preallocated and initialized, we should
> > never be trying to start a handle from zswap. This should prevent the
> > deadlock from happening. If zswap is doing something which is causing
> > ext4 to start a handle when it tries to writeout a swap page, then
> > that would certainly be a problem. But that really shouldn't be the
> > case.
>
> Yes. But the the first sys_write() called by the application did
> allocate an journal handle as required and since
> this specific request now is waiting for IO to complete the handle is
> not closed. Elsewhere in jbd2 task the commit_transaction is
> blocked since there is one or more open journalling handles. Is my
> understanding correct ?

Yes, that's correct. When we start a transaction commit, either
because the 5 second commit interval has been reached, or there isn't
enough room in the journal for a particular handle to start (when we
start a file system mutation, we estimate the worst case number of
blocks that might need to be modified, and hence require space in the
journal), we first (a) stop any new handles from being started, and
then (b) wait for all currently running handles to complete.

If one handle takes a lot longer to complete than all the others,
while we are waiting for that last handle to finish, the commit can
not make forward progress, and no other file system operation which
requires modifying metadata can proceed. As a result, we try to keep
the time between starting a handle and stopping a handle as short as
possible. For example, if possible, we will try to read a block that
might be needed by a mutation operation *before* we start the handle.
That's not always possible, but we try to do that whenever possible,
and there are various tracepoints and other debugging facilities so we
can see which types of file system mutations require holding handles
longest, so we can try to optimize them.

> 4,1737846,1121675697013,-; schedule+0x36/0x80
> 4,1737847,1121675697015,-; io_schedule+0x16/0x40
> 4,1737848,1121675697016,-; blk_mq_get_tag+0x161/0x250
> 4,1737849,1121675697018,-; ? wait_woken+0x80/0x80
> 4,1737850,1121675697020,-; blk_mq_get_request+0xdc/0x3b0
> 4,1737851,1121675697021,-; blk_mq_make_request+0x128/0x5b0
> 4,1737852,1121675697023,-; generic_make_request+0x122/0x2f0
> 4,1737853,1121675697024,-; ? bio_alloc_bioset+0xd2/0x1e0
> 4,1737854,1121675697026,-; submit_bio+0x73/0x140
> .....
> So all those IO requests are waiting for response from the raid port,
> is that right ?
>
> But the megaraid_sas driver( the system has LSI MEGARAID port) in most
> cases handles the unresponsive behavior
> by resetting the device. IN this case the reset did not happen, maybe
> there is some other bug in the megaraid driver.

Yes, it's not necessarily a problem with the storage device or the
host bus adapter; it could also be some kind of bug in the device
driver --- or even the block layer, although that's much, much less
likely (mostly because a lot of people would be complaining if that
were the case).

If you have access to a SCSI/SATA bus snooper which can be inserted in
between the storage device (HDD/SSD) and the LSI Megaraid, that might
be helpful in terms of trying to figure out what is going on. Failing
that, you'll probably find some way to add/use debugging
hooks/tracepoints in the driver.

Good luck,

- Ted

2021-03-17 16:20:23

by Shashidhar Patil

[permalink] [raw]
Subject: Re: jbd2 task hung in jbd2_journal_commit_transaction

Hi Theodore,
Thank you for the details about the journalling layer and
insight into the block device layer.
I think Good luck might have clicked. The swap file in our case is
attached to a loop block device before enabling swap using swapon.
Since loop driver processes its IO requests by calling
vfs_iter_write() the write requests re-enter the ext4
filesystem/journalling code.
Is that right ? There seems to be a possibility of cylic dependency.

Thanks
-Shashidhar
On Sun, Mar 14, 2021 at 9:08 AM Theodore Ts'o <[email protected]> wrote:
>
> On Sat, Mar 13, 2021 at 01:29:43PM +0530, Shashidhar Patil wrote:
> > > From what I can tell zswap is using writepage(), and since the swap
> > > file should be *completely* preallocated and initialized, we should
> > > never be trying to start a handle from zswap. This should prevent the
> > > deadlock from happening. If zswap is doing something which is causing
> > > ext4 to start a handle when it tries to writeout a swap page, then
> > > that would certainly be a problem. But that really shouldn't be the
> > > case.
> >
> > Yes. But the the first sys_write() called by the application did
> > allocate an journal handle as required and since
> > this specific request now is waiting for IO to complete the handle is
> > not closed. Elsewhere in jbd2 task the commit_transaction is
> > blocked since there is one or more open journalling handles. Is my
> > understanding correct ?
>
> Yes, that's correct. When we start a transaction commit, either
> because the 5 second commit interval has been reached, or there isn't
> enough room in the journal for a particular handle to start (when we
> start a file system mutation, we estimate the worst case number of
> blocks that might need to be modified, and hence require space in the
> journal), we first (a) stop any new handles from being started, and
> then (b) wait for all currently running handles to complete.
>
> If one handle takes a lot longer to complete than all the others,
> while we are waiting for that last handle to finish, the commit can
> not make forward progress, and no other file system operation which
> requires modifying metadata can proceed. As a result, we try to keep
> the time between starting a handle and stopping a handle as short as
> possible. For example, if possible, we will try to read a block that
> might be needed by a mutation operation *before* we start the handle.
> That's not always possible, but we try to do that whenever possible,
> and there are various tracepoints and other debugging facilities so we
> can see which types of file system mutations require holding handles
> longest, so we can try to optimize them.
>
> > 4,1737846,1121675697013,-; schedule+0x36/0x80
> > 4,1737847,1121675697015,-; io_schedule+0x16/0x40
> > 4,1737848,1121675697016,-; blk_mq_get_tag+0x161/0x250
> > 4,1737849,1121675697018,-; ? wait_woken+0x80/0x80
> > 4,1737850,1121675697020,-; blk_mq_get_request+0xdc/0x3b0
> > 4,1737851,1121675697021,-; blk_mq_make_request+0x128/0x5b0
> > 4,1737852,1121675697023,-; generic_make_request+0x122/0x2f0
> > 4,1737853,1121675697024,-; ? bio_alloc_bioset+0xd2/0x1e0
> > 4,1737854,1121675697026,-; submit_bio+0x73/0x140
> > .....
> > So all those IO requests are waiting for response from the raid port,
> > is that right ?
> >
> > But the megaraid_sas driver( the system has LSI MEGARAID port) in most
> > cases handles the unresponsive behavior
> > by resetting the device. IN this case the reset did not happen, maybe
> > there is some other bug in the megaraid driver.
>
> Yes, it's not necessarily a problem with the storage device or the
> host bus adapter; it could also be some kind of bug in the device
> driver --- or even the block layer, although that's much, much less
> likely (mostly because a lot of people would be complaining if that
> were the case).
>
> If you have access to a SCSI/SATA bus snooper which can be inserted in
> between the storage device (HDD/SSD) and the LSI Megaraid, that might
> be helpful in terms of trying to figure out what is going on. Failing
> that, you'll probably find some way to add/use debugging
> hooks/tracepoints in the driver.
>
> Good luck,
>
> - Ted

2021-03-17 17:25:27

by Theodore Ts'o

[permalink] [raw]
Subject: Re: jbd2 task hung in jbd2_journal_commit_transaction

On Wed, Mar 17, 2021 at 08:30:56PM +0530, Shashidhar Patil wrote:
> Hi Theodore,
> Thank you for the details about the journalling layer and
> insight into the block device layer.
> I think Good luck might have clicked. The swap file in our case is
> attached to a loop block device before enabling swap using swapon.
> Since loop driver processes its IO requests by calling
> vfs_iter_write() the write requests re-enter the ext4
> filesystem/journalling code.
> Is that right ? There seems to be a possibility of cylic dependency.

If that hypothesis is correct, you should see an example of that in
one of your stack traces; do you? The loop device creates struct file
where the file is opened using O_DIRECT. In the O_DIRECT code path,
assuming the file was fully allocate and initialized, it shouldn't
involve starting a journal handle.

That being said, why are you using a loop device for a swap device at
all? Using a swap file directly is going to be much more efficient,
and decrease the stack depth and CPU cycles needed to do a swap out if
nothing else. If you can reliably reproduce the problem, what happens
if you use a swap file directly and cut out the loop device as a swap
device? Does it make the problem go away?

- Ted

2021-03-18 06:59:02

by Shashidhar Patil

[permalink] [raw]
Subject: Re: jbd2 task hung in jbd2_journal_commit_transaction

Hi Theodore,
I forgot to include two important details , the stack trace of
loop0 driver and sar output, which clearly nail the problem.
The losetup with Ubuntu16.05 does not have O_DIRECT support and we
were not aware of bypassing of journalling if
O_DIRECT combined with preallocated file scenario.

Using the loop we could track the swap load using sysstat, but
otherwise no other major requirement.
With loop I could reproduce the problem only twice using stress-ng
being run for 20 time for 15 seconds interval.
The problem happens highly random so it may take more number of tries
in some cases.

With direct swap file I tried many times and could not see the issue.
The system continues to function fine after terminating the tests.
The loop is now removed and swap file is activated directly in the
deployed systems where the issue was seen every couple of
weeks. Awaiting for few weeks before concluding about the problem/solution.

Logs below.

backtrace:

6,1725650,1121675639291,-;loop0 D 0 2090 2 0x80000080
4,1725651,1121675639293,-;Call Trace:
4,1725652,1121675639295,-; __schedule+0x3d6/0x8b0
4,1725653,1121675639296,-; schedule+0x36/0x80
4,1725654,1121675639298,-; wait_transaction_locked+0x8a/0xd0
4,1725655,1121675639300,-; ? wait_woken+0x80/0x80
4,1725656,1121675639302,-; add_transaction_credits+0x1cd/0x2b0
4,1725657,1121675639303,-; ? __wake_up_common_lock+0x8e/0xc0
4,1725658,1121675639305,-; start_this_handle+0x103/0x410
4,1725659,1121675639306,-; ? _cond_resched+0x1a/0x50
4,1725660,1121675639310,-; ? kmem_cache_alloc+0x115/0x1c0
4,1725661,1121675639311,-; jbd2__journal_start+0xdb/0x1f0
4,1725662,1121675639314,-; ? ext4_dirty_inode+0x32/0x70
4,1725663,1121675639317,-; __ext4_journal_start_sb+0x6d/0x120
4,1725664,1121675639318,-; ext4_dirty_inode+0x32/0x70
4,1725665,1121675639322,-; __mark_inode_dirty+0x184/0x3b0
4,1725666,1121675639325,-; generic_update_time+0x7b/0xd0
4,1725667,1121675639326,-; ? current_time+0x32/0x70
4,1725668,1121675639328,-; file_update_time+0xbe/0x110
4,1725669,1121675639330,-; __generic_file_write_iter+0x9d/0x1f0
4,1725670,1121675639331,-; ? kmem_cache_free+0x1d1/0x1e0
4,1725671,1121675639333,-; ext4_file_write_iter+0xc4/0x3b0
4,1725672,1121675639336,-; do_iter_readv_writev+0x111/0x180
4,1725673,1121675639337,-; do_iter_write+0x87/0x1a0
4,1725674,1121675639339,-; vfs_iter_write+0x19/0x30
4,1725675,1121675639343,-; lo_write_bvec+0x69/0x110
4,1725676,1121675639344,-; loop_queue_work+0x8ff/0xa60
4,1725677,1121675639346,-; ? __switch_to_asm+0x35/0x70
4,1725678,1121675639347,-; ? __switch_to_asm+0x35/0x70
4,1725679,1121675639348,-; ? __schedule+0x3de/0x8b0
4,1725680,1121675639350,-; kthread_worker_fn+0x85/0x1f0
4,1725681,1121675639351,-; loop_kthread_worker_fn+0x1e/0x20
4,1725682,1121675639352,-; kthread+0x105/0x140
4,1725683,1121675639353,-; ? loop_get_status64+0x90/0x90
4,1725684,1121675639354,-; ? kthread_bind+0x40/0x40
4,1725685,1121675639356,-; ret_from_fork+0x35/0x40

sar output:

root@maglev-master-190:/data/tmp/customers/weber/var/log/sysstat# sar
-d -f sa10 | grep dev7

Linux 4.15.0-117-generic (maglev-master-1) 03/10/21
_x86_64_ (88 CPU)

00:00:01 DEV tps rkB/s wkB/s areq-sz
aqu-sz await svctm %util
00:05:01 dev7-0 1.74 0.92 6.02 4.00
0.29 184.17 28.01 4.86
00:15:01 dev7-0 4.77 3.35 15.74 4.00
0.53 119.11 14.67 7.00
00:25:01 dev7-0 0.48 1.54 0.37 4.00
0.00 1.31 0.20 0.01
00:35:01 dev7-0 0.48 1.03 0.88 4.00
0.00 4.80 0.25 0.01
00:45:01 dev7-0 0.56 1.43 0.82 4.00
0.00 1.13 0.31 0.02
00:55:01 dev7-0 0.88 2.39 1.13 4.00
0.03 34.22 11.67 1.03
01:05:01 dev7-0 0.88 1.47 2.03 4.00
0.06 75.12 18.69 1.64
01:15:01 dev7-0 1.45 0.08 5.71 4.00
94.30 387.62 605.50 87.62
01:25:01 dev7-0 0.00 0.00 0.00 0.00
128.00 0.00 0.00 100.00
01:35:01 dev7-0 0.00 0.00 0.00 0.00
128.00 0.00 0.00 100.00

On Wed, Mar 17, 2021 at 10:36 PM Theodore Ts'o <[email protected]> wrote:
>
> On Wed, Mar 17, 2021 at 08:30:56PM +0530, Shashidhar Patil wrote:
> > Hi Theodore,
> > Thank you for the details about the journalling layer and
> > insight into the block device layer.
> > I think Good luck might have clicked. The swap file in our case is
> > attached to a loop block device before enabling swap using swapon.
> > Since loop driver processes its IO requests by calling
> > vfs_iter_write() the write requests re-enter the ext4
> > filesystem/journalling code.
> > Is that right ? There seems to be a possibility of cylic dependency.
>
> If that hypothesis is correct, you should see an example of that in
> one of your stack traces; do you? The loop device creates struct file
> where the file is opened using O_DIRECT. In the O_DIRECT code path,
> assuming the file was fully allocate and initialized, it shouldn't
> involve starting a journal handle.
>
> That being said, why are you using a loop device for a swap device at
> all? Using a swap file directly is going to be much more efficient,
> and decrease the stack depth and CPU cycles needed to do a swap out if
> nothing else. If you can reliably reproduce the problem, what happens
> if you use a swap file directly and cut out the loop device as a swap
> device? Does it make the problem go away?
>
> - Ted

2021-03-19 04:44:31

by Shashidhar Patil

[permalink] [raw]
Subject: Re: jbd2 task hung in jbd2_journal_commit_transaction

Hi Theodore,
I forgot to include two important details , the stack trace of
loop0 driver and sar output, which clearly nail the problem.
The losetup with Ubuntu16.05 does not have O_DIRECT support and we
were not aware of bypassing of journalling if
O_DIRECT combined with preallocated file scenario.

Using the loop we could track the swap load using sysstat, but
otherwise no other major requirement.
With loop I could reproduce the problem only twice using stress-ng
being run for 20 time for 15 seconds interval.
The problem happens highly random so it may take more number of tries
in some cases.

With direct swap file I tried many times and could not see the issue.
The system continues to function fine after terminating the tests.
The loop is now removed and swap file is activated directly in the
deployed systems where the issue was seen every couple of
weeks. Awaiting for few weeks before concluding about the problem/solution.

Logs below.

backtrace:

6,1725650,1121675639291,-;loop0 D 0 2090 2 0x80000080
4,1725651,1121675639293,-;Call Trace:
4,1725652,1121675639295,-; __schedule+0x3d6/0x8b0
4,1725653,1121675639296,-; schedule+0x36/0x80
4,1725654,1121675639298,-; wait_transaction_locked+0x8a/0xd0
4,1725655,1121675639300,-; ? wait_woken+0x80/0x80
4,1725656,1121675639302,-; add_transaction_credits+0x1cd/0x2b0
4,1725657,1121675639303,-; ? __wake_up_common_lock+0x8e/0xc0
4,1725658,1121675639305,-; start_this_handle+0x103/0x410
4,1725659,1121675639306,-; ? _cond_resched+0x1a/0x50
4,1725660,1121675639310,-; ? kmem_cache_alloc+0x115/0x1c0
4,1725661,1121675639311,-; jbd2__journal_start+0xdb/0x1f0
4,1725662,1121675639314,-; ? ext4_dirty_inode+0x32/0x70
4,1725663,1121675639317,-; __ext4_journal_start_sb+0x6d/0x120
4,1725664,1121675639318,-; ext4_dirty_inode+0x32/0x70
4,1725665,1121675639322,-; __mark_inode_dirty+0x184/0x3b0
4,1725666,1121675639325,-; generic_update_time+0x7b/0xd0
4,1725667,1121675639326,-; ? current_time+0x32/0x70
4,1725668,1121675639328,-; file_update_time+0xbe/0x110
4,1725669,1121675639330,-; __generic_file_write_iter+0x9d/0x1f0
4,1725670,1121675639331,-; ? kmem_cache_free+0x1d1/0x1e0
4,1725671,1121675639333,-; ext4_file_write_iter+0xc4/0x3b0
4,1725672,1121675639336,-; do_iter_readv_writev+0x111/0x180
4,1725673,1121675639337,-; do_iter_write+0x87/0x1a0
4,1725674,1121675639339,-; vfs_iter_write+0x19/0x30
4,1725675,1121675639343,-; lo_write_bvec+0x69/0x110
4,1725676,1121675639344,-; loop_queue_work+0x8ff/0xa60
4,1725677,1121675639346,-; ? __switch_to_asm+0x35/0x70
4,1725678,1121675639347,-; ? __switch_to_asm+0x35/0x70
4,1725679,1121675639348,-; ? __schedule+0x3de/0x8b0
4,1725680,1121675639350,-; kthread_worker_fn+0x85/0x1f0
4,1725681,1121675639351,-; loop_kthread_worker_fn+0x1e/0x20
4,1725682,1121675639352,-; kthread+0x105/0x140
4,1725683,1121675639353,-; ? loop_get_status64+0x90/0x90
4,1725684,1121675639354,-; ? kthread_bind+0x40/0x40
4,1725685,1121675639356,-; ret_from_fork+0x35/0x40

sar output:

root@maglev-master-190:/data/tmp/customers/weber/var/log/sysstat# sar
-d -f sa10 | grep dev7

Linux 4.15.0-117-generic (maglev-master-1) 03/10/21
_x86_64_ (88 CPU)

00:00:01 DEV tps rkB/s wkB/s areq-sz
aqu-sz await svctm %util
00:05:01 dev7-0 1.74 0.92 6.02 4.00
0.29 184.17 28.01 4.86
00:15:01 dev7-0 4.77 3.35 15.74 4.00
0.53 119.11 14.67 7.00
00:25:01 dev7-0 0.48 1.54 0.37 4.00
0.00 1.31 0.20 0.01
00:35:01 dev7-0 0.48 1.03 0.88 4.00
0.00 4.80 0.25 0.01
00:45:01 dev7-0 0.56 1.43 0.82 4.00
0.00 1.13 0.31 0.02
00:55:01 dev7-0 0.88 2.39 1.13 4.00
0.03 34.22 11.67 1.03
01:05:01 dev7-0 0.88 1.47 2.03 4.00
0.06 75.12 18.69 1.64
01:15:01 dev7-0 1.45 0.08 5.71 4.00
94.30 387.62 605.50 87.62
01:25:01 dev7-0 0.00 0.00 0.00 0.00
128.00 0.00 0.00 100.00
01:35:01 dev7-0 0.00 0.00 0.00 0.00
128.00 0.00 0.00 100.00

On Wed, Mar 17, 2021 at 10:36 PM Theodore Ts'o <[email protected]> wrote:
>
> On Wed, Mar 17, 2021 at 08:30:56PM +0530, Shashidhar Patil wrote:
> > Hi Theodore,
> > Thank you for the details about the journalling layer and
> > insight into the block device layer.
> > I think Good luck might have clicked. The swap file in our case is
> > attached to a loop block device before enabling swap using swapon.
> > Since loop driver processes its IO requests by calling
> > vfs_iter_write() the write requests re-enter the ext4
> > filesystem/journalling code.
> > Is that right ? There seems to be a possibility of cylic dependency.
>
> If that hypothesis is correct, you should see an example of that in
> one of your stack traces; do you? The loop device creates struct file
> where the file is opened using O_DIRECT. In the O_DIRECT code path,
> assuming the file was fully allocate and initialized, it shouldn't
> involve starting a journal handle.
>
> That being said, why are you using a loop device for a swap device at
> all? Using a swap file directly is going to be much more efficient,
> and decrease the stack depth and CPU cycles needed to do a swap out if
> nothing else. If you can reliably reproduce the problem, what happens
> if you use a swap file directly and cut out the loop device as a swap
> device? Does it make the problem go away?
>
> - Ted

2021-03-19 15:18:51

by Theodore Ts'o

[permalink] [raw]
Subject: Re: jbd2 task hung in jbd2_journal_commit_transaction

On Thu, Mar 18, 2021 at 12:27:44PM +0530, Shashidhar Patil wrote:
> Hi Theodore,
> I forgot to include two important details , the stack trace of
> loop0 driver and sar output, which clearly nail the problem.
> The losetup with Ubuntu16.05 does not have O_DIRECT support and we
> were not aware of bypassing of journalling if
> O_DIRECT combined with preallocated file scenario.

Which version of the kernel are you using? The use of O_DIRECT for
loop devices requires kernel and losetup support. (Also note that
upstream developers really generally don't pay attention --- or
support --- distribution kernels unless they work for the company for
which you are paying $$$ for support, and Ubuntu 16.05 isn't even a
Long-Term Support distribution.)

My suggestion is to see if you can replicate the problem on a modern
userspace with the latest kernel. And if not, then that's an object
lesson about why using a antediluvian is not a great life choice. :-)

Cheers,

- Ted

2021-03-22 06:36:39

by Shashidhar Patil

[permalink] [raw]
Subject: Re: jbd2 task hung in jbd2_journal_commit_transaction

Hi Theodore

On Fri, Mar 19, 2021 at 8:46 PM Theodore Ts'o <[email protected]> wrote:
>
> On Thu, Mar 18, 2021 at 12:27:44PM +0530, Shashidhar Patil wrote:
> > Hi Theodore,
> > I forgot to include two important details , the stack trace of
> > loop0 driver and sar output, which clearly nail the problem.
> > The losetup with Ubuntu16.05 does not have O_DIRECT support and we
> > were not aware of bypassing of journalling if
> > O_DIRECT combined with preallocated file scenario.
>
> Which version of the kernel are you using? The use of O_DIRECT for

I am using *antediluvian* ubunut 16.04 LTS as below

$ uname -a
Linux ubuntu 4.15.0-117-generic #118~16.04.1-Ubuntu SMP Sat Sep 5
23:35:06 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

> loop devices requires kernel and losetup support. (Also note that

As per the backtrace of loop0 driver attached it does look like a
problem, right ?

> upstream developers really generally don't pay attention --- or
> support --- distribution kernels unless they work for the company for
> which you are paying $$$ for support, and Ubuntu 16.05 isn't even a

$$$ again. No respite from $$ urge. $ Not good :-)

$ $$$
-bash: 31163$: command not found
:-)
> Long-Term Support distribution.)
>
> My suggestion is to see if you can replicate the problem on a modern
> userspace with the latest kernel. And if not, then that's an object

I will try to recreate the problem with latest upstream kernel used by
upstream developers.

> lesson about why using a antediluvian is not a great life choice. :-)
>
I agree with "antediluvian is not a great life choice" but can't
comment on "antediluvian is not a great choice in corporates".

Thank you
-Shashidhar
> Cheers,
>
> - Ted