On 10/17/2014 12:55 PM, Austin Schuh wrote:
> Use the 121 patch. This sounds very similar to the issue that I helped
> debug with XFS. There ended up being a deadlock due to a bug in the
> kernel work queues. You can search the RT archives for more info.
I can confirm that the problem still shows up with the rt121 patch. (And
also with Paul Gortmaker's proposed 3.4.103-rt127 patch.)
We added some instrumentation and it looks like we've tracked down the problem.
Figuring out how to fix it is proving to be tricky.
Basically it looks like we have a circular dependency involving the
inode->i_data_sem rt_mutex, the PG_writeback bit, and the BJ_Shadow list. It
goes something like this:
jbd2_journal_commit_transaction:
1) set page for writeback (set PG_writeback bit)
2) put jbd2 journal head on BJ_Shadow list
3) sleep on PG_writeback bit waiting for page writeback complete
ext4_da_writepages:
1) ext4_map_blocks() acquires inode->i_data_sem for writing
2) do_get_write_access() sleeps waiting for jbd2 journal head to come off
the BJ_Shadow list
At this point the flush code can't run because it can't acquire
inode->i_data_sem for reading, so the page will never get written out.
Deadlock.
The following is a more detailed timeline with information from added trace
events:
nfsd-2012 [003] ....1.. 8612.903541: ext4_map_blocks_down_write: dev 147,3 ino 2097161
nfsd-2012 [003] ....1.. 8612.903546: ext4_map_blocks_up_write: dev 147,3 ino 2097161
nfsd-2012 [003] ....1.. 8612.903559: ext4_map_blocks_down_write: dev 147,3 ino 2097161
nfsd-2012 [003] ....1.. 8612.903565: ext4_map_blocks_up_write: dev 147,3 ino 2097161
nfsd-2012 [003] ....1.. 8612.903611: ext4_map_blocks_down_write: dev 147,3 ino 2097161
nfsd-2012 [003] ....1.. 8612.903616: ext4_map_blocks_up_write: dev 147,3 ino 2097161
<...>-5960 [004] ....1.. 8612.903628: jbd2_submit_inode_data: dev 147,3 ino 2097160
<...>-5960 [004] ....111 8612.903651: jbd2_list_add_bjshadow: adding jh ffff880415350000 to transaction ffff880415391180 BJ_Shadow list
<...>-5960 [004] ....111 8612.903653: jbd2_list_add_bjshadow: adding jh ffff8803eb08dbd0 to transaction ffff880415391180 BJ_Shadow list
<...>-5960 [004] ....111 8612.903655: jbd2_list_add_bjshadow: adding jh ffff8803eb08d150 to transaction ffff880415391180 BJ_Shadow list
<...>-5960 [004] ....111 8612.903656: jbd2_list_add_bjshadow: adding jh ffff8803eb08d0e0 to transaction ffff880415391180 BJ_Shadow list
<...>-5960 [004] ....111 8612.903657: jbd2_list_add_bjshadow: adding jh ffff88031c9449a0 to transaction ffff880415391180 BJ_Shadow list
nfsd-2012 [003] ....1.. 8612.903658: ext4_map_blocks_down_write: dev 147,3 ino 2097161
<...>-5960 [004] ....111 8612.903658: jbd2_list_add_bjshadow: adding jh ffff88031c944310 to transaction ffff880415391180 BJ_Shadow list
nfsd-2012 [003] ....1.. 8612.903665: ext4_map_blocks_up_write: dev 147,3 ino 2097161
<...>-5960 [004] ....1.. 8612.903696: jbd2_finish_inode_data: dev 147,3 ino 2097161
nfsd-2012 [003] ....1.. 8612.903706: ext4_map_blocks_down_write: dev 147,3 ino 2097161
nfsd-2012 [003] ....1.. 8612.903714: ext4_map_blocks_up_write: dev 147,3 ino 2097161
nfsd-2012 [003] ....1.. 8612.903802: ext4_map_blocks_down_write: dev 147,3 ino 2097161
nfsd-2012 [003] ....1.. 8612.903814: ext4_map_blocks_up_write: dev 147,3 ino 2097161
nfsd-2012 [003] ....1.. 8612.903960: ext4_map_blocks_down_write: dev 147,3 ino 2097161
nfsd-2012 [003] ....1.. 8612.903983: ext4_map_blocks_up_write: dev 147,3 ino 2097161
nfsd-2012 [003] ....1.. 8612.904311: ext4_map_blocks_down_write: dev 147,3 ino 2097161
nfsd-2012 [003] ....1.. 8612.904318: ext4_map_blocks_up_write: dev 147,3 ino 2097161
nfsd-2012 [003] ....1.. 8612.904331: ext4_map_blocks_down_write: dev 147,3 ino 2097161
nfsd-2012 [003] ....1.. 8612.904337: ext4_map_blocks_up_write: dev 147,3 ino 2097161
nfsd-2012 [003] ....1.. 8612.904399: ext4_map_blocks_down_write: dev 147,3 ino 2097161
nfsd-2012 [003] ....1.. 8612.904408: ext4_map_blocks_up_write: dev 147,3 ino 2097161
nfsd-2012 [003] ....1.. 8612.904678: ext4_map_blocks_down_write: dev 147,3 ino 2097161
nfsd-2012 [003] ....1.. 8612.904772: ext4_map_blocks_up_write: dev 147,3 ino 2097161
<...>-2015 [007] ....1.. 8612.934515: ext4_map_blocks_down_write: dev 147,3 ino 2097161
<...>-2015 [007] ....1.. 8612.934525: jbd2_list_sleep_bjshadow: waiting for jh ffff8803eb08dbd0 from transaction ffff880415391180 to be removed from BJ_Shadow list
Timeline:
pid 5960 is [jbd2/drbd3-8]
pid 2015 is [nfsd]
pid 2012 is [nfsd]
pid 5960:
8612.903628: jbd2_submit_inode_data for inode 2097160.
This is right before calling journal_submit_inode_data_buffers(), which
ends up calling set_page_writeback().
8612.903653: Add jh ffff8803eb08dbd0 to BJ_Shadow list.
This is in bd2_journal_write_metadata_buffer() right before calling
__jbd2_journal_file_buffer().
8612.903696: This is in journal_finish_inode_data_buffers(), right before calling
filemap_fdatawait() which ends up calling wait_on_page_bit(page,
PG_writeback)
<we see no more logs for pid 5960 after this>
pid 2015:
8612.934515: takes write lock on inode->i_data_sem for inode 2097161
8612.934525: goes to sleep waiting for jh ffff8803eb08dbd0 to be removed from
BJ_Shadow list
<we see no more logs for pid 2015 after this>
pid 2012:
8617.963896: hits 5-sec retry limit and stops the trace. This means it blocked
trying to get a read lock on inode->i_data_sem for inode 2097161
at time 8612.963.
Chris
On Thu, 23 Oct 2014, Chris Friesen wrote:
> On 10/17/2014 12:55 PM, Austin Schuh wrote:
> > Use the 121 patch. This sounds very similar to the issue that I helped
> > debug with XFS. There ended up being a deadlock due to a bug in the
> > kernel work queues. You can search the RT archives for more info.
>
> I can confirm that the problem still shows up with the rt121 patch. (And
> also with Paul Gortmaker's proposed 3.4.103-rt127 patch.)
> We added some instrumentation and it looks like we've tracked down the problem.
> Figuring out how to fix it is proving to be tricky.
>
> Basically it looks like we have a circular dependency involving the
> inode->i_data_sem rt_mutex, the PG_writeback bit, and the BJ_Shadow list. It
> goes something like this:
>
> jbd2_journal_commit_transaction:
> 1) set page for writeback (set PG_writeback bit)
> 2) put jbd2 journal head on BJ_Shadow list
> 3) sleep on PG_writeback bit waiting for page writeback complete
>
> ext4_da_writepages:
> 1) ext4_map_blocks() acquires inode->i_data_sem for writing
> 2) do_get_write_access() sleeps waiting for jbd2 journal head to come off
> the BJ_Shadow list
>
> At this point the flush code can't run because it can't acquire
> inode->i_data_sem for reading, so the page will never get written out.
> Deadlock.
Sorry, I really cannot map that sparse description to any code
flow. Proper callchains for the involved parts might help to actually
understand what you are looking for.
Thanks,
tglx
On 10/26/2014 08:25 AM, Thomas Gleixner wrote:
> On Thu, 23 Oct 2014, Chris Friesen wrote:
>> On 10/17/2014 12:55 PM, Austin Schuh wrote:
>>> Use the 121 patch. This sounds very similar to the issue that I helped
>>> debug with XFS. There ended up being a deadlock due to a bug in the
>>> kernel work queues. You can search the RT archives for more info.
>>
>> I can confirm that the problem still shows up with the rt121 patch. (And
>> also with Paul Gortmaker's proposed 3.4.103-rt127 patch.)
>
>> We added some instrumentation and it looks like we've tracked down the problem.
>> Figuring out how to fix it is proving to be tricky.
>>
>> Basically it looks like we have a circular dependency involving the
>> inode->i_data_sem rt_mutex, the PG_writeback bit, and the BJ_Shadow list. It
>> goes something like this:
>>
>> jbd2_journal_commit_transaction:
>> 1) set page for writeback (set PG_writeback bit)
>> 2) put jbd2 journal head on BJ_Shadow list
>> 3) sleep on PG_writeback bit waiting for page writeback complete
>>
>> ext4_da_writepages:
>> 1) ext4_map_blocks() acquires inode->i_data_sem for writing
>> 2) do_get_write_access() sleeps waiting for jbd2 journal head to come off
>> the BJ_Shadow list
>>
>> At this point the flush code can't run because it can't acquire
>> inode->i_data_sem for reading, so the page will never get written out.
>> Deadlock.
>
> Sorry, I really cannot map that sparse description to any code
> flow. Proper callchains for the involved parts might help to actually
> understand what you are looking for.
There are details (stack traces, etc.) in the first message in the thread:
http://www.spinics.net/lists/linux-rt-users/msg12261.html
Originally we had thought that nfsd might have been implicated somehow,
but it seems like it was just a trigger (possibly by increasing the rate
of sync I/O).
In the interest of full disclosure I should point out that we're using a
modified kernel so there is a chance that we have introduced the problem
ourselves. That said, we have not made significant changes to either
ext4 or jbd2. (Just a couple of minor cherry-picked bugfixes.)
The relevant code paths are:
Journal commit. The important thing here is that we set the
PG_writeback on a page, put the jbd2 journal head on BJ_Shadow list,
then sleep waiting for page writeback complete. If the page writeback
never completes, then the journal head never comes off the BJ_Shadow list.
jbd2_journal_commit_transaction
journal_submit_data_buffers
journal_submit_inode_data_buffers
generic_writepages
set_page_writeback(page) [PG_writeback]
jbd2_journal_write_metadata_buffer
__jbd2_journal_file_buffer(jh_in, transaction, BJ_Shadow);
journal_finish_inode_data_buffers
filemap_fdatawait
filemap_fdatawait_range
wait_on_page_writeback(page)
wait_on_page_bit(page, PG_writeback) <--stuck here
jbd2_journal_unfile_buffer(journal, jh) [delete from BJ_Shadow list]
We can get to the code path below a couple of different ways (see
further down). The important stuff here is:
1) There is a code path that takes i_data_sem and then goes to sleep
waiting for the jbd2 journal head to be removed from the BJ_Shadow list.
If the journal head never comes off the list, the sema will never be
released.
2) ext4_map_blocks() always takes a read lock on i_data_sem. If the
sema is held by someone waiting for the journal head to come off the
list, it will block.
ext4_da_writepages
write_cache_pages_da
mpage_da_map_and_submit
ext4_map_blocks
down_read((&EXT4_I(inode)->i_data_sem))
up_read((&EXT4_I(inode)->i_data_sem))
down_write((&EXT4_I(inode)->i_data_sem))
ext4_ext_map_blocks
ext4_mb_new_blocks
ext4_mb_mark_diskspace_used
__ext4_journal_get_write_access
jbd2_journal_get_write_access
do_get_write_access
wait on BJ_Shadow list
One of the ways we end up at ext4_da_writepages() is via the page
writeback thread. If i_data_sem is already held by someone that is
sleeping, this can result in pages not getting written out.
bdi_writeback_thread
wb_do_writeback
wb_check_old_data_flush
wb_writeback
__writeback_inodes_wb
writeback_sb_inodes
writeback_single_inode
do_writepages
ext4_da_writepages
Another way to end up at ext4_da_writepages() is via sync writev()
calls. In the traces from my original report this ended up taking the
sema and then going to sleep waiting for the journal head to get removed
from the BJ_Shadow list.
sys_writev
vfs_writev
do_readv_writev
do_sync_readv_writev
ext4_file_write
generic_file_aio_write
generic_write_sync
ext4_sync_file
filemap_write_and_wait_range
__filemap_fdatawrite_range
do_writepages
ext4_da_writepages
Chris
On 10/23/2014 11:54 AM, Chris Friesen wrote:
> Basically it looks like we have a circular dependency involving the
> inode->i_data_sem rt_mutex, the PG_writeback bit, and the BJ_Shadow list. It
> goes something like this:
>
> jbd2_journal_commit_transaction:
> 1) set page for writeback (set PG_writeback bit)
> 2) put jbd2 journal head on BJ_Shadow list
> 3) sleep on PG_writeback bit waiting for page writeback complete
>
> ext4_da_writepages:
> 1) ext4_map_blocks() acquires inode->i_data_sem for writing
> 2) do_get_write_access() sleeps waiting for jbd2 journal head to come off
> the BJ_Shadow list
>
> At this point the flush code can't run because it can't acquire
> inode->i_data_sem for reading, so the page will never get written out.
> Deadlock.
Just curious...would we expect lockdep to detect this sort of thing? I
wasn't sure if the introduction of the two wait queues would cause
complications.
Chris
On Mon, 27 Oct 2014, Chris Friesen wrote:
> There are details (stack traces, etc.) in the first message in the thread:
> http://www.spinics.net/lists/linux-rt-users/msg12261.html
>
>
> Originally we had thought that nfsd might have been implicated somehow, but it
> seems like it was just a trigger (possibly by increasing the rate of sync
> I/O).
>
> In the interest of full disclosure I should point out that we're using a
> modified kernel so there is a chance that we have introduced the problem
> ourselves. That said, we have not made significant changes to either ext4 or
> jbd2. (Just a couple of minor cherry-picked bugfixes.)
I don't think it's an ext4/jdb2 problem.
> The relevant code paths are:
>
> Journal commit. The important thing here is that we set the PG_writeback on a
> page, put the jbd2 journal head on BJ_Shadow list, then sleep waiting for page
> writeback complete. If the page writeback never completes, then the journal
> head never comes off the BJ_Shadow list.
And that's what you need to investigate.
The rest of the threads being stuck waiting for the journal writeback
or inode->sem are just the consequence of it and have nothing to do
with the root cause of the problem.
ftrace with the block/writeback/jdb/ext4/sched tracepoints enabled
should provide a first insight into the issue.
Thanks,
tglx
On 10/29/2014 12:05 PM, Thomas Gleixner wrote:
> On Mon, 27 Oct 2014, Chris Friesen wrote:
>> There are details (stack traces, etc.) in the first message in the thread:
>> http://www.spinics.net/lists/linux-rt-users/msg12261.html
>>
>>
>> Originally we had thought that nfsd might have been implicated somehow, but it
>> seems like it was just a trigger (possibly by increasing the rate of sync
>> I/O).
>>
>> In the interest of full disclosure I should point out that we're using a
>> modified kernel so there is a chance that we have introduced the problem
>> ourselves. That said, we have not made significant changes to either ext4 or
>> jbd2. (Just a couple of minor cherry-picked bugfixes.)
>
> I don't think it's an ext4/jdb2 problem.
If we turn off journalling in ext4 we can't reproduce the problem. Not
conclusive, I'll admit...but interesting.
>> The relevant code paths are:
>>
>> Journal commit. The important thing here is that we set the PG_writeback on a
>> page, put the jbd2 journal head on BJ_Shadow list, then sleep waiting for page
>> writeback complete. If the page writeback never completes, then the journal
>> head never comes off the BJ_Shadow list.
>
> And that's what you need to investigate.
>
> The rest of the threads being stuck waiting for the journal writeback
> or inode->sem are just the consequence of it and have nothing to do
> with the root cause of the problem.
>
> ftrace with the block/writeback/jdb/ext4/sched tracepoints enabled
> should provide a first insight into the issue.
It seems plausible that the reason why page writeback never completes is
that it's blocking trying to take inode->i_data_sem for reading, as seen
in the following stack trace (from a hung system):
[<ffffffff8109cd0c>] rt_down_read+0x2c/0x40
[<ffffffff8120ac91>] ext4_map_blocks+0x41/0x270
[<ffffffff8120f0dc>] mpage_da_map_and_submit+0xac/0x4c0
[<ffffffff8120f9c9>] write_cache_pages_da+0x3f9/0x420
[<ffffffff8120fd30>] ext4_da_writepages+0x340/0x720
[<ffffffff8111a5f4>] do_writepages+0x24/0x40
[<ffffffff81191b71>] writeback_single_inode+0x181/0x4b0
[<ffffffff811922a2>] writeback_sb_inodes+0x1b2/0x290
[<ffffffff8119241e>] __writeback_inodes_wb+0x9e/0xd0
[<ffffffff811928e3>] wb_writeback+0x223/0x3f0
[<ffffffff81192b4f>] wb_check_old_data_flush+0x9f/0xb0
[<ffffffff8119403f>] wb_do_writeback+0x12f/0x250
[<ffffffff811941f4>] bdi_writeback_thread+0x94/0x320
I have ftrace logs for two of the three components that we think are
involved. I don't have ftrace logs for the above writeback case. My
instrumentation was set up to end tracing when someone blocked for 5
seconds trying to get inode->i_data_sem, and it happened to be an nfsd
task instead of the page writeback code. I could conceivably modify the
instrumentation to only get triggered by page writeback blocking.
For what it's worth, I'm currently testing a backport of commit b34090e
from mainline (which in turn required backporting commits e5a120a and
f5113ef). It switches from using the BJ_Shadow list to using the
BH_Shadow flag on the buffer head. More interestingly, waiters now get
woken up from journal_end_buffer_io_sync() instead of from
jbd2_journal_commit_transaction().
So far this seems to be helping a lot. It's lasted about 15x as long
under stress as without the patches.
Chris
On Wed, 29 Oct 2014, Chris Friesen wrote:
> On 10/29/2014 12:05 PM, Thomas Gleixner wrote:
> It seems plausible that the reason why page writeback never completes is that
> it's blocking trying to take inode->i_data_sem for reading, as seen in the
> following stack trace (from a hung system):
>
> [<ffffffff8109cd0c>] rt_down_read+0x2c/0x40
> [<ffffffff8120ac91>] ext4_map_blocks+0x41/0x270
> [<ffffffff8120f0dc>] mpage_da_map_and_submit+0xac/0x4c0
> [<ffffffff8120f9c9>] write_cache_pages_da+0x3f9/0x420
> [<ffffffff8120fd30>] ext4_da_writepages+0x340/0x720
> [<ffffffff8111a5f4>] do_writepages+0x24/0x40
> [<ffffffff81191b71>] writeback_single_inode+0x181/0x4b0
> [<ffffffff811922a2>] writeback_sb_inodes+0x1b2/0x290
> [<ffffffff8119241e>] __writeback_inodes_wb+0x9e/0xd0
> [<ffffffff811928e3>] wb_writeback+0x223/0x3f0
> [<ffffffff81192b4f>] wb_check_old_data_flush+0x9f/0xb0
> [<ffffffff8119403f>] wb_do_writeback+0x12f/0x250
> [<ffffffff811941f4>] bdi_writeback_thread+0x94/0x320
Well, the point is that the JBD write out is not completed. The above
is just the consequence. So really looking at ext4 inode write backs
and something stuck on BJ_Shadow or the inode sem is the wrong
place. It's all just caused by the JDB writeout not being completed
for whatever reason.
> For what it's worth, I'm currently testing a backport of commit b34090e from
> mainline (which in turn required backporting commits e5a120a and f5113ef). It
> switches from using the BJ_Shadow list to using the BH_Shadow flag on the
> buffer head. More interestingly, waiters now get woken up from
> journal_end_buffer_io_sync() instead of from
> jbd2_journal_commit_transaction().
>
> So far this seems to be helping a lot. It's lasted about 15x as long under
> stress as without the patches.
I fear that this is just papering over the problem, but you have to
talk to the jbd2 folks about that.
I personally prefer a reasonable explanation for the current behaviour
rather than a magic "solution" to the problem. But that's up to you.
Thanks,
tglx
On 10/29/2014 01:26 PM, Thomas Gleixner wrote:
> On Wed, 29 Oct 2014, Chris Friesen wrote:
>> On 10/29/2014 12:05 PM, Thomas Gleixner wrote:
>> It seems plausible that the reason why page writeback never completes is that
>> it's blocking trying to take inode->i_data_sem for reading, as seen in the
>> following stack trace (from a hung system):
>>
>> [<ffffffff8109cd0c>] rt_down_read+0x2c/0x40
>> [<ffffffff8120ac91>] ext4_map_blocks+0x41/0x270
>> [<ffffffff8120f0dc>] mpage_da_map_and_submit+0xac/0x4c0
>> [<ffffffff8120f9c9>] write_cache_pages_da+0x3f9/0x420
>> [<ffffffff8120fd30>] ext4_da_writepages+0x340/0x720
>> [<ffffffff8111a5f4>] do_writepages+0x24/0x40
>> [<ffffffff81191b71>] writeback_single_inode+0x181/0x4b0
>> [<ffffffff811922a2>] writeback_sb_inodes+0x1b2/0x290
>> [<ffffffff8119241e>] __writeback_inodes_wb+0x9e/0xd0
>> [<ffffffff811928e3>] wb_writeback+0x223/0x3f0
>> [<ffffffff81192b4f>] wb_check_old_data_flush+0x9f/0xb0
>> [<ffffffff8119403f>] wb_do_writeback+0x12f/0x250
>> [<ffffffff811941f4>] bdi_writeback_thread+0x94/0x320
>
> Well, the point is that the JBD write out is not completed. The above
> is just the consequence. So really looking at ext4 inode write backs
> and something stuck on BJ_Shadow or the inode sem is the wrong
> place. It's all just caused by the JDB writeout not being completed
> for whatever reason.
I'll willingly confess my ignorance of filesystem code before I started
looking at this issue. I was under the impression that the above stack
trace (for the "flush-147:3" task, in this case) was performing the
write out of the page that had been flagged for writeback by JBD...is
that not the case? If not, then could you point me in the right direction?
>> For what it's worth, I'm currently testing a backport of commit b34090e from
>> mainline (which in turn required backporting commits e5a120a and f5113ef). It
>> switches from using the BJ_Shadow list to using the BH_Shadow flag on the
>> buffer head. More interestingly, waiters now get woken up from
>> journal_end_buffer_io_sync() instead of from
>> jbd2_journal_commit_transaction().
>>
>> So far this seems to be helping a lot. It's lasted about 15x as long under
>> stress as without the patches.
>
> I fear that this is just papering over the problem, but you have to
> talk to the jbd2 folks about that.
They're on the CC list, hopefully they'll chime in...
> I personally prefer a reasonable explanation for the current behaviour
> rather than a magic "solution" to the problem. But that's up to you.
Well sure...but if nothing else it helps to point to a possible cause.
I'm currently looking at the locking implications of the patch to see if
it completely closes the race window or just narrows it.
Chris
On Wed, 29 Oct 2014, Chris Friesen wrote:
> On 10/29/2014 01:26 PM, Thomas Gleixner wrote:
> > On Wed, 29 Oct 2014, Chris Friesen wrote:
> > > On 10/29/2014 12:05 PM, Thomas Gleixner wrote:
> > > It seems plausible that the reason why page writeback never completes is
> > > that
> > > it's blocking trying to take inode->i_data_sem for reading, as seen in the
> > > following stack trace (from a hung system):
> > >
> > > [<ffffffff8109cd0c>] rt_down_read+0x2c/0x40
> > > [<ffffffff8120ac91>] ext4_map_blocks+0x41/0x270
> > > [<ffffffff8120f0dc>] mpage_da_map_and_submit+0xac/0x4c0
> > > [<ffffffff8120f9c9>] write_cache_pages_da+0x3f9/0x420
> > > [<ffffffff8120fd30>] ext4_da_writepages+0x340/0x720
> > > [<ffffffff8111a5f4>] do_writepages+0x24/0x40
> > > [<ffffffff81191b71>] writeback_single_inode+0x181/0x4b0
> > > [<ffffffff811922a2>] writeback_sb_inodes+0x1b2/0x290
> > > [<ffffffff8119241e>] __writeback_inodes_wb+0x9e/0xd0
> > > [<ffffffff811928e3>] wb_writeback+0x223/0x3f0
> > > [<ffffffff81192b4f>] wb_check_old_data_flush+0x9f/0xb0
> > > [<ffffffff8119403f>] wb_do_writeback+0x12f/0x250
> > > [<ffffffff811941f4>] bdi_writeback_thread+0x94/0x320
> >
> > Well, the point is that the JBD write out is not completed. The above
> > is just the consequence. So really looking at ext4 inode write backs
> > and something stuck on BJ_Shadow or the inode sem is the wrong
> > place. It's all just caused by the JDB writeout not being completed
> > for whatever reason.
>
> I'll willingly confess my ignorance of filesystem code before I started
> looking at this issue. I was under the impression that the above stack trace
> (for the "flush-147:3" task, in this case) was performing the write out of the
> page that had been flagged for writeback by JBD...is that not the case? If
> not, then could you point me in the right direction?
jbd2_journal_commit_transaction
journal_submit_data_buffers
journal_submit_inode_data_buffers
generic_writepages
set_page_writeback(page) [PG_writeback]
generic_writepages() is issuing the writeout to the block layer and
the underlying physical device.
journal_finish_inode_data_buffers
filemap_fdatawait
filemap_fdatawait_range
wait_on_page_writeback(page)
wait_on_page_bit(page, PG_writeback) <--stuck here
That parts waits for the IO completion. And because that thing is
stuck everything else gets stuck as well.
Thanks,
tglx
On Wed, Oct 29, 2014 at 08:26:36PM +0100, Thomas Gleixner wrote:
> > For what it's worth, I'm currently testing a backport of commit b34090e from
> > mainline (which in turn required backporting commits e5a120a and f5113ef). It
> > switches from using the BJ_Shadow list to using the BH_Shadow flag on the
> > buffer head. More interestingly, waiters now get woken up from
> > journal_end_buffer_io_sync() instead of from
> > jbd2_journal_commit_transaction().
> >
> > So far this seems to be helping a lot. It's lasted about 15x as long under
> > stress as without the patches.
>
> I fear that this is just papering over the problem, but you have to
> talk to the jbd2 folks about that.
No, it's a clean fix for the problem. The main issue is that what the
jbd2 commit was doing was starting inode writeback for those blocks
needed to guarantee data=ordered mode (so this is what caused various
pages to have writeback page set) as well as starting metadata writes
to the commit (which is what caused the shadow bit to be set on the
metadata buffers).
Now that we clear the shadow flag when the metadata writes is
complete, the writeback will eventually be allowed to complete and
this prevents the deadlock.
Cheers,
- Ted
On 10/29/2014 05:19 PM, Theodore Ts'o wrote:
> On Wed, Oct 29, 2014 at 08:26:36PM +0100, Thomas Gleixner wrote:
>>> For what it's worth, I'm currently testing a backport of commit b34090e from
>>> mainline (which in turn required backporting commits e5a120a and f5113ef). It
>>> switches from using the BJ_Shadow list to using the BH_Shadow flag on the
>>> buffer head. More interestingly, waiters now get woken up from
>>> journal_end_buffer_io_sync() instead of from
>>> jbd2_journal_commit_transaction().
>>>
>>> So far this seems to be helping a lot. It's lasted about 15x as long under
>>> stress as without the patches.
>>
>> I fear that this is just papering over the problem, but you have to
>> talk to the jbd2 folks about that.
>
> No, it's a clean fix for the problem. The main issue is that what the
> jbd2 commit was doing was starting inode writeback for those blocks
> needed to guarantee data=ordered mode (so this is what caused various
> pages to have writeback page set) as well as starting metadata writes
> to the commit (which is what caused the shadow bit to be set on the
> metadata buffers).
>
> Now that we clear the shadow flag when the metadata writes is
> complete, the writeback will eventually be allowed to complete and
> this prevents the deadlock.
Thanks for the explanation.
A few questions:
1) Is this something that could hit mainline as well, or just the RT kernel?
2) If it can hit mainline, is this something that should be considered
for the various longterm-support kernels? (3.10, maybe 3.4)
3) For 3.4, do you think that it's sufficient to backport the three
commits I mentioned, or are you aware of others that I should be looking
at as well?
Chris
On Wed, Oct 29, 2014 at 05:37:03PM -0600, Chris Friesen wrote:
>
> 1) Is this something that could hit mainline as well, or just the RT kernel?
Yes, it could hit the mainline as well. It's more about the workload
and how the files are written to and/or sync'ed.
> 2) If it can hit mainline, is this something that should be considered for
> the various longterm-support kernels? (3.10, maybe 3.4)
Yes, probably.
> 3) For 3.4, do you think that it's sufficient to backport the three commits
> I mentioned, or are you aware of others that I should be looking at as well?
No, those three commits would be quite sufficient. If you could
handle doing the backports, I would greatly appreciate it! Sounds
like you've done so for 3.4 already, and it should be fairly
straightforward to get them into 3.10, since b34090e landed in 3.11.
Thanks,
- Ted
2014-10-30 9:44 GMT+08:00 Theodore Ts'o <[email protected]>:
> On Wed, Oct 29, 2014 at 05:37:03PM -0600, Chris Friesen wrote:
>> 3) For 3.4, do you think that it's sufficient to backport the three commits
>> I mentioned, or are you aware of others that I should be looking at as well?
>
> No, those three commits would be quite sufficient. If you could
> handle doing the backports, I would greatly appreciate it! Sounds
> like you've done so for 3.4 already, and it should be fairly
> straightforward to get them into 3.10, since b34090e landed in 3.11.
>
Hi Ted and Chris,
I also have similar jbd2 hang problem for kernel 3.4. Could you kindly
specify what are the three commits to fix the issue? Cause I can not
find them using the keyword "commit b34090e e5a120a f5113ef".
Thanks a lot for your help.
Kevin
On Thu, Oct 30, 2014 at 04:15:45PM +0800, Kevin Liao wrote:
>
> I also have similar jbd2 hang problem for kernel 3.4. Could you kindly
> specify what are the three commits to fix the issue? Cause I can not
> find them using the keyword "commit b34090e e5a120a f5113ef".
Those *are* the three commits:
% git log -3 --oneline b34090e
b34090e jbd2: refine waiting for shadow buffers
e5a120a jbd2: remove journal_head from descriptor buffers
f5113ef jbd2: don't create journal_head for temporary journal buffers
Or if you need the full git commit ID's
% git log --pretty=oneline -3 b34090e
b34090e5e22a02fba0e4473056cce9420ad9dd0b jbd2: refine waiting for shadow buffers
e5a120aeb57f40ae568a5ca1dd6ace53d0213582 jbd2: remove journal_head from descriptor buffers
f5113effc2a2ee6b86a4b345ce557353dcbcfffe jbd2: don't create journal_head for temporary journal buffers
Or the URL's:
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=b34090e5e22a02fba0e4473056cce9420ad9dd0b
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=e5a120aeb57f40ae568a5ca1dd6ace53d0213582
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=f5113effc2a2ee6b86a4b345ce557353dcbcfffe
Cheers,
- Ted
On Wed, 29 Oct 2014, Theodore Ts'o wrote:
> On Wed, Oct 29, 2014 at 08:26:36PM +0100, Thomas Gleixner wrote:
> > > For what it's worth, I'm currently testing a backport of commit b34090e from
> > > mainline (which in turn required backporting commits e5a120a and f5113ef). It
> > > switches from using the BJ_Shadow list to using the BH_Shadow flag on the
> > > buffer head. More interestingly, waiters now get woken up from
> > > journal_end_buffer_io_sync() instead of from
> > > jbd2_journal_commit_transaction().
> > >
> > > So far this seems to be helping a lot. It's lasted about 15x as long under
> > > stress as without the patches.
> >
> > I fear that this is just papering over the problem, but you have to
> > talk to the jbd2 folks about that.
>
> No, it's a clean fix for the problem. The main issue is that what the
> jbd2 commit was doing was starting inode writeback for those blocks
> needed to guarantee data=ordered mode (so this is what caused various
> pages to have writeback page set) as well as starting metadata writes
> to the commit (which is what caused the shadow bit to be set on the
> metadata buffers).
>
> Now that we clear the shadow flag when the metadata writes is
> complete, the writeback will eventually be allowed to complete and
> this prevents the deadlock.
That's a way better explanation than what I saw in the commit logs and
it actually maps to the observed traces and stackdumps.
Thanks for the clarification! I'm just getting nervous when 'picked
some backports' magically 'fixes' an issue without a proper
explanation.
Thanks,
tglx
On Thu, Oct 30, 2014 at 10:11:26PM +0100, Thomas Gleixner wrote:
>
> That's a way better explanation than what I saw in the commit logs and
> it actually maps to the observed traces and stackdumps.
I can't speak for Jan, but I suspect he didn't realize that there was
a problem. The commit description in b34090e5e2 makes it clear that
the intent was a performance improvement, and not an attempt to fix a
potential deadlock bug.
Looking at the commit history, the problem was introduced in 2.6.27
(July 2008), in commit c851ed54017373, so this problem wasn't noticed
in the RHEL 6 and RHEL 7 enterprise linux QA runs, and it wasn't
noticed in all of the regression testing that we've been doing.
I've certainly seen this before. Two years ago we found a bug that
was only noticed when we deployed ext4 in production at Google, and
stress tested it at Google scale with the appropriate monitoring
systems so we could find a bug that had existed since the very
beginning of ext3, and which had never been noticed in all of the
enterprise testing done by Red Hat, SuSE, IBM, HP, etc. Actually, it
probably was noticed, but never in a reproducible way, and so it was
probably written off as some kind of flaky hardware induced
corruption.
The difference is that in this case, it seems that Chris and Kevin was
able to reproduce the problem reliably. (It also might be that the RT
patch kits widens the race window and makes it much more likely to
trigger.) Chris or Kevin, if you have time to try to create a
reliable repro that is small/simple enough that we could propose it as
an new test to add to xfstests, that would be great. If you can't,
that's completely understable.
In the case I described above, it was an extremely hard to hit race
that only happened under high memory pressure, so we never able to
create a reliable repro. Instead we had a theory that was consistent
pattern of metadata corruption we were seeing, deployed a kernel with
the fix, and after a few weeks were able to conclude we had finally
fixed the bug. Welcome to file system debugging. :-)
> Thanks for the clarification! I'm just getting nervous when 'picked
> some backports' magically 'fixes' an issue without a proper
> explanation.
Well, thanks to Chris for pointing out that b34090e5 seemed to make
the problem go away. Once I looked at what that patch changed, it was
a lot more obvious what might have been going wrong. It's always
helpful if you can beek at the answer key, even if it's a only
potential answer key. :-)
Cheers,
- Ted
On 10/30/2014 05:24 PM, Theodore Ts'o wrote:
> On Thu, Oct 30, 2014 at 10:11:26PM +0100, Thomas Gleixner wrote:
>>
>> That's a way better explanation than what I saw in the commit logs and
>> it actually maps to the observed traces and stackdumps.
>
> I can't speak for Jan, but I suspect he didn't realize that there was
> a problem. The commit description in b34090e5e2 makes it clear that
> the intent was a performance improvement, and not an attempt to fix a
> potential deadlock bug.
>
> Looking at the commit history, the problem was introduced in 2.6.27
> (July 2008), in commit c851ed54017373, so this problem wasn't noticed
> in the RHEL 6 and RHEL 7 enterprise linux QA runs, and it wasn't
> noticed in all of the regression testing that we've been doing.
>
> I've certainly seen this before. Two years ago we found a bug that
> was only noticed when we deployed ext4 in production at Google, and
> stress tested it at Google scale with the appropriate monitoring
> systems so we could find a bug that had existed since the very
> beginning of ext3, and which had never been noticed in all of the
> enterprise testing done by Red Hat, SuSE, IBM, HP, etc. Actually, it
> probably was noticed, but never in a reproducible way, and so it was
> probably written off as some kind of flaky hardware induced
> corruption.
>
> The difference is that in this case, it seems that Chris and Kevin was
> able to reproduce the problem reliably. (It also might be that the RT
> patch kits widens the race window and makes it much more likely to
> trigger.) Chris or Kevin, if you have time to try to create a
> reliable repro that is small/simple enough that we could propose it as
> an new test to add to xfstests, that would be great. If you can't,
> that's completely understable.
It appears that EXT4_I(inode)->i_data_sem is involved, so I wonder if it
might have something to do with the fact that the RT patches modify the
reader-writer semaphores so that the read-side is exclusive?
I suspect I won't have time to isolate a useful testcase, unfortunately.
For what it's worth, we initially discovered the problem when copying
large (10GB) files from an NFS client onto an NFS-mounted ext4
filesystem that was mounted with "noatime,nodiratime,data=ordered".
Initially it failed quite reliably, then something in our environment
changed and it became more intermittent (could take several hours of
stress test to reproduce).
We discovered somewhat by accident that we could more reliably reproduce
it running on a pair of VirtualBox VMs. The server exported the
filesystem as per above, and on the client I just used dd to copy from
/dev/zero to the NFS-mounted filesystem. Generally it would hang before
copying 5GB of data.
Chris
Ted,
On Thu, 30 Oct 2014, Theodore Ts'o wrote:
> On Thu, Oct 30, 2014 at 10:11:26PM +0100, Thomas Gleixner wrote:
> >
> > That's a way better explanation than what I saw in the commit logs and
> > it actually maps to the observed traces and stackdumps.
>
> I can't speak for Jan, but I suspect he didn't realize that there was
> a problem. The commit description in b34090e5e2 makes it clear that
> the intent was a performance improvement, and not an attempt to fix a
> potential deadlock bug.
Indeed. The changelog does not look like a bug fix. That's why I was
worried about Chris 'magic resolution' report.
> The difference is that in this case, it seems that Chris and Kevin was
> able to reproduce the problem reliably. (It also might be that the RT
> patch kits widens the race window and makes it much more likely to
> trigger.)
We've seen that before. RT indeed makes race conditions more prominent
by opening way more race windows than a mainline kernel ever does.
But there is a caveat. With the recent per cpu assumptions RT might
very well run into some completely not mainline relevant RT specific
pitfalls. And that's becoming worse with every day that RT is out of
tree. But that's a totally different story.
Thanks again for your profound analysis of the issue at hand!
tglx
On Thu 30-10-14 19:24:37, Ted Tso wrote:
> On Thu, Oct 30, 2014 at 10:11:26PM +0100, Thomas Gleixner wrote:
> >
> > That's a way better explanation than what I saw in the commit logs and
> > it actually maps to the observed traces and stackdumps.
>
> I can't speak for Jan, but I suspect he didn't realize that there was
> a problem. The commit description in b34090e5e2 makes it clear that
> the intent was a performance improvement, and not an attempt to fix a
> potential deadlock bug.
Yeah, correct. When I wrote b34090e5e2 I did that because I have realized
that journal reservations would create similar deadlock opportunity as
Chris has observed. I hadn't realized the problem is already in the code :)
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR