2015-06-25 10:16:42

by Angel Shtilianov

[permalink] [raw]
Subject: Lockup in wait_transaction_locked under memory pressure

[Resending to linux-ext4 as the first try failed]

On 06/25/2015 01:13 PM, Nikolay Borisov wrote:
> Hello,
>
> On a fairly busy server, running LXC I'm observing that sometimes
> the processes for a particular container lock up by going into D
> (uninterruptible sleep) state. Obtaining backtraces for those
> processes one thing which stands out is that they are all
> blocked in wait_transaction_locked (part of JBD2).
> This occurs sporadically when the particular container
> is under memory pressure and a process is selected by
> OOM for killing. I'm running kernel 4.0.0 and
> oom_kill_allocating_task is enabled.
>
> Here are backtraces from multiple such processes:
>
>
> alxc9 kernel: nginx D ffff8820a90b39a8 11496 9331 30627
> 0x00000004
> alxc9 kernel: ffff8820a90b39a8 ffff881ff284f010 ffff88396c6d1e90
> 0000000000000282
> alxc9 kernel: ffff8820a90b0010 ffff883ff12f3870 ffff883ff12f3800
> 0000000000027df1
> alxc9 kernel: ffff880413c08000 ffff8820a90b39c8 ffffffff815ab76e
> ffff883ff12f3870
> alxc9 kernel: Call Trace:
> alxc9 kernel: [<ffffffff815ab76e>] schedule+0x3e/0x90
> alxc9 kernel: [<ffffffff81264265>] wait_transaction_locked+0x85/0xc0
> alxc9 kernel: [<ffffffff810910d0>] ? woken_wake_function+0x20/0x20
> alxc9 kernel: [<ffffffff815ab18a>] ? __schedule+0x39a/0x870
> alxc9 kernel: [<ffffffff81264540>] add_transaction_credits+0xf0/0x250
> alxc9 kernel: [<ffffffff815ab76e>] ? schedule+0x3e/0x90
> alxc9 kernel: [<ffffffff815ae5e5>] ? schedule_timeout+0x165/0x1f0
> alxc9 kernel: [<ffffffff81264824>] start_this_handle+0x184/0x420
> alxc9 kernel: [<ffffffff810e0880>] ? __delayacct_blkio_end+0x30/0x50
> alxc9 kernel: [<ffffffff8117a48e>] ? kmem_cache_alloc+0xee/0x1c0
> alxc9 kernel: [<ffffffff81265220>] jbd2__journal_start+0x100/0x200
> alxc9 kernel: [<ffffffff8121da5c>] ? ext4_dirty_inode+0x3c/0x80
> alxc9 kernel: [<ffffffff8124bb49>] __ext4_journal_start_sb+0x79/0x100
> alxc9 kernel: [<ffffffff8121da5c>] ext4_dirty_inode+0x3c/0x80
> alxc9 kernel: [<ffffffff811be5c3>] __mark_inode_dirty+0x173/0x400
> alxc9 kernel: [<ffffffff811ae9c5>] generic_update_time+0x85/0xd0
> alxc9 kernel: [<ffffffff81120f5a>] ? filemap_map_pages+0x1ca/0x210
> alxc9 kernel: [<ffffffff811ae632>] file_update_time+0xb2/0x110
> alxc9 kernel: [<ffffffff811226c2>] __generic_file_write_iter+0x172/0x3a0
> alxc9 kernel: [<ffffffff81214814>] ext4_file_write_iter+0x134/0x460
> alxc9 kernel: [<ffffffff810ad910>] ? update_rmtp+0x80/0x80
> alxc9 kernel: [<ffffffff81194047>] new_sync_write+0x97/0xc0
> alxc9 kernel: [<ffffffff8119445e>] vfs_write+0xce/0x180
> alxc9 kernel: [<ffffffff81194bda>] SyS_write+0x5a/0xd0
> alxc9 kernel: [<ffffffff815afa89>] system_call_fastpath+0x12/0x17
>
> alxc9 kernel: mysqld D ffff8821352638d8 11936 5176 30627
> 0x00000006
> alxc9 kernel: ffff8821352638d8 ffff881ff2848000 ffff8812d3d28a30
> 0000000000000286
> alxc9 kernel: ffff882135260010 ffff883ff12f3870 ffff883ff12f3800
> 0000000000027df1
> alxc9 kernel: ffff880413c08000 ffff8821352638f8 ffffffff815ab76e
> ffff883ff12f3870
> alxc9 kernel: Call Trace:
> alxc9 kernel: [<ffffffff815ab76e>] schedule+0x3e/0x90
> alxc9 kernel: [<ffffffff81264265>] wait_transaction_locked+0x85/0xc0
> alxc9 kernel: [<ffffffff810910d0>] ? woken_wake_function+0x20/0x20
> alxc9 kernel: [<ffffffff81264540>] add_transaction_credits+0xf0/0x250
> alxc9 kernel: [<ffffffff81264824>] start_this_handle+0x184/0x420
> alxc9 kernel: [<ffffffff8117a48e>] ? kmem_cache_alloc+0xee/0x1c0
> alxc9 kernel: [<ffffffff81265220>] jbd2__journal_start+0x100/0x200
> alxc9 kernel: [<ffffffff8121fa70>] ? ext4_evict_inode+0x190/0x490
> alxc9 kernel: [<ffffffff8124bb49>] __ext4_journal_start_sb+0x79/0x100
> alxc9 kernel: [<ffffffff8121fa70>] ext4_evict_inode+0x190/0x490
> alxc9 kernel: [<ffffffff811af6d8>] evict+0xb8/0x1a0
> alxc9 kernel: [<ffffffff811af8b6>] iput_final+0xf6/0x190
> alxc9 kernel: [<ffffffff811b0230>] iput+0xa0/0xe0
> alxc9 kernel: [<ffffffff811ab068>] dentry_iput+0xa8/0xf0
> alxc9 kernel: [<ffffffff811ac1c5>] __dentry_kill+0x85/0x130
> alxc9 kernel: [<ffffffff811ac42c>] dput+0x1bc/0x220
> alxc9 kernel: [<ffffffff811966b4>] __fput+0x144/0x200
> alxc9 kernel: [<ffffffff8119681e>] ____fput+0xe/0x10
> alxc9 kernel: [<ffffffff8106dc85>] task_work_run+0xd5/0x120
> alxc9 kernel: [<ffffffff810537d9>] do_exit+0x1b9/0x560
> alxc9 kernel: [<ffffffff810aebc2>] ? hrtimer_cancel+0x22/0x30
> alxc9 kernel: [<ffffffff81053bd6>] do_group_exit+0x56/0x100
> alxc9 kernel: [<ffffffff81061787>] get_signal+0x237/0x530
> alxc9 kernel: [<ffffffff81002d45>] do_signal+0x25/0x130
> alxc9 kernel: [<ffffffff810a57d9>] ? rcu_eqs_exit+0x79/0xb0
> alxc9 kernel: [<ffffffff810a5823>] ? rcu_user_exit+0x13/0x20
> alxc9 kernel: [<ffffffff81002ec8>] do_notify_resume+0x78/0xb0
> alxc9 kernel: [<ffffffff815afce3>] int_signal+0x12/0x17
>
> My hypotheses is that the OOM is killing a process while its performing
> a write to a transaction without it having a chance to complete it which
> leaves all other processes waiting to be woken up, which naturally never
> happens. I wonder whether such a failure scenario is even possible? If
> not then what could possibly force a transaction to stall for hours?
>
> Regards,
> Nikolay
>


2015-06-25 11:21:19

by Michal Hocko

[permalink] [raw]
Subject: Re: Lockup in wait_transaction_locked under memory pressure

On Thu 25-06-15 13:16:39, Nikolay Borisov wrote:
> [Resending to linux-ext4 as the first try failed]
>
> On 06/25/2015 01:13 PM, Nikolay Borisov wrote:
> > Hello,
> >
> > On a fairly busy server, running LXC I'm observing that sometimes
> > the processes for a particular container lock up by going into D
> > (uninterruptible sleep) state. Obtaining backtraces for those
> > processes one thing which stands out is that they are all
> > blocked in wait_transaction_locked (part of JBD2).
> > This occurs sporadically when the particular container
> > is under memory pressure and a process is selected by
> > OOM for killing. I'm running kernel 4.0.0 and
> > oom_kill_allocating_task is enabled.

Do you have the full OOM report?

> > Here are backtraces from multiple such processes:
> >
> >
> > alxc9 kernel: nginx D ffff8820a90b39a8 11496 9331 30627
> > 0x00000004
> > alxc9 kernel: ffff8820a90b39a8 ffff881ff284f010 ffff88396c6d1e90
> > 0000000000000282
> > alxc9 kernel: ffff8820a90b0010 ffff883ff12f3870 ffff883ff12f3800
> > 0000000000027df1
> > alxc9 kernel: ffff880413c08000 ffff8820a90b39c8 ffffffff815ab76e
> > ffff883ff12f3870
> > alxc9 kernel: Call Trace:
> > alxc9 kernel: [<ffffffff815ab76e>] schedule+0x3e/0x90
> > alxc9 kernel: [<ffffffff81264265>] wait_transaction_locked+0x85/0xc0
> > alxc9 kernel: [<ffffffff810910d0>] ? woken_wake_function+0x20/0x20
> > alxc9 kernel: [<ffffffff815ab18a>] ? __schedule+0x39a/0x870
> > alxc9 kernel: [<ffffffff81264540>] add_transaction_credits+0xf0/0x250
> > alxc9 kernel: [<ffffffff815ab76e>] ? schedule+0x3e/0x90
> > alxc9 kernel: [<ffffffff815ae5e5>] ? schedule_timeout+0x165/0x1f0
> > alxc9 kernel: [<ffffffff81264824>] start_this_handle+0x184/0x420
> > alxc9 kernel: [<ffffffff810e0880>] ? __delayacct_blkio_end+0x30/0x50
> > alxc9 kernel: [<ffffffff8117a48e>] ? kmem_cache_alloc+0xee/0x1c0
> > alxc9 kernel: [<ffffffff81265220>] jbd2__journal_start+0x100/0x200
> > alxc9 kernel: [<ffffffff8121da5c>] ? ext4_dirty_inode+0x3c/0x80
> > alxc9 kernel: [<ffffffff8124bb49>] __ext4_journal_start_sb+0x79/0x100
> > alxc9 kernel: [<ffffffff8121da5c>] ext4_dirty_inode+0x3c/0x80
> > alxc9 kernel: [<ffffffff811be5c3>] __mark_inode_dirty+0x173/0x400
> > alxc9 kernel: [<ffffffff811ae9c5>] generic_update_time+0x85/0xd0
> > alxc9 kernel: [<ffffffff81120f5a>] ? filemap_map_pages+0x1ca/0x210
> > alxc9 kernel: [<ffffffff811ae632>] file_update_time+0xb2/0x110
> > alxc9 kernel: [<ffffffff811226c2>] __generic_file_write_iter+0x172/0x3a0
> > alxc9 kernel: [<ffffffff81214814>] ext4_file_write_iter+0x134/0x460
> > alxc9 kernel: [<ffffffff810ad910>] ? update_rmtp+0x80/0x80
> > alxc9 kernel: [<ffffffff81194047>] new_sync_write+0x97/0xc0
> > alxc9 kernel: [<ffffffff8119445e>] vfs_write+0xce/0x180
> > alxc9 kernel: [<ffffffff81194bda>] SyS_write+0x5a/0xd0
> > alxc9 kernel: [<ffffffff815afa89>] system_call_fastpath+0x12/0x17
> >
> > alxc9 kernel: mysqld D ffff8821352638d8 11936 5176 30627
> > 0x00000006
> > alxc9 kernel: ffff8821352638d8 ffff881ff2848000 ffff8812d3d28a30
> > 0000000000000286
> > alxc9 kernel: ffff882135260010 ffff883ff12f3870 ffff883ff12f3800
> > 0000000000027df1
> > alxc9 kernel: ffff880413c08000 ffff8821352638f8 ffffffff815ab76e
> > ffff883ff12f3870
> > alxc9 kernel: Call Trace:
> > alxc9 kernel: [<ffffffff815ab76e>] schedule+0x3e/0x90
> > alxc9 kernel: [<ffffffff81264265>] wait_transaction_locked+0x85/0xc0
> > alxc9 kernel: [<ffffffff810910d0>] ? woken_wake_function+0x20/0x20
> > alxc9 kernel: [<ffffffff81264540>] add_transaction_credits+0xf0/0x250
> > alxc9 kernel: [<ffffffff81264824>] start_this_handle+0x184/0x420
> > alxc9 kernel: [<ffffffff8117a48e>] ? kmem_cache_alloc+0xee/0x1c0
> > alxc9 kernel: [<ffffffff81265220>] jbd2__journal_start+0x100/0x200
> > alxc9 kernel: [<ffffffff8121fa70>] ? ext4_evict_inode+0x190/0x490
> > alxc9 kernel: [<ffffffff8124bb49>] __ext4_journal_start_sb+0x79/0x100
> > alxc9 kernel: [<ffffffff8121fa70>] ext4_evict_inode+0x190/0x490
> > alxc9 kernel: [<ffffffff811af6d8>] evict+0xb8/0x1a0
> > alxc9 kernel: [<ffffffff811af8b6>] iput_final+0xf6/0x190
> > alxc9 kernel: [<ffffffff811b0230>] iput+0xa0/0xe0
> > alxc9 kernel: [<ffffffff811ab068>] dentry_iput+0xa8/0xf0
> > alxc9 kernel: [<ffffffff811ac1c5>] __dentry_kill+0x85/0x130
> > alxc9 kernel: [<ffffffff811ac42c>] dput+0x1bc/0x220
> > alxc9 kernel: [<ffffffff811966b4>] __fput+0x144/0x200
> > alxc9 kernel: [<ffffffff8119681e>] ____fput+0xe/0x10
> > alxc9 kernel: [<ffffffff8106dc85>] task_work_run+0xd5/0x120
> > alxc9 kernel: [<ffffffff810537d9>] do_exit+0x1b9/0x560
> > alxc9 kernel: [<ffffffff810aebc2>] ? hrtimer_cancel+0x22/0x30
> > alxc9 kernel: [<ffffffff81053bd6>] do_group_exit+0x56/0x100
> > alxc9 kernel: [<ffffffff81061787>] get_signal+0x237/0x530
> > alxc9 kernel: [<ffffffff81002d45>] do_signal+0x25/0x130
> > alxc9 kernel: [<ffffffff810a57d9>] ? rcu_eqs_exit+0x79/0xb0
> > alxc9 kernel: [<ffffffff810a5823>] ? rcu_user_exit+0x13/0x20
> > alxc9 kernel: [<ffffffff81002ec8>] do_notify_resume+0x78/0xb0
> > alxc9 kernel: [<ffffffff815afce3>] int_signal+0x12/0x17
> >
> > My hypotheses is that the OOM is killing a process while its performing
> > a write to a transaction without it having a chance to complete it which
> > leaves all other processes waiting to be woken up, which naturally never
> > happens. I wonder whether such a failure scenario is even possible? If
> > not then what could possibly force a transaction to stall for hours?

I am not familiar with ext4 code much but I assume that even if the oom
victim was writing to the transaction and the allocation had failed then
the transaction should have failed as well.
It would be a bit different if that allocation was GFP_NOFAIL and the
memory reserves were completely depleted then it would loop inside the
allocator and never finish the write (until something else releases
memory). You can increase min_free_kbytes to increase memory reserves.

But this is just a wild guess. sysrq+m would tell us more about the
current state of the memory. It would be also good to check whether the
OOM victim managed to die (oom report and sysrq+t).
--
Michal Hocko
SUSE Labss

2015-06-25 11:43:45

by Angel Shtilianov

[permalink] [raw]
Subject: Re: Lockup in wait_transaction_locked under memory pressure

I do have several OOM reports unfortunately I don't think I can
correlate them in any sensible way to be able to answer the question
"Which was the process that was writing prior to the D state occuring".
Maybe you can be more specific as to what am I likely looking for?

On 06/25/2015 02:21 PM, Michal Hocko wrote:
> On Thu 25-06-15 13:16:39, Nikolay Borisov wrote:
>> [Resending to linux-ext4 as the first try failed]
>>
>> On 06/25/2015 01:13 PM, Nikolay Borisov wrote:
>>> Hello,
>>>
>>> On a fairly busy server, running LXC I'm observing that sometimes
>>> the processes for a particular container lock up by going into D
>>> (uninterruptible sleep) state. Obtaining backtraces for those
>>> processes one thing which stands out is that they are all
>>> blocked in wait_transaction_locked (part of JBD2).
>>> This occurs sporadically when the particular container
>>> is under memory pressure and a process is selected by
>>> OOM for killing. I'm running kernel 4.0.0 and
>>> oom_kill_allocating_task is enabled.
>
> Do you have the full OOM report?
>
>>> Here are backtraces from multiple such processes:
>>>
>>>
>>> alxc9 kernel: nginx D ffff8820a90b39a8 11496 9331 30627
>>> 0x00000004
>>> alxc9 kernel: ffff8820a90b39a8 ffff881ff284f010 ffff88396c6d1e90
>>> 0000000000000282
>>> alxc9 kernel: ffff8820a90b0010 ffff883ff12f3870 ffff883ff12f3800
>>> 0000000000027df1
>>> alxc9 kernel: ffff880413c08000 ffff8820a90b39c8 ffffffff815ab76e
>>> ffff883ff12f3870
>>> alxc9 kernel: Call Trace:
>>> alxc9 kernel: [<ffffffff815ab76e>] schedule+0x3e/0x90
>>> alxc9 kernel: [<ffffffff81264265>] wait_transaction_locked+0x85/0xc0
>>> alxc9 kernel: [<ffffffff810910d0>] ? woken_wake_function+0x20/0x20
>>> alxc9 kernel: [<ffffffff815ab18a>] ? __schedule+0x39a/0x870
>>> alxc9 kernel: [<ffffffff81264540>] add_transaction_credits+0xf0/0x250
>>> alxc9 kernel: [<ffffffff815ab76e>] ? schedule+0x3e/0x90
>>> alxc9 kernel: [<ffffffff815ae5e5>] ? schedule_timeout+0x165/0x1f0
>>> alxc9 kernel: [<ffffffff81264824>] start_this_handle+0x184/0x420
>>> alxc9 kernel: [<ffffffff810e0880>] ? __delayacct_blkio_end+0x30/0x50
>>> alxc9 kernel: [<ffffffff8117a48e>] ? kmem_cache_alloc+0xee/0x1c0
>>> alxc9 kernel: [<ffffffff81265220>] jbd2__journal_start+0x100/0x200
>>> alxc9 kernel: [<ffffffff8121da5c>] ? ext4_dirty_inode+0x3c/0x80
>>> alxc9 kernel: [<ffffffff8124bb49>] __ext4_journal_start_sb+0x79/0x100
>>> alxc9 kernel: [<ffffffff8121da5c>] ext4_dirty_inode+0x3c/0x80
>>> alxc9 kernel: [<ffffffff811be5c3>] __mark_inode_dirty+0x173/0x400
>>> alxc9 kernel: [<ffffffff811ae9c5>] generic_update_time+0x85/0xd0
>>> alxc9 kernel: [<ffffffff81120f5a>] ? filemap_map_pages+0x1ca/0x210
>>> alxc9 kernel: [<ffffffff811ae632>] file_update_time+0xb2/0x110
>>> alxc9 kernel: [<ffffffff811226c2>] __generic_file_write_iter+0x172/0x3a0
>>> alxc9 kernel: [<ffffffff81214814>] ext4_file_write_iter+0x134/0x460
>>> alxc9 kernel: [<ffffffff810ad910>] ? update_rmtp+0x80/0x80
>>> alxc9 kernel: [<ffffffff81194047>] new_sync_write+0x97/0xc0
>>> alxc9 kernel: [<ffffffff8119445e>] vfs_write+0xce/0x180
>>> alxc9 kernel: [<ffffffff81194bda>] SyS_write+0x5a/0xd0
>>> alxc9 kernel: [<ffffffff815afa89>] system_call_fastpath+0x12/0x17
>>>
>>> alxc9 kernel: mysqld D ffff8821352638d8 11936 5176 30627
>>> 0x00000006
>>> alxc9 kernel: ffff8821352638d8 ffff881ff2848000 ffff8812d3d28a30
>>> 0000000000000286
>>> alxc9 kernel: ffff882135260010 ffff883ff12f3870 ffff883ff12f3800
>>> 0000000000027df1
>>> alxc9 kernel: ffff880413c08000 ffff8821352638f8 ffffffff815ab76e
>>> ffff883ff12f3870
>>> alxc9 kernel: Call Trace:
>>> alxc9 kernel: [<ffffffff815ab76e>] schedule+0x3e/0x90
>>> alxc9 kernel: [<ffffffff81264265>] wait_transaction_locked+0x85/0xc0
>>> alxc9 kernel: [<ffffffff810910d0>] ? woken_wake_function+0x20/0x20
>>> alxc9 kernel: [<ffffffff81264540>] add_transaction_credits+0xf0/0x250
>>> alxc9 kernel: [<ffffffff81264824>] start_this_handle+0x184/0x420
>>> alxc9 kernel: [<ffffffff8117a48e>] ? kmem_cache_alloc+0xee/0x1c0
>>> alxc9 kernel: [<ffffffff81265220>] jbd2__journal_start+0x100/0x200
>>> alxc9 kernel: [<ffffffff8121fa70>] ? ext4_evict_inode+0x190/0x490
>>> alxc9 kernel: [<ffffffff8124bb49>] __ext4_journal_start_sb+0x79/0x100
>>> alxc9 kernel: [<ffffffff8121fa70>] ext4_evict_inode+0x190/0x490
>>> alxc9 kernel: [<ffffffff811af6d8>] evict+0xb8/0x1a0
>>> alxc9 kernel: [<ffffffff811af8b6>] iput_final+0xf6/0x190
>>> alxc9 kernel: [<ffffffff811b0230>] iput+0xa0/0xe0
>>> alxc9 kernel: [<ffffffff811ab068>] dentry_iput+0xa8/0xf0
>>> alxc9 kernel: [<ffffffff811ac1c5>] __dentry_kill+0x85/0x130
>>> alxc9 kernel: [<ffffffff811ac42c>] dput+0x1bc/0x220
>>> alxc9 kernel: [<ffffffff811966b4>] __fput+0x144/0x200
>>> alxc9 kernel: [<ffffffff8119681e>] ____fput+0xe/0x10
>>> alxc9 kernel: [<ffffffff8106dc85>] task_work_run+0xd5/0x120
>>> alxc9 kernel: [<ffffffff810537d9>] do_exit+0x1b9/0x560
>>> alxc9 kernel: [<ffffffff810aebc2>] ? hrtimer_cancel+0x22/0x30
>>> alxc9 kernel: [<ffffffff81053bd6>] do_group_exit+0x56/0x100
>>> alxc9 kernel: [<ffffffff81061787>] get_signal+0x237/0x530
>>> alxc9 kernel: [<ffffffff81002d45>] do_signal+0x25/0x130
>>> alxc9 kernel: [<ffffffff810a57d9>] ? rcu_eqs_exit+0x79/0xb0
>>> alxc9 kernel: [<ffffffff810a5823>] ? rcu_user_exit+0x13/0x20
>>> alxc9 kernel: [<ffffffff81002ec8>] do_notify_resume+0x78/0xb0
>>> alxc9 kernel: [<ffffffff815afce3>] int_signal+0x12/0x17
>>>
>>> My hypotheses is that the OOM is killing a process while its performing
>>> a write to a transaction without it having a chance to complete it which
>>> leaves all other processes waiting to be woken up, which naturally never
>>> happens. I wonder whether such a failure scenario is even possible? If
>>> not then what could possibly force a transaction to stall for hours?
>
> I am not familiar with ext4 code much but I assume that even if the oom
> victim was writing to the transaction and the allocation had failed then
> the transaction should have failed as well.
> It would be a bit different if that allocation was GFP_NOFAIL and the
> memory reserves were completely depleted then it would loop inside the
> allocator and never finish the write (until something else releases
> memory). You can increase min_free_kbytes to increase memory reserves.
>
> But this is just a wild guess. sysrq+m would tell us more about the
> current state of the memory. It would be also good to check whether the
> OOM victim managed to die (oom report and sysrq+t).
>

2015-06-25 11:50:26

by Michal Hocko

[permalink] [raw]
Subject: Re: Lockup in wait_transaction_locked under memory pressure

On Thu 25-06-15 14:43:42, Nikolay Borisov wrote:
> I do have several OOM reports unfortunately I don't think I can
> correlate them in any sensible way to be able to answer the question
> "Which was the process that was writing prior to the D state occuring".
> Maybe you can be more specific as to what am I likely looking for?

Is the system still in this state? If yes I would check the last few OOM
reports which will tell you the pid of the oom victim and then I would
check sysrq+t whether they are still alive. And if yes check their stack
traces to see whether they are still in the allocation path or they got
stuck somewhere else or maybe they are not related at all...

sysrq+t might be useful even when this is not oom related because it can
pinpoint the task which is blocking your waiters.
--
Michal Hocko
SUSE Labs

2015-06-25 12:05:03

by Angel Shtilianov

[permalink] [raw]
Subject: Re: Lockup in wait_transaction_locked under memory pressure

Unfortunately the system is not in this state, the server had to be
rebooted, however I've got the syslog from that time frame and the pids
which were OOM'ed. I will sift through the logs for anything useful.

On 06/25/2015 02:50 PM, Michal Hocko wrote:
> On Thu 25-06-15 14:43:42, Nikolay Borisov wrote:
>> I do have several OOM reports unfortunately I don't think I can
>> correlate them in any sensible way to be able to answer the question
>> "Which was the process that was writing prior to the D state occuring".
>> Maybe you can be more specific as to what am I likely looking for?
>
> Is the system still in this state? If yes I would check the last few OOM
> reports which will tell you the pid of the oom victim and then I would
> check sysrq+t whether they are still alive. And if yes check their stack
> traces to see whether they are still in the allocation path or they got
> stuck somewhere else or maybe they are not related at all...
>
> sysrq+t might be useful even when this is not oom related because it can
> pinpoint the task which is blocking your waiters.
>

2015-06-25 13:29:34

by Angel Shtilianov

[permalink] [raw]
Subject: Re: Lockup in wait_transaction_locked under memory pressure

I couldn't find any particular OOM which stands out, here how a typical
one looks like:

alxc9 kernel: Memory cgroup out of memory (oom_kill_allocating_task): Kill process 9703 (postmaster) score 0 or sacrifice child
alxc9 kernel: Killed process 9703 (postmaster) total-vm:205800kB, anon-rss:1128kB, file-rss:0kB
alxc9 kernel: php invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
alxc9 kernel: php cpuset=cXXXX mems_allowed=0-1
alxc9 kernel: CPU: 12 PID: 1000 Comm: php Not tainted 4.0.0-clouder9+ #31
alxc9 kernel: Hardware name: Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.2 01/16/2015
alxc9 kernel: ffff8805d8440400 ffff88208d863c78 ffffffff815aaca3 ffff8820b947c750
alxc9 kernel: ffff8820b947c750 ffff88208d863cc8 ffffffff81123b2e ffff882000000000
alxc9 kernel: ffffffff000000d0 ffff8805d8440400 ffff8820b947c750 ffff8820b947cee0
alxc9 kernel: Call Trace:
alxc9 kernel: [<ffffffff815aaca3>] dump_stack+0x48/0x5d
alxc9 kernel: [<ffffffff81123b2e>] dump_header+0x8e/0xe0
alxc9 kernel: [<ffffffff81123fa7>] oom_kill_process+0x1d7/0x3c0
alxc9 kernel: [<ffffffff810d85a1>] ? cpuset_mems_allowed_intersects+0x21/0x30
alxc9 kernel: [<ffffffff8118c2bd>] mem_cgroup_out_of_memory+0x2bd/0x370
alxc9 kernel: [<ffffffff81189b37>] ? mem_cgroup_iter+0x177/0x390
alxc9 kernel: [<ffffffff8118c5d7>] mem_cgroup_oom_synchronize+0x267/0x290
alxc9 kernel: [<ffffffff811874f0>] ? mem_cgroup_wait_acct_move+0x140/0x140
alxc9 kernel: [<ffffffff81124504>] pagefault_out_of_memory+0x24/0xe0
alxc9 kernel: [<ffffffff81041927>] mm_fault_error+0x47/0x160
alxc9 kernel: [<ffffffff81041db0>] __do_page_fault+0x340/0x3c0
alxc9 kernel: [<ffffffff81041e6c>] do_page_fault+0x3c/0x90
alxc9 kernel: [<ffffffff815b1758>] page_fault+0x28/0x30
alxc9 kernel: Task in /lxc/cXXXX killed as a result of limit of /lxc/cXXXX
alxc9 kernel: memory: usage 2097152kB, limit 2097152kB, failcnt 7832302
alxc9 kernel: memory+swap: usage 2097152kB, limit 2621440kB, failcnt 0
alxc9 kernel: kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
alxc9 kernel: Memory cgroup stats for /lxc/cXXXX: cache:22708KB rss:2074444KB rss_huge:0KB
mapped_file:19960KB writeback:4KB swap:0KB inactive_anon:20364KB active_anon:2074896KB
inactive_file:1236KB active_file:464KB unevictable:0KB

The backtrace for other processes is exactly the same.


On 06/25/2015 02:50 PM, Michal Hocko wrote:
> On Thu 25-06-15 14:43:42, Nikolay Borisov wrote:
>> I do have several OOM reports unfortunately I don't think I can
>> correlate them in any sensible way to be able to answer the question
>> "Which was the process that was writing prior to the D state occuring".
>> Maybe you can be more specific as to what am I likely looking for?
>
> Is the system still in this state? If yes I would check the last few OOM
> reports which will tell you the pid of the oom victim and then I would
> check sysrq+t whether they are still alive. And if yes check their stack
> traces to see whether they are still in the allocation path or they got
> stuck somewhere else or maybe they are not related at all...
>
> sysrq+t might be useful even when this is not oom related because it can
> pinpoint the task which is blocking your waiters.
>

2015-06-25 13:31:43

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Lockup in wait_transaction_locked under memory pressure

On Thu, Jun 25, 2015 at 01:50:25PM +0200, Michal Hocko wrote:
> On Thu 25-06-15 14:43:42, Nikolay Borisov wrote:
> > I do have several OOM reports unfortunately I don't think I can
> > correlate them in any sensible way to be able to answer the question
> > "Which was the process that was writing prior to the D state occuring".
> > Maybe you can be more specific as to what am I likely looking for?
>
> Is the system still in this state? If yes I would check the last few OOM
> reports which will tell you the pid of the oom victim and then I would
> check sysrq+t whether they are still alive. And if yes check their stack
> traces to see whether they are still in the allocation path or they got
> stuck somewhere else or maybe they are not related at all...
>
> sysrq+t might be useful even when this is not oom related because it can
> pinpoint the task which is blocking your waiters.

In addition to sysrq+t, the other thing to do is to sample sysrq-p a
few half-dozen times so we can see if there are any processes in some
memory allocation retry loop. Also useful is to enable soft lockup
detection.

Something that perhaps we should have (and maybe GFP_NOFAIL should
imply this) is for places where your choices are either (a) let the
memory allocation succeed eventually, or (b) remount the file system
read-only and/or panic the system, is in the case where we're under
severe memory pressure due to cgroup settings, to simply allow the
kmalloc to bypass the cgroup allocation limits, since otherwise the
stall could end up impacting processes in other cgroups.

This is basically the same issue as a misconfigured cgroup which as
very tiny disk I/O and memory allocated to it, such that when a
process in that cgroup does a directory lookup, VFS locks the
directory *before* calling into the file system layer, and then if
cgroup isn't allow much in the way of memory and disk time, it's
likely that the directory block has been pushed out of memory, and on
a sufficiently busy system, the directory read might not happen for
minutes or *hours* (both because of the disk I/O limits as well as the
time needed to clean memory to allow the necessary memory allocation
to succeed).

In the meantime, if a process in another cgroup, with plenty of
disk-time and memory, tries to do anything else with that directory,
it will run into locked directory mutex, and *wham*. Priority
inversion. It gets even more amusing if this process is the overall
docker or other cgroup manager, since then the entire system is out to
lunch, and so then a watchdog daemon fires, and reboots the entire
system....

- Ted

2015-06-25 13:45:59

by Michal Hocko

[permalink] [raw]
Subject: Re: Lockup in wait_transaction_locked under memory pressure

On Thu 25-06-15 16:29:31, Nikolay Borisov wrote:
> I couldn't find any particular OOM which stands out, here how a typical
> one looks like:
>
> alxc9 kernel: Memory cgroup out of memory (oom_kill_allocating_task): Kill process 9703 (postmaster) score 0 or sacrifice child
> alxc9 kernel: Killed process 9703 (postmaster) total-vm:205800kB, anon-rss:1128kB, file-rss:0kB
> alxc9 kernel: php invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
> alxc9 kernel: php cpuset=cXXXX mems_allowed=0-1
> alxc9 kernel: CPU: 12 PID: 1000 Comm: php Not tainted 4.0.0-clouder9+ #31
> alxc9 kernel: Hardware name: Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.2 01/16/2015
> alxc9 kernel: ffff8805d8440400 ffff88208d863c78 ffffffff815aaca3 ffff8820b947c750
> alxc9 kernel: ffff8820b947c750 ffff88208d863cc8 ffffffff81123b2e ffff882000000000
> alxc9 kernel: ffffffff000000d0 ffff8805d8440400 ffff8820b947c750 ffff8820b947cee0
> alxc9 kernel: Call Trace:
> alxc9 kernel: [<ffffffff815aaca3>] dump_stack+0x48/0x5d
> alxc9 kernel: [<ffffffff81123b2e>] dump_header+0x8e/0xe0
> alxc9 kernel: [<ffffffff81123fa7>] oom_kill_process+0x1d7/0x3c0
> alxc9 kernel: [<ffffffff810d85a1>] ? cpuset_mems_allowed_intersects+0x21/0x30
> alxc9 kernel: [<ffffffff8118c2bd>] mem_cgroup_out_of_memory+0x2bd/0x370
> alxc9 kernel: [<ffffffff81189b37>] ? mem_cgroup_iter+0x177/0x390
> alxc9 kernel: [<ffffffff8118c5d7>] mem_cgroup_oom_synchronize+0x267/0x290
> alxc9 kernel: [<ffffffff811874f0>] ? mem_cgroup_wait_acct_move+0x140/0x140
> alxc9 kernel: [<ffffffff81124504>] pagefault_out_of_memory+0x24/0xe0
> alxc9 kernel: [<ffffffff81041927>] mm_fault_error+0x47/0x160
> alxc9 kernel: [<ffffffff81041db0>] __do_page_fault+0x340/0x3c0
> alxc9 kernel: [<ffffffff81041e6c>] do_page_fault+0x3c/0x90
> alxc9 kernel: [<ffffffff815b1758>] page_fault+0x28/0x30
> alxc9 kernel: Task in /lxc/cXXXX killed as a result of limit of /lxc/cXXXX
> alxc9 kernel: memory: usage 2097152kB, limit 2097152kB, failcnt 7832302
> alxc9 kernel: memory+swap: usage 2097152kB, limit 2621440kB, failcnt 0
> alxc9 kernel: kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
> alxc9 kernel: Memory cgroup stats for /lxc/cXXXX: cache:22708KB rss:2074444KB rss_huge:0KB
> mapped_file:19960KB writeback:4KB swap:0KB inactive_anon:20364KB active_anon:2074896KB
> inactive_file:1236KB active_file:464KB unevictable:0KB
>
> The backtrace for other processes is exactly the same.

OK, so this is not the global OOM killer. That wasn't clear from your
previous description. It makes a difference because it means that the
system is still healthy globaly and allocation requests will not loop
for ever in the allocator. Memcg charging path will not get blocked
until the OOM resolves and return ENOMEM when not called from the page
fault path.

memcg oom killer ignores oom_kill_allocating_task so the victim might be
different from the current task. That means the victim might get stuck
behind a lock held by somebody else. If the ext4 journaling code depends
on memcg charges and retry endlessly then the waiters would get stuck as
well.

I can see some calls to find_or_create_page from fs/ext4/mballoc.c but
AFAIU they are handling ENOMEM and lead to transaction abort - but I am
not familiar with this code enough so somebody familiar with ext4 should
double check that.

This all suggests that your lockup is caused by something else than OOM
most probably.
--
Michal Hocko
SUSE Labs

2015-06-25 13:49:47

by Angel Shtilianov

[permalink] [raw]
Subject: Re: Lockup in wait_transaction_locked under memory pressure



On 06/25/2015 04:31 PM, Theodore Ts'o wrote:
> On Thu, Jun 25, 2015 at 01:50:25PM +0200, Michal Hocko wrote:
>> On Thu 25-06-15 14:43:42, Nikolay Borisov wrote:
>>> I do have several OOM reports unfortunately I don't think I can
>>> correlate them in any sensible way to be able to answer the question
>>> "Which was the process that was writing prior to the D state occuring".
>>> Maybe you can be more specific as to what am I likely looking for?
>>
>> Is the system still in this state? If yes I would check the last few OOM
>> reports which will tell you the pid of the oom victim and then I would
>> check sysrq+t whether they are still alive. And if yes check their stack
>> traces to see whether they are still in the allocation path or they got
>> stuck somewhere else or maybe they are not related at all...
>>
>> sysrq+t might be useful even when this is not oom related because it can
>> pinpoint the task which is blocking your waiters.
>
> In addition to sysrq+t, the other thing to do is to sample sysrq-p a
> few half-dozen times so we can see if there are any processes in some
> memory allocation retry loop. Also useful is to enable soft lockup
> detection.
>
> Something that perhaps we should have (and maybe GFP_NOFAIL should
> imply this) is for places where your choices are either (a) let the
> memory allocation succeed eventually, or (b) remount the file system
> read-only and/or panic the system, is in the case where we're under
> severe memory pressure due to cgroup settings, to simply allow the
> kmalloc to bypass the cgroup allocation limits, since otherwise the
> stall could end up impacting processes in other cgroups.
>
> This is basically the same issue as a misconfigured cgroup which as
> very tiny disk I/O and memory allocated to it, such that when a
> process in that cgroup does a directory lookup, VFS locks the
> directory *before* calling into the file system layer, and then if
> cgroup isn't allow much in the way of memory and disk time, it's
> likely that the directory block has been pushed out of memory, and on
> a sufficiently busy system, the directory read might not happen for
> minutes or *hours* (both because of the disk I/O limits as well as the
> time needed to clean memory to allow the necessary memory allocation
> to succeed).
>
> In the meantime, if a process in another cgroup, with plenty of
> disk-time and memory, tries to do anything else with that directory,
> it will run into locked directory mutex, and *wham*. Priority
> inversion. It gets even more amusing if this process is the overall
> docker or other cgroup manager, since then the entire system is out to
> lunch, and so then a watchdog daemon fires, and reboots the entire
> system....
>

You know it might be possible that I'm observing exactly this,
since the other places where processes are blocked (but I
omitted initially since I thought it's inconsequential)
is in the following code path:

Jun 24 11:22:59 alxc9 kernel: crond D ffff8820b8affe58 14784 30568 30627 0x00000004
Jun 24 11:22:59 alxc9 kernel: ffff8820b8affe58 ffff8820ca72b2f0 ffff882c3534b2f0 000000000000fe4e
Jun 24 11:22:59 alxc9 kernel: ffff8820b8afc010 ffff882c3534b2f0 ffff8808d2d7e34c 00000000ffffffff
Jun 24 11:22:59 alxc9 kernel: ffff8808d2d7e350 ffff8820b8affe78 ffffffff815ab76e ffff882c3534b2f0
Jun 24 11:22:59 alxc9 kernel: Call Trace:
Jun 24 11:22:59 alxc9 kernel: [<ffffffff815ab76e>] schedule+0x3e/0x90
Jun 24 11:22:59 alxc9 kernel: [<ffffffff815ab9de>] schedule_preempt_disabled+0xe/0x10
Jun 24 11:22:59 alxc9 kernel: [<ffffffff815ad505>] __mutex_lock_slowpath+0x95/0x110
Jun 24 11:22:59 alxc9 kernel: [<ffffffff810a57d9>] ? rcu_eqs_exit+0x79/0xb0
Jun 24 11:22:59 alxc9 kernel: [<ffffffff815ad59b>] mutex_lock+0x1b/0x30
Jun 24 11:22:59 alxc9 kernel: [<ffffffff811b1fbd>] __fdget_pos+0x3d/0x50
Jun 24 11:22:59 alxc9 kernel: [<ffffffff810119d7>] ? syscall_trace_leave+0xa7/0xf0
Jun 24 11:22:59 alxc9 kernel: [<ffffffff81194bb3>] SyS_write+0x33/0xd0
Jun 24 11:22:59 alxc9 kernel: [<ffffffff815afcc8>] ? int_check_syscall_exit_work+0x34/0x3d
Jun 24 11:22:59 alxc9 kernel: [<ffffffff815afa89>] system_call_fastpath+0x12/0x17

Particularly, I can see a lot of processes locked up
in __fdget_pos -> mutex_lock. And this all sounds very
similar to what you just described.

How would you advise to rectify such situation?

> - Ted
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>

2015-06-25 13:54:25

by Angel Shtilianov

[permalink] [raw]
Subject: Re: Lockup in wait_transaction_locked under memory pressure



On 06/25/2015 04:45 PM, Michal Hocko wrote:
> On Thu 25-06-15 16:29:31, Nikolay Borisov wrote:
>> I couldn't find any particular OOM which stands out, here how a typical
>> one looks like:
>>
>> alxc9 kernel: Memory cgroup out of memory (oom_kill_allocating_task): Kill process 9703 (postmaster) score 0 or sacrifice child
>> alxc9 kernel: Killed process 9703 (postmaster) total-vm:205800kB, anon-rss:1128kB, file-rss:0kB
>> alxc9 kernel: php invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
>> alxc9 kernel: php cpuset=cXXXX mems_allowed=0-1
>> alxc9 kernel: CPU: 12 PID: 1000 Comm: php Not tainted 4.0.0-clouder9+ #31
>> alxc9 kernel: Hardware name: Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.2 01/16/2015
>> alxc9 kernel: ffff8805d8440400 ffff88208d863c78 ffffffff815aaca3 ffff8820b947c750
>> alxc9 kernel: ffff8820b947c750 ffff88208d863cc8 ffffffff81123b2e ffff882000000000
>> alxc9 kernel: ffffffff000000d0 ffff8805d8440400 ffff8820b947c750 ffff8820b947cee0
>> alxc9 kernel: Call Trace:
>> alxc9 kernel: [<ffffffff815aaca3>] dump_stack+0x48/0x5d
>> alxc9 kernel: [<ffffffff81123b2e>] dump_header+0x8e/0xe0
>> alxc9 kernel: [<ffffffff81123fa7>] oom_kill_process+0x1d7/0x3c0
>> alxc9 kernel: [<ffffffff810d85a1>] ? cpuset_mems_allowed_intersects+0x21/0x30
>> alxc9 kernel: [<ffffffff8118c2bd>] mem_cgroup_out_of_memory+0x2bd/0x370
>> alxc9 kernel: [<ffffffff81189b37>] ? mem_cgroup_iter+0x177/0x390
>> alxc9 kernel: [<ffffffff8118c5d7>] mem_cgroup_oom_synchronize+0x267/0x290
>> alxc9 kernel: [<ffffffff811874f0>] ? mem_cgroup_wait_acct_move+0x140/0x140
>> alxc9 kernel: [<ffffffff81124504>] pagefault_out_of_memory+0x24/0xe0
>> alxc9 kernel: [<ffffffff81041927>] mm_fault_error+0x47/0x160
>> alxc9 kernel: [<ffffffff81041db0>] __do_page_fault+0x340/0x3c0
>> alxc9 kernel: [<ffffffff81041e6c>] do_page_fault+0x3c/0x90
>> alxc9 kernel: [<ffffffff815b1758>] page_fault+0x28/0x30
>> alxc9 kernel: Task in /lxc/cXXXX killed as a result of limit of /lxc/cXXXX
>> alxc9 kernel: memory: usage 2097152kB, limit 2097152kB, failcnt 7832302
>> alxc9 kernel: memory+swap: usage 2097152kB, limit 2621440kB, failcnt 0
>> alxc9 kernel: kmem: usage 0kB, limit 9007199254740988kB, failcnt 0
>> alxc9 kernel: Memory cgroup stats for /lxc/cXXXX: cache:22708KB rss:2074444KB rss_huge:0KB
>> mapped_file:19960KB writeback:4KB swap:0KB inactive_anon:20364KB active_anon:2074896KB
>> inactive_file:1236KB active_file:464KB unevictable:0KB
>>
>> The backtrace for other processes is exactly the same.
>
> OK, so this is not the global OOM killer. That wasn't clear from your
> previous description. It makes a difference because it means that the
> system is still healthy globaly and allocation requests will not loop
> for ever in the allocator. Memcg charging path will not get blocked
> until the OOM resolves and return ENOMEM when not called from the page
> fault path.

Yes, overall, the machine is healthy, only the processes for a
particular container would all go into uninterruptible sleep.

>
> memcg oom killer ignores oom_kill_allocating_task so the victim might be
> different from the current task. That means the victim might get stuck
> behind a lock held by somebody else. If the ext4 journaling code depends
> on memcg charges and retry endlessly then the waiters would get stuck as
> well.

I've patched the cgroup OOM so that it takes into account the
oom_kill_allocating_task.

>
> I can see some calls to find_or_create_page from fs/ext4/mballoc.c but
> AFAIU they are handling ENOMEM and lead to transaction abort - but I am
> not familiar with this code enough so somebody familiar with ext4 should
> double check that.
>
> This all suggests that your lockup is caused by something else than OOM
> most probably.
>

2015-06-25 13:57:08

by Michal Hocko

[permalink] [raw]
Subject: Re: Lockup in wait_transaction_locked under memory pressure

On Thu 25-06-15 09:31:38, Theodore Ts'o wrote:
> On Thu, Jun 25, 2015 at 01:50:25PM +0200, Michal Hocko wrote:
> > On Thu 25-06-15 14:43:42, Nikolay Borisov wrote:
> > > I do have several OOM reports unfortunately I don't think I can
> > > correlate them in any sensible way to be able to answer the question
> > > "Which was the process that was writing prior to the D state occuring".
> > > Maybe you can be more specific as to what am I likely looking for?
> >
> > Is the system still in this state? If yes I would check the last few OOM
> > reports which will tell you the pid of the oom victim and then I would
> > check sysrq+t whether they are still alive. And if yes check their stack
> > traces to see whether they are still in the allocation path or they got
> > stuck somewhere else or maybe they are not related at all...
> >
> > sysrq+t might be useful even when this is not oom related because it can
> > pinpoint the task which is blocking your waiters.
>
> In addition to sysrq+t, the other thing to do is to sample sysrq-p a
> few half-dozen times so we can see if there are any processes in some
> memory allocation retry loop. Also useful is to enable soft lockup
> detection.
>
> Something that perhaps we should have (and maybe GFP_NOFAIL should
> imply this) is for places where your choices are either (a) let the
> memory allocation succeed eventually, or (b) remount the file system
> read-only and/or panic the system, is in the case where we're under
> severe memory pressure due to cgroup settings, to simply allow the
> kmalloc to bypass the cgroup allocation limits, since otherwise the
> stall could end up impacting processes in other cgroups.

GFP_NOFAIL will fallback to the root memcg if the charge fails after
several attempts to reclaim the memcg.

Besides that kmalloc would require kmem limit set which is not the case
here. And even if it were it would fallback as well.

As explained in other email, memcg charges are not blocked until memcg
oom gets resolved. We simply return ENOMEM. The memcg oom killer is
called only from the page fault path and even there from
pagefault_out_of_memory where no locks are held.
--
Michal Hocko
SUSE Labs

2015-06-25 13:59:00

by Michal Hocko

[permalink] [raw]
Subject: Re: Lockup in wait_transaction_locked under memory pressure

On Thu 25-06-15 16:54:23, Nikolay Borisov wrote:
>
>
> On 06/25/2015 04:45 PM, Michal Hocko wrote:
[...]
> > memcg oom killer ignores oom_kill_allocating_task so the victim might be
> > different from the current task. That means the victim might get stuck
> > behind a lock held by somebody else. If the ext4 journaling code depends
> > on memcg charges and retry endlessly then the waiters would get stuck as
> > well.
>
> I've patched the cgroup OOM so that it takes into account the
> oom_kill_allocating_task.

I am not very much keen on this configuration because it behaves very
randomly and I am pretty sure it will just paper over the real issue
here.

--
Michal Hocko
SUSE Labs

2015-06-25 14:05:12

by Michal Hocko

[permalink] [raw]
Subject: Re: Lockup in wait_transaction_locked under memory pressure

On Thu 25-06-15 16:49:43, Nikolay Borisov wrote:
[...]
> How would you advise to rectify such situation?

As I've said. Check the oom victim traces and see if it is holding any
of those locks.
--
Michal Hocko
SUSE Labs

2015-06-25 14:34:26

by Angel Shtilianov

[permalink] [raw]
Subject: Re: Lockup in wait_transaction_locked under memory pressure

On 06/25/2015 05:05 PM, Michal Hocko wrote:
> On Thu 25-06-15 16:49:43, Nikolay Borisov wrote:
> [...]
>> How would you advise to rectify such situation?
>
> As I've said. Check the oom victim traces and see if it is holding any
> of those locks.

As mentioned previously all OOM traces are identical to the one I've
sent - OOM being called form the page fault path.


2015-06-25 14:45:57

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Lockup in wait_transaction_locked under memory pressure

On Thu, Jun 25, 2015 at 04:49:43PM +0300, Nikolay Borisov wrote:
>
> You know it might be possible that I'm observing exactly this,
> since the other places where processes are blocked (but I
> omitted initially since I thought it's inconsequential)
> is in the following code path:
>
> Jun 24 11:22:59 alxc9 kernel: crond D ffff8820b8affe58 14784 30568 30627 0x00000004
> Jun 24 11:22:59 alxc9 kernel: ffff8820b8affe58 ffff8820ca72b2f0 ffff882c3534b2f0 000000000000fe4e
> Jun 24 11:22:59 alxc9 kernel: ffff8820b8afc010 ffff882c3534b2f0 ffff8808d2d7e34c 00000000ffffffff
> Jun 24 11:22:59 alxc9 kernel: ffff8808d2d7e350 ffff8820b8affe78 ffffffff815ab76e ffff882c3534b2f0
> Jun 24 11:22:59 alxc9 kernel: Call Trace:
> Jun 24 11:22:59 alxc9 kernel: [<ffffffff815ab76e>] schedule+0x3e/0x90
> Jun 24 11:22:59 alxc9 kernel: [<ffffffff815ab9de>] schedule_preempt_disabled+0xe/0x10
> Jun 24 11:22:59 alxc9 kernel: [<ffffffff815ad505>] __mutex_lock_slowpath+0x95/0x110
> Jun 24 11:22:59 alxc9 kernel: [<ffffffff810a57d9>] ? rcu_eqs_exit+0x79/0xb0
> Jun 24 11:22:59 alxc9 kernel: [<ffffffff815ad59b>] mutex_lock+0x1b/0x30
> Jun 24 11:22:59 alxc9 kernel: [<ffffffff811b1fbd>] __fdget_pos+0x3d/0x50
> Jun 24 11:22:59 alxc9 kernel: [<ffffffff810119d7>] ? syscall_trace_leave+0xa7/0xf0
> Jun 24 11:22:59 alxc9 kernel: [<ffffffff81194bb3>] SyS_write+0x33/0xd0
> Jun 24 11:22:59 alxc9 kernel: [<ffffffff815afcc8>] ? int_check_syscall_exit_work+0x34/0x3d
> Jun 24 11:22:59 alxc9 kernel: [<ffffffff815afa89>] system_call_fastpath+0x12/0x17
>
> Particularly, I can see a lot of processes locked up
> in __fdget_pos -> mutex_lock. And this all sounds very
> similar to what you just described.

What we would need to do is to analyze the stack traces of *all* of
the processes. It's clear that you have a lot of processes waiting on
something to clear, but we need to figure out what that might be. We
could be waiting on some memory allocation to complete; we could be
waiting for disk I/O to complete (which could get throttled for any
number of different reasons, including a cgroup's disk I/O limits), etc.

> How would you advise to rectify such situation?

In addition to trying to figure this out by analyzing all of the
kernel strace traces, you could also try to figure this out by
experimental methods.

Determine which containiners all of the processes that are stalled in
disk wait, and try relaxing the memory and disk and cpu contraints on
each of the containers, one at a time. Say, add 50% to each limit, or
make it be unlimited. I would suggest starting with containers that
contain processes that are trying to exit due to being OOM killed.

After you change each limit, see if it unclogs the system. If it
does, then you'll know a bit more about what caused the system. It
also suggest a terrible hack, namely a process which scrapes dmesg
output, and when it sees a process that has been oom killed, if it is
in a container, send kill signals to all of the processes in that
container (since if one process has exited, it's likely that container
isn't going to be functioning correctly anyway), and then unconstrain
the cgroup limits for all of the processes in that container.

The container manager can then restart the container, after it has
exited cleanly. Yes, it's a kludge. But it's a kludge that I bet
will work, which makes it a devops procedure. :-)

- Ted

2015-06-25 15:18:44

by Michal Hocko

[permalink] [raw]
Subject: Re: Lockup in wait_transaction_locked under memory pressure

On Thu 25-06-15 17:34:22, Nikolay Borisov wrote:
> On 06/25/2015 05:05 PM, Michal Hocko wrote:
> > On Thu 25-06-15 16:49:43, Nikolay Borisov wrote:
> > [...]
> >> How would you advise to rectify such situation?
> >
> > As I've said. Check the oom victim traces and see if it is holding any
> > of those locks.
>
> As mentioned previously all OOM traces are identical to the one I've
> sent - OOM being called form the page fault path.

By identical you mean that all of them kill the same task? Or just that
the path is same (which wouldn't be surprising as this is the only path
which triggers memcg oom killer)?

--
Michal Hocko
SUSE Labs

2015-06-25 15:27:13

by Angel Shtilianov

[permalink] [raw]
Subject: Re: Lockup in wait_transaction_locked under memory pressure



On 06/25/2015 06:18 PM, Michal Hocko wrote:
> On Thu 25-06-15 17:34:22, Nikolay Borisov wrote:
>> On 06/25/2015 05:05 PM, Michal Hocko wrote:
>>> On Thu 25-06-15 16:49:43, Nikolay Borisov wrote:
>>> [...]
>>>> How would you advise to rectify such situation?
>>>
>>> As I've said. Check the oom victim traces and see if it is holding any
>>> of those locks.
>>
>> As mentioned previously all OOM traces are identical to the one I've
>> sent - OOM being called form the page fault path.
>
> By identical you mean that all of them kill the same task? Or just that
> the path is same (which wouldn't be surprising as this is the only path
> which triggers memcg oom killer)?

The code path is the same, the tasks being killed are different
>

2015-06-29 08:32:46

by Michal Hocko

[permalink] [raw]
Subject: Re: Lockup in wait_transaction_locked under memory pressure

On Thu 25-06-15 18:27:10, Nikolay Borisov wrote:
>
>
> On 06/25/2015 06:18 PM, Michal Hocko wrote:
> > On Thu 25-06-15 17:34:22, Nikolay Borisov wrote:
> >> On 06/25/2015 05:05 PM, Michal Hocko wrote:
> >>> On Thu 25-06-15 16:49:43, Nikolay Borisov wrote:
> >>> [...]
> >>>> How would you advise to rectify such situation?
> >>>
> >>> As I've said. Check the oom victim traces and see if it is holding any
> >>> of those locks.
> >>
> >> As mentioned previously all OOM traces are identical to the one I've
> >> sent - OOM being called form the page fault path.
> >
> > By identical you mean that all of them kill the same task? Or just that
> > the path is same (which wouldn't be surprising as this is the only path
> > which triggers memcg oom killer)?
>
> The code path is the same, the tasks being killed are different

Is the OOM killer triggered only for a singe memcg or others misbehave
as well?
--
Michal Hocko
SUSE Labs

2015-06-29 09:01:52

by Angel Shtilianov

[permalink] [raw]
Subject: Re: Lockup in wait_transaction_locked under memory pressure

Today I observed the issue again, this time on a different server. What
is particularly strange is the fact that the OOM wasn't triggered for
the cgroup, whose tasks have entered D state. There were a couple of
SSHD processes and an RSYNC performing some backup tasks. Here is what
the stacktrace for the rsync looks like:

crash> set 18308
PID: 18308
COMMAND: "rsync"
TASK: ffff883d7c9b0a30 [THREAD_INFO: ffff881773748000]
CPU: 1
STATE: TASK_UNINTERRUPTIBLE
crash> bt
PID: 18308 TASK: ffff883d7c9b0a30 CPU: 1 COMMAND: "rsync"
#0 [ffff88177374ac60] __schedule at ffffffff815ab152
#1 [ffff88177374acb0] schedule at ffffffff815ab76e
#2 [ffff88177374acd0] schedule_timeout at ffffffff815ae5e5
#3 [ffff88177374ad70] io_schedule_timeout at ffffffff815aad6a
#4 [ffff88177374ada0] bit_wait_io at ffffffff815abfc6
#5 [ffff88177374adb0] __wait_on_bit at ffffffff815abda5
#6 [ffff88177374ae00] wait_on_page_bit at ffffffff8111fd4f
#7 [ffff88177374ae50] shrink_page_list at ffffffff81135445
#8 [ffff88177374af50] shrink_inactive_list at ffffffff81135845
#9 [ffff88177374b060] shrink_lruvec at ffffffff81135ead
#10 [ffff88177374b150] shrink_zone at ffffffff811360c3
#11 [ffff88177374b220] shrink_zones at ffffffff81136eff
#12 [ffff88177374b2a0] do_try_to_free_pages at ffffffff8113712f
#13 [ffff88177374b300] try_to_free_mem_cgroup_pages at ffffffff811372be
#14 [ffff88177374b380] try_charge at ffffffff81189423
#15 [ffff88177374b430] mem_cgroup_try_charge at ffffffff8118c6f5
#16 [ffff88177374b470] __add_to_page_cache_locked at ffffffff8112137d
#17 [ffff88177374b4e0] add_to_page_cache_lru at ffffffff81121618
#18 [ffff88177374b510] pagecache_get_page at ffffffff8112170b
#19 [ffff88177374b560] grow_dev_page at ffffffff811c8297
#20 [ffff88177374b5c0] __getblk_slow at ffffffff811c91d6
#21 [ffff88177374b600] __getblk_gfp at ffffffff811c92c1
#22 [ffff88177374b630] ext4_ext_grow_indepth at ffffffff8124565c
#23 [ffff88177374b690] ext4_ext_create_new_leaf at ffffffff81246ca8
#24 [ffff88177374b6e0] ext4_ext_insert_extent at ffffffff81246f09
#25 [ffff88177374b750] ext4_ext_map_blocks at ffffffff8124a848
#26 [ffff88177374b870] ext4_map_blocks at ffffffff8121a5b7
#27 [ffff88177374b910] mpage_map_one_extent at ffffffff8121b1fa
#28 [ffff88177374b950] mpage_map_and_submit_extent at ffffffff8121f07b
#29 [ffff88177374b9b0] ext4_writepages at ffffffff8121f6d5
#30 [ffff88177374bb20] do_writepages at ffffffff8112c490
#31 [ffff88177374bb30] __filemap_fdatawrite_range at ffffffff81120199
#32 [ffff88177374bb80] filemap_flush at ffffffff8112041c
#33 [ffff88177374bb90] ext4_alloc_da_blocks at ffffffff81219da1
#34 [ffff88177374bbb0] ext4_rename at ffffffff81229b91
#35 [ffff88177374bcd0] ext4_rename2 at ffffffff81229e32
#36 [ffff88177374bce0] vfs_rename at ffffffff811a08a5
#37 [ffff88177374bd60] SYSC_renameat2 at ffffffff811a3ffc
#38 [ffff88177374bf60] sys_renameat2 at ffffffff811a408e
#39 [ffff88177374bf70] sys_rename at ffffffff8119e51e
#40 [ffff88177374bf80] system_call_fastpath at ffffffff815afa89
RIP: 00002ba0169b61d7 RSP: 00007ffde830f428 RFLAGS: 00000206
RAX: ffffffffffffffda RBX: ffffffff815afa89 RCX: 00002ba016a2e007
RDX: 0000000000000000 RSI: 00007ffde83126f0 RDI: 00007ffde83106f0
RBP: 00007ffde83106f0 R8: 0000000000000000 R9: 0000000000000000
R10: 0000000000000000 R11: 0000000000000202 R12: ffffffff8119e51e
R13: ffff88177374bf78 R14: ffffffff811a408e R15: ffff88177374bf68
ORIG_RAX: 0000000000000052 CS: 0033 SS: 002b


All the other processes in the container have locked up when trying to
access file system resources. The SSHD processes for example have
blocked on tryng to access /var/log/btmp:

PID: 11919 TASK: ffff8815613c1460 CPU: 2 COMMAND: "sshd"
#0 [ffff880158d03c70] __schedule at ffffffff815ab152
#1 [ffff880158d03cc0] schedule at ffffffff815ab76e
#2 [ffff880158d03ce0] schedule_preempt_disabled at ffffffff815ab9de
#3 [ffff880158d03cf0] __mutex_lock_slowpath at ffffffff815ad505
#4 [ffff880158d03d50] mutex_lock at ffffffff815ad59b
#5 [ffff880158d03d70] ext4_file_write_iter at ffffffff8121477b
#6 [ffff880158d03e30] new_sync_write at ffffffff81194047
#7 [ffff880158d03f00] vfs_write at ffffffff8119445e
#8 [ffff880158d03f30] sys_write at ffffffff81194bda
#9 [ffff880158d03f80] system_call_fastpath at ffffffff815afa89
RIP: 00002adf56857790 RSP: 00007fffaf514788 RFLAGS: 00000206
RAX: ffffffffffffffda RBX: ffffffff815afa89 RCX: 00002adf568657f7
RDX: 0000000000000180 RSI: 00007fffaf5147a0 RDI: 0000000000000005
RBP: 0000000000000005 R8: 00002adf53ebf7a0 R9: 0000000080000000
R10: 0000000000000001 R11: 0000000000000246 R12: 00007fffaf514740
R13: 0000000000000180 R14: 0000000000000000 R15: 00007fffaf5147a0
ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b

Except for one which has blocked with the following stack:
PID: 7833 TASK: ffff88175b3ba8c0 CPU: 3 COMMAND: "sshd"
#0 [ffff88001d983960] __schedule at ffffffff815ab152
#1 [ffff88001d9839b0] schedule at ffffffff815ab76e
#2 [ffff88001d9839d0] wait_transaction_locked at ffffffff81264265
#3 [ffff88001d983a30] add_transaction_credits at ffffffff81264540
#4 [ffff88001d983ab0] start_this_handle at ffffffff81264824
#5 [ffff88001d983b60] jbd2__journal_start at ffffffff81265220
#6 [ffff88001d983bc0] __ext4_journal_start_sb at ffffffff8124bb49
#7 [ffff88001d983c10] ext4_dirty_inode at ffffffff8121da5c
#8 [ffff88001d983c30] __mark_inode_dirty at ffffffff811be5c3
#9 [ffff88001d983c70] generic_update_time at ffffffff811ae9c5
#10 [ffff88001d983cb0] file_update_time at ffffffff811ae632
#11 [ffff88001d983d10] __generic_file_write_iter at ffffffff811226c2
#12 [ffff88001d983d70] ext4_file_write_iter at ffffffff81214814
#13 [ffff88001d983e30] new_sync_write at ffffffff81194047
#14 [ffff88001d983f00] vfs_write at ffffffff8119445e
#15 [ffff88001d983f30] sys_write at ffffffff81194bda
#16 [ffff88001d983f80] system_call_fastpath at ffffffff815afa89
RIP: 00002adf56857790 RSP: 00007fffaf514788 RFLAGS: 00000206
RAX: ffffffffffffffda RBX: ffffffff815afa89 RCX: 00002adf568657f7
RDX: 0000000000000180 RSI: 00007fffaf5147a0 RDI: 0000000000000005
RBP: 0000000000000005 R8: 00002adf53ebf7a0 R9: 0000000080000000
R10: 0000000000000001 R11: 0000000000000246 R12: 00007fffaf514740
R13: 0000000000000180 R14: 0000000000000000 R15: 00007fffaf5147a0
ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b


So basically what I have managed to determine so far is that "something"
(this is the bit which I've yet to uncover, hopefully with your help)
starves the file system of resources. Which causes processes to block in
the jbd2_journal_start path. The other SSHD processes essentially block
on the i_mutex of the /var/log/btmp (nothing unusual).

So I guess the most interesting bit here is the stack of the rsync
process, so it is blocking on trying to free some pages, yet the OOM
hasn't triggered. Also, I did try increasing the memory of the cgroup
from 6 to 8 gigs and nothing changes (and this is logical when you think
about it, since everything is in D, state).

Currently, our only way to fix issue like that is to restart the whole
node since there is no way to forcefully remove the block device the
file system is hosted on (which is a device mapper target).




2015-06-29 09:07:57

by Angel Shtilianov

[permalink] [raw]
Subject: Re: Lockup in wait_transaction_locked under memory pressure



On 06/29/2015 11:32 AM, Michal Hocko wrote:
> On Thu 25-06-15 18:27:10, Nikolay Borisov wrote:
>>
>>
>> On 06/25/2015 06:18 PM, Michal Hocko wrote:
>>> On Thu 25-06-15 17:34:22, Nikolay Borisov wrote:
>>>> On 06/25/2015 05:05 PM, Michal Hocko wrote:
>>>>> On Thu 25-06-15 16:49:43, Nikolay Borisov wrote:
>>>>> [...]
>>>>>> How would you advise to rectify such situation?
>>>>>
>>>>> As I've said. Check the oom victim traces and see if it is holding any
>>>>> of those locks.
>>>>
>>>> As mentioned previously all OOM traces are identical to the one I've
>>>> sent - OOM being called form the page fault path.
>>>
>>> By identical you mean that all of them kill the same task? Or just that
>>> the path is same (which wouldn't be surprising as this is the only path
>>> which triggers memcg oom killer)?
>>
>> The code path is the same, the tasks being killed are different
>
> Is the OOM killer triggered only for a singe memcg or others misbehave
> as well?

Generally OOM would be triggered for whichever memcg runs out of
resources but so far I've only observed that the D state issue happens
in a single containers. However, this in turn might affect other
processes if they try to sleep on the same jbd2 journal .

2015-06-29 09:16:31

by Michal Hocko

[permalink] [raw]
Subject: Re: Lockup in wait_transaction_locked under memory pressure

On Mon 29-06-15 12:07:54, Nikolay Borisov wrote:
>
>
> On 06/29/2015 11:32 AM, Michal Hocko wrote:
> > On Thu 25-06-15 18:27:10, Nikolay Borisov wrote:
> >>
> >>
> >> On 06/25/2015 06:18 PM, Michal Hocko wrote:
> >>> On Thu 25-06-15 17:34:22, Nikolay Borisov wrote:
> >>>> On 06/25/2015 05:05 PM, Michal Hocko wrote:
> >>>>> On Thu 25-06-15 16:49:43, Nikolay Borisov wrote:
> >>>>> [...]
> >>>>>> How would you advise to rectify such situation?
> >>>>>
> >>>>> As I've said. Check the oom victim traces and see if it is holding any
> >>>>> of those locks.
> >>>>
> >>>> As mentioned previously all OOM traces are identical to the one I've
> >>>> sent - OOM being called form the page fault path.
> >>>
> >>> By identical you mean that all of them kill the same task? Or just that
> >>> the path is same (which wouldn't be surprising as this is the only path
> >>> which triggers memcg oom killer)?
> >>
> >> The code path is the same, the tasks being killed are different
> >
> > Is the OOM killer triggered only for a singe memcg or others misbehave
> > as well?
>
> Generally OOM would be triggered for whichever memcg runs out of
> resources but so far I've only observed that the D state issue happens
> in a single containers.

It is not clear whether it is the OOM memcg which has tasks in the D
state. Anyway I think it all smells like one memcg is throttling others
on another shared resource - journal in your case.

> However, this in turn might affect other processes if they try to
> sleep on the same jbd2 journal .

Sure, if the journal is shared then this is an inherent problem. Memcg
restrictions can easily cause priority inheritance problems as Ted has
already mentioned.

--
Michal Hocko
SUSE Labs

2015-06-29 09:23:20

by Angel Shtilianov

[permalink] [raw]
Subject: Re: Lockup in wait_transaction_locked under memory pressure



On 06/29/2015 12:16 PM, Michal Hocko wrote:
> On Mon 29-06-15 12:07:54, Nikolay Borisov wrote:
>>
>>
>> On 06/29/2015 11:32 AM, Michal Hocko wrote:
>>> On Thu 25-06-15 18:27:10, Nikolay Borisov wrote:
>>>>
>>>>
>>>> On 06/25/2015 06:18 PM, Michal Hocko wrote:
>>>>> On Thu 25-06-15 17:34:22, Nikolay Borisov wrote:
>>>>>> On 06/25/2015 05:05 PM, Michal Hocko wrote:
>>>>>>> On Thu 25-06-15 16:49:43, Nikolay Borisov wrote:
>>>>>>> [...]
>>>>>>>> How would you advise to rectify such situation?
>>>>>>>
>>>>>>> As I've said. Check the oom victim traces and see if it is holding any
>>>>>>> of those locks.
>>>>>>
>>>>>> As mentioned previously all OOM traces are identical to the one I've
>>>>>> sent - OOM being called form the page fault path.
>>>>>
>>>>> By identical you mean that all of them kill the same task? Or just that
>>>>> the path is same (which wouldn't be surprising as this is the only path
>>>>> which triggers memcg oom killer)?
>>>>
>>>> The code path is the same, the tasks being killed are different
>>>
>>> Is the OOM killer triggered only for a singe memcg or others misbehave
>>> as well?
>>
>> Generally OOM would be triggered for whichever memcg runs out of
>> resources but so far I've only observed that the D state issue happens
>> in a single containers.
>
> It is not clear whether it is the OOM memcg which has tasks in the D
> state. Anyway I think it all smells like one memcg is throttling others
> on another shared resource - journal in your case.

Be that as it may, how do I find which cgroup is the culprit?

>
>> However, this in turn might affect other processes if they try to
>> sleep on the same jbd2 journal .
>
> Sure, if the journal is shared then this is an inherent problem. Memcg
> restrictions can easily cause priority inheritance problems as Ted has
> already mentioned.
>

2015-06-29 09:36:42

by Michal Hocko

[permalink] [raw]
Subject: Re: Lockup in wait_transaction_locked under memory pressure

On Mon 29-06-15 12:01:49, Nikolay Borisov wrote:
> Today I observed the issue again, this time on a different server. What
> is particularly strange is the fact that the OOM wasn't triggered for
> the cgroup, whose tasks have entered D state. There were a couple of
> SSHD processes and an RSYNC performing some backup tasks. Here is what
> the stacktrace for the rsync looks like:
>
> crash> set 18308
> PID: 18308
> COMMAND: "rsync"
> TASK: ffff883d7c9b0a30 [THREAD_INFO: ffff881773748000]
> CPU: 1
> STATE: TASK_UNINTERRUPTIBLE
> crash> bt
> PID: 18308 TASK: ffff883d7c9b0a30 CPU: 1 COMMAND: "rsync"
> #0 [ffff88177374ac60] __schedule at ffffffff815ab152
> #1 [ffff88177374acb0] schedule at ffffffff815ab76e
> #2 [ffff88177374acd0] schedule_timeout at ffffffff815ae5e5
> #3 [ffff88177374ad70] io_schedule_timeout at ffffffff815aad6a
> #4 [ffff88177374ada0] bit_wait_io at ffffffff815abfc6
> #5 [ffff88177374adb0] __wait_on_bit at ffffffff815abda5
> #6 [ffff88177374ae00] wait_on_page_bit at ffffffff8111fd4f
> #7 [ffff88177374ae50] shrink_page_list at ffffffff81135445

This is most probably wait_on_page_writeback because the reclaim has
encountered a dirty page which is under writeback currently. This is
a memcg specific throttling on dirty memory because there is nothing
like that integrated into balance_dirty_pages. This also suggests that
dirtiers in this memcg are able to fill up the LRU by dirty memory
before the global writeback thread can make them clean a sufficient
pace.

I would try to tune dirty limits and the writeback timeout.

> #8 [ffff88177374af50] shrink_inactive_list at ffffffff81135845
> #9 [ffff88177374b060] shrink_lruvec at ffffffff81135ead
> #10 [ffff88177374b150] shrink_zone at ffffffff811360c3
> #11 [ffff88177374b220] shrink_zones at ffffffff81136eff
> #12 [ffff88177374b2a0] do_try_to_free_pages at ffffffff8113712f
> #13 [ffff88177374b300] try_to_free_mem_cgroup_pages at ffffffff811372be
> #14 [ffff88177374b380] try_charge at ffffffff81189423
> #15 [ffff88177374b430] mem_cgroup_try_charge at ffffffff8118c6f5
> #16 [ffff88177374b470] __add_to_page_cache_locked at ffffffff8112137d
> #17 [ffff88177374b4e0] add_to_page_cache_lru at ffffffff81121618
> #18 [ffff88177374b510] pagecache_get_page at ffffffff8112170b
> #19 [ffff88177374b560] grow_dev_page at ffffffff811c8297
> #20 [ffff88177374b5c0] __getblk_slow at ffffffff811c91d6
> #21 [ffff88177374b600] __getblk_gfp at ffffffff811c92c1
> #22 [ffff88177374b630] ext4_ext_grow_indepth at ffffffff8124565c
> #23 [ffff88177374b690] ext4_ext_create_new_leaf at ffffffff81246ca8
> #24 [ffff88177374b6e0] ext4_ext_insert_extent at ffffffff81246f09
> #25 [ffff88177374b750] ext4_ext_map_blocks at ffffffff8124a848
> #26 [ffff88177374b870] ext4_map_blocks at ffffffff8121a5b7
> #27 [ffff88177374b910] mpage_map_one_extent at ffffffff8121b1fa
> #28 [ffff88177374b950] mpage_map_and_submit_extent at ffffffff8121f07b
> #29 [ffff88177374b9b0] ext4_writepages at ffffffff8121f6d5
> #30 [ffff88177374bb20] do_writepages at ffffffff8112c490
> #31 [ffff88177374bb30] __filemap_fdatawrite_range at ffffffff81120199
> #32 [ffff88177374bb80] filemap_flush at ffffffff8112041c
> #33 [ffff88177374bb90] ext4_alloc_da_blocks at ffffffff81219da1
> #34 [ffff88177374bbb0] ext4_rename at ffffffff81229b91
> #35 [ffff88177374bcd0] ext4_rename2 at ffffffff81229e32
> #36 [ffff88177374bce0] vfs_rename at ffffffff811a08a5
> #37 [ffff88177374bd60] SYSC_renameat2 at ffffffff811a3ffc
> #38 [ffff88177374bf60] sys_renameat2 at ffffffff811a408e
> #39 [ffff88177374bf70] sys_rename at ffffffff8119e51e
> #40 [ffff88177374bf80] system_call_fastpath at ffffffff815afa89
> RIP: 00002ba0169b61d7 RSP: 00007ffde830f428 RFLAGS: 00000206
> RAX: ffffffffffffffda RBX: ffffffff815afa89 RCX: 00002ba016a2e007
> RDX: 0000000000000000 RSI: 00007ffde83126f0 RDI: 00007ffde83106f0
> RBP: 00007ffde83106f0 R8: 0000000000000000 R9: 0000000000000000
> R10: 0000000000000000 R11: 0000000000000202 R12: ffffffff8119e51e
> R13: ffff88177374bf78 R14: ffffffff811a408e R15: ffff88177374bf68
> ORIG_RAX: 0000000000000052 CS: 0033 SS: 002b
>
>
> All the other processes in the container have locked up when trying to
> access file system resources. The SSHD processes for example have
> blocked on tryng to access /var/log/btmp:
>
> PID: 11919 TASK: ffff8815613c1460 CPU: 2 COMMAND: "sshd"
> #0 [ffff880158d03c70] __schedule at ffffffff815ab152
> #1 [ffff880158d03cc0] schedule at ffffffff815ab76e
> #2 [ffff880158d03ce0] schedule_preempt_disabled at ffffffff815ab9de
> #3 [ffff880158d03cf0] __mutex_lock_slowpath at ffffffff815ad505
> #4 [ffff880158d03d50] mutex_lock at ffffffff815ad59b
> #5 [ffff880158d03d70] ext4_file_write_iter at ffffffff8121477b
> #6 [ffff880158d03e30] new_sync_write at ffffffff81194047
> #7 [ffff880158d03f00] vfs_write at ffffffff8119445e
> #8 [ffff880158d03f30] sys_write at ffffffff81194bda
> #9 [ffff880158d03f80] system_call_fastpath at ffffffff815afa89
> RIP: 00002adf56857790 RSP: 00007fffaf514788 RFLAGS: 00000206
> RAX: ffffffffffffffda RBX: ffffffff815afa89 RCX: 00002adf568657f7
> RDX: 0000000000000180 RSI: 00007fffaf5147a0 RDI: 0000000000000005
> RBP: 0000000000000005 R8: 00002adf53ebf7a0 R9: 0000000080000000
> R10: 0000000000000001 R11: 0000000000000246 R12: 00007fffaf514740
> R13: 0000000000000180 R14: 0000000000000000 R15: 00007fffaf5147a0
> ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b
>
> Except for one which has blocked with the following stack:
> PID: 7833 TASK: ffff88175b3ba8c0 CPU: 3 COMMAND: "sshd"
> #0 [ffff88001d983960] __schedule at ffffffff815ab152
> #1 [ffff88001d9839b0] schedule at ffffffff815ab76e
> #2 [ffff88001d9839d0] wait_transaction_locked at ffffffff81264265
> #3 [ffff88001d983a30] add_transaction_credits at ffffffff81264540
> #4 [ffff88001d983ab0] start_this_handle at ffffffff81264824
> #5 [ffff88001d983b60] jbd2__journal_start at ffffffff81265220
> #6 [ffff88001d983bc0] __ext4_journal_start_sb at ffffffff8124bb49
> #7 [ffff88001d983c10] ext4_dirty_inode at ffffffff8121da5c
> #8 [ffff88001d983c30] __mark_inode_dirty at ffffffff811be5c3
> #9 [ffff88001d983c70] generic_update_time at ffffffff811ae9c5
> #10 [ffff88001d983cb0] file_update_time at ffffffff811ae632
> #11 [ffff88001d983d10] __generic_file_write_iter at ffffffff811226c2
> #12 [ffff88001d983d70] ext4_file_write_iter at ffffffff81214814
> #13 [ffff88001d983e30] new_sync_write at ffffffff81194047
> #14 [ffff88001d983f00] vfs_write at ffffffff8119445e
> #15 [ffff88001d983f30] sys_write at ffffffff81194bda
> #16 [ffff88001d983f80] system_call_fastpath at ffffffff815afa89
> RIP: 00002adf56857790 RSP: 00007fffaf514788 RFLAGS: 00000206
> RAX: ffffffffffffffda RBX: ffffffff815afa89 RCX: 00002adf568657f7
> RDX: 0000000000000180 RSI: 00007fffaf5147a0 RDI: 0000000000000005
> RBP: 0000000000000005 R8: 00002adf53ebf7a0 R9: 0000000080000000
> R10: 0000000000000001 R11: 0000000000000246 R12: 00007fffaf514740
> R13: 0000000000000180 R14: 0000000000000000 R15: 00007fffaf5147a0
> ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b

This is something for ext4 people to look at but it seems this waiting
for the transaction is causing others to block on i_mutex. I would check
who is performing the trasnaction and why it is blocked or why it is
making a slow progress.

> So basically what I have managed to determine so far is that "something"
> (this is the bit which I've yet to uncover, hopefully with your help)
> starves the file system of resources. Which causes processes to block in
> the jbd2_journal_start path. The other SSHD processes essentially block
> on the i_mutex of the /var/log/btmp (nothing unusual).
>
> So I guess the most interesting bit here is the stack of the rsync
> process, so it is blocking on trying to free some pages, yet the OOM
> hasn't triggered.

This one is most probably holding an i_mutex but it should be blocking
the transaction. It sees a page under writeback so it is waiting for it.
The assumption is that the writeback will finish in a reasonable time.
And maybe that is what happens but you just have too many pages under
writeback on the LRU of that memcg.

--
Michal Hocko
SUSE Labs

2015-06-29 09:38:28

by Michal Hocko

[permalink] [raw]
Subject: Re: Lockup in wait_transaction_locked under memory pressure

On Mon 29-06-15 12:23:16, Nikolay Borisov wrote:
>
>
> On 06/29/2015 12:16 PM, Michal Hocko wrote:
> > On Mon 29-06-15 12:07:54, Nikolay Borisov wrote:
> >>
> >>
> >> On 06/29/2015 11:32 AM, Michal Hocko wrote:
> >>> On Thu 25-06-15 18:27:10, Nikolay Borisov wrote:
> >>>>
> >>>>
> >>>> On 06/25/2015 06:18 PM, Michal Hocko wrote:
> >>>>> On Thu 25-06-15 17:34:22, Nikolay Borisov wrote:
> >>>>>> On 06/25/2015 05:05 PM, Michal Hocko wrote:
> >>>>>>> On Thu 25-06-15 16:49:43, Nikolay Borisov wrote:
> >>>>>>> [...]
> >>>>>>>> How would you advise to rectify such situation?
> >>>>>>>
> >>>>>>> As I've said. Check the oom victim traces and see if it is holding any
> >>>>>>> of those locks.
> >>>>>>
> >>>>>> As mentioned previously all OOM traces are identical to the one I've
> >>>>>> sent - OOM being called form the page fault path.
> >>>>>
> >>>>> By identical you mean that all of them kill the same task? Or just that
> >>>>> the path is same (which wouldn't be surprising as this is the only path
> >>>>> which triggers memcg oom killer)?
> >>>>
> >>>> The code path is the same, the tasks being killed are different
> >>>
> >>> Is the OOM killer triggered only for a singe memcg or others misbehave
> >>> as well?
> >>
> >> Generally OOM would be triggered for whichever memcg runs out of
> >> resources but so far I've only observed that the D state issue happens
> >> in a single containers.
> >
> > It is not clear whether it is the OOM memcg which has tasks in the D
> > state. Anyway I think it all smells like one memcg is throttling others
> > on another shared resource - journal in your case.
>
> Be that as it may, how do I find which cgroup is the culprit?

Ted has already described that. You have to check all the running tasks
and try to find which of them is doing the operation which blocks
others. Transaction commit sounds like the first one to check.
--
Michal Hocko
SUSE Labs

2015-06-29 10:21:29

by Angel Shtilianov

[permalink] [raw]
Subject: Re: Lockup in wait_transaction_locked under memory pressure



On 06/29/2015 12:38 PM, Michal Hocko wrote:
> On Mon 29-06-15 12:23:16, Nikolay Borisov wrote:
>>
>>
>> On 06/29/2015 12:16 PM, Michal Hocko wrote:
>>> On Mon 29-06-15 12:07:54, Nikolay Borisov wrote:
>>>>
>>>>
>>>> On 06/29/2015 11:32 AM, Michal Hocko wrote:
>>>>> On Thu 25-06-15 18:27:10, Nikolay Borisov wrote:
>>>>>>
>>>>>>
>>>>>> On 06/25/2015 06:18 PM, Michal Hocko wrote:
>>>>>>> On Thu 25-06-15 17:34:22, Nikolay Borisov wrote:
>>>>>>>> On 06/25/2015 05:05 PM, Michal Hocko wrote:
>>>>>>>>> On Thu 25-06-15 16:49:43, Nikolay Borisov wrote:
>>>>>>>>> [...]
>>>>>>>>>> How would you advise to rectify such situation?
>>>>>>>>>
>>>>>>>>> As I've said. Check the oom victim traces and see if it is holding any
>>>>>>>>> of those locks.
>>>>>>>>
>>>>>>>> As mentioned previously all OOM traces are identical to the one I've
>>>>>>>> sent - OOM being called form the page fault path.
>>>>>>>
>>>>>>> By identical you mean that all of them kill the same task? Or just that
>>>>>>> the path is same (which wouldn't be surprising as this is the only path
>>>>>>> which triggers memcg oom killer)?
>>>>>>
>>>>>> The code path is the same, the tasks being killed are different
>>>>>
>>>>> Is the OOM killer triggered only for a singe memcg or others misbehave
>>>>> as well?
>>>>
>>>> Generally OOM would be triggered for whichever memcg runs out of
>>>> resources but so far I've only observed that the D state issue happens
>>>> in a single containers.
>>>
>>> It is not clear whether it is the OOM memcg which has tasks in the D
>>> state. Anyway I think it all smells like one memcg is throttling others
>>> on another shared resource - journal in your case.
>>
>> Be that as it may, how do I find which cgroup is the culprit?
>
> Ted has already described that. You have to check all the running tasks
> and try to find which of them is doing the operation which blocks
> others. Transaction commit sounds like the first one to check.

One other, fairly crucial detail - each and every container is on a
separate block device, meaning the journals for different block devices
is not being shared, since the journal is per-block device. I guess this
means that whatever is happening is more or less constrained to the
block device and thus the possibility that different memcg competing for
the journal can be eliminated?

>

2015-06-29 11:44:14

by Michal Hocko

[permalink] [raw]
Subject: Re: Lockup in wait_transaction_locked under memory pressure

On Mon 29-06-15 13:21:25, Nikolay Borisov wrote:
>
>
> On 06/29/2015 12:38 PM, Michal Hocko wrote:
> > On Mon 29-06-15 12:23:16, Nikolay Borisov wrote:
> >>
> >>
> >> On 06/29/2015 12:16 PM, Michal Hocko wrote:
> >>> On Mon 29-06-15 12:07:54, Nikolay Borisov wrote:
> >>>>
> >>>>
> >>>> On 06/29/2015 11:32 AM, Michal Hocko wrote:
> >>>>> On Thu 25-06-15 18:27:10, Nikolay Borisov wrote:
> >>>>>>
> >>>>>>
> >>>>>> On 06/25/2015 06:18 PM, Michal Hocko wrote:
> >>>>>>> On Thu 25-06-15 17:34:22, Nikolay Borisov wrote:
> >>>>>>>> On 06/25/2015 05:05 PM, Michal Hocko wrote:
> >>>>>>>>> On Thu 25-06-15 16:49:43, Nikolay Borisov wrote:
> >>>>>>>>> [...]
> >>>>>>>>>> How would you advise to rectify such situation?
> >>>>>>>>>
> >>>>>>>>> As I've said. Check the oom victim traces and see if it is holding any
> >>>>>>>>> of those locks.
> >>>>>>>>
> >>>>>>>> As mentioned previously all OOM traces are identical to the one I've
> >>>>>>>> sent - OOM being called form the page fault path.
> >>>>>>>
> >>>>>>> By identical you mean that all of them kill the same task? Or just that
> >>>>>>> the path is same (which wouldn't be surprising as this is the only path
> >>>>>>> which triggers memcg oom killer)?
> >>>>>>
> >>>>>> The code path is the same, the tasks being killed are different
> >>>>>
> >>>>> Is the OOM killer triggered only for a singe memcg or others misbehave
> >>>>> as well?
> >>>>
> >>>> Generally OOM would be triggered for whichever memcg runs out of
> >>>> resources but so far I've only observed that the D state issue happens
> >>>> in a single containers.
> >>>
> >>> It is not clear whether it is the OOM memcg which has tasks in the D
> >>> state. Anyway I think it all smells like one memcg is throttling others
> >>> on another shared resource - journal in your case.
> >>
> >> Be that as it may, how do I find which cgroup is the culprit?
> >
> > Ted has already described that. You have to check all the running tasks
> > and try to find which of them is doing the operation which blocks
> > others. Transaction commit sounds like the first one to check.
>
> One other, fairly crucial detail - each and every container is on a
> separate block device, meaning the journals for different block devices
> is not being shared, since the journal is per-block device.

Yes this is quite an important "detail". My understanding is that the
journal is per superblock so they shouldn't interfere on the journal
commit. i_mutex and other per-inode/file locks shouldn't matter either.

> I guess this
> means that whatever is happening is more or less constrained to the
> block device and thus the possibility that different memcg competing for
> the journal can be eliminated?

They might be still competing over IO bandwidth AFAIU...
--
Michal Hocko
SUSE Labs

2015-06-30 01:52:36

by Dave Chinner

[permalink] [raw]
Subject: Re: Lockup in wait_transaction_locked under memory pressure

On Mon, Jun 29, 2015 at 11:36:40AM +0200, Michal Hocko wrote:
> On Mon 29-06-15 12:01:49, Nikolay Borisov wrote:
> > Today I observed the issue again, this time on a different server. What
> > is particularly strange is the fact that the OOM wasn't triggered for
> > the cgroup, whose tasks have entered D state. There were a couple of
> > SSHD processes and an RSYNC performing some backup tasks. Here is what
> > the stacktrace for the rsync looks like:
> >
> > crash> set 18308
> > PID: 18308
> > COMMAND: "rsync"
> > TASK: ffff883d7c9b0a30 [THREAD_INFO: ffff881773748000]
> > CPU: 1
> > STATE: TASK_UNINTERRUPTIBLE
> > crash> bt
> > PID: 18308 TASK: ffff883d7c9b0a30 CPU: 1 COMMAND: "rsync"
> > #0 [ffff88177374ac60] __schedule at ffffffff815ab152
> > #1 [ffff88177374acb0] schedule at ffffffff815ab76e
> > #2 [ffff88177374acd0] schedule_timeout at ffffffff815ae5e5
> > #3 [ffff88177374ad70] io_schedule_timeout at ffffffff815aad6a
> > #4 [ffff88177374ada0] bit_wait_io at ffffffff815abfc6
> > #5 [ffff88177374adb0] __wait_on_bit at ffffffff815abda5
> > #6 [ffff88177374ae00] wait_on_page_bit at ffffffff8111fd4f
> > #7 [ffff88177374ae50] shrink_page_list at ffffffff81135445
>
> This is most probably wait_on_page_writeback because the reclaim has
> encountered a dirty page which is under writeback currently.

Yes, and looks at the caller path....

> > #8 [ffff88177374af50] shrink_inactive_list at ffffffff81135845
> > #9 [ffff88177374b060] shrink_lruvec at ffffffff81135ead
> > #10 [ffff88177374b150] shrink_zone at ffffffff811360c3
> > #11 [ffff88177374b220] shrink_zones at ffffffff81136eff
> > #12 [ffff88177374b2a0] do_try_to_free_pages at ffffffff8113712f
> > #13 [ffff88177374b300] try_to_free_mem_cgroup_pages at ffffffff811372be
> > #14 [ffff88177374b380] try_charge at ffffffff81189423
> > #15 [ffff88177374b430] mem_cgroup_try_charge at ffffffff8118c6f5
> > #16 [ffff88177374b470] __add_to_page_cache_locked at ffffffff8112137d
> > #17 [ffff88177374b4e0] add_to_page_cache_lru at ffffffff81121618
> > #18 [ffff88177374b510] pagecache_get_page at ffffffff8112170b
> > #19 [ffff88177374b560] grow_dev_page at ffffffff811c8297
> > #20 [ffff88177374b5c0] __getblk_slow at ffffffff811c91d6
> > #21 [ffff88177374b600] __getblk_gfp at ffffffff811c92c1
> > #22 [ffff88177374b630] ext4_ext_grow_indepth at ffffffff8124565c
> > #23 [ffff88177374b690] ext4_ext_create_new_leaf at ffffffff81246ca8
> > #24 [ffff88177374b6e0] ext4_ext_insert_extent at ffffffff81246f09
> > #25 [ffff88177374b750] ext4_ext_map_blocks at ffffffff8124a848
> > #26 [ffff88177374b870] ext4_map_blocks at ffffffff8121a5b7
> > #27 [ffff88177374b910] mpage_map_one_extent at ffffffff8121b1fa
> > #28 [ffff88177374b950] mpage_map_and_submit_extent at ffffffff8121f07b
> > #29 [ffff88177374b9b0] ext4_writepages at ffffffff8121f6d5
> > #30 [ffff88177374bb20] do_writepages at ffffffff8112c490
> > #31 [ffff88177374bb30] __filemap_fdatawrite_range at ffffffff81120199
> > #32 [ffff88177374bb80] filemap_flush at ffffffff8112041c

That's a potential self deadlocking path, isn't it? i.e. the
writeback path has been entered, may hold pages locked in the
current bio being built (waiting for submission), then memory
reclaim has been entered while trying to map more contiguous blocks
to submit, and that waits on page IO to complete on a page in a bio
that ext4 hasn't yet submitted?

i.e. shouldn't ext4 be doing GFP_NOFS allocations all through this
writeback path?

Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-06-30 03:03:16

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Lockup in wait_transaction_locked under memory pressure

On Tue, Jun 30, 2015 at 11:52:06AM +1000, Dave Chinner wrote:
> Yes, and looks at the caller path....
>
> > > #8 [ffff88177374af50] shrink_inactive_list at ffffffff81135845
> > > #9 [ffff88177374b060] shrink_lruvec at ffffffff81135ead
> > > #10 [ffff88177374b150] shrink_zone at ffffffff811360c3
> > > #11 [ffff88177374b220] shrink_zones at ffffffff81136eff
> > > #12 [ffff88177374b2a0] do_try_to_free_pages at ffffffff8113712f
> > > #13 [ffff88177374b300] try_to_free_mem_cgroup_pages at ffffffff811372be
> > > #14 [ffff88177374b380] try_charge at ffffffff81189423
> > > #15 [ffff88177374b430] mem_cgroup_try_charge at ffffffff8118c6f5
> > > #16 [ffff88177374b470] __add_to_page_cache_locked at ffffffff8112137d
> > > #17 [ffff88177374b4e0] add_to_page_cache_lru at ffffffff81121618
> > > #18 [ffff88177374b510] pagecache_get_page at ffffffff8112170b
> > > #19 [ffff88177374b560] grow_dev_page at ffffffff811c8297
> > > #20 [ffff88177374b5c0] __getblk_slow at ffffffff811c91d6
> > > #21 [ffff88177374b600] __getblk_gfp at ffffffff811c92c1
> > > #22 [ffff88177374b630] ext4_ext_grow_indepth at ffffffff8124565c
> > > #23 [ffff88177374b690] ext4_ext_create_new_leaf at ffffffff81246ca8
> > > #24 [ffff88177374b6e0] ext4_ext_insert_extent at ffffffff81246f09
> > > #25 [ffff88177374b750] ext4_ext_map_blocks at ffffffff8124a848
> > > #26 [ffff88177374b870] ext4_map_blocks at ffffffff8121a5b7
> > > #27 [ffff88177374b910] mpage_map_one_extent at ffffffff8121b1fa
> > > #28 [ffff88177374b950] mpage_map_and_submit_extent at ffffffff8121f07b
> > > #29 [ffff88177374b9b0] ext4_writepages at ffffffff8121f6d5
> > > #30 [ffff88177374bb20] do_writepages at ffffffff8112c490
> > > #31 [ffff88177374bb30] __filemap_fdatawrite_range at ffffffff81120199
> > > #32 [ffff88177374bb80] filemap_flush at ffffffff8112041c
>
> That's a potential self deadlocking path, isn't it? i.e. the
> writeback path has been entered, may hold pages locked in the
> current bio being built (waiting for submission), then memory
> reclaim has been entered while trying to map more contiguous blocks
> to submit, and that waits on page IO to complete on a page in a bio
> that ext4 hasn't yet submitted?
>
> i.e. shouldn't ext4 be doing GFP_NOFS allocations all through this
> writeback path?

All of the direct allocations in fs/ext4/extents.c are using GFP_NOFS.
The problem is that we're calling sb_getblk(), which does _not_ set
GFP_NOFS. What we need to do is to add a sb_getblk_gfp() inline
function in include/linux/buffer_head.h, and use that in
fs/ext4/extents.c.

Thanks for pointing that out! I'll create a patch as soon as I get
back from vacation.

- Ted

2015-06-30 06:35:27

by Angel Shtilianov

[permalink] [raw]
Subject: Re: Lockup in wait_transaction_locked under memory pressure

On 06/30/2015 06:02 AM, Theodore Ts'o wrote:
> On Tue, Jun 30, 2015 at 11:52:06AM +1000, Dave Chinner wrote:
>> Yes, and looks at the caller path....
>>
>>>> #8 [ffff88177374af50] shrink_inactive_list at ffffffff81135845
>>>> #9 [ffff88177374b060] shrink_lruvec at ffffffff81135ead
>>>> #10 [ffff88177374b150] shrink_zone at ffffffff811360c3
>>>> #11 [ffff88177374b220] shrink_zones at ffffffff81136eff
>>>> #12 [ffff88177374b2a0] do_try_to_free_pages at ffffffff8113712f
>>>> #13 [ffff88177374b300] try_to_free_mem_cgroup_pages at ffffffff811372be
>>>> #14 [ffff88177374b380] try_charge at ffffffff81189423
>>>> #15 [ffff88177374b430] mem_cgroup_try_charge at ffffffff8118c6f5
>>>> #16 [ffff88177374b470] __add_to_page_cache_locked at ffffffff8112137d
>>>> #17 [ffff88177374b4e0] add_to_page_cache_lru at ffffffff81121618
>>>> #18 [ffff88177374b510] pagecache_get_page at ffffffff8112170b
>>>> #19 [ffff88177374b560] grow_dev_page at ffffffff811c8297
>>>> #20 [ffff88177374b5c0] __getblk_slow at ffffffff811c91d6
>>>> #21 [ffff88177374b600] __getblk_gfp at ffffffff811c92c1
>>>> #22 [ffff88177374b630] ext4_ext_grow_indepth at ffffffff8124565c
>>>> #23 [ffff88177374b690] ext4_ext_create_new_leaf at ffffffff81246ca8
>>>> #24 [ffff88177374b6e0] ext4_ext_insert_extent at ffffffff81246f09
>>>> #25 [ffff88177374b750] ext4_ext_map_blocks at ffffffff8124a848
>>>> #26 [ffff88177374b870] ext4_map_blocks at ffffffff8121a5b7
>>>> #27 [ffff88177374b910] mpage_map_one_extent at ffffffff8121b1fa
>>>> #28 [ffff88177374b950] mpage_map_and_submit_extent at ffffffff8121f07b
>>>> #29 [ffff88177374b9b0] ext4_writepages at ffffffff8121f6d5
>>>> #30 [ffff88177374bb20] do_writepages at ffffffff8112c490
>>>> #31 [ffff88177374bb30] __filemap_fdatawrite_range at ffffffff81120199
>>>> #32 [ffff88177374bb80] filemap_flush at ffffffff8112041c
>>
>> That's a potential self deadlocking path, isn't it? i.e. the
>> writeback path has been entered, may hold pages locked in the
>> current bio being built (waiting for submission), then memory
>> reclaim has been entered while trying to map more contiguous blocks
>> to submit, and that waits on page IO to complete on a page in a bio
>> that ext4 hasn't yet submitted?
>>
>> i.e. shouldn't ext4 be doing GFP_NOFS allocations all through this
>> writeback path?
>
> All of the direct allocations in fs/ext4/extents.c are using GFP_NOFS.
> The problem is that we're calling sb_getblk(), which does _not_ set
> GFP_NOFS. What we need to do is to add a sb_getblk_gfp() inline
> function in include/linux/buffer_head.h, and use that in
> fs/ext4/extents.c.

I've already sent your suggestion as a separate patch series, but what
about just changing sb_getblk to pass GFP_NOFS alongside _GFP_MOVABLE? I
see this function is purely FS related and is used a lot in various
FSes. Shouldn't it be prudent to change it and eliminate possible issues
in other filesystems as well (I haven't looked into detail the execution
context in other filesystems)?

>
> Thanks for pointing that out! I'll create a patch as soon as I get
> back from vacation.
>
> - Ted
>

2015-06-30 12:30:37

by Michal Hocko

[permalink] [raw]
Subject: Re: Lockup in wait_transaction_locked under memory pressure

On Tue 30-06-15 11:52:06, Dave Chinner wrote:
> On Mon, Jun 29, 2015 at 11:36:40AM +0200, Michal Hocko wrote:
> > On Mon 29-06-15 12:01:49, Nikolay Borisov wrote:
> > > Today I observed the issue again, this time on a different server. What
> > > is particularly strange is the fact that the OOM wasn't triggered for
> > > the cgroup, whose tasks have entered D state. There were a couple of
> > > SSHD processes and an RSYNC performing some backup tasks. Here is what
> > > the stacktrace for the rsync looks like:
> > >
> > > crash> set 18308
> > > PID: 18308
> > > COMMAND: "rsync"
> > > TASK: ffff883d7c9b0a30 [THREAD_INFO: ffff881773748000]
> > > CPU: 1
> > > STATE: TASK_UNINTERRUPTIBLE
> > > crash> bt
> > > PID: 18308 TASK: ffff883d7c9b0a30 CPU: 1 COMMAND: "rsync"
> > > #0 [ffff88177374ac60] __schedule at ffffffff815ab152
> > > #1 [ffff88177374acb0] schedule at ffffffff815ab76e
> > > #2 [ffff88177374acd0] schedule_timeout at ffffffff815ae5e5
> > > #3 [ffff88177374ad70] io_schedule_timeout at ffffffff815aad6a
> > > #4 [ffff88177374ada0] bit_wait_io at ffffffff815abfc6
> > > #5 [ffff88177374adb0] __wait_on_bit at ffffffff815abda5
> > > #6 [ffff88177374ae00] wait_on_page_bit at ffffffff8111fd4f
> > > #7 [ffff88177374ae50] shrink_page_list at ffffffff81135445
> >
> > This is most probably wait_on_page_writeback because the reclaim has
> > encountered a dirty page which is under writeback currently.
>
> Yes, and looks at the caller path....
>
> > > #8 [ffff88177374af50] shrink_inactive_list at ffffffff81135845
> > > #9 [ffff88177374b060] shrink_lruvec at ffffffff81135ead
> > > #10 [ffff88177374b150] shrink_zone at ffffffff811360c3
> > > #11 [ffff88177374b220] shrink_zones at ffffffff81136eff
> > > #12 [ffff88177374b2a0] do_try_to_free_pages at ffffffff8113712f
> > > #13 [ffff88177374b300] try_to_free_mem_cgroup_pages at ffffffff811372be
> > > #14 [ffff88177374b380] try_charge at ffffffff81189423
> > > #15 [ffff88177374b430] mem_cgroup_try_charge at ffffffff8118c6f5
> > > #16 [ffff88177374b470] __add_to_page_cache_locked at ffffffff8112137d
> > > #17 [ffff88177374b4e0] add_to_page_cache_lru at ffffffff81121618
> > > #18 [ffff88177374b510] pagecache_get_page at ffffffff8112170b
> > > #19 [ffff88177374b560] grow_dev_page at ffffffff811c8297
> > > #20 [ffff88177374b5c0] __getblk_slow at ffffffff811c91d6
> > > #21 [ffff88177374b600] __getblk_gfp at ffffffff811c92c1
> > > #22 [ffff88177374b630] ext4_ext_grow_indepth at ffffffff8124565c
> > > #23 [ffff88177374b690] ext4_ext_create_new_leaf at ffffffff81246ca8
> > > #24 [ffff88177374b6e0] ext4_ext_insert_extent at ffffffff81246f09
> > > #25 [ffff88177374b750] ext4_ext_map_blocks at ffffffff8124a848
> > > #26 [ffff88177374b870] ext4_map_blocks at ffffffff8121a5b7
> > > #27 [ffff88177374b910] mpage_map_one_extent at ffffffff8121b1fa
> > > #28 [ffff88177374b950] mpage_map_and_submit_extent at ffffffff8121f07b
> > > #29 [ffff88177374b9b0] ext4_writepages at ffffffff8121f6d5
> > > #30 [ffff88177374bb20] do_writepages at ffffffff8112c490
> > > #31 [ffff88177374bb30] __filemap_fdatawrite_range at ffffffff81120199
> > > #32 [ffff88177374bb80] filemap_flush at ffffffff8112041c
>
> That's a potential self deadlocking path, isn't it? i.e. the
> writeback path has been entered, may hold pages locked in the
> current bio being built (waiting for submission), then memory
> reclaim has been entered while trying to map more contiguous blocks
> to submit, and that waits on page IO to complete on a page in a bio
> that ext4 hasn't yet submitted?

I am not sure I understand. Pages are marked writeback in
ext4_bio_write_page after all of this has been done already and then
the IO is submitted and the reclaim shouldn't block it. Or am I missing
something?

> i.e. shouldn't ext4 be doing GFP_NOFS allocations all through this
> writeback path?

GFP_NOFS wouldn't prevent shrink_page_list from waiting on the page
under writeback.

--
Michal Hocko
SUSE Labs

2015-06-30 14:32:00

by Michal Hocko

[permalink] [raw]
Subject: Re: Lockup in wait_transaction_locked under memory pressure

On Tue 30-06-15 14:30:33, Michal Hocko wrote:
> On Tue 30-06-15 11:52:06, Dave Chinner wrote:
> > On Mon, Jun 29, 2015 at 11:36:40AM +0200, Michal Hocko wrote:
> > > On Mon 29-06-15 12:01:49, Nikolay Borisov wrote:
> > > > Today I observed the issue again, this time on a different server. What
> > > > is particularly strange is the fact that the OOM wasn't triggered for
> > > > the cgroup, whose tasks have entered D state. There were a couple of
> > > > SSHD processes and an RSYNC performing some backup tasks. Here is what
> > > > the stacktrace for the rsync looks like:
> > > >
> > > > crash> set 18308
> > > > PID: 18308
> > > > COMMAND: "rsync"
> > > > TASK: ffff883d7c9b0a30 [THREAD_INFO: ffff881773748000]
> > > > CPU: 1
> > > > STATE: TASK_UNINTERRUPTIBLE
> > > > crash> bt
> > > > PID: 18308 TASK: ffff883d7c9b0a30 CPU: 1 COMMAND: "rsync"
> > > > #0 [ffff88177374ac60] __schedule at ffffffff815ab152
> > > > #1 [ffff88177374acb0] schedule at ffffffff815ab76e
> > > > #2 [ffff88177374acd0] schedule_timeout at ffffffff815ae5e5
> > > > #3 [ffff88177374ad70] io_schedule_timeout at ffffffff815aad6a
> > > > #4 [ffff88177374ada0] bit_wait_io at ffffffff815abfc6
> > > > #5 [ffff88177374adb0] __wait_on_bit at ffffffff815abda5
> > > > #6 [ffff88177374ae00] wait_on_page_bit at ffffffff8111fd4f
> > > > #7 [ffff88177374ae50] shrink_page_list at ffffffff81135445
> > >
> > > This is most probably wait_on_page_writeback because the reclaim has
> > > encountered a dirty page which is under writeback currently.
> >
> > Yes, and looks at the caller path....
> >
> > > > #8 [ffff88177374af50] shrink_inactive_list at ffffffff81135845
> > > > #9 [ffff88177374b060] shrink_lruvec at ffffffff81135ead
> > > > #10 [ffff88177374b150] shrink_zone at ffffffff811360c3
> > > > #11 [ffff88177374b220] shrink_zones at ffffffff81136eff
> > > > #12 [ffff88177374b2a0] do_try_to_free_pages at ffffffff8113712f
> > > > #13 [ffff88177374b300] try_to_free_mem_cgroup_pages at ffffffff811372be
> > > > #14 [ffff88177374b380] try_charge at ffffffff81189423
> > > > #15 [ffff88177374b430] mem_cgroup_try_charge at ffffffff8118c6f5
> > > > #16 [ffff88177374b470] __add_to_page_cache_locked at ffffffff8112137d
> > > > #17 [ffff88177374b4e0] add_to_page_cache_lru at ffffffff81121618
> > > > #18 [ffff88177374b510] pagecache_get_page at ffffffff8112170b
> > > > #19 [ffff88177374b560] grow_dev_page at ffffffff811c8297
> > > > #20 [ffff88177374b5c0] __getblk_slow at ffffffff811c91d6
> > > > #21 [ffff88177374b600] __getblk_gfp at ffffffff811c92c1
> > > > #22 [ffff88177374b630] ext4_ext_grow_indepth at ffffffff8124565c
> > > > #23 [ffff88177374b690] ext4_ext_create_new_leaf at ffffffff81246ca8
> > > > #24 [ffff88177374b6e0] ext4_ext_insert_extent at ffffffff81246f09
> > > > #25 [ffff88177374b750] ext4_ext_map_blocks at ffffffff8124a848
> > > > #26 [ffff88177374b870] ext4_map_blocks at ffffffff8121a5b7
> > > > #27 [ffff88177374b910] mpage_map_one_extent at ffffffff8121b1fa
> > > > #28 [ffff88177374b950] mpage_map_and_submit_extent at ffffffff8121f07b
> > > > #29 [ffff88177374b9b0] ext4_writepages at ffffffff8121f6d5
> > > > #30 [ffff88177374bb20] do_writepages at ffffffff8112c490
> > > > #31 [ffff88177374bb30] __filemap_fdatawrite_range at ffffffff81120199
> > > > #32 [ffff88177374bb80] filemap_flush at ffffffff8112041c
> >
> > That's a potential self deadlocking path, isn't it? i.e. the
> > writeback path has been entered, may hold pages locked in the
> > current bio being built (waiting for submission), then memory
> > reclaim has been entered while trying to map more contiguous blocks
> > to submit, and that waits on page IO to complete on a page in a bio
> > that ext4 hasn't yet submitted?
>
> I am not sure I understand. Pages are marked writeback in
> ext4_bio_write_page after all of this has been done already and then
> the IO is submitted and the reclaim shouldn't block it. Or am I missing
> something?

Thanks to Jan Kara for the off list clarification. I misunderstood the
code. You are right ext4 is really deadlock prone. The heuristic in the
reclaim code assumes that waiting on page_writeback is guaranteed to
make a progress (from memcg POV) and that is not true for ext4 as it
really charge after set_page_writeback (called from ext4_bio_write_page)
and before the page is really submitted (when the bio is full or
explicitly via ext4_io_submit). I thought that io_submit_add_bh submits
the page but it doesn't do that necessarily.

Now the question is how to handle that. We cannot simply remove the
heuristic because that would reintroduce premature OOM issues. I guess
we can make it depend on __GFP_FS. This could lead to premature charge
failure but that should be toleratable. I will cook up a patch.

Thanks!
--
Michal Hocko
SUSE Labs

2015-06-30 22:58:55

by Dave Chinner

[permalink] [raw]
Subject: Re: Lockup in wait_transaction_locked under memory pressure

On Tue, Jun 30, 2015 at 04:31:58PM +0200, Michal Hocko wrote:
> On Tue 30-06-15 14:30:33, Michal Hocko wrote:
> > On Tue 30-06-15 11:52:06, Dave Chinner wrote:
> > > On Mon, Jun 29, 2015 at 11:36:40AM +0200, Michal Hocko wrote:
> > > > On Mon 29-06-15 12:01:49, Nikolay Borisov wrote:
> > > > > Today I observed the issue again, this time on a different server. What
> > > > > is particularly strange is the fact that the OOM wasn't triggered for
> > > > > the cgroup, whose tasks have entered D state. There were a couple of
> > > > > SSHD processes and an RSYNC performing some backup tasks. Here is what
> > > > > the stacktrace for the rsync looks like:
> > > > >
> > > > > crash> set 18308
> > > > > PID: 18308
> > > > > COMMAND: "rsync"
> > > > > TASK: ffff883d7c9b0a30 [THREAD_INFO: ffff881773748000]
> > > > > CPU: 1
> > > > > STATE: TASK_UNINTERRUPTIBLE
> > > > > crash> bt
> > > > > PID: 18308 TASK: ffff883d7c9b0a30 CPU: 1 COMMAND: "rsync"
> > > > > #0 [ffff88177374ac60] __schedule at ffffffff815ab152
> > > > > #1 [ffff88177374acb0] schedule at ffffffff815ab76e
> > > > > #2 [ffff88177374acd0] schedule_timeout at ffffffff815ae5e5
> > > > > #3 [ffff88177374ad70] io_schedule_timeout at ffffffff815aad6a
> > > > > #4 [ffff88177374ada0] bit_wait_io at ffffffff815abfc6
> > > > > #5 [ffff88177374adb0] __wait_on_bit at ffffffff815abda5
> > > > > #6 [ffff88177374ae00] wait_on_page_bit at ffffffff8111fd4f
> > > > > #7 [ffff88177374ae50] shrink_page_list at ffffffff81135445
> > > >
> > > > This is most probably wait_on_page_writeback because the reclaim has
> > > > encountered a dirty page which is under writeback currently.
> > >
> > > Yes, and looks at the caller path....
> > >
> > > > > #8 [ffff88177374af50] shrink_inactive_list at ffffffff81135845
> > > > > #9 [ffff88177374b060] shrink_lruvec at ffffffff81135ead
> > > > > #10 [ffff88177374b150] shrink_zone at ffffffff811360c3
> > > > > #11 [ffff88177374b220] shrink_zones at ffffffff81136eff
> > > > > #12 [ffff88177374b2a0] do_try_to_free_pages at ffffffff8113712f
> > > > > #13 [ffff88177374b300] try_to_free_mem_cgroup_pages at ffffffff811372be
> > > > > #14 [ffff88177374b380] try_charge at ffffffff81189423
> > > > > #15 [ffff88177374b430] mem_cgroup_try_charge at ffffffff8118c6f5
> > > > > #16 [ffff88177374b470] __add_to_page_cache_locked at ffffffff8112137d
> > > > > #17 [ffff88177374b4e0] add_to_page_cache_lru at ffffffff81121618
> > > > > #18 [ffff88177374b510] pagecache_get_page at ffffffff8112170b
> > > > > #19 [ffff88177374b560] grow_dev_page at ffffffff811c8297
> > > > > #20 [ffff88177374b5c0] __getblk_slow at ffffffff811c91d6
> > > > > #21 [ffff88177374b600] __getblk_gfp at ffffffff811c92c1
> > > > > #22 [ffff88177374b630] ext4_ext_grow_indepth at ffffffff8124565c
> > > > > #23 [ffff88177374b690] ext4_ext_create_new_leaf at ffffffff81246ca8
> > > > > #24 [ffff88177374b6e0] ext4_ext_insert_extent at ffffffff81246f09
> > > > > #25 [ffff88177374b750] ext4_ext_map_blocks at ffffffff8124a848
> > > > > #26 [ffff88177374b870] ext4_map_blocks at ffffffff8121a5b7
> > > > > #27 [ffff88177374b910] mpage_map_one_extent at ffffffff8121b1fa
> > > > > #28 [ffff88177374b950] mpage_map_and_submit_extent at ffffffff8121f07b
> > > > > #29 [ffff88177374b9b0] ext4_writepages at ffffffff8121f6d5
> > > > > #30 [ffff88177374bb20] do_writepages at ffffffff8112c490
> > > > > #31 [ffff88177374bb30] __filemap_fdatawrite_range at ffffffff81120199
> > > > > #32 [ffff88177374bb80] filemap_flush at ffffffff8112041c
> > >
> > > That's a potential self deadlocking path, isn't it? i.e. the
> > > writeback path has been entered, may hold pages locked in the
> > > current bio being built (waiting for submission), then memory
> > > reclaim has been entered while trying to map more contiguous blocks
> > > to submit, and that waits on page IO to complete on a page in a bio
> > > that ext4 hasn't yet submitted?
> >
> > I am not sure I understand. Pages are marked writeback in
> > ext4_bio_write_page after all of this has been done already and then
> > the IO is submitted and the reclaim shouldn't block it. Or am I missing
> > something?
>
> Thanks to Jan Kara for the off list clarification. I misunderstood the
> code. You are right ext4 is really deadlock prone. The heuristic in the
> reclaim code assumes that waiting on page_writeback is guaranteed to
> make a progress (from memcg POV) and that is not true for ext4 as it

*blink*

/me re-reads again

That assumption is fundamentally broken. Filesystems use GFP_NOFS
because the filesystem holds resources that can prevent memory
reclaim making forwards progress if it re-enters the filesystem or
blocks on anything filesystem related. memcg does not change that,
and I'm kinda scared to learn that memcg plays fast and loose like
this.

For example: IO completion might require unwritten extent conversion
which executes filesystem transactions and GFP_NOFS allocations. The
writeback flag on the pages can not be cleared until unwritten
extent conversion completes. Hence memory reclaim cannot wait on
page writeback to complete in GFP_NOFS context because it is not
safe to do so, memcg reclaim or otherwise.

> really charge after set_page_writeback (called from ext4_bio_write_page)
> and before the page is really submitted (when the bio is full or
> explicitly via ext4_io_submit). I thought that io_submit_add_bh submits
> the page but it doesn't do that necessarily.

XFS does exactly the same thing - the underlying alogrithm ext4 uses
to build large bios efficiently was copied from XFS. And FWIW XFS has
been using this algorithm since 2.6.15....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2015-07-01 06:10:18

by Michal Hocko

[permalink] [raw]
Subject: Re: Lockup in wait_transaction_locked under memory pressure

On Wed 01-07-15 08:58:51, Dave Chinner wrote:
[...]
> *blink*
>
> /me re-reads again
>
> That assumption is fundamentally broken. Filesystems use GFP_NOFS
> because the filesystem holds resources that can prevent memory
> reclaim making forwards progress if it re-enters the filesystem or
> blocks on anything filesystem related. memcg does not change that,
> and I'm kinda scared to learn that memcg plays fast and loose like
> this.
>
> For example: IO completion might require unwritten extent conversion
> which executes filesystem transactions and GFP_NOFS allocations. The
> writeback flag on the pages can not be cleared until unwritten
> extent conversion completes. Hence memory reclaim cannot wait on
> page writeback to complete in GFP_NOFS context because it is not
> safe to do so, memcg reclaim or otherwise.

Thanks for the clarification.

> > really charge after set_page_writeback (called from ext4_bio_write_page)
> > and before the page is really submitted (when the bio is full or
> > explicitly via ext4_io_submit). I thought that io_submit_add_bh submits
> > the page but it doesn't do that necessarily.
>
> XFS does exactly the same thing - the underlying alogrithm ext4 uses
> to build large bios efficiently was copied from XFS. And FWIW XFS has
> been using this algorithm since 2.6.15....

OK, I will mark the patch for stable then.

Thanks!
--
Michal Hocko
SUSE Labs

2015-07-01 11:13:56

by Theodore Ts'o

[permalink] [raw]
Subject: Re: Lockup in wait_transaction_locked under memory pressure

On Wed, Jul 01, 2015 at 08:10:14AM +0200, Michal Hocko wrote:
> On Wed 01-07-15 08:58:51, Dave Chinner wrote:
> [...]
> > *blink*
> >
> > /me re-reads again
> >
> > That assumption is fundamentally broken. Filesystems use GFP_NOFS
> > because the filesystem holds resources that can prevent memory
> > reclaim making forwards progress if it re-enters the filesystem or
> > blocks on anything filesystem related. memcg does not change that,
> > and I'm kinda scared to learn that memcg plays fast and loose like
> > this.
> >
> > For example: IO completion might require unwritten extent conversion
> > which executes filesystem transactions and GFP_NOFS allocations. The
> > writeback flag on the pages can not be cleared until unwritten
> > extent conversion completes. Hence memory reclaim cannot wait on
> > page writeback to complete in GFP_NOFS context because it is not
> > safe to do so, memcg reclaim or otherwise.
>
> Thanks for the clarification.

Perhaps we need to make the documentation a bit more explicit?
All which is stated in include/slab.h:

* %GFP_NOIO - Do not do any I/O at all while trying to get memory.
*
* %GFP_NOFS - Do not make any fs calls while trying to get memory.

I thought this was obvious, but these flags are used by code which in
the I/O or FS paths, and so it's always possible that they are trying
to write back the page which you decide to blocking on when trying to
do the memory allocation, at which point, *boom*, deadlock.

So it's just not "do not make any FS or I/O calls", but also "the mm
layer must not not wait for any FS or I/O operations from completing,
since the operation you block on might be the one they were in the
middle of trying to complete --- or they may be holding a lock at the
time when they were trying to do a memory allocation which blocks the
I/O or FS operation from completing".

- Ted

2015-07-01 14:21:59

by Michal Hocko

[permalink] [raw]
Subject: Re: Lockup in wait_transaction_locked under memory pressure

On Wed 01-07-15 07:13:55, Theodore Ts'o wrote:
[...]
> Perhaps we need to make the documentation a bit more explicit?
> All which is stated in include/slab.h:
>
> * %GFP_NOIO - Do not do any I/O at all while trying to get memory.
> *
> * %GFP_NOFS - Do not make any fs calls while trying to get memory.
>
> I thought this was obvious, but these flags are used by code which in
> the I/O or FS paths, and so it's always possible that they are trying
> to write back the page which you decide to blocking on when trying to
> do the memory allocation, at which point, *boom*, deadlock.
>
> So it's just not "do not make any FS or I/O calls", but also "the mm
> layer must not not wait for any FS or I/O operations from completing,
> since the operation you block on might be the one they were in the
> middle of trying to complete --- or they may be holding a lock at the
> time when they were trying to do a memory allocation which blocks the
> I/O or FS operation from completing".

Sure any refinement here would be more than welcome.
--
Michal Hocko
SUSE Labs