LinuxLists.cc - hugepage related lockdep trace.

2013-07-17 15:32:36

Subject: hugepage related lockdep trace.

[128095.470960] =================================
[128095.471315] [ INFO: inconsistent lock state ]
[128095.471660] 3.11.0-rc1+ #9 Not tainted
[128095.472156] ---------------------------------
[128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
[128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
[128095.474373] (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3
[128095.475128] {RECLAIM_FS-ON-W} state was registered at:
[128095.475866] [<c10a6232>] mark_held_locks+0x81/0xe7
[128095.476597] [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc
[128095.477322] [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6
[128095.478049] [<c1123ab6>] __get_free_pages+0x20/0x31
[128095.478769] [<c1123ad9>] get_zeroed_page+0x12/0x14
[128095.479477] [<c113fe1e>] __pmd_alloc+0x1c/0x6b
[128095.480138] [<c1155ea7>] huge_pmd_share+0x265/0x283
[128095.480138] [<c1155f22>] huge_pte_alloc+0x5d/0x71
[128095.480138] [<c115612e>] hugetlb_fault+0x7c/0x64a
[128095.480138] [<c114087c>] handle_mm_fault+0x255/0x299
[128095.480138] [<c15bbab0>] __do_page_fault+0x142/0x55c
[128095.480138] [<c15bbed7>] do_page_fault+0xd/0x16
[128095.480138] [<c15b927c>] error_code+0x6c/0x74
[128095.480138] irq event stamp: 3136917
[128095.480138] hardirqs last enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50
[128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78
[128095.480138] softirqs last enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f
[128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa
[128095.480138]
other info that might help us debug this:
[128095.480138] Possible unsafe locking scenario:

[128095.480138] CPU0
[128095.480138] ----
[128095.480138] lock(&mapping->i_mmap_mutex);
[128095.480138] <Interrupt>
[128095.480138] lock(&mapping->i_mmap_mutex);
[128095.480138]
*** DEADLOCK ***

[128095.480138] no locks held by kswapd0/49.
[128095.480138]
stack backtrace:
[128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
[128095.480138] Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008
[128095.480138] c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845
[128095.480138] c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001
[128095.480138] c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb
[128095.480138] Call Trace:
[128095.480138] [<c15b001e>] dump_stack+0x4b/0x79
[128095.480138] [<c15acdcb>] print_usage_bug+0x1d9/0x1e3
[128095.480138] [<c10a6130>] mark_lock+0x1e0/0x261
[128095.480138] [<c10a5878>] ? check_usage_backwards+0x109/0x109
[128095.480138] [<c10a6cde>] __lock_acquire+0x623/0x17f2
[128095.480138] [<c107aa43>] ? sched_clock_cpu+0xcd/0x130
[128095.480138] [<c107a7e8>] ? sched_clock_local+0x42/0x12e
[128095.480138] [<c10a84cf>] lock_acquire+0x7d/0x195
[128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3
[128095.480138] [<c15b3671>] mutex_lock_nested+0x6c/0x3a7
[128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3
[128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3
[128095.480138] [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e
[128095.480138] [<c114971b>] page_referenced+0x87/0x5e3
[128095.480138] [<f8433030>] ? raid0_congested+0x26/0x8a [raid0]
[128095.480138] [<c112b9c7>] shrink_page_list+0x3d9/0x947
[128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
[128095.480138] [<c112c3cf>] shrink_inactive_list+0x155/0x4cb
[128095.480138] [<c112cd07>] shrink_lruvec+0x300/0x5ce
[128095.480138] [<c112d028>] shrink_zone+0x53/0x14e
[128095.480138] [<c112e531>] kswapd+0x517/0xa75
[128095.480138] [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280
[128095.480138] [<c10661ff>] kthread+0xa8/0xaa
[128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
[128095.480138] [<c15bf737>] ret_from_kernel_thread+0x1b/0x28
[128095.480138] [<c1066157>] ? insert_kthread_work+0x63/0x63

2013-07-18 00:09:06

by Minchan Kim

[permalink] [raw]

Subject: Re: hugepage related lockdep trace.

Ccing people get_maintainer says.

On Wed, Jul 17, 2013 at 11:32:23AM -0400, Dave Jones wrote:
> [128095.470960] =================================
> [128095.471315] [ INFO: inconsistent lock state ]
> [128095.471660] 3.11.0-rc1+ #9 Not tainted
> [128095.472156] ---------------------------------
> [128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
> [128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
> [128095.474373] (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3
> [128095.475128] {RECLAIM_FS-ON-W} state was registered at:
> [128095.475866] [<c10a6232>] mark_held_locks+0x81/0xe7
> [128095.476597] [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc
> [128095.477322] [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6
> [128095.478049] [<c1123ab6>] __get_free_pages+0x20/0x31
> [128095.478769] [<c1123ad9>] get_zeroed_page+0x12/0x14
> [128095.479477] [<c113fe1e>] __pmd_alloc+0x1c/0x6b
> [128095.480138] [<c1155ea7>] huge_pmd_share+0x265/0x283
> [128095.480138] [<c1155f22>] huge_pte_alloc+0x5d/0x71
> [128095.480138] [<c115612e>] hugetlb_fault+0x7c/0x64a
> [128095.480138] [<c114087c>] handle_mm_fault+0x255/0x299
> [128095.480138] [<c15bbab0>] __do_page_fault+0x142/0x55c
> [128095.480138] [<c15bbed7>] do_page_fault+0xd/0x16
> [128095.480138] [<c15b927c>] error_code+0x6c/0x74
> [128095.480138] irq event stamp: 3136917
> [128095.480138] hardirqs last enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50
> [128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78
> [128095.480138] softirqs last enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f
> [128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa
> [128095.480138]
> other info that might help us debug this:
> [128095.480138] Possible unsafe locking scenario:
>
> [128095.480138] CPU0
> [128095.480138] ----
> [128095.480138] lock(&mapping->i_mmap_mutex);
> [128095.480138] <Interrupt>
> [128095.480138] lock(&mapping->i_mmap_mutex);
> [128095.480138]
> *** DEADLOCK ***
>
> [128095.480138] no locks held by kswapd0/49.
> [128095.480138]
> stack backtrace:
> [128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
> [128095.480138] Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008
> [128095.480138] c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845
> [128095.480138] c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001
> [128095.480138] c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb
> [128095.480138] Call Trace:
> [128095.480138] [<c15b001e>] dump_stack+0x4b/0x79
> [128095.480138] [<c15acdcb>] print_usage_bug+0x1d9/0x1e3
> [128095.480138] [<c10a6130>] mark_lock+0x1e0/0x261
> [128095.480138] [<c10a5878>] ? check_usage_backwards+0x109/0x109
> [128095.480138] [<c10a6cde>] __lock_acquire+0x623/0x17f2
> [128095.480138] [<c107aa43>] ? sched_clock_cpu+0xcd/0x130
> [128095.480138] [<c107a7e8>] ? sched_clock_local+0x42/0x12e
> [128095.480138] [<c10a84cf>] lock_acquire+0x7d/0x195
> [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3
> [128095.480138] [<c15b3671>] mutex_lock_nested+0x6c/0x3a7
> [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3
> [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3
> [128095.480138] [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e
> [128095.480138] [<c114971b>] page_referenced+0x87/0x5e3
> [128095.480138] [<f8433030>] ? raid0_congested+0x26/0x8a [raid0]
> [128095.480138] [<c112b9c7>] shrink_page_list+0x3d9/0x947
> [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
> [128095.480138] [<c112c3cf>] shrink_inactive_list+0x155/0x4cb
> [128095.480138] [<c112cd07>] shrink_lruvec+0x300/0x5ce
> [128095.480138] [<c112d028>] shrink_zone+0x53/0x14e
> [128095.480138] [<c112e531>] kswapd+0x517/0xa75
> [128095.480138] [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280
> [128095.480138] [<c10661ff>] kthread+0xa8/0xaa
> [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
> [128095.480138] [<c15bf737>] ret_from_kernel_thread+0x1b/0x28
> [128095.480138] [<c1066157>] ? insert_kthread_work+0x63/0x63

IMHO, it's a false positive because i_mmap_mutex was held by kswapd
while one in the middle of fault path could be never on kswapd context.

It seems lockdep for reclaim-over-fs isn't enough smart to identify
between background and direct reclaim.

Wait for other's opinion.

>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2013-07-18 17:42:44

by Aneesh Kumar K.V

[permalink] [raw]

Subject: Re: hugepage related lockdep trace.

Minchan Kim <[email protected]> writes:

> Ccing people get_maintainer says.
>
> On Wed, Jul 17, 2013 at 11:32:23AM -0400, Dave Jones wrote:
>> [128095.470960] =================================
>> [128095.471315] [ INFO: inconsistent lock state ]
>> [128095.471660] 3.11.0-rc1+ #9 Not tainted
>> [128095.472156] ---------------------------------
>> [128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
>> [128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
>> [128095.474373] (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3
>> [128095.475128] {RECLAIM_FS-ON-W} state was registered at:
>> [128095.475866] [<c10a6232>] mark_held_locks+0x81/0xe7
>> [128095.476597] [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc
>> [128095.477322] [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6
>> [128095.478049] [<c1123ab6>] __get_free_pages+0x20/0x31
>> [128095.478769] [<c1123ad9>] get_zeroed_page+0x12/0x14
>> [128095.479477] [<c113fe1e>] __pmd_alloc+0x1c/0x6b
>> [128095.480138] [<c1155ea7>] huge_pmd_share+0x265/0x283
>> [128095.480138] [<c1155f22>] huge_pte_alloc+0x5d/0x71
>> [128095.480138] [<c115612e>] hugetlb_fault+0x7c/0x64a
>> [128095.480138] [<c114087c>] handle_mm_fault+0x255/0x299
>> [128095.480138] [<c15bbab0>] __do_page_fault+0x142/0x55c
>> [128095.480138] [<c15bbed7>] do_page_fault+0xd/0x16
>> [128095.480138] [<c15b927c>] error_code+0x6c/0x74
>> [128095.480138] irq event stamp: 3136917
>> [128095.480138] hardirqs last enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50
>> [128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78
>> [128095.480138] softirqs last enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f
>> [128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa
>> [128095.480138]
>> other info that might help us debug this:
>> [128095.480138] Possible unsafe locking scenario:
>>
>> [128095.480138] CPU0
>> [128095.480138] ----
>> [128095.480138] lock(&mapping->i_mmap_mutex);
>> [128095.480138] <Interrupt>
>> [128095.480138] lock(&mapping->i_mmap_mutex);
>> [128095.480138]
>> *** DEADLOCK ***
>>
>> [128095.480138] no locks held by kswapd0/49.
>> [128095.480138]
>> stack backtrace:
>> [128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
>> [128095.480138] Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008
>> [128095.480138] c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845
>> [128095.480138] c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001
>> [128095.480138] c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb
>> [128095.480138] Call Trace:
>> [128095.480138] [<c15b001e>] dump_stack+0x4b/0x79
>> [128095.480138] [<c15acdcb>] print_usage_bug+0x1d9/0x1e3
>> [128095.480138] [<c10a6130>] mark_lock+0x1e0/0x261
>> [128095.480138] [<c10a5878>] ? check_usage_backwards+0x109/0x109
>> [128095.480138] [<c10a6cde>] __lock_acquire+0x623/0x17f2
>> [128095.480138] [<c107aa43>] ? sched_clock_cpu+0xcd/0x130
>> [128095.480138] [<c107a7e8>] ? sched_clock_local+0x42/0x12e
>> [128095.480138] [<c10a84cf>] lock_acquire+0x7d/0x195
>> [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3
>> [128095.480138] [<c15b3671>] mutex_lock_nested+0x6c/0x3a7
>> [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3
>> [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3
>> [128095.480138] [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e
>> [128095.480138] [<c114971b>] page_referenced+0x87/0x5e3
>> [128095.480138] [<f8433030>] ? raid0_congested+0x26/0x8a [raid0]
>> [128095.480138] [<c112b9c7>] shrink_page_list+0x3d9/0x947
>> [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
>> [128095.480138] [<c112c3cf>] shrink_inactive_list+0x155/0x4cb
>> [128095.480138] [<c112cd07>] shrink_lruvec+0x300/0x5ce
>> [128095.480138] [<c112d028>] shrink_zone+0x53/0x14e
>> [128095.480138] [<c112e531>] kswapd+0x517/0xa75
>> [128095.480138] [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280
>> [128095.480138] [<c10661ff>] kthread+0xa8/0xaa
>> [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
>> [128095.480138] [<c15bf737>] ret_from_kernel_thread+0x1b/0x28
>> [128095.480138] [<c1066157>] ? insert_kthread_work+0x63/0x63
>
> IMHO, it's a false positive because i_mmap_mutex was held by kswapd
> while one in the middle of fault path could be never on kswapd context.
>
> It seems lockdep for reclaim-over-fs isn't enough smart to identify
> between background and direct reclaim.
>
> Wait for other's opinion.

Is that reasoning correct ?. We may not deadlock because hugetlb pages
cannot be reclaimed. So the fault path in hugetlb won't end up
reclaiming pages from same inode. But the report is correct right ?

Looking at the hugetlb code we have in huge_pmd_share

out:
pte = (pte_t *)pmd_alloc(mm, pud, addr);
mutex_unlock(&mapping->i_mmap_mutex);
return pte;

I guess we should move that pmd_alloc outside i_mmap_mutex. Otherwise
that pmd_alloc can result in a reclaim which can call shrink_page_list ?

Something like ?

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 83aff0a..2cb1be3 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
put_page(virt_to_page(spte));
spin_unlock(&mm->page_table_lock);
out:
- pte = (pte_t *)pmd_alloc(mm, pud, addr);
mutex_unlock(&mapping->i_mmap_mutex);
+ pte = (pte_t *)pmd_alloc(mm, pud, addr);
return pte;
}

-aneesh

2013-07-19 00:13:09

by Minchan Kim

[permalink] [raw]

Subject: Re: hugepage related lockdep trace.

On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote:
> Minchan Kim <[email protected]> writes:
>
> > Ccing people get_maintainer says.
> >
> > On Wed, Jul 17, 2013 at 11:32:23AM -0400, Dave Jones wrote:
> >> [128095.470960] =================================
> >> [128095.471315] [ INFO: inconsistent lock state ]
> >> [128095.471660] 3.11.0-rc1+ #9 Not tainted
> >> [128095.472156] ---------------------------------
> >> [128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
> >> [128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
> >> [128095.474373] (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3
> >> [128095.475128] {RECLAIM_FS-ON-W} state was registered at:
> >> [128095.475866] [<c10a6232>] mark_held_locks+0x81/0xe7
> >> [128095.476597] [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc
> >> [128095.477322] [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6
> >> [128095.478049] [<c1123ab6>] __get_free_pages+0x20/0x31
> >> [128095.478769] [<c1123ad9>] get_zeroed_page+0x12/0x14
> >> [128095.479477] [<c113fe1e>] __pmd_alloc+0x1c/0x6b
> >> [128095.480138] [<c1155ea7>] huge_pmd_share+0x265/0x283
> >> [128095.480138] [<c1155f22>] huge_pte_alloc+0x5d/0x71
> >> [128095.480138] [<c115612e>] hugetlb_fault+0x7c/0x64a
> >> [128095.480138] [<c114087c>] handle_mm_fault+0x255/0x299
> >> [128095.480138] [<c15bbab0>] __do_page_fault+0x142/0x55c
> >> [128095.480138] [<c15bbed7>] do_page_fault+0xd/0x16
> >> [128095.480138] [<c15b927c>] error_code+0x6c/0x74
> >> [128095.480138] irq event stamp: 3136917
> >> [128095.480138] hardirqs last enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50
> >> [128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78
> >> [128095.480138] softirqs last enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f
> >> [128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa
> >> [128095.480138]
> >> other info that might help us debug this:
> >> [128095.480138] Possible unsafe locking scenario:
> >>
> >> [128095.480138] CPU0
> >> [128095.480138] ----
> >> [128095.480138] lock(&mapping->i_mmap_mutex);
> >> [128095.480138] <Interrupt>
> >> [128095.480138] lock(&mapping->i_mmap_mutex);
> >> [128095.480138]
> >> *** DEADLOCK ***
> >>
> >> [128095.480138] no locks held by kswapd0/49.
> >> [128095.480138]
> >> stack backtrace:
> >> [128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
> >> [128095.480138] Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008
> >> [128095.480138] c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845
> >> [128095.480138] c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001
> >> [128095.480138] c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb
> >> [128095.480138] Call Trace:
> >> [128095.480138] [<c15b001e>] dump_stack+0x4b/0x79
> >> [128095.480138] [<c15acdcb>] print_usage_bug+0x1d9/0x1e3
> >> [128095.480138] [<c10a6130>] mark_lock+0x1e0/0x261
> >> [128095.480138] [<c10a5878>] ? check_usage_backwards+0x109/0x109
> >> [128095.480138] [<c10a6cde>] __lock_acquire+0x623/0x17f2
> >> [128095.480138] [<c107aa43>] ? sched_clock_cpu+0xcd/0x130
> >> [128095.480138] [<c107a7e8>] ? sched_clock_local+0x42/0x12e
> >> [128095.480138] [<c10a84cf>] lock_acquire+0x7d/0x195
> >> [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3
> >> [128095.480138] [<c15b3671>] mutex_lock_nested+0x6c/0x3a7
> >> [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3
> >> [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3
> >> [128095.480138] [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e
> >> [128095.480138] [<c114971b>] page_referenced+0x87/0x5e3
> >> [128095.480138] [<f8433030>] ? raid0_congested+0x26/0x8a [raid0]
> >> [128095.480138] [<c112b9c7>] shrink_page_list+0x3d9/0x947
> >> [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
> >> [128095.480138] [<c112c3cf>] shrink_inactive_list+0x155/0x4cb
> >> [128095.480138] [<c112cd07>] shrink_lruvec+0x300/0x5ce
> >> [128095.480138] [<c112d028>] shrink_zone+0x53/0x14e
> >> [128095.480138] [<c112e531>] kswapd+0x517/0xa75
> >> [128095.480138] [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280
> >> [128095.480138] [<c10661ff>] kthread+0xa8/0xaa
> >> [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
> >> [128095.480138] [<c15bf737>] ret_from_kernel_thread+0x1b/0x28
> >> [128095.480138] [<c1066157>] ? insert_kthread_work+0x63/0x63
> >
> > IMHO, it's a false positive because i_mmap_mutex was held by kswapd
> > while one in the middle of fault path could be never on kswapd context.
> >
> > It seems lockdep for reclaim-over-fs isn't enough smart to identify
> > between background and direct reclaim.
> >
> > Wait for other's opinion.
>
> Is that reasoning correct ?. We may not deadlock because hugetlb pages
> cannot be reclaimed. So the fault path in hugetlb won't end up
> reclaiming pages from same inode. But the report is correct right ?
>
>
> Looking at the hugetlb code we have in huge_pmd_share
>
> out:
> pte = (pte_t *)pmd_alloc(mm, pud, addr);
> mutex_unlock(&mapping->i_mmap_mutex);
> return pte;
>
> I guess we should move that pmd_alloc outside i_mmap_mutex. Otherwise
> that pmd_alloc can result in a reclaim which can call shrink_page_list ?

True. Sorry for that I didn't review the code carefully and I was very paranoid
in reclaim-over-fs due to internal works. :(

>
> Something like ?
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 83aff0a..2cb1be3 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> put_page(virt_to_page(spte));
> spin_unlock(&mm->page_table_lock);
> out:
> - pte = (pte_t *)pmd_alloc(mm, pud, addr);
> mutex_unlock(&mapping->i_mmap_mutex);
> + pte = (pte_t *)pmd_alloc(mm, pud, addr);
> return pte;

I am blind on hugetlb but not sure it doesn't break eb48c071.
Michal?

> }
>
> -aneesh
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Kind regards,
Minchan Kim

2013-07-19 02:08:07

by Hillf Danton

[permalink] [raw]

Subject: Re: hugepage related lockdep trace.

On Fri, Jul 19, 2013 at 1:42 AM, Aneesh Kumar K.V
<[email protected]> wrote:
> Minchan Kim <[email protected]> writes:
>> IMHO, it's a false positive because i_mmap_mutex was held by kswapd
>> while one in the middle of fault path could be never on kswapd context.
>>
>> It seems lockdep for reclaim-over-fs isn't enough smart to identify
>> between background and direct reclaim.
>>
>> Wait for other's opinion.
>
> Is that reasoning correct ?. We may not deadlock because hugetlb pages
> cannot be reclaimed. So the fault path in hugetlb won't end up
> reclaiming pages from same inode. But the report is correct right ?
>
>
> Looking at the hugetlb code we have in huge_pmd_share
>
> out:
> pte = (pte_t *)pmd_alloc(mm, pud, addr);
> mutex_unlock(&mapping->i_mmap_mutex);
> return pte;
>
> I guess we should move that pmd_alloc outside i_mmap_mutex. Otherwise
> that pmd_alloc can result in a reclaim which can call shrink_page_list ?
>
Hm, can huge pages be reclaimed, say by kswapd currently?

> Something like ?
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 83aff0a..2cb1be3 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> put_page(virt_to_page(spte));
> spin_unlock(&mm->page_table_lock);
> out:
> - pte = (pte_t *)pmd_alloc(mm, pud, addr);
> mutex_unlock(&mapping->i_mmap_mutex);
> + pte = (pte_t *)pmd_alloc(mm, pud, addr);
> return pte;
> }
>

2013-07-19 03:20:12

by Aneesh Kumar K.V

[permalink] [raw]

Subject: Re: hugepage related lockdep trace.

Hillf Danton <[email protected]> writes:

> On Fri, Jul 19, 2013 at 1:42 AM, Aneesh Kumar K.V
> <[email protected]> wrote:
>> Minchan Kim <[email protected]> writes:
>>> IMHO, it's a false positive because i_mmap_mutex was held by kswapd
>>> while one in the middle of fault path could be never on kswapd context.
>>>
>>> It seems lockdep for reclaim-over-fs isn't enough smart to identify
>>> between background and direct reclaim.
>>>
>>> Wait for other's opinion.
>>
>> Is that reasoning correct ?. We may not deadlock because hugetlb pages
>> cannot be reclaimed. So the fault path in hugetlb won't end up
>> reclaiming pages from same inode. But the report is correct right ?
>>
>>
>> Looking at the hugetlb code we have in huge_pmd_share
>>
>> out:
>> pte = (pte_t *)pmd_alloc(mm, pud, addr);
>> mutex_unlock(&mapping->i_mmap_mutex);
>> return pte;
>>
>> I guess we should move that pmd_alloc outside i_mmap_mutex. Otherwise
>> that pmd_alloc can result in a reclaim which can call shrink_page_list ?
>>
> Hm, can huge pages be reclaimed, say by kswapd currently?

No we don't reclaim hugetlb pages.

-aneesh

2013-07-23 07:24:29

by Hush Bensen

[permalink] [raw]

Subject: Re: hugepage related lockdep trace.

On 07/18/2013 06:13 PM, Minchan Kim wrote:
> On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote:
>> Minchan Kim <[email protected]> writes:
>>
>>> Ccing people get_maintainer says.
>>>
>>> On Wed, Jul 17, 2013 at 11:32:23AM -0400, Dave Jones wrote:
>>>> [128095.470960] =================================
>>>> [128095.471315] [ INFO: inconsistent lock state ]
>>>> [128095.471660] 3.11.0-rc1+ #9 Not tainted
>>>> [128095.472156] ---------------------------------
>>>> [128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
>>>> [128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
>>>> [128095.474373] (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3
>>>> [128095.475128] {RECLAIM_FS-ON-W} state was registered at:
>>>> [128095.475866] [<c10a6232>] mark_held_locks+0x81/0xe7
>>>> [128095.476597] [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc
>>>> [128095.477322] [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6
>>>> [128095.478049] [<c1123ab6>] __get_free_pages+0x20/0x31
>>>> [128095.478769] [<c1123ad9>] get_zeroed_page+0x12/0x14
>>>> [128095.479477] [<c113fe1e>] __pmd_alloc+0x1c/0x6b
>>>> [128095.480138] [<c1155ea7>] huge_pmd_share+0x265/0x283
>>>> [128095.480138] [<c1155f22>] huge_pte_alloc+0x5d/0x71
>>>> [128095.480138] [<c115612e>] hugetlb_fault+0x7c/0x64a
>>>> [128095.480138] [<c114087c>] handle_mm_fault+0x255/0x299
>>>> [128095.480138] [<c15bbab0>] __do_page_fault+0x142/0x55c
>>>> [128095.480138] [<c15bbed7>] do_page_fault+0xd/0x16
>>>> [128095.480138] [<c15b927c>] error_code+0x6c/0x74
>>>> [128095.480138] irq event stamp: 3136917
>>>> [128095.480138] hardirqs last enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50
>>>> [128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78
>>>> [128095.480138] softirqs last enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f
>>>> [128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa
>>>> [128095.480138]
>>>> other info that might help us debug this:
>>>> [128095.480138] Possible unsafe locking scenario:
>>>>
>>>> [128095.480138] CPU0
>>>> [128095.480138] ----
>>>> [128095.480138] lock(&mapping->i_mmap_mutex);
>>>> [128095.480138] <Interrupt>
>>>> [128095.480138] lock(&mapping->i_mmap_mutex);
>>>> [128095.480138]
>>>> *** DEADLOCK ***
>>>>
>>>> [128095.480138] no locks held by kswapd0/49.
>>>> [128095.480138]
>>>> stack backtrace:
>>>> [128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
>>>> [128095.480138] Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008
>>>> [128095.480138] c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845
>>>> [128095.480138] c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001
>>>> [128095.480138] c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb
>>>> [128095.480138] Call Trace:
>>>> [128095.480138] [<c15b001e>] dump_stack+0x4b/0x79
>>>> [128095.480138] [<c15acdcb>] print_usage_bug+0x1d9/0x1e3
>>>> [128095.480138] [<c10a6130>] mark_lock+0x1e0/0x261
>>>> [128095.480138] [<c10a5878>] ? check_usage_backwards+0x109/0x109
>>>> [128095.480138] [<c10a6cde>] __lock_acquire+0x623/0x17f2
>>>> [128095.480138] [<c107aa43>] ? sched_clock_cpu+0xcd/0x130
>>>> [128095.480138] [<c107a7e8>] ? sched_clock_local+0x42/0x12e
>>>> [128095.480138] [<c10a84cf>] lock_acquire+0x7d/0x195
>>>> [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3
>>>> [128095.480138] [<c15b3671>] mutex_lock_nested+0x6c/0x3a7
>>>> [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3
>>>> [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3
>>>> [128095.480138] [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e
>>>> [128095.480138] [<c114971b>] page_referenced+0x87/0x5e3
>>>> [128095.480138] [<f8433030>] ? raid0_congested+0x26/0x8a [raid0]
>>>> [128095.480138] [<c112b9c7>] shrink_page_list+0x3d9/0x947
>>>> [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
>>>> [128095.480138] [<c112c3cf>] shrink_inactive_list+0x155/0x4cb
>>>> [128095.480138] [<c112cd07>] shrink_lruvec+0x300/0x5ce
>>>> [128095.480138] [<c112d028>] shrink_zone+0x53/0x14e
>>>> [128095.480138] [<c112e531>] kswapd+0x517/0xa75
>>>> [128095.480138] [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280
>>>> [128095.480138] [<c10661ff>] kthread+0xa8/0xaa
>>>> [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
>>>> [128095.480138] [<c15bf737>] ret_from_kernel_thread+0x1b/0x28
>>>> [128095.480138] [<c1066157>] ? insert_kthread_work+0x63/0x63
>>> IMHO, it's a false positive because i_mmap_mutex was held by kswapd
>>> while one in the middle of fault path could be never on kswapd context.
>>>
>>> It seems lockdep for reclaim-over-fs isn't enough smart to identify
>>> between background and direct reclaim.
>>>
>>> Wait for other's opinion.
>> Is that reasoning correct ?. We may not deadlock because hugetlb pages
>> cannot be reclaimed. So the fault path in hugetlb won't end up
>> reclaiming pages from same inode. But the report is correct right ?
>>
>>
>> Looking at the hugetlb code we have in huge_pmd_share
>>
>> out:
>> pte = (pte_t *)pmd_alloc(mm, pud, addr);
>> mutex_unlock(&mapping->i_mmap_mutex);
>> return pte;
>>
>> I guess we should move that pmd_alloc outside i_mmap_mutex. Otherwise
>> that pmd_alloc can result in a reclaim which can call shrink_page_list ?
> True. Sorry for that I didn't review the code carefully and I was very paranoid
> in reclaim-over-fs due to internal works. :(

Could you explain more about reclaim-over-fs stuff?

>
>> Something like ?
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 83aff0a..2cb1be3 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
>> put_page(virt_to_page(spte));
>> spin_unlock(&mm->page_table_lock);
>> out:
>> - pte = (pte_t *)pmd_alloc(mm, pud, addr);
>> mutex_unlock(&mapping->i_mmap_mutex);
>> + pte = (pte_t *)pmd_alloc(mm, pud, addr);
>> return pte;
> I am blind on hugetlb but not sure it doesn't break eb48c071.
> Michal?
>
>
>> }
>>
>> -aneesh
>>
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at http://www.tux.org/lkml/

2013-07-23 14:01:24

by Michal Hocko

[permalink] [raw]

Subject: Re: hugepage related lockdep trace.

On Fri 19-07-13 09:13:03, Minchan Kim wrote:
> On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote:
[...]
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index 83aff0a..2cb1be3 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> > put_page(virt_to_page(spte));
> > spin_unlock(&mm->page_table_lock);
> > out:
> > - pte = (pte_t *)pmd_alloc(mm, pud, addr);
> > mutex_unlock(&mapping->i_mmap_mutex);
> > + pte = (pte_t *)pmd_alloc(mm, pud, addr);
> > return pte;
>
> I am blind on hugetlb but not sure it doesn't break eb48c071.
> Michal?

Well, it is some time since I debugged the race and all the details
vanished in the meantime. But this part of the changelog suggests that
this indeed breaks the fix:
"
This patch addresses the issue by moving pmd_alloc into huge_pmd_share
which guarantees that the shared pud is populated in the same critical
section as pmd. This also means that huge_pte_offset test in
huge_pmd_share is serialized correctly now which in turn means that the
success of the sharing will be higher as the racing tasks see the pud
and pmd populated together.
"

Besides that I fail to see how moving pmd_alloc down changes anything.
Even if pmd_alloc triggered reclaim then we cannot trip over the same
i_mmap_mutex as hugetlb pages are not reclaimable because they are not
on the LRU.
--
Michal Hocko
SUSE Labs

2013-07-24 02:44:32

by Minchan Kim

[permalink] [raw]

Subject: Re: hugepage related lockdep trace.

On Tue, Jul 23, 2013 at 04:01:20PM +0200, Michal Hocko wrote:
> On Fri 19-07-13 09:13:03, Minchan Kim wrote:
> > On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote:
> [...]
> > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > index 83aff0a..2cb1be3 100644
> > > --- a/mm/hugetlb.c
> > > +++ b/mm/hugetlb.c
> > > @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> > > put_page(virt_to_page(spte));
> > > spin_unlock(&mm->page_table_lock);
> > > out:
> > > - pte = (pte_t *)pmd_alloc(mm, pud, addr);
> > > mutex_unlock(&mapping->i_mmap_mutex);
> > > + pte = (pte_t *)pmd_alloc(mm, pud, addr);
> > > return pte;
> >
> > I am blind on hugetlb but not sure it doesn't break eb48c071.
> > Michal?
>
> Well, it is some time since I debugged the race and all the details
> vanished in the meantime. But this part of the changelog suggests that
> this indeed breaks the fix:
> "
> This patch addresses the issue by moving pmd_alloc into huge_pmd_share
> which guarantees that the shared pud is populated in the same critical
> section as pmd. This also means that huge_pte_offset test in
> huge_pmd_share is serialized correctly now which in turn means that the
> success of the sharing will be higher as the racing tasks see the pud
> and pmd populated together.
> "
>
> Besides that I fail to see how moving pmd_alloc down changes anything.
> Even if pmd_alloc triggered reclaim then we cannot trip over the same
> i_mmap_mutex as hugetlb pages are not reclaimable because they are not
> on the LRU.

I thought we could map some part of binary with normal page and other part
of the one with MAP_HUGETLB so that a address space could have both normal
page and HugeTLB page. Okay, it's impossible so HugeTLB pages are not on LRU.
Then, above lockdep warning is totally false positive.
Best solution is avoiding pmd_alloc with holding i_mmap_mutex but we need it
to fix eb48c071 so how about this if we couldn't have a better idea?

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 83aff0a..e7c3a15 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -3240,7 +3240,15 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
if (!vma_shareable(vma, addr))
return (pte_t *)pmd_alloc(mm, pud, addr);

- mutex_lock(&mapping->i_mmap_mutex);
+ /*
+ * It annotates to shut lockdep's warning up casued by i_mmap_mutex
+ * Below pmd_alloc try to allocate memory with GFP_KERNEL while
+ * holding i_mmap_mutex so that it could enter direct reclaim path
+ * that rmap try to hold i_mmap_mutex again. But it's no problem
+ * for hugetlb because pages on hugetlb never could live in LRU so
+ * it's false positive. I hope someone fixes it with avoiding pmd_alloc
+ * with holding i_mmap_mutex rather than nesting annotation.
+ */
+ mutex_lock_nested(&mapping->i_mmap_mutex, SINGLE_DEPTH_NESTING);
vma_interval_tree_foreach(svma, &mapping->i_mmap, idx, idx) {
if (svma == vma)
continue;

> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2013-07-24 02:57:14

by Minchan Kim

[permalink] [raw]

Subject: Re: hugepage related lockdep trace.

On Tue, Jul 23, 2013 at 01:24:17AM -0600, Hush Bensen wrote:
> On 07/18/2013 06:13 PM, Minchan Kim wrote:
> >On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote:
> >>Minchan Kim <[email protected]> writes:
> >>
> >>>Ccing people get_maintainer says.
> >>>
> >>>On Wed, Jul 17, 2013 at 11:32:23AM -0400, Dave Jones wrote:
> >>>>[128095.470960] =================================
> >>>>[128095.471315] [ INFO: inconsistent lock state ]
> >>>>[128095.471660] 3.11.0-rc1+ #9 Not tainted
> >>>>[128095.472156] ---------------------------------
> >>>>[128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
> >>>>[128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
> >>>>[128095.474373] (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3
> >>>>[128095.475128] {RECLAIM_FS-ON-W} state was registered at:
> >>>>[128095.475866] [<c10a6232>] mark_held_locks+0x81/0xe7
> >>>>[128095.476597] [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc
> >>>>[128095.477322] [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6
> >>>>[128095.478049] [<c1123ab6>] __get_free_pages+0x20/0x31
> >>>>[128095.478769] [<c1123ad9>] get_zeroed_page+0x12/0x14
> >>>>[128095.479477] [<c113fe1e>] __pmd_alloc+0x1c/0x6b
> >>>>[128095.480138] [<c1155ea7>] huge_pmd_share+0x265/0x283
> >>>>[128095.480138] [<c1155f22>] huge_pte_alloc+0x5d/0x71
> >>>>[128095.480138] [<c115612e>] hugetlb_fault+0x7c/0x64a
> >>>>[128095.480138] [<c114087c>] handle_mm_fault+0x255/0x299
> >>>>[128095.480138] [<c15bbab0>] __do_page_fault+0x142/0x55c
> >>>>[128095.480138] [<c15bbed7>] do_page_fault+0xd/0x16
> >>>>[128095.480138] [<c15b927c>] error_code+0x6c/0x74
> >>>>[128095.480138] irq event stamp: 3136917
> >>>>[128095.480138] hardirqs last enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50
> >>>>[128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78
> >>>>[128095.480138] softirqs last enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f
> >>>>[128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa
> >>>>[128095.480138]
> >>>>other info that might help us debug this:
> >>>>[128095.480138] Possible unsafe locking scenario:
> >>>>
> >>>>[128095.480138] CPU0
> >>>>[128095.480138] ----
> >>>>[128095.480138] lock(&mapping->i_mmap_mutex);
> >>>>[128095.480138] <Interrupt>
> >>>>[128095.480138] lock(&mapping->i_mmap_mutex);
> >>>>[128095.480138]
> >>>> *** DEADLOCK ***
> >>>>
> >>>>[128095.480138] no locks held by kswapd0/49.
> >>>>[128095.480138]
> >>>>stack backtrace:
> >>>>[128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
> >>>>[128095.480138] Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008
> >>>>[128095.480138] c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845
> >>>>[128095.480138] c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001
> >>>>[128095.480138] c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb
> >>>>[128095.480138] Call Trace:
> >>>>[128095.480138] [<c15b001e>] dump_stack+0x4b/0x79
> >>>>[128095.480138] [<c15acdcb>] print_usage_bug+0x1d9/0x1e3
> >>>>[128095.480138] [<c10a6130>] mark_lock+0x1e0/0x261
> >>>>[128095.480138] [<c10a5878>] ? check_usage_backwards+0x109/0x109
> >>>>[128095.480138] [<c10a6cde>] __lock_acquire+0x623/0x17f2
> >>>>[128095.480138] [<c107aa43>] ? sched_clock_cpu+0xcd/0x130
> >>>>[128095.480138] [<c107a7e8>] ? sched_clock_local+0x42/0x12e
> >>>>[128095.480138] [<c10a84cf>] lock_acquire+0x7d/0x195
> >>>>[128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3
> >>>>[128095.480138] [<c15b3671>] mutex_lock_nested+0x6c/0x3a7
> >>>>[128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3
> >>>>[128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3
> >>>>[128095.480138] [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e
> >>>>[128095.480138] [<c114971b>] page_referenced+0x87/0x5e3
> >>>>[128095.480138] [<f8433030>] ? raid0_congested+0x26/0x8a [raid0]
> >>>>[128095.480138] [<c112b9c7>] shrink_page_list+0x3d9/0x947
> >>>>[128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
> >>>>[128095.480138] [<c112c3cf>] shrink_inactive_list+0x155/0x4cb
> >>>>[128095.480138] [<c112cd07>] shrink_lruvec+0x300/0x5ce
> >>>>[128095.480138] [<c112d028>] shrink_zone+0x53/0x14e
> >>>>[128095.480138] [<c112e531>] kswapd+0x517/0xa75
> >>>>[128095.480138] [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280
> >>>>[128095.480138] [<c10661ff>] kthread+0xa8/0xaa
> >>>>[128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
> >>>>[128095.480138] [<c15bf737>] ret_from_kernel_thread+0x1b/0x28
> >>>>[128095.480138] [<c1066157>] ? insert_kthread_work+0x63/0x63
> >>>IMHO, it's a false positive because i_mmap_mutex was held by kswapd
> >>>while one in the middle of fault path could be never on kswapd context.
> >>>
> >>>It seems lockdep for reclaim-over-fs isn't enough smart to identify
> >>>between background and direct reclaim.
> >>>
> >>>Wait for other's opinion.
> >>Is that reasoning correct ?. We may not deadlock because hugetlb pages
> >>cannot be reclaimed. So the fault path in hugetlb won't end up
> >>reclaiming pages from same inode. But the report is correct right ?
> >>
> >>
> >>Looking at the hugetlb code we have in huge_pmd_share
> >>
> >>out:
> >> pte = (pte_t *)pmd_alloc(mm, pud, addr);
> >> mutex_unlock(&mapping->i_mmap_mutex);
> >> return pte;
> >>
> >>I guess we should move that pmd_alloc outside i_mmap_mutex. Otherwise
> >>that pmd_alloc can result in a reclaim which can call shrink_page_list ?
> >True. Sorry for that I didn't review the code carefully and I was very paranoid
> >in reclaim-over-fs due to internal works. :(
>
> Could you explain more about reclaim-over-fs stuff?

It's lockdep stuff to catch reclaim deadlock, which could happen following as
fox example,

fs_write
lock(fs_lock);
alloc memory
reclaim path
pageout
fs_write
lock(fs_lock)

>
> >
> >>Something like ?
> >>
> >>diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> >>index 83aff0a..2cb1be3 100644
> >>--- a/mm/hugetlb.c
> >>+++ b/mm/hugetlb.c
> >>@@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> >> put_page(virt_to_page(spte));
> >> spin_unlock(&mm->page_table_lock);
> >> out:
> >>- pte = (pte_t *)pmd_alloc(mm, pud, addr);
> >> mutex_unlock(&mapping->i_mmap_mutex);
> >>+ pte = (pte_t *)pmd_alloc(mm, pud, addr);
> >> return pte;
> >I am blind on hugetlb but not sure it doesn't break eb48c071.
> >Michal?
> >
> >
> >> }
> >>-aneesh
> >>
> >>
> >>
> >>--
> >>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >>the body of a message to [email protected]
> >>More majordomo info at http://vger.kernel.org/majordomo-info.html
> >>Please read the FAQ at http://www.tux.org/lkml/
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2013-07-25 13:30:44

by Michal Hocko

[permalink] [raw]

Subject: Re: hugepage related lockdep trace.

On Wed 24-07-13 11:44:28, Minchan Kim wrote:
> On Tue, Jul 23, 2013 at 04:01:20PM +0200, Michal Hocko wrote:
> > On Fri 19-07-13 09:13:03, Minchan Kim wrote:
> > > On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote:
> > [...]
> > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > > index 83aff0a..2cb1be3 100644
> > > > --- a/mm/hugetlb.c
> > > > +++ b/mm/hugetlb.c
> > > > @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> > > > put_page(virt_to_page(spte));
> > > > spin_unlock(&mm->page_table_lock);
> > > > out:
> > > > - pte = (pte_t *)pmd_alloc(mm, pud, addr);
> > > > mutex_unlock(&mapping->i_mmap_mutex);
> > > > + pte = (pte_t *)pmd_alloc(mm, pud, addr);
> > > > return pte;
> > >
> > > I am blind on hugetlb but not sure it doesn't break eb48c071.
> > > Michal?
> >
> > Well, it is some time since I debugged the race and all the details
> > vanished in the meantime. But this part of the changelog suggests that
> > this indeed breaks the fix:
> > "
> > This patch addresses the issue by moving pmd_alloc into huge_pmd_share
> > which guarantees that the shared pud is populated in the same critical
> > section as pmd. This also means that huge_pte_offset test in
> > huge_pmd_share is serialized correctly now which in turn means that the
> > success of the sharing will be higher as the racing tasks see the pud
> > and pmd populated together.
> > "
> >
> > Besides that I fail to see how moving pmd_alloc down changes anything.
> > Even if pmd_alloc triggered reclaim then we cannot trip over the same
> > i_mmap_mutex as hugetlb pages are not reclaimable because they are not
> > on the LRU.
>
> I thought we could map some part of binary with normal page and other part
> of the one with MAP_HUGETLB so that a address space could have both normal
> page and HugeTLB page. Okay, it's impossible so HugeTLB pages are not on LRU.
> Then, above lockdep warning is totally false positive.
> Best solution is avoiding pmd_alloc with holding i_mmap_mutex but we need it
> to fix eb48c071 so how about this if we couldn't have a better idea?

Shouldn't we rather use a hugetlb specific lock_class_key. I am not
familiar with lockdep much but something like bellow should do the
trick?

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index a3f868a..40a61f6 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -463,6 +463,8 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb,
return inode;
}

+struct lock_class_key hugetlbfs_i_mmap_mutex_key;
+
static struct inode *hugetlbfs_get_inode(struct super_block *sb,
struct inode *dir,
umode_t mode, dev_t dev)
@@ -474,6 +476,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
struct hugetlbfs_inode_info *info;
inode->i_ino = get_next_ino();
inode_init_owner(inode, dir, mode);
+ lockdep_set_class(&inode->i_mapping->i_mmap_mutex, &hugetlbfs_i_mmap_mutex_key);
inode->i_mapping->a_ops = &hugetlbfs_aops;
inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
--
Michal Hocko
SUSE Labs

2013-07-29 08:24:39

by Minchan Kim

[permalink] [raw]

Subject: Re: hugepage related lockdep trace.

Hi Michal,

On Thu, Jul 25, 2013 at 03:30:40PM +0200, Michal Hocko wrote:
> On Wed 24-07-13 11:44:28, Minchan Kim wrote:
> > On Tue, Jul 23, 2013 at 04:01:20PM +0200, Michal Hocko wrote:
> > > On Fri 19-07-13 09:13:03, Minchan Kim wrote:
> > > > On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote:
> > > [...]
> > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > > > index 83aff0a..2cb1be3 100644
> > > > > --- a/mm/hugetlb.c
> > > > > +++ b/mm/hugetlb.c
> > > > > @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> > > > > put_page(virt_to_page(spte));
> > > > > spin_unlock(&mm->page_table_lock);
> > > > > out:
> > > > > - pte = (pte_t *)pmd_alloc(mm, pud, addr);
> > > > > mutex_unlock(&mapping->i_mmap_mutex);
> > > > > + pte = (pte_t *)pmd_alloc(mm, pud, addr);
> > > > > return pte;
> > > >
> > > > I am blind on hugetlb but not sure it doesn't break eb48c071.
> > > > Michal?
> > >
> > > Well, it is some time since I debugged the race and all the details
> > > vanished in the meantime. But this part of the changelog suggests that
> > > this indeed breaks the fix:
> > > "
> > > This patch addresses the issue by moving pmd_alloc into huge_pmd_share
> > > which guarantees that the shared pud is populated in the same critical
> > > section as pmd. This also means that huge_pte_offset test in
> > > huge_pmd_share is serialized correctly now which in turn means that the
> > > success of the sharing will be higher as the racing tasks see the pud
> > > and pmd populated together.
> > > "
> > >
> > > Besides that I fail to see how moving pmd_alloc down changes anything.
> > > Even if pmd_alloc triggered reclaim then we cannot trip over the same
> > > i_mmap_mutex as hugetlb pages are not reclaimable because they are not
> > > on the LRU.
> >
> > I thought we could map some part of binary with normal page and other part
> > of the one with MAP_HUGETLB so that a address space could have both normal
> > page and HugeTLB page. Okay, it's impossible so HugeTLB pages are not on LRU.
> > Then, above lockdep warning is totally false positive.
> > Best solution is avoiding pmd_alloc with holding i_mmap_mutex but we need it
> > to fix eb48c071 so how about this if we couldn't have a better idea?
>
> Shouldn't we rather use a hugetlb specific lock_class_key. I am not
> familiar with lockdep much but something like bellow should do the
> trick?

Looks good to me.
Could you resend it with formal patch with Ccing Peter for Dave to confirm it?
Below just a nitpick.

>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index a3f868a..40a61f6 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -463,6 +463,8 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb,
> return inode;
> }
>
> +struct lock_class_key hugetlbfs_i_mmap_mutex_key;

Let's add comment why we need this special something.
How about this?

/*
* Now, reclaim path never hold hugetlbfs_inode->i_mmap_mutex while it could
* hold normal inode->i_mmap_mutex so this annoation avoids a lockdep splat.
*/
static struct lock_class_key hugetlbctx_mutex;

> +
> static struct inode *hugetlbfs_get_inode(struct super_block *sb,
> struct inode *dir,
> umode_t mode, dev_t dev)
> @@ -474,6 +476,7 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
> struct hugetlbfs_inode_info *info;
> inode->i_ino = get_next_ino();
> inode_init_owner(inode, dir, mode);
> + lockdep_set_class(&inode->i_mapping->i_mmap_mutex, &hugetlbfs_i_mmap_mutex_key);
> inode->i_mapping->a_ops = &hugetlbfs_aops;
> inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
> inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
> --
> Michal Hocko
> SUSE Labs
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Kind regards,
Minchan Kim

2013-07-29 14:53:12

by Michal Hocko

[permalink] [raw]

Subject: Re: hugepage related lockdep trace.

On Mon 29-07-13 17:24:53, Minchan Kim wrote:
> Hi Michal,
>
> On Thu, Jul 25, 2013 at 03:30:40PM +0200, Michal Hocko wrote:
> > On Wed 24-07-13 11:44:28, Minchan Kim wrote:
> > > On Tue, Jul 23, 2013 at 04:01:20PM +0200, Michal Hocko wrote:
> > > > On Fri 19-07-13 09:13:03, Minchan Kim wrote:
> > > > > On Thu, Jul 18, 2013 at 11:12:24PM +0530, Aneesh Kumar K.V wrote:
> > > > [...]
> > > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > > > > > index 83aff0a..2cb1be3 100644
> > > > > > --- a/mm/hugetlb.c
> > > > > > +++ b/mm/hugetlb.c
> > > > > > @@ -3266,8 +3266,8 @@ pte_t *huge_pmd_share(struct mm_struct *mm, unsigned long addr, pud_t *pud)
> > > > > > put_page(virt_to_page(spte));
> > > > > > spin_unlock(&mm->page_table_lock);
> > > > > > out:
> > > > > > - pte = (pte_t *)pmd_alloc(mm, pud, addr);
> > > > > > mutex_unlock(&mapping->i_mmap_mutex);
> > > > > > + pte = (pte_t *)pmd_alloc(mm, pud, addr);
> > > > > > return pte;
> > > > >
> > > > > I am blind on hugetlb but not sure it doesn't break eb48c071.
> > > > > Michal?
> > > >
> > > > Well, it is some time since I debugged the race and all the details
> > > > vanished in the meantime. But this part of the changelog suggests that
> > > > this indeed breaks the fix:
> > > > "
> > > > This patch addresses the issue by moving pmd_alloc into huge_pmd_share
> > > > which guarantees that the shared pud is populated in the same critical
> > > > section as pmd. This also means that huge_pte_offset test in
> > > > huge_pmd_share is serialized correctly now which in turn means that the
> > > > success of the sharing will be higher as the racing tasks see the pud
> > > > and pmd populated together.
> > > > "
> > > >
> > > > Besides that I fail to see how moving pmd_alloc down changes anything.
> > > > Even if pmd_alloc triggered reclaim then we cannot trip over the same
> > > > i_mmap_mutex as hugetlb pages are not reclaimable because they are not
> > > > on the LRU.
> > >
> > > I thought we could map some part of binary with normal page and other part
> > > of the one with MAP_HUGETLB so that a address space could have both normal
> > > page and HugeTLB page. Okay, it's impossible so HugeTLB pages are not on LRU.
> > > Then, above lockdep warning is totally false positive.
> > > Best solution is avoiding pmd_alloc with holding i_mmap_mutex but we need it
> > > to fix eb48c071 so how about this if we couldn't have a better idea?
> >
> > Shouldn't we rather use a hugetlb specific lock_class_key. I am not
> > familiar with lockdep much but something like bellow should do the
> > trick?
>
> Looks good to me.
> Could you resend it with formal patch with Ccing Peter for Dave to confirm it?
> Below just a nitpick.

I would have posted it already but I have to confess I am really not
familiar with lockdep and what is the good way to fix such a false
positive.

Peter, for you context the lockdep splat has been reported
here: https://lkml.org/lkml/2013/7/17/381

Minchan has proposed to workaround it by using SINGLE_DEPTH_NESTING
https://lkml.org/lkml/2013/7/23/812

my idea was to use a separate class key for hugetlb as it is quite
special in many ways:
https://lkml.org/lkml/2013/7/25/277

What is the preferred way of fixing such an issue?

Thanks!
--
Michal Hocko
SUSE Labs

2013-07-29 15:20:18

by Peter Zijlstra

[permalink] [raw]

Subject: Re: hugepage related lockdep trace.

On Mon, Jul 29, 2013 at 04:53:08PM +0200, Michal Hocko wrote:
> Peter, for you context the lockdep splat has been reported
> here: https://lkml.org/lkml/2013/7/17/381
>
> Minchan has proposed to workaround it by using SINGLE_DEPTH_NESTING
> https://lkml.org/lkml/2013/7/23/812
>
> my idea was to use a separate class key for hugetlb as it is quite
> special in many ways:
> https://lkml.org/lkml/2013/7/25/277
>
> What is the preferred way of fixing such an issue?

The class is the safer annotation.

That said; it is a rather horrible issue any which way. This PMD sharing
is very unique to hugetlbfs (also is that really worth the effort these
days?) and it will make it impossible to make hugetlbfs swappable.

The other solution is to make the pmd allocation GFP_NOFS.

2013-07-30 14:30:04

by Michal Hocko

[permalink] [raw]

Subject: Re: hugepage related lockdep trace.

On Mon 29-07-13 17:20:01, Peter Zijlstra wrote:
> On Mon, Jul 29, 2013 at 04:53:08PM +0200, Michal Hocko wrote:
> > Peter, for you context the lockdep splat has been reported
> > here: https://lkml.org/lkml/2013/7/17/381
> >
> > Minchan has proposed to workaround it by using SINGLE_DEPTH_NESTING
> > https://lkml.org/lkml/2013/7/23/812
> >
> > my idea was to use a separate class key for hugetlb as it is quite
> > special in many ways:
> > https://lkml.org/lkml/2013/7/25/277
> >
> > What is the preferred way of fixing such an issue?
>
> The class is the safer annotation.

OK, I will use the class then. It should prevent other false positives
AFAIU.

> That said; it is a rather horrible issue any which way. This PMD sharing
> is very unique to hugetlbfs (also is that really worth the effort these
> days?) and it will make it impossible to make hugetlbfs swappable.

No idea.

> The other solution is to make the pmd allocation GFP_NOFS.

That would be just papering over the lockdep limitation. So I would
rather stick with something lockdep specific.

I will cook up a patch.

Thanks!
--
Michal Hocko
SUSE Labs

2013-07-30 14:46:11

by Michal Hocko

[permalink] [raw]

Subject: [PATCH] hugetlb: fix lockdep splat caused by pmd sharing

Dave has reported the following lockdep splat:
[128095.470960] =================================
[128095.471315] [ INFO: inconsistent lock state ]
[128095.471660] 3.11.0-rc1+ #9 Not tainted
[128095.472156] ---------------------------------
[128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
[128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
[128095.474373] (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3
[128095.475128] {RECLAIM_FS-ON-W} state was registered at:
[128095.475866] [<c10a6232>] mark_held_locks+0x81/0xe7
[128095.476597] [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc
[128095.477322] [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6
[128095.478049] [<c1123ab6>] __get_free_pages+0x20/0x31
[128095.478769] [<c1123ad9>] get_zeroed_page+0x12/0x14
[128095.479477] [<c113fe1e>] __pmd_alloc+0x1c/0x6b
[128095.480138] [<c1155ea7>] huge_pmd_share+0x265/0x283
[128095.480138] [<c1155f22>] huge_pte_alloc+0x5d/0x71
[128095.480138] [<c115612e>] hugetlb_fault+0x7c/0x64a
[128095.480138] [<c114087c>] handle_mm_fault+0x255/0x299
[128095.480138] [<c15bbab0>] __do_page_fault+0x142/0x55c
[128095.480138] [<c15bbed7>] do_page_fault+0xd/0x16
[128095.480138] [<c15b927c>] error_code+0x6c/0x74
[128095.480138] irq event stamp: 3136917
[128095.480138] hardirqs last enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50
[128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78
[128095.480138] softirqs last enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f
[128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa
[128095.480138]
other info that might help us debug this:
[128095.480138] Possible unsafe locking scenario:
[128095.480138] CPU0
[128095.480138] ----
[128095.480138] lock(&mapping->i_mmap_mutex);
[128095.480138] <Interrupt>
[128095.480138] lock(&mapping->i_mmap_mutex);
[128095.480138]
*** DEADLOCK ***
[128095.480138] no locks held by kswapd0/49.
[128095.480138]
stack backtrace:
[128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
[128095.480138] Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008
[128095.480138] c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845
[128095.480138] c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001
[128095.480138] c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb
[128095.480138] Call Trace:
[128095.480138] [<c15b001e>] dump_stack+0x4b/0x79
[128095.480138] [<c15acdcb>] print_usage_bug+0x1d9/0x1e3
[128095.480138] [<c10a6130>] mark_lock+0x1e0/0x261
[128095.480138] [<c10a5878>] ? check_usage_backwards+0x109/0x109
[128095.480138] [<c10a6cde>] __lock_acquire+0x623/0x17f2
[128095.480138] [<c107aa43>] ? sched_clock_cpu+0xcd/0x130
[128095.480138] [<c107a7e8>] ? sched_clock_local+0x42/0x12e
[128095.480138] [<c10a84cf>] lock_acquire+0x7d/0x195
[128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3
[128095.480138] [<c15b3671>] mutex_lock_nested+0x6c/0x3a7
[128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3
[128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3
[128095.480138] [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e
[128095.480138] [<c114971b>] page_referenced+0x87/0x5e3
[128095.480138] [<f8433030>] ? raid0_congested+0x26/0x8a [raid0]
[128095.480138] [<c112b9c7>] shrink_page_list+0x3d9/0x947
[128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
[128095.480138] [<c112c3cf>] shrink_inactive_list+0x155/0x4cb
[128095.480138] [<c112cd07>] shrink_lruvec+0x300/0x5ce
[128095.480138] [<c112d028>] shrink_zone+0x53/0x14e
[128095.480138] [<c112e531>] kswapd+0x517/0xa75
[128095.480138] [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280
[128095.480138] [<c10661ff>] kthread+0xa8/0xaa
[128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
[128095.480138] [<c15bf737>] ret_from_kernel_thread+0x1b/0x28
[128095.480138] [<c1066157>] ? insert_kthread_work+0x63/0x63

which is a false positive caused by hugetlb pmd sharing code which
allocates a new pmd from withing mappint->i_mmap_mutex. If this
allocation causes reclaim then the lockdep detector complains that we
might self-deadlock.

This is not correct though, because hugetlb pages are not reclaimable so
their mapping will be never touched from the reclaim path.

The patch tells lockup detector that hugetlb i_mmap_mutex is special
by assigning it a separate lockdep class so it won't report possible
deadlocks on unrelated mappings.

Reported-by: Dave Jones <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
fs/hugetlbfs/inode.c | 8 ++++++++
1 file changed, 8 insertions(+)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index a3f868a..230533d 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -463,6 +463,12 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb,
return inode;
}

+/*
+ * Now, reclaim path never holds hugetlbfs_inode->i_mmap_mutex while it could
+ * hold normal inode->i_mmap_mutex so this annotation avoids a lockdep splat.
+ */
+struct lock_class_key hugetlbfs_i_mmap_mutex_key;
+
static struct inode *hugetlbfs_get_inode(struct super_block *sb,
struct inode *dir,
umode_t mode, dev_t dev)
@@ -474,6 +480,8 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
struct hugetlbfs_inode_info *info;
inode->i_ino = get_next_ino();
inode_init_owner(inode, dir, mode);
+ lockdep_set_class(&inode->i_mapping->i_mmap_mutex,
+ &hugetlbfs_i_mmap_mutex_key);
inode->i_mapping->a_ops = &hugetlbfs_aops;
inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
--
1.8.3.2

2013-07-30 14:58:48

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] hugetlb: fix lockdep splat caused by pmd sharing

On Tue, Jul 30, 2013 at 04:46:00PM +0200, Michal Hocko wrote:
> which is a false positive caused by hugetlb pmd sharing code which
> allocates a new pmd from withing mappint->i_mmap_mutex. If this
> allocation causes reclaim then the lockdep detector complains that we
> might self-deadlock.
>
> This is not correct though, because hugetlb pages are not reclaimable so
> their mapping will be never touched from the reclaim path.
>
> The patch tells lockup detector that hugetlb i_mmap_mutex is special
> by assigning it a separate lockdep class so it won't report possible
> deadlocks on unrelated mappings.
>
> Reported-by: Dave Jones <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
> ---
> fs/hugetlbfs/inode.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index a3f868a..230533d 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -463,6 +463,12 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb,
> return inode;
> }
>
> +/*
> + * Now, reclaim path never holds hugetlbfs_inode->i_mmap_mutex while it could
> + * hold normal inode->i_mmap_mutex so this annotation avoids a lockdep splat.

How about something like:

/*
* Hugetlbfs is not reclaimable; therefore its i_mmap_mutex will never
* be taken from reclaim -- unlike regular filesystems. This needs an
* annotation because huge_pmd_share() does an allocation under
* i_mmap_mutex.
*/

It clarifies the exact conditions and makes easier to verify the
validity of the annotation.

> + */
> +struct lock_class_key hugetlbfs_i_mmap_mutex_key;
> +
> static struct inode *hugetlbfs_get_inode(struct super_block *sb,
> struct inode *dir,
> umode_t mode, dev_t dev)
> @@ -474,6 +480,8 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
> struct hugetlbfs_inode_info *info;
> inode->i_ino = get_next_ino();
> inode_init_owner(inode, dir, mode);
> + lockdep_set_class(&inode->i_mapping->i_mmap_mutex,
> + &hugetlbfs_i_mmap_mutex_key);
> inode->i_mapping->a_ops = &hugetlbfs_aops;
> inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
> inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
> --
> 1.8.3.2
>

2013-07-30 15:23:45

by Michal Hocko

[permalink] [raw]

Subject: Re: [PATCH] hugetlb: fix lockdep splat caused by pmd sharing

On Tue 30-07-13 16:58:34, Peter Zijlstra wrote:
> On Tue, Jul 30, 2013 at 04:46:00PM +0200, Michal Hocko wrote:
[...]
> > +/*
> > + * Now, reclaim path never holds hugetlbfs_inode->i_mmap_mutex while it could
> > + * hold normal inode->i_mmap_mutex so this annotation avoids a lockdep splat.
>
> How about something like:
>
> /*
> * Hugetlbfs is not reclaimable; therefore its i_mmap_mutex will never
> * be taken from reclaim -- unlike regular filesystems. This needs an
> * annotation because huge_pmd_share() does an allocation under
> * i_mmap_mutex.
> */
>
> It clarifies the exact conditions and makes easier to verify the
> validity of the annotation.

Yes, looks much better. Thanks!
---
>From 673cbe2ca7df0decd7320987d97585660542e468 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Tue, 30 Jul 2013 17:22:14 +0200
Subject: [PATCH] hugetlb: fix lockdep splat caused by pmd sharing

Dave has reported the following lockdep splat:
[128095.470960] =================================
[128095.471315] [ INFO: inconsistent lock state ]
[128095.471660] 3.11.0-rc1+ #9 Not tainted
[128095.472156] ---------------------------------
[128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
[128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
[128095.474373] (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3
[128095.475128] {RECLAIM_FS-ON-W} state was registered at:
[128095.475866] [<c10a6232>] mark_held_locks+0x81/0xe7
[128095.476597] [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc
[128095.477322] [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6
[128095.478049] [<c1123ab6>] __get_free_pages+0x20/0x31
[128095.478769] [<c1123ad9>] get_zeroed_page+0x12/0x14
[128095.479477] [<c113fe1e>] __pmd_alloc+0x1c/0x6b
[128095.480138] [<c1155ea7>] huge_pmd_share+0x265/0x283
[128095.480138] [<c1155f22>] huge_pte_alloc+0x5d/0x71
[128095.480138] [<c115612e>] hugetlb_fault+0x7c/0x64a
[128095.480138] [<c114087c>] handle_mm_fault+0x255/0x299
[128095.480138] [<c15bbab0>] __do_page_fault+0x142/0x55c
[128095.480138] [<c15bbed7>] do_page_fault+0xd/0x16
[128095.480138] [<c15b927c>] error_code+0x6c/0x74
[128095.480138] irq event stamp: 3136917
[128095.480138] hardirqs last enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50
[128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78
[128095.480138] softirqs last enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f
[128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa
[128095.480138]
other info that might help us debug this:
[128095.480138] Possible unsafe locking scenario:
[128095.480138] CPU0
[128095.480138] ----
[128095.480138] lock(&mapping->i_mmap_mutex);
[128095.480138] <Interrupt>
[128095.480138] lock(&mapping->i_mmap_mutex);
[128095.480138]
*** DEADLOCK ***
[128095.480138] no locks held by kswapd0/49.
[128095.480138]
stack backtrace:
[128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
[128095.480138] Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008
[128095.480138] c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845
[128095.480138] c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001
[128095.480138] c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb
[128095.480138] Call Trace:
[128095.480138] [<c15b001e>] dump_stack+0x4b/0x79
[128095.480138] [<c15acdcb>] print_usage_bug+0x1d9/0x1e3
[128095.480138] [<c10a6130>] mark_lock+0x1e0/0x261
[128095.480138] [<c10a5878>] ? check_usage_backwards+0x109/0x109
[128095.480138] [<c10a6cde>] __lock_acquire+0x623/0x17f2
[128095.480138] [<c107aa43>] ? sched_clock_cpu+0xcd/0x130
[128095.480138] [<c107a7e8>] ? sched_clock_local+0x42/0x12e
[128095.480138] [<c10a84cf>] lock_acquire+0x7d/0x195
[128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3
[128095.480138] [<c15b3671>] mutex_lock_nested+0x6c/0x3a7
[128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3
[128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3
[128095.480138] [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e
[128095.480138] [<c114971b>] page_referenced+0x87/0x5e3
[128095.480138] [<f8433030>] ? raid0_congested+0x26/0x8a [raid0]
[128095.480138] [<c112b9c7>] shrink_page_list+0x3d9/0x947
[128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
[128095.480138] [<c112c3cf>] shrink_inactive_list+0x155/0x4cb
[128095.480138] [<c112cd07>] shrink_lruvec+0x300/0x5ce
[128095.480138] [<c112d028>] shrink_zone+0x53/0x14e
[128095.480138] [<c112e531>] kswapd+0x517/0xa75
[128095.480138] [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280
[128095.480138] [<c10661ff>] kthread+0xa8/0xaa
[128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
[128095.480138] [<c15bf737>] ret_from_kernel_thread+0x1b/0x28
[128095.480138] [<c1066157>] ? insert_kthread_work+0x63/0x63

which is a false positive caused by hugetlb pmd sharing code which
allocates a new pmd from withing mappint->i_mmap_mutex. If this
allocation causes reclaim then the lockdep detector complains that we
might self-deadlock.

This is not correct though, because hugetlb pages are not reclaimable so
their mapping will be never touched from the reclaim path.

The patch tells lockup detector that hugetlb i_mmap_mutex is special
by assigning it a separate lockdep class so it won't report possible
deadlocks on unrelated mappings.

[[email protected]: comment for annotation]
Reported-by: Dave Jones <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
fs/hugetlbfs/inode.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index a3f868a..3442397 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -463,6 +463,14 @@ static struct inode *hugetlbfs_get_root(struct super_block *sb,
return inode;
}

+/*
+ * Hugetlbfs is not reclaimable; therefore its i_mmap_mutex will never
+ * be taken from reclaim -- unlike regular filesystems. This needs an
+ * annotation because huge_pmd_share() does an allocation under
+ * i_mmap_mutex.
+ */
+struct lock_class_key hugetlbfs_i_mmap_mutex_key;
+
static struct inode *hugetlbfs_get_inode(struct super_block *sb,
struct inode *dir,
umode_t mode, dev_t dev)
@@ -474,6 +482,8 @@ static struct inode *hugetlbfs_get_inode(struct super_block *sb,
struct hugetlbfs_inode_info *info;
inode->i_ino = get_next_ino();
inode_init_owner(inode, dir, mode);
+ lockdep_set_class(&inode->i_mapping->i_mmap_mutex,
+ &hugetlbfs_i_mmap_mutex_key);
inode->i_mapping->a_ops = &hugetlbfs_aops;
inode->i_mapping->backing_dev_info =&hugetlbfs_backing_dev_info;
inode->i_atime = inode->i_mtime = inode->i_ctime = CURRENT_TIME;
--
1.8.3.2

--
Michal Hocko
SUSE Labs

2013-07-30 23:39:10

by Dave Jones

[permalink] [raw]

Subject: Re: [PATCH] hugetlb: fix lockdep splat caused by pmd sharing

On Wed, Jul 31, 2013 at 08:35:30AM +0900, Minchan Kim wrote:
> > which is a false positive caused by hugetlb pmd sharing code which
> > allocates a new pmd from withing mappint->i_mmap_mutex. If this
> > allocation causes reclaim then the lockdep detector complains that we
> > might self-deadlock.
> >
> > This is not correct though, because hugetlb pages are not reclaimable so
> > their mapping will be never touched from the reclaim path.
> >
> > The patch tells lockup detector that hugetlb i_mmap_mutex is special
> > by assigning it a separate lockdep class so it won't report possible
> > deadlocks on unrelated mappings.
> >
> > [[email protected]: comment for annotation]
> > Reported-by: Dave Jones <[email protected]>
> > Signed-off-by: Michal Hocko <[email protected]>
> Reviewed-by: Minchan Kim <[email protected]>
>
> Thanks, Michal!
> Only remained thing is Dave's testing.

I've added it to my builds, all is quiet so far..

Dave

2013-07-30 23:50:11

by Minchan Kim

[permalink] [raw]

Subject: Re: [PATCH] hugetlb: fix lockdep splat caused by pmd sharing

On Tue, Jul 30, 2013 at 05:23:33PM +0200, Michal Hocko wrote:
> On Tue 30-07-13 16:58:34, Peter Zijlstra wrote:
> > On Tue, Jul 30, 2013 at 04:46:00PM +0200, Michal Hocko wrote:
> [...]
> > > +/*
> > > + * Now, reclaim path never holds hugetlbfs_inode->i_mmap_mutex while it could
> > > + * hold normal inode->i_mmap_mutex so this annotation avoids a lockdep splat.
> >
> > How about something like:
> >
> > /*
> > * Hugetlbfs is not reclaimable; therefore its i_mmap_mutex will never
> > * be taken from reclaim -- unlike regular filesystems. This needs an
> > * annotation because huge_pmd_share() does an allocation under
> > * i_mmap_mutex.
> > */
> >
> > It clarifies the exact conditions and makes easier to verify the
> > validity of the annotation.
>
> Yes, looks much better. Thanks!
> ---
> >From 673cbe2ca7df0decd7320987d97585660542e468 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Tue, 30 Jul 2013 17:22:14 +0200
> Subject: [PATCH] hugetlb: fix lockdep splat caused by pmd sharing
>
> Dave has reported the following lockdep splat:
> [128095.470960] =================================
> [128095.471315] [ INFO: inconsistent lock state ]
> [128095.471660] 3.11.0-rc1+ #9 Not tainted
> [128095.472156] ---------------------------------
> [128095.472905] inconsistent {RECLAIM_FS-ON-W} -> {IN-RECLAIM_FS-W} usage.
> [128095.473650] kswapd0/49 [HC0[0]:SC0[0]:HE1:SE1] takes:
> [128095.474373] (&mapping->i_mmap_mutex){+.+.?.}, at: [<c114971b>] page_referenced+0x87/0x5e3
> [128095.475128] {RECLAIM_FS-ON-W} state was registered at:
> [128095.475866] [<c10a6232>] mark_held_locks+0x81/0xe7
> [128095.476597] [<c10a8db3>] lockdep_trace_alloc+0x5e/0xbc
> [128095.477322] [<c112316b>] __alloc_pages_nodemask+0x8b/0x9b6
> [128095.478049] [<c1123ab6>] __get_free_pages+0x20/0x31
> [128095.478769] [<c1123ad9>] get_zeroed_page+0x12/0x14
> [128095.479477] [<c113fe1e>] __pmd_alloc+0x1c/0x6b
> [128095.480138] [<c1155ea7>] huge_pmd_share+0x265/0x283
> [128095.480138] [<c1155f22>] huge_pte_alloc+0x5d/0x71
> [128095.480138] [<c115612e>] hugetlb_fault+0x7c/0x64a
> [128095.480138] [<c114087c>] handle_mm_fault+0x255/0x299
> [128095.480138] [<c15bbab0>] __do_page_fault+0x142/0x55c
> [128095.480138] [<c15bbed7>] do_page_fault+0xd/0x16
> [128095.480138] [<c15b927c>] error_code+0x6c/0x74
> [128095.480138] irq event stamp: 3136917
> [128095.480138] hardirqs last enabled at (3136917): [<c15b8139>] _raw_spin_unlock_irq+0x27/0x50
> [128095.480138] hardirqs last disabled at (3136916): [<c15b7f4e>] _raw_spin_lock_irq+0x15/0x78
> [128095.480138] softirqs last enabled at (3136180): [<c1048e4a>] __do_softirq+0x137/0x30f
> [128095.480138] softirqs last disabled at (3136175): [<c1049195>] irq_exit+0xa8/0xaa
> [128095.480138]
> other info that might help us debug this:
> [128095.480138] Possible unsafe locking scenario:
> [128095.480138] CPU0
> [128095.480138] ----
> [128095.480138] lock(&mapping->i_mmap_mutex);
> [128095.480138] <Interrupt>
> [128095.480138] lock(&mapping->i_mmap_mutex);
> [128095.480138]
> *** DEADLOCK ***
> [128095.480138] no locks held by kswapd0/49.
> [128095.480138]
> stack backtrace:
> [128095.480138] CPU: 1 PID: 49 Comm: kswapd0 Not tainted 3.11.0-rc1+ #9
> [128095.480138] Hardware name: Dell Inc. Precision WorkStation 490 /0DT031, BIOS A08 04/25/2008
> [128095.480138] c1d32630 00000000 ee39fb18 c15b001e ee395780 ee39fb54 c15acdcb c1751845
> [128095.480138] c1751bbf 00000031 00000000 00000000 00000000 00000000 00000001 00000001
> [128095.480138] c1751bbf 00000008 ee395c44 00000100 ee39fb88 c10a6130 00000008 0000d8fb
> [128095.480138] Call Trace:
> [128095.480138] [<c15b001e>] dump_stack+0x4b/0x79
> [128095.480138] [<c15acdcb>] print_usage_bug+0x1d9/0x1e3
> [128095.480138] [<c10a6130>] mark_lock+0x1e0/0x261
> [128095.480138] [<c10a5878>] ? check_usage_backwards+0x109/0x109
> [128095.480138] [<c10a6cde>] __lock_acquire+0x623/0x17f2
> [128095.480138] [<c107aa43>] ? sched_clock_cpu+0xcd/0x130
> [128095.480138] [<c107a7e8>] ? sched_clock_local+0x42/0x12e
> [128095.480138] [<c10a84cf>] lock_acquire+0x7d/0x195
> [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3
> [128095.480138] [<c15b3671>] mutex_lock_nested+0x6c/0x3a7
> [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3
> [128095.480138] [<c114971b>] ? page_referenced+0x87/0x5e3
> [128095.480138] [<c11661d5>] ? mem_cgroup_charge_statistics.isra.24+0x61/0x9e
> [128095.480138] [<c114971b>] page_referenced+0x87/0x5e3
> [128095.480138] [<f8433030>] ? raid0_congested+0x26/0x8a [raid0]
> [128095.480138] [<c112b9c7>] shrink_page_list+0x3d9/0x947
> [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
> [128095.480138] [<c112c3cf>] shrink_inactive_list+0x155/0x4cb
> [128095.480138] [<c112cd07>] shrink_lruvec+0x300/0x5ce
> [128095.480138] [<c112d028>] shrink_zone+0x53/0x14e
> [128095.480138] [<c112e531>] kswapd+0x517/0xa75
> [128095.480138] [<c112e01a>] ? mem_cgroup_shrink_node_zone+0x280/0x280
> [128095.480138] [<c10661ff>] kthread+0xa8/0xaa
> [128095.480138] [<c10a6457>] ? trace_hardirqs_on+0xb/0xd
> [128095.480138] [<c15bf737>] ret_from_kernel_thread+0x1b/0x28
> [128095.480138] [<c1066157>] ? insert_kthread_work+0x63/0x63
>
> which is a false positive caused by hugetlb pmd sharing code which
> allocates a new pmd from withing mappint->i_mmap_mutex. If this
> allocation causes reclaim then the lockdep detector complains that we
> might self-deadlock.
>
> This is not correct though, because hugetlb pages are not reclaimable so
> their mapping will be never touched from the reclaim path.
>
> The patch tells lockup detector that hugetlb i_mmap_mutex is special
> by assigning it a separate lockdep class so it won't report possible
> deadlocks on unrelated mappings.
>
> [[email protected]: comment for annotation]
> Reported-by: Dave Jones <[email protected]>
> Signed-off-by: Michal Hocko <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>

Thanks, Michal!
Only remained thing is Dave's testing.

-
Kind regards,
Minchan Kim