2021-07-07 11:14:29

by Desmond Cheong Zhi Xi

[permalink] [raw]
Subject: [PATCH] mm/page_alloc: avoid deadlocks for &pagesets.lock

Syzbot reports a number of potential deadlocks for &pagesets.lock. It
seems that this new lock is being used as both an inner and outer
lock, which makes it prone to creating circular dependencies.

For example, one such call trace goes as follows:
__alloc_pages_bulk()
local_lock_irqsave(&pagesets.lock, flags) <---- outer lock here
prep_new_page():
post_alloc_hook():
set_page_owner():
__set_page_owner():
save_stack():
stack_depot_save():
alloc_pages():
alloc_page_interleave():
__alloc_pages():
get_page_from_freelist():
rm_queue():
rm_queue_pcplist():
local_lock_irqsave(&pagesets.lock, flags);
*** DEADLOCK ***

The common culprit for the lockdep splats seems to be the call to
local_lock_irqsave(&pagesets.lock, flags) inside
__alloc_pages_bulk(). &pagesets.lock becomes an outer lock if it's
held during the call to prep_new_page().

As the local lock is used to protect the PCP structure, we adjust the
locking in __alloc_pages_bulk so that only the necessary structures
are protected.

Fixes: dbbee9d5cd83 ("mm/page_alloc: convert per-cpu list protection to local_lock")
Reported-and-tested-by: [email protected]
Signed-off-by: Desmond Cheong Zhi Xi <[email protected]>
---
mm/page_alloc.c | 15 +++++----------
1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0817d88383d5..0e005b1a60e3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5288,10 +5288,6 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
goto failed;

/* Attempt the batch allocation */
- local_lock_irqsave(&pagesets.lock, flags);
- pcp = this_cpu_ptr(zone->per_cpu_pageset);
- pcp_list = &pcp->lists[order_to_pindex(ac.migratetype, 0)];
-
while (nr_populated < nr_pages) {

/* Skip existing pages */
@@ -5300,12 +5296,16 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
continue;
}

+ local_lock_irqsave(&pagesets.lock, flags);
+ pcp = this_cpu_ptr(zone->per_cpu_pageset);
+ pcp_list = &pcp->lists[order_to_pindex(ac.migratetype, 0)];
page = __rmqueue_pcplist(zone, 0, ac.migratetype, alloc_flags,
pcp, pcp_list);
+ local_unlock_irqrestore(&pagesets.lock, flags);
if (unlikely(!page)) {
/* Try and get at least one page */
if (!nr_populated)
- goto failed_irq;
+ goto failed;
break;
}
nr_account++;
@@ -5318,16 +5318,11 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
nr_populated++;
}

- local_unlock_irqrestore(&pagesets.lock, flags);
-
__count_zid_vm_events(PGALLOC, zone_idx(zone), nr_account);
zone_statistics(ac.preferred_zoneref->zone, zone, nr_account);

return nr_populated;

-failed_irq:
- local_unlock_irqrestore(&pagesets.lock, flags);
-
failed:
page = __alloc_pages(gfp, 0, preferred_nid, nodemask);
if (page) {
--
2.25.1


2021-07-07 12:30:03

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH] mm/page_alloc: avoid deadlocks for &pagesets.lock

On Wed, Jul 07, 2021 at 07:12:45PM +0800, Desmond Cheong Zhi Xi wrote:
> Syzbot reports a number of potential deadlocks for &pagesets.lock. It
> seems that this new lock is being used as both an inner and outer
> lock, which makes it prone to creating circular dependencies.
>
> For example, one such call trace goes as follows:
> __alloc_pages_bulk()
> local_lock_irqsave(&pagesets.lock, flags) <---- outer lock here
> prep_new_page():
> post_alloc_hook():
> set_page_owner():
> __set_page_owner():
> save_stack():
> stack_depot_save():
> alloc_pages():
> alloc_page_interleave():
> __alloc_pages():
> get_page_from_freelist():
> rm_queue():
> rm_queue_pcplist():
> local_lock_irqsave(&pagesets.lock, flags);
> *** DEADLOCK ***
>
> The common culprit for the lockdep splats seems to be the call to
> local_lock_irqsave(&pagesets.lock, flags) inside
> __alloc_pages_bulk(). &pagesets.lock becomes an outer lock if it's
> held during the call to prep_new_page().
>
> As the local lock is used to protect the PCP structure, we adjust the
> locking in __alloc_pages_bulk so that only the necessary structures
> are protected.
>
> Fixes: dbbee9d5cd83 ("mm/page_alloc: convert per-cpu list protection to local_lock")
> Reported-and-tested-by: [email protected]
> Signed-off-by: Desmond Cheong Zhi Xi <[email protected]>

Hi Desmond,

Thanks for the report. Unfortunately, this patch incurs a performance
penalty for the bulk allocator even if PAGE_OWNER is disabled. Can you
try the following as an alternative please? It passed a build and boot
test but I didn't try triggering the actual bug.

--8<--
mm/page_alloc: Avoid page allocator recursion with pagesets.lock held

Syzbot is reporting potential deadlocks due to pagesets.lock when
PAGE_OWNER is enabled. One example from Desmond Cheong Zhi Xi is
as follows

__alloc_pages_bulk()
local_lock_irqsave(&pagesets.lock, flags) <---- outer lock here
prep_new_page():
post_alloc_hook():
set_page_owner():
__set_page_owner():
save_stack():
stack_depot_save():
alloc_pages():
alloc_page_interleave():
__alloc_pages():
get_page_from_freelist():
rm_queue():
rm_queue_pcplist():
local_lock_irqsave(&pagesets.lock, flags);
*** DEADLOCK ***

Zhang, Qiang also reported

BUG: sleeping function called from invalid context at mm/page_alloc.c:5179
in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 1, name: swapper/0
.....
__dump_stack lib/dump_stack.c:79 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:96
___might_sleep.cold+0x1f1/0x237 kernel/sched/core.c:9153
prepare_alloc_pages+0x3da/0x580 mm/page_alloc.c:5179
__alloc_pages+0x12f/0x500 mm/page_alloc.c:5375
alloc_page_interleave+0x1e/0x200 mm/mempolicy.c:2147
alloc_pages+0x238/0x2a0 mm/mempolicy.c:2270
stack_depot_save+0x39d/0x4e0 lib/stackdepot.c:303
save_stack+0x15e/0x1e0 mm/page_owner.c:120
__set_page_owner+0x50/0x290 mm/page_owner.c:181
prep_new_page mm/page_alloc.c:2445 [inline]
__alloc_pages_bulk+0x8b9/0x1870 mm/page_alloc.c:5313
alloc_pages_bulk_array_node include/linux/gfp.h:557 [inline]
vm_area_alloc_pages mm/vmalloc.c:2775 [inline]
__vmalloc_area_node mm/vmalloc.c:2845 [inline]
__vmalloc_node_range+0x39d/0x960 mm/vmalloc.c:2947
__vmalloc_node mm/vmalloc.c:2996 [inline]
vzalloc+0x67/0x80 mm/vmalloc.c:3066

There are a number of ways it could be fixed. The page owner code could
be audited to strip GFP flags that allow sleeping but it'll impair the
functionality of PAGE_OWNER if allocations fail. The bulk allocator
could add a special case to release/reacquire the lock for prep_new_page
and lookup PCP after the lock is reacquired at the cost of performance.
Both options are relatively complex and the second one still incurs a
performance penalty when PAGE_OWNER is active so this patch takes the
simple approach -- disable bulk allocation of PAGE_OWNER is active. The
caller will be forced to allocate one page at a time incurring a
performance penalty but PAGE_OWNER is already a performance penalty.

Fixes: dbbee9d5cd83 ("mm/page_alloc: convert per-cpu list protection to local_lock")
Reported-by: Desmond Cheong Zhi Xi <[email protected]>
Reported-by: "Zhang, Qiang" <[email protected]>
Reported-by: [email protected]
Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 12 ++++++++++++
1 file changed, 12 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3b97e17806be..6ef86f338151 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -5239,6 +5239,18 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
if (nr_pages - nr_populated == 1)
goto failed;

+#ifdef CONFIG_PAGE_OWNER
+ /*
+ * PAGE_OWNER may recurse into the allocator to allocate space to
+ * save the stack with pagesets.lock held. Releasing/reacquiring
+ * removes much of the performance benefit of bulk allocation so
+ * force the caller to allocate one page at a time as it'll have
+ * similar performance to added complexity to the bulk allocator.
+ */
+ if (static_branch_unlikely(&page_owner_inited))
+ goto failed;
+#endif
+
/* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
gfp &= gfp_allowed_mask;
alloc_gfp = gfp;

2021-07-08 01:50:32

by Desmond Cheong Zhi Xi

[permalink] [raw]
Subject: Re: [PATCH] mm/page_alloc: avoid deadlocks for &pagesets.lock

On 7/7/21 8:25 pm, Mel Gorman wrote:
> On Wed, Jul 07, 2021 at 07:12:45PM +0800, Desmond Cheong Zhi Xi wrote:
>> Syzbot reports a number of potential deadlocks for &pagesets.lock. It
>> seems that this new lock is being used as both an inner and outer
>> lock, which makes it prone to creating circular dependencies.
>>
>> For example, one such call trace goes as follows:
>> __alloc_pages_bulk()
>> local_lock_irqsave(&pagesets.lock, flags) <---- outer lock here
>> prep_new_page():
>> post_alloc_hook():
>> set_page_owner():
>> __set_page_owner():
>> save_stack():
>> stack_depot_save():
>> alloc_pages():
>> alloc_page_interleave():
>> __alloc_pages():
>> get_page_from_freelist():
>> rm_queue():
>> rm_queue_pcplist():
>> local_lock_irqsave(&pagesets.lock, flags);
>> *** DEADLOCK ***
>>
>> The common culprit for the lockdep splats seems to be the call to
>> local_lock_irqsave(&pagesets.lock, flags) inside
>> __alloc_pages_bulk(). &pagesets.lock becomes an outer lock if it's
>> held during the call to prep_new_page().
>>
>> As the local lock is used to protect the PCP structure, we adjust the
>> locking in __alloc_pages_bulk so that only the necessary structures
>> are protected.
>>
>> Fixes: dbbee9d5cd83 ("mm/page_alloc: convert per-cpu list protection to local_lock")
>> Reported-and-tested-by: [email protected]
>> Signed-off-by: Desmond Cheong Zhi Xi <[email protected]>
>
> Hi Desmond,
>
> Thanks for the report. Unfortunately, this patch incurs a performance
> penalty for the bulk allocator even if PAGE_OWNER is disabled. Can you
> try the following as an alternative please? It passed a build and boot
> test but I didn't try triggering the actual bug.
>

Hi Mel,

Thanks for the feedback, I hadn't thought of the performance penalty. I
think you're right that if the recursive call to __set_page_owner is
avoided, then that also avoids creating the circular lock hierarchy.

Your proposed patch passed the Syzbot repro test:

Reported-and-tested-by:
[email protected]

Best wishes,
Desmond

> --8<--
> mm/page_alloc: Avoid page allocator recursion with pagesets.lock held
>
> Syzbot is reporting potential deadlocks due to pagesets.lock when
> PAGE_OWNER is enabled. One example from Desmond Cheong Zhi Xi is
> as follows
>
> __alloc_pages_bulk()
> local_lock_irqsave(&pagesets.lock, flags) <---- outer lock here
> prep_new_page():
> post_alloc_hook():
> set_page_owner():
> __set_page_owner():
> save_stack():
> stack_depot_save():
> alloc_pages():
> alloc_page_interleave():
> __alloc_pages():
> get_page_from_freelist():
> rm_queue():
> rm_queue_pcplist():
> local_lock_irqsave(&pagesets.lock, flags);
> *** DEADLOCK ***
>
> Zhang, Qiang also reported
>
> BUG: sleeping function called from invalid context at mm/page_alloc.c:5179
> in_atomic(): 0, irqs_disabled(): 1, non_block: 0, pid: 1, name: swapper/0
> .....
> __dump_stack lib/dump_stack.c:79 [inline]
> dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:96
> ___might_sleep.cold+0x1f1/0x237 kernel/sched/core.c:9153
> prepare_alloc_pages+0x3da/0x580 mm/page_alloc.c:5179
> __alloc_pages+0x12f/0x500 mm/page_alloc.c:5375
> alloc_page_interleave+0x1e/0x200 mm/mempolicy.c:2147
> alloc_pages+0x238/0x2a0 mm/mempolicy.c:2270
> stack_depot_save+0x39d/0x4e0 lib/stackdepot.c:303
> save_stack+0x15e/0x1e0 mm/page_owner.c:120
> __set_page_owner+0x50/0x290 mm/page_owner.c:181
> prep_new_page mm/page_alloc.c:2445 [inline]
> __alloc_pages_bulk+0x8b9/0x1870 mm/page_alloc.c:5313
> alloc_pages_bulk_array_node include/linux/gfp.h:557 [inline]
> vm_area_alloc_pages mm/vmalloc.c:2775 [inline]
> __vmalloc_area_node mm/vmalloc.c:2845 [inline]
> __vmalloc_node_range+0x39d/0x960 mm/vmalloc.c:2947
> __vmalloc_node mm/vmalloc.c:2996 [inline]
> vzalloc+0x67/0x80 mm/vmalloc.c:3066
>
> There are a number of ways it could be fixed. The page owner code could
> be audited to strip GFP flags that allow sleeping but it'll impair the
> functionality of PAGE_OWNER if allocations fail. The bulk allocator
> could add a special case to release/reacquire the lock for prep_new_page
> and lookup PCP after the lock is reacquired at the cost of performance.
> Both options are relatively complex and the second one still incurs a
> performance penalty when PAGE_OWNER is active so this patch takes the
> simple approach -- disable bulk allocation of PAGE_OWNER is active. The
> caller will be forced to allocate one page at a time incurring a
> performance penalty but PAGE_OWNER is already a performance penalty.
>
> Fixes: dbbee9d5cd83 ("mm/page_alloc: convert per-cpu list protection to local_lock")
> Reported-by: Desmond Cheong Zhi Xi <[email protected]>
> Reported-by: "Zhang, Qiang" <[email protected]>
> Reported-by: [email protected]
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> mm/page_alloc.c | 12 ++++++++++++
> 1 file changed, 12 insertions(+)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3b97e17806be..6ef86f338151 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -5239,6 +5239,18 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
> if (nr_pages - nr_populated == 1)
> goto failed;
>
> +#ifdef CONFIG_PAGE_OWNER
> + /*
> + * PAGE_OWNER may recurse into the allocator to allocate space to
> + * save the stack with pagesets.lock held. Releasing/reacquiring
> + * removes much of the performance benefit of bulk allocation so
> + * force the caller to allocate one page at a time as it'll have
> + * similar performance to added complexity to the bulk allocator.
> + */
> + if (static_branch_unlikely(&page_owner_inited))
> + goto failed;
> +#endif
> +
> /* May set ALLOC_NOFRAGMENT, fragmentation will return 1 page. */
> gfp &= gfp_allowed_mask;
> alloc_gfp = gfp;
>