LinuxLists.cc - [RFC PATCH 0/6] Drain remote per-cpu directly

2022-04-22 10:36:38

Subject: [RFC PATCH 0/6] Drain remote per-cpu directly

This series has the same intent as Nicolas' series "mm/page_alloc: Remote
per-cpu lists drain support" -- avoid interference of a high priority
task due to a workqueue item draining per-cpu page lists. While many
workloads can tolerate a brief interruption, it may be cause a real-time
task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum,
the draining in non-deterministic.

Currently an IRQ-safe local_lock protects the page allocator per-cpu lists.
The local_lock on its own prevents migration and the IRQ disabling protects
from corruption due to an interrupt arriving while a page allocation is
in progress. The locking is inherently unsafe for remote access unless
the CPU is hot-removed.

This series adjusts the locking. A spin-lock is added to struct
per_cpu_pages to protect the list contents while local_lock_irq continues
to prevent migration and IRQ reentry. This allows a remote CPU to safely
drain a remote per-cpu list.

This series is a partial series. Follow-on work would allow the
local_irq_save to be converted to a local_irq to avoid IRQs being
disabled/enabled in most cases. However, there are enough corner cases
that it deserves a series on its own separated by one kernel release and
the priority right now is to avoid interference of high priority tasks.

Patch 1 is a cosmetic patch to clarify when page->lru is storing buddy pages
and when it is storing per-cpu pages.

Patch 2 shrinks per_cpu_pages to make room for a spin lock. Strictly speaking
this is not necessary but it avoids per_cpu_pages consuming another
cache line.

Patch 3 is a preparation patch to avoid code duplication.

Patch 4 is a simple micro-optimisation that improves code flow necessary for
a later patch to avoid code duplication.

Patch 5 uses a spin_lock to protect the per_cpu_pages contents while still
relying on local_lock to prevent migration, stabilise the pcp
lookup and prevent IRQ reentrancy.

Patch 6 remote drains per-cpu pages directly instead of using a workqueue.

include/linux/mm_types.h | 5 +
include/linux/mmzone.h | 12 +-
mm/page_alloc.c | 333 ++++++++++++++++++++++++---------------
3 files changed, 222 insertions(+), 128 deletions(-)

--
2.34.1

2022-04-22 14:23:05

by Mel Gorman

[permalink] [raw]

Subject: [PATCH 3/6] mm/page_alloc: Split out buddy removal code from rmqueue into separate helper

This is a preparation page to allow the buddy removal code to be reused
in a later patch.

No functional change.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 87 ++++++++++++++++++++++++++++---------------------
1 file changed, 50 insertions(+), 37 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ed2deb93a758..4c1acf666056 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3622,6 +3622,46 @@ static inline void zone_statistics(struct zone *preferred_zone, struct zone *z,
#endif
}

+static __always_inline
+struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
+ unsigned int order, unsigned int alloc_flags,
+ int migratetype)
+{
+ struct page *page;
+ unsigned long flags;
+
+ do {
+ page = NULL;
+ spin_lock_irqsave(&zone->lock, flags);
+ /*
+ * order-0 request can reach here when the pcplist is skipped
+ * due to non-CMA allocation context. HIGHATOMIC area is
+ * reserved for high-order atomic allocation, so order-0
+ * request should skip it.
+ */
+ if (order > 0 && alloc_flags & ALLOC_HARDER) {
+ page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
+ if (page)
+ trace_mm_page_alloc_zone_locked(page, order, migratetype);
+ }
+ if (!page) {
+ page = __rmqueue(zone, order, migratetype, alloc_flags);
+ if (!page) {
+ spin_unlock_irqrestore(&zone->lock, flags);
+ return NULL;
+ }
+ }
+ __mod_zone_freepage_state(zone, -(1 << order),
+ get_pcppage_migratetype(page));
+ spin_unlock_irqrestore(&zone->lock, flags);
+ } while (check_new_pages(page, order));
+
+ __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
+ zone_statistics(preferred_zone, zone, 1);
+
+ return page;
+}
+
/* Remove page from the per-cpu list, caller must protect the list */
static inline
struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
@@ -3702,9 +3742,14 @@ struct page *rmqueue(struct zone *preferred_zone,
gfp_t gfp_flags, unsigned int alloc_flags,
int migratetype)
{
- unsigned long flags;
struct page *page;

+ /*
+ * We most definitely don't want callers attempting to
+ * allocate greater than order-1 page units with __GFP_NOFAIL.
+ */
+ WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
+
if (likely(pcp_allowed_order(order))) {
/*
* MIGRATE_MOVABLE pcplist could have the pages on CMA area and
@@ -3718,38 +3763,10 @@ struct page *rmqueue(struct zone *preferred_zone,
}
}

- /*
- * We most definitely don't want callers attempting to
- * allocate greater than order-1 page units with __GFP_NOFAIL.
- */
- WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
-
- do {
- page = NULL;
- spin_lock_irqsave(&zone->lock, flags);
- /*
- * order-0 request can reach here when the pcplist is skipped
- * due to non-CMA allocation context. HIGHATOMIC area is
- * reserved for high-order atomic allocation, so order-0
- * request should skip it.
- */
- if (order > 0 && alloc_flags & ALLOC_HARDER) {
- page = __rmqueue_smallest(zone, order, MIGRATE_HIGHATOMIC);
- if (page)
- trace_mm_page_alloc_zone_locked(page, order, migratetype);
- }
- if (!page) {
- page = __rmqueue(zone, order, migratetype, alloc_flags);
- if (!page)
- goto failed;
- }
- __mod_zone_freepage_state(zone, -(1 << order),
- get_pcppage_migratetype(page));
- spin_unlock_irqrestore(&zone->lock, flags);
- } while (check_new_pages(page, order));
-
- __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
- zone_statistics(preferred_zone, zone, 1);
+ page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
+ migratetype);
+ if (unlikely(!page))
+ return NULL;

out:
/* Separate test+clear to avoid unnecessary atomics */
@@ -3760,10 +3777,6 @@ struct page *rmqueue(struct zone *preferred_zone,

VM_BUG_ON_PAGE(page && bad_range(zone, page), page);
return page;
-
-failed:
- spin_unlock_irqrestore(&zone->lock, flags);
- return NULL;
}

#ifdef CONFIG_FAIL_PAGE_ALLOC
--
2.34.1

2022-04-22 19:51:00

by Mel Gorman

[permalink] [raw]

Subject: [PATCH 2/6] mm/page_alloc: Use only one PCP list for THP-sized allocations

The per_cpu_pages is cache-aligned on a standard x86-64 distribution
configuration but a later patch will add a new field which would push
the structure into the next cache line. Use only one list to store
THP-sized pages on the per-cpu list. This assumes that the vast majority
of THP-sized allocations are GFP_MOVABLE but even if it was another type,
it would not contribute to serious fragmentation that potentially causes
a later THP allocation failure. Align per_cpu_pages on the cacheline
boundary to ensure there is no false cache sharing.

After this patch, the structure sizing is;

struct per_cpu_pages {
int count; /* 0 4 */
int high; /* 4 4 */
int batch; /* 8 4 */
short int free_factor; /* 12 2 */
short int expire; /* 14 2 */
struct list_head lists[13]; /* 16 208 */

/* size: 256, cachelines: 4, members: 6 */
/* padding: 32 */
} __attribute__((__aligned__(64)));

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 11 +++++++----
mm/page_alloc.c | 4 ++--
2 files changed, 9 insertions(+), 6 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 962b14d403e8..abe530748de6 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -358,15 +358,18 @@ enum zone_watermarks {
};

/*
- * One per migratetype for each PAGE_ALLOC_COSTLY_ORDER plus one additional
- * for pageblock size for THP if configured.
+ * One per migratetype for each PAGE_ALLOC_COSTLY_ORDER. One additional list
+ * for THP which will usually be GFP_MOVABLE. Even if it is another type,
+ * it should not contribute to serious fragmentation causing THP allocation
+ * failures.
*/
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
#define NR_PCP_THP 1
#else
#define NR_PCP_THP 0
#endif
-#define NR_PCP_LISTS (MIGRATE_PCPTYPES * (PAGE_ALLOC_COSTLY_ORDER + 1 + NR_PCP_THP))
+#define NR_LOWORDER_PCP_LISTS (MIGRATE_PCPTYPES * (PAGE_ALLOC_COSTLY_ORDER + 1))
+#define NR_PCP_LISTS (NR_LOWORDER_PCP_LISTS + NR_PCP_THP)

/*
* Shift to encode migratetype and order in the same integer, with order
@@ -392,7 +395,7 @@ struct per_cpu_pages {

/* Lists of pages, one per migrate type stored on the pcp-lists */
struct list_head lists[NR_PCP_LISTS];
-};
+} ____cacheline_aligned_in_smp;

struct per_cpu_zonestat {
#ifdef CONFIG_SMP
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 63976ad4b7f1..ed2deb93a758 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -648,7 +648,7 @@ static inline unsigned int order_to_pindex(int migratetype, int order)
#ifdef CONFIG_TRANSPARENT_HUGEPAGE
if (order > PAGE_ALLOC_COSTLY_ORDER) {
VM_BUG_ON(order != pageblock_order);
- base = PAGE_ALLOC_COSTLY_ORDER + 1;
+ return NR_LOWORDER_PCP_LISTS;
}
#else
VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
@@ -662,7 +662,7 @@ static inline int pindex_to_order(unsigned int pindex)
int order = pindex / MIGRATE_PCPTYPES;

#ifdef CONFIG_TRANSPARENT_HUGEPAGE
- if (order > PAGE_ALLOC_COSTLY_ORDER)
+ if (pindex == NR_LOWORDER_PCP_LISTS)
order = pageblock_order;
#else
VM_BUG_ON(order > PAGE_ALLOC_COSTLY_ORDER);
--
2.34.1

2022-04-22 22:04:22

by Mel Gorman

[permalink] [raw]

Subject: [PATCH 4/6] mm/page_alloc: Remove unnecessary page == NULL check in rmqueue

The VM_BUG_ON check for a valid page can be avoided with a simple
change in the flow. The ZONE_BOOSTED_WATERMARK is unlikely in general
and even more unlikely if the page allocation failed so mark the
branch unlikely.

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 9 +++++----
1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4c1acf666056..dc0fdeb3795c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3765,17 +3765,18 @@ struct page *rmqueue(struct zone *preferred_zone,

page = rmqueue_buddy(preferred_zone, zone, order, alloc_flags,
migratetype);
- if (unlikely(!page))
- return NULL;

out:
/* Separate test+clear to avoid unnecessary atomics */
- if (test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags)) {
+ if (unlikely(test_bit(ZONE_BOOSTED_WATERMARK, &zone->flags))) {
clear_bit(ZONE_BOOSTED_WATERMARK, &zone->flags);
wakeup_kswapd(zone, 0, 0, zone_idx(zone));
}

- VM_BUG_ON_PAGE(page && bad_range(zone, page), page);
+ if (unlikely(!page))
+ return NULL;
+
+ VM_BUG_ON_PAGE(bad_range(zone, page), page);
return page;
}

--
2.34.1

2022-04-26 06:59:12

by Suren Baghdasaryan

[permalink] [raw]

Subject: Re: [RFC PATCH 0/6] Drain remote per-cpu directly

On Wed, Apr 20, 2022 at 2:59 AM Mel Gorman <[email protected]> wrote:
>
> This series has the same intent as Nicolas' series "mm/page_alloc: Remote
> per-cpu lists drain support" -- avoid interference of a high priority
> task due to a workqueue item draining per-cpu page lists. While many
> workloads can tolerate a brief interruption, it may be cause a real-time
> task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum,
> the draining in non-deterministic.
>
> Currently an IRQ-safe local_lock protects the page allocator per-cpu lists.
> The local_lock on its own prevents migration and the IRQ disabling protects
> from corruption due to an interrupt arriving while a page allocation is
> in progress. The locking is inherently unsafe for remote access unless
> the CPU is hot-removed.
>
> This series adjusts the locking. A spin-lock is added to struct
> per_cpu_pages to protect the list contents while local_lock_irq continues
> to prevent migration and IRQ reentry. This allows a remote CPU to safely
> drain a remote per-cpu list.
>
> This series is a partial series. Follow-on work would allow the
> local_irq_save to be converted to a local_irq to avoid IRQs being
> disabled/enabled in most cases. However, there are enough corner cases
> that it deserves a series on its own separated by one kernel release and
> the priority right now is to avoid interference of high priority tasks.
>
> Patch 1 is a cosmetic patch to clarify when page->lru is storing buddy pages
> and when it is storing per-cpu pages.
>
> Patch 2 shrinks per_cpu_pages to make room for a spin lock. Strictly speaking
> this is not necessary but it avoids per_cpu_pages consuming another
> cache line.
>
> Patch 3 is a preparation patch to avoid code duplication.
>
> Patch 4 is a simple micro-optimisation that improves code flow necessary for
> a later patch to avoid code duplication.
>
> Patch 5 uses a spin_lock to protect the per_cpu_pages contents while still
> relying on local_lock to prevent migration, stabilise the pcp
> lookup and prevent IRQ reentrancy.
>
> Patch 6 remote drains per-cpu pages directly instead of using a workqueue.

This quite possibly solves the issue I was trying to fix in
https://lore.kernel.org/all/[email protected].
I will give it a try and see how it looks.
Thanks!

>
> include/linux/mm_types.h | 5 +
> include/linux/mmzone.h | 12 +-
> mm/page_alloc.c | 333 ++++++++++++++++++++++++---------------
> 3 files changed, 222 insertions(+), 128 deletions(-)
>
> --
> 2.34.1
>
>

2022-04-26 08:29:43

by Suren Baghdasaryan

[permalink] [raw]

Subject: Re: [RFC PATCH 0/6] Drain remote per-cpu directly

On Mon, Apr 25, 2022 at 7:49 PM Suren Baghdasaryan <[email protected]> wrote:
>
> On Wed, Apr 20, 2022 at 2:59 AM Mel Gorman <[email protected]> wrote:
> >
> > This series has the same intent as Nicolas' series "mm/page_alloc: Remote
> > per-cpu lists drain support" -- avoid interference of a high priority
> > task due to a workqueue item draining per-cpu page lists. While many
> > workloads can tolerate a brief interruption, it may be cause a real-time
> > task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum,
> > the draining in non-deterministic.
> >
> > Currently an IRQ-safe local_lock protects the page allocator per-cpu lists.
> > The local_lock on its own prevents migration and the IRQ disabling protects
> > from corruption due to an interrupt arriving while a page allocation is
> > in progress. The locking is inherently unsafe for remote access unless
> > the CPU is hot-removed.
> >
> > This series adjusts the locking. A spin-lock is added to struct
> > per_cpu_pages to protect the list contents while local_lock_irq continues
> > to prevent migration and IRQ reentry. This allows a remote CPU to safely
> > drain a remote per-cpu list.
> >
> > This series is a partial series. Follow-on work would allow the
> > local_irq_save to be converted to a local_irq to avoid IRQs being
> > disabled/enabled in most cases. However, there are enough corner cases
> > that it deserves a series on its own separated by one kernel release and
> > the priority right now is to avoid interference of high priority tasks.
> >
> > Patch 1 is a cosmetic patch to clarify when page->lru is storing buddy pages
> > and when it is storing per-cpu pages.
> >
> > Patch 2 shrinks per_cpu_pages to make room for a spin lock. Strictly speaking
> > this is not necessary but it avoids per_cpu_pages consuming another
> > cache line.
> >
> > Patch 3 is a preparation patch to avoid code duplication.
> >
> > Patch 4 is a simple micro-optimisation that improves code flow necessary for
> > a later patch to avoid code duplication.
> >
> > Patch 5 uses a spin_lock to protect the per_cpu_pages contents while still
> > relying on local_lock to prevent migration, stabilise the pcp
> > lookup and prevent IRQ reentrancy.
> >
> > Patch 6 remote drains per-cpu pages directly instead of using a workqueue.
>
> This quite possibly solves the issue I was trying to fix in
> https://lore.kernel.org/all/[email protected].
> I will give it a try and see how it looks.

My test shows sizable improvement for the worst case drain_all_pages
duration. Before the change I caught cases when a drain_local_pages_wq
in the workqueue was delayed by 100+ms (not even counting
drain_local_pages_wq execution time itself). With this patchset the
worst time I was able to record for drain_all_pages duration was 17ms.

> Thanks!
>
> >
> > include/linux/mm_types.h | 5 +
> > include/linux/mmzone.h | 12 +-
> > mm/page_alloc.c | 333 ++++++++++++++++++++++++---------------
> > 3 files changed, 222 insertions(+), 128 deletions(-)
> >
> > --
> > 2.34.1
> >
> >

2022-04-26 08:52:05

by Minchan Kim

[permalink] [raw]

Subject: Re: [RFC PATCH 0/6] Drain remote per-cpu directly

On Wed, Apr 20, 2022 at 10:59:00AM +0100, Mel Gorman wrote:
> This series has the same intent as Nicolas' series "mm/page_alloc: Remote
> per-cpu lists drain support" -- avoid interference of a high priority
> task due to a workqueue item draining per-cpu page lists. While many
> workloads can tolerate a brief interruption, it may be cause a real-time
> task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum,
> the draining in non-deterministic.

Yeah, the non-deterministic is a problem. I saw the kworker-based draining
takes 100+ms(up to 300ms observed) sometimes in alloc_contig_range if CPUs
are heavily loaded.

I am not sure Nicolas already observed. it's not only problem of
per_cpu_pages but it is also lru_pvecs (pagevec) draining.
Do we need to introduce similar(allow remote drainning with spin_lock)
solution for pagevec?

>
> Currently an IRQ-safe local_lock protects the page allocator per-cpu lists.
> The local_lock on its own prevents migration and the IRQ disabling protects
> from corruption due to an interrupt arriving while a page allocation is
> in progress. The locking is inherently unsafe for remote access unless
> the CPU is hot-removed.
>
> This series adjusts the locking. A spin-lock is added to struct
> per_cpu_pages to protect the list contents while local_lock_irq continues
> to prevent migration and IRQ reentry. This allows a remote CPU to safely
> drain a remote per-cpu list.
>
> This series is a partial series. Follow-on work would allow the
> local_irq_save to be converted to a local_irq to avoid IRQs being
> disabled/enabled in most cases. However, there are enough corner cases
> that it deserves a series on its own separated by one kernel release and
> the priority right now is to avoid interference of high priority tasks.
>
> Patch 1 is a cosmetic patch to clarify when page->lru is storing buddy pages
> and when it is storing per-cpu pages.
>
> Patch 2 shrinks per_cpu_pages to make room for a spin lock. Strictly speaking
> this is not necessary but it avoids per_cpu_pages consuming another
> cache line.
>
> Patch 3 is a preparation patch to avoid code duplication.
>
> Patch 4 is a simple micro-optimisation that improves code flow necessary for
> a later patch to avoid code duplication.
>
> Patch 5 uses a spin_lock to protect the per_cpu_pages contents while still
> relying on local_lock to prevent migration, stabilise the pcp
> lookup and prevent IRQ reentrancy.
>
> Patch 6 remote drains per-cpu pages directly instead of using a workqueue.
>
> include/linux/mm_types.h | 5 +
> include/linux/mmzone.h | 12 +-
> mm/page_alloc.c | 333 ++++++++++++++++++++++++---------------
> 3 files changed, 222 insertions(+), 128 deletions(-)
>
> --
> 2.34.1
>
>

2022-04-26 16:07:42

by Nicolas Saenz Julienne

[permalink] [raw]

Subject: Re: [RFC PATCH 0/6] Drain remote per-cpu directly

On Mon, 2022-04-25 at 15:58 -0700, Minchan Kim wrote:
> On Wed, Apr 20, 2022 at 10:59:00AM +0100, Mel Gorman wrote:
> > This series has the same intent as Nicolas' series "mm/page_alloc: Remote
> > per-cpu lists drain support" -- avoid interference of a high priority
> > task due to a workqueue item draining per-cpu page lists. While many
> > workloads can tolerate a brief interruption, it may be cause a real-time
> > task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum,
> > the draining in non-deterministic.
>
> Yeah, the non-deterministic is a problem. I saw the kworker-based draining
> takes 100+ms(up to 300ms observed) sometimes in alloc_contig_range if CPUs
> are heavily loaded.
>
> I am not sure Nicolas already observed. it's not only problem of
> per_cpu_pages but it is also lru_pvecs (pagevec) draining.
> Do we need to introduce similar(allow remote drainning with spin_lock)
> solution for pagevec?

Yes, I'm aware of the lru problem. I'll start working on it too once we're done
with the page allocator (and if no-one beats me to it). That said, I don't know
if we can apply the exact same approach, the devil is in the details. :)

--
Nicolás Sáenz

2022-04-27 15:49:24

by Marcelo Tosatti

[permalink] [raw]

Subject: Re: [RFC PATCH 0/6] Drain remote per-cpu directly

On Tue, Apr 26, 2022 at 01:06:12PM +0200, Nicolas Saenz Julienne wrote:
> On Mon, 2022-04-25 at 15:58 -0700, Minchan Kim wrote:
> > On Wed, Apr 20, 2022 at 10:59:00AM +0100, Mel Gorman wrote:
> > > This series has the same intent as Nicolas' series "mm/page_alloc: Remote
> > > per-cpu lists drain support" -- avoid interference of a high priority
> > > task due to a workqueue item draining per-cpu page lists. While many
> > > workloads can tolerate a brief interruption, it may be cause a real-time
> > > task runnning on a NOHZ_FULL CPU to miss a deadline and at minimum,
> > > the draining in non-deterministic.
> >
> > Yeah, the non-deterministic is a problem. I saw the kworker-based draining
> > takes 100+ms(up to 300ms observed) sometimes in alloc_contig_range if CPUs
> > are heavily loaded.
> >
> > I am not sure Nicolas already observed. it's not only problem of
> > per_cpu_pages but it is also lru_pvecs (pagevec) draining.
> > Do we need to introduce similar(allow remote drainning with spin_lock)
> > solution for pagevec?
>
> Yes, I'm aware of the lru problem. I'll start working on it too once we're done
> with the page allocator (and if no-one beats me to it). That said, I don't know
> if we can apply the exact same approach, the devil is in the details. :)

I think one necessary step for that (adding spinlock to protect per-CPU
lru_pvecs) would be to find a suitable testcase.

Mel, do you have anything in mind ?

>
> --
> Nicol?s S?enz
>
>