2022-02-09 02:46:37

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: [PATCH 0/2] mm/page_alloc: Remote per-cpu lists drain support

This series replaces mm/page_alloc's per-cpu page lists drain mechanism with
one that allows accessing the lists remotely. Currently, only a local CPU is
permitted to change its per-cpu lists, and it's expected to do so, on-demand,
whenever a process demands it by means of queueing a drain task on the local
CPU. This causes problems for NOHZ_FULL CPUs and real-time systems that can't
take any sort of interruption and to some lesser extent inconveniences idle and
virtualised systems.

The new algorithm will atomically switch the pointer to the per-cpu page lists
and use RCU to make sure it's not being concurrently used before draining the
lists. And its main benefit of is that it fixes the issue for good, avoiding
the need for configuration based heuristics or having to modify applications
(i.e. using the isolation prctrl being worked by Marcello Tosatti ATM).

All this with minimal performance implications: a page allocation
microbenchmark was run on multiple systems and architectures generally showing
no performance differences, only the more extreme cases showed a 1-3%
degradation. See data below. Needless to say that I'd appreciate if someone
could validate my values independently.

The approach has been stress-tested: I forced 100 drains/s while running
mmtests' pft in a loop for a full day on multiple machines and archs (arm64,
x86_64, ppc64le).

Note that this is not the first attempt at fixing this per-cpu page lists:
- The first attempt[1] tried to conditionally change the pagesets locking
scheme based the NOHZ_FULL config. It was deemed hard to maintain as the
NOHZ_FULL code path would be rarely tested. Also, this only solves the issue
for NOHZ_FULL setups, which isn't ideal.
- The second[2] unanimously switched the local_locks to per-cpu spinlocks. The
performance degradation was too big.

Previous RFC: https://lkml.org/lkml/2021/10/8/793

Thanks!

[1] https://lkml.org/lkml/2021/9/21/599
[2] https://lkml.org/lkml/2021/11/3/644

---

Changes since RFC:
- Get rid of aesthetic changes that affected performance
- Add more documentation
- Add better commit messages
- Pass sparse tests
- Verify this_cpu_*() usage
- Performance measurements

Nicolas Saenz Julienne (2):
mm/page_alloc: Access lists in 'struct per_cpu_pages' indirectly
mm/page_alloc: Add remote draining support to per-cpu lists

include/linux/mmzone.h | 28 +++++-
mm/page_alloc.c | 212 ++++++++++++++++++++++++++---------------
mm/vmstat.c | 6 +-
3 files changed, 162 insertions(+), 84 deletions(-)


-------------------------Performance results-----------------------------

I'm focusing on mmtests' Page Fault Test (pft), as it's page allocator
intensive.

- AMD Daytona Reference System, 2 sockets, AMD EPYC 7742, Zen 2, 64-Core,
4 NUMA nodes, x86_64

pft timings:
vanilla rcu
Amean system-1 58.52 ( 0.00%) 58.92 * -0.68%*
Amean system-4 61.00 ( 0.00%) 61.41 * -0.67%*
Amean system-7 61.55 ( 0.00%) 61.74 * -0.30%*
Amean system-12 64.91 ( 0.00%) 64.94 * -0.05%*
Amean system-21 98.80 ( 0.00%) 99.92 * -1.13%*
Amean system-30 147.68 ( 0.00%) 145.83 * 1.25%*
Amean system-48 237.04 ( 0.00%) 241.29 * -1.79%*
Amean system-79 286.61 ( 0.00%) 283.72 * 1.01%*
Amean system-110 303.40 ( 0.00%) 299.91 * 1.15%*
Amean system-128 345.07 ( 0.00%) 342.10 * 0.86%*
Amean elapsed-1 61.21 ( 0.00%) 61.65 * -0.71%*
Amean elapsed-4 15.94 ( 0.00%) 16.05 * -0.69%*
Amean elapsed-7 9.24 ( 0.00%) 9.28 * -0.47%*
Amean elapsed-12 5.70 ( 0.00%) 5.70 * -0.07%*
Amean elapsed-21 5.11 ( 0.00%) 5.06 * 1.13%*
Amean elapsed-30 5.28 ( 0.00%) 5.14 * 2.73%*
Amean elapsed-48 5.28 ( 0.00%) 5.24 * 0.74%*
Amean elapsed-79 4.41 ( 0.00%) 4.31 * 2.17%*
Amean elapsed-110 3.45 ( 0.00%) 3.44 * 0.40%*
Amean elapsed-128 2.75 ( 0.00%) 2.75 * -0.28%*

- AMD Speedway Reference System, 2 sockets, AMD EPYC 7601, Zen 1, 64-core, 8
NUMA nodes, x86_64. Lots of variance between tests on this platform. It'll
easily swing -+5% on each result.

pft timings:
vanilla rcu
Amean system-1 69.20 ( 0.00%) 66.21 * 4.32%*
Amean system-4 70.79 ( 0.00%) 69.01 * 2.52%*
Amean system-7 71.34 ( 0.00%) 69.16 * 3.05%*
Amean system-12 74.00 ( 0.00%) 72.74 * 1.70%*
Amean system-21 86.01 ( 0.00%) 85.70 * 0.36%*
Amean system-30 89.21 ( 0.00%) 89.93 * -0.80%*
Amean system-48 92.39 ( 0.00%) 92.43 * -0.04%*
Amean system-79 120.19 ( 0.00%) 121.30 * -0.92%*
Amean system-110 172.79 ( 0.00%) 179.37 * -3.81%*
Amean system-128 201.70 ( 0.00%) 212.57 * -5.39%*
Amean elapsed-1 72.23 ( 0.00%) 69.29 * 4.08%*
Amean elapsed-4 18.69 ( 0.00%) 18.28 * 2.20%*
Amean elapsed-7 10.80 ( 0.00%) 10.54 * 2.41%*
Amean elapsed-12 6.62 ( 0.00%) 6.53 * 1.30%*
Amean elapsed-21 4.68 ( 0.00%) 4.69 * -0.14%*
Amean elapsed-30 3.44 ( 0.00%) 3.50 * -1.66%*
Amean elapsed-48 2.40 ( 0.00%) 2.42 * -1.00%*
Amean elapsed-79 2.05 ( 0.00%) 2.09 * -1.90%*
Amean elapsed-110 1.83 ( 0.00%) 1.91 * -4.60%*
Amean elapsed-128 1.75 ( 0.00%) 1.85 * -5.99%*

- IBM 9006-22C system, 2 sockets, POWER9, 64-Core, 1 NUMA node per cpu,
pppc64le.

pft timings:
vanilla rcu
Amean system-1 1.82 ( 0.00%) 1.85 * -1.43%*
Amean system-4 2.18 ( 0.00%) 2.22 * -2.02%*
Amean system-7 3.27 ( 0.00%) 3.28 * -0.15%*
Amean system-12 5.22 ( 0.00%) 5.20 * 0.26%*
Amean system-21 10.10 ( 0.00%) 10.20 * -1.00%*
Amean system-30 15.00 ( 0.00%) 14.52 * 3.20%*
Amean system-48 26.41 ( 0.00%) 25.96 * 1.71%*
Amean system-79 29.35 ( 0.00%) 29.70 * -1.21%*
Amean system-110 24.01 ( 0.00%) 23.40 * 2.54%*
Amean system-128 24.57 ( 0.00%) 25.32 * -3.06%*
Amean elapsed-1 1.85 ( 0.00%) 1.87 * -1.28%*
Amean elapsed-4 0.56 ( 0.00%) 0.57 * -1.72%*
Amean elapsed-7 0.51 ( 0.00%) 0.50 * 0.07%*
Amean elapsed-12 0.51 ( 0.00%) 0.51 * 0.06%*
Amean elapsed-21 0.54 ( 0.00%) 0.54 * 0.06%*
Amean elapsed-30 0.54 ( 0.00%) 0.53 * 2.22%*
Amean elapsed-48 0.58 ( 0.00%) 0.57 * 1.73%*
Amean elapsed-79 0.49 ( 0.00%) 0.48 * 0.89%*
Amean elapsed-110 0.37 ( 0.00%) 0.37 * -1.08%*
Amean elapsed-128 0.33 ( 0.00%) 0.33 * 0.00%*

- Ampere MtSnow, 1 socket, Neoverse-N1, 80-Cores, 1 NUMA node, arm64.

pft timings:
vanilla rcu
Amean system-1 11.92 ( 0.00%) 11.99 * -0.61%*
Amean system-4 13.13 ( 0.00%) 13.09 * 0.31%*
Amean system-7 13.91 ( 0.00%) 13.94 * -0.20%*
Amean system-12 15.77 ( 0.00%) 15.69 * 0.48%*
Amean system-21 21.32 ( 0.00%) 21.42 * -0.46%*
Amean system-30 28.58 ( 0.00%) 29.12 * -1.90%*
Amean system-48 47.41 ( 0.00%) 46.91 * 1.04%*
Amean system-79 76.76 ( 0.00%) 77.16 * -0.52%*
Amean system-80 77.98 ( 0.00%) 78.23 * -0.32%*
Amean elapsed-1 12.46 ( 0.00%) 12.53 * -0.58%*
Amean elapsed-4 3.47 ( 0.00%) 3.46 * 0.34%*
Amean elapsed-7 2.18 ( 0.00%) 2.21 * -1.58%*
Amean elapsed-12 1.41 ( 0.00%) 1.42 * -0.80%*
Amean elapsed-21 1.09 ( 0.00%) 1.12 * -2.60%*
Amean elapsed-30 0.98 ( 0.00%) 1.01 * -3.08%*
Amean elapsed-48 1.08 ( 0.00%) 1.10 * -1.78%*
Amean elapsed-79 1.32 ( 0.00%) 1.28 * 2.71%*
Amean elapsed-80 1.32 ( 0.00%) 1.28 * 3.23%*

- Dell R430, 2 sockets, Intel Xeon E5-2640 v3, Sandy Bridge, 16-Cores, 2 NUMA
nodes, x86_64.

pft timings:
vanilla rcu
Amean system-1 11.10 ( 0.00%) 11.07 * 0.24%*
Amean system-3 11.14 ( 0.00%) 11.10 * 0.34%*
Amean system-5 11.18 ( 0.00%) 11.13 * 0.47%*
Amean system-7 11.21 ( 0.00%) 11.17 * 0.38%*
Amean system-12 11.28 ( 0.00%) 11.28 ( -0.03%)
Amean system-18 13.24 ( 0.00%) 13.25 * -0.11%*
Amean system-24 17.12 ( 0.00%) 17.14 ( -0.13%)
Amean system-30 21.10 ( 0.00%) 21.23 * -0.60%*
Amean system-32 22.31 ( 0.00%) 22.47 * -0.71%*
Amean elapsed-1 11.76 ( 0.00%) 11.73 * 0.29%*
Amean elapsed-3 3.93 ( 0.00%) 3.93 * 0.17%*
Amean elapsed-5 2.39 ( 0.00%) 2.37 * 0.74%*
Amean elapsed-7 1.72 ( 0.00%) 1.71 * 0.81%*
Amean elapsed-12 1.02 ( 0.00%) 1.03 ( -0.42%)
Amean elapsed-18 1.13 ( 0.00%) 1.14 ( -0.18%)
Amean elapsed-24 0.87 ( 0.00%) 0.88 * -0.65%*
Amean elapsed-30 0.77 ( 0.00%) 0.78 * -0.86%*
Amean elapsed-32 0.74 ( 0.00%) 0.74 ( 0.00%)

- HPE Apollo 70, 2 sockets, Cavium ThunderX2, 128-Cores, 2 NUMA nodes, arm64.
NOTE: The test here only goes up to 128 for some reason, although there are
256 CPUs. Maybe a mmtests issue? I didn't investigate.

pft timings:
vanilla rcu
Amean system-1 4.42 ( 0.00%) 4.36 * 1.29%*
Amean system-4 4.56 ( 0.00%) 4.51 * 1.05%*
Amean system-7 4.63 ( 0.00%) 4.65 * -0.42%*
Amean system-12 5.96 ( 0.00%) 6.02 * -1.00%*
Amean system-21 10.97 ( 0.00%) 11.01 * -0.32%*
Amean system-30 16.01 ( 0.00%) 16.04 * -0.19%*
Amean system-48 26.81 ( 0.00%) 26.78 * 0.09%*
Amean system-79 30.80 ( 0.00%) 30.85 * -0.16%*
Amean system-110 31.87 ( 0.00%) 31.93 * -0.19%*
Amean system-128 36.27 ( 0.00%) 36.31 * -0.10%*
Amean elapsed-1 4.88 ( 0.00%) 4.85 * 0.60%*
Amean elapsed-4 1.27 ( 0.00%) 1.26 * 1.00%*
Amean elapsed-7 0.73 ( 0.00%) 0.74 * -0.46%*
Amean elapsed-12 0.55 ( 0.00%) 0.55 * 1.09%*
Amean elapsed-21 0.59 ( 0.00%) 0.60 * -0.96%*
Amean elapsed-30 0.60 ( 0.00%) 0.60 * 0.28%*
Amean elapsed-48 0.60 ( 0.00%) 0.60 * 0.44%*
Amean elapsed-79 0.49 ( 0.00%) 0.49 * -0.07%*
Amean elapsed-110 0.36 ( 0.00%) 0.36 * 0.28%*
Amean elapsed-128 0.31 ( 0.00%) 0.31 * -0.43%*

- Raspberry Pi 4, 1 socket, bcm2711, Cortex-A72, 4-Cores, 1 NUMA node, arm64.

pft timings:
vanilla rcu
Amean system-1 0.67 ( 0.00%) 0.67 * -1.25%*
Amean system-3 1.30 ( 0.00%) 1.29 * 0.62%*
Amean system-4 1.61 ( 0.00%) 1.59 * 0.95%*
Amean elapsed-1 0.71 ( 0.00%) 0.72 * -1.17%*
Amean elapsed-3 0.45 ( 0.00%) 0.45 * 0.88%*
Amean elapsed-4 0.42 ( 0.00%) 0.42 * 1.19%*


--
2.34.1



2022-02-09 07:55:32

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: [PATCH 1/2] mm/page_alloc: Access lists in 'struct per_cpu_pages' indirectly

In preparation to adding remote per-cpu page list drain support, let's
bundle 'struct per_cpu_pages's' page lists and page count into a new
structure: 'struct pcplists', and have all code access it indirectly
through a pointer. It'll be used by upcoming patches in order to
maintain multiple instances of 'pcplists' and switch the pointer
atomically.

The 'struct pcplists' instance lives inside 'struct per_cpu_pages', and
shouldn't be accessed directly. It is setup as such since these
structures are used during early boot when no memory allocation is
possible and to simplify memory hotplug code paths.

free_pcppages_bulk() and __rmqueue_pcplist()'s function signatures
change a bit so as to accommodate these changes without affecting
performance.

No functional change intended.

Signed-off-by: Nicolas Saenz Julienne <[email protected]>
---

Changes since RFC:
- Add more info in commit message.
- Removed __private attribute, in hindsight doesn't really fit what
we're doing here.
- Use raw_cpu_ptr() where relevant to avoid warnings.

include/linux/mmzone.h | 10 +++--
mm/page_alloc.c | 87 +++++++++++++++++++++++++-----------------
mm/vmstat.c | 6 +--
3 files changed, 62 insertions(+), 41 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 3fff6deca2c0..b4cb85d9c6e8 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -381,7 +381,6 @@ enum zone_watermarks {

/* Fields and list protected by pagesets local_lock in page_alloc.c */
struct per_cpu_pages {
- int count; /* number of pages in the list */
int high; /* high watermark, emptying needed */
int batch; /* chunk size for buddy add/remove */
short free_factor; /* batch scaling factor during free */
@@ -389,8 +388,13 @@ struct per_cpu_pages {
short expire; /* When 0, remote pagesets are drained */
#endif

- /* Lists of pages, one per migrate type stored on the pcp-lists */
- struct list_head lists[NR_PCP_LISTS];
+ struct pcplists *lp;
+ struct pcplists {
+ /* Number of pages in the pcplists */
+ int count;
+ /* Lists of pages, one per migrate type stored on the pcp-lists */
+ struct list_head lists[NR_PCP_LISTS];
+ } __pcplists; /* Do not access directly */
};

struct per_cpu_zonestat {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4f549123150c..4f37815b0e4c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1449,13 +1449,12 @@ static inline void prefetch_buddy(struct page *page)
* count is the number of pages to free.
*/
static void free_pcppages_bulk(struct zone *zone, int count,
- struct per_cpu_pages *pcp)
+ int batch, struct pcplists *lp)
{
int pindex = 0;
int batch_free = 0;
int nr_freed = 0;
unsigned int order;
- int prefetch_nr = READ_ONCE(pcp->batch);
bool isolated_pageblocks;
struct page *page, *tmp;
LIST_HEAD(head);
@@ -1464,7 +1463,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
* Ensure proper count is passed which otherwise would stuck in the
* below while (list_empty(list)) loop.
*/
- count = min(pcp->count, count);
+ count = min(lp->count, count);
while (count > 0) {
struct list_head *list;

@@ -1479,7 +1478,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
batch_free++;
if (++pindex == NR_PCP_LISTS)
pindex = 0;
- list = &pcp->lists[pindex];
+ list = &lp->lists[pindex];
} while (list_empty(list));

/* This is the only non-empty list. Free them all. */
@@ -1513,13 +1512,13 @@ static void free_pcppages_bulk(struct zone *zone, int count,
* avoid excessive prefetching due to large count, only
* prefetch buddy for the first pcp->batch nr of pages.
*/
- if (prefetch_nr) {
+ if (batch) {
prefetch_buddy(page);
- prefetch_nr--;
+ batch--;
}
} while (count > 0 && --batch_free && !list_empty(list));
}
- pcp->count -= nr_freed;
+ lp->count -= nr_freed;

/*
* local_lock_irq held so equivalent to spin_lock_irqsave for
@@ -3130,14 +3129,16 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
*/
void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
{
+ struct pcplists *lp;
unsigned long flags;
int to_drain, batch;

local_lock_irqsave(&pagesets.lock, flags);
batch = READ_ONCE(pcp->batch);
- to_drain = min(pcp->count, batch);
+ lp = pcp->lp;
+ to_drain = min(lp->count, batch);
if (to_drain > 0)
- free_pcppages_bulk(zone, to_drain, pcp);
+ free_pcppages_bulk(zone, to_drain, batch, lp);
local_unlock_irqrestore(&pagesets.lock, flags);
}
#endif
@@ -3153,12 +3154,14 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
{
unsigned long flags;
struct per_cpu_pages *pcp;
+ struct pcplists *lp;

local_lock_irqsave(&pagesets.lock, flags);

pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
- if (pcp->count)
- free_pcppages_bulk(zone, pcp->count, pcp);
+ lp = pcp->lp;
+ if (lp->count)
+ free_pcppages_bulk(zone, lp->count, READ_ONCE(pcp->batch), lp);

local_unlock_irqrestore(&pagesets.lock, flags);
}
@@ -3219,7 +3222,7 @@ static void drain_local_pages_wq(struct work_struct *work)
*
* drain_all_pages() is optimized to only execute on cpus where pcplists are
* not empty. The check for non-emptiness can however race with a free to
- * pcplist that has not yet increased the pcp->count from 0 to 1. Callers
+ * pcplist that has not yet increased the lp->count from 0 to 1. Callers
* that need the guarantee that every CPU has drained can disable the
* optimizing racy check.
*/
@@ -3258,24 +3261,24 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
* disables preemption as part of its processing
*/
for_each_online_cpu(cpu) {
- struct per_cpu_pages *pcp;
struct zone *z;
bool has_pcps = false;
+ struct pcplists *lp;

if (force_all_cpus) {
/*
- * The pcp.count check is racy, some callers need a
+ * The lp->count check is racy, some callers need a
* guarantee that no cpu is missed.
*/
has_pcps = true;
} else if (zone) {
- pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
- if (pcp->count)
+ lp = per_cpu_ptr(zone->per_cpu_pageset, cpu)->lp;
+ if (lp->count)
has_pcps = true;
} else {
for_each_populated_zone(z) {
- pcp = per_cpu_ptr(z->per_cpu_pageset, cpu);
- if (pcp->count) {
+ lp = per_cpu_ptr(z->per_cpu_pageset, cpu)->lp;
+ if (lp->count) {
has_pcps = true;
break;
}
@@ -3427,19 +3430,21 @@ static void free_unref_page_commit(struct page *page, int migratetype,
{
struct zone *zone = page_zone(page);
struct per_cpu_pages *pcp;
+ struct pcplists *lp;
int high;
int pindex;

__count_vm_event(PGFREE);
pcp = this_cpu_ptr(zone->per_cpu_pageset);
+ lp = pcp->lp;
pindex = order_to_pindex(migratetype, order);
- list_add(&page->lru, &pcp->lists[pindex]);
- pcp->count += 1 << order;
+ list_add(&page->lru, &lp->lists[pindex]);
+ lp->count += 1 << order;
high = nr_pcp_high(pcp, zone);
- if (pcp->count >= high) {
+ if (lp->count >= high) {
int batch = READ_ONCE(pcp->batch);

- free_pcppages_bulk(zone, nr_pcp_free(pcp, high, batch), pcp);
+ free_pcppages_bulk(zone, nr_pcp_free(pcp, high, batch), batch, lp);
}
}

@@ -3660,7 +3665,8 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
int migratetype,
unsigned int alloc_flags,
struct per_cpu_pages *pcp,
- struct list_head *list)
+ struct list_head *list,
+ int *count)
{
struct page *page;

@@ -3682,14 +3688,14 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
batch, list,
migratetype, alloc_flags);

- pcp->count += alloced << order;
+ *count += alloced << order;
if (unlikely(list_empty(list)))
return NULL;
}

page = list_first_entry(list, struct page, lru);
list_del(&page->lru);
- pcp->count -= 1 << order;
+ *count -= 1 << order;
} while (check_new_pcp(page));

return page;
@@ -3703,8 +3709,10 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
{
struct per_cpu_pages *pcp;
struct list_head *list;
+ struct pcplists *lp;
struct page *page;
unsigned long flags;
+ int *count;

local_lock_irqsave(&pagesets.lock, flags);

@@ -3715,8 +3723,11 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
*/
pcp = this_cpu_ptr(zone->per_cpu_pageset);
pcp->free_factor >>= 1;
- list = &pcp->lists[order_to_pindex(migratetype, order)];
- page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
+ lp = pcp->lp;
+ list = &lp->lists[order_to_pindex(migratetype, order)];
+ count = &lp->count;
+ page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags,
+ pcp, list, count);
local_unlock_irqrestore(&pagesets.lock, flags);
if (page) {
__count_zid_vm_events(PGALLOC, page_zonenum(page), 1);
@@ -5255,9 +5266,11 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
struct per_cpu_pages *pcp;
struct list_head *pcp_list;
struct alloc_context ac;
+ struct pcplists *lp;
gfp_t alloc_gfp;
unsigned int alloc_flags = ALLOC_WMARK_LOW;
int nr_populated = 0, nr_account = 0;
+ int *count;

/*
* Skip populated array elements to determine if any pages need
@@ -5333,7 +5346,9 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
/* Attempt the batch allocation */
local_lock_irqsave(&pagesets.lock, flags);
pcp = this_cpu_ptr(zone->per_cpu_pageset);
- pcp_list = &pcp->lists[order_to_pindex(ac.migratetype, 0)];
+ lp = pcp->lp;
+ pcp_list = &lp->lists[order_to_pindex(ac.migratetype, 0)];
+ count = &lp->count;

while (nr_populated < nr_pages) {

@@ -5344,7 +5359,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
}

page = __rmqueue_pcplist(zone, 0, ac.migratetype, alloc_flags,
- pcp, pcp_list);
+ pcp, pcp_list, count);
if (unlikely(!page)) {
/* Try and get at least one page */
if (!nr_populated)
@@ -5947,7 +5962,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
continue;

for_each_online_cpu(cpu)
- free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
+ free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->lp->count;
}

printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n"
@@ -6041,7 +6056,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)

free_pcp = 0;
for_each_online_cpu(cpu)
- free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->count;
+ free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->lp->count;

show_node(zone);
printk(KERN_CONT
@@ -6084,7 +6099,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
K(zone_page_state(zone, NR_MLOCK)),
K(zone_page_state(zone, NR_BOUNCE)),
K(free_pcp),
- K(this_cpu_read(zone->per_cpu_pageset->count)),
+ K(raw_cpu_ptr(zone->per_cpu_pageset)->lp->count),
K(zone_page_state(zone, NR_FREE_CMA_PAGES)));
printk("lowmem_reserve[]:");
for (i = 0; i < MAX_NR_ZONES; i++)
@@ -6971,7 +6986,7 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online)

/*
* pcp->high and pcp->batch values are related and generally batch is lower
- * than high. They are also related to pcp->count such that count is lower
+ * than high. They are also related to pcp->lp->count such that count is lower
* than high, and as soon as it reaches high, the pcplist is flushed.
*
* However, guaranteeing these relations at all times would require e.g. write
@@ -6979,7 +6994,7 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online)
* thus be prone to error and bad for performance. Thus the update only prevents
* store tearing. Any new users of pcp->batch and pcp->high should ensure they
* can cope with those fields changing asynchronously, and fully trust only the
- * pcp->count field on the local CPU with interrupts disabled.
+ * pcp->lp->count field on the local CPU with interrupts disabled.
*
* mutex_is_locked(&pcp_batch_high_lock) required when calling this function
* outside of boot time (or some other assurance that no concurrent updaters
@@ -6999,8 +7014,10 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta
memset(pcp, 0, sizeof(*pcp));
memset(pzstats, 0, sizeof(*pzstats));

+ pcp->lp = &pcp->__pcplists;
+
for (pindex = 0; pindex < NR_PCP_LISTS; pindex++)
- INIT_LIST_HEAD(&pcp->lists[pindex]);
+ INIT_LIST_HEAD(&pcp->lp->lists[pindex]);

/*
* Set batch and high values safe for a boot pageset. A true percpu
diff --git a/mm/vmstat.c b/mm/vmstat.c
index d5cc8d739fac..576b2b932ccd 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -856,7 +856,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
* if not then there is nothing to expire.
*/
if (!__this_cpu_read(pcp->expire) ||
- !__this_cpu_read(pcp->count))
+ !this_cpu_ptr(pcp)->lp->count)
continue;

/*
@@ -870,7 +870,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
if (__this_cpu_dec_return(pcp->expire))
continue;

- if (__this_cpu_read(pcp->count)) {
+ if (this_cpu_ptr(pcp)->lp->count) {
drain_zone_pages(zone, this_cpu_ptr(pcp));
changes++;
}
@@ -1728,7 +1728,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
"\n high: %i"
"\n batch: %i",
i,
- pcp->count,
+ pcp->lp->count,
pcp->high,
pcp->batch);
#ifdef CONFIG_SMP
--
2.34.1


2022-02-09 12:34:45

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm/page_alloc: Remote per-cpu lists drain support

On Wed, 2022-02-09 at 19:26 +0800, Xiongfeng Wang wrote:
> Hi,
>
> On 2022/2/9 17:45, Nicolas Saenz Julienne wrote:
> > Hi Xiongfeng, thanks for taking the time to look at this.
> >
> > On Wed, 2022-02-09 at 16:55 +0800, Xiongfeng Wang wrote:
> > > Hi Nicolas,
> > >
> > > When I applied the patchset on the following commit and tested on QEMU, I came
> > > accross the following CallTrace.
> > > commit dd81e1c7d5fb126e5fbc5c9e334d7b3ec29a16a0
> > >
> > > I wrote a userspace application to consume the memory. When the memory is used
> > > out, the OOM killer is triggered and the following Calltrace is printed. I am
> > > not sure if it is related to this patchset. But when I reverted this patchset,
> > > the 'NULL pointer' Calltrace didn't show.
> >
> > It's a silly mistake on my part, while cleaning up the code I messed up one of
> > the 'struct per_cpu_pages' accessors. This should fix it:
> >
> > ------------------------->8-------------------------
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index 0caa7155ca34..e65b991c3dc8 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -3279,7 +3279,7 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
> > has_pcps = true;
> > } else {
> > for_each_populated_zone(z) {
> > - pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
> > + pcp = per_cpu_ptr(z->per_cpu_pageset, cpu);
> > lp = rcu_dereference_protected(pcp->lp,
> > mutex_is_locked(&pcpu_drain_mutex));
> > if (lp->count) {
>
> I have tested it. It works well. No more 'NULL pointer' Calltrace.

Thanks!

--
Nicolás Sáenz


2022-02-09 13:06:22

by Xiongfeng Wang

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm/page_alloc: Remote per-cpu lists drain support

Hi Nicolas,

When I applied the patchset on the following commit and tested on QEMU, I came
accross the following CallTrace.
commit dd81e1c7d5fb126e5fbc5c9e334d7b3ec29a16a0

I wrote a userspace application to consume the memory. When the memory is used
out, the OOM killer is triggered and the following Calltrace is printed. I am
not sure if it is related to this patchset. But when I reverted this patchset,
the 'NULL pointer' Calltrace didn't show.

root@syzkaller:~# [ 148.041840][ T476] Unable to handle kernel NULL pointer
dereference at virtual address 0000000000000058
[ 148.042838][ T476] Mem abort info:
[ 148.043188][ T476] ESR = 0x96000004
[ 148.043540][ T476] EC = 0x25: DABT (current EL), IL = 32 bits
[ 148.044102][ T476] SET = 0, FnV = 0
[ 148.044451][ T476] EA = 0, S1PTW = 0
[ 148.044812][ T476] FSC = 0x04: level 0 translation fault
[ 148.045333][ T476] Data abort info:
[ 148.045666][ T476] ISV = 0, ISS = 0x00000004
[ 148.046103][ T476] CM = 0, WnR = 0
[ 148.046444][ T476] user pgtable: 4k pages, 48-bit VAs, pgdp=00000000abcec000
[ 148.047187][ T476] [0000000000000058] pgd=0000000000000000, p4d=0000000000000000
[ 148.047915][ T476] Internal error: Oops: 96000004 [#1] SMP
[ 148.048450][ T476] Dumping ftrace buffer:
[ 148.048843][ T476] (ftrace buffer empty)
[ 148.049473][ T476] Modules linked in:
[ 148.049834][ T476] CPU: 1 PID: 476 Comm: test Not tainted 5.17.0-rc1+ #19
[ 148.050485][ T476] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0
02/06/2015
[ 148.051242][ T476] pstate: 80400005 (Nzcv daif +PAN -UAO -TCO -DIT -SSBS
BTYPE=--)
[ 148.052019][ T476] pc : __drain_all_pages+0x1c4/0x318
[ 148.052548][ T476] lr : __drain_all_pages+0x1a0/0x318
[ 148.053070][ T476] sp : ffff80000ce9ba20
[ 148.053477][ T476] x29: ffff80000ce9ba20 x28: 0000000000000000 x27:
0000000000000000
[ 148.054268][ T476] x26: 0000000000000000 x25: ffff800009799000 x24:
0000000000000000
[ 148.055067][ T476] x23: ffff80000979f460 x22: ffff800009bac000 x21:
ffff800009ed7110
[ 148.055997][ T476] x20: ffff80000979a000 x19: 0000000000000000 x18:
0000000000000000
[ 148.056784][ T476] x17: 0000000000000000 x16: 0000000000000000 x15:
0000000000000000
[ 148.058493][ T476] x14: 0000000000000000 x13: 0000000000000000 x12:
ffff80000ce9b558
[ 148.059649][ T476] x11: 0000000000000000 x10: ffff000bfffe6240 x9 :
0000000000000002
[ 148.060775][ T476] x8 : 0000000000000000 x7 : ffffffffffffffff x6 :
0000000000000000
[ 148.061609][ T476] x5 : 0000000000000000 x4 : ffff80000979bb70 x3 :
ffff800bf6a18000
[ 148.062435][ T476] x2 : 0000000000000080 x1 : 00000000000c0000 x0 :
ffff000bfffe6240
[ 148.063267][ T476] Call trace:
[ 148.063602][ T476] __drain_all_pages+0x1c4/0x318
[ 148.064110][ T476] __alloc_pages_slowpath.constprop.126+0x328/0xac8
[ 148.064790][ T476] __alloc_pages+0x270/0x330
[ 148.065260][ T476] alloc_pages+0xb0/0x160
[ 148.065703][ T476] __pte_alloc+0x44/0x180
[ 148.066147][ T476] __handle_mm_fault+0x59c/0xb30
[ 148.066656][ T476] handle_mm_fault+0xc4/0x210
[ 148.067139][ T476] do_page_fault+0x1cc/0x480
[ 148.067607][ T476] do_translation_fault+0x80/0x98
[ 148.068265][ T476] do_mem_abort+0x50/0xa0
[ 148.068724][ T476] el0_da+0x40/0x118
[ 148.069132][ T476] el0t_64_sync_handler+0x60/0xb0
[ 148.069659][ T476] el0t_64_sync+0x16c/0x170
[ 148.070127][ T476] Code: b4fff7e0 f9404401 b4ffffa1 f87c7ae3 (f9402f01)
[ 148.070874][ T476] ---[ end trace 0000000000000000 ]---
[ 149.059678][ T229] systemd-journal invoked oom-killer:
gfp_mask=0x1140cca(GFP_HIGHUSER_MOVABLE|__GFP_COMP), order=0, oom_score_adj=0
[ 149.064280][ T229] CPU: 0 PID: 229 Comm: systemd-journal Tainted: G D
5.17.0-rc1+ #19
[ 149.069110][ T229] Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0
02/06/2015
[ 149.073070][ T229] Call trace:
[ 149.074392][ T229] dump_backtrace+0x114/0x120
[ 149.077045][ T229] show_stack+0x2c/0x78
[ 149.079680][ T229] dump_stack_lvl+0x64/0x7c
[ 149.081741][ T229] dump_stack+0x14/0x2c
[ 149.083760][ T229] dump_header+0x48/0x200
[ 149.085772][ T229] oom_kill_process+0x230/0x2e0
[ 149.087877][ T229] out_of_memory+0xf0/0x5a8
[ 149.090776][ T229] __alloc_pages_slowpath.constprop.126+0x830/0xac8
[ 149.103397][ T229] __alloc_pages+0x270/0x330
[ 149.112662][ T229] alloc_pages+0xb0/0x160
[ 149.113088][ T229] folio_alloc+0x28/0x50
[ 149.113504][ T229] filemap_alloc_folio+0x98/0xc8
[ 149.113990][ T229] __filemap_get_folio+0x194/0x2f0
[ 149.114492][ T229] filemap_fault+0x434/0x708
[ 149.115799][ T229] __do_fault+0x3c/0x118
[ 149.117842][ T229] do_fault+0x160/0x428
[ 149.119893][ T229] __handle_mm_fault+0x498/0xb30
[ 149.123249][ T229] handle_mm_fault+0xc4/0x210
[ 149.125558][ T229] do_page_fault+0x1cc/0x480
[ 149.128347][ T229] do_translation_fault+0x80/0x98
[ 149.130444][ T229] do_mem_abort+0x50/0xa0
[ 149.132513][ T229] el0_da+0x40/0x118
[ 149.135152][ T229] el0t_64_sync_handler+0x60/0xb0
[ 149.137802][ T229] el0t_64_sync+0x16c/0x170
[ 149.140088][ T229] Mem-Info:
[ 149.141197][ T229] active_anon:108 inactive_anon:12226341 isolated_anon:0
[ 149.141197][ T229] active_file:68 inactive_file:6 isolated_file:0
[ 149.141197][ T229] unevictable:0 dirty:0 writeback:0
[ 149.141197][ T229] slab_reclaimable:5964 slab_unreclaimable:14242
[ 149.141197][ T229] mapped:41 shmem:995 pagetables:24564 bounce:0
[ 149.141197][ T229] kernel_misc_reclaimable:0
[ 149.141197][ T229] free:52077 free_pcp:1845 free_cma:13312

Thanks,
Xiongfeng

On 2022/2/8 18:07, Nicolas Saenz Julienne wrote:
> This series replaces mm/page_alloc's per-cpu page lists drain mechanism with
> one that allows accessing the lists remotely. Currently, only a local CPU is
> permitted to change its per-cpu lists, and it's expected to do so, on-demand,
> whenever a process demands it by means of queueing a drain task on the local
> CPU. This causes problems for NOHZ_FULL CPUs and real-time systems that can't
> take any sort of interruption and to some lesser extent inconveniences idle and
> virtualised systems.
>
> The new algorithm will atomically switch the pointer to the per-cpu page lists
> and use RCU to make sure it's not being concurrently used before draining the
> lists. And its main benefit of is that it fixes the issue for good, avoiding
> the need for configuration based heuristics or having to modify applications
> (i.e. using the isolation prctrl being worked by Marcello Tosatti ATM).
>
> All this with minimal performance implications: a page allocation
> microbenchmark was run on multiple systems and architectures generally showing
> no performance differences, only the more extreme cases showed a 1-3%
> degradation. See data below. Needless to say that I'd appreciate if someone
> could validate my values independently.
>
> The approach has been stress-tested: I forced 100 drains/s while running
> mmtests' pft in a loop for a full day on multiple machines and archs (arm64,
> x86_64, ppc64le).
>
> Note that this is not the first attempt at fixing this per-cpu page lists:
> - The first attempt[1] tried to conditionally change the pagesets locking
> scheme based the NOHZ_FULL config. It was deemed hard to maintain as the
> NOHZ_FULL code path would be rarely tested. Also, this only solves the issue
> for NOHZ_FULL setups, which isn't ideal.
> - The second[2] unanimously switched the local_locks to per-cpu spinlocks. The
> performance degradation was too big.
>
> Previous RFC: https://lkml.org/lkml/2021/10/8/793
>
> Thanks!
>
> [1] https://lkml.org/lkml/2021/9/21/599
> [2] https://lkml.org/lkml/2021/11/3/644
>
> ---
>
> Changes since RFC:
> - Get rid of aesthetic changes that affected performance
> - Add more documentation
> - Add better commit messages
> - Pass sparse tests
> - Verify this_cpu_*() usage
> - Performance measurements
>
> Nicolas Saenz Julienne (2):
> mm/page_alloc: Access lists in 'struct per_cpu_pages' indirectly
> mm/page_alloc: Add remote draining support to per-cpu lists
>
> include/linux/mmzone.h | 28 +++++-
> mm/page_alloc.c | 212 ++++++++++++++++++++++++++---------------
> mm/vmstat.c | 6 +-
> 3 files changed, 162 insertions(+), 84 deletions(-)
>
>
> -------------------------Performance results-----------------------------
>
> I'm focusing on mmtests' Page Fault Test (pft), as it's page allocator
> intensive.
>
> - AMD Daytona Reference System, 2 sockets, AMD EPYC 7742, Zen 2, 64-Core,
> 4 NUMA nodes, x86_64
>
> pft timings:
> vanilla rcu
> Amean system-1 58.52 ( 0.00%) 58.92 * -0.68%*
> Amean system-4 61.00 ( 0.00%) 61.41 * -0.67%*
> Amean system-7 61.55 ( 0.00%) 61.74 * -0.30%*
> Amean system-12 64.91 ( 0.00%) 64.94 * -0.05%*
> Amean system-21 98.80 ( 0.00%) 99.92 * -1.13%*
> Amean system-30 147.68 ( 0.00%) 145.83 * 1.25%*
> Amean system-48 237.04 ( 0.00%) 241.29 * -1.79%*
> Amean system-79 286.61 ( 0.00%) 283.72 * 1.01%*
> Amean system-110 303.40 ( 0.00%) 299.91 * 1.15%*
> Amean system-128 345.07 ( 0.00%) 342.10 * 0.86%*
> Amean elapsed-1 61.21 ( 0.00%) 61.65 * -0.71%*
> Amean elapsed-4 15.94 ( 0.00%) 16.05 * -0.69%*
> Amean elapsed-7 9.24 ( 0.00%) 9.28 * -0.47%*
> Amean elapsed-12 5.70 ( 0.00%) 5.70 * -0.07%*
> Amean elapsed-21 5.11 ( 0.00%) 5.06 * 1.13%*
> Amean elapsed-30 5.28 ( 0.00%) 5.14 * 2.73%*
> Amean elapsed-48 5.28 ( 0.00%) 5.24 * 0.74%*
> Amean elapsed-79 4.41 ( 0.00%) 4.31 * 2.17%*
> Amean elapsed-110 3.45 ( 0.00%) 3.44 * 0.40%*
> Amean elapsed-128 2.75 ( 0.00%) 2.75 * -0.28%*
>
> - AMD Speedway Reference System, 2 sockets, AMD EPYC 7601, Zen 1, 64-core, 8
> NUMA nodes, x86_64. Lots of variance between tests on this platform. It'll
> easily swing -+5% on each result.
>
> pft timings:
> vanilla rcu
> Amean system-1 69.20 ( 0.00%) 66.21 * 4.32%*
> Amean system-4 70.79 ( 0.00%) 69.01 * 2.52%*
> Amean system-7 71.34 ( 0.00%) 69.16 * 3.05%*
> Amean system-12 74.00 ( 0.00%) 72.74 * 1.70%*
> Amean system-21 86.01 ( 0.00%) 85.70 * 0.36%*
> Amean system-30 89.21 ( 0.00%) 89.93 * -0.80%*
> Amean system-48 92.39 ( 0.00%) 92.43 * -0.04%*
> Amean system-79 120.19 ( 0.00%) 121.30 * -0.92%*
> Amean system-110 172.79 ( 0.00%) 179.37 * -3.81%*
> Amean system-128 201.70 ( 0.00%) 212.57 * -5.39%*
> Amean elapsed-1 72.23 ( 0.00%) 69.29 * 4.08%*
> Amean elapsed-4 18.69 ( 0.00%) 18.28 * 2.20%*
> Amean elapsed-7 10.80 ( 0.00%) 10.54 * 2.41%*
> Amean elapsed-12 6.62 ( 0.00%) 6.53 * 1.30%*
> Amean elapsed-21 4.68 ( 0.00%) 4.69 * -0.14%*
> Amean elapsed-30 3.44 ( 0.00%) 3.50 * -1.66%*
> Amean elapsed-48 2.40 ( 0.00%) 2.42 * -1.00%*
> Amean elapsed-79 2.05 ( 0.00%) 2.09 * -1.90%*
> Amean elapsed-110 1.83 ( 0.00%) 1.91 * -4.60%*
> Amean elapsed-128 1.75 ( 0.00%) 1.85 * -5.99%*
>
> - IBM 9006-22C system, 2 sockets, POWER9, 64-Core, 1 NUMA node per cpu,
> pppc64le.
>
> pft timings:
> vanilla rcu
> Amean system-1 1.82 ( 0.00%) 1.85 * -1.43%*
> Amean system-4 2.18 ( 0.00%) 2.22 * -2.02%*
> Amean system-7 3.27 ( 0.00%) 3.28 * -0.15%*
> Amean system-12 5.22 ( 0.00%) 5.20 * 0.26%*
> Amean system-21 10.10 ( 0.00%) 10.20 * -1.00%*
> Amean system-30 15.00 ( 0.00%) 14.52 * 3.20%*
> Amean system-48 26.41 ( 0.00%) 25.96 * 1.71%*
> Amean system-79 29.35 ( 0.00%) 29.70 * -1.21%*
> Amean system-110 24.01 ( 0.00%) 23.40 * 2.54%*
> Amean system-128 24.57 ( 0.00%) 25.32 * -3.06%*
> Amean elapsed-1 1.85 ( 0.00%) 1.87 * -1.28%*
> Amean elapsed-4 0.56 ( 0.00%) 0.57 * -1.72%*
> Amean elapsed-7 0.51 ( 0.00%) 0.50 * 0.07%*
> Amean elapsed-12 0.51 ( 0.00%) 0.51 * 0.06%*
> Amean elapsed-21 0.54 ( 0.00%) 0.54 * 0.06%*
> Amean elapsed-30 0.54 ( 0.00%) 0.53 * 2.22%*
> Amean elapsed-48 0.58 ( 0.00%) 0.57 * 1.73%*
> Amean elapsed-79 0.49 ( 0.00%) 0.48 * 0.89%*
> Amean elapsed-110 0.37 ( 0.00%) 0.37 * -1.08%*
> Amean elapsed-128 0.33 ( 0.00%) 0.33 * 0.00%*
>
> - Ampere MtSnow, 1 socket, Neoverse-N1, 80-Cores, 1 NUMA node, arm64.
>
> pft timings:
> vanilla rcu
> Amean system-1 11.92 ( 0.00%) 11.99 * -0.61%*
> Amean system-4 13.13 ( 0.00%) 13.09 * 0.31%*
> Amean system-7 13.91 ( 0.00%) 13.94 * -0.20%*
> Amean system-12 15.77 ( 0.00%) 15.69 * 0.48%*
> Amean system-21 21.32 ( 0.00%) 21.42 * -0.46%*
> Amean system-30 28.58 ( 0.00%) 29.12 * -1.90%*
> Amean system-48 47.41 ( 0.00%) 46.91 * 1.04%*
> Amean system-79 76.76 ( 0.00%) 77.16 * -0.52%*
> Amean system-80 77.98 ( 0.00%) 78.23 * -0.32%*
> Amean elapsed-1 12.46 ( 0.00%) 12.53 * -0.58%*
> Amean elapsed-4 3.47 ( 0.00%) 3.46 * 0.34%*
> Amean elapsed-7 2.18 ( 0.00%) 2.21 * -1.58%*
> Amean elapsed-12 1.41 ( 0.00%) 1.42 * -0.80%*
> Amean elapsed-21 1.09 ( 0.00%) 1.12 * -2.60%*
> Amean elapsed-30 0.98 ( 0.00%) 1.01 * -3.08%*
> Amean elapsed-48 1.08 ( 0.00%) 1.10 * -1.78%*
> Amean elapsed-79 1.32 ( 0.00%) 1.28 * 2.71%*
> Amean elapsed-80 1.32 ( 0.00%) 1.28 * 3.23%*
>
> - Dell R430, 2 sockets, Intel Xeon E5-2640 v3, Sandy Bridge, 16-Cores, 2 NUMA
> nodes, x86_64.
>
> pft timings:
> vanilla rcu
> Amean system-1 11.10 ( 0.00%) 11.07 * 0.24%*
> Amean system-3 11.14 ( 0.00%) 11.10 * 0.34%*
> Amean system-5 11.18 ( 0.00%) 11.13 * 0.47%*
> Amean system-7 11.21 ( 0.00%) 11.17 * 0.38%*
> Amean system-12 11.28 ( 0.00%) 11.28 ( -0.03%)
> Amean system-18 13.24 ( 0.00%) 13.25 * -0.11%*
> Amean system-24 17.12 ( 0.00%) 17.14 ( -0.13%)
> Amean system-30 21.10 ( 0.00%) 21.23 * -0.60%*
> Amean system-32 22.31 ( 0.00%) 22.47 * -0.71%*
> Amean elapsed-1 11.76 ( 0.00%) 11.73 * 0.29%*
> Amean elapsed-3 3.93 ( 0.00%) 3.93 * 0.17%*
> Amean elapsed-5 2.39 ( 0.00%) 2.37 * 0.74%*
> Amean elapsed-7 1.72 ( 0.00%) 1.71 * 0.81%*
> Amean elapsed-12 1.02 ( 0.00%) 1.03 ( -0.42%)
> Amean elapsed-18 1.13 ( 0.00%) 1.14 ( -0.18%)
> Amean elapsed-24 0.87 ( 0.00%) 0.88 * -0.65%*
> Amean elapsed-30 0.77 ( 0.00%) 0.78 * -0.86%*
> Amean elapsed-32 0.74 ( 0.00%) 0.74 ( 0.00%)
>
> - HPE Apollo 70, 2 sockets, Cavium ThunderX2, 128-Cores, 2 NUMA nodes, arm64.
> NOTE: The test here only goes up to 128 for some reason, although there are
> 256 CPUs. Maybe a mmtests issue? I didn't investigate.
>
> pft timings:
> vanilla rcu
> Amean system-1 4.42 ( 0.00%) 4.36 * 1.29%*
> Amean system-4 4.56 ( 0.00%) 4.51 * 1.05%*
> Amean system-7 4.63 ( 0.00%) 4.65 * -0.42%*
> Amean system-12 5.96 ( 0.00%) 6.02 * -1.00%*
> Amean system-21 10.97 ( 0.00%) 11.01 * -0.32%*
> Amean system-30 16.01 ( 0.00%) 16.04 * -0.19%*
> Amean system-48 26.81 ( 0.00%) 26.78 * 0.09%*
> Amean system-79 30.80 ( 0.00%) 30.85 * -0.16%*
> Amean system-110 31.87 ( 0.00%) 31.93 * -0.19%*
> Amean system-128 36.27 ( 0.00%) 36.31 * -0.10%*
> Amean elapsed-1 4.88 ( 0.00%) 4.85 * 0.60%*
> Amean elapsed-4 1.27 ( 0.00%) 1.26 * 1.00%*
> Amean elapsed-7 0.73 ( 0.00%) 0.74 * -0.46%*
> Amean elapsed-12 0.55 ( 0.00%) 0.55 * 1.09%*
> Amean elapsed-21 0.59 ( 0.00%) 0.60 * -0.96%*
> Amean elapsed-30 0.60 ( 0.00%) 0.60 * 0.28%*
> Amean elapsed-48 0.60 ( 0.00%) 0.60 * 0.44%*
> Amean elapsed-79 0.49 ( 0.00%) 0.49 * -0.07%*
> Amean elapsed-110 0.36 ( 0.00%) 0.36 * 0.28%*
> Amean elapsed-128 0.31 ( 0.00%) 0.31 * -0.43%*
>
> - Raspberry Pi 4, 1 socket, bcm2711, Cortex-A72, 4-Cores, 1 NUMA node, arm64.
>
> pft timings:
> vanilla rcu
> Amean system-1 0.67 ( 0.00%) 0.67 * -1.25%*
> Amean system-3 1.30 ( 0.00%) 1.29 * 0.62%*
> Amean system-4 1.61 ( 0.00%) 1.59 * 0.95%*
> Amean elapsed-1 0.71 ( 0.00%) 0.72 * -1.17%*
> Amean elapsed-3 0.45 ( 0.00%) 0.45 * 0.88%*
> Amean elapsed-4 0.42 ( 0.00%) 0.42 * 1.19%*
>
>

2022-02-09 13:10:12

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm/page_alloc: Remote per-cpu lists drain support

Hi Xiongfeng, thanks for taking the time to look at this.

On Wed, 2022-02-09 at 16:55 +0800, Xiongfeng Wang wrote:
> Hi Nicolas,
>
> When I applied the patchset on the following commit and tested on QEMU, I came
> accross the following CallTrace.
> commit dd81e1c7d5fb126e5fbc5c9e334d7b3ec29a16a0
>
> I wrote a userspace application to consume the memory. When the memory is used
> out, the OOM killer is triggered and the following Calltrace is printed. I am
> not sure if it is related to this patchset. But when I reverted this patchset,
> the 'NULL pointer' Calltrace didn't show.

It's a silly mistake on my part, while cleaning up the code I messed up one of
the 'struct per_cpu_pages' accessors. This should fix it:

------------------------->8-------------------------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0caa7155ca34..e65b991c3dc8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3279,7 +3279,7 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
has_pcps = true;
} else {
for_each_populated_zone(z) {
- pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+ pcp = per_cpu_ptr(z->per_cpu_pageset, cpu);
lp = rcu_dereference_protected(pcp->lp,
mutex_is_locked(&pcpu_drain_mutex));
if (lp->count) {
------------------------->8-------------------------

Thanks!

--
Nicolás Sáenz


2022-02-09 13:34:23

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: [PATCH 2/2] mm/page_alloc: Add remote draining support to per-cpu lists

The page allocator's per-cpu page lists (pcplists) are currently
protected using local_locks. While performance savvy, this doesn't allow
for remote access to these structures. CPUs requiring system-wide
changes to the per-cpu lists get around this by scheduling
workers on each CPU. That said, some setups like NOHZ_FULL CPUs,
aren't well suited to this since they can't handle interruptions
of any sort.

To mitigate this, replace the current draining mechanism with one that
allows remotely draining the lists:

- Each CPU now has two pcplists pointers: one that points to a pcplists
instance that is in-use, 'pcp->lp', another that points to an idle
and empty instance, 'pcp->drain'. CPUs access their local pcplists
through 'pcp->lp' and the pointer is dereferenced atomically.

- When a CPU decides it needs to empty some remote pcplists, it'll
atomically exchange the remote CPU's 'pcp->lp' and 'pcp->drain'
pointers. A remote CPU racing with this will either have:

- An old 'pcp->lp' reference, it'll soon be emptied by the drain
process, we just have to wait for it to finish using it.

- The new 'pcp->lp' reference, that is, an empty pcplists instance.
rcu_replace_pointer()'s release semantics ensures any prior
changes will be visible by the remote CPU, for example: changes
to 'pcp->high' and 'pcp->batch' when disabling the pcplists.

- The CPU that started the drain can now wait for an RCU grace period
to make sure the remote CPU is done using the old pcplists.
synchronize_rcu() counts as a full memory barrier, so any changes the
local CPU makes to the soon to be drained pcplists will be visible to
the draining CPU once it returns.

- Then the CPU can safely free the old pcplists. Nobody else holds a
reference to it. Note that concurrent access to the remote pcplists
drain is protected by the 'pcpu_drain_mutex'.

From an RCU perspective, we're only protecting access to the pcplists
pointer, the drain operation is the writer and the local_lock critical
sections are the readers. RCU guarantees atomicity both while
dereferencing the pcplists pointer and replacing it. It also checks for
RCU critical section/locking correctness, as all readers have to hold
their per-cpu pagesets local_lock, which also counts as a critical
section from RCU's perspective.

From a performance perspective, on production configurations, the patch
adds an extra dereference to all hot paths (under such circumstances
rcu_dereference() will simplify to READ_ONCE()). Extensive measurements
have been performed on different architectures to ascertain the
performance impact is minimal. Most platforms don't see any difference
and the worst-case scenario shows a 1-3% degradation on a page
allocation micro-benchmark. See cover letter for in-depth results.

Accesses to the pcplists like the ones in mm/vmstat.c don't require RCU
supervision since they can handle outdated data, but they do use
rcu_access_pointer() to avoid compiler weirdness make sparse happy.

Note that special care has been taken to verify there are no races with
the memory hotplug code paths. Notably with calls to zone_pcp_reset().
As Mel Gorman explains in a previous patch[1]: "The existing hotplug
paths guarantees the pcplists cannot be used after zone_pcp_enable()
[the one in offline_pages()]. That should be the case already because
all the pages have been freed and there is no page to put on the PCP
lists."

All in all, this technique allows for remote draining on all setups with
an acceptable performance impact. It benefits all sorts of use cases:
low-latency, real-time, HPC, idle systems, KVM guests.

[1] 8ca559132a2d ("mm/memory_hotplug: remove broken locking of zone PCP
structures during hot remove")

Signed-off-by: Nicolas Saenz Julienne <[email protected]>
---

Changes since RFC:
- Avoid unnecessary spin_lock_irqsave/restore() in free_pcppages_bulk()
- Add more detail to commit and code comments.
- Use synchronize_rcu() instead of synchronize_rcu_expedited(), the RCU
documentation says to avoid it unless really justified. I don't think
it's justified here, if we can schedule and join works, waiting for
an RCU grace period is OK.
- Avoid sparse warnings by using rcu_access_pointer() and
rcu_dereference_protected().

include/linux/mmzone.h | 22 +++++-
mm/page_alloc.c | 155 ++++++++++++++++++++++++++---------------
mm/vmstat.c | 6 +-
3 files changed, 120 insertions(+), 63 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b4cb85d9c6e8..b0b593fd8e48 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -388,13 +388,31 @@ struct per_cpu_pages {
short expire; /* When 0, remote pagesets are drained */
#endif

- struct pcplists *lp;
+ /*
+ * As a rule of thumb, any access to struct per_cpu_pages's 'lp' has
+ * happen with the pagesets local_lock held and using
+ * rcu_dereference_check(). If there is a need to modify both
+ * 'lp->count' and 'lp->lists' in the same critical section 'pcp->lp'
+ * can only be derefrenced once. See for example:
+ *
+ * local_lock_irqsave(&pagesets.lock, flags);
+ * lp = rcu_dereference_check(pcp->lp, ...);
+ * list_add(&page->lru, &lp->lists[pindex]);
+ * lp->count += 1 << order;
+ * local_unlock_irqrestore(&pagesets.lock, flags);
+ *
+ * vmstat code only needs to check the page count and can deal with
+ * outdated data. In that case rcu_access_pointer() is good enough and
+ * the locking is not needed.
+ */
+ struct pcplists __rcu *lp;
+ struct pcplists *drain;
struct pcplists {
/* Number of pages in the pcplists */
int count;
/* Lists of pages, one per migrate type stored on the pcp-lists */
struct list_head lists[NR_PCP_LISTS];
- } __pcplists; /* Do not access directly */
+ } __pcplists[2]; /* Do not access directly */
};

struct per_cpu_zonestat {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4f37815b0e4c..4680dd458184 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -150,13 +150,7 @@ DEFINE_PER_CPU(int, _numa_mem_); /* Kernel "local memory" node */
EXPORT_PER_CPU_SYMBOL(_numa_mem_);
#endif

-/* work_structs for global per-cpu drains */
-struct pcpu_drain {
- struct zone *zone;
- struct work_struct work;
-};
static DEFINE_MUTEX(pcpu_drain_mutex);
-static DEFINE_PER_CPU(struct pcpu_drain, pcpu_drain);

#ifdef CONFIG_GCC_PLUGIN_LATENT_ENTROPY
volatile unsigned long latent_entropy __latent_entropy;
@@ -3135,7 +3129,7 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)

local_lock_irqsave(&pagesets.lock, flags);
batch = READ_ONCE(pcp->batch);
- lp = pcp->lp;
+ lp = rcu_dereference_check(pcp->lp, lockdep_is_held(this_cpu_ptr(&pagesets.lock)));
to_drain = min(lp->count, batch);
if (to_drain > 0)
free_pcppages_bulk(zone, to_drain, batch, lp);
@@ -3159,7 +3153,7 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
local_lock_irqsave(&pagesets.lock, flags);

pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
- lp = pcp->lp;
+ lp = rcu_dereference_check(pcp->lp, lockdep_is_held(this_cpu_ptr(&pagesets.lock)));
if (lp->count)
free_pcppages_bulk(zone, lp->count, READ_ONCE(pcp->batch), lp);

@@ -3198,36 +3192,48 @@ void drain_local_pages(struct zone *zone)
drain_pages(cpu);
}

-static void drain_local_pages_wq(struct work_struct *work)
-{
- struct pcpu_drain *drain;
-
- drain = container_of(work, struct pcpu_drain, work);
-
- /*
- * drain_all_pages doesn't use proper cpu hotplug protection so
- * we can race with cpu offline when the WQ can move this from
- * a cpu pinned worker to an unbound one. We can operate on a different
- * cpu which is alright but we also have to make sure to not move to
- * a different one.
- */
- migrate_disable();
- drain_local_pages(drain->zone);
- migrate_enable();
-}
-
/*
* The implementation of drain_all_pages(), exposing an extra parameter to
* drain on all cpus.
*
- * drain_all_pages() is optimized to only execute on cpus where pcplists are
- * not empty. The check for non-emptiness can however race with a free to
- * pcplist that has not yet increased the lp->count from 0 to 1. Callers
- * that need the guarantee that every CPU has drained can disable the
- * optimizing racy check.
+ * drain_all_pages() is optimized to only affect cpus where pcplists are not
+ * empty. The check for non-emptiness can however race with a free to pcplist
+ * that has not yet increased the lp->count from 0 to 1. Callers that need the
+ * guarantee that every CPU has drained can disable the optimizing racy check.
+ *
+ * The drain mechanism does the following:
+ *
+ * - Each CPU has two pcplists pointers: one that points to a pcplists
+ * instance that is in-use, 'pcp->lp', another that points to an idle
+ * and empty instance, 'pcp->drain'. CPUs atomically dereference their local
+ * pcplists through 'pcp->lp' while holding the pagesets local_lock.
+ *
+ * - When a CPU decides it needs to empty some remote pcplists, it'll
+ * atomically exchange the remote CPU's 'pcp->lp' and 'pcp->drain' pointers.
+ * A remote CPU racing with this will either have:
+ *
+ * - An old 'pcp->lp' reference, it'll soon be emptied by the drain
+ * process, we just have to wait for it to finish using it.
+ *
+ * - The new 'pcp->lp' reference, that is, an empty pcplists instance.
+ * rcu_replace_pointer()'s release semantics ensures any prior changes
+ * will be visible by the remote CPU, for example changes to 'pcp->high'
+ * and 'pcp->batch' when disabling the pcplists.
+ *
+ * - The CPU that started the drain can now wait for an RCU grace period to
+ * make sure the remote CPU is done using the old pcplists.
+ * synchronize_rcu() counts as a full memory barrier, so any changes the
+ * local CPU makes to the soon to be drained pcplists will be visible to the
+ * draining CPU once it returns.
+ *
+ * - Then the CPU can safely free the old pcplists. Nobody else holds a
+ * reference to it. Note that concurrent write access to remote pcplists
+ * pointers is protected by the 'pcpu_drain_mutex'.
*/
static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
{
+ struct per_cpu_pages *pcp;
+ struct zone *z;
int cpu;

/*
@@ -3236,13 +3242,6 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
*/
static cpumask_t cpus_with_pcps;

- /*
- * Make sure nobody triggers this path before mm_percpu_wq is fully
- * initialized.
- */
- if (WARN_ON_ONCE(!mm_percpu_wq))
- return;
-
/*
* Do not drain if one is already in progress unless it's specific to
* a zone. Such callers are primarily CMA and memory hotplug and need
@@ -3261,6 +3260,7 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
* disables preemption as part of its processing
*/
for_each_online_cpu(cpu) {
+ struct per_cpu_pages *pcp;
struct zone *z;
bool has_pcps = false;
struct pcplists *lp;
@@ -3272,12 +3272,16 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
*/
has_pcps = true;
} else if (zone) {
- lp = per_cpu_ptr(zone->per_cpu_pageset, cpu)->lp;
+ pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+ lp = rcu_dereference_protected(pcp->lp,
+ mutex_is_locked(&pcpu_drain_mutex));
if (lp->count)
has_pcps = true;
} else {
for_each_populated_zone(z) {
- lp = per_cpu_ptr(z->per_cpu_pageset, cpu)->lp;
+ pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+ lp = rcu_dereference_protected(pcp->lp,
+ mutex_is_locked(&pcpu_drain_mutex));
if (lp->count) {
has_pcps = true;
break;
@@ -3291,16 +3295,40 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
cpumask_clear_cpu(cpu, &cpus_with_pcps);
}

+ if (cpumask_empty(&cpus_with_pcps))
+ goto exit;
+
for_each_cpu(cpu, &cpus_with_pcps) {
- struct pcpu_drain *drain = per_cpu_ptr(&pcpu_drain, cpu);
+ for_each_populated_zone(z) {
+ if (zone && zone != z)
+ continue;
+
+ pcp = per_cpu_ptr(z->per_cpu_pageset, cpu);
+ WARN_ON(pcp->drain->count);
+ pcp->drain = rcu_replace_pointer(pcp->lp, pcp->drain,
+ mutex_is_locked(&pcpu_drain_mutex));
+ }
+ }
+
+ synchronize_rcu();

- drain->zone = zone;
- INIT_WORK(&drain->work, drain_local_pages_wq);
- queue_work_on(cpu, mm_percpu_wq, &drain->work);
+ for_each_cpu(cpu, &cpus_with_pcps) {
+ for_each_populated_zone(z) {
+ unsigned long flags;
+ int count;
+
+ pcp = per_cpu_ptr(z->per_cpu_pageset, cpu);
+ count = pcp->drain->count;
+ if (!count)
+ continue;
+
+ local_irq_save(flags);
+ free_pcppages_bulk(z, count, READ_ONCE(pcp->batch), pcp->drain);
+ local_irq_restore(flags);
+ }
}
- for_each_cpu(cpu, &cpus_with_pcps)
- flush_work(&per_cpu_ptr(&pcpu_drain, cpu)->work);

+exit:
mutex_unlock(&pcpu_drain_mutex);
}

@@ -3309,7 +3337,7 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
*
* When zone parameter is non-NULL, spill just the single zone's pages.
*
- * Note that this can be extremely slow as the draining happens in a workqueue.
+ * Note that this can be extremely slow.
*/
void drain_all_pages(struct zone *zone)
{
@@ -3436,7 +3464,7 @@ static void free_unref_page_commit(struct page *page, int migratetype,

__count_vm_event(PGFREE);
pcp = this_cpu_ptr(zone->per_cpu_pageset);
- lp = pcp->lp;
+ lp = rcu_dereference_check(pcp->lp, lockdep_is_held(this_cpu_ptr(&pagesets.lock)));
pindex = order_to_pindex(migratetype, order);
list_add(&page->lru, &lp->lists[pindex]);
lp->count += 1 << order;
@@ -3723,7 +3751,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
*/
pcp = this_cpu_ptr(zone->per_cpu_pageset);
pcp->free_factor >>= 1;
- lp = pcp->lp;
+ lp = rcu_dereference_check(pcp->lp, lockdep_is_held(this_cpu_ptr(&pagesets.lock)));
list = &lp->lists[order_to_pindex(migratetype, order)];
count = &lp->count;
page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags,
@@ -5346,7 +5374,7 @@ unsigned long __alloc_pages_bulk(gfp_t gfp, int preferred_nid,
/* Attempt the batch allocation */
local_lock_irqsave(&pagesets.lock, flags);
pcp = this_cpu_ptr(zone->per_cpu_pageset);
- lp = pcp->lp;
+ lp = rcu_dereference_check(pcp->lp, lockdep_is_held(this_cpu_ptr(&pagesets.lock)));
pcp_list = &lp->lists[order_to_pindex(ac.migratetype, 0)];
count = &lp->count;

@@ -5961,8 +5989,12 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
if (show_mem_node_skip(filter, zone_to_nid(zone), nodemask))
continue;

- for_each_online_cpu(cpu)
- free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->lp->count;
+ for_each_online_cpu(cpu) {
+ struct per_cpu_pages *pcp;
+
+ pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+ free_pcp += rcu_access_pointer(pcp->lp)->count;
+ }
}

printk("active_anon:%lu inactive_anon:%lu isolated_anon:%lu\n"
@@ -6055,8 +6087,12 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
continue;

free_pcp = 0;
- for_each_online_cpu(cpu)
- free_pcp += per_cpu_ptr(zone->per_cpu_pageset, cpu)->lp->count;
+ for_each_online_cpu(cpu) {
+ struct per_cpu_pages *pcp;
+
+ pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+ free_pcp += rcu_access_pointer(pcp->lp)->count;
+ }

show_node(zone);
printk(KERN_CONT
@@ -6099,7 +6135,7 @@ void show_free_areas(unsigned int filter, nodemask_t *nodemask)
K(zone_page_state(zone, NR_MLOCK)),
K(zone_page_state(zone, NR_BOUNCE)),
K(free_pcp),
- K(raw_cpu_ptr(zone->per_cpu_pageset)->lp->count),
+ K(rcu_access_pointer(raw_cpu_ptr(zone->per_cpu_pageset)->lp)->count),
K(zone_page_state(zone, NR_FREE_CMA_PAGES)));
printk("lowmem_reserve[]:");
for (i = 0; i < MAX_NR_ZONES; i++)
@@ -7014,10 +7050,13 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta
memset(pcp, 0, sizeof(*pcp));
memset(pzstats, 0, sizeof(*pzstats));

- pcp->lp = &pcp->__pcplists;
+ pcp->lp = RCU_INITIALIZER(&pcp->__pcplists[0]);
+ pcp->drain = &pcp->__pcplists[1];

- for (pindex = 0; pindex < NR_PCP_LISTS; pindex++)
- INIT_LIST_HEAD(&pcp->lp->lists[pindex]);
+ for (pindex = 0; pindex < NR_PCP_LISTS; pindex++) {
+ INIT_LIST_HEAD(&rcu_access_pointer(pcp->lp)->lists[pindex]);
+ INIT_LIST_HEAD(&pcp->drain->lists[pindex]);
+ }

/*
* Set batch and high values safe for a boot pageset. A true percpu
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 576b2b932ccd..9c33ff4a580a 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -856,7 +856,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
* if not then there is nothing to expire.
*/
if (!__this_cpu_read(pcp->expire) ||
- !this_cpu_ptr(pcp)->lp->count)
+ !rcu_access_pointer(this_cpu_ptr(pcp)->lp)->count)
continue;

/*
@@ -870,7 +870,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
if (__this_cpu_dec_return(pcp->expire))
continue;

- if (this_cpu_ptr(pcp)->lp->count) {
+ if (rcu_access_pointer(this_cpu_ptr(pcp)->lp)->count) {
drain_zone_pages(zone, this_cpu_ptr(pcp));
changes++;
}
@@ -1728,7 +1728,7 @@ static void zoneinfo_show_print(struct seq_file *m, pg_data_t *pgdat,
"\n high: %i"
"\n batch: %i",
i,
- pcp->lp->count,
+ rcu_access_pointer(pcp->lp)->count,
pcp->high,
pcp->batch);
#ifdef CONFIG_SMP
--
2.34.1


2022-02-09 15:13:10

by Xiongfeng Wang

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm/page_alloc: Remote per-cpu lists drain support

Hi,

On 2022/2/9 17:45, Nicolas Saenz Julienne wrote:
> Hi Xiongfeng, thanks for taking the time to look at this.
>
> On Wed, 2022-02-09 at 16:55 +0800, Xiongfeng Wang wrote:
>> Hi Nicolas,
>>
>> When I applied the patchset on the following commit and tested on QEMU, I came
>> accross the following CallTrace.
>> commit dd81e1c7d5fb126e5fbc5c9e334d7b3ec29a16a0
>>
>> I wrote a userspace application to consume the memory. When the memory is used
>> out, the OOM killer is triggered and the following Calltrace is printed. I am
>> not sure if it is related to this patchset. But when I reverted this patchset,
>> the 'NULL pointer' Calltrace didn't show.
>
> It's a silly mistake on my part, while cleaning up the code I messed up one of
> the 'struct per_cpu_pages' accessors. This should fix it:
>
> ------------------------->8-------------------------
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0caa7155ca34..e65b991c3dc8 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3279,7 +3279,7 @@ static void __drain_all_pages(struct zone *zone, bool force_all_cpus)
> has_pcps = true;
> } else {
> for_each_populated_zone(z) {
> - pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
> + pcp = per_cpu_ptr(z->per_cpu_pageset, cpu);
> lp = rcu_dereference_protected(pcp->lp,
> mutex_is_locked(&pcpu_drain_mutex));
> if (lp->count) {

I have tested it. It works well. No more 'NULL pointer' Calltrace.

Thanks,
Xiongfeng

> ------------------------->8-------------------------
>
> Thanks!
>

2022-02-10 14:22:36

by Xiongfeng Wang

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm/page_alloc: Remote per-cpu lists drain support

Hi Nicolas,

I found there are several 'lru_add_drain_per_cpu()' worker thread on the
NOHZ_FULL CPUs. I am just wondering do you have plan to add support for remote
per-cpu LRU pagevec drain ?

Thanks,
Xiongfeng

On 2022/2/8 18:07, Nicolas Saenz Julienne wrote:
> This series replaces mm/page_alloc's per-cpu page lists drain mechanism with
> one that allows accessing the lists remotely. Currently, only a local CPU is
> permitted to change its per-cpu lists, and it's expected to do so, on-demand,
> whenever a process demands it by means of queueing a drain task on the local
> CPU. This causes problems for NOHZ_FULL CPUs and real-time systems that can't
> take any sort of interruption and to some lesser extent inconveniences idle and
> virtualised systems.
>
> The new algorithm will atomically switch the pointer to the per-cpu page lists
> and use RCU to make sure it's not being concurrently used before draining the
> lists. And its main benefit of is that it fixes the issue for good, avoiding
> the need for configuration based heuristics or having to modify applications
> (i.e. using the isolation prctrl being worked by Marcello Tosatti ATM).
>
> All this with minimal performance implications: a page allocation
> microbenchmark was run on multiple systems and architectures generally showing
> no performance differences, only the more extreme cases showed a 1-3%
> degradation. See data below. Needless to say that I'd appreciate if someone
> could validate my values independently.
>
> The approach has been stress-tested: I forced 100 drains/s while running
> mmtests' pft in a loop for a full day on multiple machines and archs (arm64,
> x86_64, ppc64le).
>
> Note that this is not the first attempt at fixing this per-cpu page lists:
> - The first attempt[1] tried to conditionally change the pagesets locking
> scheme based the NOHZ_FULL config. It was deemed hard to maintain as the
> NOHZ_FULL code path would be rarely tested. Also, this only solves the issue
> for NOHZ_FULL setups, which isn't ideal.
> - The second[2] unanimously switched the local_locks to per-cpu spinlocks. The
> performance degradation was too big.
>
> Previous RFC: https://lkml.org/lkml/2021/10/8/793
>
> Thanks!
>
> [1] https://lkml.org/lkml/2021/9/21/599
> [2] https://lkml.org/lkml/2021/11/3/644
>
> ---
>
> Changes since RFC:
> - Get rid of aesthetic changes that affected performance
> - Add more documentation
> - Add better commit messages
> - Pass sparse tests
> - Verify this_cpu_*() usage
> - Performance measurements
>
> Nicolas Saenz Julienne (2):
> mm/page_alloc: Access lists in 'struct per_cpu_pages' indirectly
> mm/page_alloc: Add remote draining support to per-cpu lists
>
> include/linux/mmzone.h | 28 +++++-
> mm/page_alloc.c | 212 ++++++++++++++++++++++++++---------------
> mm/vmstat.c | 6 +-
> 3 files changed, 162 insertions(+), 84 deletions(-)
>
>
> -------------------------Performance results-----------------------------
>
> I'm focusing on mmtests' Page Fault Test (pft), as it's page allocator
> intensive.
>
> - AMD Daytona Reference System, 2 sockets, AMD EPYC 7742, Zen 2, 64-Core,
> 4 NUMA nodes, x86_64
>
> pft timings:
> vanilla rcu
> Amean system-1 58.52 ( 0.00%) 58.92 * -0.68%*
> Amean system-4 61.00 ( 0.00%) 61.41 * -0.67%*
> Amean system-7 61.55 ( 0.00%) 61.74 * -0.30%*
> Amean system-12 64.91 ( 0.00%) 64.94 * -0.05%*
> Amean system-21 98.80 ( 0.00%) 99.92 * -1.13%*
> Amean system-30 147.68 ( 0.00%) 145.83 * 1.25%*
> Amean system-48 237.04 ( 0.00%) 241.29 * -1.79%*
> Amean system-79 286.61 ( 0.00%) 283.72 * 1.01%*
> Amean system-110 303.40 ( 0.00%) 299.91 * 1.15%*
> Amean system-128 345.07 ( 0.00%) 342.10 * 0.86%*
> Amean elapsed-1 61.21 ( 0.00%) 61.65 * -0.71%*
> Amean elapsed-4 15.94 ( 0.00%) 16.05 * -0.69%*
> Amean elapsed-7 9.24 ( 0.00%) 9.28 * -0.47%*
> Amean elapsed-12 5.70 ( 0.00%) 5.70 * -0.07%*
> Amean elapsed-21 5.11 ( 0.00%) 5.06 * 1.13%*
> Amean elapsed-30 5.28 ( 0.00%) 5.14 * 2.73%*
> Amean elapsed-48 5.28 ( 0.00%) 5.24 * 0.74%*
> Amean elapsed-79 4.41 ( 0.00%) 4.31 * 2.17%*
> Amean elapsed-110 3.45 ( 0.00%) 3.44 * 0.40%*
> Amean elapsed-128 2.75 ( 0.00%) 2.75 * -0.28%*
>
> - AMD Speedway Reference System, 2 sockets, AMD EPYC 7601, Zen 1, 64-core, 8
> NUMA nodes, x86_64. Lots of variance between tests on this platform. It'll
> easily swing -+5% on each result.
>
> pft timings:
> vanilla rcu
> Amean system-1 69.20 ( 0.00%) 66.21 * 4.32%*
> Amean system-4 70.79 ( 0.00%) 69.01 * 2.52%*
> Amean system-7 71.34 ( 0.00%) 69.16 * 3.05%*
> Amean system-12 74.00 ( 0.00%) 72.74 * 1.70%*
> Amean system-21 86.01 ( 0.00%) 85.70 * 0.36%*
> Amean system-30 89.21 ( 0.00%) 89.93 * -0.80%*
> Amean system-48 92.39 ( 0.00%) 92.43 * -0.04%*
> Amean system-79 120.19 ( 0.00%) 121.30 * -0.92%*
> Amean system-110 172.79 ( 0.00%) 179.37 * -3.81%*
> Amean system-128 201.70 ( 0.00%) 212.57 * -5.39%*
> Amean elapsed-1 72.23 ( 0.00%) 69.29 * 4.08%*
> Amean elapsed-4 18.69 ( 0.00%) 18.28 * 2.20%*
> Amean elapsed-7 10.80 ( 0.00%) 10.54 * 2.41%*
> Amean elapsed-12 6.62 ( 0.00%) 6.53 * 1.30%*
> Amean elapsed-21 4.68 ( 0.00%) 4.69 * -0.14%*
> Amean elapsed-30 3.44 ( 0.00%) 3.50 * -1.66%*
> Amean elapsed-48 2.40 ( 0.00%) 2.42 * -1.00%*
> Amean elapsed-79 2.05 ( 0.00%) 2.09 * -1.90%*
> Amean elapsed-110 1.83 ( 0.00%) 1.91 * -4.60%*
> Amean elapsed-128 1.75 ( 0.00%) 1.85 * -5.99%*
>
> - IBM 9006-22C system, 2 sockets, POWER9, 64-Core, 1 NUMA node per cpu,
> pppc64le.
>
> pft timings:
> vanilla rcu
> Amean system-1 1.82 ( 0.00%) 1.85 * -1.43%*
> Amean system-4 2.18 ( 0.00%) 2.22 * -2.02%*
> Amean system-7 3.27 ( 0.00%) 3.28 * -0.15%*
> Amean system-12 5.22 ( 0.00%) 5.20 * 0.26%*
> Amean system-21 10.10 ( 0.00%) 10.20 * -1.00%*
> Amean system-30 15.00 ( 0.00%) 14.52 * 3.20%*
> Amean system-48 26.41 ( 0.00%) 25.96 * 1.71%*
> Amean system-79 29.35 ( 0.00%) 29.70 * -1.21%*
> Amean system-110 24.01 ( 0.00%) 23.40 * 2.54%*
> Amean system-128 24.57 ( 0.00%) 25.32 * -3.06%*
> Amean elapsed-1 1.85 ( 0.00%) 1.87 * -1.28%*
> Amean elapsed-4 0.56 ( 0.00%) 0.57 * -1.72%*
> Amean elapsed-7 0.51 ( 0.00%) 0.50 * 0.07%*
> Amean elapsed-12 0.51 ( 0.00%) 0.51 * 0.06%*
> Amean elapsed-21 0.54 ( 0.00%) 0.54 * 0.06%*
> Amean elapsed-30 0.54 ( 0.00%) 0.53 * 2.22%*
> Amean elapsed-48 0.58 ( 0.00%) 0.57 * 1.73%*
> Amean elapsed-79 0.49 ( 0.00%) 0.48 * 0.89%*
> Amean elapsed-110 0.37 ( 0.00%) 0.37 * -1.08%*
> Amean elapsed-128 0.33 ( 0.00%) 0.33 * 0.00%*
>
> - Ampere MtSnow, 1 socket, Neoverse-N1, 80-Cores, 1 NUMA node, arm64.
>
> pft timings:
> vanilla rcu
> Amean system-1 11.92 ( 0.00%) 11.99 * -0.61%*
> Amean system-4 13.13 ( 0.00%) 13.09 * 0.31%*
> Amean system-7 13.91 ( 0.00%) 13.94 * -0.20%*
> Amean system-12 15.77 ( 0.00%) 15.69 * 0.48%*
> Amean system-21 21.32 ( 0.00%) 21.42 * -0.46%*
> Amean system-30 28.58 ( 0.00%) 29.12 * -1.90%*
> Amean system-48 47.41 ( 0.00%) 46.91 * 1.04%*
> Amean system-79 76.76 ( 0.00%) 77.16 * -0.52%*
> Amean system-80 77.98 ( 0.00%) 78.23 * -0.32%*
> Amean elapsed-1 12.46 ( 0.00%) 12.53 * -0.58%*
> Amean elapsed-4 3.47 ( 0.00%) 3.46 * 0.34%*
> Amean elapsed-7 2.18 ( 0.00%) 2.21 * -1.58%*
> Amean elapsed-12 1.41 ( 0.00%) 1.42 * -0.80%*
> Amean elapsed-21 1.09 ( 0.00%) 1.12 * -2.60%*
> Amean elapsed-30 0.98 ( 0.00%) 1.01 * -3.08%*
> Amean elapsed-48 1.08 ( 0.00%) 1.10 * -1.78%*
> Amean elapsed-79 1.32 ( 0.00%) 1.28 * 2.71%*
> Amean elapsed-80 1.32 ( 0.00%) 1.28 * 3.23%*
>
> - Dell R430, 2 sockets, Intel Xeon E5-2640 v3, Sandy Bridge, 16-Cores, 2 NUMA
> nodes, x86_64.
>
> pft timings:
> vanilla rcu
> Amean system-1 11.10 ( 0.00%) 11.07 * 0.24%*
> Amean system-3 11.14 ( 0.00%) 11.10 * 0.34%*
> Amean system-5 11.18 ( 0.00%) 11.13 * 0.47%*
> Amean system-7 11.21 ( 0.00%) 11.17 * 0.38%*
> Amean system-12 11.28 ( 0.00%) 11.28 ( -0.03%)
> Amean system-18 13.24 ( 0.00%) 13.25 * -0.11%*
> Amean system-24 17.12 ( 0.00%) 17.14 ( -0.13%)
> Amean system-30 21.10 ( 0.00%) 21.23 * -0.60%*
> Amean system-32 22.31 ( 0.00%) 22.47 * -0.71%*
> Amean elapsed-1 11.76 ( 0.00%) 11.73 * 0.29%*
> Amean elapsed-3 3.93 ( 0.00%) 3.93 * 0.17%*
> Amean elapsed-5 2.39 ( 0.00%) 2.37 * 0.74%*
> Amean elapsed-7 1.72 ( 0.00%) 1.71 * 0.81%*
> Amean elapsed-12 1.02 ( 0.00%) 1.03 ( -0.42%)
> Amean elapsed-18 1.13 ( 0.00%) 1.14 ( -0.18%)
> Amean elapsed-24 0.87 ( 0.00%) 0.88 * -0.65%*
> Amean elapsed-30 0.77 ( 0.00%) 0.78 * -0.86%*
> Amean elapsed-32 0.74 ( 0.00%) 0.74 ( 0.00%)
>
> - HPE Apollo 70, 2 sockets, Cavium ThunderX2, 128-Cores, 2 NUMA nodes, arm64.
> NOTE: The test here only goes up to 128 for some reason, although there are
> 256 CPUs. Maybe a mmtests issue? I didn't investigate.
>
> pft timings:
> vanilla rcu
> Amean system-1 4.42 ( 0.00%) 4.36 * 1.29%*
> Amean system-4 4.56 ( 0.00%) 4.51 * 1.05%*
> Amean system-7 4.63 ( 0.00%) 4.65 * -0.42%*
> Amean system-12 5.96 ( 0.00%) 6.02 * -1.00%*
> Amean system-21 10.97 ( 0.00%) 11.01 * -0.32%*
> Amean system-30 16.01 ( 0.00%) 16.04 * -0.19%*
> Amean system-48 26.81 ( 0.00%) 26.78 * 0.09%*
> Amean system-79 30.80 ( 0.00%) 30.85 * -0.16%*
> Amean system-110 31.87 ( 0.00%) 31.93 * -0.19%*
> Amean system-128 36.27 ( 0.00%) 36.31 * -0.10%*
> Amean elapsed-1 4.88 ( 0.00%) 4.85 * 0.60%*
> Amean elapsed-4 1.27 ( 0.00%) 1.26 * 1.00%*
> Amean elapsed-7 0.73 ( 0.00%) 0.74 * -0.46%*
> Amean elapsed-12 0.55 ( 0.00%) 0.55 * 1.09%*
> Amean elapsed-21 0.59 ( 0.00%) 0.60 * -0.96%*
> Amean elapsed-30 0.60 ( 0.00%) 0.60 * 0.28%*
> Amean elapsed-48 0.60 ( 0.00%) 0.60 * 0.44%*
> Amean elapsed-79 0.49 ( 0.00%) 0.49 * -0.07%*
> Amean elapsed-110 0.36 ( 0.00%) 0.36 * 0.28%*
> Amean elapsed-128 0.31 ( 0.00%) 0.31 * -0.43%*
>
> - Raspberry Pi 4, 1 socket, bcm2711, Cortex-A72, 4-Cores, 1 NUMA node, arm64.
>
> pft timings:
> vanilla rcu
> Amean system-1 0.67 ( 0.00%) 0.67 * -1.25%*
> Amean system-3 1.30 ( 0.00%) 1.29 * 0.62%*
> Amean system-4 1.61 ( 0.00%) 1.59 * 0.95%*
> Amean elapsed-1 0.71 ( 0.00%) 0.72 * -1.17%*
> Amean elapsed-3 0.45 ( 0.00%) 0.45 * 0.88%*
> Amean elapsed-4 0.42 ( 0.00%) 0.42 * 1.19%*
>
>

2022-02-10 14:23:00

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm/page_alloc: Remote per-cpu lists drain support

On Thu, 2022-02-10 at 18:59 +0800, Xiongfeng Wang wrote:
> Hi Nicolas,
>
> I found there are several 'lru_add_drain_per_cpu()' worker thread on the
> NOHZ_FULL CPUs. I am just wondering do you have plan to add support for remote
> per-cpu LRU pagevec drain ?

Yes, this is something we've been meaning to fix too. But haven't started
looking into it yet (maybe Marcelo can contradict me here).

Regards,
Nicolas


2022-03-03 12:19:11

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm/page_alloc: Remote per-cpu lists drain support

On Tue, Feb 08, 2022 at 11:07:48AM +0100, Nicolas Saenz Julienne wrote:
> This series replaces mm/page_alloc's per-cpu page lists drain mechanism with
> one that allows accessing the lists remotely. Currently, only a local CPU is
> permitted to change its per-cpu lists, and it's expected to do so, on-demand,
> whenever a process demands it by means of queueing a drain task on the local
> CPU. This causes problems for NOHZ_FULL CPUs and real-time systems that can't
> take any sort of interruption and to some lesser extent inconveniences idle and
> virtualised systems.
>

I know this has been sitting here for a long while. Last few weeks have
not been fun.

> Note that this is not the first attempt at fixing this per-cpu page lists:
> - The first attempt[1] tried to conditionally change the pagesets locking
> scheme based the NOHZ_FULL config. It was deemed hard to maintain as the
> NOHZ_FULL code path would be rarely tested. Also, this only solves the issue
> for NOHZ_FULL setups, which isn't ideal.
> - The second[2] unanimously switched the local_locks to per-cpu spinlocks. The
> performance degradation was too big.
>

For unrelated reasons I looked at using llist to avoid locks entirely. It
turns out it's not possible and needs a lock. We know "local_locks to
per-cpu spinlocks" took a large penalty so I considered alternatives on
how a lock could be used. I found it's possible to both remote drain
the lists and avoid the disable/enable of IRQs entirely as long as a
preempting IRQ is willing to take the zone lock instead (should be very
rare). The IRQ part is a bit hairy though as softirqs are also a problem
and preempt-rt needs different rules and the llist has to sort PCP
refills which might be a loss in total. However, the remote draining may
still be interesting. The full series is at
https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/ mm-pcpllist-v1r2

It's still waiting on tests to complete and not all the changelogs are
complete which is why it's not posted.

This is a comparison of vanilla vs "local_locks to per-cpu spinlocks"
versus the git series up to "mm/page_alloc: Remotely drain per-cpu lists"
for the page faulting microbench I originally complained about. The test
machine is a 2-socket CascadeLake machine.

pft timings
5.17.0-rc5 5.17.0-rc5 5.17.0-rc5
vanilla mm-remotedrain-v2r1 mm-pcpdrain-v1r1
Amean elapsed-1 32.54 ( 0.00%) 33.08 * -1.66%* 32.82 * -0.86%*
Amean elapsed-4 8.66 ( 0.00%) 9.24 * -6.72%* 8.69 * -0.38%*
Amean elapsed-7 5.02 ( 0.00%) 5.43 * -8.16%* 5.05 * -0.55%*
Amean elapsed-12 3.07 ( 0.00%) 3.38 * -10.00%* 3.09 * -0.72%*
Amean elapsed-21 2.36 ( 0.00%) 2.38 * -0.89%* 2.19 * 7.39%*
Amean elapsed-30 1.75 ( 0.00%) 1.87 * -6.50%* 1.62 * 7.59%*
Amean elapsed-48 1.71 ( 0.00%) 2.00 * -17.32%* 1.71 ( -0.08%)
Amean elapsed-79 1.56 ( 0.00%) 1.62 * -3.84%* 1.56 ( -0.02%)
Amean elapsed-80 1.57 ( 0.00%) 1.65 * -5.31%* 1.57 ( -0.04%)

Note the local_lock conversion took 1 1-17% penalty while the git tree
takes a negligile penalty while still allowing remote drains. It might
have some potential while being less complex than the RCU approach.

--
Mel Gorman
SUSE Labs

2022-03-03 14:52:50

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm/page_alloc: Remote per-cpu lists drain support

On Thu, 2022-03-03 at 14:27 +0100, Vlastimil Babka wrote:
> On 2/8/22 11:07, Nicolas Saenz Julienne wrote:
> > This series replaces mm/page_alloc's per-cpu page lists drain mechanism with
> > one that allows accessing the lists remotely. Currently, only a local CPU is
> > permitted to change its per-cpu lists, and it's expected to do so, on-demand,
> > whenever a process demands it by means of queueing a drain task on the local
> > CPU. This causes problems for NOHZ_FULL CPUs and real-time systems that can't
> > take any sort of interruption and to some lesser extent inconveniences idle and
> > virtualised systems.
> >
> > The new algorithm will atomically switch the pointer to the per-cpu page lists
> > and use RCU to make sure it's not being concurrently used before draining the
> > lists. And its main benefit of is that it fixes the issue for good, avoiding
> > the need for configuration based heuristics or having to modify applications
> > (i.e. using the isolation prctrl being worked by Marcello Tosatti ATM).
> >
> > All this with minimal performance implications: a page allocation
> > microbenchmark was run on multiple systems and architectures generally showing
> > no performance differences, only the more extreme cases showed a 1-3%
> > degradation. See data below. Needless to say that I'd appreciate if someone
> > could validate my values independently.
> >
> > The approach has been stress-tested: I forced 100 drains/s while running
> > mmtests' pft in a loop for a full day on multiple machines and archs (arm64,
> > x86_64, ppc64le).
> >
> > Note that this is not the first attempt at fixing this per-cpu page lists:
> > - The first attempt[1] tried to conditionally change the pagesets locking
> > scheme based the NOHZ_FULL config. It was deemed hard to maintain as the
> > NOHZ_FULL code path would be rarely tested. Also, this only solves the issue
> > for NOHZ_FULL setups, which isn't ideal.
> > - The second[2] unanimously switched the local_locks to per-cpu spinlocks. The
> > performance degradation was too big.
>
> For completeness, what was the fate of the approach to have pcp->high = 0
> for NOHZ cpus? [1] It would be nice to have documented why it wasn't
> feasible. Too much overhead for when these CPUs eventually do allocate, or
> some other unforeseen issue? Thanks.

Yes sorry, should've been more explicit on why I haven't gone that way yet.

Some points:
- As I mention above, not only CPU isolation users care for this. RT and HPC
do too. This is my main motivation for focusing on this solution, or
potentially Mel's.

- Fully disabling pcplists on nohz_full CPUs is too drastic, as isolated CPUs
might want to retain the performance edge while not running their sensitive
workloads. (I remember Christoph Lamenter's commenting about this on the
previous RFC).

- So the idea would be to selectively disable pcplists upon entering in the
really 'isolated' area. This could be achieved with Marcelo Tosatti's new
WIP prctrl[1]. And if we decide the current solutions are unacceptable I'll
have a go at it.

Thanks!

[1] https://lore.kernel.org/lkml/[email protected]/T/

--
Nicolás Sáenz

2022-03-03 15:40:27

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 1/2] mm/page_alloc: Access lists in 'struct per_cpu_pages' indirectly

On Tue, Feb 08, 2022 at 11:07:49AM +0100, Nicolas Saenz Julienne wrote:
> In preparation to adding remote per-cpu page list drain support, let's
> bundle 'struct per_cpu_pages's' page lists and page count into a new
> structure: 'struct pcplists', and have all code access it indirectly
> through a pointer. It'll be used by upcoming patches in order to
> maintain multiple instances of 'pcplists' and switch the pointer
> atomically.
>
> The 'struct pcplists' instance lives inside 'struct per_cpu_pages', and
> shouldn't be accessed directly. It is setup as such since these
> structures are used during early boot when no memory allocation is
> possible and to simplify memory hotplug code paths.
>
> free_pcppages_bulk() and __rmqueue_pcplist()'s function signatures
> change a bit so as to accommodate these changes without affecting
> performance.
>
> No functional change intended.
>
> Signed-off-by: Nicolas Saenz Julienne <[email protected]>
> ---
>
> Changes since RFC:
> - Add more info in commit message.
> - Removed __private attribute, in hindsight doesn't really fit what
> we're doing here.
> - Use raw_cpu_ptr() where relevant to avoid warnings.
>
> include/linux/mmzone.h | 10 +++--
> mm/page_alloc.c | 87 +++++++++++++++++++++++++-----------------
> mm/vmstat.c | 6 +--
> 3 files changed, 62 insertions(+), 41 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 3fff6deca2c0..b4cb85d9c6e8 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -381,7 +381,6 @@ enum zone_watermarks {
>
> /* Fields and list protected by pagesets local_lock in page_alloc.c */
> struct per_cpu_pages {
> - int count; /* number of pages in the list */
> int high; /* high watermark, emptying needed */
> int batch; /* chunk size for buddy add/remove */
> short free_factor; /* batch scaling factor during free */
> @@ -389,8 +388,13 @@ struct per_cpu_pages {
> short expire; /* When 0, remote pagesets are drained */
> #endif
>
> - /* Lists of pages, one per migrate type stored on the pcp-lists */
> - struct list_head lists[NR_PCP_LISTS];
> + struct pcplists *lp;
> + struct pcplists {
> + /* Number of pages in the pcplists */
> + int count;
> + /* Lists of pages, one per migrate type stored on the pcp-lists */
> + struct list_head lists[NR_PCP_LISTS];
> + } __pcplists; /* Do not access directly */
> };

Perhaps its useful to add something like: "should be acessed through ..."

Looks good otherwise.

2022-03-03 16:29:24

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm/page_alloc: Remote per-cpu lists drain support

On 2/8/22 11:07, Nicolas Saenz Julienne wrote:
> This series replaces mm/page_alloc's per-cpu page lists drain mechanism with
> one that allows accessing the lists remotely. Currently, only a local CPU is
> permitted to change its per-cpu lists, and it's expected to do so, on-demand,
> whenever a process demands it by means of queueing a drain task on the local
> CPU. This causes problems for NOHZ_FULL CPUs and real-time systems that can't
> take any sort of interruption and to some lesser extent inconveniences idle and
> virtualised systems.
>
> The new algorithm will atomically switch the pointer to the per-cpu page lists
> and use RCU to make sure it's not being concurrently used before draining the
> lists. And its main benefit of is that it fixes the issue for good, avoiding
> the need for configuration based heuristics or having to modify applications
> (i.e. using the isolation prctrl being worked by Marcello Tosatti ATM).
>
> All this with minimal performance implications: a page allocation
> microbenchmark was run on multiple systems and architectures generally showing
> no performance differences, only the more extreme cases showed a 1-3%
> degradation. See data below. Needless to say that I'd appreciate if someone
> could validate my values independently.
>
> The approach has been stress-tested: I forced 100 drains/s while running
> mmtests' pft in a loop for a full day on multiple machines and archs (arm64,
> x86_64, ppc64le).
>
> Note that this is not the first attempt at fixing this per-cpu page lists:
> - The first attempt[1] tried to conditionally change the pagesets locking
> scheme based the NOHZ_FULL config. It was deemed hard to maintain as the
> NOHZ_FULL code path would be rarely tested. Also, this only solves the issue
> for NOHZ_FULL setups, which isn't ideal.
> - The second[2] unanimously switched the local_locks to per-cpu spinlocks. The
> performance degradation was too big.

For completeness, what was the fate of the approach to have pcp->high = 0
for NOHZ cpus? [1] It would be nice to have documented why it wasn't
feasible. Too much overhead for when these CPUs eventually do allocate, or
some other unforeseen issue? Thanks.

[1]
https://lore.kernel.org/all/[email protected]/

> Previous RFC: https://lkml.org/lkml/2021/10/8/793
>
> Thanks!
>
> [1] https://lkml.org/lkml/2021/9/21/599
> [2] https://lkml.org/lkml/2021/11/3/644
>
> ---
>
> Changes since RFC:
> - Get rid of aesthetic changes that affected performance
> - Add more documentation
> - Add better commit messages
> - Pass sparse tests
> - Verify this_cpu_*() usage
> - Performance measurements
>
> Nicolas Saenz Julienne (2):
> mm/page_alloc: Access lists in 'struct per_cpu_pages' indirectly
> mm/page_alloc: Add remote draining support to per-cpu lists
>
> include/linux/mmzone.h | 28 +++++-
> mm/page_alloc.c | 212 ++++++++++++++++++++++++++---------------
> mm/vmstat.c | 6 +-
> 3 files changed, 162 insertions(+), 84 deletions(-)
>
>
> -------------------------Performance results-----------------------------
>
> I'm focusing on mmtests' Page Fault Test (pft), as it's page allocator
> intensive.
>
> - AMD Daytona Reference System, 2 sockets, AMD EPYC 7742, Zen 2, 64-Core,
> 4 NUMA nodes, x86_64
>
> pft timings:
> vanilla rcu
> Amean system-1 58.52 ( 0.00%) 58.92 * -0.68%*
> Amean system-4 61.00 ( 0.00%) 61.41 * -0.67%*
> Amean system-7 61.55 ( 0.00%) 61.74 * -0.30%*
> Amean system-12 64.91 ( 0.00%) 64.94 * -0.05%*
> Amean system-21 98.80 ( 0.00%) 99.92 * -1.13%*
> Amean system-30 147.68 ( 0.00%) 145.83 * 1.25%*
> Amean system-48 237.04 ( 0.00%) 241.29 * -1.79%*
> Amean system-79 286.61 ( 0.00%) 283.72 * 1.01%*
> Amean system-110 303.40 ( 0.00%) 299.91 * 1.15%*
> Amean system-128 345.07 ( 0.00%) 342.10 * 0.86%*
> Amean elapsed-1 61.21 ( 0.00%) 61.65 * -0.71%*
> Amean elapsed-4 15.94 ( 0.00%) 16.05 * -0.69%*
> Amean elapsed-7 9.24 ( 0.00%) 9.28 * -0.47%*
> Amean elapsed-12 5.70 ( 0.00%) 5.70 * -0.07%*
> Amean elapsed-21 5.11 ( 0.00%) 5.06 * 1.13%*
> Amean elapsed-30 5.28 ( 0.00%) 5.14 * 2.73%*
> Amean elapsed-48 5.28 ( 0.00%) 5.24 * 0.74%*
> Amean elapsed-79 4.41 ( 0.00%) 4.31 * 2.17%*
> Amean elapsed-110 3.45 ( 0.00%) 3.44 * 0.40%*
> Amean elapsed-128 2.75 ( 0.00%) 2.75 * -0.28%*
>
> - AMD Speedway Reference System, 2 sockets, AMD EPYC 7601, Zen 1, 64-core, 8
> NUMA nodes, x86_64. Lots of variance between tests on this platform. It'll
> easily swing -+5% on each result.
>
> pft timings:
> vanilla rcu
> Amean system-1 69.20 ( 0.00%) 66.21 * 4.32%*
> Amean system-4 70.79 ( 0.00%) 69.01 * 2.52%*
> Amean system-7 71.34 ( 0.00%) 69.16 * 3.05%*
> Amean system-12 74.00 ( 0.00%) 72.74 * 1.70%*
> Amean system-21 86.01 ( 0.00%) 85.70 * 0.36%*
> Amean system-30 89.21 ( 0.00%) 89.93 * -0.80%*
> Amean system-48 92.39 ( 0.00%) 92.43 * -0.04%*
> Amean system-79 120.19 ( 0.00%) 121.30 * -0.92%*
> Amean system-110 172.79 ( 0.00%) 179.37 * -3.81%*
> Amean system-128 201.70 ( 0.00%) 212.57 * -5.39%*
> Amean elapsed-1 72.23 ( 0.00%) 69.29 * 4.08%*
> Amean elapsed-4 18.69 ( 0.00%) 18.28 * 2.20%*
> Amean elapsed-7 10.80 ( 0.00%) 10.54 * 2.41%*
> Amean elapsed-12 6.62 ( 0.00%) 6.53 * 1.30%*
> Amean elapsed-21 4.68 ( 0.00%) 4.69 * -0.14%*
> Amean elapsed-30 3.44 ( 0.00%) 3.50 * -1.66%*
> Amean elapsed-48 2.40 ( 0.00%) 2.42 * -1.00%*
> Amean elapsed-79 2.05 ( 0.00%) 2.09 * -1.90%*
> Amean elapsed-110 1.83 ( 0.00%) 1.91 * -4.60%*
> Amean elapsed-128 1.75 ( 0.00%) 1.85 * -5.99%*
>
> - IBM 9006-22C system, 2 sockets, POWER9, 64-Core, 1 NUMA node per cpu,
> pppc64le.
>
> pft timings:
> vanilla rcu
> Amean system-1 1.82 ( 0.00%) 1.85 * -1.43%*
> Amean system-4 2.18 ( 0.00%) 2.22 * -2.02%*
> Amean system-7 3.27 ( 0.00%) 3.28 * -0.15%*
> Amean system-12 5.22 ( 0.00%) 5.20 * 0.26%*
> Amean system-21 10.10 ( 0.00%) 10.20 * -1.00%*
> Amean system-30 15.00 ( 0.00%) 14.52 * 3.20%*
> Amean system-48 26.41 ( 0.00%) 25.96 * 1.71%*
> Amean system-79 29.35 ( 0.00%) 29.70 * -1.21%*
> Amean system-110 24.01 ( 0.00%) 23.40 * 2.54%*
> Amean system-128 24.57 ( 0.00%) 25.32 * -3.06%*
> Amean elapsed-1 1.85 ( 0.00%) 1.87 * -1.28%*
> Amean elapsed-4 0.56 ( 0.00%) 0.57 * -1.72%*
> Amean elapsed-7 0.51 ( 0.00%) 0.50 * 0.07%*
> Amean elapsed-12 0.51 ( 0.00%) 0.51 * 0.06%*
> Amean elapsed-21 0.54 ( 0.00%) 0.54 * 0.06%*
> Amean elapsed-30 0.54 ( 0.00%) 0.53 * 2.22%*
> Amean elapsed-48 0.58 ( 0.00%) 0.57 * 1.73%*
> Amean elapsed-79 0.49 ( 0.00%) 0.48 * 0.89%*
> Amean elapsed-110 0.37 ( 0.00%) 0.37 * -1.08%*
> Amean elapsed-128 0.33 ( 0.00%) 0.33 * 0.00%*
>
> - Ampere MtSnow, 1 socket, Neoverse-N1, 80-Cores, 1 NUMA node, arm64.
>
> pft timings:
> vanilla rcu
> Amean system-1 11.92 ( 0.00%) 11.99 * -0.61%*
> Amean system-4 13.13 ( 0.00%) 13.09 * 0.31%*
> Amean system-7 13.91 ( 0.00%) 13.94 * -0.20%*
> Amean system-12 15.77 ( 0.00%) 15.69 * 0.48%*
> Amean system-21 21.32 ( 0.00%) 21.42 * -0.46%*
> Amean system-30 28.58 ( 0.00%) 29.12 * -1.90%*
> Amean system-48 47.41 ( 0.00%) 46.91 * 1.04%*
> Amean system-79 76.76 ( 0.00%) 77.16 * -0.52%*
> Amean system-80 77.98 ( 0.00%) 78.23 * -0.32%*
> Amean elapsed-1 12.46 ( 0.00%) 12.53 * -0.58%*
> Amean elapsed-4 3.47 ( 0.00%) 3.46 * 0.34%*
> Amean elapsed-7 2.18 ( 0.00%) 2.21 * -1.58%*
> Amean elapsed-12 1.41 ( 0.00%) 1.42 * -0.80%*
> Amean elapsed-21 1.09 ( 0.00%) 1.12 * -2.60%*
> Amean elapsed-30 0.98 ( 0.00%) 1.01 * -3.08%*
> Amean elapsed-48 1.08 ( 0.00%) 1.10 * -1.78%*
> Amean elapsed-79 1.32 ( 0.00%) 1.28 * 2.71%*
> Amean elapsed-80 1.32 ( 0.00%) 1.28 * 3.23%*
>
> - Dell R430, 2 sockets, Intel Xeon E5-2640 v3, Sandy Bridge, 16-Cores, 2 NUMA
> nodes, x86_64.
>
> pft timings:
> vanilla rcu
> Amean system-1 11.10 ( 0.00%) 11.07 * 0.24%*
> Amean system-3 11.14 ( 0.00%) 11.10 * 0.34%*
> Amean system-5 11.18 ( 0.00%) 11.13 * 0.47%*
> Amean system-7 11.21 ( 0.00%) 11.17 * 0.38%*
> Amean system-12 11.28 ( 0.00%) 11.28 ( -0.03%)
> Amean system-18 13.24 ( 0.00%) 13.25 * -0.11%*
> Amean system-24 17.12 ( 0.00%) 17.14 ( -0.13%)
> Amean system-30 21.10 ( 0.00%) 21.23 * -0.60%*
> Amean system-32 22.31 ( 0.00%) 22.47 * -0.71%*
> Amean elapsed-1 11.76 ( 0.00%) 11.73 * 0.29%*
> Amean elapsed-3 3.93 ( 0.00%) 3.93 * 0.17%*
> Amean elapsed-5 2.39 ( 0.00%) 2.37 * 0.74%*
> Amean elapsed-7 1.72 ( 0.00%) 1.71 * 0.81%*
> Amean elapsed-12 1.02 ( 0.00%) 1.03 ( -0.42%)
> Amean elapsed-18 1.13 ( 0.00%) 1.14 ( -0.18%)
> Amean elapsed-24 0.87 ( 0.00%) 0.88 * -0.65%*
> Amean elapsed-30 0.77 ( 0.00%) 0.78 * -0.86%*
> Amean elapsed-32 0.74 ( 0.00%) 0.74 ( 0.00%)
>
> - HPE Apollo 70, 2 sockets, Cavium ThunderX2, 128-Cores, 2 NUMA nodes, arm64.
> NOTE: The test here only goes up to 128 for some reason, although there are
> 256 CPUs. Maybe a mmtests issue? I didn't investigate.
>
> pft timings:
> vanilla rcu
> Amean system-1 4.42 ( 0.00%) 4.36 * 1.29%*
> Amean system-4 4.56 ( 0.00%) 4.51 * 1.05%*
> Amean system-7 4.63 ( 0.00%) 4.65 * -0.42%*
> Amean system-12 5.96 ( 0.00%) 6.02 * -1.00%*
> Amean system-21 10.97 ( 0.00%) 11.01 * -0.32%*
> Amean system-30 16.01 ( 0.00%) 16.04 * -0.19%*
> Amean system-48 26.81 ( 0.00%) 26.78 * 0.09%*
> Amean system-79 30.80 ( 0.00%) 30.85 * -0.16%*
> Amean system-110 31.87 ( 0.00%) 31.93 * -0.19%*
> Amean system-128 36.27 ( 0.00%) 36.31 * -0.10%*
> Amean elapsed-1 4.88 ( 0.00%) 4.85 * 0.60%*
> Amean elapsed-4 1.27 ( 0.00%) 1.26 * 1.00%*
> Amean elapsed-7 0.73 ( 0.00%) 0.74 * -0.46%*
> Amean elapsed-12 0.55 ( 0.00%) 0.55 * 1.09%*
> Amean elapsed-21 0.59 ( 0.00%) 0.60 * -0.96%*
> Amean elapsed-30 0.60 ( 0.00%) 0.60 * 0.28%*
> Amean elapsed-48 0.60 ( 0.00%) 0.60 * 0.44%*
> Amean elapsed-79 0.49 ( 0.00%) 0.49 * -0.07%*
> Amean elapsed-110 0.36 ( 0.00%) 0.36 * 0.28%*
> Amean elapsed-128 0.31 ( 0.00%) 0.31 * -0.43%*
>
> - Raspberry Pi 4, 1 socket, bcm2711, Cortex-A72, 4-Cores, 1 NUMA node, arm64.
>
> pft timings:
> vanilla rcu
> Amean system-1 0.67 ( 0.00%) 0.67 * -1.25%*
> Amean system-3 1.30 ( 0.00%) 1.29 * 0.62%*
> Amean system-4 1.61 ( 0.00%) 1.59 * 0.95%*
> Amean elapsed-1 0.71 ( 0.00%) 0.72 * -1.17%*
> Amean elapsed-3 0.45 ( 0.00%) 0.45 * 0.88%*
> Amean elapsed-4 0.42 ( 0.00%) 0.42 * 1.19%*
>
>

2022-03-07 15:24:47

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm/page_alloc: Remote per-cpu lists drain support

Hi Mel,
Thanks for having a look at this.

On Thu, 2022-03-03 at 11:45 +0000, Mel Gorman wrote:
> On Tue, Feb 08, 2022 at 11:07:48AM +0100, Nicolas Saenz Julienne wrote:
> > This series replaces mm/page_alloc's per-cpu page lists drain mechanism with
> > one that allows accessing the lists remotely. Currently, only a local CPU is
> > permitted to change its per-cpu lists, and it's expected to do so, on-demand,
> > whenever a process demands it by means of queueing a drain task on the local
> > CPU. This causes problems for NOHZ_FULL CPUs and real-time systems that can't
> > take any sort of interruption and to some lesser extent inconveniences idle and
> > virtualised systems.
> >
>
> I know this has been sitting here for a long while. Last few weeks have
> not been fun.

No problem.

> > Note that this is not the first attempt at fixing this per-cpu page lists:
> > - The first attempt[1] tried to conditionally change the pagesets locking
> > scheme based the NOHZ_FULL config. It was deemed hard to maintain as the
> > NOHZ_FULL code path would be rarely tested. Also, this only solves the issue
> > for NOHZ_FULL setups, which isn't ideal.
> > - The second[2] unanimously switched the local_locks to per-cpu spinlocks. The
> > performance degradation was too big.
> >
>
> For unrelated reasons I looked at using llist to avoid locks entirely. It
> turns out it's not possible and needs a lock. We know "local_locks to
> per-cpu spinlocks" took a large penalty so I considered alternatives on
> how a lock could be used. I found it's possible to both remote drain
> the lists and avoid the disable/enable of IRQs entirely as long as a
> preempting IRQ is willing to take the zone lock instead (should be very
> rare). The IRQ part is a bit hairy though as softirqs are also a problem
> and preempt-rt needs different rules and the llist has to sort PCP
> refills which might be a loss in total. However, the remote draining may
> still be interesting. The full series is at
> https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/ mm-pcpllist-v1r2

I'll have a proper look at it soon.

Regards,

--
Nicolás Sáenz

2022-03-09 00:35:48

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm/page_alloc: Remote per-cpu lists drain support

On Thu, Mar 03, 2022 at 11:45:50AM +0000, Mel Gorman wrote:
> On Tue, Feb 08, 2022 at 11:07:48AM +0100, Nicolas Saenz Julienne wrote:
> > This series replaces mm/page_alloc's per-cpu page lists drain mechanism with
> > one that allows accessing the lists remotely. Currently, only a local CPU is
> > permitted to change its per-cpu lists, and it's expected to do so, on-demand,
> > whenever a process demands it by means of queueing a drain task on the local
> > CPU. This causes problems for NOHZ_FULL CPUs and real-time systems that can't
> > take any sort of interruption and to some lesser extent inconveniences idle and
> > virtualised systems.
> >
>
> I know this has been sitting here for a long while. Last few weeks have
> not been fun.
>
> > Note that this is not the first attempt at fixing this per-cpu page lists:
> > - The first attempt[1] tried to conditionally change the pagesets locking
> > scheme based the NOHZ_FULL config. It was deemed hard to maintain as the
> > NOHZ_FULL code path would be rarely tested. Also, this only solves the issue
> > for NOHZ_FULL setups, which isn't ideal.
> > - The second[2] unanimously switched the local_locks to per-cpu spinlocks. The
> > performance degradation was too big.
> >
>
> For unrelated reasons I looked at using llist to avoid locks entirely. It
> turns out it's not possible and needs a lock. We know "local_locks to
> per-cpu spinlocks" took a large penalty so I considered alternatives on
> how a lock could be used. I found it's possible to both remote drain
> the lists and avoid the disable/enable of IRQs entirely as long as a
> preempting IRQ is willing to take the zone lock instead (should be very
> rare). The IRQ part is a bit hairy though as softirqs are also a problem
> and preempt-rt needs different rules and the llist has to sort PCP
> refills which might be a loss in total. However, the remote draining may
> still be interesting. The full series is at
> https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/ mm-pcpllist-v1r2
>
> It's still waiting on tests to complete and not all the changelogs are
> complete which is why it's not posted.
>
> This is a comparison of vanilla vs "local_locks to per-cpu spinlocks"
> versus the git series up to "mm/page_alloc: Remotely drain per-cpu lists"
> for the page faulting microbench I originally complained about. The test
> machine is a 2-socket CascadeLake machine.
>
> pft timings
> 5.17.0-rc5 5.17.0-rc5 5.17.0-rc5
> vanilla mm-remotedrain-v2r1 mm-pcpdrain-v1r1
> Amean elapsed-1 32.54 ( 0.00%) 33.08 * -1.66%* 32.82 * -0.86%*
> Amean elapsed-4 8.66 ( 0.00%) 9.24 * -6.72%* 8.69 * -0.38%*
> Amean elapsed-7 5.02 ( 0.00%) 5.43 * -8.16%* 5.05 * -0.55%*
> Amean elapsed-12 3.07 ( 0.00%) 3.38 * -10.00%* 3.09 * -0.72%*
> Amean elapsed-21 2.36 ( 0.00%) 2.38 * -0.89%* 2.19 * 7.39%*
> Amean elapsed-30 1.75 ( 0.00%) 1.87 * -6.50%* 1.62 * 7.59%*
> Amean elapsed-48 1.71 ( 0.00%) 2.00 * -17.32%* 1.71 ( -0.08%)
> Amean elapsed-79 1.56 ( 0.00%) 1.62 * -3.84%* 1.56 ( -0.02%)
> Amean elapsed-80 1.57 ( 0.00%) 1.65 * -5.31%* 1.57 ( -0.04%)
>
> Note the local_lock conversion took 1 1-17% penalty while the git tree
> takes a negligile penalty while still allowing remote drains. It might
> have some potential while being less complex than the RCU approach.

Nice!

Hopefully a spinlock can be added to "struct lru_pvecs" without
degradation in performance, similarly to what is done here.

2022-03-11 00:12:19

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm/page_alloc: Remote per-cpu lists drain support

On Mon, Mar 07, 2022 at 02:57:47PM +0100, Nicolas Saenz Julienne wrote:
> > > Note that this is not the first attempt at fixing this per-cpu page lists:
> > > - The first attempt[1] tried to conditionally change the pagesets locking
> > > scheme based the NOHZ_FULL config. It was deemed hard to maintain as the
> > > NOHZ_FULL code path would be rarely tested. Also, this only solves the issue
> > > for NOHZ_FULL setups, which isn't ideal.
> > > - The second[2] unanimously switched the local_locks to per-cpu spinlocks. The
> > > performance degradation was too big.
> > >
> >
> > For unrelated reasons I looked at using llist to avoid locks entirely. It
> > turns out it's not possible and needs a lock. We know "local_locks to
> > per-cpu spinlocks" took a large penalty so I considered alternatives on
> > how a lock could be used. I found it's possible to both remote drain
> > the lists and avoid the disable/enable of IRQs entirely as long as a
> > preempting IRQ is willing to take the zone lock instead (should be very
> > rare). The IRQ part is a bit hairy though as softirqs are also a problem
> > and preempt-rt needs different rules and the llist has to sort PCP
> > refills which might be a loss in total. However, the remote draining may
> > still be interesting. The full series is at
> > https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/ mm-pcpllist-v1r2
>
> I'll have a proper look at it soon.
>

Thanks. I'm still delayed actually finishing the series as most of my
time is dedicated to a separate issue. However, there is at least one
bug in there at patch "mm/page_alloc: Remotely drain per-cpu lists"
that causes a lockup under severe memory pressure. The fix is

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c9a6f2b5548e..11b54f383d04 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3065,10 +3065,8 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
*/
void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
{
- unsigned long flags;
int to_drain, batch;

- pcp_local_lock(&pagesets.lock, flags);
batch = READ_ONCE(pcp->batch);
to_drain = min(pcp->count, batch);
if (to_drain > 0) {
@@ -3076,7 +3074,6 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
free_pcppages_bulk(zone, to_drain, pcp, 0);
spin_unlock(&pcp->lock);
}
- pcp_local_unlock(&pagesets.lock, flags);
}
#endif

@@ -3088,16 +3085,12 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
unsigned long flags;
struct per_cpu_pages *pcp;

- pcp_local_lock(&pagesets.lock, flags);
-
pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
if (pcp->count) {
spin_lock(&pcp->lock);
free_pcppages_bulk(zone, pcp->count, pcp, 0);
spin_unlock(&pcp->lock);
}
-
- pcp_local_unlock(&pagesets.lock, flags);
}

/*

2022-03-25 19:40:43

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm/page_alloc: Remote per-cpu lists drain support

On Thu, Mar 24, 2022 at 07:59:56PM +0100, Nicolas Saenz Julienne wrote:
> Hi Mel,
>
> On Thu, 2022-03-03 at 11:45 +0000, Mel Gorman wrote:
> > For unrelated reasons I looked at using llist to avoid locks entirely. It
> > turns out it's not possible and needs a lock. We know "local_locks to
> > per-cpu spinlocks" took a large penalty so I considered alternatives on
> > how a lock could be used. I found it's possible to both remote drain
> > the lists and avoid the disable/enable of IRQs entirely as long as a
> > preempting IRQ is willing to take the zone lock instead (should be very
> > rare). The IRQ part is a bit hairy though as softirqs are also a problem
> > and preempt-rt needs different rules and the llist has to sort PCP
> > refills which might be a loss in total. However, the remote draining may
> > still be interesting. The full series is at
> > https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/ mm-pcpllist-v1r2
> >
> > It's still waiting on tests to complete and not all the changelogs are
> > complete which is why it's not posted.
> >
> > This is a comparison of vanilla vs "local_locks to per-cpu spinlocks"
> > versus the git series up to "mm/page_alloc: Remotely drain per-cpu lists"
> > for the page faulting microbench I originally complained about. The test
> > machine is a 2-socket CascadeLake machine.
> >
> > pft timings
> > 5.17.0-rc5 5.17.0-rc5 5.17.0-rc5
> > vanilla mm-remotedrain-v2r1 mm-pcpdrain-v1r1
> > Amean elapsed-1 32.54 ( 0.00%) 33.08 * -1.66%* 32.82 * -0.86%*
> > Amean elapsed-4 8.66 ( 0.00%) 9.24 * -6.72%* 8.69 * -0.38%*
> > Amean elapsed-7 5.02 ( 0.00%) 5.43 * -8.16%* 5.05 * -0.55%*
> > Amean elapsed-12 3.07 ( 0.00%) 3.38 * -10.00%* 3.09 * -0.72%*
> > Amean elapsed-21 2.36 ( 0.00%) 2.38 * -0.89%* 2.19 * 7.39%*
> > Amean elapsed-30 1.75 ( 0.00%) 1.87 * -6.50%* 1.62 * 7.59%*
> > Amean elapsed-48 1.71 ( 0.00%) 2.00 * -17.32%* 1.71 ( -0.08%)
> > Amean elapsed-79 1.56 ( 0.00%) 1.62 * -3.84%* 1.56 ( -0.02%)
> > Amean elapsed-80 1.57 ( 0.00%) 1.65 * -5.31%* 1.57 ( -0.04%)
> >
> > Note the local_lock conversion took 1 1-17% penalty while the git tree
> > takes a negligile penalty while still allowing remote drains. It might
> > have some potential while being less complex than the RCU approach.
>
> I finally got some time to look at this and made some progress:
>
> First, I belive your 'mm-remotedrain-v2r1' results are wrong/inflated due to a
> bug in my series. Essentially, all 'this_cpu_ptr()' calls should've been
> 'raw_cpu_ptr()' and your build, which I bet enables CONFIG_DEBUG_PREEMPT,

I no longer have the logs but it could have and I didn't check the
dmesg at the time to see if there were warnings in it. It really should
add something to parse that log and automatically report if there are
unexpected warnings, oops or prove-locking warnings if enabled.

> wasted time trowing warnings about per-cpu variable usage with preemption
> enabled. Making the overall performance look worse than it actually is. My
> build didn't enable it, which made me miss this whole issue. I'm sorry for the
> noise and time wasted on such a silly thing. Note that the local_lock to
> spin_lock conversion can handle the preeemption alright, it is part of the
> design[1].
>
> As for your idea of not disabling interrupts in the hot paths, it seems to
> close the performance gap created by the lock conversion. That said, I'm not
> sure I understand why you find the need to keep the local_locks around, not
> only it casuses problems for RT systems, but IIUC they aren't really protecting
> anything other than the 'this_cpu_ptr()' usage (which isn't really needed).

The local lock was preserved because something has to stabilise the per-cpu
pointer due to preemption and migration. On !RT, that's a preempt_disable
and on RT it's a spinlock both which prevent a migration. The secondary
goal of using local lock was to allow some allocations to be done without
disabling IRQs at all with the penalty that an IRQ arriving in at the
wrong time will have to allocate directly from the buddy lists which
should be rare.

> I
> rewrote your patch on top of my lock conversion series and I'm in the process
> of testing it on multiple systems[2].
>
> Let me know what you think.
> Thanks!
>
> [1] It follows this pattern:
>
> struct per_cpu_pages *pcp;
>
> pcp = raw_cpu_ptr(page_zone(page)->per_cpu_pageset);
> // <- Migration here is OK: spin_lock protects vs eventual pcplist
> // access from local CPU as long as all list access happens through the
> // pcp pointer.
> spin_lock(&pcp->lock);
> do_stuff_with_pcp_lists(pcp);
> spin_unlock(&pcp->lock);
>

And this was the part I am concerned with. We are accessing a PCP
structure that is not necessarily the one belonging to the CPU we
are currently running on. This type of pattern is warned about in
Documentation/locking/locktypes.rst

---8<---
A typical scenario is protection of per-CPU variables in thread context::

struct foo *p = get_cpu_ptr(&var1);

spin_lock(&p->lock);
p->count += this_cpu_read(var2);

This is correct code on a non-PREEMPT_RT kernel, but on a PREEMPT_RT kernel
this breaks. The PREEMPT_RT-specific change of spinlock_t semantics does
not allow to acquire p->lock because get_cpu_ptr() implicitly disables
preemption. The following substitution works on both kernels::
---8<---

Now we don't explicitly have this pattern because there isn't an
obvious this_cpu_read() for example but it can accidentally happen for
counting. __count_zid_vm_events -> __count_vm_events -> raw_cpu_add is
an example although a harmless one.

Any of the mod_page_state ones are more problematic though because we
lock one PCP but potentially update the per-cpu pcp stats of another CPU
of a different PCP that we have not locked and those counters must be
accurate.

It *might* still be safe but it's subtle, it could be easily accidentally
broken in the future and it would be hard to detect because it would be
very slow corruption of VM counters like NR_FREE_PAGES that must be
accurate.

--
Mel Gorman
SUSE Labs

2022-03-25 19:46:59

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm/page_alloc: Remote per-cpu lists drain support

Hi Mel,

On Thu, 2022-03-03 at 11:45 +0000, Mel Gorman wrote:
> For unrelated reasons I looked at using llist to avoid locks entirely. It
> turns out it's not possible and needs a lock. We know "local_locks to
> per-cpu spinlocks" took a large penalty so I considered alternatives on
> how a lock could be used. I found it's possible to both remote drain
> the lists and avoid the disable/enable of IRQs entirely as long as a
> preempting IRQ is willing to take the zone lock instead (should be very
> rare). The IRQ part is a bit hairy though as softirqs are also a problem
> and preempt-rt needs different rules and the llist has to sort PCP
> refills which might be a loss in total. However, the remote draining may
> still be interesting. The full series is at
> https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/ mm-pcpllist-v1r2
>
> It's still waiting on tests to complete and not all the changelogs are
> complete which is why it's not posted.
>
> This is a comparison of vanilla vs "local_locks to per-cpu spinlocks"
> versus the git series up to "mm/page_alloc: Remotely drain per-cpu lists"
> for the page faulting microbench I originally complained about. The test
> machine is a 2-socket CascadeLake machine.
>
> pft timings
> 5.17.0-rc5 5.17.0-rc5 5.17.0-rc5
> vanilla mm-remotedrain-v2r1 mm-pcpdrain-v1r1
> Amean elapsed-1 32.54 ( 0.00%) 33.08 * -1.66%* 32.82 * -0.86%*
> Amean elapsed-4 8.66 ( 0.00%) 9.24 * -6.72%* 8.69 * -0.38%*
> Amean elapsed-7 5.02 ( 0.00%) 5.43 * -8.16%* 5.05 * -0.55%*
> Amean elapsed-12 3.07 ( 0.00%) 3.38 * -10.00%* 3.09 * -0.72%*
> Amean elapsed-21 2.36 ( 0.00%) 2.38 * -0.89%* 2.19 * 7.39%*
> Amean elapsed-30 1.75 ( 0.00%) 1.87 * -6.50%* 1.62 * 7.59%*
> Amean elapsed-48 1.71 ( 0.00%) 2.00 * -17.32%* 1.71 ( -0.08%)
> Amean elapsed-79 1.56 ( 0.00%) 1.62 * -3.84%* 1.56 ( -0.02%)
> Amean elapsed-80 1.57 ( 0.00%) 1.65 * -5.31%* 1.57 ( -0.04%)
>
> Note the local_lock conversion took 1 1-17% penalty while the git tree
> takes a negligile penalty while still allowing remote drains. It might
> have some potential while being less complex than the RCU approach.

I finally got some time to look at this and made some progress:

First, I belive your 'mm-remotedrain-v2r1' results are wrong/inflated due to a
bug in my series. Essentially, all 'this_cpu_ptr()' calls should've been
'raw_cpu_ptr()' and your build, which I bet enables CONFIG_DEBUG_PREEMPT,
wasted time trowing warnings about per-cpu variable usage with preemption
enabled. Making the overall performance look worse than it actually is. My
build didn't enable it, which made me miss this whole issue. I'm sorry for the
noise and time wasted on such a silly thing. Note that the local_lock to
spin_lock conversion can handle the preeemption alright, it is part of the
design[1].

As for your idea of not disabling interrupts in the hot paths, it seems to
close the performance gap created by the lock conversion. That said, I'm not
sure I understand why you find the need to keep the local_locks around, not
only it casuses problems for RT systems, but IIUC they aren't really protecting
anything other than the 'this_cpu_ptr()' usage (which isn't really needed). I
rewrote your patch on top of my lock conversion series and I'm in the process
of testing it on multiple systems[2].

Let me know what you think.
Thanks!

[1] It follows this pattern:

struct per_cpu_pages *pcp;

pcp = raw_cpu_ptr(page_zone(page)->per_cpu_pageset);
// <- Migration here is OK: spin_lock protects vs eventual pcplist
// access from local CPU as long as all list access happens through the
// pcp pointer.
spin_lock(&pcp->lock);
do_stuff_with_pcp_lists(pcp);
spin_unlock(&pcp->lock);

[2] See:

git://git.kernel.org/pub/scm/linux/kernel/git/nsaenz/linux-rpi.git pcpdrain-sl-v3r1

--
Nicolás Sáenz

2022-03-28 21:36:21

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm/page_alloc: Remote per-cpu lists drain support

Hi Mel,

On Fri, 2022-03-25 at 10:48 +0000, Mel Gorman wrote:
> > [1] It follows this pattern:
> >
> > struct per_cpu_pages *pcp;
> >
> > pcp = raw_cpu_ptr(page_zone(page)->per_cpu_pageset);
> > // <- Migration here is OK: spin_lock protects vs eventual pcplist
> > // access from local CPU as long as all list access happens through the
> > // pcp pointer.
> > spin_lock(&pcp->lock);
> > do_stuff_with_pcp_lists(pcp);
> > spin_unlock(&pcp->lock);
> >
>
> And this was the part I am concerned with. We are accessing a PCP
> structure that is not necessarily the one belonging to the CPU we
> are currently running on. This type of pattern is warned about in
> Documentation/locking/locktypes.rst
>
> ---8<---
> A typical scenario is protection of per-CPU variables in thread context::
>
> struct foo *p = get_cpu_ptr(&var1);
>
> spin_lock(&p->lock);
> p->count += this_cpu_read(var2);
>
> This is correct code on a non-PREEMPT_RT kernel, but on a PREEMPT_RT kernel
> this breaks. The PREEMPT_RT-specific change of spinlock_t semantics does
> not allow to acquire p->lock because get_cpu_ptr() implicitly disables
> preemption. The following substitution works on both kernels::
> ---8<---
>
> Now we don't explicitly have this pattern because there isn't an
> obvious this_cpu_read() for example but it can accidentally happen for
> counting. __count_zid_vm_events -> __count_vm_events -> raw_cpu_add is
> an example although a harmless one.
>
> Any of the mod_page_state ones are more problematic though because we
> lock one PCP but potentially update the per-cpu pcp stats of another CPU
> of a different PCP that we have not locked and those counters must be
> accurate.

But IIUC vmstats don't track pcplist usage (i.e. adding a page into the local
pcplist doesn't affect the count at all). It is only when interacting with the
buddy allocator that they get updated. It makes sense for the CPU that
adds/removes pages from the allocator to do the stat update, regardless of the
page's journey.

> It *might* still be safe but it's subtle, it could be easily accidentally
> broken in the future and it would be hard to detect because it would be
> very slow corruption of VM counters like NR_FREE_PAGES that must be
> accurate.

What does accurate mean here? vmstat consumers don't get accurate data, only
snapshots. And as I comment above you can't infer information about pcplist
usage from these stats. So, I see no real need for CPU locality when updating
them (which we're still retaining nonetheless, as per my comment above), the
only thing that is really needed is atomicity, achieved by disabling IRQs (and
preemption on RT). And this, even with your solution, is achieved through the
struct zone's spin_lock (plus a preempt_disable() in RT).

All in all, my point is that none of the stats are affected by the change, nor
have a dependency with the pcplists handling. And if we ever have the need to
pin vmstat updates to pcplist usage they should share the same pcp structure.
That said, I'm happy with either solution as long as we get remote pcplist
draining. So if still unconvinced, let me know how can I help. I have access to
all sorts of machines to validate perf results, time to review, or even to move
the series forward.

Thanks!

--
Nicolás Sáenz

2022-03-29 12:42:54

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm/page_alloc: Remote per-cpu lists drain support

On Mon, Mar 28, 2022 at 03:51:43PM +0200, Nicolas Saenz Julienne wrote:
> > Now we don't explicitly have this pattern because there isn't an
> > obvious this_cpu_read() for example but it can accidentally happen for
> > counting. __count_zid_vm_events -> __count_vm_events -> raw_cpu_add is
> > an example although a harmless one.
> >
> > Any of the mod_page_state ones are more problematic though because we
> > lock one PCP but potentially update the per-cpu pcp stats of another CPU
> > of a different PCP that we have not locked and those counters must be
> > accurate.
>
> But IIUC vmstats don't track pcplist usage (i.e. adding a page into the local
> pcplist doesn't affect the count at all). It is only when interacting with the
> buddy allocator that they get updated. It makes sense for the CPU that
> adds/removes pages from the allocator to do the stat update, regardless of the
> page's journey.
>

It probably doesn't, I didn't audit it. As I said, it's subtle which is
why I'm wary of relying on accidental safety of getting a per-cpu pointer
that may not be stable. Even if it was ok *now*, I would worry that it
would break in the future. There already has been cases where patches
tried to move vmstats outside the appropriate locking accidentally.

> > It *might* still be safe but it's subtle, it could be easily accidentally
> > broken in the future and it would be hard to detect because it would be
> > very slow corruption of VM counters like NR_FREE_PAGES that must be
> > accurate.
>
> What does accurate mean here? vmstat consumers don't get accurate data, only
> snapshots.

They are accurate in that they have "Eventual Consistency".
zone_page_state_snapshot exists to get a more accurate count but there is
always some drift but it still is accurate eventually. There is a clear
distinction between VM counters which can be inaccurate they are just to
assist debugging and vmstats like NR_FREE_PAGES that the kernel uses to
make decisions. It potentially gets very problematic if a per-cpu pointer
acquired from one zone gets migrated to another zone and the wrong vmstat
is updated. It *might* still be ok, I haven't audited it but if there is a
possible that two CPUs can be doing a RMW on one per-cpu vmstat structure,
it will corrupt and it'll be difficult to detect.

> And as I comment above you can't infer information about pcplist
> usage from these stats. So, I see no real need for CPU locality when updating
> them (which we're still retaining nonetheless, as per my comment above), the
> only thing that is really needed is atomicity, achieved by disabling IRQs (and
> preemption on RT). And this, even with your solution, is achieved through the
> struct zone's spin_lock (plus a preempt_disable() in RT).
>

Yes, but under the series I had, I was using local_lock to stabilise what
CPU is being used before acquiring the per-cpu pointer. Strictly speaking,
it doesn't need a local_lock but the local_lock is clearer in terms of
what is being protected and it works with PROVE_LOCKING which already
caught a problematic softirq interaction for me when developing the series.

> All in all, my point is that none of the stats are affected by the change, nor
> have a dependency with the pcplists handling. And if we ever have the need to
> pin vmstat updates to pcplist usage they should share the same pcp structure.
> That said, I'm happy with either solution as long as we get remote pcplist
> draining. So if still unconvinced, let me know how can I help. I have access to
> all sorts of machines to validate perf results, time to review, or even to move
> the series forward.
>

I also want the remote draining for PREEMPT_RT to avoid interference
of isolated CPUs due to workqueue activity but whatever the solution, I
would be happier if the per-cpu lock is acquired with the CPU stablised
and covers the scope of any vmstat delta updates stored in the per-cpu
structure. The earliest I will be rebasing my series is 5.18-rc1 as I
see limited value in basing it on 5.17 aiming for a 5.19 merge window.

--
Mel Gorman
SUSE Labs

2022-03-31 04:43:20

by Nicolas Saenz Julienne

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm/page_alloc: Remote per-cpu lists drain support

Hi Mel,

On Thu, 2022-03-03 at 11:45 +0000, Mel Gorman wrote:
> On Tue, Feb 08, 2022 at 11:07:48AM +0100, Nicolas Saenz Julienne wrote:
> > This series replaces mm/page_alloc's per-cpu page lists drain mechanism with
> > one that allows accessing the lists remotely. Currently, only a local CPU is
> > permitted to change its per-cpu lists, and it's expected to do so, on-demand,
> > whenever a process demands it by means of queueing a drain task on the local
> > CPU. This causes problems for NOHZ_FULL CPUs and real-time systems that can't
> > take any sort of interruption and to some lesser extent inconveniences idle and
> > virtualised systems.
> >
>
> I know this has been sitting here for a long while. Last few weeks have
> not been fun.
>
> > Note that this is not the first attempt at fixing this per-cpu page lists:
> > - The first attempt[1] tried to conditionally change the pagesets locking
> > scheme based the NOHZ_FULL config. It was deemed hard to maintain as the
> > NOHZ_FULL code path would be rarely tested. Also, this only solves the issue
> > for NOHZ_FULL setups, which isn't ideal.
> > - The second[2] unanimously switched the local_locks to per-cpu spinlocks. The
> > performance degradation was too big.
> >
>
> For unrelated reasons I looked at using llist to avoid locks entirely. It
> turns out it's not possible and needs a lock. We know "local_locks to
> per-cpu spinlocks" took a large penalty so I considered alternatives on
> how a lock could be used. I found it's possible to both remote drain
> the lists and avoid the disable/enable of IRQs entirely as long as a
> preempting IRQ is willing to take the zone lock instead (should be very
> rare). The IRQ part is a bit hairy though as softirqs are also a problem
> and preempt-rt needs different rules and the llist has to sort PCP
> refills which might be a loss in total. However, the remote draining may
> still be interesting. The full series is at
> https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/ mm-pcpllist-v1r2
>
> It's still waiting on tests to complete and not all the changelogs are
> complete which is why it's not posted.
>
> This is a comparison of vanilla vs "local_locks to per-cpu spinlocks"
> versus the git series up to "mm/page_alloc: Remotely drain per-cpu lists"
> for the page faulting microbench I originally complained about. The test
> machine is a 2-socket CascadeLake machine.
>
> pft timings
> 5.17.0-rc5 5.17.0-rc5 5.17.0-rc5
> vanilla mm-remotedrain-v2r1 mm-pcpdrain-v1r1
> Amean elapsed-1 32.54 ( 0.00%) 33.08 * -1.66%* 32.82 * -0.86%*
> Amean elapsed-4 8.66 ( 0.00%) 9.24 * -6.72%* 8.69 * -0.38%*
> Amean elapsed-7 5.02 ( 0.00%) 5.43 * -8.16%* 5.05 * -0.55%*
> Amean elapsed-12 3.07 ( 0.00%) 3.38 * -10.00%* 3.09 * -0.72%*
> Amean elapsed-21 2.36 ( 0.00%) 2.38 * -0.89%* 2.19 * 7.39%*
> Amean elapsed-30 1.75 ( 0.00%) 1.87 * -6.50%* 1.62 * 7.59%*
> Amean elapsed-48 1.71 ( 0.00%) 2.00 * -17.32%* 1.71 ( -0.08%)
> Amean elapsed-79 1.56 ( 0.00%) 1.62 * -3.84%* 1.56 ( -0.02%)
> Amean elapsed-80 1.57 ( 0.00%) 1.65 * -5.31%* 1.57 ( -0.04%)
>
> Note the local_lock conversion took 1 1-17% penalty while the git tree
> takes a negligile penalty while still allowing remote drains. It might
> have some potential while being less complex than the RCU approach.

I've been made aware of a problem with the spin_trylock() approach. It doesn't
work for UP since in that context spin_lock() is a NOOP (well, it only disables
preemption). So nothing prevents a race with an IRQ.

It's kinda funny, breakages generally happen the other way around.

Regards,

--
Nicolás Sáenz

2022-03-31 21:22:23

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 0/2] mm/page_alloc: Remote per-cpu lists drain support

On Wed, Mar 30, 2022 at 01:29:04PM +0200, Nicolas Saenz Julienne wrote:
> Hi Mel,
>
> On Thu, 2022-03-03 at 11:45 +0000, Mel Gorman wrote:
> > On Tue, Feb 08, 2022 at 11:07:48AM +0100, Nicolas Saenz Julienne wrote:
> > > This series replaces mm/page_alloc's per-cpu page lists drain mechanism with
> > > one that allows accessing the lists remotely. Currently, only a local CPU is
> > > permitted to change its per-cpu lists, and it's expected to do so, on-demand,
> > > whenever a process demands it by means of queueing a drain task on the local
> > > CPU. This causes problems for NOHZ_FULL CPUs and real-time systems that can't
> > > take any sort of interruption and to some lesser extent inconveniences idle and
> > > virtualised systems.
> > >
> >
> > I know this has been sitting here for a long while. Last few weeks have
> > not been fun.
> >
> > > Note that this is not the first attempt at fixing this per-cpu page lists:
> > > - The first attempt[1] tried to conditionally change the pagesets locking
> > > scheme based the NOHZ_FULL config. It was deemed hard to maintain as the
> > > NOHZ_FULL code path would be rarely tested. Also, this only solves the issue
> > > for NOHZ_FULL setups, which isn't ideal.
> > > - The second[2] unanimously switched the local_locks to per-cpu spinlocks. The
> > > performance degradation was too big.
> > >
> >
> > For unrelated reasons I looked at using llist to avoid locks entirely. It
> > turns out it's not possible and needs a lock. We know "local_locks to
> > per-cpu spinlocks" took a large penalty so I considered alternatives on
> > how a lock could be used. I found it's possible to both remote drain
> > the lists and avoid the disable/enable of IRQs entirely as long as a
> > preempting IRQ is willing to take the zone lock instead (should be very
> > rare). The IRQ part is a bit hairy though as softirqs are also a problem
> > and preempt-rt needs different rules and the llist has to sort PCP
> > refills which might be a loss in total. However, the remote draining may
> > still be interesting. The full series is at
> > https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git/ mm-pcpllist-v1r2
> >
> > It's still waiting on tests to complete and not all the changelogs are
> > complete which is why it's not posted.
> >
> > This is a comparison of vanilla vs "local_locks to per-cpu spinlocks"
> > versus the git series up to "mm/page_alloc: Remotely drain per-cpu lists"
> > for the page faulting microbench I originally complained about. The test
> > machine is a 2-socket CascadeLake machine.
> >
> > pft timings
> > 5.17.0-rc5 5.17.0-rc5 5.17.0-rc5
> > vanilla mm-remotedrain-v2r1 mm-pcpdrain-v1r1
> > Amean elapsed-1 32.54 ( 0.00%) 33.08 * -1.66%* 32.82 * -0.86%*
> > Amean elapsed-4 8.66 ( 0.00%) 9.24 * -6.72%* 8.69 * -0.38%*
> > Amean elapsed-7 5.02 ( 0.00%) 5.43 * -8.16%* 5.05 * -0.55%*
> > Amean elapsed-12 3.07 ( 0.00%) 3.38 * -10.00%* 3.09 * -0.72%*
> > Amean elapsed-21 2.36 ( 0.00%) 2.38 * -0.89%* 2.19 * 7.39%*
> > Amean elapsed-30 1.75 ( 0.00%) 1.87 * -6.50%* 1.62 * 7.59%*
> > Amean elapsed-48 1.71 ( 0.00%) 2.00 * -17.32%* 1.71 ( -0.08%)
> > Amean elapsed-79 1.56 ( 0.00%) 1.62 * -3.84%* 1.56 ( -0.02%)
> > Amean elapsed-80 1.57 ( 0.00%) 1.65 * -5.31%* 1.57 ( -0.04%)
> >
> > Note the local_lock conversion took 1 1-17% penalty while the git tree
> > takes a negligile penalty while still allowing remote drains. It might
> > have some potential while being less complex than the RCU approach.
>
> I've been made aware of a problem with the spin_trylock() approach. It doesn't
> work for UP since in that context spin_lock() is a NOOP (well, it only disables
> preemption). So nothing prevents a race with an IRQ.
>

I didn't think of UP being a problem. I'm offline shortly until early next
week but superficially the spin_[try]lock for PCP would need a pcp_lock
and pcp_trylock helpers. On SMP, it would be the equivalent lock. On UP,
pcp_lock would map to spin_lock but pcp_trylock would likely need to map
to spin_lock_irqsave. It means that UP would always disable IRQs but that
would be no worse than the current allocator.

--
Mel Gorman
SUSE Labs