2021-05-21 12:02:59

by Mel Gorman

[permalink] [raw]
Subject: [RFC PATCH 0/6] Calculate pcp->high based on zone sizes and active CPUs

The per-cpu page allocator (PCP) is meant to reduce contention on the zone
lock but the sizing of batch and high is archaic and neither takes the zone
size into account or the number of CPUs local to a zone. Furthermore, the
fact that vm.percpu_pagelist_fraction adjusts both batch and high values
means that the sysctl can reduce zone lock contention but also increase
allocation latencies.

This series disassociates pcp->high from pcp->batch and then scales
pcp->high based on the size of the local zone with limited impact to
reclaim and accounting for active CPUs but leaves pcp->batch static.
It also adapts the number of pages that can be on the pcp list based on
recent freeing patterns.

The motivation is partially to adjust to larger memory sizes but
is also driven by the fact that large batches of page freeing via
release_pages() often shows zone contention as a major part of the
problem. Another is a bug report based on an older kernel where a
multi-terabyte process can takes several minutes to exit. A workaround
was to use vm.percpu_pagelist_fraction to increase the pcp->high value
but testing indicated that a production workload could not use the same
values because of an increase in allocation latencies. Unfortunately,
I cannot reproduce this test case myself as the multi-terabyte machines
are in active use but it should alleviate the problem.

The series aims to address both and partially acts as a pre-requisite. pcp
only works with order-0 which is useless for SLUB (when using high orders)
and THP (unconditionally). To store high-order pages on PCP, the pcp->high
values need to be increased first.

Documentation/admin-guide/sysctl/vm.rst | 19 +--
include/linux/cpuhotplug.h | 2 +-
include/linux/mmzone.h | 8 +-
kernel/sysctl.c | 8 +-
mm/internal.h | 2 +-
mm/memory_hotplug.c | 4 +-
mm/page_alloc.c | 166 +++++++++++++++++-------
mm/vmscan.c | 35 +++++
8 files changed, 179 insertions(+), 65 deletions(-)

--
2.26.2


2021-05-21 12:04:05

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 2/6] mm/page_alloc: Disassociate the pcp->high from pcp->batch

The pcp high watermark is based on the batch size but there is no
relationship between them other than it is convenient to use early in
boot.

This patch takes the first step and bases pcp->high on the zone low
watermark split across the number of CPUs local to a zone while the batch
size remains the same to avoid increasing allocation latencies. The intent
behind the default pcp->high is "set the number of PCP pages such that
if they are all full that background reclaim is not started prematurely".

Note that in this patch the pcp->high values are adjusted after memory
hotplug events, min_free_kbytes adjustments and watermark scale factor
adjustments but not CPU hotplug events.

On a test KVM instance;

Before grep -E "high:|batch" /proc/zoneinfo | tail -2
high: 378
batch: 63

After grep -E "high:|batch" /proc/zoneinfo | tail -2
high: 649
batch: 63

Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 53 ++++++++++++++++++++++++++++++++-----------------
1 file changed, 35 insertions(+), 18 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a48f305f0381..bf5cdc466e6c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2163,14 +2163,6 @@ void __init page_alloc_init_late(void)
/* Block until all are initialised */
wait_for_completion(&pgdat_init_all_done_comp);

- /*
- * The number of managed pages has changed due to the initialisation
- * so the pcpu batch and high limits needs to be updated or the limits
- * will be artificially small.
- */
- for_each_populated_zone(zone)
- zone_pcp_update(zone);
-
/*
* We initialized the rest of the deferred pages. Permanently disable
* on-demand struct page initialization.
@@ -6594,13 +6586,12 @@ static int zone_batchsize(struct zone *zone)
int batch;

/*
- * The per-cpu-pages pools are set to around 1000th of the
- * size of the zone.
+ * The number of pages to batch allocate is either 0.1%
+ * of the zone or 1MB, whichever is smaller. The batch
+ * size is striking a balance between allocation latency
+ * and zone lock contention.
*/
- batch = zone_managed_pages(zone) / 1024;
- /* But no more than a meg. */
- if (batch * PAGE_SIZE > 1024 * 1024)
- batch = (1024 * 1024) / PAGE_SIZE;
+ batch = min(zone_managed_pages(zone) >> 10, (1024 * 1024) / PAGE_SIZE);
batch /= 4; /* We effectively *= 4 below */
if (batch < 1)
batch = 1;
@@ -6637,6 +6628,27 @@ static int zone_batchsize(struct zone *zone)
#endif
}

+static int zone_highsize(struct zone *zone)
+{
+#ifdef CONFIG_MMU
+ int high;
+ int nr_local_cpus;
+
+ /*
+ * The high value of the pcp is based on the zone low watermark
+ * when reclaim is potentially active spread across the online
+ * CPUs local to a zone. Note that early in boot that CPUs may
+ * not be online yet.
+ */
+ nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone))));
+ high = low_wmark_pages(zone) / nr_local_cpus;
+
+ return high;
+#else
+ return 0;
+#endif
+}
+
/*
* pcp->high and pcp->batch values are related and generally batch is lower
* than high. They are also related to pcp->count such that count is lower
@@ -6698,11 +6710,10 @@ static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long h
*/
static void zone_set_pageset_high_and_batch(struct zone *zone)
{
- unsigned long new_high, new_batch;
+ int new_high, new_batch;

- new_batch = zone_batchsize(zone);
- new_high = 6 * new_batch;
- new_batch = max(1UL, 1 * new_batch);
+ new_batch = max(1, zone_batchsize(zone));
+ new_high = zone_highsize(zone);

if (zone->pageset_high == new_high &&
zone->pageset_batch == new_batch)
@@ -8170,6 +8181,12 @@ static void __setup_per_zone_wmarks(void)
zone->_watermark[WMARK_LOW] = min_wmark_pages(zone) + tmp;
zone->_watermark[WMARK_HIGH] = min_wmark_pages(zone) + tmp * 2;

+ /*
+ * The watermark size have changed so update the pcpu batch
+ * and high limits or the limits may be inappropriate.
+ */
+ zone_set_pageset_high_and_batch(zone);
+
spin_unlock_irqrestore(&zone->lock, flags);
}

--
2.26.2

2021-05-21 12:04:35

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 1/6] mm/page_alloc: Delete vm.percpu_pagelist_fraction

The vm.percpu_pagelist_fraction is used to increase the batch and high
limits for the per-cpu page allocator (PCP). The intent behind the sysctl
is to reduce zone lock acquisition when allocating/freeing pages but it has
a problem. While it can decrease contention, it can also increase latency
on the allocation side due to unreasonably large batch sizes. This leads
to games where an administrator adjusts percpu_pagelist_fraction on the
fly to work around contention and allocation latency problems.

This series aims to alleviate the problems with zone lock contention while
avoiding the allocation-side latency problems. For the purposes of review,
it's easier to remove this sysctl now and reintroduce a similar sysctl
later in the series that deals only with pcp->high.

Signed-off-by: Mel Gorman <[email protected]>
---
Documentation/admin-guide/sysctl/vm.rst | 19 ---------
include/linux/mmzone.h | 3 --
kernel/sysctl.c | 8 ----
mm/page_alloc.c | 55 ++-----------------------
4 files changed, 4 insertions(+), 81 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 586cd4b86428..2fcafccb53a8 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -64,7 +64,6 @@ files can be found in mm/swap.c.
- overcommit_ratio
- page-cluster
- panic_on_oom
-- percpu_pagelist_fraction
- stat_interval
- stat_refresh
- numa_stat
@@ -790,24 +789,6 @@ panic_on_oom=2+kdump gives you very strong tool to investigate
why oom happens. You can get snapshot.


-percpu_pagelist_fraction
-========================
-
-This is the fraction of pages at most (high mark pcp->high) in each zone that
-are allocated for each per cpu page list. The min value for this is 8. It
-means that we don't allow more than 1/8th of pages in each zone to be
-allocated in any single per_cpu_pagelist. This entry only changes the value
-of hot per cpu pagelists. User can specify a number like 100 to allocate
-1/100th of each zone to each per cpu page list.
-
-The batch value of each per cpu pagelist is also updated as a result. It is
-set to pcp->high/4. The upper limit of batch is (PAGE_SHIFT * 8)
-
-The initial value is zero. Kernel does not use this value at boot time to set
-the high water marks for each per cpu page list. If the user writes '0' to this
-sysctl, it will revert to this default behavior.
-
-
stat_interval
=============

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d7740c97b87e..b449151745d7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1038,15 +1038,12 @@ int watermark_scale_factor_sysctl_handler(struct ctl_table *, int, void *,
extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES];
int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int, void *,
size_t *, loff_t *);
-int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *, int,
- void *, size_t *, loff_t *);
int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
void *, size_t *, loff_t *);
int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
void *, size_t *, loff_t *);
int numa_zonelist_order_handler(struct ctl_table *, int,
void *, size_t *, loff_t *);
-extern int percpu_pagelist_fraction;
extern char numa_zonelist_order[];
#define NUMA_ZONELIST_ORDER_LEN 16

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 14edf84cc571..4e5ac50a1af0 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2889,14 +2889,6 @@ static struct ctl_table vm_table[] = {
.extra1 = SYSCTL_ONE,
.extra2 = &one_thousand,
},
- {
- .procname = "percpu_pagelist_fraction",
- .data = &percpu_pagelist_fraction,
- .maxlen = sizeof(percpu_pagelist_fraction),
- .mode = 0644,
- .proc_handler = percpu_pagelist_fraction_sysctl_handler,
- .extra1 = SYSCTL_ZERO,
- },
{
.procname = "page_lock_unfairness",
.data = &sysctl_page_lock_unfairness,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index ff8f706839ea..a48f305f0381 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -120,7 +120,6 @@ typedef int __bitwise fpi_t;

/* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
static DEFINE_MUTEX(pcp_batch_high_lock);
-#define MIN_PERCPU_PAGELIST_FRACTION (8)

struct pagesets {
local_lock_t lock;
@@ -182,7 +181,6 @@ EXPORT_SYMBOL(_totalram_pages);
unsigned long totalreserve_pages __read_mostly;
unsigned long totalcma_pages __read_mostly;

-int percpu_pagelist_fraction;
gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
DEFINE_STATIC_KEY_MAYBE(CONFIG_INIT_ON_ALLOC_DEFAULT_ON, init_on_alloc);
EXPORT_SYMBOL(init_on_alloc);
@@ -6696,22 +6694,15 @@ static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long h

/*
* Calculate and set new high and batch values for all per-cpu pagesets of a
- * zone, based on the zone's size and the percpu_pagelist_fraction sysctl.
+ * zone based on the zone's size.
*/
static void zone_set_pageset_high_and_batch(struct zone *zone)
{
unsigned long new_high, new_batch;

- if (percpu_pagelist_fraction) {
- new_high = zone_managed_pages(zone) / percpu_pagelist_fraction;
- new_batch = max(1UL, new_high / 4);
- if ((new_high / 4) > (PAGE_SHIFT * 8))
- new_batch = PAGE_SHIFT * 8;
- } else {
- new_batch = zone_batchsize(zone);
- new_high = 6 * new_batch;
- new_batch = max(1UL, 1 * new_batch);
- }
+ new_batch = zone_batchsize(zone);
+ new_high = 6 * new_batch;
+ new_batch = max(1UL, 1 * new_batch);

if (zone->pageset_high == new_high &&
zone->pageset_batch == new_batch)
@@ -8377,44 +8368,6 @@ int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *table, int write,
return 0;
}

-/*
- * percpu_pagelist_fraction - changes the pcp->high for each zone on each
- * cpu. It is the fraction of total pages in each zone that a hot per cpu
- * pagelist can have before it gets flushed back to buddy allocator.
- */
-int percpu_pagelist_fraction_sysctl_handler(struct ctl_table *table, int write,
- void *buffer, size_t *length, loff_t *ppos)
-{
- struct zone *zone;
- int old_percpu_pagelist_fraction;
- int ret;
-
- mutex_lock(&pcp_batch_high_lock);
- old_percpu_pagelist_fraction = percpu_pagelist_fraction;
-
- ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
- if (!write || ret < 0)
- goto out;
-
- /* Sanity checking to avoid pcp imbalance */
- if (percpu_pagelist_fraction &&
- percpu_pagelist_fraction < MIN_PERCPU_PAGELIST_FRACTION) {
- percpu_pagelist_fraction = old_percpu_pagelist_fraction;
- ret = -EINVAL;
- goto out;
- }
-
- /* No change? */
- if (percpu_pagelist_fraction == old_percpu_pagelist_fraction)
- goto out;
-
- for_each_populated_zone(zone)
- zone_set_pageset_high_and_batch(zone);
-out:
- mutex_unlock(&pcp_batch_high_lock);
- return ret;
-}
-
#ifndef __HAVE_ARCH_RESERVED_KERNEL_PAGES
/*
* Returns the number of pages that arch has reserved but
--
2.26.2

2021-05-21 12:05:16

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 6/6] mm/page_alloc: Introduce vm.percpu_pagelist_high_fraction

This introduces a new sysctl vm.percpu_pagelist_high_fraction. It is
similar to the old vm.percpu_pagelist_fraction except it only adjusts
pcp->high to potentially reduce zone->lock contention while preserving
allocation latency when PCP lists have to be refilled.

# grep -E "high:|batch" /proc/zoneinfo | tail -2
high: 649
batch: 63

# sysctl vm.percpu_pagelist_high_fraction=8
# grep -E "high:|batch" /proc/zoneinfo | tail -2
high: 35071
batch: 63

# sysctl vm.percpu_pagelist_high_fraction=64
high: 4383
batch: 63

# sysctl vm.percpu_pagelist_high_fraction=0
high: 649
batch: 63

Signed-off-by: Mel Gorman <[email protected]>
---
Documentation/admin-guide/sysctl/vm.rst | 20 +++++++++
include/linux/mmzone.h | 3 ++
kernel/sysctl.c | 8 ++++
mm/page_alloc.c | 56 +++++++++++++++++++++++--
4 files changed, 83 insertions(+), 4 deletions(-)

diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst
index 2fcafccb53a8..415f2aebf59b 100644
--- a/Documentation/admin-guide/sysctl/vm.rst
+++ b/Documentation/admin-guide/sysctl/vm.rst
@@ -64,6 +64,7 @@ files can be found in mm/swap.c.
- overcommit_ratio
- page-cluster
- panic_on_oom
+- percpu_pagelist_high_fraction
- stat_interval
- stat_refresh
- numa_stat
@@ -789,6 +790,25 @@ panic_on_oom=2+kdump gives you very strong tool to investigate
why oom happens. You can get snapshot.


+percpu_pagelist_high_fraction
+=============================
+
+This is the fraction of pages at most (high mark pcp->high) in each zone that
+are allocated for each per cpu page list. The min value for this is 8. It
+means that we do not allow more than 1/8th of pages in each zone to be
+allocated in any single per_cpu_pagelist. This entry only changes the value
+of hot per cpu pagelists. User can specify a number like 100 to allocate
+1/100th of each zone to each per cpu page list.
+
+The batch value of each per cpu pagelist remains the same regardless of the
+value of the high fraction so allocation latencies are unaffected.
+
+The initial value is zero. Kernel uses this value to set the high pcp->high
+mark based on the low watermark for the zone and the number of local
+online CPUs. If the user writes '0' to this sysctl, it will revert to
+this default behavior.
+
+
stat_interval
=============

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a0606239a167..e20d98c62beb 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1040,12 +1040,15 @@ int watermark_scale_factor_sysctl_handler(struct ctl_table *, int, void *,
extern int sysctl_lowmem_reserve_ratio[MAX_NR_ZONES];
int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *, int, void *,
size_t *, loff_t *);
+int percpu_pagelist_high_fraction_sysctl_handler(struct ctl_table *, int,
+ void *, size_t *, loff_t *);
int sysctl_min_unmapped_ratio_sysctl_handler(struct ctl_table *, int,
void *, size_t *, loff_t *);
int sysctl_min_slab_ratio_sysctl_handler(struct ctl_table *, int,
void *, size_t *, loff_t *);
int numa_zonelist_order_handler(struct ctl_table *, int,
void *, size_t *, loff_t *);
+extern int percpu_pagelist_high_fraction;
extern char numa_zonelist_order[];
#define NUMA_ZONELIST_ORDER_LEN 16

diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 4e5ac50a1af0..9eb9d1f987d9 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -2889,6 +2889,14 @@ static struct ctl_table vm_table[] = {
.extra1 = SYSCTL_ONE,
.extra2 = &one_thousand,
},
+ {
+ .procname = "percpu_pagelist_high_fraction",
+ .data = &percpu_pagelist_high_fraction,
+ .maxlen = sizeof(percpu_pagelist_high_fraction),
+ .mode = 0644,
+ .proc_handler = percpu_pagelist_high_fraction_sysctl_handler,
+ .extra1 = SYSCTL_ZERO,
+ },
{
.procname = "page_lock_unfairness",
.data = &sysctl_page_lock_unfairness,
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d8f8044781c4..08f9e5027ed4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -120,6 +120,7 @@ typedef int __bitwise fpi_t;

/* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
static DEFINE_MUTEX(pcp_batch_high_lock);
+#define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)

struct pagesets {
local_lock_t lock;
@@ -181,6 +182,7 @@ EXPORT_SYMBOL(_totalram_pages);
unsigned long totalreserve_pages __read_mostly;
unsigned long totalcma_pages __read_mostly;

+int percpu_pagelist_high_fraction;
gfp_t gfp_allowed_mask __read_mostly = GFP_BOOT_MASK;
DEFINE_STATIC_KEY_MAYBE(CONFIG_INIT_ON_ALLOC_DEFAULT_ON, init_on_alloc);
EXPORT_SYMBOL(init_on_alloc);
@@ -6670,7 +6672,8 @@ static int zone_batchsize(struct zone *zone)
#endif
}

-static int zone_highsize(struct zone *zone, int cpu_online)
+static int
+zone_highsize(struct zone *zone, unsigned long total_pages, int cpu_online)
{
#ifdef CONFIG_MMU
int high;
@@ -6683,7 +6686,7 @@ static int zone_highsize(struct zone *zone, int cpu_online)
* not be online yet.
*/
nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone)))) + cpu_online;
- high = low_wmark_pages(zone) / nr_local_cpus;
+ high = total_pages / nr_local_cpus;

return high;
#else
@@ -6749,14 +6752,21 @@ static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long h

/*
* Calculate and set new high and batch values for all per-cpu pagesets of a
- * zone based on the zone's size.
+ * zone based on the zone's size and the percpu_pagelist_high_fraction sysctl.
*/
static void zone_set_pageset_high_and_batch(struct zone *zone, int cpu_online)
{
int new_high, new_batch;

+ if (!percpu_pagelist_high_fraction) {
+ new_high = zone_highsize(zone, low_wmark_pages(zone), cpu_online);
+ } else {
+ new_high = zone_highsize(zone,
+ zone_managed_pages(zone) / percpu_pagelist_high_fraction,
+ cpu_online);
+ }
+
new_batch = max(1, zone_batchsize(zone));
- new_high = zone_highsize(zone, cpu_online);

if (zone->pageset_high == new_high &&
zone->pageset_batch == new_batch)
@@ -8443,6 +8453,44 @@ int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *table, int write,
return 0;
}

+/*
+ * percpu_pagelist_high_fraction - changes the pcp->high for each zone on each
+ * cpu. It is the fraction of total pages in each zone that a hot per cpu
+ * pagelist can have before it gets flushed back to buddy allocator.
+ */
+int percpu_pagelist_high_fraction_sysctl_handler(struct ctl_table *table,
+ int write, void *buffer, size_t *length, loff_t *ppos)
+{
+ struct zone *zone;
+ int old_percpu_pagelist_high_fraction;
+ int ret;
+
+ mutex_lock(&pcp_batch_high_lock);
+ old_percpu_pagelist_high_fraction = percpu_pagelist_high_fraction;
+
+ ret = proc_dointvec_minmax(table, write, buffer, length, ppos);
+ if (!write || ret < 0)
+ goto out;
+
+ /* Sanity checking to avoid pcp imbalance */
+ if (percpu_pagelist_high_fraction &&
+ percpu_pagelist_high_fraction < MIN_PERCPU_PAGELIST_HIGH_FRACTION) {
+ percpu_pagelist_high_fraction = old_percpu_pagelist_high_fraction;
+ ret = -EINVAL;
+ goto out;
+ }
+
+ /* No change? */
+ if (percpu_pagelist_high_fraction == old_percpu_pagelist_high_fraction)
+ goto out;
+
+ for_each_populated_zone(zone)
+ zone_set_pageset_high_and_batch(zone, 0);
+out:
+ mutex_unlock(&pcp_batch_high_lock);
+ return ret;
+}
+
#ifndef __HAVE_ARCH_RESERVED_KERNEL_PAGES
/*
* Returns the number of pages that arch has reserved but
--
2.26.2

2021-05-21 20:13:20

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 3/6] mm/page_alloc: Adjust pcp->high after CPU hotplug events

The PCP high watermark is based on the number of online CPUs so the
watermarks must be adjusted during CPU hotplug. At the time of
hot-remove, the number of online CPUs is already adjusted but during
hot-add, a delta needs to be applied to update PCP to the correct
value. After this patch is applied, the high watermarks are adjusted
correctly.

# grep high: /proc/zoneinfo | tail -1
high: 649
# echo 0 > /sys/devices/system/cpu/cpu4/online
# grep high: /proc/zoneinfo | tail -1
high: 664
# echo 1 > /sys/devices/system/cpu/cpu4/online
# grep high: /proc/zoneinfo | tail -1
high: 649

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/cpuhotplug.h | 2 +-
mm/internal.h | 2 +-
mm/memory_hotplug.c | 4 ++--
mm/page_alloc.c | 35 +++++++++++++++++++++++++----------
4 files changed, 29 insertions(+), 14 deletions(-)

diff --git a/include/linux/cpuhotplug.h b/include/linux/cpuhotplug.h
index 4a62b3980642..47e13582d9fc 100644
--- a/include/linux/cpuhotplug.h
+++ b/include/linux/cpuhotplug.h
@@ -54,7 +54,7 @@ enum cpuhp_state {
CPUHP_MM_MEMCQ_DEAD,
CPUHP_PERCPU_CNT_DEAD,
CPUHP_RADIX_DEAD,
- CPUHP_PAGE_ALLOC_DEAD,
+ CPUHP_PAGE_ALLOC,
CPUHP_NET_DEV_DEAD,
CPUHP_PCI_XGENE_DEAD,
CPUHP_IOMMU_IOVA_DEAD,
diff --git a/mm/internal.h b/mm/internal.h
index 54bd0dc2c23c..651250e59ef5 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -221,7 +221,7 @@ extern int user_min_free_kbytes;
extern void free_unref_page(struct page *page);
extern void free_unref_page_list(struct list_head *list);

-extern void zone_pcp_update(struct zone *zone);
+extern void zone_pcp_update(struct zone *zone, int cpu_online);
extern void zone_pcp_reset(struct zone *zone);
extern void zone_pcp_disable(struct zone *zone);
extern void zone_pcp_enable(struct zone *zone);
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 70620d0dd923..bebb3cead810 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -961,7 +961,7 @@ int __ref online_pages(unsigned long pfn, unsigned long nr_pages, struct zone *z
node_states_set_node(nid, &arg);
if (need_zonelists_rebuild)
build_all_zonelists(NULL);
- zone_pcp_update(zone);
+ zone_pcp_update(zone, 0);

/* Basic onlining is complete, allow allocation of onlined pages. */
undo_isolate_page_range(pfn, pfn + nr_pages, MIGRATE_MOVABLE);
@@ -1835,7 +1835,7 @@ int __ref offline_pages(unsigned long start_pfn, unsigned long nr_pages)
zone_pcp_reset(zone);
build_all_zonelists(NULL);
} else
- zone_pcp_update(zone);
+ zone_pcp_update(zone, 0);

node_states_clear_node(node, &arg);
if (arg.status_change_nid >= 0) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bf5cdc466e6c..2761b03b3a44 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6628,7 +6628,7 @@ static int zone_batchsize(struct zone *zone)
#endif
}

-static int zone_highsize(struct zone *zone)
+static int zone_highsize(struct zone *zone, int cpu_online)
{
#ifdef CONFIG_MMU
int high;
@@ -6640,7 +6640,7 @@ static int zone_highsize(struct zone *zone)
* CPUs local to a zone. Note that early in boot that CPUs may
* not be online yet.
*/
- nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone))));
+ nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone)))) + cpu_online;
high = low_wmark_pages(zone) / nr_local_cpus;

return high;
@@ -6708,12 +6708,12 @@ static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long h
* Calculate and set new high and batch values for all per-cpu pagesets of a
* zone based on the zone's size.
*/
-static void zone_set_pageset_high_and_batch(struct zone *zone)
+static void zone_set_pageset_high_and_batch(struct zone *zone, int cpu_online)
{
int new_high, new_batch;

new_batch = max(1, zone_batchsize(zone));
- new_high = zone_highsize(zone);
+ new_high = zone_highsize(zone, cpu_online);

if (zone->pageset_high == new_high &&
zone->pageset_batch == new_batch)
@@ -6743,7 +6743,7 @@ void __meminit setup_zone_pageset(struct zone *zone)
per_cpu_pages_init(pcp, pzstats);
}

- zone_set_pageset_high_and_batch(zone);
+ zone_set_pageset_high_and_batch(zone, 0);
}

/*
@@ -8001,6 +8001,7 @@ void __init set_dma_reserve(unsigned long new_dma_reserve)

static int page_alloc_cpu_dead(unsigned int cpu)
{
+ struct zone *zone;

lru_add_drain_cpu(cpu);
drain_pages(cpu);
@@ -8021,6 +8022,19 @@ static int page_alloc_cpu_dead(unsigned int cpu)
* race with what we are doing.
*/
cpu_vm_stats_fold(cpu);
+
+ for_each_populated_zone(zone)
+ zone_pcp_update(zone, 0);
+
+ return 0;
+}
+
+static int page_alloc_cpu_online(unsigned int cpu)
+{
+ struct zone *zone;
+
+ for_each_populated_zone(zone)
+ zone_pcp_update(zone, 1);
return 0;
}

@@ -8046,8 +8060,9 @@ void __init page_alloc_init(void)
hashdist = 0;
#endif

- ret = cpuhp_setup_state_nocalls(CPUHP_PAGE_ALLOC_DEAD,
- "mm/page_alloc:dead", NULL,
+ ret = cpuhp_setup_state_nocalls(CPUHP_PAGE_ALLOC,
+ "mm/page_alloc:pcp",
+ page_alloc_cpu_online,
page_alloc_cpu_dead);
WARN_ON(ret < 0);
}
@@ -8185,7 +8200,7 @@ static void __setup_per_zone_wmarks(void)
* The watermark size have changed so update the pcpu batch
* and high limits or the limits may be inappropriate.
*/
- zone_set_pageset_high_and_batch(zone);
+ zone_set_pageset_high_and_batch(zone, 0);

spin_unlock_irqrestore(&zone->lock, flags);
}
@@ -9007,10 +9022,10 @@ EXPORT_SYMBOL(free_contig_range);
* The zone indicated has a new number of managed_pages; batch sizes and percpu
* page high values need to be recalculated.
*/
-void __meminit zone_pcp_update(struct zone *zone)
+void zone_pcp_update(struct zone *zone, int cpu_online)
{
mutex_lock(&pcp_batch_high_lock);
- zone_set_pageset_high_and_batch(zone);
+ zone_set_pageset_high_and_batch(zone, cpu_online);
mutex_unlock(&pcp_batch_high_lock);
}

--
2.26.2

2021-05-21 20:13:21

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 4/6] mm/page_alloc: Scale the number of pages that are batch freed

When a task is freeing a large number of order-0 pages, it may acquire
the zone->lock multiple times freeing pages in batches. This may
unnecessarily contend on the zone lock when freeing very large number
of pages. This patch adapts the size of the batch based on the recent
pattern to scale the batch size for subsequent frees.

As the machines I used were not large enough to test this are not large
enough to illustrate a problem, a debugging patch shows patterns like
the following (slightly editted for clarity)

Baseline vanilla kernel
time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378
time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378
time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378
time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378
time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378

With patches
time-unmap-7724 [...] free_pcppages_bulk: free 126 count 814 high 814
time-unmap-7724 [...] free_pcppages_bulk: free 252 count 814 high 814
time-unmap-7724 [...] free_pcppages_bulk: free 504 count 814 high 814
time-unmap-7724 [...] free_pcppages_bulk: free 751 count 814 high 814
time-unmap-7724 [...] free_pcppages_bulk: free 751 count 814 high 814

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 3 ++-
mm/page_alloc.c | 30 ++++++++++++++++++++++++++++--
2 files changed, 30 insertions(+), 3 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b449151745d7..92182e0299b2 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -343,8 +343,9 @@ struct per_cpu_pages {
int count; /* number of pages in the list */
int high; /* high watermark, emptying needed */
int batch; /* chunk size for buddy add/remove */
+ short free_factor; /* batch scaling factor during free */
#ifdef CONFIG_NUMA
- int expire; /* When 0, remote pagesets are drained */
+ short expire; /* When 0, remote pagesets are drained */
#endif

/* Lists of pages, one per migrate type stored on the pcp-lists */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 2761b03b3a44..c3da6401f138 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3267,18 +3267,42 @@ static bool free_unref_page_prepare(struct page *page, unsigned long pfn)
return true;
}

+static int nr_pcp_free(struct per_cpu_pages *pcp, int high, int batch)
+{
+ int min_nr_free, max_nr_free;
+
+ /* Check for PCP disabled or boot pageset */
+ if (unlikely(high < batch))
+ return 1;
+
+ min_nr_free = batch;
+ max_nr_free = high - batch;
+
+ batch <<= pcp->free_factor;
+ if (batch < max_nr_free)
+ pcp->free_factor++;
+ batch = clamp(batch, min_nr_free, max_nr_free);
+
+ return batch;
+}
+
static void free_unref_page_commit(struct page *page, unsigned long pfn,
int migratetype)
{
struct zone *zone = page_zone(page);
struct per_cpu_pages *pcp;
+ int high;

__count_vm_event(PGFREE);
pcp = this_cpu_ptr(zone->per_cpu_pageset);
list_add(&page->lru, &pcp->lists[migratetype]);
pcp->count++;
- if (pcp->count >= READ_ONCE(pcp->high))
- free_pcppages_bulk(zone, READ_ONCE(pcp->batch), pcp);
+ high = READ_ONCE(pcp->high);
+ if (pcp->count >= high) {
+ int batch = READ_ONCE(pcp->batch);
+
+ free_pcppages_bulk(zone, nr_pcp_free(pcp, high, batch), pcp);
+ }
}

/*
@@ -3531,6 +3555,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,

local_lock_irqsave(&pagesets.lock, flags);
pcp = this_cpu_ptr(zone->per_cpu_pageset);
+ pcp->free_factor >>= 1;
list = &pcp->lists[migratetype];
page = __rmqueue_pcplist(zone, migratetype, alloc_flags, pcp, list);
local_unlock_irqrestore(&pagesets.lock, flags);
@@ -6690,6 +6715,7 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta
*/
pcp->high = BOOT_PAGESET_HIGH;
pcp->batch = BOOT_PAGESET_BATCH;
+ pcp->free_factor = 0;
}

static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high,
--
2.26.2

2021-05-21 20:13:45

by Mel Gorman

[permalink] [raw]
Subject: [PATCH 5/6] mm/page_alloc: Limit the number of pages on PCP lists when reclaim is active

When kswapd is active then direct reclaim is potentially active. In
either case, it is possible that a zone would be balanced if pages were
not trapped on PCP lists. Instead of draining remote pages, simply limit
the size of the PCP lists while kswapd is active.

Signed-off-by: Mel Gorman <[email protected]>
---
include/linux/mmzone.h | 1 +
mm/page_alloc.c | 19 ++++++++++++++++++-
mm/vmscan.c | 35 +++++++++++++++++++++++++++++++++++
3 files changed, 54 insertions(+), 1 deletion(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 92182e0299b2..a0606239a167 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -647,6 +647,7 @@ enum zone_flags {
ZONE_BOOSTED_WATERMARK, /* zone recently boosted watermarks.
* Cleared when kswapd is woken.
*/
+ ZONE_RECLAIM_ACTIVE, /* kswapd may be scanning the zone. */
};

static inline unsigned long zone_managed_pages(struct zone *zone)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c3da6401f138..d8f8044781c4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3286,6 +3286,23 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int high, int batch)
return batch;
}

+static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone)
+{
+ int high = READ_ONCE(pcp->high);
+
+ if (unlikely(!high))
+ return 0;
+
+ if (!test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags))
+ return high;
+
+ /*
+ * If reclaim is active, limit the number of pages that can be
+ * stored on pcp lists
+ */
+ return READ_ONCE(pcp->batch) << 2;
+}
+
static void free_unref_page_commit(struct page *page, unsigned long pfn,
int migratetype)
{
@@ -3297,7 +3314,7 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn,
pcp = this_cpu_ptr(zone->per_cpu_pageset);
list_add(&page->lru, &pcp->lists[migratetype]);
pcp->count++;
- high = READ_ONCE(pcp->high);
+ high = nr_pcp_high(pcp, zone);
if (pcp->count >= high) {
int batch = READ_ONCE(pcp->batch);

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 5199b9696bab..c3c2100a80b8 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -3722,6 +3722,38 @@ static bool kswapd_shrink_node(pg_data_t *pgdat,
return sc->nr_scanned >= sc->nr_to_reclaim;
}

+/* Page allocator PCP high watermark is lowered if reclaim is active. */
+static inline void
+update_reclaim_active(pg_data_t *pgdat, int highest_zoneidx, bool active)
+{
+ int i;
+ struct zone *zone;
+
+ for (i = 0; i <= highest_zoneidx; i++) {
+ zone = pgdat->node_zones + i;
+
+ if (!managed_zone(zone))
+ continue;
+
+ if (active)
+ set_bit(ZONE_RECLAIM_ACTIVE, &zone->flags);
+ else
+ clear_bit(ZONE_RECLAIM_ACTIVE, &zone->flags);
+ }
+}
+
+static inline void
+set_reclaim_active(pg_data_t *pgdat, int highest_zoneidx)
+{
+ update_reclaim_active(pgdat, highest_zoneidx, true);
+}
+
+static inline void
+clear_reclaim_active(pg_data_t *pgdat, int highest_zoneidx)
+{
+ update_reclaim_active(pgdat, highest_zoneidx, false);
+}
+
/*
* For kswapd, balance_pgdat() will reclaim pages across a node from zones
* that are eligible for use by the caller until at least one zone is
@@ -3774,6 +3806,7 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
boosted = nr_boost_reclaim;

restart:
+ set_reclaim_active(pgdat, highest_zoneidx);
sc.priority = DEF_PRIORITY;
do {
unsigned long nr_reclaimed = sc.nr_reclaimed;
@@ -3907,6 +3940,8 @@ static int balance_pgdat(pg_data_t *pgdat, int order, int highest_zoneidx)
pgdat->kswapd_failures++;

out:
+ clear_reclaim_active(pgdat, highest_zoneidx);
+
/* If reclaim was boosted, account for the reclaim done in this pass */
if (boosted) {
unsigned long flags;
--
2.26.2

2021-05-21 21:05:53

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 1/6] mm/page_alloc: Delete vm.percpu_pagelist_fraction

On 5/21/21 3:28 AM, Mel Gorman wrote:
> The vm.percpu_pagelist_fraction is used to increase the batch and high
> limits for the per-cpu page allocator (PCP). The intent behind the sysctl
> is to reduce zone lock acquisition when allocating/freeing pages but it has
> a problem. While it can decrease contention, it can also increase latency
> on the allocation side due to unreasonably large batch sizes. This leads
> to games where an administrator adjusts percpu_pagelist_fraction on the
> fly to work around contention and allocation latency problems.
>
> This series aims to alleviate the problems with zone lock contention while
> avoiding the allocation-side latency problems. For the purposes of review,
> it's easier to remove this sysctl now and reintroduce a similar sysctl
> later in the series that deals only with pcp->high.
>
> Signed-off-by: Mel Gorman <[email protected]>

I despise working with percpu_pagelist_fraction. I usually just end up
poking random numbers into it and then checking what the fallout looks
like in /proc/zoneinfo is.

Good riddance.

Acked-by: Dave Hansen <[email protected]>

2021-05-21 21:54:09

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 2/6] mm/page_alloc: Disassociate the pcp->high from pcp->batch

On 5/21/21 3:28 AM, Mel Gorman wrote:
> Note that in this patch the pcp->high values are adjusted after memory
> hotplug events, min_free_kbytes adjustments and watermark scale factor
> adjustments but not CPU hotplug events.

Not that it was a long wait to figure it out, but I'd probably say:

"CPU hotplug events are handled later in the series".

instead of just saying they're not handled.

> Before grep -E "high:|batch" /proc/zoneinfo | tail -2
> high: 378
> batch: 63
>
> After grep -E "high:|batch" /proc/zoneinfo | tail -2
> high: 649
> batch: 63

You noted the relationship between pcp->high and zone lock contention.
Larger ->high values mean less contention. It's probably also worth
noting the trend of having more logical CPUs per NUMA node.

I have the feeling when this was put in place it wasn't uncommon to have
somewhere between 1 and 8 CPUs in a node pounding on a zone.

Today, having ~60 is common. I've occasionally resorted to recommending
that folks enable hardware features like Sub-NUMA-Clustering [1] since
it increases the number of zones and decreases the number of CPUs
pounding on each zone lock.

1.
https://software.intel.com/content/www/us/en/develop/articles/intel-xeon-processor-scalable-family-technical-overview.html

> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index a48f305f0381..bf5cdc466e6c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2163,14 +2163,6 @@ void __init page_alloc_init_late(void)
> /* Block until all are initialised */
> wait_for_completion(&pgdat_init_all_done_comp);
>
> - /*
> - * The number of managed pages has changed due to the initialisation
> - * so the pcpu batch and high limits needs to be updated or the limits
> - * will be artificially small.
> - */
> - for_each_populated_zone(zone)
> - zone_pcp_update(zone);
> -
> /*
> * We initialized the rest of the deferred pages. Permanently disable
> * on-demand struct page initialization.
> @@ -6594,13 +6586,12 @@ static int zone_batchsize(struct zone *zone)
> int batch;
>
> /*
> - * The per-cpu-pages pools are set to around 1000th of the
> - * size of the zone.
> + * The number of pages to batch allocate is either 0.1%

Probably worth making that "~0.1%" just in case someone goes looking for
the /1000 and can't find it.

> + * of the zone or 1MB, whichever is smaller. The batch
> + * size is striking a balance between allocation latency
> + * and zone lock contention.
> */
> - batch = zone_managed_pages(zone) / 1024;
> - /* But no more than a meg. */
> - if (batch * PAGE_SIZE > 1024 * 1024)
> - batch = (1024 * 1024) / PAGE_SIZE;
> + batch = min(zone_managed_pages(zone) >> 10, (1024 * 1024) / PAGE_SIZE);
> batch /= 4; /* We effectively *= 4 below */
> if (batch < 1)
> batch = 1;
> @@ -6637,6 +6628,27 @@ static int zone_batchsize(struct zone *zone)
> #endif
> }
>
> +static int zone_highsize(struct zone *zone)
> +{
> +#ifdef CONFIG_MMU
> + int high;
> + int nr_local_cpus;
> +
> + /*
> + * The high value of the pcp is based on the zone low watermark
> + * when reclaim is potentially active spread across the online
> + * CPUs local to a zone. Note that early in boot that CPUs may
> + * not be online yet.
> + */

FWIW, I like the way the changelog talked about this a bit better, with
the goal of avoiding background reclaim even in the face of a bunch of
full pcp's.

> + nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone))));
> + high = low_wmark_pages(zone) / nr_local_cpus;

I'm a little concerned that this might get out of hand on really big
nodes with no CPUs. For persistent memory (which we *do* toss into the
page allocator for volatile use), we can have multi-terabyte zones with
no CPUs in the node.

Also, while the CPUs which are on the node are the ones *most* likely to
be hitting the ->high limit, we do *keep* a pcp for each possible CPU.
So, the amount of memory which can actually be sequestered is
num_online_cpus()*high. Right?

*That* might really get out of hand if we have nr_local_cpus=1.

We might want some overall cap on 'high', or even to scale it
differently for the zone-local cpus' pcps versus remote.

2021-05-21 22:17:28

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 3/6] mm/page_alloc: Adjust pcp->high after CPU hotplug events

On 5/21/21 3:28 AM, Mel Gorman wrote:
> The PCP high watermark is based on the number of online CPUs so the
> watermarks must be adjusted during CPU hotplug. At the time of
> hot-remove, the number of online CPUs is already adjusted but during
> hot-add, a delta needs to be applied to update PCP to the correct
> value. After this patch is applied, the high watermarks are adjusted
> correctly.
>
> # grep high: /proc/zoneinfo | tail -1
> high: 649
> # echo 0 > /sys/devices/system/cpu/cpu4/online
> # grep high: /proc/zoneinfo | tail -1
> high: 664
> # echo 1 > /sys/devices/system/cpu/cpu4/online
> # grep high: /proc/zoneinfo | tail -1
> high: 649

This is actually a comment more about the previous patch, but it doesn't
really become apparent until the example above.

In your example, you mentioned increased exit() performance by using
"vm.percpu_pagelist_fraction to increase the pcp->high value". That's
presumably because of the increased batching effects and fewer lock
acquisitions.

But, logically, doesn't that mean that, the more CPUs you have in a
node, the *higher* you want pcp->high to be? If we took this to the
extreme and had an absurd number of CPUs in a node, we could end up with
a too-small pcp->high value.

Also, do you worry at all about a zone with a low min_free_kbytes seeing
increased zone lock contention?

...
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index bf5cdc466e6c..2761b03b3a44 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -6628,7 +6628,7 @@ static int zone_batchsize(struct zone *zone)
> #endif
> }
>
> -static int zone_highsize(struct zone *zone)
> +static int zone_highsize(struct zone *zone, int cpu_online)
> {
> #ifdef CONFIG_MMU
> int high;
> @@ -6640,7 +6640,7 @@ static int zone_highsize(struct zone *zone)
> * CPUs local to a zone. Note that early in boot that CPUs may
> * not be online yet.
> */
> - nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone))));
> + nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone)))) + cpu_online;
> high = low_wmark_pages(zone) / nr_local_cpus;

Is this "+ cpu_online" bias because the CPU isn't in cpumask_of_node()
when the CPU hotplug callback occurs? If so, it might be nice to mention.

2021-05-21 22:38:02

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 4/6] mm/page_alloc: Scale the number of pages that are batch freed

...
> +static int nr_pcp_free(struct per_cpu_pages *pcp, int high, int batch)
> +{
> + int min_nr_free, max_nr_free;
> +
> + /* Check for PCP disabled or boot pageset */
> + if (unlikely(high < batch))
> + return 1;
> +
> + min_nr_free = batch;
> + max_nr_free = high - batch;

I puzzled over this for a minute. I *think* it means to say: "Leave at
least one batch worth of pages in the pcp at all times so that the next
allocation can still be satisfied from this pcp."

> + batch <<= pcp->free_factor;
> + if (batch < max_nr_free)
> + pcp->free_factor++;
> + batch = clamp(batch, min_nr_free, max_nr_free);
> +
> + return batch;
> +}
> +
> static void free_unref_page_commit(struct page *page, unsigned long pfn,
> int migratetype)
> {
> struct zone *zone = page_zone(page);
> struct per_cpu_pages *pcp;
> + int high;
>
> __count_vm_event(PGFREE);
> pcp = this_cpu_ptr(zone->per_cpu_pageset);
> list_add(&page->lru, &pcp->lists[migratetype]);
> pcp->count++;
> - if (pcp->count >= READ_ONCE(pcp->high))
> - free_pcppages_bulk(zone, READ_ONCE(pcp->batch), pcp);
> + high = READ_ONCE(pcp->high);
> + if (pcp->count >= high) {
> + int batch = READ_ONCE(pcp->batch);
> +
> + free_pcppages_bulk(zone, nr_pcp_free(pcp, high, batch), pcp);
> + }
> }
>
> /*
> @@ -3531,6 +3555,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
>
> local_lock_irqsave(&pagesets.lock, flags);
> pcp = this_cpu_ptr(zone->per_cpu_pageset);
> + pcp->free_factor >>= 1;
> list = &pcp->lists[migratetype];
> page = __rmqueue_pcplist(zone, migratetype, alloc_flags, pcp, list);
> local_unlock_irqrestore(&pagesets.lock, flags);

A high-level description of the algorithm in the changelog would also be
nice. I *think* it's basically:

After hitting the high pcp mark, free one pcp->batch at a time. But, as
subsequent pcp free operations occur, keep doubling the size of the
freed batches. Cap them so that they always leave at least one
pcp->batch worth of pages. Scale the size back down by half whenever an
allocation that consumes a page from the pcp occurs.

While I'd appreciate another comment or two, I do think this is worth
doing, and the approach seems sound:

Acked-by: Dave Hansen <[email protected]>

2021-05-21 22:46:09

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 5/6] mm/page_alloc: Limit the number of pages on PCP lists when reclaim is active

On 5/21/21 3:28 AM, Mel Gorman wrote:
> +static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone)
> +{
> + int high = READ_ONCE(pcp->high);
> +
> + if (unlikely(!high))
> + return 0;
> +
> + if (!test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags))
> + return high;
> +
> + /*
> + * If reclaim is active, limit the number of pages that can be
> + * stored on pcp lists
> + */
> + return READ_ONCE(pcp->batch) << 2;
> +}

Should there be a sanity check on this? Let's say we had one of those
weirdo zones with tons of CPUs and a small low_wmark_pages(). Could we
have a case where:

pcp->high < pcp->batch<<2

and this effectively *raises* nr_pcp_high()?

It's not possible with the current pcp->high calculation, but does
anything prevent it now?

2021-05-21 22:58:48

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/page_alloc: Introduce vm.percpu_pagelist_high_fraction

On 5/21/21 3:28 AM, Mel Gorman wrote:
> This introduces a new sysctl vm.percpu_pagelist_high_fraction. It is
> similar to the old vm.percpu_pagelist_fraction except it only adjusts
> pcp->high to potentially reduce zone->lock contention while preserving
> allocation latency when PCP lists have to be refilled.

Look at me... Five patches later and I already forgot what the old one
did and why it stinks. I wonder if you might do a wee bit of compare
and contrast. Something like:

The old vm.percpu_pagelist_fraction increased both the batch and
high limits for the per-cpu page allocator. Its worst feature
was that it led to absurdly large batch sizes that incurred
nasty worst-case allocation latency.

This new sysctl in comparison...

Anyway, the approach looks sound to me. The batch size isn't important
now, especially given the auto-scaling in patch 4.

Acked-by: Dave Hansen <[email protected]>

2021-05-24 08:35:11

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 2/6] mm/page_alloc: Disassociate the pcp->high from pcp->batch

On Fri, May 21, 2021 at 02:52:39PM -0700, Dave Hansen wrote:
> On 5/21/21 3:28 AM, Mel Gorman wrote:
> > Note that in this patch the pcp->high values are adjusted after memory
> > hotplug events, min_free_kbytes adjustments and watermark scale factor
> > adjustments but not CPU hotplug events.
>
> Not that it was a long wait to figure it out, but I'd probably say:
>
> "CPU hotplug events are handled later in the series".
>
> instead of just saying they're not handled.
>
> > Before grep -E "high:|batch" /proc/zoneinfo | tail -2
> > high: 378
> > batch: 63
> >
> > After grep -E "high:|batch" /proc/zoneinfo | tail -2
> > high: 649
> > batch: 63
>
> You noted the relationship between pcp->high and zone lock contention.
> Larger ->high values mean less contention. It's probably also worth
> noting the trend of having more logical CPUs per NUMA node.
>

It's noted in the leader with "neither takes the zone size into account
or the number of CPUs local to a zone". Your point is valid but I'm not
doing much about the number of CPUs sharing a lock.

> I have the feeling when this was put in place it wasn't uncommon to have
> somewhere between 1 and 8 CPUs in a node pounding on a zone.
>

True.

> Today, having ~60 is common. I've occasionally resorted to recommending
> that folks enable hardware features like Sub-NUMA-Clustering [1] since
> it increases the number of zones and decreases the number of CPUs
> pounding on each zone lock.
>

Enabling SNC to reduce concention is very unfortunate. It potentially
causes page age inversion issue as it's similar to specifying numa=fake=N

> 1.
> https://software.intel.com/content/www/us/en/develop/articles/intel-xeon-processor-scalable-family-technical-overview.html
>
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index a48f305f0381..bf5cdc466e6c 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -2163,14 +2163,6 @@ void __init page_alloc_init_late(void)
> > /* Block until all are initialised */
> > wait_for_completion(&pgdat_init_all_done_comp);
> >
> > - /*
> > - * The number of managed pages has changed due to the initialisation
> > - * so the pcpu batch and high limits needs to be updated or the limits
> > - * will be artificially small.
> > - */
> > - for_each_populated_zone(zone)
> > - zone_pcp_update(zone);
> > -
> > /*
> > * We initialized the rest of the deferred pages. Permanently disable
> > * on-demand struct page initialization.
> > @@ -6594,13 +6586,12 @@ static int zone_batchsize(struct zone *zone)
> > int batch;
> >
> > /*
> > - * The per-cpu-pages pools are set to around 1000th of the
> > - * size of the zone.
> > + * The number of pages to batch allocate is either 0.1%
>
> Probably worth making that "~0.1%" just in case someone goes looking for
> the /1000 and can't find it.
>

Done

> > + * of the zone or 1MB, whichever is smaller. The batch
> > + * size is striking a balance between allocation latency
> > + * and zone lock contention.
> > */
> > - batch = zone_managed_pages(zone) / 1024;
> > - /* But no more than a meg. */
> > - if (batch * PAGE_SIZE > 1024 * 1024)
> > - batch = (1024 * 1024) / PAGE_SIZE;
> > + batch = min(zone_managed_pages(zone) >> 10, (1024 * 1024) / PAGE_SIZE);
> > batch /= 4; /* We effectively *= 4 below */
> > if (batch < 1)
> > batch = 1;
> > @@ -6637,6 +6628,27 @@ static int zone_batchsize(struct zone *zone)
> > #endif
> > }
> >
> > +static int zone_highsize(struct zone *zone)
> > +{
> > +#ifdef CONFIG_MMU
> > + int high;
> > + int nr_local_cpus;
> > +
> > + /*
> > + * The high value of the pcp is based on the zone low watermark
> > + * when reclaim is potentially active spread across the online
> > + * CPUs local to a zone. Note that early in boot that CPUs may
> > + * not be online yet.
> > + */
>
> FWIW, I like the way the changelog talked about this a bit better, with
> the goal of avoiding background reclaim even in the face of a bunch of
> full pcp's.
>
> > + nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone))));

Done.

> > + high = low_wmark_pages(zone) / nr_local_cpus;
>
> I'm a little concerned that this might get out of hand on really big
> nodes with no CPUs. For persistent memory (which we *do* toss into the
> page allocator for volatile use), we can have multi-terabyte zones with
> no CPUs in the node.
>

It should not get out of hand given that it's based on the low watermark,
at least for local CPus.

> Also, while the CPUs which are on the node are the ones *most* likely to
> be hitting the ->high limit, we do *keep* a pcp for each possible CPU.
> So, the amount of memory which can actually be sequestered is
> num_online_cpus()*high. Right?
>

Potentially yes for short durations but remote CPUs are drained every
few seconds by refresh_cpu_vm_stats so it's a transient problem.

> *That* might really get out of hand if we have nr_local_cpus=1.
>
> We might want some overall cap on 'high', or even to scale it
> differently for the zone-local cpus' pcps versus remote.

I'm reluctant to prematurely set this because I don't have a test case
and machine where this has been demonstrated to be a problem but I would
not be opposed to a patch added on top which demonstrated a reasonable
case where too many pages are pinned on remote CPUs for too long. I would
imagine this involves a machine with large amounts of persistent memory
onlined as a memory-like device and using an interleave policy showing
that reclaim on the persistent memory node is triggered prematurely.

My initial thinking is that remote CPUs should simply clamp the remote
cpus on a static value such as pcp->batch << 1 but I would prefer this
was based on a real test. I expect the check for a CPU being local to
a zone would be done in __zone_set_pageset_high_and_batch.

--
Mel Gorman
SUSE Labs

2021-05-24 09:08:44

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 3/6] mm/page_alloc: Adjust pcp->high after CPU hotplug events

On Fri, May 21, 2021 at 03:13:35PM -0700, Dave Hansen wrote:
> On 5/21/21 3:28 AM, Mel Gorman wrote:
> > The PCP high watermark is based on the number of online CPUs so the
> > watermarks must be adjusted during CPU hotplug. At the time of
> > hot-remove, the number of online CPUs is already adjusted but during
> > hot-add, a delta needs to be applied to update PCP to the correct
> > value. After this patch is applied, the high watermarks are adjusted
> > correctly.
> >
> > # grep high: /proc/zoneinfo | tail -1
> > high: 649
> > # echo 0 > /sys/devices/system/cpu/cpu4/online
> > # grep high: /proc/zoneinfo | tail -1
> > high: 664
> > # echo 1 > /sys/devices/system/cpu/cpu4/online
> > # grep high: /proc/zoneinfo | tail -1
> > high: 649
>
> This is actually a comment more about the previous patch, but it doesn't
> really become apparent until the example above.
>
> In your example, you mentioned increased exit() performance by using
> "vm.percpu_pagelist_fraction to increase the pcp->high value". That's
> presumably because of the increased batching effects and fewer lock
> acquisitions.
>

Yes

> But, logically, doesn't that mean that, the more CPUs you have in a
> node, the *higher* you want pcp->high to be? If we took this to the
> extreme and had an absurd number of CPUs in a node, we could end up with
> a too-small pcp->high value.
>

I see your point but I don't think increasing pcp->high for larger
numbers of CPUs is the right answer because then reclaim can be
triggered simply because too many PCPs have pages.

To address your point requires much deeper surgery. zone->lock would have
to be split to being a metadata lock and a free page lock. Then the free
areas would have to be split based on some factor -- number of CPUs or
memory size. That gets complex because then the page allocator loop needs
to walk multiple arenas as well as multiple zones as well as consider which
arena should be examined first. Fragmentation should also be considered
because a decision would need to be made on whether a pageblock should
fragment or whether other local areans should be examined. Anything that
walks PFNs such as compaction would also need to be aware of arenas and
their associated locks. Finally every acquisition of zone->lock would
have to be audited to determine exactly what it is protecting. Even with
all that, it still makes sense to disassociate pcp->high from pcp->batch
as this series does.

There is value to doing something like this but it's beyond what this
series is trying to do and doing the work without introducing regressions
would be very difficult.

> Also, do you worry at all about a zone with a low min_free_kbytes seeing
> increased zone lock contention?
>
> ...
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index bf5cdc466e6c..2761b03b3a44 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -6628,7 +6628,7 @@ static int zone_batchsize(struct zone *zone)
> > #endif
> > }
> >
> > -static int zone_highsize(struct zone *zone)
> > +static int zone_highsize(struct zone *zone, int cpu_online)
> > {
> > #ifdef CONFIG_MMU
> > int high;
> > @@ -6640,7 +6640,7 @@ static int zone_highsize(struct zone *zone)
> > * CPUs local to a zone. Note that early in boot that CPUs may
> > * not be online yet.
> > */
> > - nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone))));
> > + nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone)))) + cpu_online;
> > high = low_wmark_pages(zone) / nr_local_cpus;
>
> Is this "+ cpu_online" bias because the CPU isn't in cpumask_of_node()
> when the CPU hotplug callback occurs? If so, it might be nice to mention.

Fixed.

--
Mel Gorman
SUSE Labs

2021-05-24 09:13:12

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 4/6] mm/page_alloc: Scale the number of pages that are batch freed

On Fri, May 21, 2021 at 03:36:05PM -0700, Dave Hansen wrote:
> ...
> > +static int nr_pcp_free(struct per_cpu_pages *pcp, int high, int batch)
> > +{
> > + int min_nr_free, max_nr_free;
> > +
> > + /* Check for PCP disabled or boot pageset */
> > + if (unlikely(high < batch))
> > + return 1;
> > +
> > + min_nr_free = batch;
> > + max_nr_free = high - batch;
>
> I puzzled over this for a minute. I *think* it means to say: "Leave at
> least one batch worth of pages in the pcp at all times so that the next
> allocation can still be satisfied from this pcp."
>

Yes, I added a comment.

> > + batch <<= pcp->free_factor;
> > + if (batch < max_nr_free)
> > + pcp->free_factor++;
> > + batch = clamp(batch, min_nr_free, max_nr_free);
> > +
> > + return batch;
> > +}
> > +
> > static void free_unref_page_commit(struct page *page, unsigned long pfn,
> > int migratetype)
> > {
> > struct zone *zone = page_zone(page);
> > struct per_cpu_pages *pcp;
> > + int high;
> >
> > __count_vm_event(PGFREE);
> > pcp = this_cpu_ptr(zone->per_cpu_pageset);
> > list_add(&page->lru, &pcp->lists[migratetype]);
> > pcp->count++;
> > - if (pcp->count >= READ_ONCE(pcp->high))
> > - free_pcppages_bulk(zone, READ_ONCE(pcp->batch), pcp);
> > + high = READ_ONCE(pcp->high);
> > + if (pcp->count >= high) {
> > + int batch = READ_ONCE(pcp->batch);
> > +
> > + free_pcppages_bulk(zone, nr_pcp_free(pcp, high, batch), pcp);
> > + }
> > }
> >
> > /*
> > @@ -3531,6 +3555,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
> >
> > local_lock_irqsave(&pagesets.lock, flags);
> > pcp = this_cpu_ptr(zone->per_cpu_pageset);
> > + pcp->free_factor >>= 1;
> > list = &pcp->lists[migratetype];
> > page = __rmqueue_pcplist(zone, migratetype, alloc_flags, pcp, list);
> > local_unlock_irqrestore(&pagesets.lock, flags);
>
> A high-level description of the algorithm in the changelog would also be
> nice. I *think* it's basically:
>
> After hitting the high pcp mark, free one pcp->batch at a time. But, as
> subsequent pcp free operations occur, keep doubling the size of the
> freed batches. Cap them so that they always leave at least one
> pcp->batch worth of pages. Scale the size back down by half whenever an
> allocation that consumes a page from the pcp occurs.
>
> While I'd appreciate another comment or two, I do think this is worth
> doing, and the approach seems sound:
>
> Acked-by: Dave Hansen <[email protected]>

Thanks, I added a few additional comments.

--
Mel Gorman
SUSE Labs

2021-05-24 09:23:49

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 5/6] mm/page_alloc: Limit the number of pages on PCP lists when reclaim is active

On Fri, May 21, 2021 at 03:44:49PM -0700, Dave Hansen wrote:
> On 5/21/21 3:28 AM, Mel Gorman wrote:
> > +static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone)
> > +{
> > + int high = READ_ONCE(pcp->high);
> > +
> > + if (unlikely(!high))
> > + return 0;
> > +
> > + if (!test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags))
> > + return high;
> > +
> > + /*
> > + * If reclaim is active, limit the number of pages that can be
> > + * stored on pcp lists
> > + */
> > + return READ_ONCE(pcp->batch) << 2;
> > +}
>
> Should there be a sanity check on this? Let's say we had one of those
> weirdo zones with tons of CPUs and a small low_wmark_pages(). Could we
> have a case where:
>
> pcp->high < pcp->batch<<2
>
> and this effectively *raises* nr_pcp_high()?
>
> It's not possible with the current pcp->high calculation, but does
> anything prevent it now?

I don't think it would happen as pcp->batch is reduced for small zones
but a sanity check does not hurt so I added one.

--
Mel Gorman
SUSE Labs

2021-05-24 09:26:46

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 6/6] mm/page_alloc: Introduce vm.percpu_pagelist_high_fraction

On Fri, May 21, 2021 at 03:57:20PM -0700, Dave Hansen wrote:
> On 5/21/21 3:28 AM, Mel Gorman wrote:
> > This introduces a new sysctl vm.percpu_pagelist_high_fraction. It is
> > similar to the old vm.percpu_pagelist_fraction except it only adjusts
> > pcp->high to potentially reduce zone->lock contention while preserving
> > allocation latency when PCP lists have to be refilled.
>
> Look at me... Five patches later and I already forgot what the old one
> did and why it stinks. I wonder if you might do a wee bit of compare
> and contrast. Something like:
>
> The old vm.percpu_pagelist_fraction increased both the batch and
> high limits for the per-cpu page allocator. Its worst feature
> was that it led to absurdly large batch sizes that incurred
> nasty worst-case allocation latency.
>
> This new sysctl in comparison...
>
> Anyway, the approach looks sound to me. The batch size isn't important
> now, especially given the auto-scaling in patch 4.
>
> Acked-by: Dave Hansen <[email protected]>

Thanks, I updated the changelog and hopefully it is better.

--
Mel Gorman
SUSE Labs

2021-05-24 15:59:40

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 3/6] mm/page_alloc: Adjust pcp->high after CPU hotplug events

On 5/24/21 2:07 AM, Mel Gorman wrote:
> On Fri, May 21, 2021 at 03:13:35PM -0700, Dave Hansen wrote:
>> On 5/21/21 3:28 AM, Mel Gorman wrote:
>>> The PCP high watermark is based on the number of online CPUs so the
>>> watermarks must be adjusted during CPU hotplug. At the time of
>>> hot-remove, the number of online CPUs is already adjusted but during
>>> hot-add, a delta needs to be applied to update PCP to the correct
>>> value. After this patch is applied, the high watermarks are adjusted
>>> correctly.
>>>
>>> # grep high: /proc/zoneinfo | tail -1
>>> high: 649
>>> # echo 0 > /sys/devices/system/cpu/cpu4/online
>>> # grep high: /proc/zoneinfo | tail -1
>>> high: 664
>>> # echo 1 > /sys/devices/system/cpu/cpu4/online
>>> # grep high: /proc/zoneinfo | tail -1
>>> high: 649
>> This is actually a comment more about the previous patch, but it doesn't
>> really become apparent until the example above.
>>
>> In your example, you mentioned increased exit() performance by using
>> "vm.percpu_pagelist_fraction to increase the pcp->high value". That's
>> presumably because of the increased batching effects and fewer lock
>> acquisitions.
>>
> Yes
>
>> But, logically, doesn't that mean that, the more CPUs you have in a
>> node, the *higher* you want pcp->high to be? If we took this to the
>> extreme and had an absurd number of CPUs in a node, we could end up with
>> a too-small pcp->high value.
>>
> I see your point but I don't think increasing pcp->high for larger
> numbers of CPUs is the right answer because then reclaim can be
> triggered simply because too many PCPs have pages.
>
> To address your point requires much deeper surgery.
...
> There is value to doing something like this but it's beyond what this
> series is trying to do and doing the work without introducing regressions
> would be very difficult.

Agreed, such a solution is outside of the scope of what this set is
trying to do.

It would be nice to touch on this counter-intuitive property in the
changelog, and *maybe* add a WARN_ON_ONCE() if we hit an edge case.
Maybe WARN_ON_ONCE() if pcp->high gets below pcp->batch*SOMETHING.





2021-05-24 16:10:28

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 3/6] mm/page_alloc: Adjust pcp->high after CPU hotplug events

On Mon, May 24, 2021 at 08:52:02AM -0700, Dave Hansen wrote:
> > To address your point requires much deeper surgery.
> ...
> > There is value to doing something like this but it's beyond what this
> > series is trying to do and doing the work without introducing regressions
> > would be very difficult.
>
> Agreed, such a solution is outside of the scope of what this set is
> trying to do.
>
> It would be nice to touch on this counter-intuitive property in the
> changelog, and *maybe* add a WARN_ON_ONCE() if we hit an edge case.
> Maybe WARN_ON_ONCE() if pcp->high gets below pcp->batch*SOMETHING.
>

I think it's reasonable to ensure pcp->batch is never above pcp->high so
I have this in zone_highsize now

+ /*
+ * Ensure high is at least batch*4. The multiple is based on the
+ * historical relationship between high and batch.
+ */
+ high = max(high, batch << 2);

Performance tests are running and I'll post a v2 assuming they pass.

--
Mel Gorman
SUSE Labs