2023-09-20 06:19:54

by Huang, Ying

[permalink] [raw]
Subject: [PATCH 00/10] mm: PCP high auto-tuning

The page allocation performance requirements of different workloads
are often different. So, we need to tune the PCP (Per-CPU Pageset)
high on each CPU automatically to optimize the page allocation
performance.

The list of patches in series is as follows,

1 mm, pcp: avoid to drain PCP when process exit
2 cacheinfo: calculate per-CPU data cache size
3 mm, pcp: reduce lock contention for draining high-order pages
4 mm: restrict the pcp batch scale factor to avoid too long latency
5 mm, page_alloc: scale the number of pages that are batch allocated
6 mm: add framework for PCP high auto-tuning
7 mm: tune PCP high automatically
8 mm, pcp: decrease PCP high if free pages < high watermark
9 mm, pcp: avoid to reduce PCP high unnecessarily
10 mm, pcp: reduce detecting time of consecutive high order page freeing

Patch 1/2/3 optimize the PCP draining for consecutive high-order pages
freeing.

Patch 4/5 optimize batch freeing and allocating.

Patch 6/7/8/9 implement and optimize a PCP high auto-tuning method.

Patch 10 optimize the PCP draining for consecutive high order page
freeing based on PCP high auto-tuning.

The test results for patches with performance impact are as follows,

kbuild
======

On a 2-socket Intel server with 224 logical CPU, we tested kbuild on
one socket with `make -j 112`.

build time zone lock% free_high alloc_zone
---------- ---------- --------- ----------
base 100.0 43.6 100.0 100.0
patch1 96.6 40.3 49.2 95.2
patch3 96.4 40.5 11.3 95.1
patch5 96.1 37.9 13.3 96.8
patch7 86.4 9.8 6.2 22.0
patch9 85.9 9.4 4.8 16.3
patch10 87.7 12.6 29.0 32.3

The PCP draining optimization (patch 1/3) improves performance a
little. The PCP batch allocation optimization (patch 5) reduces zone
lock contention a little. The PCP high auto-tuning (patch 7/9)
improves performance much. Where the tuning target: the number of
pages allocated from zone reduces greatly. So, the zone lock
contention cycles% reduces greatly. The further PCP draining
optimization (patch 10) based on PCP tuning reduce the performance a
little. But it will benefit network workloads as below.

With PCP tuning patches (patch 7/9/10), the maximum used memory during
test increases up to 50.6% because more pages are cached in PCP. But
finally, the number of the used memory decreases to the same level as
that of the base patch. That is, the pages cached in PCP will be
released to zone after not being used actively.

netperf SCTP_STREAM_MANY
========================

On a 2-socket Intel server with 128 logical CPU, we tested
SCTP_STREAM_MANY test case of netperf test suite with 64-pair
processes.

score zone lock% free_high alloc_zone cache miss rate%
----- ---------- --------- ---------- ----------------
base 100.0 2.0 100.0 100.0 1.3
patch1 99.7 2.0 99.7 99.7 1.3
patch3 105.5 1.2 13.2 105.4 1.2
patch5 106.9 1.2 13.4 106.9 1.3
patch7 103.5 1.8 6.8 90.8 7.6
patch9 103.7 1.8 6.6 89.8 7.7
patch10 106.9 1.2 13.5 106.9 1.2

The PCP draining optimization (patch 1+3) improves performance. The
PCP high auto-tuning (patch 7/9) reduces performance a little because
PCP draining cannot be triggered in time sometimes. So, the cache
miss rate% increases. The further PCP draining optimization (patch
10) based on PCP tuning restore the performance.

lmbench3 UNIX (AF_UNIX)
=======================

On a 2-socket Intel server with 128 logical CPU, we tested UNIX
(AF_UNIX socket) test case of lmbench3 test suite with 16-pair
processes.

score zone lock% free_high alloc_zone cache miss rate%
----- ---------- --------- ---------- ----------------
base 100.0 50.0 100.0 100.0 0.3
patch1 117.1 45.8 72.6 108.9 0.2
patch3 201.6 21.2 7.4 111.5 0.2
patch5 201.9 20.9 7.5 112.7 0.3
patch7 194.2 19.3 7.3 111.5 2.9
patch9 193.1 19.2 7.2 110.4 2.9
patch10 196.8 21.0 7.4 111.2 2.1

The PCP draining optimization (patch 1/3) improves performance much.
The PCP tuning (patch 7/9) reduces performance a little because PCP
draining cannot be triggered in time sometimes. The further PCP
draining optimization (patch 10) based on PCP tuning restores the
performance partly.

The patchset adds several fields in struct per_cpu_pages. The struct
layout before/after the patchset is as follows,

base
====

struct per_cpu_pages {
spinlock_t lock; /* 0 4 */
int count; /* 4 4 */
int high; /* 8 4 */
int batch; /* 12 4 */
short int free_factor; /* 16 2 */
short int expire; /* 18 2 */

/* XXX 4 bytes hole, try to pack */

struct list_head lists[13]; /* 24 208 */

/* size: 256, cachelines: 4, members: 7 */
/* sum members: 228, holes: 1, sum holes: 4 */
/* padding: 24 */
} __attribute__((__aligned__(64)));

patched
=======

struct per_cpu_pages {
spinlock_t lock; /* 0 4 */
int count; /* 4 4 */
int count_min; /* 8 4 */
int high; /* 12 4 */
int high_min; /* 16 4 */
int high_max; /* 20 4 */
int batch; /* 24 4 */
u8 flags; /* 28 1 */
u8 alloc_factor; /* 29 1 */
u8 expire; /* 30 1 */

/* XXX 1 byte hole, try to pack */

short int free_count; /* 32 2 */

/* XXX 6 bytes hole, try to pack */

struct list_head lists[13]; /* 40 208 */

/* size: 256, cachelines: 4, members: 12 */
/* sum members: 241, holes: 2, sum holes: 7 */
/* padding: 8 */
} __attribute__((__aligned__(64)));

The size of the struct doesn't changed with the patchset.

Best Regards,
Huang, Ying


2023-09-20 07:22:30

by Huang, Ying

[permalink] [raw]
Subject: [PATCH 06/10] mm: add framework for PCP high auto-tuning

The page allocation performance requirements of different workloads
are usually different. So, we need to tune PCP (per-CPU pageset) high
to optimize the workload page allocation performance. Now, we have a
system wide sysctl knob (percpu_pagelist_high_fraction) to tune PCP
high by hand. But, it's hard to find out the best value by hand. And
one global configuration may not work best for the different workloads
that run on the same system. One solution to these issues is to tune
PCP high of each CPU automatically.

This patch adds the framework for PCP high auto-tuning. With it,
pcp->high of each CPU will be changed automatically by tuning
algorithm at runtime. The minimal high (pcp->high_min) is the
original PCP high value calculated based on the low watermark pages.
While the maximal high (pcp->high_max) is the PCP high value when
percpu_pagelist_high_fraction sysctl knob is set to
MIN_PERCPU_PAGELIST_HIGH_FRACTION. That is, the maximal pcp->high
that can be set via sysctl knob by hand.

It's possible that PCP high auto-tuning doesn't work well for some
workloads. So, when PCP high is tuned by hand via the sysctl knob,
the auto-tuning will be disabled. The PCP high set by hand will be
used instead.

This patch only adds the framework, so pcp->high will be set to
pcp->high_min (original default) always. We will add actual
auto-tuning algorithm in the following patches in the series.

Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Christoph Lameter <[email protected]>
---
include/linux/mmzone.h | 5 ++-
mm/page_alloc.c | 71 +++++++++++++++++++++++++++---------------
2 files changed, 50 insertions(+), 26 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4f7420e35fbb..d6cfb5023f3e 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -683,6 +683,8 @@ struct per_cpu_pages {
spinlock_t lock; /* Protects lists field */
int count; /* number of pages in the list */
int high; /* high watermark, emptying needed */
+ int high_min; /* min high watermark */
+ int high_max; /* max high watermark */
int batch; /* chunk size for buddy add/remove */
u8 flags; /* protected by pcp->lock */
u8 alloc_factor; /* batch scaling factor during allocate */
@@ -842,7 +844,8 @@ struct zone {
* the high and batch values are copied to individual pagesets for
* faster access
*/
- int pageset_high;
+ int pageset_high_min;
+ int pageset_high_max;
int pageset_batch;

#ifndef CONFIG_SPARSEMEM
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 30bb05fa5353..38bfab562b44 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2353,7 +2353,7 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high)
static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
bool free_high)
{
- int high = READ_ONCE(pcp->high);
+ int high = READ_ONCE(pcp->high_min);

if (unlikely(!high || free_high))
return 0;
@@ -2692,7 +2692,7 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, int order)
{
int high, batch, max_nr_alloc;

- high = READ_ONCE(pcp->high);
+ high = READ_ONCE(pcp->high_min);
batch = READ_ONCE(pcp->batch);

/* Check for PCP disabled or boot pageset */
@@ -5298,14 +5298,15 @@ static int zone_batchsize(struct zone *zone)
}

static int percpu_pagelist_high_fraction;
-static int zone_highsize(struct zone *zone, int batch, int cpu_online)
+static int zone_highsize(struct zone *zone, int batch, int cpu_online,
+ int high_fraction)
{
#ifdef CONFIG_MMU
int high;
int nr_split_cpus;
unsigned long total_pages;

- if (!percpu_pagelist_high_fraction) {
+ if (!high_fraction) {
/*
* By default, the high value of the pcp is based on the zone
* low watermark so that if they are full then background
@@ -5318,15 +5319,15 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online)
* value is based on a fraction of the managed pages in the
* zone.
*/
- total_pages = zone_managed_pages(zone) / percpu_pagelist_high_fraction;
+ total_pages = zone_managed_pages(zone) / high_fraction;
}

/*
* Split the high value across all online CPUs local to the zone. Note
* that early in boot that CPUs may not be online yet and that during
* CPU hotplug that the cpumask is not yet updated when a CPU is being
- * onlined. For memory nodes that have no CPUs, split pcp->high across
- * all online CPUs to mitigate the risk that reclaim is triggered
+ * onlined. For memory nodes that have no CPUs, split the high value
+ * across all online CPUs to mitigate the risk that reclaim is triggered
* prematurely due to pages stored on pcp lists.
*/
nr_split_cpus = cpumask_weight(cpumask_of_node(zone_to_nid(zone))) + cpu_online;
@@ -5354,19 +5355,21 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online)
* However, guaranteeing these relations at all times would require e.g. write
* barriers here but also careful usage of read barriers at the read side, and
* thus be prone to error and bad for performance. Thus the update only prevents
- * store tearing. Any new users of pcp->batch and pcp->high should ensure they
- * can cope with those fields changing asynchronously, and fully trust only the
- * pcp->count field on the local CPU with interrupts disabled.
+ * store tearing. Any new users of pcp->batch, pcp->high_min and pcp->high_max
+ * should ensure they can cope with those fields changing asynchronously, and
+ * fully trust only the pcp->count field on the local CPU with interrupts
+ * disabled.
*
* mutex_is_locked(&pcp_batch_high_lock) required when calling this function
* outside of boot time (or some other assurance that no concurrent updaters
* exist).
*/
-static void pageset_update(struct per_cpu_pages *pcp, unsigned long high,
- unsigned long batch)
+static void pageset_update(struct per_cpu_pages *pcp, unsigned long high_min,
+ unsigned long high_max, unsigned long batch)
{
WRITE_ONCE(pcp->batch, batch);
- WRITE_ONCE(pcp->high, high);
+ WRITE_ONCE(pcp->high_min, high_min);
+ WRITE_ONCE(pcp->high_max, high_max);
}

static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonestat *pzstats)
@@ -5386,20 +5389,21 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta
* need to be as careful as pageset_update() as nobody can access the
* pageset yet.
*/
- pcp->high = BOOT_PAGESET_HIGH;
+ pcp->high_min = BOOT_PAGESET_HIGH;
+ pcp->high_max = BOOT_PAGESET_HIGH;
pcp->batch = BOOT_PAGESET_BATCH;
pcp->free_factor = 0;
}

-static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high,
- unsigned long batch)
+static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high_min,
+ unsigned long high_max, unsigned long batch)
{
struct per_cpu_pages *pcp;
int cpu;

for_each_possible_cpu(cpu) {
pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
- pageset_update(pcp, high, batch);
+ pageset_update(pcp, high_min, high_max, batch);
}
}

@@ -5409,19 +5413,34 @@ static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long h
*/
static void zone_set_pageset_high_and_batch(struct zone *zone, int cpu_online)
{
- int new_high, new_batch;
+ int new_high_min, new_high_max, new_batch;

new_batch = max(1, zone_batchsize(zone));
- new_high = zone_highsize(zone, new_batch, cpu_online);
+ if (percpu_pagelist_high_fraction) {
+ new_high_min = zone_highsize(zone, new_batch, cpu_online,
+ percpu_pagelist_high_fraction);
+ /*
+ * PCP high is tuned manually, disable auto-tuning via
+ * setting high_min and high_max to the manual value.
+ */
+ new_high_max = new_high_min;
+ } else {
+ new_high_min = zone_highsize(zone, new_batch, cpu_online, 0);
+ new_high_max = zone_highsize(zone, new_batch, cpu_online,
+ MIN_PERCPU_PAGELIST_HIGH_FRACTION);
+ }

- if (zone->pageset_high == new_high &&
+ if (zone->pageset_high_min == new_high_min &&
+ zone->pageset_high_max == new_high_max &&
zone->pageset_batch == new_batch)
return;

- zone->pageset_high = new_high;
+ zone->pageset_high_min = new_high_min;
+ zone->pageset_high_max = new_high_max;
zone->pageset_batch = new_batch;

- __zone_set_pageset_high_and_batch(zone, new_high, new_batch);
+ __zone_set_pageset_high_and_batch(zone, new_high_min, new_high_max,
+ new_batch);
}

void __meminit setup_zone_pageset(struct zone *zone)
@@ -5529,7 +5548,8 @@ __meminit void zone_pcp_init(struct zone *zone)
*/
zone->per_cpu_pageset = &boot_pageset;
zone->per_cpu_zonestats = &boot_zonestats;
- zone->pageset_high = BOOT_PAGESET_HIGH;
+ zone->pageset_high_min = BOOT_PAGESET_HIGH;
+ zone->pageset_high_max = BOOT_PAGESET_HIGH;
zone->pageset_batch = BOOT_PAGESET_BATCH;

if (populated_zone(zone))
@@ -6431,13 +6451,14 @@ EXPORT_SYMBOL(free_contig_range);
void zone_pcp_disable(struct zone *zone)
{
mutex_lock(&pcp_batch_high_lock);
- __zone_set_pageset_high_and_batch(zone, 0, 1);
+ __zone_set_pageset_high_and_batch(zone, 0, 0, 1);
__drain_all_pages(zone, true);
}

void zone_pcp_enable(struct zone *zone)
{
- __zone_set_pageset_high_and_batch(zone, zone->pageset_high, zone->pageset_batch);
+ __zone_set_pageset_high_and_batch(zone, zone->pageset_high_min,
+ zone->pageset_high_max, zone->pageset_batch);
mutex_unlock(&pcp_batch_high_lock);
}

--
2.39.2

2023-09-20 07:28:12

by Huang, Ying

[permalink] [raw]
Subject: [PATCH 03/10] mm, pcp: reduce lock contention for draining high-order pages

In commit f26b3fa04611 ("mm/page_alloc: limit number of high-order
pages on PCP during bulk free"), the PCP (Per-CPU Pageset) will be
drained when PCP is mostly used for high-order pages freeing to
improve the cache-hot pages reusing between page allocating and
freeing CPUs.

On system with small per-CPU data cache, pages shouldn't be cached
before draining to guarantee cache-hot. But on a system with large
per-CPU data cache, more pages can be cached before draining to reduce
zone lock contention.

So, in this patch, instead of draining without any caching, "batch"
pages will be cached in PCP before draining if the per-CPU data cache
size is more than "4 * batch".

On a 2-socket Intel server with 128 logical CPU, with the patch, the
network bandwidth of the UNIX (AF_UNIX) test case of lmbench test
suite with 16-pair processes increase 72.2%. The cycles% of the
spinlock contention (mostly for zone lock) decreases from 45.8% to
21.2%. The number of PCP draining for high order pages
freeing (free_high) decreases 89.8%. The cache miss rate keeps 0.3%.

Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Christoph Lameter <[email protected]>
---
drivers/base/cacheinfo.c | 2 ++
include/linux/gfp.h | 1 +
include/linux/mmzone.h | 1 +
mm/page_alloc.c | 37 ++++++++++++++++++++++++++++++++++++-
4 files changed, 40 insertions(+), 1 deletion(-)

diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
index 3e8951a3fbab..a55b2f83958b 100644
--- a/drivers/base/cacheinfo.c
+++ b/drivers/base/cacheinfo.c
@@ -943,6 +943,7 @@ static int cacheinfo_cpu_online(unsigned int cpu)
if (rc)
goto err;
update_data_cache_size(true, cpu);
+ setup_pcp_cacheinfo();
return 0;
err:
free_cache_attributes(cpu);
@@ -956,6 +957,7 @@ static int cacheinfo_cpu_pre_down(unsigned int cpu)

free_cache_attributes(cpu);
update_data_cache_size(false, cpu);
+ setup_pcp_cacheinfo();
return 0;
}

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 665f06675c83..665edc11fb9f 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -325,6 +325,7 @@ void drain_all_pages(struct zone *zone);
void drain_local_pages(struct zone *zone);

void page_alloc_init_late(void);
+void setup_pcp_cacheinfo(void);

/*
* gfp_allowed_mask is set to GFP_BOOT_MASK during early boot to restrict what
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 64d5ed2bb724..4132e7490b49 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -677,6 +677,7 @@ enum zone_watermarks {
#define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost)

#define PCPF_PREV_FREE_HIGH_ORDER 0x01
+#define PCPF_FREE_HIGH_BATCH 0x02

struct per_cpu_pages {
spinlock_t lock; /* Protects lists field */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 828dcc24b030..06aa9c5687e0 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -52,6 +52,7 @@
#include <linux/psi.h>
#include <linux/khugepaged.h>
#include <linux/delayacct.h>
+#include <linux/cacheinfo.h>
#include <asm/div64.h>
#include "internal.h"
#include "shuffle.h"
@@ -2385,7 +2386,9 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
*/
if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
free_high = (pcp->free_factor &&
- (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER));
+ (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) &&
+ (!(pcp->flags & PCPF_FREE_HIGH_BATCH) ||
+ pcp->count >= READ_ONCE(pcp->batch)));
pcp->flags |= PCPF_PREV_FREE_HIGH_ORDER;
} else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) {
pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER;
@@ -5418,6 +5421,38 @@ static void zone_pcp_update(struct zone *zone, int cpu_online)
mutex_unlock(&pcp_batch_high_lock);
}

+static void zone_pcp_update_cacheinfo(struct zone *zone)
+{
+ int cpu;
+ struct per_cpu_pages *pcp;
+ struct cpu_cacheinfo *cci;
+
+ for_each_online_cpu(cpu) {
+ pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
+ cci = get_cpu_cacheinfo(cpu);
+ /*
+ * If per-CPU data cache is large enough, up to
+ * "batch" high-order pages can be cached in PCP for
+ * consecutive freeing. This can reduce zone lock
+ * contention without hurting cache-hot pages sharing.
+ */
+ spin_lock(&pcp->lock);
+ if ((cci->size_data >> PAGE_SHIFT) > 4 * pcp->batch)
+ pcp->flags |= PCPF_FREE_HIGH_BATCH;
+ else
+ pcp->flags &= ~PCPF_FREE_HIGH_BATCH;
+ spin_unlock(&pcp->lock);
+ }
+}
+
+void setup_pcp_cacheinfo(void)
+{
+ struct zone *zone;
+
+ for_each_populated_zone(zone)
+ zone_pcp_update_cacheinfo(zone);
+}
+
/*
* Allocate per cpu pagesets and initialize them.
* Before this call only boot pagesets were available.
--
2.39.2

2023-09-20 08:11:54

by Huang, Ying

[permalink] [raw]
Subject: [PATCH 07/10] mm: tune PCP high automatically

The target to tune PCP high automatically is as follows,

- Minimize allocation/freeing from/to shared zone

- Minimize idle pages in PCP

- Minimize pages in PCP if the system free pages is too few

To reach these target, a tuning algorithm as follows is designed,

- When we refill PCP via allocating from the zone, increase PCP high.
Because if we had larger PCP, we could avoid to allocate from the
zone.

- In periodic vmstat updating kworker (via refresh_cpu_vm_stats()),
decrease PCP high to try to free possible idle PCP pages.

- When page reclaiming is active for the zone, stop increasing PCP
high in allocating path, decrease PCP high and free some pages in
freeing path.

So, the PCP high can be tuned to the page allocating/freeing depth of
workloads eventually.

One issue of the algorithm is that if the number of pages allocated is
much more than that of pages freed on a CPU, the PCP high may become
the maximal value even if the allocating/freeing depth is small. But
this isn't a severe issue, because there are no idle pages in this
case.

One alternative choice is to increase PCP high when we drain PCP via
trying to free pages to the zone, but don't increase PCP high during
PCP refilling. This can avoid the issue above. But if the number of
pages allocated is much less than that of pages freed on a CPU, there
will be many idle pages in PCP and it may be hard to free these idle
pages.

On a 2-socket Intel server with 224 logical CPU, we tested kbuild on
one socket with `make -j 112`. With the patch, the build time
decreases 10.1%. The cycles% of the spinlock contention (mostly for
zone lock) decreases from 37.9% to 9.8% (with PCP size == 361). The
number of PCP draining for high order pages freeing (free_high)
decreases 53.4%. The number of pages allocated from zone (instead of
from PCP) decreases 77.3%.

Signed-off-by: "Huang, Ying" <[email protected]>
Suggested-by: Mel Gorman <[email protected]>
Suggested-by: Michal Hocko <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Christoph Lameter <[email protected]>
---
include/linux/gfp.h | 1 +
mm/page_alloc.c | 118 ++++++++++++++++++++++++++++++++++----------
mm/vmstat.c | 8 +--
3 files changed, 98 insertions(+), 29 deletions(-)

diff --git a/include/linux/gfp.h b/include/linux/gfp.h
index 665edc11fb9f..5b917e5b9350 100644
--- a/include/linux/gfp.h
+++ b/include/linux/gfp.h
@@ -320,6 +320,7 @@ extern void page_frag_free(void *addr);
#define free_page(addr) free_pages((addr), 0)

void page_alloc_init_cpuhp(void);
+int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp);
void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp);
void drain_all_pages(struct zone *zone);
void drain_local_pages(struct zone *zone);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 38bfab562b44..225abe56752c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2160,6 +2160,40 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
return i;
}

+/*
+ * Called from the vmstat counter updater to decay the PCP high.
+ * Return whether there are addition works to do.
+ */
+int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
+{
+ int high_min, to_drain, batch;
+ int todo = 0;
+
+ high_min = READ_ONCE(pcp->high_min);
+ batch = READ_ONCE(pcp->batch);
+ /*
+ * Decrease pcp->high periodically to try to free possible
+ * idle PCP pages. And, avoid to free too many pages to
+ * control latency.
+ */
+ if (pcp->high > high_min) {
+ pcp->high = max3(pcp->count - (batch << PCP_BATCH_SCALE_MAX),
+ pcp->high * 4 / 5, high_min);
+ if (pcp->high > high_min)
+ todo++;
+ }
+
+ to_drain = pcp->count - pcp->high;
+ if (to_drain > 0) {
+ spin_lock(&pcp->lock);
+ free_pcppages_bulk(zone, to_drain, pcp, 0);
+ spin_unlock(&pcp->lock);
+ todo++;
+ }
+
+ return todo;
+}
+
#ifdef CONFIG_NUMA
/*
* Called from the vmstat counter updater to drain pagesets of this
@@ -2321,14 +2355,13 @@ static bool free_unref_page_prepare(struct page *page, unsigned long pfn,
return true;
}

-static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high)
+static int nr_pcp_free(struct per_cpu_pages *pcp, int batch, int high, bool free_high)
{
int min_nr_free, max_nr_free;
- int batch = READ_ONCE(pcp->batch);

- /* Free everything if batch freeing high-order pages. */
+ /* Free as much as possible if batch freeing high-order pages. */
if (unlikely(free_high))
- return pcp->count;
+ return min(pcp->count, batch << PCP_BATCH_SCALE_MAX);

/* Check for PCP disabled or boot pageset */
if (unlikely(high < batch))
@@ -2343,7 +2376,7 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high)
* freeing of pages without any allocation.
*/
batch <<= pcp->free_factor;
- if (batch < max_nr_free && pcp->free_factor < PCP_BATCH_SCALE_MAX)
+ if (batch <= max_nr_free && pcp->free_factor < PCP_BATCH_SCALE_MAX)
pcp->free_factor++;
batch = clamp(batch, min_nr_free, max_nr_free);

@@ -2351,28 +2384,47 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high)
}

static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
- bool free_high)
+ int batch, bool free_high)
{
- int high = READ_ONCE(pcp->high_min);
+ int high, high_min, high_max;

- if (unlikely(!high || free_high))
+ high_min = READ_ONCE(pcp->high_min);
+ high_max = READ_ONCE(pcp->high_max);
+ high = pcp->high = clamp(pcp->high, high_min, high_max);
+
+ if (unlikely(!high))
return 0;

- if (!test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags))
- return high;
+ if (unlikely(free_high)) {
+ pcp->high = max(high - (batch << PCP_BATCH_SCALE_MAX), high_min);
+ return 0;
+ }

/*
* If reclaim is active, limit the number of pages that can be
* stored on pcp lists
*/
- return min(READ_ONCE(pcp->batch) << 2, high);
+ if (test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags)) {
+ pcp->high = max(high - (batch << pcp->free_factor), high_min);
+ return min(batch << 2, pcp->high);
+ }
+
+ if (pcp->count >= high && high_min != high_max) {
+ int need_high = (batch << pcp->free_factor) + batch;
+
+ /* pcp->high should be large enough to hold batch freed pages */
+ if (pcp->high < need_high)
+ pcp->high = clamp(need_high, high_min, high_max);
+ }
+
+ return high;
}

static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
struct page *page, int migratetype,
unsigned int order)
{
- int high;
+ int high, batch;
int pindex;
bool free_high = false;

@@ -2387,6 +2439,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
list_add(&page->pcp_list, &pcp->lists[pindex]);
pcp->count += 1 << order;

+ batch = READ_ONCE(pcp->batch);
/*
* As high-order pages other than THP's stored on PCP can contribute
* to fragmentation, limit the number stored when PCP is heavily
@@ -2397,14 +2450,15 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
free_high = (pcp->free_factor &&
(pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) &&
(!(pcp->flags & PCPF_FREE_HIGH_BATCH) ||
- pcp->count >= READ_ONCE(pcp->batch)));
+ pcp->count >= READ_ONCE(batch)));
pcp->flags |= PCPF_PREV_FREE_HIGH_ORDER;
} else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) {
pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER;
}
- high = nr_pcp_high(pcp, zone, free_high);
+ high = nr_pcp_high(pcp, zone, batch, free_high);
if (pcp->count >= high) {
- free_pcppages_bulk(zone, nr_pcp_free(pcp, high, free_high), pcp, pindex);
+ free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
+ pcp, pindex);
}
}

@@ -2688,24 +2742,38 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
return page;
}

-static int nr_pcp_alloc(struct per_cpu_pages *pcp, int order)
+static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order)
{
- int high, batch, max_nr_alloc;
+ int high, base_batch, batch, max_nr_alloc;
+ int high_max, high_min;

- high = READ_ONCE(pcp->high_min);
- batch = READ_ONCE(pcp->batch);
+ base_batch = READ_ONCE(pcp->batch);
+ high_min = READ_ONCE(pcp->high_min);
+ high_max = READ_ONCE(pcp->high_max);
+ high = pcp->high = clamp(pcp->high, high_min, high_max);

/* Check for PCP disabled or boot pageset */
- if (unlikely(high < batch))
+ if (unlikely(high < base_batch))
return 1;

+ if (order)
+ batch = base_batch;
+ else
+ batch = (base_batch << pcp->alloc_factor);
+
/*
- * Double the number of pages allocated each time there is subsequent
- * refiling of order-0 pages without drain.
+ * If we had larger pcp->high, we could avoid to allocate from
+ * zone.
*/
+ if (high_min != high_max && !test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags))
+ high = pcp->high = min(high + batch, high_max);
+
if (!order) {
- max_nr_alloc = max(high - pcp->count - batch, batch);
- batch <<= pcp->alloc_factor;
+ max_nr_alloc = max(high - pcp->count - base_batch, base_batch);
+ /*
+ * Double the number of pages allocated each time there is
+ * subsequent refiling of order-0 pages without drain.
+ */
if (batch <= max_nr_alloc && pcp->alloc_factor < PCP_BATCH_SCALE_MAX)
pcp->alloc_factor++;
batch = min(batch, max_nr_alloc);
@@ -2735,7 +2803,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,

do {
if (list_empty(list)) {
- int batch = nr_pcp_alloc(pcp, order);
+ int batch = nr_pcp_alloc(pcp, zone, order);
int alloced;

alloced = rmqueue_bulk(zone, order,
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 00e81e99c6ee..2f716ad14168 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -814,9 +814,7 @@ static int refresh_cpu_vm_stats(bool do_pagesets)

for_each_populated_zone(zone) {
struct per_cpu_zonestat __percpu *pzstats = zone->per_cpu_zonestats;
-#ifdef CONFIG_NUMA
struct per_cpu_pages __percpu *pcp = zone->per_cpu_pageset;
-#endif

for (i = 0; i < NR_VM_ZONE_STAT_ITEMS; i++) {
int v;
@@ -832,10 +830,12 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
#endif
}
}
-#ifdef CONFIG_NUMA

if (do_pagesets) {
cond_resched();
+
+ changes += decay_pcp_high(zone, this_cpu_ptr(pcp));
+#ifdef CONFIG_NUMA
/*
* Deal with draining the remote pageset of this
* processor
@@ -862,8 +862,8 @@ static int refresh_cpu_vm_stats(bool do_pagesets)
drain_zone_pages(zone, this_cpu_ptr(pcp));
changes++;
}
- }
#endif
+ }
}

for_each_online_pgdat(pgdat) {
--
2.39.2

2023-09-20 08:28:11

by Huang, Ying

[permalink] [raw]
Subject: [PATCH 09/10] mm, pcp: avoid to reduce PCP high unnecessarily

In PCP high auto-tuning algorithm, to minimize idle pages in PCP, in
periodic vmstat updating kworker (via refresh_cpu_vm_stats()), we will
decrease PCP high to try to free possible idle PCP pages. One issue
is that even if the page allocating/freeing depth is larger than
maximal PCP high, we may reduce PCP high unnecessarily.

To avoid the above issue, in this patch, we will track the minimal PCP
page count. And, the periodic PCP high decrement will not more than
the recent minimal PCP page count. So, only detected idle pages will
be freed.

On a 2-socket Intel server with 224 logical CPU, we tested kbuild on
one socket with `make -j 112`. With the patch, The number of pages
allocated from zone (instead of from PCP) decreases 25.8%.

Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Christoph Lameter <[email protected]>
---
include/linux/mmzone.h | 1 +
mm/page_alloc.c | 15 ++++++++++-----
2 files changed, 11 insertions(+), 5 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 8a19e2af89df..35b78c7522a7 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -682,6 +682,7 @@ enum zone_watermarks {
struct per_cpu_pages {
spinlock_t lock; /* Protects lists field */
int count; /* number of pages in the list */
+ int count_min; /* minimal number of pages in the list recently */
int high; /* high watermark, emptying needed */
int high_min; /* min high watermark */
int high_max; /* max high watermark */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3f8c7dfeed23..77e9b7b51688 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2166,19 +2166,20 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
*/
int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
{
- int high_min, to_drain, batch;
+ int high_min, decrease, to_drain, batch;
int todo = 0;

high_min = READ_ONCE(pcp->high_min);
batch = READ_ONCE(pcp->batch);
/*
- * Decrease pcp->high periodically to try to free possible
- * idle PCP pages. And, avoid to free too many pages to
- * control latency.
+ * Decrease pcp->high periodically to free idle PCP pages counted
+ * via pcp->count_min. And, avoid to free too many pages to
+ * control latency. This caps pcp->high decrement too.
*/
if (pcp->high > high_min) {
+ decrease = min(pcp->count_min, pcp->high / 5);
pcp->high = max3(pcp->count - (batch << PCP_BATCH_SCALE_MAX),
- pcp->high * 4 / 5, high_min);
+ pcp->high - decrease, high_min);
if (pcp->high > high_min)
todo++;
}
@@ -2191,6 +2192,8 @@ int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
todo++;
}

+ pcp->count_min = pcp->count;
+
return todo;
}

@@ -2828,6 +2831,8 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
page = list_first_entry(list, struct page, pcp_list);
list_del(&page->pcp_list);
pcp->count -= 1 << order;
+ if (pcp->count < pcp->count_min)
+ pcp->count_min = pcp->count;
} while (check_new_pages(page, order));

return page;
--
2.39.2

2023-09-20 08:42:08

by Huang, Ying

[permalink] [raw]
Subject: [PATCH 04/10] mm: restrict the pcp batch scale factor to avoid too long latency

In page allocator, PCP (Per-CPU Pageset) is refilled and drained in
batches to increase page allocation throughput, reduce page
allocation/freeing latency per page, and reduce zone lock contention.
But too large batch size will cause too long maximal
allocation/freeing latency, which may punish arbitrary users. So the
default batch size is chosen carefully (in zone_batchsize(), the value
is 63 for zone > 1GB) to avoid that.

In commit 3b12e7e97938 ("mm/page_alloc: scale the number of pages that
are batch freed"), the batch size will be scaled for large number of
page freeing to improve page freeing performance and reduce zone lock
contention. Similar optimization can be used for large number of
pages allocation too.

To find out a suitable max batch scale factor (that is, max effective
batch size), some tests and measurement on some machines were done as
follows.

A set of debug patches are implemented as follows,

- Set PCP high to be 2 * batch to reduce the effect of PCP high

- Disable free batch size scaling to get the raw performance.

- The code with zone lock held is extracted from rmqueue_bulk() and
free_pcppages_bulk() to 2 separate functions to make it easy to
measure the function run time with ftrace function_graph tracer.

- The batch size is hard coded to be 63 (default), 127, 255, 511,
1023, 2047, 4095.

Then will-it-scale/page_fault1 is used to generate the page
allocation/freeing workload. The page allocation/freeing throughput
(page/s) is measured via will-it-scale. The page allocation/freeing
average latency (alloc/free latency avg, in us) and allocation/freeing
latency at 99 percentile (alloc/free latency 99%, in us) are measured
with ftrace function_graph tracer.

The test results are as follows,

Sapphire Rapids Server
======================
Batch throughput free latency free latency alloc latency alloc latency
page/s avg / us 99% / us avg / us 99% / us
----- ---------- ------------ ------------ ------------- -------------
63 513633.4 2.33 3.57 2.67 6.83
127 517616.7 4.35 6.65 4.22 13.03
255 520822.8 8.29 13.32 7.52 25.24
511 524122.0 15.79 23.42 14.02 49.35
1023 525980.5 30.25 44.19 25.36 94.88
2047 526793.6 59.39 84.50 45.22 140.81

Ice Lake Server
===============
Batch throughput free latency free latency alloc latency alloc latency
page/s avg / us 99% / us avg / us 99% / us
----- ---------- ------------ ------------ ------------- -------------
63 620210.3 2.21 3.68 2.02 4.35
127 627003.0 4.09 6.86 3.51 8.28
255 630777.5 7.70 13.50 6.17 15.97
511 633651.5 14.85 22.62 11.66 31.08
1023 637071.1 28.55 42.02 20.81 54.36
2047 638089.7 56.54 84.06 39.28 91.68

Cascade Lake Server
===================
Batch throughput free latency free latency alloc latency alloc latency
page/s avg / us 99% / us avg / us 99% / us
----- ---------- ------------ ------------ ------------- -------------
63 404706.7 3.29 5.03 3.53 4.75
127 422475.2 6.12 9.09 6.36 8.76
255 411522.2 11.68 16.97 10.90 16.39
511 428124.1 22.54 31.28 19.86 32.25
1023 414718.4 43.39 62.52 40.00 66.33
2047 429848.7 86.64 120.34 71.14 106.08

Commet Lake Desktop
===================
Batch throughput free latency free latency alloc latency alloc latency
page/s avg / us 99% / us avg / us 99% / us
----- ---------- ------------ ------------ ------------- -------------

63 795183.13 2.18 3.55 2.03 3.05
127 803067.85 3.91 6.56 3.85 5.52
255 812771.10 7.35 10.80 7.14 10.20
511 817723.48 14.17 27.54 13.43 30.31
1023 818870.19 27.72 40.10 27.89 46.28

Coffee Lake Desktop
===================
Batch throughput free latency free latency alloc latency alloc latency
page/s avg / us 99% / us avg / us 99% / us
----- ---------- ------------ ------------ ------------- -------------
63 510542.8 3.13 4.40 2.48 3.43
127 514288.6 5.97 7.89 4.65 6.04
255 516889.7 11.86 15.58 8.96 12.55
511 519802.4 23.10 28.81 16.95 26.19
1023 520802.7 45.30 52.51 33.19 45.95
2047 519997.1 90.63 104.00 65.26 81.74

From the above data, to restrict the allocation/freeing latency to be
less than 100 us in most times, the max batch scale factor needs to be
less than or equal to 5.

So, in this patch, the batch scale factor is restricted to be less
than or equal to 5.

Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Christoph Lameter <[email protected]>
---
mm/page_alloc.c | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 06aa9c5687e0..30554c674349 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -86,6 +86,9 @@ typedef int __bitwise fpi_t;
*/
#define FPI_TO_TAIL ((__force fpi_t)BIT(1))

+/* Maximum PCP batch scale factor to restrict max allocation/freeing latency */
+#define PCP_BATCH_SCALE_MAX 5
+
/* prevent >1 _updater_ of zone percpu pageset ->high and ->batch fields */
static DEFINE_MUTEX(pcp_batch_high_lock);
#define MIN_PERCPU_PAGELIST_HIGH_FRACTION (8)
@@ -2340,7 +2343,7 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int high, bool free_high)
* freeing of pages without any allocation.
*/
batch <<= pcp->free_factor;
- if (batch < max_nr_free)
+ if (batch < max_nr_free && pcp->free_factor < PCP_BATCH_SCALE_MAX)
pcp->free_factor++;
batch = clamp(batch, min_nr_free, max_nr_free);

--
2.39.2

2023-09-20 12:06:13

by Huang, Ying

[permalink] [raw]
Subject: [PATCH 08/10] mm, pcp: decrease PCP high if free pages < high watermark

One target of PCP is to minimize pages in PCP if the system free pages
is too few. To reach that target, when page reclaiming is active for
the zone (ZONE_RECLAIM_ACTIVE), we will stop increasing PCP high in
allocating path, decrease PCP high and free some pages in freeing
path. But this may be too late because the background page reclaiming
may introduce latency for some workloads. So, in this patch, during
page allocation we will detect whether the number of free pages of the
zone is below high watermark. If so, we will stop increasing PCP high
in allocating path, decrease PCP high and free some pages in freeing
path. With this, we can reduce the possibility of the premature
background page reclaiming caused by too large PCP.

The high watermark checking is done in allocating path to reduce the
overhead in hotter freeing path.

Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Christoph Lameter <[email protected]>
---
include/linux/mmzone.h | 1 +
mm/page_alloc.c | 22 ++++++++++++++++++++--
2 files changed, 21 insertions(+), 2 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index d6cfb5023f3e..8a19e2af89df 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -1006,6 +1006,7 @@ enum zone_flags {
* Cleared when kswapd is woken.
*/
ZONE_RECLAIM_ACTIVE, /* kswapd may be scanning the zone. */
+ ZONE_BELOW_HIGH, /* zone is below high watermark. */
};

static inline unsigned long zone_managed_pages(struct zone *zone)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 225abe56752c..3f8c7dfeed23 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2409,7 +2409,13 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
return min(batch << 2, pcp->high);
}

- if (pcp->count >= high && high_min != high_max) {
+ if (high_min == high_max)
+ return high;
+
+ if (test_bit(ZONE_BELOW_HIGH, &zone->flags)) {
+ pcp->high = max(high - (batch << pcp->free_factor), high_min);
+ high = max(pcp->count, high_min);
+ } else if (pcp->count >= high) {
int need_high = (batch << pcp->free_factor) + batch;

/* pcp->high should be large enough to hold batch freed pages */
@@ -2459,6 +2465,10 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
if (pcp->count >= high) {
free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
pcp, pindex);
+ if (test_bit(ZONE_BELOW_HIGH, &zone->flags) &&
+ zone_watermark_ok(zone, 0, high_wmark_pages(zone),
+ ZONE_MOVABLE, 0))
+ clear_bit(ZONE_BELOW_HIGH, &zone->flags);
}
}

@@ -2765,7 +2775,7 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order)
* If we had larger pcp->high, we could avoid to allocate from
* zone.
*/
- if (high_min != high_max && !test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags))
+ if (high_min != high_max && !test_bit(ZONE_BELOW_HIGH, &zone->flags))
high = pcp->high = min(high + batch, high_max);

if (!order) {
@@ -3226,6 +3236,14 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
}
}

+ mark = high_wmark_pages(zone);
+ if (zone_watermark_fast(zone, order, mark,
+ ac->highest_zoneidx, alloc_flags,
+ gfp_mask))
+ goto try_this_zone;
+ else if (!test_bit(ZONE_BELOW_HIGH, &zone->flags))
+ set_bit(ZONE_BELOW_HIGH, &zone->flags);
+
mark = wmark_pages(zone, alloc_flags & ALLOC_WMARK_MASK);
if (!zone_watermark_fast(zone, order, mark,
ac->highest_zoneidx, alloc_flags,
--
2.39.2

2023-09-20 13:36:21

by Huang, Ying

[permalink] [raw]
Subject: [PATCH 02/10] cacheinfo: calculate per-CPU data cache size

Per-CPU data cache size is useful information. For example, it can be
used to determine per-CPU cache size. So, in this patch, the data
cache size for each CPU is calculated via data_cache_size /
shared_cpu_weight.

A brute-force algorithm to iterate all online CPUs is used to avoid
to allocate an extra cpumask, especially in offline callback.

Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Sudeep Holla <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Christoph Lameter <[email protected]>
---
drivers/base/cacheinfo.c | 42 ++++++++++++++++++++++++++++++++++++++-
include/linux/cacheinfo.h | 1 +
2 files changed, 42 insertions(+), 1 deletion(-)

diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
index cbae8be1fe52..3e8951a3fbab 100644
--- a/drivers/base/cacheinfo.c
+++ b/drivers/base/cacheinfo.c
@@ -898,6 +898,41 @@ static int cache_add_dev(unsigned int cpu)
return rc;
}

+static void update_data_cache_size_cpu(unsigned int cpu)
+{
+ struct cpu_cacheinfo *ci;
+ struct cacheinfo *leaf;
+ unsigned int i, nr_shared;
+ unsigned int size_data = 0;
+
+ if (!per_cpu_cacheinfo(cpu))
+ return;
+
+ ci = ci_cacheinfo(cpu);
+ for (i = 0; i < cache_leaves(cpu); i++) {
+ leaf = per_cpu_cacheinfo_idx(cpu, i);
+ if (leaf->type != CACHE_TYPE_DATA &&
+ leaf->type != CACHE_TYPE_UNIFIED)
+ continue;
+ nr_shared = cpumask_weight(&leaf->shared_cpu_map);
+ if (!nr_shared)
+ continue;
+ size_data += leaf->size / nr_shared;
+ }
+ ci->size_data = size_data;
+}
+
+static void update_data_cache_size(bool cpu_online, unsigned int cpu)
+{
+ unsigned int icpu;
+
+ for_each_online_cpu(icpu) {
+ if (!cpu_online && icpu == cpu)
+ continue;
+ update_data_cache_size_cpu(icpu);
+ }
+}
+
static int cacheinfo_cpu_online(unsigned int cpu)
{
int rc = detect_cache_attributes(cpu);
@@ -906,7 +941,11 @@ static int cacheinfo_cpu_online(unsigned int cpu)
return rc;
rc = cache_add_dev(cpu);
if (rc)
- free_cache_attributes(cpu);
+ goto err;
+ update_data_cache_size(true, cpu);
+ return 0;
+err:
+ free_cache_attributes(cpu);
return rc;
}

@@ -916,6 +955,7 @@ static int cacheinfo_cpu_pre_down(unsigned int cpu)
cpu_cache_sysfs_exit(cpu);

free_cache_attributes(cpu);
+ update_data_cache_size(false, cpu);
return 0;
}

diff --git a/include/linux/cacheinfo.h b/include/linux/cacheinfo.h
index a5cfd44fab45..4e7ccfa0c36d 100644
--- a/include/linux/cacheinfo.h
+++ b/include/linux/cacheinfo.h
@@ -73,6 +73,7 @@ struct cacheinfo {

struct cpu_cacheinfo {
struct cacheinfo *info_list;
+ unsigned int size_data;
unsigned int num_levels;
unsigned int num_leaves;
bool cpu_map_populated;
--
2.39.2

2023-09-20 13:45:36

by Huang, Ying

[permalink] [raw]
Subject: [PATCH 10/10] mm, pcp: reduce detecting time of consecutive high order page freeing

In current PCP auto-tuning design, if the number of pages allocated is
much more than that of pages freed on a CPU, the PCP high may become
the maximal value even if the allocating/freeing depth is small, for
example, in the sender of network workloads. If a CPU was used as
sender originally, then it is used as receiver after context
switching, we need to fill the whole PCP with maximal high before
triggering PCP draining for consecutive high order freeing. This will
hurt the performance of some network workloads.

To solve the issue, in this patch, we will track the consecutive page
freeing with a counter in stead of relying on PCP draining. So, we
can detect consecutive page freeing much earlier.

On a 2-socket Intel server with 128 logical CPU, we tested
SCTP_STREAM_MANY test case of netperf test suite with 64-pair
processes. With the patch, the network bandwidth improves 3.1%. This
restores the performance drop caused by PCP auto-tuning.

Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Christoph Lameter <[email protected]>
---
include/linux/mmzone.h | 2 +-
mm/page_alloc.c | 23 +++++++++++------------
2 files changed, 12 insertions(+), 13 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 35b78c7522a7..44f6dc3cdeeb 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -689,10 +689,10 @@ struct per_cpu_pages {
int batch; /* chunk size for buddy add/remove */
u8 flags; /* protected by pcp->lock */
u8 alloc_factor; /* batch scaling factor during allocate */
- u8 free_factor; /* batch scaling factor during free */
#ifdef CONFIG_NUMA
u8 expire; /* When 0, remote pagesets are drained */
#endif
+ short free_count; /* consecutive free count */

/* Lists of pages, one per migrate type stored on the pcp-lists */
struct list_head lists[NR_PCP_LISTS];
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 77e9b7b51688..6ae2a5ebf7a4 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2375,13 +2375,10 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int batch, int high, bool free
max_nr_free = high - batch;

/*
- * Double the number of pages freed each time there is subsequent
- * freeing of pages without any allocation.
+ * Increase the batch number to the number of the consecutive
+ * freed pages to reduce zone lock contention.
*/
- batch <<= pcp->free_factor;
- if (batch <= max_nr_free && pcp->free_factor < PCP_BATCH_SCALE_MAX)
- pcp->free_factor++;
- batch = clamp(batch, min_nr_free, max_nr_free);
+ batch = clamp_t(int, pcp->free_count, min_nr_free, max_nr_free);

return batch;
}
@@ -2408,7 +2405,7 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
* stored on pcp lists
*/
if (test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags)) {
- pcp->high = max(high - (batch << pcp->free_factor), high_min);
+ pcp->high = max(high - pcp->free_count, high_min);
return min(batch << 2, pcp->high);
}

@@ -2416,10 +2413,10 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
return high;

if (test_bit(ZONE_BELOW_HIGH, &zone->flags)) {
- pcp->high = max(high - (batch << pcp->free_factor), high_min);
+ pcp->high = max(high - pcp->free_count, high_min);
high = max(pcp->count, high_min);
} else if (pcp->count >= high) {
- int need_high = (batch << pcp->free_factor) + batch;
+ int need_high = pcp->free_count + batch;

/* pcp->high should be large enough to hold batch freed pages */
if (pcp->high < need_high)
@@ -2456,7 +2453,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
* stops will be drained from vmstat refresh context.
*/
if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
- free_high = (pcp->free_factor &&
+ free_high = (pcp->free_count >= batch &&
(pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) &&
(!(pcp->flags & PCPF_FREE_HIGH_BATCH) ||
pcp->count >= READ_ONCE(batch)));
@@ -2464,6 +2461,8 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
} else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) {
pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER;
}
+ if (pcp->free_count < (batch << PCP_BATCH_SCALE_MAX))
+ pcp->free_count += (1 << order);
high = nr_pcp_high(pcp, zone, batch, free_high);
if (pcp->count >= high) {
free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
@@ -2861,7 +2860,7 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
* See nr_pcp_free() where free_factor is increased for subsequent
* frees.
*/
- pcp->free_factor >>= 1;
+ pcp->free_count >>= 1;
list = &pcp->lists[order_to_pindex(migratetype, order)];
page = __rmqueue_pcplist(zone, order, migratetype, alloc_flags, pcp, list);
pcp_spin_unlock(pcp);
@@ -5483,7 +5482,7 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta
pcp->high_min = BOOT_PAGESET_HIGH;
pcp->high_max = BOOT_PAGESET_HIGH;
pcp->batch = BOOT_PAGESET_BATCH;
- pcp->free_factor = 0;
+ pcp->free_count = 0;
}

static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high_min,
--
2.39.2

2023-09-20 15:35:28

by Huang, Ying

[permalink] [raw]
Subject: [PATCH 05/10] mm, page_alloc: scale the number of pages that are batch allocated

When a task is allocating a large number of order-0 pages, it may
acquire the zone->lock multiple times allocating pages in batches.
This may unnecessarily contend on the zone lock when allocating very
large number of pages. This patch adapts the size of the batch based
on the recent pattern to scale the batch size for subsequent
allocations.

On a 2-socket Intel server with 224 logical CPU, we tested kbuild on
one socket with `make -j 112`. With the patch, the cycles% of the
spinlock contention (mostly for zone lock) decreases from 40.5% to
37.9% (with PCP size == 361).

Signed-off-by: "Huang, Ying" <[email protected]>
Suggested-by: Mel Gorman <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Christoph Lameter <[email protected]>
---
include/linux/mmzone.h | 3 ++-
mm/page_alloc.c | 52 ++++++++++++++++++++++++++++++++++--------
2 files changed, 44 insertions(+), 11 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4132e7490b49..4f7420e35fbb 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -685,9 +685,10 @@ struct per_cpu_pages {
int high; /* high watermark, emptying needed */
int batch; /* chunk size for buddy add/remove */
u8 flags; /* protected by pcp->lock */
+ u8 alloc_factor; /* batch scaling factor during allocate */
u8 free_factor; /* batch scaling factor during free */
#ifdef CONFIG_NUMA
- short expire; /* When 0, remote pagesets are drained */
+ u8 expire; /* When 0, remote pagesets are drained */
#endif

/* Lists of pages, one per migrate type stored on the pcp-lists */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 30554c674349..30bb05fa5353 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2376,6 +2376,12 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
int pindex;
bool free_high = false;

+ /*
+ * On freeing, reduce the number of pages that are batch allocated.
+ * See nr_pcp_alloc() where alloc_factor is increased for subsequent
+ * allocations.
+ */
+ pcp->alloc_factor >>= 1;
__count_vm_events(PGFREE, 1 << order);
pindex = order_to_pindex(migratetype, order);
list_add(&page->pcp_list, &pcp->lists[pindex]);
@@ -2682,6 +2688,41 @@ struct page *rmqueue_buddy(struct zone *preferred_zone, struct zone *zone,
return page;
}

+static int nr_pcp_alloc(struct per_cpu_pages *pcp, int order)
+{
+ int high, batch, max_nr_alloc;
+
+ high = READ_ONCE(pcp->high);
+ batch = READ_ONCE(pcp->batch);
+
+ /* Check for PCP disabled or boot pageset */
+ if (unlikely(high < batch))
+ return 1;
+
+ /*
+ * Double the number of pages allocated each time there is subsequent
+ * refiling of order-0 pages without drain.
+ */
+ if (!order) {
+ max_nr_alloc = max(high - pcp->count - batch, batch);
+ batch <<= pcp->alloc_factor;
+ if (batch <= max_nr_alloc && pcp->alloc_factor < PCP_BATCH_SCALE_MAX)
+ pcp->alloc_factor++;
+ batch = min(batch, max_nr_alloc);
+ }
+
+ /*
+ * Scale batch relative to order if batch implies free pages
+ * can be stored on the PCP. Batch can be 1 for small zones or
+ * for boot pagesets which should never store free pages as
+ * the pages may belong to arbitrary zones.
+ */
+ if (batch > 1)
+ batch = max(batch >> order, 2);
+
+ return batch;
+}
+
/* Remove page from the per-cpu list, caller must protect the list */
static inline
struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
@@ -2694,18 +2735,9 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,

do {
if (list_empty(list)) {
- int batch = READ_ONCE(pcp->batch);
+ int batch = nr_pcp_alloc(pcp, order);
int alloced;

- /*
- * Scale batch relative to order if batch implies
- * free pages can be stored on the PCP. Batch can
- * be 1 for small zones or for boot pagesets which
- * should never store free pages as the pages may
- * belong to arbitrary zones.
- */
- if (batch > 1)
- batch = max(batch >> order, 2);
alloced = rmqueue_bulk(zone, order,
batch, list,
migratetype, alloc_flags);
--
2.39.2

2023-09-20 16:51:27

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: PCP high auto-tuning

On Wed, 20 Sep 2023 14:18:46 +0800 Huang Ying <[email protected]> wrote:

> The page allocation performance requirements of different workloads
> are often different. So, we need to tune the PCP (Per-CPU Pageset)
> high on each CPU automatically to optimize the page allocation
> performance.

Some of the performance changes here are downright scary.

I've never been very sure that percpu pages was very beneficial (and
hey, I invented the thing back in the Mesozoic era). But these numbers
make me think it's very important and we should have been paying more
attention.

> The list of patches in series is as follows,
>
> 1 mm, pcp: avoid to drain PCP when process exit
> 2 cacheinfo: calculate per-CPU data cache size
> 3 mm, pcp: reduce lock contention for draining high-order pages
> 4 mm: restrict the pcp batch scale factor to avoid too long latency
> 5 mm, page_alloc: scale the number of pages that are batch allocated
> 6 mm: add framework for PCP high auto-tuning
> 7 mm: tune PCP high automatically
> 8 mm, pcp: decrease PCP high if free pages < high watermark
> 9 mm, pcp: avoid to reduce PCP high unnecessarily
> 10 mm, pcp: reduce detecting time of consecutive high order page freeing
>
> Patch 1/2/3 optimize the PCP draining for consecutive high-order pages
> freeing.
>
> Patch 4/5 optimize batch freeing and allocating.
>
> Patch 6/7/8/9 implement and optimize a PCP high auto-tuning method.
>
> Patch 10 optimize the PCP draining for consecutive high order page
> freeing based on PCP high auto-tuning.
>
> The test results for patches with performance impact are as follows,
>
> kbuild
> ======
>
> On a 2-socket Intel server with 224 logical CPU, we tested kbuild on
> one socket with `make -j 112`.
>
> build time zone lock% free_high alloc_zone
> ---------- ---------- --------- ----------
> base 100.0 43.6 100.0 100.0
> patch1 96.6 40.3 49.2 95.2
> patch3 96.4 40.5 11.3 95.1
> patch5 96.1 37.9 13.3 96.8
> patch7 86.4 9.8 6.2 22.0
> patch9 85.9 9.4 4.8 16.3
> patch10 87.7 12.6 29.0 32.3

You're seriously saying that kbuild got 12% faster?

I see that [07/10] (autotuning) alone sped up kbuild by 10%?

Other thoughts:

- What if any facilities are provided to permit users/developers to
monitor the operation of the autotuning algorithm?

- I'm not seeing any Documentation/ updates. Surely there are things
we can tell users?

- This:

: It's possible that PCP high auto-tuning doesn't work well for some
: workloads. So, when PCP high is tuned by hand via the sysctl knob,
: the auto-tuning will be disabled. The PCP high set by hand will be
: used instead.

Is it a bit hacky to disable autotuning when the user alters
pcp-high? Would it be cleaner to have a separate on/off knob for
autotuning?

And how is the user to determine that "PCP high auto-tuning doesn't work
well" for their workload?

2023-09-20 19:38:41

by Huang, Ying

[permalink] [raw]
Subject: [PATCH 01/10] mm, pcp: avoid to drain PCP when process exit

In commit f26b3fa04611 ("mm/page_alloc: limit number of high-order
pages on PCP during bulk free"), the PCP (Per-CPU Pageset) will be
drained when PCP is mostly used for high-order pages freeing to
improve the cache-hot pages reusing between page allocation and
freeing CPUs.

But, the PCP draining mechanism may be triggered unexpectedly when
process exits. With some customized trace point, it was found that
PCP draining (free_high == true) was triggered with the order-1 page
freeing with the following call stack,

=> free_unref_page_commit
=> free_unref_page
=> __mmdrop
=> exit_mm
=> do_exit
=> do_group_exit
=> __x64_sys_exit_group
=> do_syscall_64

Checking the source code, this is the page table PGD
freeing (mm_free_pgd()). It's a order-1 page freeing if
CONFIG_PAGE_TABLE_ISOLATION=y. Which is a common configuration for
security.

Just before that, page freeing with the following call stack was
found,

=> free_unref_page_commit
=> free_unref_page_list
=> release_pages
=> tlb_batch_pages_flush
=> tlb_finish_mmu
=> exit_mmap
=> __mmput
=> exit_mm
=> do_exit
=> do_group_exit
=> __x64_sys_exit_group
=> do_syscall_64

So, when a process exits,

- a large number of user pages of the process will be freed without
page allocation, it's highly possible that pcp->free_factor becomes
> 0.

- after freeing all user pages, the PGD will be freed, which is a
order-1 page freeing, PCP will be drained.

All in all, when a process exits, it's high possible that the PCP will
be drained. This is an unexpected behavior.

To avoid this, in the patch, the PCP draining will only be triggered
for 2 consecutive high-order page freeing.

On a 2-socket Intel server with 224 logical CPU, we tested kbuild on
one socket with `make -j 112`. With the patch, the build time
decreases 3.4% (from 206s to 199s). The cycles% of the spinlock
contention (mostly for zone lock) decreases from 43.6% to 40.3% (with
PCP size == 361). The number of PCP draining for high order pages
freeing (free_high) decreases 50.8%.

This helps network workload too for reduced zone lock contention. On
a 2-socket Intel server with 128 logical CPU, with the patch, the
network bandwidth of the UNIX (AF_UNIX) test case of lmbench test
suite with 16-pair processes increase 17.1%. The cycles% of the
spinlock contention (mostly for zone lock) decreases from 50.0% to
45.8%. The number of PCP draining for high order pages
freeing (free_high) decreases 27.4%. The cache miss rate keeps 0.3%.

Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Matthew Wilcox <[email protected]>
Cc: Christoph Lameter <[email protected]>
---
include/linux/mmzone.h | 5 ++++-
mm/page_alloc.c | 11 ++++++++---
2 files changed, 12 insertions(+), 4 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 4106fbc5b4b3..64d5ed2bb724 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -676,12 +676,15 @@ enum zone_watermarks {
#define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost)
#define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost)

+#define PCPF_PREV_FREE_HIGH_ORDER 0x01
+
struct per_cpu_pages {
spinlock_t lock; /* Protects lists field */
int count; /* number of pages in the list */
int high; /* high watermark, emptying needed */
int batch; /* chunk size for buddy add/remove */
- short free_factor; /* batch scaling factor during free */
+ u8 flags; /* protected by pcp->lock */
+ u8 free_factor; /* batch scaling factor during free */
#ifdef CONFIG_NUMA
short expire; /* When 0, remote pagesets are drained */
#endif
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0c5be12f9336..828dcc24b030 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2370,7 +2370,7 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
{
int high;
int pindex;
- bool free_high;
+ bool free_high = false;

__count_vm_events(PGFREE, 1 << order);
pindex = order_to_pindex(migratetype, order);
@@ -2383,8 +2383,13 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
* freeing without allocation. The remainder after bulk freeing
* stops will be drained from vmstat refresh context.
*/
- free_high = (pcp->free_factor && order && order <= PAGE_ALLOC_COSTLY_ORDER);
-
+ if (order && order <= PAGE_ALLOC_COSTLY_ORDER) {
+ free_high = (pcp->free_factor &&
+ (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER));
+ pcp->flags |= PCPF_PREV_FREE_HIGH_ORDER;
+ } else if (pcp->flags & PCPF_PREV_FREE_HIGH_ORDER) {
+ pcp->flags &= ~PCPF_PREV_FREE_HIGH_ORDER;
+ }
high = nr_pcp_high(pcp, zone, free_high);
if (pcp->count >= high) {
free_pcppages_bulk(zone, nr_pcp_free(pcp, high, free_high), pcp, pindex);
--
2.39.2

2023-09-22 01:55:11

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: PCP high auto-tuning

On Thu, 21 Sep 2023 21:32:35 +0800 "Huang, Ying" <[email protected]> wrote:

> > : It's possible that PCP high auto-tuning doesn't work well for some
> > : workloads. So, when PCP high is tuned by hand via the sysctl knob,
> > : the auto-tuning will be disabled. The PCP high set by hand will be
> > : used instead.
> >
> > Is it a bit hacky to disable autotuning when the user alters
> > pcp-high? Would it be cleaner to have a separate on/off knob for
> > autotuning?
>
> This was suggested by Mel Gormon,
>
> https://lore.kernel.org/linux-mm/[email protected]/
>
> "
> I'm not opposed to having an adaptive pcp->high in concept. I think it would
> be best to disable adaptive tuning if percpu_pagelist_high_fraction is set
> though. I expect that users of that tunable are rare and that if it *is*
> used that there is a very good reason for it.
> "
>
> Do you think that this is reasonable?

I suppose so, if it's documented!

Documentation/admin-guide/sysctl/vm.rst describes
percpu_pagelist_high_fraction.

2023-09-22 02:49:12

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: PCP high auto-tuning

Hi, Andrew,

Andrew Morton <[email protected]> writes:

> On Wed, 20 Sep 2023 14:18:46 +0800 Huang Ying <[email protected]> wrote:
>
>> The page allocation performance requirements of different workloads
>> are often different. So, we need to tune the PCP (Per-CPU Pageset)
>> high on each CPU automatically to optimize the page allocation
>> performance.
>
> Some of the performance changes here are downright scary.
>
> I've never been very sure that percpu pages was very beneficial (and
> hey, I invented the thing back in the Mesozoic era). But these numbers
> make me think it's very important and we should have been paying more
> attention.
>
>> The list of patches in series is as follows,
>>
>> 1 mm, pcp: avoid to drain PCP when process exit
>> 2 cacheinfo: calculate per-CPU data cache size
>> 3 mm, pcp: reduce lock contention for draining high-order pages
>> 4 mm: restrict the pcp batch scale factor to avoid too long latency
>> 5 mm, page_alloc: scale the number of pages that are batch allocated
>> 6 mm: add framework for PCP high auto-tuning
>> 7 mm: tune PCP high automatically
>> 8 mm, pcp: decrease PCP high if free pages < high watermark
>> 9 mm, pcp: avoid to reduce PCP high unnecessarily
>> 10 mm, pcp: reduce detecting time of consecutive high order page freeing
>>
>> Patch 1/2/3 optimize the PCP draining for consecutive high-order pages
>> freeing.
>>
>> Patch 4/5 optimize batch freeing and allocating.
>>
>> Patch 6/7/8/9 implement and optimize a PCP high auto-tuning method.
>>
>> Patch 10 optimize the PCP draining for consecutive high order page
>> freeing based on PCP high auto-tuning.
>>
>> The test results for patches with performance impact are as follows,
>>
>> kbuild
>> ======
>>
>> On a 2-socket Intel server with 224 logical CPU, we tested kbuild on
>> one socket with `make -j 112`.
>>
>> build time zone lock% free_high alloc_zone
>> ---------- ---------- --------- ----------
>> base 100.0 43.6 100.0 100.0
>> patch1 96.6 40.3 49.2 95.2
>> patch3 96.4 40.5 11.3 95.1
>> patch5 96.1 37.9 13.3 96.8
>> patch7 86.4 9.8 6.2 22.0
>> patch9 85.9 9.4 4.8 16.3
>> patch10 87.7 12.6 29.0 32.3
>
> You're seriously saying that kbuild got 12% faster?
>
> I see that [07/10] (autotuning) alone sped up kbuild by 10%?

Thank you very much for questioning!

I double-checked the my test results and configuration and found that I
used an uncommon configuration. So the description of the test should
have been,

On a 2-socket Intel server with 224 logical CPU, we tested kbuild with
`numactl -m 1 -- make -j 112`.

This will make processes running on socket 0 to use the normal zone of
socket 1. The remote accessing to zone->lock cause heavy lock
contention.

I apologize for any confusing caused by the above test results.

If we test kbuild with `make -j 224` on the machine, the test results
becomes,

build time lock% free_high alloc_zone
---------- ---------- --------- ----------
base 100.0 16.8 100.0 100.0
patch5 99.2 13.9 9.5 97.0
patch7 98.5 5.4 4.8 19.2

Although lock contention cycles%, draining PCP for high order freeing,
and allocating from zone reduces greatly, the build time almost doesn't
change.

We also tested kbuild in the following way, created 8 cgroup, and run
`make -j 28` in each cgroup. That is, the total parallel is same, but
LRU lock contention can be eliminated via cgroup. And, the
single-process link stage take less proportion to the parallel compiling
stage. This isn't common for personal usage. But it can be used by
something like 0Day kbuild service. The test result is as follows,

build time lock% free_high alloc_zone
---------- ---------- --------- ----------
base 100.0 14.2 100.0 100.0
patch5 98.5 8.5 8.1 97.1
patch7 95.0 0.7 3.0 19.0

The lock contention cycles% reduces to nearly 0, because LRU lock
contention is eliminated too. The build time reduction becomes visible
too. We will continue to do a full test with this configuration.

> Other thoughts:
>
> - What if any facilities are provided to permit users/developers to
> monitor the operation of the autotuning algorithm?

/proc/zoneinfo can be used to observe PCP high and count for each CPU.

> - I'm not seeing any Documentation/ updates. Surely there are things
> we can tell users?

I will think about that.

> - This:
>
> : It's possible that PCP high auto-tuning doesn't work well for some
> : workloads. So, when PCP high is tuned by hand via the sysctl knob,
> : the auto-tuning will be disabled. The PCP high set by hand will be
> : used instead.
>
> Is it a bit hacky to disable autotuning when the user alters
> pcp-high? Would it be cleaner to have a separate on/off knob for
> autotuning?

This was suggested by Mel Gormon,

https://lore.kernel.org/linux-mm/[email protected]/

"
I'm not opposed to having an adaptive pcp->high in concept. I think it would
be best to disable adaptive tuning if percpu_pagelist_high_fraction is set
though. I expect that users of that tunable are rare and that if it *is*
used that there is a very good reason for it.
"

Do you think that this is reasonable?

> And how is the user to determine that "PCP high auto-tuning doesn't work
> well" for their workload?

One way is to check the perf profiling results. If there is heavy zone
lock contention, the PCP high auto-tuning doesn't work well enough to
eliminate the zone lock contention. Users may try to tune PCP high by
hand.

--
Best Regards,
Huang, Ying

2023-09-22 04:56:31

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: PCP high auto-tuning

Andrew Morton <[email protected]> writes:

> On Thu, 21 Sep 2023 21:32:35 +0800 "Huang, Ying" <[email protected]> wrote:
>
>> > : It's possible that PCP high auto-tuning doesn't work well for some
>> > : workloads. So, when PCP high is tuned by hand via the sysctl knob,
>> > : the auto-tuning will be disabled. The PCP high set by hand will be
>> > : used instead.
>> >
>> > Is it a bit hacky to disable autotuning when the user alters
>> > pcp-high? Would it be cleaner to have a separate on/off knob for
>> > autotuning?
>>
>> This was suggested by Mel Gormon,
>>
>> https://lore.kernel.org/linux-mm/[email protected]/
>>
>> "
>> I'm not opposed to having an adaptive pcp->high in concept. I think it would
>> be best to disable adaptive tuning if percpu_pagelist_high_fraction is set
>> though. I expect that users of that tunable are rare and that if it *is*
>> used that there is a very good reason for it.
>> "
>>
>> Do you think that this is reasonable?
>
> I suppose so, if it's documented!
>
> Documentation/admin-guide/sysctl/vm.rst describes
> percpu_pagelist_high_fraction.

Sure. Will add document about auto-tuning behavior in the above
document.

--
Best Regards,
Huang, Ying

2023-10-11 12:24:32

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 02/10] cacheinfo: calculate per-CPU data cache size

On Wed, Sep 20, 2023 at 02:18:48PM +0800, Huang Ying wrote:
> Per-CPU data cache size is useful information. For example, it can be
> used to determine per-CPU cache size. So, in this patch, the data
> cache size for each CPU is calculated via data_cache_size /
> shared_cpu_weight.
>
> A brute-force algorithm to iterate all online CPUs is used to avoid
> to allocate an extra cpumask, especially in offline callback.
>
> Signed-off-by: "Huang, Ying" <[email protected]>

It's not necessarily relevant to the patch, but at least the scheduler
also stores some per-cpu topology information such as sd_llc_size -- the
number of CPUs sharing the same last-level-cache as this CPU. It may be
worth unifying this at some point if it's common that per-cpu
information is too fine and per-zone or per-node information is too
coarse. This would be particularly true when considering locking
granularity,

> Cc: Sudeep Holla <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: David Hildenbrand <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Pavel Tatashin <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> ---
> drivers/base/cacheinfo.c | 42 ++++++++++++++++++++++++++++++++++++++-
> include/linux/cacheinfo.h | 1 +
> 2 files changed, 42 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
> index cbae8be1fe52..3e8951a3fbab 100644
> --- a/drivers/base/cacheinfo.c
> +++ b/drivers/base/cacheinfo.c
> @@ -898,6 +898,41 @@ static int cache_add_dev(unsigned int cpu)
> return rc;
> }
>
> +static void update_data_cache_size_cpu(unsigned int cpu)
> +{
> + struct cpu_cacheinfo *ci;
> + struct cacheinfo *leaf;
> + unsigned int i, nr_shared;
> + unsigned int size_data = 0;
> +
> + if (!per_cpu_cacheinfo(cpu))
> + return;
> +
> + ci = ci_cacheinfo(cpu);
> + for (i = 0; i < cache_leaves(cpu); i++) {
> + leaf = per_cpu_cacheinfo_idx(cpu, i);
> + if (leaf->type != CACHE_TYPE_DATA &&
> + leaf->type != CACHE_TYPE_UNIFIED)
> + continue;
> + nr_shared = cpumask_weight(&leaf->shared_cpu_map);
> + if (!nr_shared)
> + continue;
> + size_data += leaf->size / nr_shared;
> + }
> + ci->size_data = size_data;
> +}

This needs comments.

It would be nice to add a comment on top describing the limitation of
CACHE_TYPE_UNIFIED here in the context of update_data_cache_size_cpu().
The L2 cache could be unified but much smaller than a L3 or other
last-level-cache. It's not clear from the code what level of cache is being
used due to a lack of familiarity of the cpu_cacheinfo code but size_data
is not the size of a cache, it appears to be the share of a cache a CPU
would have under ideal circumstances. However, as it appears to also be
iterating hierarchy then this may not be accurate. Caches may or may not
allow data to be duplicated between levels so the value may be inaccurate.

A respin of the patch is not necessary but a follow-on patch adding
clarifing comments would be very welcome covering

o What levels of cache are being used
o Describe what size_data actually is and preferably rename the field
to be more explicit as "size" could be the total cache capacity, the
cache slice under ideal circumstances or even the number of CPUs sharing
that cache.

The cache details *may* need a follow-on patch if the size_data value is
misleading. If it is a hierarchy and the value does not always represent
the slice of cache a CPU could have under ideal circumstances then the
value should be based on the LLC only so that it is predictable across
architectures.

--
Mel Gorman
SUSE Labs

2023-10-11 12:47:23

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm, pcp: avoid to drain PCP when process exit

On Wed, Sep 20, 2023 at 02:18:47PM +0800, Huang Ying wrote:
> In commit f26b3fa04611 ("mm/page_alloc: limit number of high-order
> pages on PCP during bulk free"), the PCP (Per-CPU Pageset) will be
> drained when PCP is mostly used for high-order pages freeing to
> improve the cache-hot pages reusing between page allocation and
> freeing CPUs.
>
> But, the PCP draining mechanism may be triggered unexpectedly when
> process exits. With some customized trace point, it was found that
> PCP draining (free_high == true) was triggered with the order-1 page
> freeing with the following call stack,
>
> => free_unref_page_commit
> => free_unref_page
> => __mmdrop
> => exit_mm
> => do_exit
> => do_group_exit
> => __x64_sys_exit_group
> => do_syscall_64
>
> Checking the source code, this is the page table PGD
> freeing (mm_free_pgd()). It's a order-1 page freeing if
> CONFIG_PAGE_TABLE_ISOLATION=y. Which is a common configuration for
> security.
>
> Just before that, page freeing with the following call stack was
> found,
>
> => free_unref_page_commit
> => free_unref_page_list
> => release_pages
> => tlb_batch_pages_flush
> => tlb_finish_mmu
> => exit_mmap
> => __mmput
> => exit_mm
> => do_exit
> => do_group_exit
> => __x64_sys_exit_group
> => do_syscall_64
>
> So, when a process exits,
>
> - a large number of user pages of the process will be freed without
> page allocation, it's highly possible that pcp->free_factor becomes
> > 0.
>
> - after freeing all user pages, the PGD will be freed, which is a
> order-1 page freeing, PCP will be drained.
>
> All in all, when a process exits, it's high possible that the PCP will
> be drained. This is an unexpected behavior.
>
> To avoid this, in the patch, the PCP draining will only be triggered
> for 2 consecutive high-order page freeing.
>
> On a 2-socket Intel server with 224 logical CPU, we tested kbuild on
> one socket with `make -j 112`. With the patch, the build time
> decreases 3.4% (from 206s to 199s). The cycles% of the spinlock
> contention (mostly for zone lock) decreases from 43.6% to 40.3% (with
> PCP size == 361). The number of PCP draining for high order pages
> freeing (free_high) decreases 50.8%.
>
> This helps network workload too for reduced zone lock contention. On
> a 2-socket Intel server with 128 logical CPU, with the patch, the
> network bandwidth of the UNIX (AF_UNIX) test case of lmbench test
> suite with 16-pair processes increase 17.1%. The cycles% of the
> spinlock contention (mostly for zone lock) decreases from 50.0% to
> 45.8%. The number of PCP draining for high order pages
> freeing (free_high) decreases 27.4%. The cache miss rate keeps 0.3%.
>
> Signed-off-by: "Huang, Ying" <[email protected]>

Acked-by: Mel Gorman <[email protected]>

However, I want to note that batching on exit is not necessarily
unexpected. For processes that are multi-TB in size, the time to exit
can actually be quite large and batching is of benefit but optimising
for exit is rarely a winning strategy. The pattern of "all allocs on CPU
B and all frees on CPU B" or "short-lived tasks triggering a premature
drain" is a bit more compelling but not worth a changelog rewrite.
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 4106fbc5b4b3..64d5ed2bb724 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -676,12 +676,15 @@ enum zone_watermarks {
> #define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost)
> #define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost)
>
> +#define PCPF_PREV_FREE_HIGH_ORDER 0x01
> +

The meaning of the flag and its intent should have been documented.

--
Mel Gorman
SUSE Labs

2023-10-11 12:49:21

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm, pcp: reduce lock contention for draining high-order pages

On Wed, Sep 20, 2023 at 02:18:49PM +0800, Huang Ying wrote:
> In commit f26b3fa04611 ("mm/page_alloc: limit number of high-order
> pages on PCP during bulk free"), the PCP (Per-CPU Pageset) will be
> drained when PCP is mostly used for high-order pages freeing to
> improve the cache-hot pages reusing between page allocating and
> freeing CPUs.
>
> On system with small per-CPU data cache, pages shouldn't be cached
> before draining to guarantee cache-hot. But on a system with large
> per-CPU data cache, more pages can be cached before draining to reduce
> zone lock contention.
>
> So, in this patch, instead of draining without any caching, "batch"
> pages will be cached in PCP before draining if the per-CPU data cache
> size is more than "4 * batch".
>
> On a 2-socket Intel server with 128 logical CPU, with the patch, the
> network bandwidth of the UNIX (AF_UNIX) test case of lmbench test
> suite with 16-pair processes increase 72.2%. The cycles% of the
> spinlock contention (mostly for zone lock) decreases from 45.8% to
> 21.2%. The number of PCP draining for high order pages
> freeing (free_high) decreases 89.8%. The cache miss rate keeps 0.3%.
>
> Signed-off-by: "Huang, Ying" <[email protected]>

Acked-by: Mel Gorman <[email protected]>

However, the flag should also have been documented to make it clear that
it preserves some pages on the PCP if the cache is large enough. Similar
to the previous patch, it would have been easier to reason about in the
general case if the decision had only been based on the LLC without
having to worry if any intermediate layer has a meaningful impact that
varies across CPU implementations.

--
Mel Gorman
SUSE Labs

2023-10-11 12:52:39

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 04/10] mm: restrict the pcp batch scale factor to avoid too long latency

On Wed, Sep 20, 2023 at 02:18:50PM +0800, Huang Ying wrote:
> In page allocator, PCP (Per-CPU Pageset) is refilled and drained in
> batches to increase page allocation throughput, reduce page
> allocation/freeing latency per page, and reduce zone lock contention.
> But too large batch size will cause too long maximal
> allocation/freeing latency, which may punish arbitrary users. So the
> default batch size is chosen carefully (in zone_batchsize(), the value
> is 63 for zone > 1GB) to avoid that.
>
> In commit 3b12e7e97938 ("mm/page_alloc: scale the number of pages that
> are batch freed"), the batch size will be scaled for large number of
> page freeing to improve page freeing performance and reduce zone lock
> contention. Similar optimization can be used for large number of
> pages allocation too.
>
> To find out a suitable max batch scale factor (that is, max effective
> batch size), some tests and measurement on some machines were done as
> follows.
>
> A set of debug patches are implemented as follows,
>
> - Set PCP high to be 2 * batch to reduce the effect of PCP high
>
> - Disable free batch size scaling to get the raw performance.
>
> - The code with zone lock held is extracted from rmqueue_bulk() and
> free_pcppages_bulk() to 2 separate functions to make it easy to
> measure the function run time with ftrace function_graph tracer.
>
> - The batch size is hard coded to be 63 (default), 127, 255, 511,
> 1023, 2047, 4095.
>
> Then will-it-scale/page_fault1 is used to generate the page
> allocation/freeing workload. The page allocation/freeing throughput
> (page/s) is measured via will-it-scale. The page allocation/freeing
> average latency (alloc/free latency avg, in us) and allocation/freeing
> latency at 99 percentile (alloc/free latency 99%, in us) are measured
> with ftrace function_graph tracer.
>
> The test results are as follows,
>
> Sapphire Rapids Server
> ======================
> Batch throughput free latency free latency alloc latency alloc latency
> page/s avg / us 99% / us avg / us 99% / us
> ----- ---------- ------------ ------------ ------------- -------------
> 63 513633.4 2.33 3.57 2.67 6.83
> 127 517616.7 4.35 6.65 4.22 13.03
> 255 520822.8 8.29 13.32 7.52 25.24
> 511 524122.0 15.79 23.42 14.02 49.35
> 1023 525980.5 30.25 44.19 25.36 94.88
> 2047 526793.6 59.39 84.50 45.22 140.81
>
> Ice Lake Server
> ===============
> Batch throughput free latency free latency alloc latency alloc latency
> page/s avg / us 99% / us avg / us 99% / us
> ----- ---------- ------------ ------------ ------------- -------------
> 63 620210.3 2.21 3.68 2.02 4.35
> 127 627003.0 4.09 6.86 3.51 8.28
> 255 630777.5 7.70 13.50 6.17 15.97
> 511 633651.5 14.85 22.62 11.66 31.08
> 1023 637071.1 28.55 42.02 20.81 54.36
> 2047 638089.7 56.54 84.06 39.28 91.68
>
> Cascade Lake Server
> ===================
> Batch throughput free latency free latency alloc latency alloc latency
> page/s avg / us 99% / us avg / us 99% / us
> ----- ---------- ------------ ------------ ------------- -------------
> 63 404706.7 3.29 5.03 3.53 4.75
> 127 422475.2 6.12 9.09 6.36 8.76
> 255 411522.2 11.68 16.97 10.90 16.39
> 511 428124.1 22.54 31.28 19.86 32.25
> 1023 414718.4 43.39 62.52 40.00 66.33
> 2047 429848.7 86.64 120.34 71.14 106.08
>
> Commet Lake Desktop
> ===================
> Batch throughput free latency free latency alloc latency alloc latency
> page/s avg / us 99% / us avg / us 99% / us
> ----- ---------- ------------ ------------ ------------- -------------
>
> 63 795183.13 2.18 3.55 2.03 3.05
> 127 803067.85 3.91 6.56 3.85 5.52
> 255 812771.10 7.35 10.80 7.14 10.20
> 511 817723.48 14.17 27.54 13.43 30.31
> 1023 818870.19 27.72 40.10 27.89 46.28
>
> Coffee Lake Desktop
> ===================
> Batch throughput free latency free latency alloc latency alloc latency
> page/s avg / us 99% / us avg / us 99% / us
> ----- ---------- ------------ ------------ ------------- -------------
> 63 510542.8 3.13 4.40 2.48 3.43
> 127 514288.6 5.97 7.89 4.65 6.04
> 255 516889.7 11.86 15.58 8.96 12.55
> 511 519802.4 23.10 28.81 16.95 26.19
> 1023 520802.7 45.30 52.51 33.19 45.95
> 2047 519997.1 90.63 104.00 65.26 81.74
>
> From the above data, to restrict the allocation/freeing latency to be
> less than 100 us in most times, the max batch scale factor needs to be
> less than or equal to 5.
>
> So, in this patch, the batch scale factor is restricted to be less
> than or equal to 5.
>
> Signed-off-by: "Huang, Ying" <[email protected]>

Acked-by: Mel Gorman <[email protected]>

However, it's worth noting that the time to free depends on the CPU and
while the CPUs you tested are reasonable, there are also slower CPUs out
there and I've at least one account that the time is excessive. While
this patch is fine, there may be a patch on top that makes this runtime
configurable, a Kconfig default or both.

--
Mel Gorman
SUSE Labs

2023-10-11 12:54:49

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 05/10] mm, page_alloc: scale the number of pages that are batch allocated

On Wed, Sep 20, 2023 at 02:18:51PM +0800, Huang Ying wrote:
> When a task is allocating a large number of order-0 pages, it may
> acquire the zone->lock multiple times allocating pages in batches.
> This may unnecessarily contend on the zone lock when allocating very
> large number of pages. This patch adapts the size of the batch based
> on the recent pattern to scale the batch size for subsequent
> allocations.
>
> On a 2-socket Intel server with 224 logical CPU, we tested kbuild on
> one socket with `make -j 112`. With the patch, the cycles% of the
> spinlock contention (mostly for zone lock) decreases from 40.5% to
> 37.9% (with PCP size == 361).
>
> Signed-off-by: "Huang, Ying" <[email protected]>
> Suggested-by: Mel Gorman <[email protected]>

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
SUSE Labs

2023-10-11 13:06:20

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 00/10] mm: PCP high auto-tuning

On Wed, Sep 20, 2023 at 09:41:18AM -0700, Andrew Morton wrote:
> On Wed, 20 Sep 2023 14:18:46 +0800 Huang Ying <[email protected]> wrote:
>
> > The page allocation performance requirements of different workloads
> > are often different. So, we need to tune the PCP (Per-CPU Pageset)
> > high on each CPU automatically to optimize the page allocation
> > performance.
>
> Some of the performance changes here are downright scary.
>
> I've never been very sure that percpu pages was very beneficial (and
> hey, I invented the thing back in the Mesozoic era). But these numbers
> make me think it's very important and we should have been paying more
> attention.
>

FWIW, it is because not only does it avoid lock contention issues, it
avoids excessive splitting/merging of buddies as well as the slower
paths of the allocator. It is not very satisfactory and frankly, the
whole page allocator needs a revisit to account for very large zones but
it is far from a trivial project. PCP just masks the worst of the issues
and replacing it is far harder than tweaking it.

> > The list of patches in series is as follows,
> >
> > 1 mm, pcp: avoid to drain PCP when process exit
> > 2 cacheinfo: calculate per-CPU data cache size
> > 3 mm, pcp: reduce lock contention for draining high-order pages
> > 4 mm: restrict the pcp batch scale factor to avoid too long latency
> > 5 mm, page_alloc: scale the number of pages that are batch allocated
> > 6 mm: add framework for PCP high auto-tuning
> > 7 mm: tune PCP high automatically
> > 8 mm, pcp: decrease PCP high if free pages < high watermark
> > 9 mm, pcp: avoid to reduce PCP high unnecessarily
> > 10 mm, pcp: reduce detecting time of consecutive high order page freeing
> >
> > Patch 1/2/3 optimize the PCP draining for consecutive high-order pages
> > freeing.
> >
> > Patch 4/5 optimize batch freeing and allocating.
> >
> > Patch 6/7/8/9 implement and optimize a PCP high auto-tuning method.
> >
> > Patch 10 optimize the PCP draining for consecutive high order page
> > freeing based on PCP high auto-tuning.
> >
> > The test results for patches with performance impact are as follows,
> >
> > kbuild
> > ======
> >
> > On a 2-socket Intel server with 224 logical CPU, we tested kbuild on
> > one socket with `make -j 112`.
> >
> > build time zone lock% free_high alloc_zone
> > ---------- ---------- --------- ----------
> > base 100.0 43.6 100.0 100.0
> > patch1 96.6 40.3 49.2 95.2
> > patch3 96.4 40.5 11.3 95.1
> > patch5 96.1 37.9 13.3 96.8
> > patch7 86.4 9.8 6.2 22.0
> > patch9 85.9 9.4 4.8 16.3
> > patch10 87.7 12.6 29.0 32.3
>
> You're seriously saying that kbuild got 12% faster?
>
> I see that [07/10] (autotuning) alone sped up kbuild by 10%?
>
> Other thoughts:
>
> - What if any facilities are provided to permit users/developers to
> monitor the operation of the autotuning algorithm?
>

Not that I've seen yet but I'm still in part of the series. It could be
monitored with tracepoints but it can also be inferred from lock
contention issue. I think it would only be meaningful to developers to
monitor this closely, at least that's what I think now. Honestly, I'm
more worried about potential changes in behaviour depending on the exact
CPU and cache implementation than I am about being able to actively
monitor it.

> - I'm not seeing any Documentation/ updates. Surely there are things
> we can tell users?
>
> - This:
>
> : It's possible that PCP high auto-tuning doesn't work well for some
> : workloads. So, when PCP high is tuned by hand via the sysctl knob,
> : the auto-tuning will be disabled. The PCP high set by hand will be
> : used instead.
>
> Is it a bit hacky to disable autotuning when the user alters
> pcp-high? Would it be cleaner to have a separate on/off knob for
> autotuning?
>

It might be but tuning the allocator is very specific and once we
introduce that tunable, we're probably stuck with it. I would prefer to
see it introduced if and only if we have to.

> And how is the user to determine that "PCP high auto-tuning doesn't work
> well" for their workload?

Not easily. It may manifest as variable lock contention issues when the
workload is at a steady state but that would increase the pressure to
split the allocator away from being zone-based entirely instead of tweaking
PCP further.

--
Mel Gorman
SUSE Labs

2023-10-11 13:08:37

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 08/10] mm, pcp: decrease PCP high if free pages < high watermark

On Wed, Sep 20, 2023 at 02:18:54PM +0800, Huang Ying wrote:
> One target of PCP is to minimize pages in PCP if the system free pages
> is too few. To reach that target, when page reclaiming is active for
> the zone (ZONE_RECLAIM_ACTIVE), we will stop increasing PCP high in
> allocating path, decrease PCP high and free some pages in freeing
> path. But this may be too late because the background page reclaiming
> may introduce latency for some workloads. So, in this patch, during
> page allocation we will detect whether the number of free pages of the
> zone is below high watermark. If so, we will stop increasing PCP high
> in allocating path, decrease PCP high and free some pages in freeing
> path. With this, we can reduce the possibility of the premature
> background page reclaiming caused by too large PCP.
>
> The high watermark checking is done in allocating path to reduce the
> overhead in hotter freeing path.
>
> Signed-off-by: "Huang, Ying" <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: David Hildenbrand <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Pavel Tatashin <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> ---
> include/linux/mmzone.h | 1 +
> mm/page_alloc.c | 22 ++++++++++++++++++++--
> 2 files changed, 21 insertions(+), 2 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index d6cfb5023f3e..8a19e2af89df 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -1006,6 +1006,7 @@ enum zone_flags {
> * Cleared when kswapd is woken.
> */
> ZONE_RECLAIM_ACTIVE, /* kswapd may be scanning the zone. */
> + ZONE_BELOW_HIGH, /* zone is below high watermark. */
> };
>
> static inline unsigned long zone_managed_pages(struct zone *zone)
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 225abe56752c..3f8c7dfeed23 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2409,7 +2409,13 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
> return min(batch << 2, pcp->high);
> }
>
> - if (pcp->count >= high && high_min != high_max) {
> + if (high_min == high_max)
> + return high;
> +
> + if (test_bit(ZONE_BELOW_HIGH, &zone->flags)) {
> + pcp->high = max(high - (batch << pcp->free_factor), high_min);
> + high = max(pcp->count, high_min);
> + } else if (pcp->count >= high) {
> int need_high = (batch << pcp->free_factor) + batch;
>
> /* pcp->high should be large enough to hold batch freed pages */
> @@ -2459,6 +2465,10 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
> if (pcp->count >= high) {
> free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
> pcp, pindex);
> + if (test_bit(ZONE_BELOW_HIGH, &zone->flags) &&
> + zone_watermark_ok(zone, 0, high_wmark_pages(zone),
> + ZONE_MOVABLE, 0))
> + clear_bit(ZONE_BELOW_HIGH, &zone->flags);
> }
> }
>
> @@ -2765,7 +2775,7 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order)
> * If we had larger pcp->high, we could avoid to allocate from
> * zone.
> */
> - if (high_min != high_max && !test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags))
> + if (high_min != high_max && !test_bit(ZONE_BELOW_HIGH, &zone->flags))
> high = pcp->high = min(high + batch, high_max);
>
> if (!order) {
> @@ -3226,6 +3236,14 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
> }
> }
>
> + mark = high_wmark_pages(zone);
> + if (zone_watermark_fast(zone, order, mark,
> + ac->highest_zoneidx, alloc_flags,
> + gfp_mask))
> + goto try_this_zone;
> + else if (!test_bit(ZONE_BELOW_HIGH, &zone->flags))
> + set_bit(ZONE_BELOW_HIGH, &zone->flags);
> +

This absolutely needs a comment explaning why because superficially a
consequence of this is that allocator performance is slightly degraded
when below the high watermark. Being below the high watermark is
completely harmless and can persist indefinitely until something wakes
kswapd.

--
Mel Gorman
SUSE Labs

2023-10-11 14:10:17

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 09/10] mm, pcp: avoid to reduce PCP high unnecessarily

On Wed, Sep 20, 2023 at 02:18:55PM +0800, Huang Ying wrote:
> In PCP high auto-tuning algorithm, to minimize idle pages in PCP, in
> periodic vmstat updating kworker (via refresh_cpu_vm_stats()), we will
> decrease PCP high to try to free possible idle PCP pages. One issue
> is that even if the page allocating/freeing depth is larger than
> maximal PCP high, we may reduce PCP high unnecessarily.
>
> To avoid the above issue, in this patch, we will track the minimal PCP
> page count. And, the periodic PCP high decrement will not more than
> the recent minimal PCP page count. So, only detected idle pages will
> be freed.
>
> On a 2-socket Intel server with 224 logical CPU, we tested kbuild on
> one socket with `make -j 112`. With the patch, The number of pages
> allocated from zone (instead of from PCP) decreases 25.8%.
>
> Signed-off-by: "Huang, Ying" <[email protected]>
> Cc: Andrew Morton <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: David Hildenbrand <[email protected]>
> Cc: Johannes Weiner <[email protected]>
> Cc: Dave Hansen <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Pavel Tatashin <[email protected]>
> Cc: Matthew Wilcox <[email protected]>
> Cc: Christoph Lameter <[email protected]>
> ---
> include/linux/mmzone.h | 1 +
> mm/page_alloc.c | 15 ++++++++++-----
> 2 files changed, 11 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
> index 8a19e2af89df..35b78c7522a7 100644
> --- a/include/linux/mmzone.h
> +++ b/include/linux/mmzone.h
> @@ -682,6 +682,7 @@ enum zone_watermarks {
> struct per_cpu_pages {
> spinlock_t lock; /* Protects lists field */
> int count; /* number of pages in the list */
> + int count_min; /* minimal number of pages in the list recently */
> int high; /* high watermark, emptying needed */
> int high_min; /* min high watermark */
> int high_max; /* max high watermark */
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3f8c7dfeed23..77e9b7b51688 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -2166,19 +2166,20 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
> */
> int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
> {
> - int high_min, to_drain, batch;
> + int high_min, decrease, to_drain, batch;
> int todo = 0;
>
> high_min = READ_ONCE(pcp->high_min);
> batch = READ_ONCE(pcp->batch);
> /*
> - * Decrease pcp->high periodically to try to free possible
> - * idle PCP pages. And, avoid to free too many pages to
> - * control latency.
> + * Decrease pcp->high periodically to free idle PCP pages counted
> + * via pcp->count_min. And, avoid to free too many pages to
> + * control latency. This caps pcp->high decrement too.
> */
> if (pcp->high > high_min) {
> + decrease = min(pcp->count_min, pcp->high / 5);

Not directly related to this patch but why 20%, it seems a bit
arbitrary. While this is not an fast path, using a divide rather than a
shift seems unnecessarily expensive.

> pcp->high = max3(pcp->count - (batch << PCP_BATCH_SCALE_MAX),
> - pcp->high * 4 / 5, high_min);
> + pcp->high - decrease, high_min);
> if (pcp->high > high_min)
> todo++;
> }
> @@ -2191,6 +2192,8 @@ int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
> todo++;
> }
>
> + pcp->count_min = pcp->count;
> +
> return todo;
> }
>
> @@ -2828,6 +2831,8 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
> page = list_first_entry(list, struct page, pcp_list);
> list_del(&page->pcp_list);
> pcp->count -= 1 << order;
> + if (pcp->count < pcp->count_min)
> + pcp->count_min = pcp->count;

While the accounting for this is in a relatively fast path.

At the moment I don't have a better suggestion but I'm not as keen on
this patch. It seems like it would have been more appropriate to decay if
there was no recent allocation activity tracked via pcp->flags. The major
caveat there is tracking a bit and clearing it may very well be in a fast
path unless it was tried to refills but that is subject to timing issues
and the allocation request stream :(

While you noted the difference in buddy allocations which may tie into
lock contention issues, how much difference to it make to the actual
performance of the workload?

--
Mel Gorman
SUSE Labs

2023-10-11 17:16:55

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm, pcp: avoid to drain PCP when process exit

On Wed, 11 Oct 2023 13:46:10 +0100 Mel Gorman <[email protected]> wrote:

> > --- a/include/linux/mmzone.h
> > +++ b/include/linux/mmzone.h
> > @@ -676,12 +676,15 @@ enum zone_watermarks {
> > #define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost)
> > #define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost)
> >
> > +#define PCPF_PREV_FREE_HIGH_ORDER 0x01
> > +
>
> The meaning of the flag and its intent should have been documented.

I need to rebase mm-stable for other reasons. So let's please
decide (soon) whether Mel's review comments can be addressed
via add-on patches or whether I should drop this version of this
series altogether, during that rebase.

2023-10-12 07:51:18

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 09/10] mm, pcp: avoid to reduce PCP high unnecessarily

Mel Gorman <[email protected]> writes:

> On Wed, Sep 20, 2023 at 02:18:55PM +0800, Huang Ying wrote:
>> In PCP high auto-tuning algorithm, to minimize idle pages in PCP, in
>> periodic vmstat updating kworker (via refresh_cpu_vm_stats()), we will
>> decrease PCP high to try to free possible idle PCP pages. One issue
>> is that even if the page allocating/freeing depth is larger than
>> maximal PCP high, we may reduce PCP high unnecessarily.
>>
>> To avoid the above issue, in this patch, we will track the minimal PCP
>> page count. And, the periodic PCP high decrement will not more than
>> the recent minimal PCP page count. So, only detected idle pages will
>> be freed.
>>
>> On a 2-socket Intel server with 224 logical CPU, we tested kbuild on
>> one socket with `make -j 112`. With the patch, The number of pages
>> allocated from zone (instead of from PCP) decreases 25.8%.
>>
>> Signed-off-by: "Huang, Ying" <[email protected]>
>> Cc: Andrew Morton <[email protected]>
>> Cc: Mel Gorman <[email protected]>
>> Cc: Vlastimil Babka <[email protected]>
>> Cc: David Hildenbrand <[email protected]>
>> Cc: Johannes Weiner <[email protected]>
>> Cc: Dave Hansen <[email protected]>
>> Cc: Michal Hocko <[email protected]>
>> Cc: Pavel Tatashin <[email protected]>
>> Cc: Matthew Wilcox <[email protected]>
>> Cc: Christoph Lameter <[email protected]>
>> ---
>> include/linux/mmzone.h | 1 +
>> mm/page_alloc.c | 15 ++++++++++-----
>> 2 files changed, 11 insertions(+), 5 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 8a19e2af89df..35b78c7522a7 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -682,6 +682,7 @@ enum zone_watermarks {
>> struct per_cpu_pages {
>> spinlock_t lock; /* Protects lists field */
>> int count; /* number of pages in the list */
>> + int count_min; /* minimal number of pages in the list recently */
>> int high; /* high watermark, emptying needed */
>> int high_min; /* min high watermark */
>> int high_max; /* max high watermark */
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 3f8c7dfeed23..77e9b7b51688 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -2166,19 +2166,20 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
>> */
>> int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
>> {
>> - int high_min, to_drain, batch;
>> + int high_min, decrease, to_drain, batch;
>> int todo = 0;
>>
>> high_min = READ_ONCE(pcp->high_min);
>> batch = READ_ONCE(pcp->batch);
>> /*
>> - * Decrease pcp->high periodically to try to free possible
>> - * idle PCP pages. And, avoid to free too many pages to
>> - * control latency.
>> + * Decrease pcp->high periodically to free idle PCP pages counted
>> + * via pcp->count_min. And, avoid to free too many pages to
>> + * control latency. This caps pcp->high decrement too.
>> */
>> if (pcp->high > high_min) {
>> + decrease = min(pcp->count_min, pcp->high / 5);
>
> Not directly related to this patch but why 20%, it seems a bit
> arbitrary. While this is not an fast path, using a divide rather than a
> shift seems unnecessarily expensive.

Yes. The number chosen is kind of arbitrary. Will use ">> 3" (/ 8).

>> pcp->high = max3(pcp->count - (batch << PCP_BATCH_SCALE_MAX),
>> - pcp->high * 4 / 5, high_min);
>> + pcp->high - decrease, high_min);
>> if (pcp->high > high_min)
>> todo++;
>> }
>> @@ -2191,6 +2192,8 @@ int decay_pcp_high(struct zone *zone, struct per_cpu_pages *pcp)
>> todo++;
>> }
>>
>> + pcp->count_min = pcp->count;
>> +
>> return todo;
>> }
>>
>> @@ -2828,6 +2831,8 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
>> page = list_first_entry(list, struct page, pcp_list);
>> list_del(&page->pcp_list);
>> pcp->count -= 1 << order;
>> + if (pcp->count < pcp->count_min)
>> + pcp->count_min = pcp->count;
>
> While the accounting for this is in a relatively fast path.
>
> At the moment I don't have a better suggestion but I'm not as keen on
> this patch. It seems like it would have been more appropriate to decay if
> there was no recent allocation activity tracked via pcp->flags. The major
> caveat there is tracking a bit and clearing it may very well be in a fast
> path unless it was tried to refills but that is subject to timing issues
> and the allocation request stream :(
>
> While you noted the difference in buddy allocations which may tie into
> lock contention issues, how much difference to it make to the actual
> performance of the workload?

Thanks Andrew for his reminding on test results. I found that I used a
uncommon configuration to test kbuild in V1 of the patchset. So, I sent
out V2 of the patchset as follows with only test results and document
changed.

https://lore.kernel.org/linux-mm/[email protected]/

So, for performance data, please refer to V2 of the patchset. For this
patch, the performance data are,

"
On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild
instances in parallel (each with `make -j 28`) in 8 cgroup. This
simulates the kbuild server that is used by 0-Day kbuild service.
With the patch, The number of pages allocated from zone (instead of
from PCP) decreases 21.4%.
"

I also showed the performance number for each step of optimization as
follows (copied from the above patchset V2 link).

"
build time lock contend% free_high alloc_zone
---------- ---------- --------- ----------
base 100.0 13.5 100.0 100.0
patch1 99.2 10.6 19.2 95.6
patch3 99.2 11.7 7.1 95.6
patch5 98.4 10.0 8.2 97.1
patch7 94.9 0.7 3.0 19.0
patch9 94.9 0.6 2.7 15.0 <-- this patch
patch10 94.9 0.9 8.8 18.6
"

Although I think the patch is helpful via avoiding the unnecessary
pcp->high decaying, thus reducing the zone lock contention. There's no
visible benchmark score change for the patch.

--
Best Regards,
Huang, Ying

2023-10-12 12:11:49

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 02/10] cacheinfo: calculate per-CPU data cache size

Mel Gorman <[email protected]> writes:

> On Wed, Sep 20, 2023 at 02:18:48PM +0800, Huang Ying wrote:
>> Per-CPU data cache size is useful information. For example, it can be
>> used to determine per-CPU cache size. So, in this patch, the data
>> cache size for each CPU is calculated via data_cache_size /
>> shared_cpu_weight.
>>
>> A brute-force algorithm to iterate all online CPUs is used to avoid
>> to allocate an extra cpumask, especially in offline callback.
>>
>> Signed-off-by: "Huang, Ying" <[email protected]>
>
> It's not necessarily relevant to the patch, but at least the scheduler
> also stores some per-cpu topology information such as sd_llc_size -- the
> number of CPUs sharing the same last-level-cache as this CPU. It may be
> worth unifying this at some point if it's common that per-cpu
> information is too fine and per-zone or per-node information is too
> coarse. This would be particularly true when considering locking
> granularity,
>
>> Cc: Sudeep Holla <[email protected]>
>> Cc: Andrew Morton <[email protected]>
>> Cc: Mel Gorman <[email protected]>
>> Cc: Vlastimil Babka <[email protected]>
>> Cc: David Hildenbrand <[email protected]>
>> Cc: Johannes Weiner <[email protected]>
>> Cc: Dave Hansen <[email protected]>
>> Cc: Michal Hocko <[email protected]>
>> Cc: Pavel Tatashin <[email protected]>
>> Cc: Matthew Wilcox <[email protected]>
>> Cc: Christoph Lameter <[email protected]>
>> ---
>> drivers/base/cacheinfo.c | 42 ++++++++++++++++++++++++++++++++++++++-
>> include/linux/cacheinfo.h | 1 +
>> 2 files changed, 42 insertions(+), 1 deletion(-)
>>
>> diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
>> index cbae8be1fe52..3e8951a3fbab 100644
>> --- a/drivers/base/cacheinfo.c
>> +++ b/drivers/base/cacheinfo.c
>> @@ -898,6 +898,41 @@ static int cache_add_dev(unsigned int cpu)
>> return rc;
>> }
>>
>> +static void update_data_cache_size_cpu(unsigned int cpu)
>> +{
>> + struct cpu_cacheinfo *ci;
>> + struct cacheinfo *leaf;
>> + unsigned int i, nr_shared;
>> + unsigned int size_data = 0;
>> +
>> + if (!per_cpu_cacheinfo(cpu))
>> + return;
>> +
>> + ci = ci_cacheinfo(cpu);
>> + for (i = 0; i < cache_leaves(cpu); i++) {
>> + leaf = per_cpu_cacheinfo_idx(cpu, i);
>> + if (leaf->type != CACHE_TYPE_DATA &&
>> + leaf->type != CACHE_TYPE_UNIFIED)
>> + continue;
>> + nr_shared = cpumask_weight(&leaf->shared_cpu_map);
>> + if (!nr_shared)
>> + continue;
>> + size_data += leaf->size / nr_shared;
>> + }
>> + ci->size_data = size_data;
>> +}
>
> This needs comments.
>
> It would be nice to add a comment on top describing the limitation of
> CACHE_TYPE_UNIFIED here in the context of
> update_data_cache_size_cpu().

Sure. Will do that.

> The L2 cache could be unified but much smaller than a L3 or other
> last-level-cache. It's not clear from the code what level of cache is being
> used due to a lack of familiarity of the cpu_cacheinfo code but size_data
> is not the size of a cache, it appears to be the share of a cache a CPU
> would have under ideal circumstances.

Yes. And it isn't for one specific level of cache. It's sum of per-CPU
shares of all levels of cache. But the calculation is inaccurate. More
details are in the below reply.

> However, as it appears to also be
> iterating hierarchy then this may not be accurate. Caches may or may not
> allow data to be duplicated between levels so the value may be inaccurate.

Thank you very much for pointing this out! The cache can be inclusive
or not. So, we cannot calculate the per-CPU slice of all-level caches
via adding them together blindly. I will change this in a follow-on
patch.

> A respin of the patch is not necessary but a follow-on patch adding
> clarifing comments would be very welcome covering
>
> o What levels of cache are being used
> o Describe what size_data actually is and preferably rename the field
> to be more explicit as "size" could be the total cache capacity, the
> cache slice under ideal circumstances or even the number of CPUs sharing
> that cache.

Sure.

> The cache details *may* need a follow-on patch if the size_data value is
> misleading. If it is a hierarchy and the value does not always represent
> the slice of cache a CPU could have under ideal circumstances then the
> value should be based on the LLC only so that it is predictable across
> architectures.

Sure.

--
Best Regards,
Huang, Ying

2023-10-12 12:13:40

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 03/10] mm, pcp: reduce lock contention for draining high-order pages

Mel Gorman <[email protected]> writes:

> On Wed, Sep 20, 2023 at 02:18:49PM +0800, Huang Ying wrote:
>> In commit f26b3fa04611 ("mm/page_alloc: limit number of high-order
>> pages on PCP during bulk free"), the PCP (Per-CPU Pageset) will be
>> drained when PCP is mostly used for high-order pages freeing to
>> improve the cache-hot pages reusing between page allocating and
>> freeing CPUs.
>>
>> On system with small per-CPU data cache, pages shouldn't be cached
>> before draining to guarantee cache-hot. But on a system with large
>> per-CPU data cache, more pages can be cached before draining to reduce
>> zone lock contention.
>>
>> So, in this patch, instead of draining without any caching, "batch"
>> pages will be cached in PCP before draining if the per-CPU data cache
>> size is more than "4 * batch".
>>
>> On a 2-socket Intel server with 128 logical CPU, with the patch, the
>> network bandwidth of the UNIX (AF_UNIX) test case of lmbench test
>> suite with 16-pair processes increase 72.2%. The cycles% of the
>> spinlock contention (mostly for zone lock) decreases from 45.8% to
>> 21.2%. The number of PCP draining for high order pages
>> freeing (free_high) decreases 89.8%. The cache miss rate keeps 0.3%.
>>
>> Signed-off-by: "Huang, Ying" <[email protected]>
>
> Acked-by: Mel Gorman <[email protected]>
>
> However, the flag should also have been documented to make it clear that
> it preserves some pages on the PCP if the cache is large enough.

Sure. Will do this.

> Similar
> to the previous patch, it would have been easier to reason about in the
> general case if the decision had only been based on the LLC without
> having to worry if any intermediate layer has a meaningful impact that
> varies across CPU implementations.

Sure. Will do this.

--
Best Regards,
Huang, Ying

2023-10-12 12:18:13

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 04/10] mm: restrict the pcp batch scale factor to avoid too long latency

Mel Gorman <[email protected]> writes:

> On Wed, Sep 20, 2023 at 02:18:50PM +0800, Huang Ying wrote:
>> In page allocator, PCP (Per-CPU Pageset) is refilled and drained in
>> batches to increase page allocation throughput, reduce page
>> allocation/freeing latency per page, and reduce zone lock contention.
>> But too large batch size will cause too long maximal
>> allocation/freeing latency, which may punish arbitrary users. So the
>> default batch size is chosen carefully (in zone_batchsize(), the value
>> is 63 for zone > 1GB) to avoid that.
>>
>> In commit 3b12e7e97938 ("mm/page_alloc: scale the number of pages that
>> are batch freed"), the batch size will be scaled for large number of
>> page freeing to improve page freeing performance and reduce zone lock
>> contention. Similar optimization can be used for large number of
>> pages allocation too.
>>
>> To find out a suitable max batch scale factor (that is, max effective
>> batch size), some tests and measurement on some machines were done as
>> follows.
>>
>> A set of debug patches are implemented as follows,
>>
>> - Set PCP high to be 2 * batch to reduce the effect of PCP high
>>
>> - Disable free batch size scaling to get the raw performance.
>>
>> - The code with zone lock held is extracted from rmqueue_bulk() and
>> free_pcppages_bulk() to 2 separate functions to make it easy to
>> measure the function run time with ftrace function_graph tracer.
>>
>> - The batch size is hard coded to be 63 (default), 127, 255, 511,
>> 1023, 2047, 4095.
>>
>> Then will-it-scale/page_fault1 is used to generate the page
>> allocation/freeing workload. The page allocation/freeing throughput
>> (page/s) is measured via will-it-scale. The page allocation/freeing
>> average latency (alloc/free latency avg, in us) and allocation/freeing
>> latency at 99 percentile (alloc/free latency 99%, in us) are measured
>> with ftrace function_graph tracer.
>>
>> The test results are as follows,
>>
>> Sapphire Rapids Server
>> ======================
>> Batch throughput free latency free latency alloc latency alloc latency
>> page/s avg / us 99% / us avg / us 99% / us
>> ----- ---------- ------------ ------------ ------------- -------------
>> 63 513633.4 2.33 3.57 2.67 6.83
>> 127 517616.7 4.35 6.65 4.22 13.03
>> 255 520822.8 8.29 13.32 7.52 25.24
>> 511 524122.0 15.79 23.42 14.02 49.35
>> 1023 525980.5 30.25 44.19 25.36 94.88
>> 2047 526793.6 59.39 84.50 45.22 140.81
>>
>> Ice Lake Server
>> ===============
>> Batch throughput free latency free latency alloc latency alloc latency
>> page/s avg / us 99% / us avg / us 99% / us
>> ----- ---------- ------------ ------------ ------------- -------------
>> 63 620210.3 2.21 3.68 2.02 4.35
>> 127 627003.0 4.09 6.86 3.51 8.28
>> 255 630777.5 7.70 13.50 6.17 15.97
>> 511 633651.5 14.85 22.62 11.66 31.08
>> 1023 637071.1 28.55 42.02 20.81 54.36
>> 2047 638089.7 56.54 84.06 39.28 91.68
>>
>> Cascade Lake Server
>> ===================
>> Batch throughput free latency free latency alloc latency alloc latency
>> page/s avg / us 99% / us avg / us 99% / us
>> ----- ---------- ------------ ------------ ------------- -------------
>> 63 404706.7 3.29 5.03 3.53 4.75
>> 127 422475.2 6.12 9.09 6.36 8.76
>> 255 411522.2 11.68 16.97 10.90 16.39
>> 511 428124.1 22.54 31.28 19.86 32.25
>> 1023 414718.4 43.39 62.52 40.00 66.33
>> 2047 429848.7 86.64 120.34 71.14 106.08
>>
>> Commet Lake Desktop
>> ===================
>> Batch throughput free latency free latency alloc latency alloc latency
>> page/s avg / us 99% / us avg / us 99% / us
>> ----- ---------- ------------ ------------ ------------- -------------
>>
>> 63 795183.13 2.18 3.55 2.03 3.05
>> 127 803067.85 3.91 6.56 3.85 5.52
>> 255 812771.10 7.35 10.80 7.14 10.20
>> 511 817723.48 14.17 27.54 13.43 30.31
>> 1023 818870.19 27.72 40.10 27.89 46.28
>>
>> Coffee Lake Desktop
>> ===================
>> Batch throughput free latency free latency alloc latency alloc latency
>> page/s avg / us 99% / us avg / us 99% / us
>> ----- ---------- ------------ ------------ ------------- -------------
>> 63 510542.8 3.13 4.40 2.48 3.43
>> 127 514288.6 5.97 7.89 4.65 6.04
>> 255 516889.7 11.86 15.58 8.96 12.55
>> 511 519802.4 23.10 28.81 16.95 26.19
>> 1023 520802.7 45.30 52.51 33.19 45.95
>> 2047 519997.1 90.63 104.00 65.26 81.74
>>
>> From the above data, to restrict the allocation/freeing latency to be
>> less than 100 us in most times, the max batch scale factor needs to be
>> less than or equal to 5.
>>
>> So, in this patch, the batch scale factor is restricted to be less
>> than or equal to 5.
>>
>> Signed-off-by: "Huang, Ying" <[email protected]>
>
> Acked-by: Mel Gorman <[email protected]>
>
> However, it's worth noting that the time to free depends on the CPU and
> while the CPUs you tested are reasonable, there are also slower CPUs out
> there and I've at least one account that the time is excessive. While
> this patch is fine, there may be a patch on top that makes this runtime
> configurable, a Kconfig default or both.

Sure. Will add a Kconfig option first in a follow-on patch.

--
Best Regards,
Huang, Ying

2023-10-12 12:21:59

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 08/10] mm, pcp: decrease PCP high if free pages < high watermark

Mel Gorman <[email protected]> writes:

> On Wed, Sep 20, 2023 at 02:18:54PM +0800, Huang Ying wrote:
>> One target of PCP is to minimize pages in PCP if the system free pages
>> is too few. To reach that target, when page reclaiming is active for
>> the zone (ZONE_RECLAIM_ACTIVE), we will stop increasing PCP high in
>> allocating path, decrease PCP high and free some pages in freeing
>> path. But this may be too late because the background page reclaiming
>> may introduce latency for some workloads. So, in this patch, during
>> page allocation we will detect whether the number of free pages of the
>> zone is below high watermark. If so, we will stop increasing PCP high
>> in allocating path, decrease PCP high and free some pages in freeing
>> path. With this, we can reduce the possibility of the premature
>> background page reclaiming caused by too large PCP.
>>
>> The high watermark checking is done in allocating path to reduce the
>> overhead in hotter freeing path.
>>
>> Signed-off-by: "Huang, Ying" <[email protected]>
>> Cc: Andrew Morton <[email protected]>
>> Cc: Mel Gorman <[email protected]>
>> Cc: Vlastimil Babka <[email protected]>
>> Cc: David Hildenbrand <[email protected]>
>> Cc: Johannes Weiner <[email protected]>
>> Cc: Dave Hansen <[email protected]>
>> Cc: Michal Hocko <[email protected]>
>> Cc: Pavel Tatashin <[email protected]>
>> Cc: Matthew Wilcox <[email protected]>
>> Cc: Christoph Lameter <[email protected]>
>> ---
>> include/linux/mmzone.h | 1 +
>> mm/page_alloc.c | 22 ++++++++++++++++++++--
>> 2 files changed, 21 insertions(+), 2 deletions(-)
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index d6cfb5023f3e..8a19e2af89df 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -1006,6 +1006,7 @@ enum zone_flags {
>> * Cleared when kswapd is woken.
>> */
>> ZONE_RECLAIM_ACTIVE, /* kswapd may be scanning the zone. */
>> + ZONE_BELOW_HIGH, /* zone is below high watermark. */
>> };
>>
>> static inline unsigned long zone_managed_pages(struct zone *zone)
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 225abe56752c..3f8c7dfeed23 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -2409,7 +2409,13 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
>> return min(batch << 2, pcp->high);
>> }
>>
>> - if (pcp->count >= high && high_min != high_max) {
>> + if (high_min == high_max)
>> + return high;
>> +
>> + if (test_bit(ZONE_BELOW_HIGH, &zone->flags)) {
>> + pcp->high = max(high - (batch << pcp->free_factor), high_min);
>> + high = max(pcp->count, high_min);
>> + } else if (pcp->count >= high) {
>> int need_high = (batch << pcp->free_factor) + batch;
>>
>> /* pcp->high should be large enough to hold batch freed pages */
>> @@ -2459,6 +2465,10 @@ static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
>> if (pcp->count >= high) {
>> free_pcppages_bulk(zone, nr_pcp_free(pcp, batch, high, free_high),
>> pcp, pindex);
>> + if (test_bit(ZONE_BELOW_HIGH, &zone->flags) &&
>> + zone_watermark_ok(zone, 0, high_wmark_pages(zone),
>> + ZONE_MOVABLE, 0))
>> + clear_bit(ZONE_BELOW_HIGH, &zone->flags);
>> }
>> }
>>
>> @@ -2765,7 +2775,7 @@ static int nr_pcp_alloc(struct per_cpu_pages *pcp, struct zone *zone, int order)
>> * If we had larger pcp->high, we could avoid to allocate from
>> * zone.
>> */
>> - if (high_min != high_max && !test_bit(ZONE_RECLAIM_ACTIVE, &zone->flags))
>> + if (high_min != high_max && !test_bit(ZONE_BELOW_HIGH, &zone->flags))
>> high = pcp->high = min(high + batch, high_max);
>>
>> if (!order) {
>> @@ -3226,6 +3236,14 @@ get_page_from_freelist(gfp_t gfp_mask, unsigned int order, int alloc_flags,
>> }
>> }
>>
>> + mark = high_wmark_pages(zone);
>> + if (zone_watermark_fast(zone, order, mark,
>> + ac->highest_zoneidx, alloc_flags,
>> + gfp_mask))
>> + goto try_this_zone;
>> + else if (!test_bit(ZONE_BELOW_HIGH, &zone->flags))
>> + set_bit(ZONE_BELOW_HIGH, &zone->flags);
>> +
>
> This absolutely needs a comment explaning why because superficially a
> consequence of this is that allocator performance is slightly degraded
> when below the high watermark. Being below the high watermark is
> completely harmless and can persist indefinitely until something wakes
> kswapd.

Sure. Will add some comments here.

--
Best Regards,
Huang, Ying

2023-10-12 12:23:31

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm, pcp: avoid to drain PCP when process exit

Mel Gorman <[email protected]> writes:

> On Wed, Sep 20, 2023 at 02:18:47PM +0800, Huang Ying wrote:
>> In commit f26b3fa04611 ("mm/page_alloc: limit number of high-order
>> pages on PCP during bulk free"), the PCP (Per-CPU Pageset) will be
>> drained when PCP is mostly used for high-order pages freeing to
>> improve the cache-hot pages reusing between page allocation and
>> freeing CPUs.
>>
>> But, the PCP draining mechanism may be triggered unexpectedly when
>> process exits. With some customized trace point, it was found that
>> PCP draining (free_high == true) was triggered with the order-1 page
>> freeing with the following call stack,
>>
>> => free_unref_page_commit
>> => free_unref_page
>> => __mmdrop
>> => exit_mm
>> => do_exit
>> => do_group_exit
>> => __x64_sys_exit_group
>> => do_syscall_64
>>
>> Checking the source code, this is the page table PGD
>> freeing (mm_free_pgd()). It's a order-1 page freeing if
>> CONFIG_PAGE_TABLE_ISOLATION=y. Which is a common configuration for
>> security.
>>
>> Just before that, page freeing with the following call stack was
>> found,
>>
>> => free_unref_page_commit
>> => free_unref_page_list
>> => release_pages
>> => tlb_batch_pages_flush
>> => tlb_finish_mmu
>> => exit_mmap
>> => __mmput
>> => exit_mm
>> => do_exit
>> => do_group_exit
>> => __x64_sys_exit_group
>> => do_syscall_64
>>
>> So, when a process exits,
>>
>> - a large number of user pages of the process will be freed without
>> page allocation, it's highly possible that pcp->free_factor becomes
>> > 0.
>>
>> - after freeing all user pages, the PGD will be freed, which is a
>> order-1 page freeing, PCP will be drained.
>>
>> All in all, when a process exits, it's high possible that the PCP will
>> be drained. This is an unexpected behavior.
>>
>> To avoid this, in the patch, the PCP draining will only be triggered
>> for 2 consecutive high-order page freeing.
>>
>> On a 2-socket Intel server with 224 logical CPU, we tested kbuild on
>> one socket with `make -j 112`. With the patch, the build time
>> decreases 3.4% (from 206s to 199s). The cycles% of the spinlock
>> contention (mostly for zone lock) decreases from 43.6% to 40.3% (with
>> PCP size == 361). The number of PCP draining for high order pages
>> freeing (free_high) decreases 50.8%.
>>
>> This helps network workload too for reduced zone lock contention. On
>> a 2-socket Intel server with 128 logical CPU, with the patch, the
>> network bandwidth of the UNIX (AF_UNIX) test case of lmbench test
>> suite with 16-pair processes increase 17.1%. The cycles% of the
>> spinlock contention (mostly for zone lock) decreases from 50.0% to
>> 45.8%. The number of PCP draining for high order pages
>> freeing (free_high) decreases 27.4%. The cache miss rate keeps 0.3%.
>>
>> Signed-off-by: "Huang, Ying" <[email protected]>
>
> Acked-by: Mel Gorman <[email protected]>
>
> However, I want to note that batching on exit is not necessarily
> unexpected. For processes that are multi-TB in size, the time to exit
> can actually be quite large and batching is of benefit but optimising
> for exit is rarely a winning strategy. The pattern of "all allocs on CPU
> B and all frees on CPU B" or "short-lived tasks triggering a premature
> drain" is a bit more compelling but not worth a changelog rewrite.
>>
>> diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
>> index 4106fbc5b4b3..64d5ed2bb724 100644
>> --- a/include/linux/mmzone.h
>> +++ b/include/linux/mmzone.h
>> @@ -676,12 +676,15 @@ enum zone_watermarks {
>> #define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost)
>> #define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost)
>>
>> +#define PCPF_PREV_FREE_HIGH_ORDER 0x01
>> +
>
> The meaning of the flag and its intent should have been documented.

Sure. Will add comments for the flags.

--
Best Regards,
Huang, Ying

2023-10-12 12:50:09

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 09/10] mm, pcp: avoid to reduce PCP high unnecessarily

On Thu, Oct 12, 2023 at 03:48:04PM +0800, Huang, Ying wrote:
> "
> On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild
> instances in parallel (each with `make -j 28`) in 8 cgroup. This
> simulates the kbuild server that is used by 0-Day kbuild service.
> With the patch, The number of pages allocated from zone (instead of
> from PCP) decreases 21.4%.
> "
>
> I also showed the performance number for each step of optimization as
> follows (copied from the above patchset V2 link).
>
> "
> build time lock contend% free_high alloc_zone
> ---------- ---------- --------- ----------
> base 100.0 13.5 100.0 100.0
> patch1 99.2 10.6 19.2 95.6
> patch3 99.2 11.7 7.1 95.6
> patch5 98.4 10.0 8.2 97.1
> patch7 94.9 0.7 3.0 19.0
> patch9 94.9 0.6 2.7 15.0 <-- this patch
> patch10 94.9 0.9 8.8 18.6
> "
>
> Although I think the patch is helpful via avoiding the unnecessary
> pcp->high decaying, thus reducing the zone lock contention. There's no
> visible benchmark score change for the patch.
>

Thanks!

Given that it's another PCP field with an update in a relatively hot
path, I would suggest dropping this patch entirely if it does not affect
performance. It has the risk of being a magical heuristic that we forget
later whether it's even worthwhile.

--
Mel Gorman
SUSE Labs

2023-10-12 12:53:13

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 02/10] cacheinfo: calculate per-CPU data cache size

On Thu, Oct 12, 2023 at 08:08:32PM +0800, Huang, Ying wrote:
> Mel Gorman <[email protected]> writes:
>
> > On Wed, Sep 20, 2023 at 02:18:48PM +0800, Huang Ying wrote:
> >> Per-CPU data cache size is useful information. For example, it can be
> >> used to determine per-CPU cache size. So, in this patch, the data
> >> cache size for each CPU is calculated via data_cache_size /
> >> shared_cpu_weight.
> >>
> >> A brute-force algorithm to iterate all online CPUs is used to avoid
> >> to allocate an extra cpumask, especially in offline callback.
> >>
> >> Signed-off-by: "Huang, Ying" <[email protected]>
> >
> > It's not necessarily relevant to the patch, but at least the scheduler
> > also stores some per-cpu topology information such as sd_llc_size -- the
> > number of CPUs sharing the same last-level-cache as this CPU. It may be
> > worth unifying this at some point if it's common that per-cpu
> > information is too fine and per-zone or per-node information is too
> > coarse. This would be particularly true when considering locking
> > granularity,
> >
> >> Cc: Sudeep Holla <[email protected]>
> >> Cc: Andrew Morton <[email protected]>
> >> Cc: Mel Gorman <[email protected]>
> >> Cc: Vlastimil Babka <[email protected]>
> >> Cc: David Hildenbrand <[email protected]>
> >> Cc: Johannes Weiner <[email protected]>
> >> Cc: Dave Hansen <[email protected]>
> >> Cc: Michal Hocko <[email protected]>
> >> Cc: Pavel Tatashin <[email protected]>
> >> Cc: Matthew Wilcox <[email protected]>
> >> Cc: Christoph Lameter <[email protected]>
> >> ---
> >> drivers/base/cacheinfo.c | 42 ++++++++++++++++++++++++++++++++++++++-
> >> include/linux/cacheinfo.h | 1 +
> >> 2 files changed, 42 insertions(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
> >> index cbae8be1fe52..3e8951a3fbab 100644
> >> --- a/drivers/base/cacheinfo.c
> >> +++ b/drivers/base/cacheinfo.c
> >> @@ -898,6 +898,41 @@ static int cache_add_dev(unsigned int cpu)
> >> return rc;
> >> }
> >>
> >> +static void update_data_cache_size_cpu(unsigned int cpu)
> >> +{
> >> + struct cpu_cacheinfo *ci;
> >> + struct cacheinfo *leaf;
> >> + unsigned int i, nr_shared;
> >> + unsigned int size_data = 0;
> >> +
> >> + if (!per_cpu_cacheinfo(cpu))
> >> + return;
> >> +
> >> + ci = ci_cacheinfo(cpu);
> >> + for (i = 0; i < cache_leaves(cpu); i++) {
> >> + leaf = per_cpu_cacheinfo_idx(cpu, i);
> >> + if (leaf->type != CACHE_TYPE_DATA &&
> >> + leaf->type != CACHE_TYPE_UNIFIED)
> >> + continue;
> >> + nr_shared = cpumask_weight(&leaf->shared_cpu_map);
> >> + if (!nr_shared)
> >> + continue;
> >> + size_data += leaf->size / nr_shared;
> >> + }
> >> + ci->size_data = size_data;
> >> +}
> >
> > This needs comments.
> >
> > It would be nice to add a comment on top describing the limitation of
> > CACHE_TYPE_UNIFIED here in the context of
> > update_data_cache_size_cpu().
>
> Sure. Will do that.
>

Thanks.

> > The L2 cache could be unified but much smaller than a L3 or other
> > last-level-cache. It's not clear from the code what level of cache is being
> > used due to a lack of familiarity of the cpu_cacheinfo code but size_data
> > is not the size of a cache, it appears to be the share of a cache a CPU
> > would have under ideal circumstances.
>
> Yes. And it isn't for one specific level of cache. It's sum of per-CPU
> shares of all levels of cache. But the calculation is inaccurate. More
> details are in the below reply.
>
> > However, as it appears to also be
> > iterating hierarchy then this may not be accurate. Caches may or may not
> > allow data to be duplicated between levels so the value may be inaccurate.
>
> Thank you very much for pointing this out! The cache can be inclusive
> or not. So, we cannot calculate the per-CPU slice of all-level caches
> via adding them together blindly. I will change this in a follow-on
> patch.
>

Please do, I would strongly suggest basing this on LLC only because it's
the only value you can be sure of. This change is the only change that may
warrant a respin of the series as the history will be somewhat confusing
otherwise.

--
Mel Gorman
SUSE Labs

2023-10-12 13:09:42

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm, pcp: avoid to drain PCP when process exit

On Wed, Oct 11, 2023 at 10:16:17AM -0700, Andrew Morton wrote:
> On Wed, 11 Oct 2023 13:46:10 +0100 Mel Gorman <[email protected]> wrote:
>
> > > --- a/include/linux/mmzone.h
> > > +++ b/include/linux/mmzone.h
> > > @@ -676,12 +676,15 @@ enum zone_watermarks {
> > > #define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost)
> > > #define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost)
> > >
> > > +#define PCPF_PREV_FREE_HIGH_ORDER 0x01
> > > +
> >
> > The meaning of the flag and its intent should have been documented.
>
> I need to rebase mm-stable for other reasons. So let's please
> decide (soon) whether Mel's review comments can be addressed
> via add-on patches or whether I should drop this version of this
> series altogether, during that rebase.

The cache slice calculation is the only change I think may deserve a
respin as it may have a material impact on the performance figures if the
"size_data" value changes by too much. Huang, what do you think and how
long do you think it would take to update the performance figures? As it
may require multiple tests for each patch in the series, I would also be ok
with a follow-on patch like "mm: page_alloc: Simply cache size estimation
for PCP tuning" that documents the limitation of summing the unified caches
and the impact, if any, on performance. It makes for a messy history *but*
it would also record the reasons why summing hierarchies is not necessarily
the best approach which also has value.

I think patch 9 should be dropped as it has no impact on headline performance
while adding a relatively tricky heuristic that updates within a fast
path. Again, I'd like to give Huang a chance to respond and to evaluate
if it materially impacts patch 10 -- I don't think it does but I didn't
think very hard about it. Even if patch 9+10 had to be dropped, it would
not take much from the overall value of the series.

Comments and documentation alone are not grounds for pulling the series but
I hope they do get addressed in follow-on patches. I think requiring them
for accepting the series is unfair even if the only reason is I took too
long to review.

--
Mel Gorman
SUSE Labs

2023-10-12 13:14:30

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 02/10] cacheinfo: calculate per-CPU data cache size

Mel Gorman <[email protected]> writes:

> On Thu, Oct 12, 2023 at 08:08:32PM +0800, Huang, Ying wrote:
>> Mel Gorman <[email protected]> writes:
>>
>> > On Wed, Sep 20, 2023 at 02:18:48PM +0800, Huang Ying wrote:
>> >> Per-CPU data cache size is useful information. For example, it can be
>> >> used to determine per-CPU cache size. So, in this patch, the data
>> >> cache size for each CPU is calculated via data_cache_size /
>> >> shared_cpu_weight.
>> >>
>> >> A brute-force algorithm to iterate all online CPUs is used to avoid
>> >> to allocate an extra cpumask, especially in offline callback.
>> >>
>> >> Signed-off-by: "Huang, Ying" <[email protected]>
>> >
>> > It's not necessarily relevant to the patch, but at least the scheduler
>> > also stores some per-cpu topology information such as sd_llc_size -- the
>> > number of CPUs sharing the same last-level-cache as this CPU. It may be
>> > worth unifying this at some point if it's common that per-cpu
>> > information is too fine and per-zone or per-node information is too
>> > coarse. This would be particularly true when considering locking
>> > granularity,
>> >
>> >> Cc: Sudeep Holla <[email protected]>
>> >> Cc: Andrew Morton <[email protected]>
>> >> Cc: Mel Gorman <[email protected]>
>> >> Cc: Vlastimil Babka <[email protected]>
>> >> Cc: David Hildenbrand <[email protected]>
>> >> Cc: Johannes Weiner <[email protected]>
>> >> Cc: Dave Hansen <[email protected]>
>> >> Cc: Michal Hocko <[email protected]>
>> >> Cc: Pavel Tatashin <[email protected]>
>> >> Cc: Matthew Wilcox <[email protected]>
>> >> Cc: Christoph Lameter <[email protected]>
>> >> ---
>> >> drivers/base/cacheinfo.c | 42 ++++++++++++++++++++++++++++++++++++++-
>> >> include/linux/cacheinfo.h | 1 +
>> >> 2 files changed, 42 insertions(+), 1 deletion(-)
>> >>
>> >> diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
>> >> index cbae8be1fe52..3e8951a3fbab 100644
>> >> --- a/drivers/base/cacheinfo.c
>> >> +++ b/drivers/base/cacheinfo.c
>> >> @@ -898,6 +898,41 @@ static int cache_add_dev(unsigned int cpu)
>> >> return rc;
>> >> }
>> >>
>> >> +static void update_data_cache_size_cpu(unsigned int cpu)
>> >> +{
>> >> + struct cpu_cacheinfo *ci;
>> >> + struct cacheinfo *leaf;
>> >> + unsigned int i, nr_shared;
>> >> + unsigned int size_data = 0;
>> >> +
>> >> + if (!per_cpu_cacheinfo(cpu))
>> >> + return;
>> >> +
>> >> + ci = ci_cacheinfo(cpu);
>> >> + for (i = 0; i < cache_leaves(cpu); i++) {
>> >> + leaf = per_cpu_cacheinfo_idx(cpu, i);
>> >> + if (leaf->type != CACHE_TYPE_DATA &&
>> >> + leaf->type != CACHE_TYPE_UNIFIED)
>> >> + continue;
>> >> + nr_shared = cpumask_weight(&leaf->shared_cpu_map);
>> >> + if (!nr_shared)
>> >> + continue;
>> >> + size_data += leaf->size / nr_shared;
>> >> + }
>> >> + ci->size_data = size_data;
>> >> +}
>> >
>> > This needs comments.
>> >
>> > It would be nice to add a comment on top describing the limitation of
>> > CACHE_TYPE_UNIFIED here in the context of
>> > update_data_cache_size_cpu().
>>
>> Sure. Will do that.
>>
>
> Thanks.
>
>> > The L2 cache could be unified but much smaller than a L3 or other
>> > last-level-cache. It's not clear from the code what level of cache is being
>> > used due to a lack of familiarity of the cpu_cacheinfo code but size_data
>> > is not the size of a cache, it appears to be the share of a cache a CPU
>> > would have under ideal circumstances.
>>
>> Yes. And it isn't for one specific level of cache. It's sum of per-CPU
>> shares of all levels of cache. But the calculation is inaccurate. More
>> details are in the below reply.
>>
>> > However, as it appears to also be
>> > iterating hierarchy then this may not be accurate. Caches may or may not
>> > allow data to be duplicated between levels so the value may be inaccurate.
>>
>> Thank you very much for pointing this out! The cache can be inclusive
>> or not. So, we cannot calculate the per-CPU slice of all-level caches
>> via adding them together blindly. I will change this in a follow-on
>> patch.
>>
>
> Please do, I would strongly suggest basing this on LLC only because it's
> the only value you can be sure of. This change is the only change that may
> warrant a respin of the series as the history will be somewhat confusing
> otherwise.

I am still checking whether it's possible to get cache inclusive
information via cpuid.

If there's no reliable way to do that. We can use the max value of
per-CPU share of each level of cache. For inclusive cache, that will be
the value of LLC. For non-inclusive cache, the value will be more
accurate. For example, on Intel Sapphire Rapids, the L2 cache is 2 MB
per core, while LLC is 1.875 MB per core according to [1].

[1] https://www.intel.com/content/www/us/en/developer/articles/technical/fourth-generation-xeon-scalable-family-overview.html

I will respin the series.

Thanks a lot for review!

--
Best Regards,
Huang, Ying

2023-10-12 13:24:46

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 09/10] mm, pcp: avoid to reduce PCP high unnecessarily

Mel Gorman <[email protected]> writes:

> On Thu, Oct 12, 2023 at 03:48:04PM +0800, Huang, Ying wrote:
>> "
>> On a 2-socket Intel server with 224 logical CPU, we run 8 kbuild
>> instances in parallel (each with `make -j 28`) in 8 cgroup. This
>> simulates the kbuild server that is used by 0-Day kbuild service.
>> With the patch, The number of pages allocated from zone (instead of
>> from PCP) decreases 21.4%.
>> "
>>
>> I also showed the performance number for each step of optimization as
>> follows (copied from the above patchset V2 link).
>>
>> "
>> build time lock contend% free_high alloc_zone
>> ---------- ---------- --------- ----------
>> base 100.0 13.5 100.0 100.0
>> patch1 99.2 10.6 19.2 95.6
>> patch3 99.2 11.7 7.1 95.6
>> patch5 98.4 10.0 8.2 97.1
>> patch7 94.9 0.7 3.0 19.0
>> patch9 94.9 0.6 2.7 15.0 <-- this patch
>> patch10 94.9 0.9 8.8 18.6
>> "
>>
>> Although I think the patch is helpful via avoiding the unnecessary
>> pcp->high decaying, thus reducing the zone lock contention. There's no
>> visible benchmark score change for the patch.
>>
>
> Thanks!
>
> Given that it's another PCP field with an update in a relatively hot
> path, I would suggest dropping this patch entirely if it does not affect
> performance. It has the risk of being a magical heuristic that we forget
> later whether it's even worthwhile.

OK. Hope we can find some workloads that can benefit from the patch in
the future.

--
Best Regards,
Huang, Ying

2023-10-12 13:37:47

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 01/10] mm, pcp: avoid to drain PCP when process exit

Mel Gorman <[email protected]> writes:

> On Wed, Oct 11, 2023 at 10:16:17AM -0700, Andrew Morton wrote:
>> On Wed, 11 Oct 2023 13:46:10 +0100 Mel Gorman <[email protected]> wrote:
>>
>> > > --- a/include/linux/mmzone.h
>> > > +++ b/include/linux/mmzone.h
>> > > @@ -676,12 +676,15 @@ enum zone_watermarks {
>> > > #define high_wmark_pages(z) (z->_watermark[WMARK_HIGH] + z->watermark_boost)
>> > > #define wmark_pages(z, i) (z->_watermark[i] + z->watermark_boost)
>> > >
>> > > +#define PCPF_PREV_FREE_HIGH_ORDER 0x01
>> > > +
>> >
>> > The meaning of the flag and its intent should have been documented.
>>
>> I need to rebase mm-stable for other reasons. So let's please
>> decide (soon) whether Mel's review comments can be addressed
>> via add-on patches or whether I should drop this version of this
>> series altogether, during that rebase.
>
> The cache slice calculation is the only change I think may deserve a
> respin as it may have a material impact on the performance figures if the
> "size_data" value changes by too much. Huang, what do you think and how
> long do you think it would take to update the performance figures? As it
> may require multiple tests for each patch in the series, I would also be ok
> with a follow-on patch like "mm: page_alloc: Simply cache size estimation
> for PCP tuning" that documents the limitation of summing the unified caches
> and the impact, if any, on performance. It makes for a messy history *but*
> it would also record the reasons why summing hierarchies is not necessarily
> the best approach which also has value.

I am OK to respin the series. It will take 3-4 days to update the
performance figures.

> I think patch 9 should be dropped as it has no impact on headline performance
> while adding a relatively tricky heuristic that updates within a fast
> path. Again, I'd like to give Huang a chance to respond and to evaluate
> if it materially impacts patch 10 -- I don't think it does but I didn't
> think very hard about it. Even if patch 9+10 had to be dropped, it would
> not take much from the overall value of the series.

I am OK to drop patch 9 at least for now. In the future we may revisit
it when we found workloads that benefit more from it. It's not too hard
to rebase patch 10.

> Comments and documentation alone are not grounds for pulling the series but
> I hope they do get addressed in follow-on patches. I think requiring them
> for accepting the series is unfair even if the only reason is I took too
> long to review.

Never mind, your review are very value!

--
Best Regards,
Huang, Ying

2023-10-12 15:23:11

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 02/10] cacheinfo: calculate per-CPU data cache size

On Thu, Oct 12, 2023 at 09:12:00PM +0800, Huang, Ying wrote:
> Mel Gorman <[email protected]> writes:
>
> > On Thu, Oct 12, 2023 at 08:08:32PM +0800, Huang, Ying wrote:
> >> Mel Gorman <[email protected]> writes:
> >>
> >> > On Wed, Sep 20, 2023 at 02:18:48PM +0800, Huang Ying wrote:
> >> >> Per-CPU data cache size is useful information. For example, it can be
> >> >> used to determine per-CPU cache size. So, in this patch, the data
> >> >> cache size for each CPU is calculated via data_cache_size /
> >> >> shared_cpu_weight.
> >> >>
> >> >> A brute-force algorithm to iterate all online CPUs is used to avoid
> >> >> to allocate an extra cpumask, especially in offline callback.
> >> >>
> >> >> Signed-off-by: "Huang, Ying" <[email protected]>
> >> >
> >> > It's not necessarily relevant to the patch, but at least the scheduler
> >> > also stores some per-cpu topology information such as sd_llc_size -- the
> >> > number of CPUs sharing the same last-level-cache as this CPU. It may be
> >> > worth unifying this at some point if it's common that per-cpu
> >> > information is too fine and per-zone or per-node information is too
> >> > coarse. This would be particularly true when considering locking
> >> > granularity,
> >> >
> >> >> Cc: Sudeep Holla <[email protected]>
> >> >> Cc: Andrew Morton <[email protected]>
> >> >> Cc: Mel Gorman <[email protected]>
> >> >> Cc: Vlastimil Babka <[email protected]>
> >> >> Cc: David Hildenbrand <[email protected]>
> >> >> Cc: Johannes Weiner <[email protected]>
> >> >> Cc: Dave Hansen <[email protected]>
> >> >> Cc: Michal Hocko <[email protected]>
> >> >> Cc: Pavel Tatashin <[email protected]>
> >> >> Cc: Matthew Wilcox <[email protected]>
> >> >> Cc: Christoph Lameter <[email protected]>
> >> >> ---
> >> >> drivers/base/cacheinfo.c | 42 ++++++++++++++++++++++++++++++++++++++-
> >> >> include/linux/cacheinfo.h | 1 +
> >> >> 2 files changed, 42 insertions(+), 1 deletion(-)
> >> >>
> >> >> diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
> >> >> index cbae8be1fe52..3e8951a3fbab 100644
> >> >> --- a/drivers/base/cacheinfo.c
> >> >> +++ b/drivers/base/cacheinfo.c
> >> >> @@ -898,6 +898,41 @@ static int cache_add_dev(unsigned int cpu)
> >> >> return rc;
> >> >> }
> >> >>
> >> >> +static void update_data_cache_size_cpu(unsigned int cpu)
> >> >> +{
> >> >> + struct cpu_cacheinfo *ci;
> >> >> + struct cacheinfo *leaf;
> >> >> + unsigned int i, nr_shared;
> >> >> + unsigned int size_data = 0;
> >> >> +
> >> >> + if (!per_cpu_cacheinfo(cpu))
> >> >> + return;
> >> >> +
> >> >> + ci = ci_cacheinfo(cpu);
> >> >> + for (i = 0; i < cache_leaves(cpu); i++) {
> >> >> + leaf = per_cpu_cacheinfo_idx(cpu, i);
> >> >> + if (leaf->type != CACHE_TYPE_DATA &&
> >> >> + leaf->type != CACHE_TYPE_UNIFIED)
> >> >> + continue;
> >> >> + nr_shared = cpumask_weight(&leaf->shared_cpu_map);
> >> >> + if (!nr_shared)
> >> >> + continue;
> >> >> + size_data += leaf->size / nr_shared;
> >> >> + }
> >> >> + ci->size_data = size_data;
> >> >> +}
> >> >
> >> > This needs comments.
> >> >
> >> > It would be nice to add a comment on top describing the limitation of
> >> > CACHE_TYPE_UNIFIED here in the context of
> >> > update_data_cache_size_cpu().
> >>
> >> Sure. Will do that.
> >>
> >
> > Thanks.
> >
> >> > The L2 cache could be unified but much smaller than a L3 or other
> >> > last-level-cache. It's not clear from the code what level of cache is being
> >> > used due to a lack of familiarity of the cpu_cacheinfo code but size_data
> >> > is not the size of a cache, it appears to be the share of a cache a CPU
> >> > would have under ideal circumstances.
> >>
> >> Yes. And it isn't for one specific level of cache. It's sum of per-CPU
> >> shares of all levels of cache. But the calculation is inaccurate. More
> >> details are in the below reply.
> >>
> >> > However, as it appears to also be
> >> > iterating hierarchy then this may not be accurate. Caches may or may not
> >> > allow data to be duplicated between levels so the value may be inaccurate.
> >>
> >> Thank you very much for pointing this out! The cache can be inclusive
> >> or not. So, we cannot calculate the per-CPU slice of all-level caches
> >> via adding them together blindly. I will change this in a follow-on
> >> patch.
> >>
> >
> > Please do, I would strongly suggest basing this on LLC only because it's
> > the only value you can be sure of. This change is the only change that may
> > warrant a respin of the series as the history will be somewhat confusing
> > otherwise.
>
> I am still checking whether it's possible to get cache inclusive
> information via cpuid.
>

cpuid may be x86-specific so that potentially leads to different behaviours
on different architectures.

> If there's no reliable way to do that. We can use the max value of
> per-CPU share of each level of cache. For inclusive cache, that will be
> the value of LLC. For non-inclusive cache, the value will be more
> accurate. For example, on Intel Sapphire Rapids, the L2 cache is 2 MB
> per core, while LLC is 1.875 MB per core according to [1].
>

Be that as it may, it still opens the possibility of significantly different
behaviour depending on the CPU family. I would strongly recommend that you
start with LLC only because LLC is also the topology level of interest used
by the scheduler and it's information that is generally available. Trying
to get accurate information on every level and the complexity of dealing
with inclusive vs exclusive cache or write-back vs write-through should
be a separate patch, with separate justification and notes on how it can
lead to behaviour specific to the CPU family or architecture.

--
Mel Gorman
SUSE Labs

2023-10-13 03:09:24

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH 02/10] cacheinfo: calculate per-CPU data cache size

Mel Gorman <[email protected]> writes:

> On Thu, Oct 12, 2023 at 09:12:00PM +0800, Huang, Ying wrote:
>> Mel Gorman <[email protected]> writes:
>>
>> > On Thu, Oct 12, 2023 at 08:08:32PM +0800, Huang, Ying wrote:
>> >> Mel Gorman <[email protected]> writes:
>> >>
>> >> > On Wed, Sep 20, 2023 at 02:18:48PM +0800, Huang Ying wrote:
>> >> >> Per-CPU data cache size is useful information. For example, it can be
>> >> >> used to determine per-CPU cache size. So, in this patch, the data
>> >> >> cache size for each CPU is calculated via data_cache_size /
>> >> >> shared_cpu_weight.
>> >> >>
>> >> >> A brute-force algorithm to iterate all online CPUs is used to avoid
>> >> >> to allocate an extra cpumask, especially in offline callback.
>> >> >>
>> >> >> Signed-off-by: "Huang, Ying" <[email protected]>
>> >> >
>> >> > It's not necessarily relevant to the patch, but at least the scheduler
>> >> > also stores some per-cpu topology information such as sd_llc_size -- the
>> >> > number of CPUs sharing the same last-level-cache as this CPU. It may be
>> >> > worth unifying this at some point if it's common that per-cpu
>> >> > information is too fine and per-zone or per-node information is too
>> >> > coarse. This would be particularly true when considering locking
>> >> > granularity,
>> >> >
>> >> >> Cc: Sudeep Holla <[email protected]>
>> >> >> Cc: Andrew Morton <[email protected]>
>> >> >> Cc: Mel Gorman <[email protected]>
>> >> >> Cc: Vlastimil Babka <[email protected]>
>> >> >> Cc: David Hildenbrand <[email protected]>
>> >> >> Cc: Johannes Weiner <[email protected]>
>> >> >> Cc: Dave Hansen <[email protected]>
>> >> >> Cc: Michal Hocko <[email protected]>
>> >> >> Cc: Pavel Tatashin <[email protected]>
>> >> >> Cc: Matthew Wilcox <[email protected]>
>> >> >> Cc: Christoph Lameter <[email protected]>
>> >> >> ---
>> >> >> drivers/base/cacheinfo.c | 42 ++++++++++++++++++++++++++++++++++++++-
>> >> >> include/linux/cacheinfo.h | 1 +
>> >> >> 2 files changed, 42 insertions(+), 1 deletion(-)
>> >> >>
>> >> >> diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
>> >> >> index cbae8be1fe52..3e8951a3fbab 100644
>> >> >> --- a/drivers/base/cacheinfo.c
>> >> >> +++ b/drivers/base/cacheinfo.c
>> >> >> @@ -898,6 +898,41 @@ static int cache_add_dev(unsigned int cpu)
>> >> >> return rc;
>> >> >> }
>> >> >>
>> >> >> +static void update_data_cache_size_cpu(unsigned int cpu)
>> >> >> +{
>> >> >> + struct cpu_cacheinfo *ci;
>> >> >> + struct cacheinfo *leaf;
>> >> >> + unsigned int i, nr_shared;
>> >> >> + unsigned int size_data = 0;
>> >> >> +
>> >> >> + if (!per_cpu_cacheinfo(cpu))
>> >> >> + return;
>> >> >> +
>> >> >> + ci = ci_cacheinfo(cpu);
>> >> >> + for (i = 0; i < cache_leaves(cpu); i++) {
>> >> >> + leaf = per_cpu_cacheinfo_idx(cpu, i);
>> >> >> + if (leaf->type != CACHE_TYPE_DATA &&
>> >> >> + leaf->type != CACHE_TYPE_UNIFIED)
>> >> >> + continue;
>> >> >> + nr_shared = cpumask_weight(&leaf->shared_cpu_map);
>> >> >> + if (!nr_shared)
>> >> >> + continue;
>> >> >> + size_data += leaf->size / nr_shared;
>> >> >> + }
>> >> >> + ci->size_data = size_data;
>> >> >> +}
>> >> >
>> >> > This needs comments.
>> >> >
>> >> > It would be nice to add a comment on top describing the limitation of
>> >> > CACHE_TYPE_UNIFIED here in the context of
>> >> > update_data_cache_size_cpu().
>> >>
>> >> Sure. Will do that.
>> >>
>> >
>> > Thanks.
>> >
>> >> > The L2 cache could be unified but much smaller than a L3 or other
>> >> > last-level-cache. It's not clear from the code what level of cache is being
>> >> > used due to a lack of familiarity of the cpu_cacheinfo code but size_data
>> >> > is not the size of a cache, it appears to be the share of a cache a CPU
>> >> > would have under ideal circumstances.
>> >>
>> >> Yes. And it isn't for one specific level of cache. It's sum of per-CPU
>> >> shares of all levels of cache. But the calculation is inaccurate. More
>> >> details are in the below reply.
>> >>
>> >> > However, as it appears to also be
>> >> > iterating hierarchy then this may not be accurate. Caches may or may not
>> >> > allow data to be duplicated between levels so the value may be inaccurate.
>> >>
>> >> Thank you very much for pointing this out! The cache can be inclusive
>> >> or not. So, we cannot calculate the per-CPU slice of all-level caches
>> >> via adding them together blindly. I will change this in a follow-on
>> >> patch.
>> >>
>> >
>> > Please do, I would strongly suggest basing this on LLC only because it's
>> > the only value you can be sure of. This change is the only change that may
>> > warrant a respin of the series as the history will be somewhat confusing
>> > otherwise.
>>
>> I am still checking whether it's possible to get cache inclusive
>> information via cpuid.
>>
>
> cpuid may be x86-specific so that potentially leads to different behaviours
> on different architectures.
>
>> If there's no reliable way to do that. We can use the max value of
>> per-CPU share of each level of cache. For inclusive cache, that will be
>> the value of LLC. For non-inclusive cache, the value will be more
>> accurate. For example, on Intel Sapphire Rapids, the L2 cache is 2 MB
>> per core, while LLC is 1.875 MB per core according to [1].
>>
>
> Be that as it may, it still opens the possibility of significantly different
> behaviour depending on the CPU family. I would strongly recommend that you
> start with LLC only because LLC is also the topology level of interest used
> by the scheduler and it's information that is generally available. Trying
> to get accurate information on every level and the complexity of dealing
> with inclusive vs exclusive cache or write-back vs write-through should
> be a separate patch, with separate justification and notes on how it can
> lead to behaviour specific to the CPU family or architecture.

IMHO, we should try to optimize for as many CPUs as possible. The size
of the per-CPU (HW thread for SMT) slice of LLC of latest Intel server
CPUs is as follows,

Icelake: 0.75 MB
Sapphire Rapids: 0.9375 MB

While pcp->batch is 63 * 4 / 1024 = 0.2461 MB.

In [03/10], only if "per_cpu_cache_slice > 4 * pcp->batch", we will cache
pcp->batch before draining the PCP. This makes the optimization
unavailable for a significant portion of the server CPUs.

In theory, if "per_cpu_cache_slice > 2 * pcp->batch", we can reuse
cache-hot pages between CPUs. So, if we change the condition to
"per_cpu_cache_slice > 3 * pcp->batch", I think that we are still safe.

As for other CPUs, according to [2], AMD CPUs have larger per-CPU LLC.
So, it's OK for them. ARM CPUs has much smaller per-CPU LLC, so some
further optimization is needed.

[2] https://www.anandtech.com/show/16594/intel-3rd-gen-xeon-scalable-review/2

So, I suggest to use "per_cpu_cache_slice > 3 * pcp->batch" in [03/10],
and use LLC in this patch [02/10]. Then, we can optimize the per-CPU
slice of cache calculation in the follow-up patches.

--
Best Regards,
Huang, Ying

2023-10-16 15:43:55

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH 02/10] cacheinfo: calculate per-CPU data cache size

On Fri, Oct 13, 2023 at 11:06:51AM +0800, Huang, Ying wrote:
> Mel Gorman <[email protected]> writes:
>
> > On Thu, Oct 12, 2023 at 09:12:00PM +0800, Huang, Ying wrote:
> >> Mel Gorman <[email protected]> writes:
> >>
> >> > On Thu, Oct 12, 2023 at 08:08:32PM +0800, Huang, Ying wrote:
> >> >> Mel Gorman <[email protected]> writes:
> >> >>
> >> >> > On Wed, Sep 20, 2023 at 02:18:48PM +0800, Huang Ying wrote:
> >> >> >> Per-CPU data cache size is useful information. For example, it can be
> >> >> >> used to determine per-CPU cache size. So, in this patch, the data
> >> >> >> cache size for each CPU is calculated via data_cache_size /
> >> >> >> shared_cpu_weight.
> >> >> >>
> >> >> >> A brute-force algorithm to iterate all online CPUs is used to avoid
> >> >> >> to allocate an extra cpumask, especially in offline callback.
> >> >> >>
> >> >> >> Signed-off-by: "Huang, Ying" <[email protected]>
> >> >> >
> >> >> > It's not necessarily relevant to the patch, but at least the scheduler
> >> >> > also stores some per-cpu topology information such as sd_llc_size -- the
> >> >> > number of CPUs sharing the same last-level-cache as this CPU. It may be
> >> >> > worth unifying this at some point if it's common that per-cpu
> >> >> > information is too fine and per-zone or per-node information is too
> >> >> > coarse. This would be particularly true when considering locking
> >> >> > granularity,
> >> >> >
> >> >> >> Cc: Sudeep Holla <[email protected]>
> >> >> >> Cc: Andrew Morton <[email protected]>
> >> >> >> Cc: Mel Gorman <[email protected]>
> >> >> >> Cc: Vlastimil Babka <[email protected]>
> >> >> >> Cc: David Hildenbrand <[email protected]>
> >> >> >> Cc: Johannes Weiner <[email protected]>
> >> >> >> Cc: Dave Hansen <[email protected]>
> >> >> >> Cc: Michal Hocko <[email protected]>
> >> >> >> Cc: Pavel Tatashin <[email protected]>
> >> >> >> Cc: Matthew Wilcox <[email protected]>
> >> >> >> Cc: Christoph Lameter <[email protected]>
> >> >> >> ---
> >> >> >> drivers/base/cacheinfo.c | 42 ++++++++++++++++++++++++++++++++++++++-
> >> >> >> include/linux/cacheinfo.h | 1 +
> >> >> >> 2 files changed, 42 insertions(+), 1 deletion(-)
> >> >> >>
> >> >> >> diff --git a/drivers/base/cacheinfo.c b/drivers/base/cacheinfo.c
> >> >> >> index cbae8be1fe52..3e8951a3fbab 100644
> >> >> >> --- a/drivers/base/cacheinfo.c
> >> >> >> +++ b/drivers/base/cacheinfo.c
> >> >> >> @@ -898,6 +898,41 @@ static int cache_add_dev(unsigned int cpu)
> >> >> >> return rc;
> >> >> >> }
> >> >> >>
> >> >> >> +static void update_data_cache_size_cpu(unsigned int cpu)
> >> >> >> +{
> >> >> >> + struct cpu_cacheinfo *ci;
> >> >> >> + struct cacheinfo *leaf;
> >> >> >> + unsigned int i, nr_shared;
> >> >> >> + unsigned int size_data = 0;
> >> >> >> +
> >> >> >> + if (!per_cpu_cacheinfo(cpu))
> >> >> >> + return;
> >> >> >> +
> >> >> >> + ci = ci_cacheinfo(cpu);
> >> >> >> + for (i = 0; i < cache_leaves(cpu); i++) {
> >> >> >> + leaf = per_cpu_cacheinfo_idx(cpu, i);
> >> >> >> + if (leaf->type != CACHE_TYPE_DATA &&
> >> >> >> + leaf->type != CACHE_TYPE_UNIFIED)
> >> >> >> + continue;
> >> >> >> + nr_shared = cpumask_weight(&leaf->shared_cpu_map);
> >> >> >> + if (!nr_shared)
> >> >> >> + continue;
> >> >> >> + size_data += leaf->size / nr_shared;
> >> >> >> + }
> >> >> >> + ci->size_data = size_data;
> >> >> >> +}
> >> >> >
> >> >> > This needs comments.
> >> >> >
> >> >> > It would be nice to add a comment on top describing the limitation of
> >> >> > CACHE_TYPE_UNIFIED here in the context of
> >> >> > update_data_cache_size_cpu().
> >> >>
> >> >> Sure. Will do that.
> >> >>
> >> >
> >> > Thanks.
> >> >
> >> >> > The L2 cache could be unified but much smaller than a L3 or other
> >> >> > last-level-cache. It's not clear from the code what level of cache is being
> >> >> > used due to a lack of familiarity of the cpu_cacheinfo code but size_data
> >> >> > is not the size of a cache, it appears to be the share of a cache a CPU
> >> >> > would have under ideal circumstances.
> >> >>
> >> >> Yes. And it isn't for one specific level of cache. It's sum of per-CPU
> >> >> shares of all levels of cache. But the calculation is inaccurate. More
> >> >> details are in the below reply.
> >> >>
> >> >> > However, as it appears to also be
> >> >> > iterating hierarchy then this may not be accurate. Caches may or may not
> >> >> > allow data to be duplicated between levels so the value may be inaccurate.
> >> >>
> >> >> Thank you very much for pointing this out! The cache can be inclusive
> >> >> or not. So, we cannot calculate the per-CPU slice of all-level caches
> >> >> via adding them together blindly. I will change this in a follow-on
> >> >> patch.
> >> >>
> >> >
> >> > Please do, I would strongly suggest basing this on LLC only because it's
> >> > the only value you can be sure of. This change is the only change that may
> >> > warrant a respin of the series as the history will be somewhat confusing
> >> > otherwise.
> >>
> >> I am still checking whether it's possible to get cache inclusive
> >> information via cpuid.
> >>
> >
> > cpuid may be x86-specific so that potentially leads to different behaviours
> > on different architectures.
> >
> >> If there's no reliable way to do that. We can use the max value of
> >> per-CPU share of each level of cache. For inclusive cache, that will be
> >> the value of LLC. For non-inclusive cache, the value will be more
> >> accurate. For example, on Intel Sapphire Rapids, the L2 cache is 2 MB
> >> per core, while LLC is 1.875 MB per core according to [1].
> >>
> >
> > Be that as it may, it still opens the possibility of significantly different
> > behaviour depending on the CPU family. I would strongly recommend that you
> > start with LLC only because LLC is also the topology level of interest used
> > by the scheduler and it's information that is generally available. Trying
> > to get accurate information on every level and the complexity of dealing
> > with inclusive vs exclusive cache or write-back vs write-through should
> > be a separate patch, with separate justification and notes on how it can
> > lead to behaviour specific to the CPU family or architecture.
>
> IMHO, we should try to optimize for as many CPUs as possible. The size
> of the per-CPU (HW thread for SMT) slice of LLC of latest Intel server
> CPUs is as follows,
>
> Icelake: 0.75 MB
> Sapphire Rapids: 0.9375 MB
>
> While pcp->batch is 63 * 4 / 1024 = 0.2461 MB.
>
> In [03/10], only if "per_cpu_cache_slice > 4 * pcp->batch", we will cache
> pcp->batch before draining the PCP. This makes the optimization
> unavailable for a significant portion of the server CPUs.
>
> In theory, if "per_cpu_cache_slice > 2 * pcp->batch", we can reuse
> cache-hot pages between CPUs. So, if we change the condition to
> "per_cpu_cache_slice > 3 * pcp->batch", I think that we are still safe.
>
> As for other CPUs, according to [2], AMD CPUs have larger per-CPU LLC.
> So, it's OK for them. ARM CPUs has much smaller per-CPU LLC, so some
> further optimization is needed.
>
> [2] https://www.anandtech.com/show/16594/intel-3rd-gen-xeon-scalable-review/2
>
> So, I suggest to use "per_cpu_cache_slice > 3 * pcp->batch" in [03/10],
> and use LLC in this patch [02/10]. Then, we can optimize the per-CPU
> slice of cache calculation in the follow-up patches.
>

I'm ok with adjusting the thresholds to adapt to using LLC only because at
least it'll be consistent across CPU architectures and families. Dealing
with the potentially different cache characteristics at each level or even
being able to discover them is just unnecessarily complicated. It gets
even worse if the mapping changes. For example, if L1 was direct mapped,
L2 index mapped and L3 fully associative then it's not even meaningful to
say that a CPU has a meaningful slice size as cache coloring side-effects
mess everything up.

--
Mel Gorman
SUSE Labs