LinuxLists.cc - [RFC 1/2] mm: add framework for PCP high auto-tuning

2023-07-10 07:26:34

Subject: [RFC 1/2] mm: add framework for PCP high auto-tuning

The page allocation performance requirements of different workloads
are usually different. So, we often need to tune PCP (per-CPU
pageset) high to optimize the workload page allocation performance.
Now, we have a system wide sysctl knob (percpu_pagelist_high_fraction)
to tune PCP high by hand. But, it's hard to find out the best value
by hand. And one global configuration may not work best for the
different workloads that run on the same system. One solution to
these issues is to tune PCP high of each CPU automatically.

This patch adds the framework for PCP high auto-tuning. With it,
pcp->high will be changed automatically by tuning algorithm at
runtime. Its default value (pcp->high_def) is the original PCP high
value calculated based on low watermark pages or
percpu_pagelist_high_fraction sysctl knob. To avoid putting too many
pages in PCP, the original limit of percpu_pagelist_high_fraction
sysctl knob, MIN_PERCPU_PAGELIST_HIGH_FRACTION, is used to calculate
the max PCP high value (pcp->high_max).

This patch only adds the framework, so pcp->high will be set to
pcp->high_def always. We will add actual auto-tuning algorithm in the
next patch in the series.

Signed-off-by: "Huang, Ying" <[email protected]>
Cc: Andrew Morton <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: David Hildenbrand <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Dave Hansen <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Pavel Tatashin <[email protected]>
Cc: Matthew Wilcox <[email protected]>
---
include/linux/mmzone.h | 5 ++-
mm/page_alloc.c | 79 +++++++++++++++++++++++++++---------------
2 files changed, 55 insertions(+), 29 deletions(-)

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index a4889c9d4055..7e2c1864a9ea 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -663,6 +663,8 @@ struct per_cpu_pages {
spinlock_t lock; /* Protects lists field */
int count; /* number of pages in the list */
int high; /* high watermark, emptying needed */
+ int high_def; /* default high watermark */
+ int high_max; /* max high watermark */
int batch; /* chunk size for buddy add/remove */
short free_factor; /* batch scaling factor during free */
#ifdef CONFIG_NUMA
@@ -820,7 +822,8 @@ struct zone {
* the high and batch values are copied to individual pagesets for
* faster access
*/
- int pageset_high;
+ int pageset_high_def;
+ int pageset_high_max;
int pageset_batch;

#ifndef CONFIG_SPARSEMEM
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 47421bedc12b..dd83c19f25c6 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2601,7 +2601,7 @@ static int nr_pcp_free(struct per_cpu_pages *pcp, int high, int batch,
static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
bool free_high)
{
- int high = READ_ONCE(pcp->high);
+ int high = pcp->high;

if (unlikely(!high || free_high))
return 0;
@@ -2616,14 +2616,22 @@ static int nr_pcp_high(struct per_cpu_pages *pcp, struct zone *zone,
return min(READ_ONCE(pcp->batch) << 2, high);
}

+static void tune_pcp_high(struct per_cpu_pages *pcp, int high_def)
+{
+ pcp->high = high_def;
+}
+
static void free_unref_page_commit(struct zone *zone, struct per_cpu_pages *pcp,
struct page *page, int migratetype,
unsigned int order)
{
- int high;
+ int high, high_def;
int pindex;
bool free_high;

+ high_def = READ_ONCE(pcp->high_def);
+ tune_pcp_high(pcp, high_def);
+
__count_vm_events(PGFREE, 1 << order);
pindex = order_to_pindex(migratetype, order);
list_add(&page->pcp_list, &pcp->lists[pindex]);
@@ -5976,14 +5984,15 @@ static int zone_batchsize(struct zone *zone)
#endif
}

-static int zone_highsize(struct zone *zone, int batch, int cpu_online)
+static int zone_highsize(struct zone *zone, int batch, int cpu_online,
+ int high_fraction)
{
#ifdef CONFIG_MMU
int high;
int nr_split_cpus;
unsigned long total_pages;

- if (!percpu_pagelist_high_fraction) {
+ if (!high_fraction) {
/*
* By default, the high value of the pcp is based on the zone
* low watermark so that if they are full then background
@@ -5996,15 +6005,15 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online)
* value is based on a fraction of the managed pages in the
* zone.
*/
- total_pages = zone_managed_pages(zone) / percpu_pagelist_high_fraction;
+ total_pages = zone_managed_pages(zone) / high_fraction;
}

/*
* Split the high value across all online CPUs local to the zone. Note
* that early in boot that CPUs may not be online yet and that during
* CPU hotplug that the cpumask is not yet updated when a CPU is being
- * onlined. For memory nodes that have no CPUs, split pcp->high across
- * all online CPUs to mitigate the risk that reclaim is triggered
+ * onlined. For memory nodes that have no CPUs, split the high value
+ * across all online CPUs to mitigate the risk that reclaim is triggered
* prematurely due to pages stored on pcp lists.
*/
nr_split_cpus = cpumask_weight(cpumask_of_node(zone_to_nid(zone))) + cpu_online;
@@ -6032,19 +6041,21 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online)
* However, guaranteeing these relations at all times would require e.g. write
* barriers here but also careful usage of read barriers at the read side, and
* thus be prone to error and bad for performance. Thus the update only prevents
- * store tearing. Any new users of pcp->batch and pcp->high should ensure they
- * can cope with those fields changing asynchronously, and fully trust only the
- * pcp->count field on the local CPU with interrupts disabled.
+ * store tearing. Any new users of pcp->batch, pcp->high_def and pcp->high_max
+ * should ensure they can cope with those fields changing asynchronously, and
+ * fully trust only the pcp->count field on the local CPU with interrupts
+ * disabled.
*
* mutex_is_locked(&pcp_batch_high_lock) required when calling this function
* outside of boot time (or some other assurance that no concurrent updaters
* exist).
*/
-static void pageset_update(struct per_cpu_pages *pcp, unsigned long high,
- unsigned long batch)
+static void pageset_update(struct per_cpu_pages *pcp, unsigned long high_def,
+ unsigned long high_max, unsigned long batch)
{
WRITE_ONCE(pcp->batch, batch);
- WRITE_ONCE(pcp->high, high);
+ WRITE_ONCE(pcp->high_def, high_def);
+ WRITE_ONCE(pcp->high_max, high_max);
}

static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonestat *pzstats)
@@ -6064,20 +6075,21 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta
* need to be as careful as pageset_update() as nobody can access the
* pageset yet.
*/
- pcp->high = BOOT_PAGESET_HIGH;
+ pcp->high_def = BOOT_PAGESET_HIGH;
+ pcp->high_max = BOOT_PAGESET_HIGH;
pcp->batch = BOOT_PAGESET_BATCH;
pcp->free_factor = 0;
}

-static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high,
- unsigned long batch)
+static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high_def,
+ unsigned long high_max, unsigned long batch)
{
struct per_cpu_pages *pcp;
int cpu;

for_each_possible_cpu(cpu) {
pcp = per_cpu_ptr(zone->per_cpu_pageset, cpu);
- pageset_update(pcp, high, batch);
+ pageset_update(pcp, high_def, high_max, batch);
}
}

@@ -6087,19 +6099,26 @@ static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long h
*/
static void zone_set_pageset_high_and_batch(struct zone *zone, int cpu_online)
{
- int new_high, new_batch;
+ int new_high_def, new_high_max, new_batch;

new_batch = max(1, zone_batchsize(zone));
- new_high = zone_highsize(zone, new_batch, cpu_online);
+ new_high_def = zone_highsize(zone, new_batch, cpu_online,
+ percpu_pagelist_high_fraction);
+ new_high_max = zone_highsize(zone, new_batch, cpu_online,
+ MIN_PERCPU_PAGELIST_HIGH_FRACTION);
+ new_high_def = min(new_high_def, new_high_max);

- if (zone->pageset_high == new_high &&
+ if (zone->pageset_high_def == new_high_def &&
+ zone->pageset_high_max == new_high_max &&
zone->pageset_batch == new_batch)
return;

- zone->pageset_high = new_high;
+ zone->pageset_high_def = new_high_def;
+ zone->pageset_high_max = new_high_max;
zone->pageset_batch = new_batch;

- __zone_set_pageset_high_and_batch(zone, new_high, new_batch);
+ __zone_set_pageset_high_and_batch(zone, new_high_def, new_high_max,
+ new_batch);
}

void __meminit setup_zone_pageset(struct zone *zone)
@@ -6175,7 +6194,8 @@ __meminit void zone_pcp_init(struct zone *zone)
*/
zone->per_cpu_pageset = &boot_pageset;
zone->per_cpu_zonestats = &boot_zonestats;
- zone->pageset_high = BOOT_PAGESET_HIGH;
+ zone->pageset_high_def = BOOT_PAGESET_HIGH;
+ zone->pageset_high_max = BOOT_PAGESET_HIGH;
zone->pageset_batch = BOOT_PAGESET_BATCH;

if (populated_zone(zone))
@@ -6619,9 +6639,11 @@ int lowmem_reserve_ratio_sysctl_handler(struct ctl_table *table, int write,
}

/*
- * percpu_pagelist_high_fraction - changes the pcp->high for each zone on each
- * cpu. It is the fraction of total pages in each zone that a hot per cpu
- * pagelist can have before it gets flushed back to buddy allocator.
+ * percpu_pagelist_high_fraction - changes the pcp->high_def for each zone on
+ * each cpu. It is the fraction of total pages in each zone that a hot per cpu
+ * pagelist can have before it gets flushed back to buddy allocator. This
+ * only set the default value, the actual value may be tuned automatically at
+ * runtime.
*/
int percpu_pagelist_high_fraction_sysctl_handler(struct ctl_table *table,
int write, void *buffer, size_t *length, loff_t *ppos)
@@ -7008,13 +7030,14 @@ EXPORT_SYMBOL(free_contig_range);
void zone_pcp_disable(struct zone *zone)
{
mutex_lock(&pcp_batch_high_lock);
- __zone_set_pageset_high_and_batch(zone, 0, 1);
+ __zone_set_pageset_high_and_batch(zone, 0, 0, 1);
__drain_all_pages(zone, true);
}

void zone_pcp_enable(struct zone *zone)
{
- __zone_set_pageset_high_and_batch(zone, zone->pageset_high, zone->pageset_batch);
+ __zone_set_pageset_high_and_batch(zone, zone->pageset_high_def,
+ zone->pageset_high_max, zone->pageset_batch);
mutex_unlock(&pcp_batch_high_lock);
}

--
2.39.2

2023-07-11 11:30:06

by Michal Hocko

[permalink] [raw]

Subject: Re: [RFC 1/2] mm: add framework for PCP high auto-tuning

On Mon 10-07-23 14:53:24, Huang Ying wrote:
> The page allocation performance requirements of different workloads
> are usually different. So, we often need to tune PCP (per-CPU
> pageset) high to optimize the workload page allocation performance.
> Now, we have a system wide sysctl knob (percpu_pagelist_high_fraction)
> to tune PCP high by hand. But, it's hard to find out the best value
> by hand. And one global configuration may not work best for the
> different workloads that run on the same system. One solution to
> these issues is to tune PCP high of each CPU automatically.
>
> This patch adds the framework for PCP high auto-tuning. With it,
> pcp->high will be changed automatically by tuning algorithm at
> runtime. Its default value (pcp->high_def) is the original PCP high
> value calculated based on low watermark pages or
> percpu_pagelist_high_fraction sysctl knob. To avoid putting too many
> pages in PCP, the original limit of percpu_pagelist_high_fraction
> sysctl knob, MIN_PERCPU_PAGELIST_HIGH_FRACTION, is used to calculate
> the max PCP high value (pcp->high_max).

It would have been very helpful to describe the basic entry points to
the auto-tuning. AFAICS the central place of the tuning is tune_pcp_high
which is called from the freeing path. Why? Is this really a good place
considering this is a hot path? What about the allocation path? Isn't
that a good spot to watch for the allocation demand?

Also this framework seems to be enabled by default. Is this really
desirable? What about workloads tuning the pcp batch size manually?
Shouldn't they override any auto-tuning?

--
Michal Hocko
SUSE Labs

2023-07-12 08:00:34

by Huang, Ying

[permalink] [raw]

Subject: Re: [RFC 1/2] mm: add framework for PCP high auto-tuning

Michal Hocko <[email protected]> writes:

> On Mon 10-07-23 14:53:24, Huang Ying wrote:
>> The page allocation performance requirements of different workloads
>> are usually different. So, we often need to tune PCP (per-CPU
>> pageset) high to optimize the workload page allocation performance.
>> Now, we have a system wide sysctl knob (percpu_pagelist_high_fraction)
>> to tune PCP high by hand. But, it's hard to find out the best value
>> by hand. And one global configuration may not work best for the
>> different workloads that run on the same system. One solution to
>> these issues is to tune PCP high of each CPU automatically.
>>
>> This patch adds the framework for PCP high auto-tuning. With it,
>> pcp->high will be changed automatically by tuning algorithm at
>> runtime. Its default value (pcp->high_def) is the original PCP high
>> value calculated based on low watermark pages or
>> percpu_pagelist_high_fraction sysctl knob. To avoid putting too many
>> pages in PCP, the original limit of percpu_pagelist_high_fraction
>> sysctl knob, MIN_PERCPU_PAGELIST_HIGH_FRACTION, is used to calculate
>> the max PCP high value (pcp->high_max).
>
> It would have been very helpful to describe the basic entry points to
> the auto-tuning. AFAICS the central place of the tuning is tune_pcp_high
> which is called from the freeing path. Why? Is this really a good place
> considering this is a hot path? What about the allocation path? Isn't
> that a good spot to watch for the allocation demand?

Yes. The main entry point to the auto-tuning is tune_pcp_high(). Which
is called from the freeing path because pcp->high is only used by page
freeing. It's possible to call it in allocation path instead. The
drawback is that the pcp->high may be updated a little later in some
situations. For example, if there are many page freeing but no page
allocation for quite long time. But I don't think this is a serious
problem.

> Also this framework seems to be enabled by default. Is this really
> desirable? What about workloads tuning the pcp batch size manually?
> Shouldn't they override any auto-tuning?

In the current implementation, the pcp->high will be tuned between
original pcp high (default or tuned manually) and the max pcp high (via
MIN_PERCPU_PAGELIST_HIGH_FRACTION). So the high value tuned manually is
respected at some degree.

So you think that it's better to disable auto-tuning if PCP high is
tuned manually?

Best Regards,
Huang, Ying

2023-07-14 09:14:33

by Michal Hocko

[permalink] [raw]

Subject: Re: [RFC 1/2] mm: add framework for PCP high auto-tuning

On Wed 12-07-23 15:45:58, Huang, Ying wrote:
> Michal Hocko <[email protected]> writes:
>
> > On Mon 10-07-23 14:53:24, Huang Ying wrote:
> >> The page allocation performance requirements of different workloads
> >> are usually different. So, we often need to tune PCP (per-CPU
> >> pageset) high to optimize the workload page allocation performance.
> >> Now, we have a system wide sysctl knob (percpu_pagelist_high_fraction)
> >> to tune PCP high by hand. But, it's hard to find out the best value
> >> by hand. And one global configuration may not work best for the
> >> different workloads that run on the same system. One solution to
> >> these issues is to tune PCP high of each CPU automatically.
> >>
> >> This patch adds the framework for PCP high auto-tuning. With it,
> >> pcp->high will be changed automatically by tuning algorithm at
> >> runtime. Its default value (pcp->high_def) is the original PCP high
> >> value calculated based on low watermark pages or
> >> percpu_pagelist_high_fraction sysctl knob. To avoid putting too many
> >> pages in PCP, the original limit of percpu_pagelist_high_fraction
> >> sysctl knob, MIN_PERCPU_PAGELIST_HIGH_FRACTION, is used to calculate
> >> the max PCP high value (pcp->high_max).
> >
> > It would have been very helpful to describe the basic entry points to
> > the auto-tuning. AFAICS the central place of the tuning is tune_pcp_high
> > which is called from the freeing path. Why? Is this really a good place
> > considering this is a hot path? What about the allocation path? Isn't
> > that a good spot to watch for the allocation demand?
>
> Yes. The main entry point to the auto-tuning is tune_pcp_high(). Which
> is called from the freeing path because pcp->high is only used by page
> freeing. It's possible to call it in allocation path instead. The
> drawback is that the pcp->high may be updated a little later in some
> situations. For example, if there are many page freeing but no page
> allocation for quite long time. But I don't think this is a serious
> problem.

I consider it a serious flaw in the framework as it cannot cope with the
transition of the allocation pattern (e.g. increasing the allocation
pressure).

> > Also this framework seems to be enabled by default. Is this really
> > desirable? What about workloads tuning the pcp batch size manually?
> > Shouldn't they override any auto-tuning?
>
> In the current implementation, the pcp->high will be tuned between
> original pcp high (default or tuned manually) and the max pcp high (via
> MIN_PERCPU_PAGELIST_HIGH_FRACTION). So the high value tuned manually is
> respected at some degree.
>
> So you think that it's better to disable auto-tuning if PCP high is
> tuned manually?

Yes, I think this is a much safer option. For two reasons 1) it is less
surprising to setups which know what they are doing by configuring the
batching and 2) the auto-tuning needs a way to get disabled in case
there are pathological patterns in behavior.
--
Michal Hocko
SUSE Labs

2023-07-17 09:00:50

by Huang, Ying

[permalink] [raw]

Subject: Re: [RFC 1/2] mm: add framework for PCP high auto-tuning

Michal Hocko <[email protected]> writes:

> On Wed 12-07-23 15:45:58, Huang, Ying wrote:
>> Michal Hocko <[email protected]> writes:
>>
>> > On Mon 10-07-23 14:53:24, Huang Ying wrote:
>> >> The page allocation performance requirements of different workloads
>> >> are usually different. So, we often need to tune PCP (per-CPU
>> >> pageset) high to optimize the workload page allocation performance.
>> >> Now, we have a system wide sysctl knob (percpu_pagelist_high_fraction)
>> >> to tune PCP high by hand. But, it's hard to find out the best value
>> >> by hand. And one global configuration may not work best for the
>> >> different workloads that run on the same system. One solution to
>> >> these issues is to tune PCP high of each CPU automatically.
>> >>
>> >> This patch adds the framework for PCP high auto-tuning. With it,
>> >> pcp->high will be changed automatically by tuning algorithm at
>> >> runtime. Its default value (pcp->high_def) is the original PCP high
>> >> value calculated based on low watermark pages or
>> >> percpu_pagelist_high_fraction sysctl knob. To avoid putting too many
>> >> pages in PCP, the original limit of percpu_pagelist_high_fraction
>> >> sysctl knob, MIN_PERCPU_PAGELIST_HIGH_FRACTION, is used to calculate
>> >> the max PCP high value (pcp->high_max).
>> >
>> > It would have been very helpful to describe the basic entry points to
>> > the auto-tuning. AFAICS the central place of the tuning is tune_pcp_high
>> > which is called from the freeing path. Why? Is this really a good place
>> > considering this is a hot path? What about the allocation path? Isn't
>> > that a good spot to watch for the allocation demand?
>>
>> Yes. The main entry point to the auto-tuning is tune_pcp_high(). Which
>> is called from the freeing path because pcp->high is only used by page
>> freeing. It's possible to call it in allocation path instead. The
>> drawback is that the pcp->high may be updated a little later in some
>> situations. For example, if there are many page freeing but no page
>> allocation for quite long time. But I don't think this is a serious
>> problem.
>
> I consider it a serious flaw in the framework as it cannot cope with the
> transition of the allocation pattern (e.g. increasing the allocation
> pressure).

Sorry, my previous words are misleading. What I really wanted to say is
that the problem may be just theoretical. Anyway, I will try to avoid
this problem in the future version.

>> > Also this framework seems to be enabled by default. Is this really
>> > desirable? What about workloads tuning the pcp batch size manually?
>> > Shouldn't they override any auto-tuning?
>>
>> In the current implementation, the pcp->high will be tuned between
>> original pcp high (default or tuned manually) and the max pcp high (via
>> MIN_PERCPU_PAGELIST_HIGH_FRACTION). So the high value tuned manually is
>> respected at some degree.
>>
>> So you think that it's better to disable auto-tuning if PCP high is
>> tuned manually?
>
> Yes, I think this is a much safer option. For two reasons 1) it is less
> surprising to setups which know what they are doing by configuring the
> batching and 2) the auto-tuning needs a way to get disabled in case
> there are pathological patterns in behavior.

OK.

Best Regards,
Huang, Ying