Changelog since v1
o Clarification comments
o Sanity check pcp->high during reclaim (dhansen)
o Handle vm.percpu_pagelist_high_fraction in zone_highsize (hdanton)
o Sanity check pcp->batch versus pcp->high
This series has pre-requisites in mmotm so for convenience it is also
available at
https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git mm-pcpburst-v2r3
The per-cpu page allocator (PCP) is meant to reduce contention on the zone
lock but the sizing of batch and high is archaic and neither takes the zone
size into account or the number of CPUs local to a zone. With larger zones
and more CPUs per node, the contention is getting worse. Furthermore,
the fact that vm.percpu_pagelist_fraction adjusts both batch and high
values means that the sysctl can reduce zone lock contention but also
increase allocation latencies.
This series disassociates pcp->high from pcp->batch and then scales
pcp->high based on the size of the local zone with limited impact to
reclaim and accounting for active CPUs but leaves pcp->batch static.
It also adapts the number of pages that can be on the pcp list based on
recent freeing patterns.
The motivation is partially to adjust to larger memory sizes but
is also driven by the fact that large batches of page freeing via
release_pages() often shows zone contention as a major part of the
problem. Another is a bug report based on an older kernel where a
multi-terabyte process can takes several minutes to exit. A workaround
was to use vm.percpu_pagelist_fraction to increase the pcp->high value
but testing indicated that a production workload could not use the same
values because of an increase in allocation latencies. Unfortunately,
I cannot reproduce this test case myself as the multi-terabyte machines
are in active use but it should alleviate the problem.
The series aims to address both and partially acts as a pre-requisite. pcp
only works with order-0 which is useless for SLUB (when using high orders)
and THP (unconditionally). To store high-order pages on PCP, the pcp->high
values need to be increased first.
Documentation/admin-guide/sysctl/vm.rst | 29 ++--
include/linux/cpuhotplug.h | 2 +-
include/linux/mmzone.h | 8 +-
kernel/sysctl.c | 8 +-
mm/internal.h | 2 +-
mm/memory_hotplug.c | 4 +-
mm/page_alloc.c | 196 ++++++++++++++++++------
mm/vmscan.c | 35 +++++
8 files changed, 212 insertions(+), 72 deletions(-)
--
2.26.2
When a task is freeing a large number of order-0 pages, it may acquire
the zone->lock multiple times freeing pages in batches. This may
unnecessarily contend on the zone lock when freeing very large number
of pages. This patch adapts the size of the batch based on the recent
pattern to scale the batch size for subsequent frees.
As the machines I used were not large enough to test this are not large
enough to illustrate a problem, a debugging patch shows patterns like
the following (slightly editted for clarity)
Baseline vanilla kernel
time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378
time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378
time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378
time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378
time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378
With patches
time-unmap-7724 [...] free_pcppages_bulk: free 126 count 814 high 814
time-unmap-7724 [...] free_pcppages_bulk: free 252 count 814 high 814
time-unmap-7724 [...] free_pcppages_bulk: free 504 count 814 high 814
time-unmap-7724 [...] free_pcppages_bulk: free 751 count 814 high 814
time-unmap-7724 [...] free_pcppages_bulk: free 751 count 814 high 814
Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Dave Hansen <[email protected]>
---
include/linux/mmzone.h | 3 ++-
mm/page_alloc.c | 41 +++++++++++++++++++++++++++++++++++++++--
2 files changed, 41 insertions(+), 3 deletions(-)
diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index b449151745d7..92182e0299b2 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -343,8 +343,9 @@ struct per_cpu_pages {
int count; /* number of pages in the list */
int high; /* high watermark, emptying needed */
int batch; /* chunk size for buddy add/remove */
+ short free_factor; /* batch scaling factor during free */
#ifdef CONFIG_NUMA
- int expire; /* When 0, remote pagesets are drained */
+ short expire; /* When 0, remote pagesets are drained */
#endif
/* Lists of pages, one per migrate type stored on the pcp-lists */
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dc4ac309bc21..89e60005dd27 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3267,18 +3267,47 @@ static bool free_unref_page_prepare(struct page *page, unsigned long pfn)
return true;
}
+static int nr_pcp_free(struct per_cpu_pages *pcp, int high, int batch)
+{
+ int min_nr_free, max_nr_free;
+
+ /* Check for PCP disabled or boot pageset */
+ if (unlikely(high < batch))
+ return 1;
+
+ /* Leave at least pcp->batch pages on the list */
+ min_nr_free = batch;
+ max_nr_free = high - batch;
+
+ /*
+ * Double the number of pages freed each time there is subsequent
+ * freeing of pages without any allocation.
+ */
+ batch <<= pcp->free_factor;
+ if (batch < max_nr_free)
+ pcp->free_factor++;
+ batch = clamp(batch, min_nr_free, max_nr_free);
+
+ return batch;
+}
+
static void free_unref_page_commit(struct page *page, unsigned long pfn,
int migratetype)
{
struct zone *zone = page_zone(page);
struct per_cpu_pages *pcp;
+ int high;
__count_vm_event(PGFREE);
pcp = this_cpu_ptr(zone->per_cpu_pageset);
list_add(&page->lru, &pcp->lists[migratetype]);
pcp->count++;
- if (pcp->count >= READ_ONCE(pcp->high))
- free_pcppages_bulk(zone, READ_ONCE(pcp->batch), pcp);
+ high = READ_ONCE(pcp->high);
+ if (pcp->count >= high) {
+ int batch = READ_ONCE(pcp->batch);
+
+ free_pcppages_bulk(zone, nr_pcp_free(pcp, high, batch), pcp);
+ }
}
/*
@@ -3530,7 +3559,14 @@ static struct page *rmqueue_pcplist(struct zone *preferred_zone,
unsigned long flags;
local_lock_irqsave(&pagesets.lock, flags);
+
+ /*
+ * On allocation, reduce the number of pages that are batch freed.
+ * See nr_pcp_free() where free_factor is increased for subsequent
+ * frees.
+ */
pcp = this_cpu_ptr(zone->per_cpu_pageset);
+ pcp->free_factor >>= 1;
list = &pcp->lists[migratetype];
page = __rmqueue_pcplist(zone, migratetype, alloc_flags, pcp, list);
local_unlock_irqrestore(&pagesets.lock, flags);
@@ -6698,6 +6734,7 @@ static void per_cpu_pages_init(struct per_cpu_pages *pcp, struct per_cpu_zonesta
*/
pcp->high = BOOT_PAGESET_HIGH;
pcp->batch = BOOT_PAGESET_BATCH;
+ pcp->free_factor = 0;
}
static void __zone_set_pageset_high_and_batch(struct zone *zone, unsigned long high,
--
2.26.2
Hi Mel,
Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
~512G of persistent memory and 128G of DRAM. The PMEM is in "volatile
use" mode and being managed via the buddy just like the normal RAM.
The PMEM zones are big ones:
present 65011712 = 248 G
high 134595 = 525 M
The PMEM nodes, of course, don't have any CPUs in them.
With your series, the pcp->high value per-cpu is 69584 pages or about
270MB per CPU. Scaled up by the 96 CPU threads, that's ~26GB of
worst-case memory in the pcps per zone, or roughly 10% of the size of
the zone.
I did see quite a few pcp->counts above 60,000, so it's definitely
possible in practice to see the pcps filled up. This was not observed
to cause any actual problems in practice. But, it's still a bit worrisome.
Maybe instead of pretending that cpu-less nodes have one cpu:
nr_local_cpus = max(1U,
cpumask_weight(cpumask_of_node(zone_to_nid(zone))));
we should just consider them to have *all* the CPUs in the system. Perhaps:
nr_local_cpus = cpumask_weight(cpumask_of_node(zone_to_nid(zone)));
if (!nr_local_cpus)
nr_local_cpus = num_online_cpus();
Even if a system has a silly number of CPUs, the 'high' sizes will still
hit a floor at 4*batch:
high = max(high, batch << 2);
Which doesn't seem too bad, especially considering that CPU-less nodes
naturally have less contention.
On Thu, May 27, 2021 at 12:36:21PM -0700, Dave Hansen wrote:
> Hi Mel,
>
> Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
> ~512G of persistent memory and 128G of DRAM. The PMEM is in "volatile
> use" mode and being managed via the buddy just like the normal RAM.
>
> The PMEM zones are big ones:
>
> present 65011712 = 248 G
> high 134595 = 525 M
>
> The PMEM nodes, of course, don't have any CPUs in them.
>
> With your series, the pcp->high value per-cpu is 69584 pages or about
> 270MB per CPU. Scaled up by the 96 CPU threads, that's ~26GB of
> worst-case memory in the pcps per zone, or roughly 10% of the size of
> the zone.
>
> I did see quite a few pcp->counts above 60,000, so it's definitely
> possible in practice to see the pcps filled up. This was not observed
> to cause any actual problems in practice. But, it's still a bit worrisome.
>
Ok, it does have the potential to trigger early reclaim as pages are
stored on remote PCP lists. The problem would be transient because
vmstat would drain those pages over time but still, how about this patch
on top of the series?
--8<--
mm/page_alloc: Split pcp->high across all online CPUs for cpuless nodes
Dave Hansen reported the following about Feng Tang's tests on a machine
with persistent memory onlined as a DRAM-like device.
Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
~512G of persistent memory and 128G of DRAM. The PMEM is in "volatile
use" mode and being managed via the buddy just like the normal RAM.
The PMEM zones are big ones:
present 65011712 = 248 G
high 134595 = 525 M
The PMEM nodes, of course, don't have any CPUs in them.
With your series, the pcp->high value per-cpu is 69584 pages or about
270MB per CPU. Scaled up by the 96 CPU threads, that's ~26GB of
worst-case memory in the pcps per zone, or roughly 10% of the size of
the zone.
This should not cause a problem as such although it could trigger reclaim
due to pages being stored on per-cpu lists for CPUs remote to a node. It
is not possible to treat cpuless nodes exactly the same as normal nodes
but the worst-case scenario can be mitigated by splitting pcp->high across
all online CPUs for cpuless memory nodes.
Suggested-by: Dave Hansen <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 14 +++++++++-----
1 file changed, 9 insertions(+), 5 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d708aa14f4ef..af566e97a0f8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6687,7 +6687,7 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online)
{
#ifdef CONFIG_MMU
int high;
- int nr_local_cpus;
+ int nr_split_cpus;
unsigned long total_pages;
if (!percpu_pagelist_high_fraction) {
@@ -6710,10 +6710,14 @@ static int zone_highsize(struct zone *zone, int batch, int cpu_online)
* Split the high value across all online CPUs local to the zone. Note
* that early in boot that CPUs may not be online yet and that during
* CPU hotplug that the cpumask is not yet updated when a CPU is being
- * onlined.
- */
- nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone)))) + cpu_online;
- high = total_pages / nr_local_cpus;
+ * onlined. For memory nodes that have no CPUs, split pcp->high across
+ * all online CPUs to mitigate the risk that reclaim is triggered
+ * prematurely due to pages stored on pcp lists.
+ */
+ nr_split_cpus = cpumask_weight(cpumask_of_node(zone_to_nid(zone))) + cpu_online;
+ if (!nr_split_cpus)
+ nr_split_cpus = num_online_cpus();
+ high = total_pages / nr_split_cpus;
/*
* Ensure high is at least batch*4. The multiple is based on the
On Fri, May 28, 2021 at 11:08:01AM +0200, David Hildenbrand wrote:
> On 28.05.21 11:03, David Hildenbrand wrote:
> > On 28.05.21 10:55, Mel Gorman wrote:
> > > On Thu, May 27, 2021 at 12:36:21PM -0700, Dave Hansen wrote:
> > > > Hi Mel,
> > > >
> > > > Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
> > > > ~512G of persistent memory and 128G of DRAM. The PMEM is in "volatile
> > > > use" mode and being managed via the buddy just like the normal RAM.
> > > >
> > > > The PMEM zones are big ones:
> > > >
> > > > present 65011712 = 248 G
> > > > high 134595 = 525 M
> > > >
> > > > The PMEM nodes, of course, don't have any CPUs in them.
> > > >
> > > > With your series, the pcp->high value per-cpu is 69584 pages or about
> > > > 270MB per CPU. Scaled up by the 96 CPU threads, that's ~26GB of
> > > > worst-case memory in the pcps per zone, or roughly 10% of the size of
> > > > the zone.
> >
> > When I read about having such big amounts of free memory theoretically
> > stuck in PCP lists, I guess we really want to start draining the PCP in
> > alloc_contig_range(), just as we do with memory hotunplug when offlining.
> >
>
> Correction: we already drain the pcp, we just don't temporarily disable it,
> so a race as described in offline_pages() could apply:
>
> "Disable pcplists so that page isolation cannot race with freeing
> in a way that pages from isolated pageblock are left on pcplists."
>
> Guess we'd then want to move the draining before start_isolate_page_range()
> in alloc_contig_range().
>
Or instead of draining, validate the PFN range in alloc_contig_range
is within the same zone and if so, call zone_pcp_disable() before
start_isolate_page_range and enable after __alloc_contig_migrate_range.
--
Mel Gorman
SUSE Labs
On 28.05.21 11:49, Mel Gorman wrote:
> On Fri, May 28, 2021 at 11:08:01AM +0200, David Hildenbrand wrote:
>> On 28.05.21 11:03, David Hildenbrand wrote:
>>> On 28.05.21 10:55, Mel Gorman wrote:
>>>> On Thu, May 27, 2021 at 12:36:21PM -0700, Dave Hansen wrote:
>>>>> Hi Mel,
>>>>>
>>>>> Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
>>>>> ~512G of persistent memory and 128G of DRAM. The PMEM is in "volatile
>>>>> use" mode and being managed via the buddy just like the normal RAM.
>>>>>
>>>>> The PMEM zones are big ones:
>>>>>
>>>>> present 65011712 = 248 G
>>>>> high 134595 = 525 M
>>>>>
>>>>> The PMEM nodes, of course, don't have any CPUs in them.
>>>>>
>>>>> With your series, the pcp->high value per-cpu is 69584 pages or about
>>>>> 270MB per CPU. Scaled up by the 96 CPU threads, that's ~26GB of
>>>>> worst-case memory in the pcps per zone, or roughly 10% of the size of
>>>>> the zone.
>>>
>>> When I read about having such big amounts of free memory theoretically
>>> stuck in PCP lists, I guess we really want to start draining the PCP in
>>> alloc_contig_range(), just as we do with memory hotunplug when offlining.
>>>
>>
>> Correction: we already drain the pcp, we just don't temporarily disable it,
>> so a race as described in offline_pages() could apply:
>>
>> "Disable pcplists so that page isolation cannot race with freeing
>> in a way that pages from isolated pageblock are left on pcplists."
>>
>> Guess we'd then want to move the draining before start_isolate_page_range()
>> in alloc_contig_range().
>>
>
> Or instead of draining, validate the PFN range in alloc_contig_range
> is within the same zone and if so, call zone_pcp_disable() before
> start_isolate_page_range and enable after __alloc_contig_migrate_range.
>
We require the caller to only pass a range within a single zone, so that
should be fine.
The only ugly thing about zone_pcp_disable() is
mutex_lock(&pcp_batch_high_lock) which would serialize all
alloc_contig_range() and even with offline_pages().
--
Thanks,
David / dhildenb
On Fri, May 28, 2021 at 11:52:53AM +0200, David Hildenbrand wrote:
> > > "Disable pcplists so that page isolation cannot race with freeing
> > > in a way that pages from isolated pageblock are left on pcplists."
> > >
> > > Guess we'd then want to move the draining before start_isolate_page_range()
> > > in alloc_contig_range().
> > >
> >
> > Or instead of draining, validate the PFN range in alloc_contig_range
> > is within the same zone and if so, call zone_pcp_disable() before
> > start_isolate_page_range and enable after __alloc_contig_migrate_range.
> >
>
> We require the caller to only pass a range within a single zone, so that
> should be fine.
>
> The only ugly thing about zone_pcp_disable() is
> mutex_lock(&pcp_batch_high_lock) which would serialize all
> alloc_contig_range() and even with offline_pages().
>
True so it would have to be accessed if that is bad or not. If racing
against offline_pages, memory is potentially being offlined in the
target zone which may cause allocation failure. If racing with other
alloc_contig_range calls, the two callers are potentially racing to
isolate and allocate the same range. The arguement could be made that
alloc_contig_ranges should be serialised within one zone to improve the
allocation success rate at the potential cost of allocation latency.
--
Mel Gorman
SUSE Labs
On 28.05.21 12:09, Mel Gorman wrote:
> On Fri, May 28, 2021 at 11:52:53AM +0200, David Hildenbrand wrote:
>>>> "Disable pcplists so that page isolation cannot race with freeing
>>>> in a way that pages from isolated pageblock are left on pcplists."
>>>>
>>>> Guess we'd then want to move the draining before start_isolate_page_range()
>>>> in alloc_contig_range().
>>>>
>>>
>>> Or instead of draining, validate the PFN range in alloc_contig_range
>>> is within the same zone and if so, call zone_pcp_disable() before
>>> start_isolate_page_range and enable after __alloc_contig_migrate_range.
>>>
>>
>> We require the caller to only pass a range within a single zone, so that
>> should be fine.
>>
>> The only ugly thing about zone_pcp_disable() is
>> mutex_lock(&pcp_batch_high_lock) which would serialize all
>> alloc_contig_range() and even with offline_pages().
>>
>
> True so it would have to be accessed if that is bad or not. If racing
> against offline_pages, memory is potentially being offlined in the
> target zone which may cause allocation failure. If racing with other
> alloc_contig_range calls, the two callers are potentially racing to
> isolate and allocate the same range. The arguement could be made that
> alloc_contig_ranges should be serialised within one zone to improve the
> allocation success rate at the potential cost of allocation latency.
We have 3 main users of alloc_contig_range():
1. CMA
CMA synchronizes allocation within a CMA area via the allocation bitmap.
So parallel CMA is perfectly possible and avoids races by design.
2. alloc_contig_pages() / Gigantic pages
Gigantic page allocation could race with virtio-mem. CMA does not apply.
Possible but unlikely to matter in practice. virtio-mem will retry later
again.
3. virito-mem
A virtio-mem device only operates on its assigned memory region, so we
cannot have alloc_contig_range() from different devices racing, even
when within a single zone. It could only race with gigantic pages as CMA
does not apply.
So serializing would mostly harm parallel CMA (possible and likely) and
parallel virtio-mem operation (e.g., unplug memory of two virtio-mem
devices -- unlikely but possible).
Memory offlining racing with CMA is not an issue (impossible).
virtio-mem synchronizes with memory offlining via memory notifiers,
there is only a tiny little race window that usually doesn't matter as
virtio-mem is expected to usually triggers offlining itself, and not
user space rancomly. Memory offlining can race with dynamic gigantic
page allocation, wich is highly unreliable already.
I wonder if we could optimize locking in zone_pcp_disable() instead.
--
Thanks,
David / dhildenb
On 28.05.21 10:55, Mel Gorman wrote:
> On Thu, May 27, 2021 at 12:36:21PM -0700, Dave Hansen wrote:
>> Hi Mel,
>>
>> Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
>> ~512G of persistent memory and 128G of DRAM. The PMEM is in "volatile
>> use" mode and being managed via the buddy just like the normal RAM.
>>
>> The PMEM zones are big ones:
>>
>> present 65011712 = 248 G
>> high 134595 = 525 M
>>
>> The PMEM nodes, of course, don't have any CPUs in them.
>>
>> With your series, the pcp->high value per-cpu is 69584 pages or about
>> 270MB per CPU. Scaled up by the 96 CPU threads, that's ~26GB of
>> worst-case memory in the pcps per zone, or roughly 10% of the size of
>> the zone.
When I read about having such big amounts of free memory theoretically
stuck in PCP lists, I guess we really want to start draining the PCP in
alloc_contig_range(), just as we do with memory hotunplug when offlining.
--
Thanks,
David / dhildenb
On 28.05.21 11:03, David Hildenbrand wrote:
> On 28.05.21 10:55, Mel Gorman wrote:
>> On Thu, May 27, 2021 at 12:36:21PM -0700, Dave Hansen wrote:
>>> Hi Mel,
>>>
>>> Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
>>> ~512G of persistent memory and 128G of DRAM. The PMEM is in "volatile
>>> use" mode and being managed via the buddy just like the normal RAM.
>>>
>>> The PMEM zones are big ones:
>>>
>>> present 65011712 = 248 G
>>> high 134595 = 525 M
>>>
>>> The PMEM nodes, of course, don't have any CPUs in them.
>>>
>>> With your series, the pcp->high value per-cpu is 69584 pages or about
>>> 270MB per CPU. Scaled up by the 96 CPU threads, that's ~26GB of
>>> worst-case memory in the pcps per zone, or roughly 10% of the size of
>>> the zone.
>
> When I read about having such big amounts of free memory theoretically
> stuck in PCP lists, I guess we really want to start draining the PCP in
> alloc_contig_range(), just as we do with memory hotunplug when offlining.
>
Correction: we already drain the pcp, we just don't temporarily disable
it, so a race as described in offline_pages() could apply:
"Disable pcplists so that page isolation cannot race with freeing
in a way that pages from isolated pageblock are left on pcplists."
Guess we'd then want to move the draining before
start_isolate_page_range() in alloc_contig_range().
--
Thanks,
David / dhildenb
On 5/25/21 10:01 AM, Mel Gorman wrote:
> When a task is freeing a large number of order-0 pages, it may acquire
> the zone->lock multiple times freeing pages in batches. This may
> unnecessarily contend on the zone lock when freeing very large number
> of pages. This patch adapts the size of the batch based on the recent
> pattern to scale the batch size for subsequent frees.
>
> As the machines I used were not large enough to test this are not large
> enough to illustrate a problem, a debugging patch shows patterns like
> the following (slightly editted for clarity)
>
> Baseline vanilla kernel
> time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378
> time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378
> time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378
> time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378
> time-unmap-14426 [...] free_pcppages_bulk: free 63 count 378 high 378
>
> With patches
> time-unmap-7724 [...] free_pcppages_bulk: free 126 count 814 high 814
> time-unmap-7724 [...] free_pcppages_bulk: free 252 count 814 high 814
> time-unmap-7724 [...] free_pcppages_bulk: free 504 count 814 high 814
> time-unmap-7724 [...] free_pcppages_bulk: free 751 count 814 high 814
> time-unmap-7724 [...] free_pcppages_bulk: free 751 count 814 high 814
>
> Signed-off-by: Mel Gorman <[email protected]>
> Acked-by: Dave Hansen <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
On 5/28/21 10:55 AM, Mel Gorman wrote:
> On Thu, May 27, 2021 at 12:36:21PM -0700, Dave Hansen wrote:
>> Hi Mel,
>>
>> Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
>> ~512G of persistent memory and 128G of DRAM. The PMEM is in "volatile
>> use" mode and being managed via the buddy just like the normal RAM.
>>
>> The PMEM zones are big ones:
>>
>> present 65011712 = 248 G
>> high 134595 = 525 M
>>
>> The PMEM nodes, of course, don't have any CPUs in them.
>>
>> With your series, the pcp->high value per-cpu is 69584 pages or about
>> 270MB per CPU. Scaled up by the 96 CPU threads, that's ~26GB of
>> worst-case memory in the pcps per zone, or roughly 10% of the size of
>> the zone.
>>
>> I did see quite a few pcp->counts above 60,000, so it's definitely
>> possible in practice to see the pcps filled up. This was not observed
>> to cause any actual problems in practice. But, it's still a bit worrisome.
>>
>
> Ok, it does have the potential to trigger early reclaim as pages are
> stored on remote PCP lists. The problem would be transient because
> vmstat would drain those pages over time but still, how about this patch
> on top of the series?
>
> --8<--
> mm/page_alloc: Split pcp->high across all online CPUs for cpuless nodes
>
> Dave Hansen reported the following about Feng Tang's tests on a machine
> with persistent memory onlined as a DRAM-like device.
>
> Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
> ~512G of persistent memory and 128G of DRAM. The PMEM is in "volatile
> use" mode and being managed via the buddy just like the normal RAM.
>
> The PMEM zones are big ones:
>
> present 65011712 = 248 G
> high 134595 = 525 M
>
> The PMEM nodes, of course, don't have any CPUs in them.
>
> With your series, the pcp->high value per-cpu is 69584 pages or about
> 270MB per CPU. Scaled up by the 96 CPU threads, that's ~26GB of
> worst-case memory in the pcps per zone, or roughly 10% of the size of
> the zone.
>
> This should not cause a problem as such although it could trigger reclaim
> due to pages being stored on per-cpu lists for CPUs remote to a node. It
> is not possible to treat cpuless nodes exactly the same as normal nodes
> but the worst-case scenario can be mitigated by splitting pcp->high across
> all online CPUs for cpuless memory nodes.
>
> Suggested-by: Dave Hansen <[email protected]>
> Signed-off-by: Mel Gorman <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Maybe we should even consider distinguishing high limits for local-to-cpu zones
vs remote, for example for the local-to-cpu zones we would divide by the number
of local cpus, for remote-to-cpu zones we would divide by all cpus.
Because we can expect cpus to allocate mostly from local zones, so leaving more
pages on percpu for those zones can be beneficial.
But as the motivation here was to reduce lock contention on freeing, that's less
clear. We probably can't expect the cpu to be freeing mostly local pages (in
case of e.g. a large process exiting), because no mechanism works towards that,
or does it? In case of cpu freeing to remote zone, the lower high limit could hurt.
So that would have to be evaluated if that works in practice. Out of scope here,
just an idea to discuss.
On Fri, May 28, 2021 at 02:12:09PM +0200, Vlastimil Babka wrote:
> > mm/page_alloc: Split pcp->high across all online CPUs for cpuless nodes
> >
> > Dave Hansen reported the following about Feng Tang's tests on a machine
> > with persistent memory onlined as a DRAM-like device.
> >
> > Feng Tang tossed these on a "Cascade Lake" system with 96 threads and
> > ~512G of persistent memory and 128G of DRAM. The PMEM is in "volatile
> > use" mode and being managed via the buddy just like the normal RAM.
> >
> > The PMEM zones are big ones:
> >
> > present 65011712 = 248 G
> > high 134595 = 525 M
> >
> > The PMEM nodes, of course, don't have any CPUs in them.
> >
> > With your series, the pcp->high value per-cpu is 69584 pages or about
> > 270MB per CPU. Scaled up by the 96 CPU threads, that's ~26GB of
> > worst-case memory in the pcps per zone, or roughly 10% of the size of
> > the zone.
> >
> > This should not cause a problem as such although it could trigger reclaim
> > due to pages being stored on per-cpu lists for CPUs remote to a node. It
> > is not possible to treat cpuless nodes exactly the same as normal nodes
> > but the worst-case scenario can be mitigated by splitting pcp->high across
> > all online CPUs for cpuless memory nodes.
> >
> > Suggested-by: Dave Hansen <[email protected]>
> > Signed-off-by: Mel Gorman <[email protected]>
>
> Acked-by: Vlastimil Babka <[email protected]>
>
Thanks.
> Maybe we should even consider distinguishing high limits for local-to-cpu zones
> vs remote, for example for the local-to-cpu zones we would divide by the number
> of local cpus, for remote-to-cpu zones we would divide by all cpus.
>
> Because we can expect cpus to allocate mostly from local zones, so leaving more
> pages on percpu for those zones can be beneficial.
>
I did think about whether the ratios should be different but failed to
conclude that it was necessary or useful so I kept it simple.
> But as the motivation here was to reduce lock contention on freeing, that's less
> clear. We probably can't expect the cpu to be freeing mostly local pages (in
> case of e.g. a large process exiting), because no mechanism works towards that,
> or does it? In case of cpu freeing to remote zone, the lower high limit could hurt.
>
This is the major issue. Even if an application was NUMA aware and heavily
threaded, the process exiting is potentially freeing remote memory and
there is nothing wrong about that. The remote memory will be partially
drained by pcp->high being reached and the remaining memory will be
cleaned up by vmstat. It's a similar problem if a process is truncating
a large file with page cache allocated on a remote node.
Hence I decided to do nothing fancy with the ratios until a practical
problem was identified that could be alleviated by adjusting pcp->high
based on whether the CPU is remote or local to memory.
--
Mel Gorman
SUSE Labs
On 5/28/21 1:55 AM, Mel Gorman wrote:
> - * onlined.
> - */
> - nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone)))) + cpu_online;
> - high = total_pages / nr_local_cpus;
> + * onlined. For memory nodes that have no CPUs, split pcp->high across
> + * all online CPUs to mitigate the risk that reclaim is triggered
> + * prematurely due to pages stored on pcp lists.
> + */
> + nr_split_cpus = cpumask_weight(cpumask_of_node(zone_to_nid(zone))) + cpu_online;
> + if (!nr_split_cpus)
> + nr_split_cpus = num_online_cpus();
> + high = total_pages / nr_split_cpus;
Updated version looks fine to me, thanks!
BTW, to do some of this testing, Feng was doing a plain old kernel
build. On the one system where this got run, he noted a ~2% regression
in build times. Nothing major, but you might want to be on the lookout
in case 0day or the other test harnesses find something similar once
this series gets to them.
Acked-by: Dave Hansen <[email protected]>
On Fri, May 28, 2021 at 07:39:29AM -0700, Dave Hansen wrote:
> On 5/28/21 1:55 AM, Mel Gorman wrote:
> > - * onlined.
> > - */
> > - nr_local_cpus = max(1U, cpumask_weight(cpumask_of_node(zone_to_nid(zone)))) + cpu_online;
> > - high = total_pages / nr_local_cpus;
> > + * onlined. For memory nodes that have no CPUs, split pcp->high across
> > + * all online CPUs to mitigate the risk that reclaim is triggered
> > + * prematurely due to pages stored on pcp lists.
> > + */
> > + nr_split_cpus = cpumask_weight(cpumask_of_node(zone_to_nid(zone))) + cpu_online;
> > + if (!nr_split_cpus)
> > + nr_split_cpus = num_online_cpus();
> > + high = total_pages / nr_split_cpus;
>
> Updated version looks fine to me, thanks!
>
> BTW, to do some of this testing, Feng was doing a plain old kernel
> build. On the one system where this got run, he noted a ~2% regression
> in build times. Nothing major, but you might want to be on the lookout
> in case 0day or the other test harnesses find something similar once
> this series gets to them.
>
What type of system was it?
I noticed minor differences for some thread counts on kernel compilations
but for CascadeLake at least, it was mostly neutral. Below is an old test
result based on a previous revision.
kernbench
5.13.0-rc2 5.13.0-rc2
vanilla mm-pcpburst-v2r3
Amean elsp-2 469.22 ( 0.00%) 470.03 * -0.17%*
Amean elsp-4 251.03 ( 0.00%) 250.83 ( 0.08%)
Amean elsp-8 131.39 ( 0.00%) 130.89 ( 0.38%)
Amean elsp-16 74.37 ( 0.00%) 75.11 ( -0.99%)
Amean elsp-32 42.10 ( 0.00%) 42.20 ( -0.24%)
Amean elsp-64 32.21 ( 0.00%) 32.14 ( 0.23%)
Amean elsp-128 31.59 ( 0.00%) 31.68 ( -0.27%)
Amean elsp-160 31.76 ( 0.00%) 31.69 ( 0.21%)
A Haswell machine showed the worst results for kernbench
Amean elsp-2 459.99 ( 0.00%) 465.27 * -1.15%*
Amean elsp-4 250.76 ( 0.00%) 253.17 * -0.96%*
Amean elsp-8 141.28 ( 0.00%) 141.78 ( -0.36%)
Amean elsp-16 77.71 ( 0.00%) 77.88 ( -0.22%)
Amean elsp-32 44.09 ( 0.00%) 44.40 ( -0.69%)
Amean elsp-64 33.79 ( 0.00%) 33.46 ( 0.96%)
Amean elsp-128 33.14 ( 0.00%) 33.26 ( -0.37%)
Amean elsp-160 33.26 ( 0.00%) 33.36 * -0.30%*
The series with review feedback and dealing with cpuless nodes is queued
and should complete over the weekend.
> Acked-by: Dave Hansen <[email protected]>
Thanks!
--
Mel Gorman
SUSE Labs
On 5/28/21 8:18 AM, Mel Gorman wrote:
>> BTW, to do some of this testing, Feng was doing a plain old kernel
>> build. On the one system where this got run, he noted a ~2% regression
>> in build times. Nothing major, but you might want to be on the lookout
>> in case 0day or the other test harnesses find something similar once
>> this series gets to them.
>>
> What type of system was it?
>
> I noticed minor differences for some thread counts on kernel compilations
> but for CascadeLake at least, it was mostly neutral. Below is an old test
> result based on a previous revision.
It's a Cascade Lake as well. But, I never trust hardware at a hardware
company. These could be preproduction CPUs or BIOS or both, or have
some bonkers configuration knob flipped.
It's also got a bunch of PMEM plugged and onlined, including the
_possibility_ of kernel data structures ended up on PMEM. They *mostly*
don't end up there, but it does happen on occasion.
Anyway, I'll see if we can do some more runs with your latest version.
It looks like it's been picked up for -mm so 0day should be pounding on
it soon enough.
On Fri, May 28, 2021 at 09:17:41AM -0700, Dave Hansen wrote:
> On 5/28/21 8:18 AM, Mel Gorman wrote:
> >> BTW, to do some of this testing, Feng was doing a plain old kernel
> >> build. On the one system where this got run, he noted a ~2% regression
> >> in build times. Nothing major, but you might want to be on the lookout
> >> in case 0day or the other test harnesses find something similar once
> >> this series gets to them.
> >>
> > What type of system was it?
> >
> > I noticed minor differences for some thread counts on kernel compilations
> > but for CascadeLake at least, it was mostly neutral. Below is an old test
> > result based on a previous revision.
>
> It's a Cascade Lake as well. But, I never trust hardware at a hardware
> company. These could be preproduction CPUs or BIOS or both, or have
> some bonkers configuration knob flipped.
>
> It's also got a bunch of PMEM plugged and onlined, including the
> _possibility_ of kernel data structures ended up on PMEM. They *mostly*
> don't end up there, but it does happen on occasion.
>
> Anyway, I'll see if we can do some more runs with your latest version.
> It looks like it's been picked up for -mm so 0day should be pounding on
> it soon enough.
Yes, usually 0day has more benchmark test covering -mm tree.
As for the kbuild test run for v2, after more runs, the previous 2%
longer kbuild time turns to 1% shorter time, seems to be in normal
deviation range.
Also I checked Mel's v3 branch which has the fix for cpuless node,
the pcp 'high' looks normal on PMEM node:
pagesets
cpu: 0
count: 67
high: 724
batch: 63
vm stats threshold: 125
Thanks,
Feng