Johannes Weiner <[email protected]> writes:
> The idea behind the cache is to save get_pageblock_migratetype()
> lookups during bulk freeing. A microbenchmark suggests this isn't
> helping, though. The pcp migratetype can get stale, which means that
> bulk freeing has an extra branch to check if the pageblock was
> isolated while on the pcp.
>
> While the variance overlaps, the cache write and the branch seem to
> make this a net negative. The following test allocates and frees
> batches of 10,000 pages (~3x the pcp high marks to trigger flushing):
>
> Before:
> 8,668.48 msec task-clock # 99.735 CPUs utilized ( +- 2.90% )
> 19 context-switches # 4.341 /sec ( +- 3.24% )
> 0 cpu-migrations # 0.000 /sec
> 17,440 page-faults # 3.984 K/sec ( +- 2.90% )
> 41,758,692,473 cycles # 9.541 GHz ( +- 2.90% )
> 126,201,294,231 instructions # 5.98 insn per cycle ( +- 2.90% )
> 25,348,098,335 branches # 5.791 G/sec ( +- 2.90% )
> 33,436,921 branch-misses # 0.26% of all branches ( +- 2.90% )
>
> 0.0869148 +- 0.0000302 seconds time elapsed ( +- 0.03% )
>
> After:
> 8,444.81 msec task-clock # 99.726 CPUs utilized ( +- 2.90% )
> 22 context-switches # 5.160 /sec ( +- 3.23% )
> 0 cpu-migrations # 0.000 /sec
> 17,443 page-faults # 4.091 K/sec ( +- 2.90% )
> 40,616,738,355 cycles # 9.527 GHz ( +- 2.90% )
> 126,383,351,792 instructions # 6.16 insn per cycle ( +- 2.90% )
> 25,224,985,153 branches # 5.917 G/sec ( +- 2.90% )
> 32,236,793 branch-misses # 0.25% of all branches ( +- 2.90% )
>
> 0.0846799 +- 0.0000412 seconds time elapsed ( +- 0.05% )
>
> A side effect is that this also ensures that pages whose pageblock
> gets stolen while on the pcplist end up on the right freelist and we
> don't perform potentially type-incompatible buddy merges (or skip
> merges when we shouldn't), whis is likely beneficial to long-term
> fragmentation management, although the effects would be harder to
> measure. Settle for simpler and faster code as justification here.
I suspected the PCP allocating/freeing path may be influenced (that is,
allocating/freeing batch is less than PCP high). So I tested
one-process will-it-scale/page_fault1 with sysctl
percpu_pagelist_high_fraction=8. So pages will be allocated/freed
from/to PCP only. The test results are as follows,
Before:
will-it-scale.1.processes 618364.3 (+- 0.075%)
perf-profile.children.get_pfnblock_flags_mask 0.13 (+- 9.350%)
After:
will-it-scale.1.processes 616512.0 (+- 0.057%)
perf-profile.children.get_pfnblock_flags_mask 0.41 (+- 22.44%)
The change isn't large: -0.3%. Perf profiling shows the cycles% of
get_pfnblock_flags_mask() increases.
--
Best Regards,
Huang, Ying
On Wed, Sep 27, 2023 at 01:42:25PM +0800, Huang, Ying wrote:
> Johannes Weiner <[email protected]> writes:
>
> > The idea behind the cache is to save get_pageblock_migratetype()
> > lookups during bulk freeing. A microbenchmark suggests this isn't
> > helping, though. The pcp migratetype can get stale, which means that
> > bulk freeing has an extra branch to check if the pageblock was
> > isolated while on the pcp.
> >
> > While the variance overlaps, the cache write and the branch seem to
> > make this a net negative. The following test allocates and frees
> > batches of 10,000 pages (~3x the pcp high marks to trigger flushing):
> >
> > Before:
> > 8,668.48 msec task-clock # 99.735 CPUs utilized ( +- 2.90% )
> > 19 context-switches # 4.341 /sec ( +- 3.24% )
> > 0 cpu-migrations # 0.000 /sec
> > 17,440 page-faults # 3.984 K/sec ( +- 2.90% )
> > 41,758,692,473 cycles # 9.541 GHz ( +- 2.90% )
> > 126,201,294,231 instructions # 5.98 insn per cycle ( +- 2.90% )
> > 25,348,098,335 branches # 5.791 G/sec ( +- 2.90% )
> > 33,436,921 branch-misses # 0.26% of all branches ( +- 2.90% )
> >
> > 0.0869148 +- 0.0000302 seconds time elapsed ( +- 0.03% )
> >
> > After:
> > 8,444.81 msec task-clock # 99.726 CPUs utilized ( +- 2.90% )
> > 22 context-switches # 5.160 /sec ( +- 3.23% )
> > 0 cpu-migrations # 0.000 /sec
> > 17,443 page-faults # 4.091 K/sec ( +- 2.90% )
> > 40,616,738,355 cycles # 9.527 GHz ( +- 2.90% )
> > 126,383,351,792 instructions # 6.16 insn per cycle ( +- 2.90% )
> > 25,224,985,153 branches # 5.917 G/sec ( +- 2.90% )
> > 32,236,793 branch-misses # 0.25% of all branches ( +- 2.90% )
> >
> > 0.0846799 +- 0.0000412 seconds time elapsed ( +- 0.05% )
> >
> > A side effect is that this also ensures that pages whose pageblock
> > gets stolen while on the pcplist end up on the right freelist and we
> > don't perform potentially type-incompatible buddy merges (or skip
> > merges when we shouldn't), whis is likely beneficial to long-term
> > fragmentation management, although the effects would be harder to
> > measure. Settle for simpler and faster code as justification here.
>
> I suspected the PCP allocating/freeing path may be influenced (that is,
> allocating/freeing batch is less than PCP high). So I tested
> one-process will-it-scale/page_fault1 with sysctl
> percpu_pagelist_high_fraction=8. So pages will be allocated/freed
> from/to PCP only. The test results are as follows,
>
> Before:
> will-it-scale.1.processes 618364.3 (+- 0.075%)
> perf-profile.children.get_pfnblock_flags_mask 0.13 (+- 9.350%)
>
> After:
> will-it-scale.1.processes 616512.0 (+- 0.057%)
> perf-profile.children.get_pfnblock_flags_mask 0.41 (+- 22.44%)
>
> The change isn't large: -0.3%. Perf profiling shows the cycles% of
> get_pfnblock_flags_mask() increases.
Ah, this is going through the free_unref_page_list() path that
Vlastimil had pointed out as well. I made another change on top that
eliminates the second lookup. After that, both pcp fast paths have the
same number of lookups as before: 1. This fixes the regression for me.
Would you mind confirming this as well?
--
From f5d032019ed832a1a50454347a33b00ca6abeb30 Mon Sep 17 00:00:00 2001
From: Johannes Weiner <[email protected]>
Date: Fri, 15 Sep 2023 16:03:24 -0400
Subject: [PATCH] mm: page_alloc: optimize free_unref_page_list()
Move direct freeing of isolated pages to the lock-breaking block in
the second loop. This saves an unnecessary migratetype reassessment.
Minor comment and local variable scoping cleanups.
Suggested-by: Vlastimil Babka <[email protected]>
Signed-off-by: Johannes Weiner <[email protected]>
---
mm/page_alloc.c | 44 ++++++++++++++++++--------------------------
1 file changed, 18 insertions(+), 26 deletions(-)
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bfffc1af94cd..665930ffe22a 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2466,48 +2466,40 @@ void free_unref_page_list(struct list_head *list)
struct per_cpu_pages *pcp = NULL;
struct zone *locked_zone = NULL;
int batch_count = 0;
- int migratetype;
-
- /* Prepare pages for freeing */
- list_for_each_entry_safe(page, next, list, lru) {
- unsigned long pfn = page_to_pfn(page);
-
- if (!free_pages_prepare(page, 0, FPI_NONE)) {
- list_del(&page->lru);
- continue;
- }
- /*
- * Free isolated pages directly to the allocator, see
- * comment in free_unref_page.
- */
- migratetype = get_pfnblock_migratetype(page, pfn);
- if (unlikely(is_migrate_isolate(migratetype))) {
+ list_for_each_entry_safe(page, next, list, lru)
+ if (!free_pages_prepare(page, 0, FPI_NONE))
list_del(&page->lru);
- free_one_page(page_zone(page), page, pfn, 0, FPI_NONE);
- continue;
- }
- }
list_for_each_entry_safe(page, next, list, lru) {
unsigned long pfn = page_to_pfn(page);
struct zone *zone = page_zone(page);
+ int migratetype;
list_del(&page->lru);
migratetype = get_pfnblock_migratetype(page, pfn);
/*
- * Either different zone requiring a different pcp lock or
- * excessive lock hold times when freeing a large list of
- * pages.
+ * Zone switch, batch complete, or non-pcp freeing?
+ * Drop the pcp lock and evaluate.
*/
- if (zone != locked_zone || batch_count == SWAP_CLUSTER_MAX) {
+ if (unlikely(zone != locked_zone ||
+ batch_count == SWAP_CLUSTER_MAX ||
+ is_migrate_isolate(migratetype))) {
if (pcp) {
pcp_spin_unlock(pcp);
pcp_trylock_finish(UP_flags);
+ locked_zone = NULL;
}
- batch_count = 0;
+ /*
+ * Free isolated pages directly to the
+ * allocator, see comment in free_unref_page.
+ */
+ if (is_migrate_isolate(migratetype)) {
+ free_one_page(zone, page, pfn, 0, FPI_NONE);
+ continue;
+ }
/*
* trylock is necessary as pages may be getting freed
@@ -2518,10 +2510,10 @@ void free_unref_page_list(struct list_head *list)
if (unlikely(!pcp)) {
pcp_trylock_finish(UP_flags);
free_one_page(zone, page, pfn, 0, FPI_NONE);
- locked_zone = NULL;
continue;
}
locked_zone = zone;
+ batch_count = 0;
}
/*
--
2.42.0
Johannes Weiner <[email protected]> writes:
> On Wed, Sep 27, 2023 at 01:42:25PM +0800, Huang, Ying wrote:
>> Johannes Weiner <[email protected]> writes:
>>
>> > The idea behind the cache is to save get_pageblock_migratetype()
>> > lookups during bulk freeing. A microbenchmark suggests this isn't
>> > helping, though. The pcp migratetype can get stale, which means that
>> > bulk freeing has an extra branch to check if the pageblock was
>> > isolated while on the pcp.
>> >
>> > While the variance overlaps, the cache write and the branch seem to
>> > make this a net negative. The following test allocates and frees
>> > batches of 10,000 pages (~3x the pcp high marks to trigger flushing):
>> >
>> > Before:
>> > 8,668.48 msec task-clock # 99.735 CPUs utilized ( +- 2.90% )
>> > 19 context-switches # 4.341 /sec ( +- 3.24% )
>> > 0 cpu-migrations # 0.000 /sec
>> > 17,440 page-faults # 3.984 K/sec ( +- 2.90% )
>> > 41,758,692,473 cycles # 9.541 GHz ( +- 2.90% )
>> > 126,201,294,231 instructions # 5.98 insn per cycle ( +- 2.90% )
>> > 25,348,098,335 branches # 5.791 G/sec ( +- 2.90% )
>> > 33,436,921 branch-misses # 0.26% of all branches ( +- 2.90% )
>> >
>> > 0.0869148 +- 0.0000302 seconds time elapsed ( +- 0.03% )
>> >
>> > After:
>> > 8,444.81 msec task-clock # 99.726 CPUs utilized ( +- 2.90% )
>> > 22 context-switches # 5.160 /sec ( +- 3.23% )
>> > 0 cpu-migrations # 0.000 /sec
>> > 17,443 page-faults # 4.091 K/sec ( +- 2.90% )
>> > 40,616,738,355 cycles # 9.527 GHz ( +- 2.90% )
>> > 126,383,351,792 instructions # 6.16 insn per cycle ( +- 2.90% )
>> > 25,224,985,153 branches # 5.917 G/sec ( +- 2.90% )
>> > 32,236,793 branch-misses # 0.25% of all branches ( +- 2.90% )
>> >
>> > 0.0846799 +- 0.0000412 seconds time elapsed ( +- 0.05% )
>> >
>> > A side effect is that this also ensures that pages whose pageblock
>> > gets stolen while on the pcplist end up on the right freelist and we
>> > don't perform potentially type-incompatible buddy merges (or skip
>> > merges when we shouldn't), whis is likely beneficial to long-term
>> > fragmentation management, although the effects would be harder to
>> > measure. Settle for simpler and faster code as justification here.
>>
>> I suspected the PCP allocating/freeing path may be influenced (that is,
>> allocating/freeing batch is less than PCP high). So I tested
>> one-process will-it-scale/page_fault1 with sysctl
>> percpu_pagelist_high_fraction=8. So pages will be allocated/freed
>> from/to PCP only. The test results are as follows,
>>
>> Before:
>> will-it-scale.1.processes 618364.3 (+- 0.075%)
>> perf-profile.children.get_pfnblock_flags_mask 0.13 (+- 9.350%)
>>
>> After:
>> will-it-scale.1.processes 616512.0 (+- 0.057%)
>> perf-profile.children.get_pfnblock_flags_mask 0.41 (+- 22.44%)
>>
>> The change isn't large: -0.3%. Perf profiling shows the cycles% of
>> get_pfnblock_flags_mask() increases.
>
> Ah, this is going through the free_unref_page_list() path that
> Vlastimil had pointed out as well. I made another change on top that
> eliminates the second lookup. After that, both pcp fast paths have the
> same number of lookups as before: 1. This fixes the regression for me.
>
> Would you mind confirming this as well?
I have done more test for the series and addon patches. The test
results are as follows,
base
perf-profile.children.get_pfnblock_flags_mask 0.15 (+- 32.62%)
will-it-scale.1.processes 618621.7 (+- 0.18%)
mm: page_alloc: remove pcppage migratetype caching
perf-profile.children.get_pfnblock_flags_mask 0.40 (+- 21.55%)
will-it-scale.1.processes 616350.3 (+- 0.27%)
mm: page_alloc: fix up block types when merging compatible blocks
perf-profile.children.get_pfnblock_flags_mask 0.36 (+- 8.36%)
will-it-scale.1.processes 617121.0 (+- 0.17%)
mm: page_alloc: move free pages when converting block during isolation
perf-profile.children.get_pfnblock_flags_mask 0.36 (+- 15.10%)
will-it-scale.1.processes 615578.0 (+- 0.18%)
mm: page_alloc: fix move_freepages_block() range error
perf-profile.children.get_pfnblock_flags_mask 0.36 (+- 12.78%)
will-it-scale.1.processes 615364.7 (+- 0.27%)
mm: page_alloc: fix freelist movement during block conversion
perf-profile.children.get_pfnblock_flags_mask 0.36 (+- 10.52%)
will-it-scale.1.processes 617834.8 (+- 0.52%)
mm: page_alloc: consolidate free page accounting
perf-profile.children.get_pfnblock_flags_mask 0.39 (+- 8.27%)
will-it-scale.1.processes 621000.0 (+- 0.13%)
mm: page_alloc: close migratetype race between freeing and stealing
perf-profile.children.get_pfnblock_flags_mask 0.37 (+- 5.87%)
will-it-scale.1.processes 618378.8 (+- 0.17%)
mm: page_alloc: optimize free_unref_page_list()
perf-profile.children.get_pfnblock_flags_mask 0.20 (+- 14.96%)
will-it-scale.1.processes 618136.3 (+- 0.16%)
It seems that the will-it-scale score is influenced by some other
factors too. But anyway, the series + addon patches restores the score
of will-it-scale. And the cycles% of get_pfnblock_flags_mask() is
almost restored by the final patch (mm: page_alloc: optimize
free_unref_page_list()).
Feel free to add my "Tested-by" for these patches.
--
Best Regards,
Huang, Ying
On Sat, Sep 30, 2023 at 12:26:01PM +0800, Huang, Ying wrote:
> I have done more test for the series and addon patches. The test
> results are as follows,
>
> base
> perf-profile.children.get_pfnblock_flags_mask 0.15 (+- 32.62%)
> will-it-scale.1.processes 618621.7 (+- 0.18%)
>
> mm: page_alloc: remove pcppage migratetype caching
> perf-profile.children.get_pfnblock_flags_mask 0.40 (+- 21.55%)
> will-it-scale.1.processes 616350.3 (+- 0.27%)
>
> mm: page_alloc: fix up block types when merging compatible blocks
> perf-profile.children.get_pfnblock_flags_mask 0.36 (+- 8.36%)
> will-it-scale.1.processes 617121.0 (+- 0.17%)
>
> mm: page_alloc: move free pages when converting block during isolation
> perf-profile.children.get_pfnblock_flags_mask 0.36 (+- 15.10%)
> will-it-scale.1.processes 615578.0 (+- 0.18%)
>
> mm: page_alloc: fix move_freepages_block() range error
> perf-profile.children.get_pfnblock_flags_mask 0.36 (+- 12.78%)
> will-it-scale.1.processes 615364.7 (+- 0.27%)
>
> mm: page_alloc: fix freelist movement during block conversion
> perf-profile.children.get_pfnblock_flags_mask 0.36 (+- 10.52%)
> will-it-scale.1.processes 617834.8 (+- 0.52%)
>
> mm: page_alloc: consolidate free page accounting
> perf-profile.children.get_pfnblock_flags_mask 0.39 (+- 8.27%)
> will-it-scale.1.processes 621000.0 (+- 0.13%)
>
> mm: page_alloc: close migratetype race between freeing and stealing
> perf-profile.children.get_pfnblock_flags_mask 0.37 (+- 5.87%)
> will-it-scale.1.processes 618378.8 (+- 0.17%)
>
> mm: page_alloc: optimize free_unref_page_list()
> perf-profile.children.get_pfnblock_flags_mask 0.20 (+- 14.96%)
> will-it-scale.1.processes 618136.3 (+- 0.16%)
>
> It seems that the will-it-scale score is influenced by some other
> factors too. But anyway, the series + addon patches restores the score
> of will-it-scale. And the cycles% of get_pfnblock_flags_mask() is
> almost restored by the final patch (mm: page_alloc: optimize
> free_unref_page_list()).
>
> Feel free to add my "Tested-by" for these patches.
Thanks, I'll add those!