2022-03-04 20:19:45

by Eric Dumazet

[permalink] [raw]
Subject: [PATCH v2] mm/page_alloc: call check_new_pages() while zone spinlock is not held

From: Eric Dumazet <[email protected]>

For high order pages not using pcp, rmqueue() is currently calling
the costly check_new_pages() while zone spinlock is held,
and hard irqs masked.

This is not needed, we can release the spinlock sooner to reduce
zone spinlock contention.

Note that after this patch, we call __mod_zone_freepage_state()
before deciding to leak the page because it is in bad state.

v2: We need to keep interrupts disabled to call __mod_zone_freepage_state()

Signed-off-by: Eric Dumazet <[email protected]>
Cc: Mel Gorman <[email protected]>
Cc: Vlastimil Babka <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Shakeel Butt <[email protected]>
Cc: Wei Xu <[email protected]>
Cc: Greg Thelen <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: David Rientjes <[email protected]>
---
mm/page_alloc.c | 18 +++++++++---------
1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3589febc6d31928f850ebe5a4015ddc40e0469f3..1804287c1b792b8aa0e964b17eb002b6b1115258 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3706,10 +3706,10 @@ struct page *rmqueue(struct zone *preferred_zone,
* allocate greater than order-1 page units with __GFP_NOFAIL.
*/
WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
- spin_lock_irqsave(&zone->lock, flags);

do {
page = NULL;
+ spin_lock_irqsave(&zone->lock, flags);
/*
* order-0 request can reach here when the pcplist is skipped
* due to non-CMA allocation context. HIGHATOMIC area is
@@ -3721,15 +3721,15 @@ struct page *rmqueue(struct zone *preferred_zone,
if (page)
trace_mm_page_alloc_zone_locked(page, order, migratetype);
}
- if (!page)
+ if (!page) {
page = __rmqueue(zone, order, migratetype, alloc_flags);
- } while (page && check_new_pages(page, order));
- if (!page)
- goto failed;
-
- __mod_zone_freepage_state(zone, -(1 << order),
- get_pcppage_migratetype(page));
- spin_unlock_irqrestore(&zone->lock, flags);
+ if (!page)
+ goto failed;
+ }
+ __mod_zone_freepage_state(zone, -(1 << order),
+ get_pcppage_migratetype(page));
+ spin_unlock_irqrestore(&zone->lock, flags);
+ } while (check_new_pages(page, order));

__count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
zone_statistics(preferred_zone, zone, 1);
--
2.35.1.616.g0bdcbb4464-goog


2022-03-04 20:56:07

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH v2] mm/page_alloc: call check_new_pages() while zone spinlock is not held

On Fri, Mar 04, 2022 at 09:02:15AM -0800, Eric Dumazet wrote:
> From: Eric Dumazet <[email protected]>

> For high order pages not using pcp, rmqueue() is currently calling
> the costly check_new_pages() while zone spinlock is held,
> and hard irqs masked.

> This is not needed, we can release the spinlock sooner to reduce
> zone spinlock contention.

> Note that after this patch, we call __mod_zone_freepage_state()
> before deciding to leak the page because it is in bad state.

> v2: We need to keep interrupts disabled to call
> __mod_zone_freepage_state()

> Signed-off-by: Eric Dumazet <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Shakeel Butt <[email protected]>
> Cc: Wei Xu <[email protected]>
> Cc: Greg Thelen <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: David Rientjes <[email protected]>

Reviewed-by: Shakeel Butt <[email protected]>

2022-03-07 03:11:24

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v2] mm/page_alloc: call check_new_pages() while zone spinlock is not held

On Fri, 4 Mar 2022, Eric Dumazet wrote:

> From: Eric Dumazet <[email protected]>
>
> For high order pages not using pcp, rmqueue() is currently calling
> the costly check_new_pages() while zone spinlock is held,
> and hard irqs masked.
>
> This is not needed, we can release the spinlock sooner to reduce
> zone spinlock contention.
>
> Note that after this patch, we call __mod_zone_freepage_state()
> before deciding to leak the page because it is in bad state.
>
> v2: We need to keep interrupts disabled to call __mod_zone_freepage_state()
>
> Signed-off-by: Eric Dumazet <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Shakeel Butt <[email protected]>
> Cc: Wei Xu <[email protected]>
> Cc: Greg Thelen <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: David Rientjes <[email protected]>

Acked-by: David Rientjes <[email protected]>

2022-03-07 09:51:50

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH v2] mm/page_alloc: call check_new_pages() while zone spinlock is not held

On Fri, Mar 04, 2022 at 09:02:15AM -0800, Eric Dumazet wrote:
> From: Eric Dumazet <[email protected]>
>
> For high order pages not using pcp, rmqueue() is currently calling
> the costly check_new_pages() while zone spinlock is held,
> and hard irqs masked.
>
> This is not needed, we can release the spinlock sooner to reduce
> zone spinlock contention.
>
> Note that after this patch, we call __mod_zone_freepage_state()
> before deciding to leak the page because it is in bad state.
>
> v2: We need to keep interrupts disabled to call __mod_zone_freepage_state()
>
> Signed-off-by: Eric Dumazet <[email protected]>
> Cc: Mel Gorman <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Shakeel Butt <[email protected]>
> Cc: Wei Xu <[email protected]>
> Cc: Greg Thelen <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: David Rientjes <[email protected]>

Ok, this is only more expensive in the event pages on the free list have
been corrupted whch is already very unlikely so thanks!

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
SUSE Labs

2022-03-07 10:11:31

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v2] mm/page_alloc: call check_new_pages() while zone spinlock is not held

On 3/4/22 18:02, Eric Dumazet wrote:
> From: Eric Dumazet <[email protected]>
>
> For high order pages not using pcp, rmqueue() is currently calling
> the costly check_new_pages() while zone spinlock is held,
> and hard irqs masked.
>
> This is not needed, we can release the spinlock sooner to reduce
> zone spinlock contention.
>
> Note that after this patch, we call __mod_zone_freepage_state()
> before deciding to leak the page because it is in bad state.

Which is arguably an accounting fix on its own, because when we remove page
from the free list, we should decrease the respective counter(s) even if we
find the page is in bad state and discard (effectively leak) it.

>
> v2: We need to keep interrupts disabled to call __mod_zone_freepage_state()
>
> Signed-off-by: Eric Dumazet <[email protected]>

Reviewed-by: Vlastimil Babka <[email protected]>

> Cc: Mel Gorman <[email protected]>
> Cc: Vlastimil Babka <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Shakeel Butt <[email protected]>
> Cc: Wei Xu <[email protected]>
> Cc: Greg Thelen <[email protected]>
> Cc: Hugh Dickins <[email protected]>
> Cc: David Rientjes <[email protected]>
> ---
> mm/page_alloc.c | 18 +++++++++---------
> 1 file changed, 9 insertions(+), 9 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 3589febc6d31928f850ebe5a4015ddc40e0469f3..1804287c1b792b8aa0e964b17eb002b6b1115258 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3706,10 +3706,10 @@ struct page *rmqueue(struct zone *preferred_zone,
> * allocate greater than order-1 page units with __GFP_NOFAIL.
> */
> WARN_ON_ONCE((gfp_flags & __GFP_NOFAIL) && (order > 1));
> - spin_lock_irqsave(&zone->lock, flags);
>
> do {
> page = NULL;
> + spin_lock_irqsave(&zone->lock, flags);
> /*
> * order-0 request can reach here when the pcplist is skipped
> * due to non-CMA allocation context. HIGHATOMIC area is
> @@ -3721,15 +3721,15 @@ struct page *rmqueue(struct zone *preferred_zone,
> if (page)
> trace_mm_page_alloc_zone_locked(page, order, migratetype);
> }
> - if (!page)
> + if (!page) {
> page = __rmqueue(zone, order, migratetype, alloc_flags);
> - } while (page && check_new_pages(page, order));
> - if (!page)
> - goto failed;
> -
> - __mod_zone_freepage_state(zone, -(1 << order),
> - get_pcppage_migratetype(page));
> - spin_unlock_irqrestore(&zone->lock, flags);
> + if (!page)
> + goto failed;
> + }
> + __mod_zone_freepage_state(zone, -(1 << order),
> + get_pcppage_migratetype(page));
> + spin_unlock_irqrestore(&zone->lock, flags);
> + } while (check_new_pages(page, order));
>
> __count_zid_vm_events(PGALLOC, page_zonenum(page), 1 << order);
> zone_statistics(preferred_zone, zone, 1);

2022-03-09 02:15:37

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v2] mm/page_alloc: call check_new_pages() while zone spinlock is not held

On Mon, Mar 7, 2022 at 1:15 AM Mel Gorman <[email protected]> wrote:
>
> On Fri, Mar 04, 2022 at 09:02:15AM -0800, Eric Dumazet wrote:
> > From: Eric Dumazet <[email protected]>
> >
> > For high order pages not using pcp, rmqueue() is currently calling
> > the costly check_new_pages() while zone spinlock is held,
> > and hard irqs masked.
> >
> > This is not needed, we can release the spinlock sooner to reduce
> > zone spinlock contention.
> >
> > Note that after this patch, we call __mod_zone_freepage_state()
> > before deciding to leak the page because it is in bad state.
> >
> > v2: We need to keep interrupts disabled to call __mod_zone_freepage_state()
> >
> > Signed-off-by: Eric Dumazet <[email protected]>
> > Cc: Mel Gorman <[email protected]>
> > Cc: Vlastimil Babka <[email protected]>
> > Cc: Michal Hocko <[email protected]>
> > Cc: Shakeel Butt <[email protected]>
> > Cc: Wei Xu <[email protected]>
> > Cc: Greg Thelen <[email protected]>
> > Cc: Hugh Dickins <[email protected]>
> > Cc: David Rientjes <[email protected]>
>
> Ok, this is only more expensive in the event pages on the free list have
> been corrupted whch is already very unlikely so thanks!
>
> Acked-by: Mel Gorman <[email protected]>
>

One remaining question is:

After your patch ("mm/page_alloc: allow high-order pages to be stored
on the per-cpu lists"),
do we want to change check_pcp_refill()/check_new_pcp() to check all pages,
and not only the head ?

Or was it a conscious choice of yours ?
(I presume part of the performance gains came from
not having to bring ~7 cache lines per 32KB chunk on x86)

Thanks !

2022-03-09 14:28:57

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH v2] mm/page_alloc: call check_new_pages() while zone spinlock is not held

On Tue, Mar 08, 2022 at 03:49:48PM -0800, Eric Dumazet wrote:
> On Mon, Mar 7, 2022 at 1:15 AM Mel Gorman <[email protected]> wrote:
> >
> > On Fri, Mar 04, 2022 at 09:02:15AM -0800, Eric Dumazet wrote:
> > > From: Eric Dumazet <[email protected]>
> > >
> > > For high order pages not using pcp, rmqueue() is currently calling
> > > the costly check_new_pages() while zone spinlock is held,
> > > and hard irqs masked.
> > >
> > > This is not needed, we can release the spinlock sooner to reduce
> > > zone spinlock contention.
> > >
> > > Note that after this patch, we call __mod_zone_freepage_state()
> > > before deciding to leak the page because it is in bad state.
> > >
> > > v2: We need to keep interrupts disabled to call __mod_zone_freepage_state()
> > >
> > > Signed-off-by: Eric Dumazet <[email protected]>
> > > Cc: Mel Gorman <[email protected]>
> > > Cc: Vlastimil Babka <[email protected]>
> > > Cc: Michal Hocko <[email protected]>
> > > Cc: Shakeel Butt <[email protected]>
> > > Cc: Wei Xu <[email protected]>
> > > Cc: Greg Thelen <[email protected]>
> > > Cc: Hugh Dickins <[email protected]>
> > > Cc: David Rientjes <[email protected]>
> >
> > Ok, this is only more expensive in the event pages on the free list have
> > been corrupted whch is already very unlikely so thanks!
> >
> > Acked-by: Mel Gorman <[email protected]>
> >
>
> One remaining question is:
>
> After your patch ("mm/page_alloc: allow high-order pages to be stored
> on the per-cpu lists"),
> do we want to change check_pcp_refill()/check_new_pcp() to check all pages,
> and not only the head ?
>

We should because it was an oversight. Thanks for pointing that out.

> Or was it a conscious choice of yours ?
> (I presume part of the performance gains came from
> not having to bring ~7 cache lines per 32KB chunk on x86)
>

There will be a performance penalty due to the check but it's a correctness
vs performance issue.

This? It's boot tested only.

--8<--
mm/page_alloc: check high-order pages for corruption during PCP operations

Eric Dumazet pointed out that commit 44042b449872 ("mm/page_alloc: allow
high-order pages to be stored on the per-cpu lists") only checks the head
page during PCP refill and allocation operations. This was an oversight
and all pages should be checked. This will incur a small performance
penalty but it's necessary for correctness.

Fixes: 44042b449872 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists")
Reported-by: Eric Dumazet <[email protected]>
Signed-off-by: Mel Gorman <[email protected]>
---
mm/page_alloc.c | 46 +++++++++++++++++++++++-----------------------
1 file changed, 23 insertions(+), 23 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 3589febc6d31..2920344fa887 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2342,23 +2342,36 @@ static inline int check_new_page(struct page *page)
return 1;
}

+static bool check_new_pages(struct page *page, unsigned int order)
+{
+ int i;
+ for (i = 0; i < (1 << order); i++) {
+ struct page *p = page + i;
+
+ if (unlikely(check_new_page(p)))
+ return true;
+ }
+
+ return false;
+}
+
#ifdef CONFIG_DEBUG_VM
/*
* With DEBUG_VM enabled, order-0 pages are checked for expected state when
* being allocated from pcp lists. With debug_pagealloc also enabled, they are
* also checked when pcp lists are refilled from the free lists.
*/
-static inline bool check_pcp_refill(struct page *page)
+static inline bool check_pcp_refill(struct page *page, unsigned int order)
{
if (debug_pagealloc_enabled_static())
- return check_new_page(page);
+ return check_new_pages(page, order);
else
return false;
}

-static inline bool check_new_pcp(struct page *page)
+static inline bool check_new_pcp(struct page *page, unsigned int order)
{
- return check_new_page(page);
+ return check_new_pages(page, order);
}
#else
/*
@@ -2366,32 +2379,19 @@ static inline bool check_new_pcp(struct page *page)
* when pcp lists are being refilled from the free lists. With debug_pagealloc
* enabled, they are also checked when being allocated from the pcp lists.
*/
-static inline bool check_pcp_refill(struct page *page)
+static inline bool check_pcp_refill(struct page *page, unsigned int order)
{
- return check_new_page(page);
+ return check_new_pages(page, order);
}
-static inline bool check_new_pcp(struct page *page)
+static inline bool check_new_pcp(struct page *page, unsigned int order)
{
if (debug_pagealloc_enabled_static())
- return check_new_page(page);
+ return check_new_pages(page, order);
else
return false;
}
#endif /* CONFIG_DEBUG_VM */

-static bool check_new_pages(struct page *page, unsigned int order)
-{
- int i;
- for (i = 0; i < (1 << order); i++) {
- struct page *p = page + i;
-
- if (unlikely(check_new_page(p)))
- return true;
- }
-
- return false;
-}
-
inline void post_alloc_hook(struct page *page, unsigned int order,
gfp_t gfp_flags)
{
@@ -3037,7 +3037,7 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
if (unlikely(page == NULL))
break;

- if (unlikely(check_pcp_refill(page)))
+ if (unlikely(check_pcp_refill(page, order)))
continue;

/*
@@ -3641,7 +3641,7 @@ struct page *__rmqueue_pcplist(struct zone *zone, unsigned int order,
page = list_first_entry(list, struct page, lru);
list_del(&page->lru);
pcp->count -= 1 << order;
- } while (check_new_pcp(page));
+ } while (check_new_pcp(page, order));

return page;
}

2022-03-09 19:34:54

by Eric Dumazet

[permalink] [raw]
Subject: Re: [PATCH v2] mm/page_alloc: call check_new_pages() while zone spinlock is not held

On Wed, Mar 9, 2022 at 4:32 AM Mel Gorman <[email protected]> wrote:

> We should because it was an oversight. Thanks for pointing that out.
>
> > Or was it a conscious choice of yours ?
> > (I presume part of the performance gains came from
> > not having to bring ~7 cache lines per 32KB chunk on x86)
> >
>
> There will be a performance penalty due to the check but it's a correctness
> vs performance issue.
>
> This? It's boot tested only.
>
> --8<--
> mm/page_alloc: check high-order pages for corruption during PCP operations
>
> Eric Dumazet pointed out that commit 44042b449872 ("mm/page_alloc: allow
> high-order pages to be stored on the per-cpu lists") only checks the head
> page during PCP refill and allocation operations. This was an oversight
> and all pages should be checked. This will incur a small performance
> penalty but it's necessary for correctness.
>
> Fixes: 44042b449872 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists")
> Reported-by: Eric Dumazet <[email protected]>
> Signed-off-by: Mel Gorman <[email protected]>
> ---

SGTM, thanks Mel !

Acked-by: Eric Dumazet <[email protected]>

2022-03-12 15:52:55

by kernel test robot

[permalink] [raw]
Subject: [mm/page_alloc] 8212a964ee: vm-scalability.throughput 30.5% improvement



Greeting,

FYI, we noticed a 30.5% improvement of vm-scalability.throughput due to commit:


commit: 8212a964ee020471104e34dce7029dec33c218a9 ("Re: [PATCH v2] mm/page_alloc: call check_new_pages() while zone spinlock is not held")
url: https://github.com/0day-ci/linux/commits/Mel-Gorman/Re-PATCH-v2-mm-page_alloc-call-check_new_pages-while-zone-spinlock-is-not-held/20220309-203504
patch link: https://lore.kernel.org/lkml/[email protected]

in testcase: vm-scalability
on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with 128G memory
with following parameters:

runtime: 300s
size: 512G
test: anon-w-rand-hugetlb
cpufreq_governor: performance
ucode: 0xd000331

test-description: The motivation behind this suite is to exercise functions and regions of the mm/ of the Linux kernel which are of interest to us.
test-url: https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/





Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
sudo bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
sudo bin/lkp run generated-yaml-file

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.

=========================================================================================
compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase/ucode:
gcc-9/performance/x86_64-rhel-8.3/debian-10.4-x86_64-20200603.cgz/300s/512G/lkp-icl-2sp5/anon-w-rand-hugetlb/vm-scalability/0xd000331

commit:
v5.17-rc7
8212a964ee ("mm/page_alloc: call check_new_pages() while zone spinlock is not held")

v5.17-rc7 8212a964ee020471104e34dce70
---------------- ---------------------------
%stddev %change %stddev
\ | \
0.00 ? 5% -7.4% 0.00 ? 4% vm-scalability.free_time
47190 ? 2% +25.5% 59208 ? 2% vm-scalability.median
6352467 ? 2% +30.5% 8293110 ? 2% vm-scalability.throughput
218.97 ? 2% -18.7% 177.98 ? 3% vm-scalability.time.elapsed_time
218.97 ? 2% -18.7% 177.98 ? 3% vm-scalability.time.elapsed_time.max
121357 ? 7% -24.9% 91162 ? 10% vm-scalability.time.involuntary_context_switches
11226 -5.2% 10641 vm-scalability.time.percent_of_cpu_this_job_got
2311 ? 3% -35.2% 1496 ? 6% vm-scalability.time.system_time
22275 ? 2% -21.7% 17443 ? 3% vm-scalability.time.user_time
9358 ? 3% -13.1% 8130 vm-scalability.time.voluntary_context_switches
255.23 -16.1% 214.10 ? 2% uptime.boot
2593 +6.8% 2771 ? 5% vmstat.system.cs
11.51 ? 7% +4.5 16.05 ? 8% mpstat.cpu.all.idle%
8.48 ? 2% -1.6 6.84 ? 3% mpstat.cpu.all.sys%
727581 ? 12% -17.2% 602238 ? 6% numa-numastat.node1.local_node
798037 ? 8% -13.3% 691955 ? 6% numa-numastat.node1.numa_hit
5806206 ? 17% +26.7% 7356010 ? 10% turbostat.C1E
9.55 ? 26% +5.9 15.48 ? 9% turbostat.C1E%
59854751 ? 2% -17.8% 49202950 ? 3% turbostat.IRQ
42804 ? 6% -54.9% 19301 ? 21% meminfo.Active
41832 ? 7% -56.2% 18325 ? 23% meminfo.Active(anon)
63386 ? 6% -26.6% 46542 ? 3% meminfo.Mapped
137758 -25.5% 102591 ? 3% meminfo.Shmem
36980 ? 5% -62.6% 13823 ? 29% numa-meminfo.node1.Active
36495 ? 5% -63.9% 13173 ? 30% numa-meminfo.node1.Active(anon)
19454 ? 26% -57.7% 8233 ? 33% numa-meminfo.node1.Mapped
65896 ? 38% -67.8% 21189 ? 13% numa-meminfo.node1.Shmem
9185 ? 6% -64.7% 3246 ? 31% numa-vmstat.node1.nr_active_anon
4769 ? 26% -54.5% 2171 ? 32% numa-vmstat.node1.nr_mapped
16462 ? 37% -68.1% 5258 ? 14% numa-vmstat.node1.nr_shmem
9185 ? 6% -64.7% 3246 ? 31% numa-vmstat.node1.nr_zone_active_anon
10436 ? 5% -56.2% 4570 ? 23% proc-vmstat.nr_active_anon
69290 +1.3% 70203 proc-vmstat.nr_anon_pages
1717695 +4.5% 1794462 proc-vmstat.nr_dirty_background_threshold
3439592 +4.5% 3593312 proc-vmstat.nr_dirty_threshold
640952 -1.4% 632171 proc-vmstat.nr_file_pages
17356030 +4.4% 18125242 proc-vmstat.nr_free_pages
93258 -2.4% 91059 proc-vmstat.nr_inactive_anon
16187 ? 5% -26.4% 11911 ? 2% proc-vmstat.nr_mapped
34477 ? 2% -25.6% 25663 ? 4% proc-vmstat.nr_shmem
10436 ? 5% -56.2% 4570 ? 23% proc-vmstat.nr_zone_active_anon
93258 -2.4% 91059 proc-vmstat.nr_zone_inactive_anon
32151 ? 16% -61.0% 12542 ? 13% proc-vmstat.numa_hint_faults
21214 ? 22% -86.0% 2964 ? 45% proc-vmstat.numa_hint_faults_local
1598135 -10.9% 1423466 proc-vmstat.numa_hit
1481881 -11.8% 1307551 proc-vmstat.numa_local
117279 -1.2% 115916 proc-vmstat.numa_other
555445 ? 16% -53.2% 260178 ? 53% proc-vmstat.numa_pte_updates
93889 ? 4% -74.3% 24113 ? 7% proc-vmstat.pgactivate
1599893 -11.0% 1424527 proc-vmstat.pgalloc_normal
1594626 -14.2% 1368920 proc-vmstat.pgfault
1609987 -20.8% 1275284 proc-vmstat.pgfree
49893 -14.8% 42496 ? 5% proc-vmstat.pgreuse
15.23 ? 2% -7.8% 14.04 perf-stat.i.MPKI
1.348e+10 +22.0% 1.645e+10 ? 3% perf-stat.i.branch-instructions
6.957e+08 ? 2% +22.4% 8.517e+08 ? 3% perf-stat.i.cache-misses
7.117e+08 ? 2% +22.4% 8.71e+08 ? 3% perf-stat.i.cache-references
7.86 ? 2% -29.0% 5.58 ? 6% perf-stat.i.cpi
3.739e+11 -5.1% 3.549e+11 perf-stat.i.cpu-cycles
550.18 ? 3% -22.2% 427.87 ? 5% perf-stat.i.cycles-between-cache-misses
1.605e+10 +22.1% 1.959e+10 ? 3% perf-stat.i.dTLB-loads
0.02 ? 3% -0.0 0.01 ? 4% perf-stat.i.dTLB-store-miss-rate%
921125 ? 2% -4.6% 878569 perf-stat.i.dTLB-store-misses
5.803e+09 +22.0% 7.078e+09 ? 3% perf-stat.i.dTLB-stores
5.665e+10 +22.0% 6.911e+10 ? 3% perf-stat.i.instructions
0.16 ? 3% +26.1% 0.20 ? 3% perf-stat.i.ipc
2.92 -5.1% 2.77 perf-stat.i.metric.GHz
123.32 ? 16% +158.4% 318.61 ? 22% perf-stat.i.metric.K/sec
286.92 +21.8% 349.59 ? 3% perf-stat.i.metric.M/sec
6641 +4.8% 6957 ? 2% perf-stat.i.minor-faults
586608 ? 12% +36.4% 800024 ? 7% perf-stat.i.node-loads
26.79 ? 4% -10.5 16.31 ? 12% perf-stat.i.node-store-miss-rate%
1.785e+08 ? 2% -27.7% 1.291e+08 ? 7% perf-stat.i.node-store-misses
5.131e+08 ? 3% +39.8% 7.172e+08 ? 5% perf-stat.i.node-stores
6643 +4.8% 6959 ? 2% perf-stat.i.page-faults
0.02 ? 18% -0.0 0.01 ? 4% perf-stat.overall.branch-miss-rate%
6.66 ? 2% -22.5% 5.16 ? 3% perf-stat.overall.cpi
539.35 ? 2% -22.7% 416.69 ? 3% perf-stat.overall.cycles-between-cache-misses
0.02 ? 3% -0.0 0.01 ? 3% perf-stat.overall.dTLB-store-miss-rate%
0.15 ? 2% +29.1% 0.19 ? 3% perf-stat.overall.ipc
25.88 ? 4% -10.6 15.28 ? 10% perf-stat.overall.node-store-miss-rate%
1.325e+10 ? 2% +22.3% 1.622e+10 ? 3% perf-stat.ps.branch-instructions
6.88e+08 ? 2% +22.7% 8.444e+08 ? 3% perf-stat.ps.cache-misses
7.043e+08 ? 2% +22.7% 8.638e+08 ? 3% perf-stat.ps.cache-references
3.708e+11 -5.2% 3.515e+11 perf-stat.ps.cpu-cycles
1.577e+10 ? 2% +22.4% 1.931e+10 ? 3% perf-stat.ps.dTLB-loads
910623 ? 2% -4.6% 868700 perf-stat.ps.dTLB-store-misses
5.701e+09 ? 2% +22.3% 6.975e+09 ? 3% perf-stat.ps.dTLB-stores
5.569e+10 ? 2% +22.3% 6.813e+10 ? 3% perf-stat.ps.instructions
6716 +4.8% 7038 perf-stat.ps.minor-faults
595302 ? 11% +37.2% 816710 ? 8% perf-stat.ps.node-loads
1.769e+08 ? 2% -27.8% 1.277e+08 ? 7% perf-stat.ps.node-store-misses
5.071e+08 ? 3% +40.3% 7.113e+08 ? 5% perf-stat.ps.node-stores
6717 +4.8% 7039 perf-stat.ps.page-faults
0.00 +0.8 0.80 ? 8% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.rmqueue_bulk.get_page_from_freelist.__alloc_pages
0.00 +0.8 0.80 ? 8% perf-profile.calltrace.cycles-pp._raw_spin_lock.rmqueue_bulk.get_page_from_freelist.__alloc_pages.alloc_buddy_huge_page
0.00 +0.8 0.83 ? 8% perf-profile.calltrace.cycles-pp.rmqueue_bulk.get_page_from_freelist.__alloc_pages.alloc_buddy_huge_page.alloc_fresh_huge_page
0.00 +0.8 0.84 ? 8% perf-profile.calltrace.cycles-pp.__alloc_pages.alloc_buddy_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page.hugetlb_acct_memory
0.00 +0.8 0.84 ? 8% perf-profile.calltrace.cycles-pp.get_page_from_freelist.__alloc_pages.alloc_buddy_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page
0.00 +0.8 0.84 ? 8% perf-profile.calltrace.cycles-pp.alloc_buddy_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page.hugetlb_acct_memory.hugetlb_reserve_pages
0.00 +0.9 0.85 ? 8% perf-profile.calltrace.cycles-pp.alloc_fresh_huge_page.alloc_surplus_huge_page.hugetlb_acct_memory.hugetlb_reserve_pages.hugetlbfs_file_mmap
0.00 +0.9 0.88 ? 8% perf-profile.calltrace.cycles-pp.alloc_surplus_huge_page.hugetlb_acct_memory.hugetlb_reserve_pages.hugetlbfs_file_mmap.mmap_region
0.00 +0.9 0.88 ? 8% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__mmap
0.00 +0.9 0.88 ? 8% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mmap
0.00 +0.9 0.88 ? 8% perf-profile.calltrace.cycles-pp.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mmap
0.00 +0.9 0.88 ? 8% perf-profile.calltrace.cycles-pp.do_mmap.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.00 +0.9 0.88 ? 8% perf-profile.calltrace.cycles-pp.mmap_region.do_mmap.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64
0.00 +0.9 0.88 ? 8% perf-profile.calltrace.cycles-pp.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mmap
0.00 +0.9 0.88 ? 8% perf-profile.calltrace.cycles-pp.__mmap
0.00 +0.9 0.88 ? 8% perf-profile.calltrace.cycles-pp.hugetlbfs_file_mmap.mmap_region.do_mmap.vm_mmap_pgoff.ksys_mmap_pgoff
0.00 +0.9 0.88 ? 8% perf-profile.calltrace.cycles-pp.hugetlb_reserve_pages.hugetlbfs_file_mmap.mmap_region.do_mmap.vm_mmap_pgoff
0.00 +0.9 0.88 ? 8% perf-profile.calltrace.cycles-pp.hugetlb_acct_memory.hugetlb_reserve_pages.hugetlbfs_file_mmap.mmap_region.do_mmap
60.28 ? 5% +4.7 64.98 ? 2% perf-profile.calltrace.cycles-pp.do_rw_once
0.09 ? 8% +0.0 0.11 ? 9% perf-profile.children.cycles-pp.task_tick_fair
0.14 ? 7% +0.0 0.17 ? 5% perf-profile.children.cycles-pp.scheduler_tick
0.20 ? 9% +0.0 0.24 ? 3% perf-profile.children.cycles-pp.tick_sched_timer
0.19 ? 9% +0.0 0.24 ? 4% perf-profile.children.cycles-pp.tick_sched_handle
0.19 ? 9% +0.0 0.23 ? 4% perf-profile.children.cycles-pp.update_process_times
0.24 ? 8% +0.0 0.29 ? 3% perf-profile.children.cycles-pp.__hrtimer_run_queues
0.40 ? 8% +0.1 0.45 ? 3% perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
0.39 ? 7% +0.1 0.45 ? 3% perf-profile.children.cycles-pp.hrtimer_interrupt
0.26 ? 71% +0.6 0.86 ? 8% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
0.28 ? 71% +0.6 0.88 ? 8% perf-profile.children.cycles-pp.__mmap
0.28 ? 71% +0.6 0.88 ? 8% perf-profile.children.cycles-pp.ksys_mmap_pgoff
0.27 ? 71% +0.6 0.88 ? 8% perf-profile.children.cycles-pp.hugetlbfs_file_mmap
0.27 ? 71% +0.6 0.88 ? 8% perf-profile.children.cycles-pp.hugetlb_reserve_pages
0.27 ? 71% +0.6 0.88 ? 8% perf-profile.children.cycles-pp.hugetlb_acct_memory
0.27 ? 71% +0.6 0.88 ? 8% perf-profile.children.cycles-pp.alloc_surplus_huge_page
0.28 ? 71% +0.6 0.88 ? 8% perf-profile.children.cycles-pp.vm_mmap_pgoff
0.28 ? 71% +0.6 0.88 ? 8% perf-profile.children.cycles-pp.do_mmap
0.28 ? 71% +0.6 0.88 ? 8% perf-profile.children.cycles-pp.mmap_region
0.55 ? 44% +0.6 1.16 ? 9% perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
0.55 ? 44% +0.6 1.16 ? 9% perf-profile.children.cycles-pp.do_syscall_64
0.12 ? 71% +0.7 0.85 ? 8% perf-profile.children.cycles-pp.alloc_fresh_huge_page
0.03 ? 70% +0.8 0.84 ? 8% perf-profile.children.cycles-pp.alloc_buddy_huge_page
0.04 ? 71% +0.8 0.84 ? 8% perf-profile.children.cycles-pp.get_page_from_freelist
0.04 ? 71% +0.8 0.84 ? 8% perf-profile.children.cycles-pp.__alloc_pages
0.00 +0.8 0.82 ? 8% perf-profile.children.cycles-pp._raw_spin_lock
0.00 +0.8 0.83 ? 8% perf-profile.children.cycles-pp.rmqueue_bulk
0.26 ? 71% +0.6 0.86 ? 8% perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath




Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


---
0-DAY CI Kernel Test Service
https://lists.01.org/hyperkitty/list/[email protected]

Thanks,
Oliver Sang


Attachments:
(No filename) (14.74 kB)
config-5.17.0-rc7-00001-g8212a964ee02 (164.58 kB)
job-script (8.24 kB)
job.yaml (5.55 kB)
reproduce (2.05 kB)
Download all attachments

2022-03-13 10:55:25

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [mm/page_alloc] 8212a964ee: vm-scalability.throughput 30.5% improvement

On 3/12/22 16:43, kernel test robot wrote:
>
>
> Greeting,
>
> FYI, we noticed a 30.5% improvement of vm-scalability.throughput due to commit:
>
>
> commit: 8212a964ee020471104e34dce7029dec33c218a9 ("Re: [PATCH v2] mm/page_alloc: call check_new_pages() while zone spinlock is not held")
> url: https://github.com/0day-ci/linux/commits/Mel-Gorman/Re-PATCH-v2-mm-page_alloc-call-check_new_pages-while-zone-spinlock-is-not-held/20220309-203504
> patch link: https://lore.kernel.org/lkml/[email protected]

Heh, that's weird. I would expect some improvement from Eric's patch,
but this seems to be actually about Mel's "mm/page_alloc: check
high-order pages for corruption during PCP operations" applied directly
on 5.17-rc7 per the github url above. This was rather expected to make
performance worse if anything, so maybe the improvement is due to some
unexpected side-effect of different inlining decisions or cache alignment...

> in testcase: vm-scalability
> on test machine: 128 threads 2 sockets Intel(R) Xeon(R) Platinum 8358 CPU @ 2.60GHz with 128G memory
> with following parameters:
>
> runtime: 300s
> size: 512G
> test: anon-w-rand-hugetlb
> cpufreq_governor: performance
> ucode: 0xd000331
>
> test-description: The motivation behind this suite is to exercise functions and regions of the mm/ of the Linux kernel which are of interest to us.
> test-url: https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/
>
>
>
>
>
> Details are as below:
> -------------------------------------------------------------------------------------------------->
>
>
> To reproduce:
>
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> sudo bin/lkp install job.yaml # job file is attached in this email
> bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
> sudo bin/lkp run generated-yaml-file
>
> # if come across any failure that blocks the test,
> # please remove ~/.lkp and /lkp dir to run from a clean state.
>
> =========================================================================================
> compiler/cpufreq_governor/kconfig/rootfs/runtime/size/tbox_group/test/testcase/ucode:
> gcc-9/performance/x86_64-rhel-8.3/debian-10.4-x86_64-20200603.cgz/300s/512G/lkp-icl-2sp5/anon-w-rand-hugetlb/vm-scalability/0xd000331
>
> commit:
> v5.17-rc7
> 8212a964ee ("mm/page_alloc: call check_new_pages() while zone spinlock is not held")
>
> v5.17-rc7 8212a964ee020471104e34dce70
> ---------------- ---------------------------
> %stddev %change %stddev
> \ | \
> 0.00 ± 5% -7.4% 0.00 ± 4% vm-scalability.free_time
> 47190 ± 2% +25.5% 59208 ± 2% vm-scalability.median
> 6352467 ± 2% +30.5% 8293110 ± 2% vm-scalability.throughput
> 218.97 ± 2% -18.7% 177.98 ± 3% vm-scalability.time.elapsed_time
> 218.97 ± 2% -18.7% 177.98 ± 3% vm-scalability.time.elapsed_time.max
> 121357 ± 7% -24.9% 91162 ± 10% vm-scalability.time.involuntary_context_switches
> 11226 -5.2% 10641 vm-scalability.time.percent_of_cpu_this_job_got
> 2311 ± 3% -35.2% 1496 ± 6% vm-scalability.time.system_time
> 22275 ± 2% -21.7% 17443 ± 3% vm-scalability.time.user_time
> 9358 ± 3% -13.1% 8130 vm-scalability.time.voluntary_context_switches
> 255.23 -16.1% 214.10 ± 2% uptime.boot
> 2593 +6.8% 2771 ± 5% vmstat.system.cs
> 11.51 ± 7% +4.5 16.05 ± 8% mpstat.cpu.all.idle%
> 8.48 ± 2% -1.6 6.84 ± 3% mpstat.cpu.all.sys%
> 727581 ± 12% -17.2% 602238 ± 6% numa-numastat.node1.local_node
> 798037 ± 8% -13.3% 691955 ± 6% numa-numastat.node1.numa_hit
> 5806206 ± 17% +26.7% 7356010 ± 10% turbostat.C1E
> 9.55 ± 26% +5.9 15.48 ± 9% turbostat.C1E%
> 59854751 ± 2% -17.8% 49202950 ± 3% turbostat.IRQ
> 42804 ± 6% -54.9% 19301 ± 21% meminfo.Active
> 41832 ± 7% -56.2% 18325 ± 23% meminfo.Active(anon)
> 63386 ± 6% -26.6% 46542 ± 3% meminfo.Mapped
> 137758 -25.5% 102591 ± 3% meminfo.Shmem
> 36980 ± 5% -62.6% 13823 ± 29% numa-meminfo.node1.Active
> 36495 ± 5% -63.9% 13173 ± 30% numa-meminfo.node1.Active(anon)
> 19454 ± 26% -57.7% 8233 ± 33% numa-meminfo.node1.Mapped
> 65896 ± 38% -67.8% 21189 ± 13% numa-meminfo.node1.Shmem
> 9185 ± 6% -64.7% 3246 ± 31% numa-vmstat.node1.nr_active_anon
> 4769 ± 26% -54.5% 2171 ± 32% numa-vmstat.node1.nr_mapped
> 16462 ± 37% -68.1% 5258 ± 14% numa-vmstat.node1.nr_shmem
> 9185 ± 6% -64.7% 3246 ± 31% numa-vmstat.node1.nr_zone_active_anon
> 10436 ± 5% -56.2% 4570 ± 23% proc-vmstat.nr_active_anon
> 69290 +1.3% 70203 proc-vmstat.nr_anon_pages
> 1717695 +4.5% 1794462 proc-vmstat.nr_dirty_background_threshold
> 3439592 +4.5% 3593312 proc-vmstat.nr_dirty_threshold
> 640952 -1.4% 632171 proc-vmstat.nr_file_pages
> 17356030 +4.4% 18125242 proc-vmstat.nr_free_pages
> 93258 -2.4% 91059 proc-vmstat.nr_inactive_anon
> 16187 ± 5% -26.4% 11911 ± 2% proc-vmstat.nr_mapped
> 34477 ± 2% -25.6% 25663 ± 4% proc-vmstat.nr_shmem
> 10436 ± 5% -56.2% 4570 ± 23% proc-vmstat.nr_zone_active_anon
> 93258 -2.4% 91059 proc-vmstat.nr_zone_inactive_anon
> 32151 ± 16% -61.0% 12542 ± 13% proc-vmstat.numa_hint_faults
> 21214 ± 22% -86.0% 2964 ± 45% proc-vmstat.numa_hint_faults_local
> 1598135 -10.9% 1423466 proc-vmstat.numa_hit
> 1481881 -11.8% 1307551 proc-vmstat.numa_local
> 117279 -1.2% 115916 proc-vmstat.numa_other
> 555445 ± 16% -53.2% 260178 ± 53% proc-vmstat.numa_pte_updates
> 93889 ± 4% -74.3% 24113 ± 7% proc-vmstat.pgactivate
> 1599893 -11.0% 1424527 proc-vmstat.pgalloc_normal
> 1594626 -14.2% 1368920 proc-vmstat.pgfault
> 1609987 -20.8% 1275284 proc-vmstat.pgfree
> 49893 -14.8% 42496 ± 5% proc-vmstat.pgreuse
> 15.23 ± 2% -7.8% 14.04 perf-stat.i.MPKI
> 1.348e+10 +22.0% 1.645e+10 ± 3% perf-stat.i.branch-instructions
> 6.957e+08 ± 2% +22.4% 8.517e+08 ± 3% perf-stat.i.cache-misses
> 7.117e+08 ± 2% +22.4% 8.71e+08 ± 3% perf-stat.i.cache-references
> 7.86 ± 2% -29.0% 5.58 ± 6% perf-stat.i.cpi
> 3.739e+11 -5.1% 3.549e+11 perf-stat.i.cpu-cycles
> 550.18 ± 3% -22.2% 427.87 ± 5% perf-stat.i.cycles-between-cache-misses
> 1.605e+10 +22.1% 1.959e+10 ± 3% perf-stat.i.dTLB-loads
> 0.02 ± 3% -0.0 0.01 ± 4% perf-stat.i.dTLB-store-miss-rate%
> 921125 ± 2% -4.6% 878569 perf-stat.i.dTLB-store-misses
> 5.803e+09 +22.0% 7.078e+09 ± 3% perf-stat.i.dTLB-stores
> 5.665e+10 +22.0% 6.911e+10 ± 3% perf-stat.i.instructions
> 0.16 ± 3% +26.1% 0.20 ± 3% perf-stat.i.ipc
> 2.92 -5.1% 2.77 perf-stat.i.metric.GHz
> 123.32 ± 16% +158.4% 318.61 ± 22% perf-stat.i.metric.K/sec
> 286.92 +21.8% 349.59 ± 3% perf-stat.i.metric.M/sec
> 6641 +4.8% 6957 ± 2% perf-stat.i.minor-faults
> 586608 ± 12% +36.4% 800024 ± 7% perf-stat.i.node-loads
> 26.79 ± 4% -10.5 16.31 ± 12% perf-stat.i.node-store-miss-rate%
> 1.785e+08 ± 2% -27.7% 1.291e+08 ± 7% perf-stat.i.node-store-misses
> 5.131e+08 ± 3% +39.8% 7.172e+08 ± 5% perf-stat.i.node-stores
> 6643 +4.8% 6959 ± 2% perf-stat.i.page-faults
> 0.02 ± 18% -0.0 0.01 ± 4% perf-stat.overall.branch-miss-rate%
> 6.66 ± 2% -22.5% 5.16 ± 3% perf-stat.overall.cpi
> 539.35 ± 2% -22.7% 416.69 ± 3% perf-stat.overall.cycles-between-cache-misses
> 0.02 ± 3% -0.0 0.01 ± 3% perf-stat.overall.dTLB-store-miss-rate%
> 0.15 ± 2% +29.1% 0.19 ± 3% perf-stat.overall.ipc
> 25.88 ± 4% -10.6 15.28 ± 10% perf-stat.overall.node-store-miss-rate%
> 1.325e+10 ± 2% +22.3% 1.622e+10 ± 3% perf-stat.ps.branch-instructions
> 6.88e+08 ± 2% +22.7% 8.444e+08 ± 3% perf-stat.ps.cache-misses
> 7.043e+08 ± 2% +22.7% 8.638e+08 ± 3% perf-stat.ps.cache-references
> 3.708e+11 -5.2% 3.515e+11 perf-stat.ps.cpu-cycles
> 1.577e+10 ± 2% +22.4% 1.931e+10 ± 3% perf-stat.ps.dTLB-loads
> 910623 ± 2% -4.6% 868700 perf-stat.ps.dTLB-store-misses
> 5.701e+09 ± 2% +22.3% 6.975e+09 ± 3% perf-stat.ps.dTLB-stores
> 5.569e+10 ± 2% +22.3% 6.813e+10 ± 3% perf-stat.ps.instructions
> 6716 +4.8% 7038 perf-stat.ps.minor-faults
> 595302 ± 11% +37.2% 816710 ± 8% perf-stat.ps.node-loads
> 1.769e+08 ± 2% -27.8% 1.277e+08 ± 7% perf-stat.ps.node-store-misses
> 5.071e+08 ± 3% +40.3% 7.113e+08 ± 5% perf-stat.ps.node-stores
> 6717 +4.8% 7039 perf-stat.ps.page-faults
> 0.00 +0.8 0.80 ± 8% perf-profile.calltrace.cycles-pp.native_queued_spin_lock_slowpath._raw_spin_lock.rmqueue_bulk.get_page_from_freelist.__alloc_pages
> 0.00 +0.8 0.80 ± 8% perf-profile.calltrace.cycles-pp._raw_spin_lock.rmqueue_bulk.get_page_from_freelist.__alloc_pages.alloc_buddy_huge_page
> 0.00 +0.8 0.83 ± 8% perf-profile.calltrace.cycles-pp.rmqueue_bulk.get_page_from_freelist.__alloc_pages.alloc_buddy_huge_page.alloc_fresh_huge_page
> 0.00 +0.8 0.84 ± 8% perf-profile.calltrace.cycles-pp.__alloc_pages.alloc_buddy_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page.hugetlb_acct_memory
> 0.00 +0.8 0.84 ± 8% perf-profile.calltrace.cycles-pp.get_page_from_freelist.__alloc_pages.alloc_buddy_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page
> 0.00 +0.8 0.84 ± 8% perf-profile.calltrace.cycles-pp.alloc_buddy_huge_page.alloc_fresh_huge_page.alloc_surplus_huge_page.hugetlb_acct_memory.hugetlb_reserve_pages
> 0.00 +0.9 0.85 ± 8% perf-profile.calltrace.cycles-pp.alloc_fresh_huge_page.alloc_surplus_huge_page.hugetlb_acct_memory.hugetlb_reserve_pages.hugetlbfs_file_mmap
> 0.00 +0.9 0.88 ± 8% perf-profile.calltrace.cycles-pp.alloc_surplus_huge_page.hugetlb_acct_memory.hugetlb_reserve_pages.hugetlbfs_file_mmap.mmap_region
> 0.00 +0.9 0.88 ± 8% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe.__mmap
> 0.00 +0.9 0.88 ± 8% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mmap
> 0.00 +0.9 0.88 ± 8% perf-profile.calltrace.cycles-pp.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mmap
> 0.00 +0.9 0.88 ± 8% perf-profile.calltrace.cycles-pp.do_mmap.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe
> 0.00 +0.9 0.88 ± 8% perf-profile.calltrace.cycles-pp.mmap_region.do_mmap.vm_mmap_pgoff.ksys_mmap_pgoff.do_syscall_64
> 0.00 +0.9 0.88 ± 8% perf-profile.calltrace.cycles-pp.ksys_mmap_pgoff.do_syscall_64.entry_SYSCALL_64_after_hwframe.__mmap
> 0.00 +0.9 0.88 ± 8% perf-profile.calltrace.cycles-pp.__mmap
> 0.00 +0.9 0.88 ± 8% perf-profile.calltrace.cycles-pp.hugetlbfs_file_mmap.mmap_region.do_mmap.vm_mmap_pgoff.ksys_mmap_pgoff
> 0.00 +0.9 0.88 ± 8% perf-profile.calltrace.cycles-pp.hugetlb_reserve_pages.hugetlbfs_file_mmap.mmap_region.do_mmap.vm_mmap_pgoff
> 0.00 +0.9 0.88 ± 8% perf-profile.calltrace.cycles-pp.hugetlb_acct_memory.hugetlb_reserve_pages.hugetlbfs_file_mmap.mmap_region.do_mmap
> 60.28 ± 5% +4.7 64.98 ± 2% perf-profile.calltrace.cycles-pp.do_rw_once
> 0.09 ± 8% +0.0 0.11 ± 9% perf-profile.children.cycles-pp.task_tick_fair
> 0.14 ± 7% +0.0 0.17 ± 5% perf-profile.children.cycles-pp.scheduler_tick
> 0.20 ± 9% +0.0 0.24 ± 3% perf-profile.children.cycles-pp.tick_sched_timer
> 0.19 ± 9% +0.0 0.24 ± 4% perf-profile.children.cycles-pp.tick_sched_handle
> 0.19 ± 9% +0.0 0.23 ± 4% perf-profile.children.cycles-pp.update_process_times
> 0.24 ± 8% +0.0 0.29 ± 3% perf-profile.children.cycles-pp.__hrtimer_run_queues
> 0.40 ± 8% +0.1 0.45 ± 3% perf-profile.children.cycles-pp.__sysvec_apic_timer_interrupt
> 0.39 ± 7% +0.1 0.45 ± 3% perf-profile.children.cycles-pp.hrtimer_interrupt
> 0.26 ± 71% +0.6 0.86 ± 8% perf-profile.children.cycles-pp.native_queued_spin_lock_slowpath
> 0.28 ± 71% +0.6 0.88 ± 8% perf-profile.children.cycles-pp.__mmap
> 0.28 ± 71% +0.6 0.88 ± 8% perf-profile.children.cycles-pp.ksys_mmap_pgoff
> 0.27 ± 71% +0.6 0.88 ± 8% perf-profile.children.cycles-pp.hugetlbfs_file_mmap
> 0.27 ± 71% +0.6 0.88 ± 8% perf-profile.children.cycles-pp.hugetlb_reserve_pages
> 0.27 ± 71% +0.6 0.88 ± 8% perf-profile.children.cycles-pp.hugetlb_acct_memory
> 0.27 ± 71% +0.6 0.88 ± 8% perf-profile.children.cycles-pp.alloc_surplus_huge_page
> 0.28 ± 71% +0.6 0.88 ± 8% perf-profile.children.cycles-pp.vm_mmap_pgoff
> 0.28 ± 71% +0.6 0.88 ± 8% perf-profile.children.cycles-pp.do_mmap
> 0.28 ± 71% +0.6 0.88 ± 8% perf-profile.children.cycles-pp.mmap_region
> 0.55 ± 44% +0.6 1.16 ± 9% perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
> 0.55 ± 44% +0.6 1.16 ± 9% perf-profile.children.cycles-pp.do_syscall_64
> 0.12 ± 71% +0.7 0.85 ± 8% perf-profile.children.cycles-pp.alloc_fresh_huge_page
> 0.03 ± 70% +0.8 0.84 ± 8% perf-profile.children.cycles-pp.alloc_buddy_huge_page
> 0.04 ± 71% +0.8 0.84 ± 8% perf-profile.children.cycles-pp.get_page_from_freelist
> 0.04 ± 71% +0.8 0.84 ± 8% perf-profile.children.cycles-pp.__alloc_pages
> 0.00 +0.8 0.82 ± 8% perf-profile.children.cycles-pp._raw_spin_lock
> 0.00 +0.8 0.83 ± 8% perf-profile.children.cycles-pp.rmqueue_bulk
> 0.26 ± 71% +0.6 0.86 ± 8% perf-profile.self.cycles-pp.native_queued_spin_lock_slowpath
>
>
>
>
> Disclaimer:
> Results have been estimated based on internal Intel analysis and are provided
> for informational purposes only. Any difference in system hardware or software
> design or configuration may affect actual performance.
>
>
> ---
> 0-DAY CI Kernel Test Service
> https://lists.01.org/hyperkitty/list/[email protected]
>
> Thanks,
> Oliver Sang
>

2022-03-14 12:35:57

by Eric Dumazet

[permalink] [raw]
Subject: Re: [mm/page_alloc] 8212a964ee: vm-scalability.throughput 30.5% improvement

On Sat, Mar 12, 2022 at 10:59 AM Vlastimil Babka <[email protected]> wrote:
>
> On 3/12/22 16:43, kernel test robot wrote:
> >
> >
> > Greeting,
> >
> > FYI, we noticed a 30.5% improvement of vm-scalability.throughput due to commit:
> >
> >
> > commit: 8212a964ee020471104e34dce7029dec33c218a9 ("Re: [PATCH v2] mm/page_alloc: call check_new_pages() while zone spinlock is not held")
> > url: https://github.com/0day-ci/linux/commits/Mel-Gorman/Re-PATCH-v2-mm-page_alloc-call-check_new_pages-while-zone-spinlock-is-not-held/20220309-203504
> > patch link: https://lore.kernel.org/lkml/[email protected]
>
> Heh, that's weird. I would expect some improvement from Eric's patch,
> but this seems to be actually about Mel's "mm/page_alloc: check
> high-order pages for corruption during PCP operations" applied directly
> on 5.17-rc7 per the github url above. This was rather expected to make
> performance worse if anything, so maybe the improvement is due to some
> unexpected side-effect of different inlining decisions or cache alignment...
>

I doubt this has anything to do with inlining or cache alignment.

I am not familiar with the benchmark, but its name
(anon-w-rand-hugetlb) hints at hugetlb ?

After Mel fix, we go over 512 'struct page' to perform sanity checks,
thus loading into cpu caches the 512 cache lines.

This caching is done while no lock is held.

If after this huge page allocation some mm operation needs to access
these 512 struct pages,
while holding a lock, then sure, there is a huge gain.