2018-03-01 06:30:02

by Aaron Lu

[permalink] [raw]
Subject: [PATCH v4 0/3] mm: improve zone->lock scalability

Patch 1/3 is a small cleanup suggested by Matthew Wilcox which doesn't
affect scalability or performance;
Patch 2/3 moves some code in free_pcppages_bulk() outside of zone->lock
and has Mel Gorman's ack;
Patch 3/3 uses prefetch in free_pcppages_bulk() outside of zone->lock to
speedup page merging under zone->lock but Mel Gorman has some concerns.

For details, please see their changelogs.

v4:
Address David Rientjes' comments to not update pcp->count in front of
free_pcppages_bulk() in patch 1/3.
Reword code comments in patch 2/3 as suggested by David Rientjes.

v3:
Added patch 1/3 to update pcp->count inside of free_pcppages_bulk();
Rebase to v4.16-rc2.

v2 & v1:
https://lkml.org/lkml/2018/1/23/879

Aaron Lu (3):
mm/free_pcppages_bulk: update pcp->count inside
mm/free_pcppages_bulk: do not hold lock when picking pages to free
mm/free_pcppages_bulk: prefetch buddy while not holding lock

mm/page_alloc.c | 62 +++++++++++++++++++++++++++++++++++++--------------------
1 file changed, 40 insertions(+), 22 deletions(-)

--
2.14.3



2018-03-01 06:29:05

by Aaron Lu

[permalink] [raw]
Subject: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock

When a page is freed back to the global pool, its buddy will be checked
to see if it's possible to do a merge. This requires accessing buddy's
page structure and that access could take a long time if it's cache cold.

This patch adds a prefetch to the to-be-freed page's buddy outside of
zone->lock in hope of accessing buddy's page structure later under
zone->lock will be faster. Since we *always* do buddy merging and check
an order-0 page's buddy to try to merge it when it goes into the main
allocator, the cacheline will always come in, i.e. the prefetched data
will never be unused.

In the meantime, there are two concerns:
1 the prefetch could potentially evict existing cachelines, especially
for L1D cache since it is not huge;
2 there is some additional instruction overhead, namely calculating
buddy pfn twice.

For 1, it's hard to say, this microbenchmark though shows good result but
the actual benefit of this patch will be workload/CPU dependant;
For 2, since the calculation is a XOR on two local variables, it's expected
in many cases that cycles spent will be offset by reduced memory latency
later. This is especially true for NUMA machines where multiple CPUs are
contending on zone->lock and the most time consuming part under zone->lock
is the wait of 'struct page' cacheline of the to-be-freed pages and their
buddies.

Test with will-it-scale/page_fault1 full load:

kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S)
v4.16-rc2+ 9034215 7971818 13667135 15677465
patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
this patch 10338868 +8.4% 8544477 +2.8% 14839808 +5.5% 17155464 +2.9%
Note: this patch's performance improvement percent is against patch2/3.

[changelog stole from Dave Hansen and Mel Gorman's comments]
https://lkml.org/lkml/2018/1/24/551
Suggested-by: Ying Huang <[email protected]>
Signed-off-by: Aaron Lu <[email protected]>
---
mm/page_alloc.c | 15 +++++++++++++++
1 file changed, 15 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dafdcdec9c1f..1d838041931e 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1141,6 +1141,9 @@ static void free_pcppages_bulk(struct zone *zone, int count,
batch_free = count;

do {
+ unsigned long pfn, buddy_pfn;
+ struct page *buddy;
+
page = list_last_entry(list, struct page, lru);
/* must delete to avoid corrupting pcp list */
list_del(&page->lru);
@@ -1150,6 +1153,18 @@ static void free_pcppages_bulk(struct zone *zone, int count,
continue;

list_add_tail(&page->lru, &head);
+
+ /*
+ * We are going to put the page back to the global
+ * pool, prefetch its buddy to speed up later access
+ * under zone->lock. It is believed the overhead of
+ * calculating buddy_pfn here can be offset by reduced
+ * memory latency later.
+ */
+ pfn = page_to_pfn(page);
+ buddy_pfn = __find_buddy_pfn(pfn, 0);
+ buddy = page + (buddy_pfn - pfn);
+ prefetch(buddy);
} while (--count && --batch_free && !list_empty(list));
}

--
2.14.3


2018-03-01 06:29:18

by Aaron Lu

[permalink] [raw]
Subject: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
the zone->lock is held and then pages are chosen from PCP's migratetype
list. While there is actually no need to do this 'choose part' under
lock since it's PCP pages, the only CPU that can touch them is us and
irq is also disabled.

Moving this part outside could reduce lock held time and improve
performance. Test with will-it-scale/page_fault1 full load:

kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S)
v4.16-rc2+ 9034215 7971818 13667135 15677465
this patch 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4%

What the test does is: starts $nr_cpu processes and each will repeatedly
do the following for 5 minutes:
1 mmap 128M anonymouse space;
2 write access to that space;
3 munmap.
The score is the aggregated iteration.

https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c

Acked-by: Mel Gorman <[email protected]>
Signed-off-by: Aaron Lu <[email protected]>
---
mm/page_alloc.c | 39 +++++++++++++++++++++++----------------
1 file changed, 23 insertions(+), 16 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index faa33eac1635..dafdcdec9c1f 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1116,12 +1116,10 @@ static void free_pcppages_bulk(struct zone *zone, int count,
int migratetype = 0;
int batch_free = 0;
bool isolated_pageblocks;
-
- spin_lock(&zone->lock);
- isolated_pageblocks = has_isolate_pageblock(zone);
+ struct page *page, *tmp;
+ LIST_HEAD(head);

while (count) {
- struct page *page;
struct list_head *list;

/*
@@ -1143,27 +1141,36 @@ static void free_pcppages_bulk(struct zone *zone, int count,
batch_free = count;

do {
- int mt; /* migratetype of the to-be-freed page */
-
page = list_last_entry(list, struct page, lru);
- /* must delete as __free_one_page list manipulates */
+ /* must delete to avoid corrupting pcp list */
list_del(&page->lru);
pcp->count--;

- mt = get_pcppage_migratetype(page);
- /* MIGRATE_ISOLATE page should not go to pcplists */
- VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
- /* Pageblock could have been isolated meanwhile */
- if (unlikely(isolated_pageblocks))
- mt = get_pageblock_migratetype(page);
-
if (bulkfree_pcp_prepare(page))
continue;

- __free_one_page(page, page_to_pfn(page), zone, 0, mt);
- trace_mm_page_pcpu_drain(page, 0, mt);
+ list_add_tail(&page->lru, &head);
} while (--count && --batch_free && !list_empty(list));
}
+
+ spin_lock(&zone->lock);
+ isolated_pageblocks = has_isolate_pageblock(zone);
+
+ /*
+ * Use safe version since after __free_one_page(),
+ * page->lru.next will not point to original list.
+ */
+ list_for_each_entry_safe(page, tmp, &head, lru) {
+ int mt = get_pcppage_migratetype(page);
+ /* MIGRATE_ISOLATE page should not go to pcplists */
+ VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
+ /* Pageblock could have been isolated meanwhile */
+ if (unlikely(isolated_pageblocks))
+ mt = get_pageblock_migratetype(page);
+
+ __free_one_page(page, page_to_pfn(page), zone, 0, mt);
+ trace_mm_page_pcpu_drain(page, 0, mt);
+ }
spin_unlock(&zone->lock);
}

--
2.14.3


2018-03-01 06:29:36

by Aaron Lu

[permalink] [raw]
Subject: [PATCH v4 1/3] mm/free_pcppages_bulk: update pcp->count inside

Matthew Wilcox found that all callers of free_pcppages_bulk() currently
update pcp->count immediately after so it's natural to do it inside
free_pcppages_bulk().

No functionality or performance change is expected from this patch.

Suggested-by: Matthew Wilcox <[email protected]>
Signed-off-by: Aaron Lu <[email protected]>
---
mm/page_alloc.c | 10 +++-------
1 file changed, 3 insertions(+), 7 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index cb416723538f..faa33eac1635 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1148,6 +1148,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
page = list_last_entry(list, struct page, lru);
/* must delete as __free_one_page list manipulates */
list_del(&page->lru);
+ pcp->count--;

mt = get_pcppage_migratetype(page);
/* MIGRATE_ISOLATE page should not go to pcplists */
@@ -2416,10 +2417,8 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
local_irq_save(flags);
batch = READ_ONCE(pcp->batch);
to_drain = min(pcp->count, batch);
- if (to_drain > 0) {
+ if (to_drain > 0)
free_pcppages_bulk(zone, to_drain, pcp);
- pcp->count -= to_drain;
- }
local_irq_restore(flags);
}
#endif
@@ -2441,10 +2440,8 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
pset = per_cpu_ptr(zone->pageset, cpu);

pcp = &pset->pcp;
- if (pcp->count) {
+ if (pcp->count)
free_pcppages_bulk(zone, pcp->count, pcp);
- pcp->count = 0;
- }
local_irq_restore(flags);
}

@@ -2668,7 +2665,6 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
if (pcp->count >= pcp->high) {
unsigned long batch = READ_ONCE(pcp->batch);
free_pcppages_bulk(zone, batch, pcp);
- pcp->count -= batch;
}
}

--
2.14.3


2018-03-01 12:12:34

by David Rientjes

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] mm/free_pcppages_bulk: update pcp->count inside

On Thu, 1 Mar 2018, Aaron Lu wrote:

> Matthew Wilcox found that all callers of free_pcppages_bulk() currently
> update pcp->count immediately after so it's natural to do it inside
> free_pcppages_bulk().
>
> No functionality or performance change is expected from this patch.
>
> Suggested-by: Matthew Wilcox <[email protected]>
> Signed-off-by: Aaron Lu <[email protected]>

Acked-by: David Rientjes <[email protected]>

2018-03-01 13:46:47

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] mm/free_pcppages_bulk: update pcp->count inside

On Thu 01-03-18 14:28:43, Aaron Lu wrote:
> Matthew Wilcox found that all callers of free_pcppages_bulk() currently
> update pcp->count immediately after so it's natural to do it inside
> free_pcppages_bulk().
>
> No functionality or performance change is expected from this patch.
>
> Suggested-by: Matthew Wilcox <[email protected]>
> Signed-off-by: Aaron Lu <[email protected]>

Makes a lot of sense to me.
Acked-by: Michal Hocko <[email protected]>

> ---
> mm/page_alloc.c | 10 +++-------
> 1 file changed, 3 insertions(+), 7 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index cb416723538f..faa33eac1635 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1148,6 +1148,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> page = list_last_entry(list, struct page, lru);
> /* must delete as __free_one_page list manipulates */
> list_del(&page->lru);
> + pcp->count--;
>
> mt = get_pcppage_migratetype(page);
> /* MIGRATE_ISOLATE page should not go to pcplists */
> @@ -2416,10 +2417,8 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
> local_irq_save(flags);
> batch = READ_ONCE(pcp->batch);
> to_drain = min(pcp->count, batch);
> - if (to_drain > 0) {
> + if (to_drain > 0)
> free_pcppages_bulk(zone, to_drain, pcp);
> - pcp->count -= to_drain;
> - }
> local_irq_restore(flags);
> }
> #endif
> @@ -2441,10 +2440,8 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
> pset = per_cpu_ptr(zone->pageset, cpu);
>
> pcp = &pset->pcp;
> - if (pcp->count) {
> + if (pcp->count)
> free_pcppages_bulk(zone, pcp->count, pcp);
> - pcp->count = 0;
> - }
> local_irq_restore(flags);
> }
>
> @@ -2668,7 +2665,6 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
> if (pcp->count >= pcp->high) {
> unsigned long batch = READ_ONCE(pcp->batch);
> free_pcppages_bulk(zone, batch, pcp);
> - pcp->count -= batch;
> }
> }
>
> --
> 2.14.3
>

--
Michal Hocko
SUSE Labs

2018-03-01 13:56:19

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

On Thu 01-03-18 14:28:44, Aaron Lu wrote:
> When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
> the zone->lock is held and then pages are chosen from PCP's migratetype
> list. While there is actually no need to do this 'choose part' under
> lock since it's PCP pages, the only CPU that can touch them is us and
> irq is also disabled.
>
> Moving this part outside could reduce lock held time and improve
> performance. Test with will-it-scale/page_fault1 full load:
>
> kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S)
> v4.16-rc2+ 9034215 7971818 13667135 15677465
> this patch 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
>
> What the test does is: starts $nr_cpu processes and each will repeatedly
> do the following for 5 minutes:
> 1 mmap 128M anonymouse space;
> 2 write access to that space;
> 3 munmap.
> The score is the aggregated iteration.

Iteration count I assume. I am still quite surprised that this would
have such a large impact.

> https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c
>
> Acked-by: Mel Gorman <[email protected]>
> Signed-off-by: Aaron Lu <[email protected]>

The patch makes sense to me
Acked-by: Michal Hocko <[email protected]>

> ---
> mm/page_alloc.c | 39 +++++++++++++++++++++++----------------
> 1 file changed, 23 insertions(+), 16 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index faa33eac1635..dafdcdec9c1f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1116,12 +1116,10 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> int migratetype = 0;
> int batch_free = 0;
> bool isolated_pageblocks;
> -
> - spin_lock(&zone->lock);
> - isolated_pageblocks = has_isolate_pageblock(zone);
> + struct page *page, *tmp;
> + LIST_HEAD(head);
>
> while (count) {
> - struct page *page;
> struct list_head *list;
>
> /*
> @@ -1143,27 +1141,36 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> batch_free = count;
>
> do {
> - int mt; /* migratetype of the to-be-freed page */
> -
> page = list_last_entry(list, struct page, lru);
> - /* must delete as __free_one_page list manipulates */
> + /* must delete to avoid corrupting pcp list */
> list_del(&page->lru);
> pcp->count--;
>
> - mt = get_pcppage_migratetype(page);
> - /* MIGRATE_ISOLATE page should not go to pcplists */
> - VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> - /* Pageblock could have been isolated meanwhile */
> - if (unlikely(isolated_pageblocks))
> - mt = get_pageblock_migratetype(page);
> -
> if (bulkfree_pcp_prepare(page))
> continue;
>
> - __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> - trace_mm_page_pcpu_drain(page, 0, mt);
> + list_add_tail(&page->lru, &head);
> } while (--count && --batch_free && !list_empty(list));
> }
> +
> + spin_lock(&zone->lock);
> + isolated_pageblocks = has_isolate_pageblock(zone);
> +
> + /*
> + * Use safe version since after __free_one_page(),
> + * page->lru.next will not point to original list.
> + */
> + list_for_each_entry_safe(page, tmp, &head, lru) {
> + int mt = get_pcppage_migratetype(page);
> + /* MIGRATE_ISOLATE page should not go to pcplists */
> + VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> + /* Pageblock could have been isolated meanwhile */
> + if (unlikely(isolated_pageblocks))
> + mt = get_pageblock_migratetype(page);
> +
> + __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> + trace_mm_page_pcpu_drain(page, 0, mt);
> + }
> spin_unlock(&zone->lock);
> }
>
> --
> 2.14.3
>

--
Michal Hocko
SUSE Labs

2018-03-01 14:02:55

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock

On Thu 01-03-18 14:28:45, Aaron Lu wrote:
> When a page is freed back to the global pool, its buddy will be checked
> to see if it's possible to do a merge. This requires accessing buddy's
> page structure and that access could take a long time if it's cache cold.
>
> This patch adds a prefetch to the to-be-freed page's buddy outside of
> zone->lock in hope of accessing buddy's page structure later under
> zone->lock will be faster. Since we *always* do buddy merging and check
> an order-0 page's buddy to try to merge it when it goes into the main
> allocator, the cacheline will always come in, i.e. the prefetched data
> will never be unused.
>
> In the meantime, there are two concerns:
> 1 the prefetch could potentially evict existing cachelines, especially
> for L1D cache since it is not huge;
> 2 there is some additional instruction overhead, namely calculating
> buddy pfn twice.
>
> For 1, it's hard to say, this microbenchmark though shows good result but
> the actual benefit of this patch will be workload/CPU dependant;
> For 2, since the calculation is a XOR on two local variables, it's expected
> in many cases that cycles spent will be offset by reduced memory latency
> later. This is especially true for NUMA machines where multiple CPUs are
> contending on zone->lock and the most time consuming part under zone->lock
> is the wait of 'struct page' cacheline of the to-be-freed pages and their
> buddies.
>
> Test with will-it-scale/page_fault1 full load:
>
> kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S)
> v4.16-rc2+ 9034215 7971818 13667135 15677465
> patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
> this patch 10338868 +8.4% 8544477 +2.8% 14839808 +5.5% 17155464 +2.9%
> Note: this patch's performance improvement percent is against patch2/3.

I am really surprised that this has such a big impact. Is this a win on
other architectures as well?

> [changelog stole from Dave Hansen and Mel Gorman's comments]
> https://lkml.org/lkml/2018/1/24/551

Please use http://lkml.kernel.org/r/<msg-id> for references because
lkml.org is quite unstable. It would be
http://lkml.kernel.org/r/[email protected]
here.
--
Michal Hocko
SUSE Labs

2018-03-02 00:03:19

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

On Thu, 1 Mar 2018 14:28:44 +0800 Aaron Lu <[email protected]> wrote:

> When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
> the zone->lock is held and then pages are chosen from PCP's migratetype
> list. While there is actually no need to do this 'choose part' under
> lock since it's PCP pages, the only CPU that can touch them is us and
> irq is also disabled.
>
> Moving this part outside could reduce lock held time and improve
> performance. Test with will-it-scale/page_fault1 full load:
>
> kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S)
> v4.16-rc2+ 9034215 7971818 13667135 15677465
> this patch 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
>
> What the test does is: starts $nr_cpu processes and each will repeatedly
> do the following for 5 minutes:
> 1 mmap 128M anonymouse space;
> 2 write access to that space;
> 3 munmap.
> The score is the aggregated iteration.

But it's a loss for uniprocessor systems: it adds more code and adds an
additional pass across a list.

2018-03-02 01:35:34

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock

On Thu, 1 Mar 2018 14:28:45 +0800 Aaron Lu <[email protected]> wrote:

> When a page is freed back to the global pool, its buddy will be checked
> to see if it's possible to do a merge. This requires accessing buddy's
> page structure and that access could take a long time if it's cache cold.
>
> This patch adds a prefetch to the to-be-freed page's buddy outside of
> zone->lock in hope of accessing buddy's page structure later under
> zone->lock will be faster. Since we *always* do buddy merging and check
> an order-0 page's buddy to try to merge it when it goes into the main
> allocator, the cacheline will always come in, i.e. the prefetched data
> will never be unused.
>
> In the meantime, there are two concerns:
> 1 the prefetch could potentially evict existing cachelines, especially
> for L1D cache since it is not huge;
> 2 there is some additional instruction overhead, namely calculating
> buddy pfn twice.
>
> For 1, it's hard to say, this microbenchmark though shows good result but
> the actual benefit of this patch will be workload/CPU dependant;
> For 2, since the calculation is a XOR on two local variables, it's expected
> in many cases that cycles spent will be offset by reduced memory latency
> later. This is especially true for NUMA machines where multiple CPUs are
> contending on zone->lock and the most time consuming part under zone->lock
> is the wait of 'struct page' cacheline of the to-be-freed pages and their
> buddies.
>
> Test with will-it-scale/page_fault1 full load:
>
> kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S)
> v4.16-rc2+ 9034215 7971818 13667135 15677465
> patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
> this patch 10338868 +8.4% 8544477 +2.8% 14839808 +5.5% 17155464 +2.9%
> Note: this patch's performance improvement percent is against patch2/3.
>
> ...
>
> @@ -1150,6 +1153,18 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> continue;
>
> list_add_tail(&page->lru, &head);
> +
> + /*
> + * We are going to put the page back to the global
> + * pool, prefetch its buddy to speed up later access
> + * under zone->lock. It is believed the overhead of
> + * calculating buddy_pfn here can be offset by reduced
> + * memory latency later.
> + */
> + pfn = page_to_pfn(page);
> + buddy_pfn = __find_buddy_pfn(pfn, 0);
> + buddy = page + (buddy_pfn - pfn);
> + prefetch(buddy);

What is the typical list length here? Maybe it's approximately the pcp
batch size which is typically 128 pages?

If so, I'm a bit surprised that it is effective to prefetch 128 page
frames before using any them for real. I guess they'll fit in the L2
cache. Thoughts?

2018-03-02 07:16:35

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

On Thu, Mar 01, 2018 at 02:55:18PM +0100, Michal Hocko wrote:
> On Thu 01-03-18 14:28:44, Aaron Lu wrote:
> > When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
> > the zone->lock is held and then pages are chosen from PCP's migratetype
> > list. While there is actually no need to do this 'choose part' under
> > lock since it's PCP pages, the only CPU that can touch them is us and
> > irq is also disabled.
> >
> > Moving this part outside could reduce lock held time and improve
> > performance. Test with will-it-scale/page_fault1 full load:
> >
> > kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S)
> > v4.16-rc2+ 9034215 7971818 13667135 15677465
> > this patch 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
> >
> > What the test does is: starts $nr_cpu processes and each will repeatedly
> > do the following for 5 minutes:
> > 1 mmap 128M anonymouse space;
> > 2 write access to that space;
> > 3 munmap.
> > The score is the aggregated iteration.
>
> Iteration count I assume.

Correct.

> I am still quite surprised that this would have such a large impact.

Most likely due to the cachelines for these page structures are warmed
up outside of zone->lock.

2018-03-02 07:43:41

by Huang, Ying

[permalink] [raw]
Subject: Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

Michal Hocko <[email protected]> writes:

> On Thu 01-03-18 14:28:44, Aaron Lu wrote:
>> When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
>> the zone->lock is held and then pages are chosen from PCP's migratetype
>> list. While there is actually no need to do this 'choose part' under
>> lock since it's PCP pages, the only CPU that can touch them is us and
>> irq is also disabled.
>>
>> Moving this part outside could reduce lock held time and improve
>> performance. Test with will-it-scale/page_fault1 full load:
>>
>> kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S)
>> v4.16-rc2+ 9034215 7971818 13667135 15677465
>> this patch 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
>>
>> What the test does is: starts $nr_cpu processes and each will repeatedly
>> do the following for 5 minutes:
>> 1 mmap 128M anonymouse space;
>> 2 write access to that space;
>> 3 munmap.
>> The score is the aggregated iteration.
>
> Iteration count I assume. I am still quite surprised that this would
> have such a large impact.

The test is run with full load, this means near or more than 100
processes will allocate memory in parallel. According to Amdahl's law,
the performance of a parallel program will be dominated by the serial
part. For this case, the part protected by zone->lock. So small
changes to code under zone->lock could make bigger changes to overall
score.

Best Regards,
Huang, Ying

2018-03-02 08:46:25

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

On Thu, Mar 01, 2018 at 04:01:05PM -0800, Andrew Morton wrote:
> On Thu, 1 Mar 2018 14:28:44 +0800 Aaron Lu <[email protected]> wrote:
>
> > When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
> > the zone->lock is held and then pages are chosen from PCP's migratetype
> > list. While there is actually no need to do this 'choose part' under
> > lock since it's PCP pages, the only CPU that can touch them is us and
> > irq is also disabled.
> >
> > Moving this part outside could reduce lock held time and improve
> > performance. Test with will-it-scale/page_fault1 full load:
> >
> > kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S)
> > v4.16-rc2+ 9034215 7971818 13667135 15677465
> > this patch 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
> >
> > What the test does is: starts $nr_cpu processes and each will repeatedly
> > do the following for 5 minutes:
> > 1 mmap 128M anonymouse space;
> > 2 write access to that space;
> > 3 munmap.
> > The score is the aggregated iteration.
>
> But it's a loss for uniprocessor systems: it adds more code and adds an
> additional pass across a list.

Performance wise, I assume the loss is pretty small and can not
be measured.

On my Sandybridge desktop, with will-it-scale/page_fault1/single process
run to emulate uniprocessor system, the score is(average of 3 runs):

base(patch 1/3): 649710
this patch: 653554 +0.6%
prefetch(patch 3/3): 650336 (in noise range compared to base)

On 4 sockets Intel Broadwell with will-it-scale/page_fault1/single
process run:

base(patch 1/3): 498649
this patch: 504171 +1.1%
prefetch(patch 3/3): 506334 +1.5% (compared to base)

It looks like we don't need to worry too much about performance for
uniprocessor system.

2018-03-02 12:03:52

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock

On Thu, Mar 01, 2018 at 04:09:50PM -0800, Andrew Morton wrote:
> On Thu, 1 Mar 2018 14:28:45 +0800 Aaron Lu <[email protected]> wrote:
>
> > When a page is freed back to the global pool, its buddy will be checked
> > to see if it's possible to do a merge. This requires accessing buddy's
> > page structure and that access could take a long time if it's cache cold.
> >
> > This patch adds a prefetch to the to-be-freed page's buddy outside of
> > zone->lock in hope of accessing buddy's page structure later under
> > zone->lock will be faster. Since we *always* do buddy merging and check
> > an order-0 page's buddy to try to merge it when it goes into the main
> > allocator, the cacheline will always come in, i.e. the prefetched data
> > will never be unused.
> >
> > In the meantime, there are two concerns:
> > 1 the prefetch could potentially evict existing cachelines, especially
> > for L1D cache since it is not huge;
> > 2 there is some additional instruction overhead, namely calculating
> > buddy pfn twice.
> >
> > For 1, it's hard to say, this microbenchmark though shows good result but
> > the actual benefit of this patch will be workload/CPU dependant;
> > For 2, since the calculation is a XOR on two local variables, it's expected
> > in many cases that cycles spent will be offset by reduced memory latency
> > later. This is especially true for NUMA machines where multiple CPUs are
> > contending on zone->lock and the most time consuming part under zone->lock
> > is the wait of 'struct page' cacheline of the to-be-freed pages and their
> > buddies.
> >
> > Test with will-it-scale/page_fault1 full load:
> >
> > kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S)
> > v4.16-rc2+ 9034215 7971818 13667135 15677465
> > patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
> > this patch 10338868 +8.4% 8544477 +2.8% 14839808 +5.5% 17155464 +2.9%
> > Note: this patch's performance improvement percent is against patch2/3.
> >
> > ...
> >
> > @@ -1150,6 +1153,18 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> > continue;
> >
> > list_add_tail(&page->lru, &head);
> > +
> > + /*
> > + * We are going to put the page back to the global
> > + * pool, prefetch its buddy to speed up later access
> > + * under zone->lock. It is believed the overhead of
> > + * calculating buddy_pfn here can be offset by reduced
> > + * memory latency later.
> > + */
> > + pfn = page_to_pfn(page);
> > + buddy_pfn = __find_buddy_pfn(pfn, 0);
> > + buddy = page + (buddy_pfn - pfn);
> > + prefetch(buddy);
>
> What is the typical list length here? Maybe it's approximately the pcp
> batch size which is typically 128 pages?

Most of time it is pcp->batch, unless when pcp's pages need to be
all drained like in drain_local_pages(zone).

The pcp->batch has a default value of 31 and its upper limit is 96 for
x86_64. For this test, it is 31 here, I didn't manipulate
/proc/sys/vm/percpu_pagelist_fraction to change it.

With this said, the count here could be pcp->count when pcp's pages
need to be all drained and though pcp->count's default value is
(6*pcp->batch)=186, user can increase that value through the above
mentioned procfs interface and the resulting pcp->count could be too
big for prefetch. Ying also mentioned this today and suggested adding
an upper limit here to avoid prefetching too much. Perhaps just prefetch
the last pcp->batch pages if count here > pcp->batch? Since pcp->batch
has an upper limit, we won't need to worry prefetching too much.

>
> If so, I'm a bit surprised that it is effective to prefetch 128 page
> frames before using any them for real. I guess they'll fit in the L2
> cache. Thoughts?

2018-03-02 12:03:56

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock

On Thu, Mar 01, 2018 at 03:00:44PM +0100, Michal Hocko wrote:
> On Thu 01-03-18 14:28:45, Aaron Lu wrote:
> > When a page is freed back to the global pool, its buddy will be checked
> > to see if it's possible to do a merge. This requires accessing buddy's
> > page structure and that access could take a long time if it's cache cold.
> >
> > This patch adds a prefetch to the to-be-freed page's buddy outside of
> > zone->lock in hope of accessing buddy's page structure later under
> > zone->lock will be faster. Since we *always* do buddy merging and check
> > an order-0 page's buddy to try to merge it when it goes into the main
> > allocator, the cacheline will always come in, i.e. the prefetched data
> > will never be unused.
> >
> > In the meantime, there are two concerns:
> > 1 the prefetch could potentially evict existing cachelines, especially
> > for L1D cache since it is not huge;
> > 2 there is some additional instruction overhead, namely calculating
> > buddy pfn twice.
> >
> > For 1, it's hard to say, this microbenchmark though shows good result but
> > the actual benefit of this patch will be workload/CPU dependant;
> > For 2, since the calculation is a XOR on two local variables, it's expected
> > in many cases that cycles spent will be offset by reduced memory latency
> > later. This is especially true for NUMA machines where multiple CPUs are
> > contending on zone->lock and the most time consuming part under zone->lock
> > is the wait of 'struct page' cacheline of the to-be-freed pages and their
> > buddies.
> >
> > Test with will-it-scale/page_fault1 full load:
> >
> > kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S)
> > v4.16-rc2+ 9034215 7971818 13667135 15677465
> > patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
> > this patch 10338868 +8.4% 8544477 +2.8% 14839808 +5.5% 17155464 +2.9%
> > Note: this patch's performance improvement percent is against patch2/3.
>
> I am really surprised that this has such a big impact. Is this a win on
> other architectures as well?

For NUMA machines, I guess so. But I didn't test other archs so can't
say for sure.

>
> > [changelog stole from Dave Hansen and Mel Gorman's comments]
> > https://lkml.org/lkml/2018/1/24/551
>
> Please use http://lkml.kernel.org/r/<msg-id> for references because
> lkml.org is quite unstable. It would be
> http://lkml.kernel.org/r/[email protected]
> here.

Good to know this, thanks!

2018-03-02 16:17:49

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

On 03/01/2018 11:15 PM, Aaron Lu wrote:
>
>> I am still quite surprised that this would have such a large impact.
> Most likely due to the cachelines for these page structures are warmed
> up outside of zone->lock.

The workload here is a pretty tight microbenchmark and single biggest
bottleneck is cache misses on 'struct page'. It's not memory bandwidth
bound. So, anything you can give the CPU keep it fed and not waiting on
cache misses will be a win.

There's never going to be a real-world workload that sees this kind of
increase, but the change in the micro isn't super-surprising because it
so directly targets the bottleneck.

2018-03-02 20:35:23

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock

On 03/01/2018 03:00 PM, Michal Hocko wrote:
> On Thu 01-03-18 14:28:45, Aaron Lu wrote:
>> When a page is freed back to the global pool, its buddy will be checked
>> to see if it's possible to do a merge. This requires accessing buddy's
>> page structure and that access could take a long time if it's cache cold.
>>
>> This patch adds a prefetch to the to-be-freed page's buddy outside of
>> zone->lock in hope of accessing buddy's page structure later under
>> zone->lock will be faster. Since we *always* do buddy merging and check
>> an order-0 page's buddy to try to merge it when it goes into the main
>> allocator, the cacheline will always come in, i.e. the prefetched data
>> will never be unused.
>>
>> In the meantime, there are two concerns:
>> 1 the prefetch could potentially evict existing cachelines, especially
>> for L1D cache since it is not huge;
>> 2 there is some additional instruction overhead, namely calculating
>> buddy pfn twice.
>>
>> For 1, it's hard to say, this microbenchmark though shows good result but
>> the actual benefit of this patch will be workload/CPU dependant;
>> For 2, since the calculation is a XOR on two local variables, it's expected
>> in many cases that cycles spent will be offset by reduced memory latency
>> later. This is especially true for NUMA machines where multiple CPUs are
>> contending on zone->lock and the most time consuming part under zone->lock
>> is the wait of 'struct page' cacheline of the to-be-freed pages and their
>> buddies.
>>
>> Test with will-it-scale/page_fault1 full load:
>>
>> kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S)
>> v4.16-rc2+ 9034215 7971818 13667135 15677465
>> patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
>> this patch 10338868 +8.4% 8544477 +2.8% 14839808 +5.5% 17155464 +2.9%
>> Note: this patch's performance improvement percent is against patch2/3.
>
> I am really surprised that this has such a big impact.

It's even stranger to me. Struct page is 64 bytes these days, exactly a
a cache line. Unless that changed, Intel CPUs prefetched a "buddy" cache
line (that forms an aligned 128 bytes block with the one we touch).
Which is exactly a order-0 buddy struct page! Maybe that implicit
prefetching stopped at L2 and explicit goes all the way to L1, can't
remember. Would that make such a difference? It would be nice to do some
perf tests with cache counters to see what is really going on...

Vlastimil

> Is this a win on
> other architectures as well?
>
>> [changelog stole from Dave Hansen and Mel Gorman's comments]
>> https://lkml.org/lkml/2018/1/24/551
>
> Please use http://lkml.kernel.org/r/<msg-id> for references because
> lkml.org is quite unstable. It would be
> http://lkml.kernel.org/r/[email protected]
> here.
>


2018-03-02 20:46:56

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock

On 03/02/2018 09:55 AM, Vlastimil Babka wrote:
> It's even stranger to me. Struct page is 64 bytes these days, exactly a
> a cache line. Unless that changed, Intel CPUs prefetched a "buddy" cache
> line (that forms an aligned 128 bytes block with the one we touch).

I believe that was a behavior that was specific to the Pentium 4
"Netburst" era. I don't think the 128-byte line behavior exists on
modern Intel cpus.

2018-03-02 20:57:05

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock

On 03/02/2018 07:00 PM, Dave Hansen wrote:
> On 03/02/2018 09:55 AM, Vlastimil Babka wrote:
>> It's even stranger to me. Struct page is 64 bytes these days, exactly a
>> a cache line. Unless that changed, Intel CPUs prefetched a "buddy" cache
>> line (that forms an aligned 128 bytes block with the one we touch).
>
> I believe that was a behavior that was specific to the Pentium 4
> "Netburst" era. I don't think the 128-byte line behavior exists on
> modern Intel cpus.

I remember it on Core 2 something (Nehalem IIRC). And this page suggests
up to Broadwell, and it can be disabled. And it's an L2 prefetcher indeed.
https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors



2018-03-02 22:03:19

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

On Fri, 2 Mar 2018 16:01:25 +0800 Aaron Lu <[email protected]> wrote:

> On Thu, Mar 01, 2018 at 04:01:05PM -0800, Andrew Morton wrote:
> > On Thu, 1 Mar 2018 14:28:44 +0800 Aaron Lu <[email protected]> wrote:
> >
> > > When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
> > > the zone->lock is held and then pages are chosen from PCP's migratetype
> > > list. While there is actually no need to do this 'choose part' under
> > > lock since it's PCP pages, the only CPU that can touch them is us and
> > > irq is also disabled.
> > >
> > > Moving this part outside could reduce lock held time and improve
> > > performance. Test with will-it-scale/page_fault1 full load:
> > >
> > > kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S)
> > > v4.16-rc2+ 9034215 7971818 13667135 15677465
> > > this patch 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
> > >
> > > What the test does is: starts $nr_cpu processes and each will repeatedly
> > > do the following for 5 minutes:
> > > 1 mmap 128M anonymouse space;
> > > 2 write access to that space;
> > > 3 munmap.
> > > The score is the aggregated iteration.
> >
> > But it's a loss for uniprocessor systems: it adds more code and adds an
> > additional pass across a list.
>
> Performance wise, I assume the loss is pretty small and can not
> be measured.
>
> On my Sandybridge desktop, with will-it-scale/page_fault1/single process
> run to emulate uniprocessor system, the score is(average of 3 runs):
>
> base(patch 1/3): 649710
> this patch: 653554 +0.6%

Does that mean we got faster or slower?

> prefetch(patch 3/3): 650336 (in noise range compared to base)
>
> On 4 sockets Intel Broadwell with will-it-scale/page_fault1/single
> process run:
>
> base(patch 1/3): 498649
> this patch: 504171 +1.1%
> prefetch(patch 3/3): 506334 +1.5% (compared to base)
>
> It looks like we don't need to worry too much about performance for
> uniprocessor system.

Well. We can say that of hundreds of patches. And we end up with a
fatter and slower kernel than we otherwise would.

Please take a look, see if there's a tidy way of avoiding this.
Probably there isn't, in which case oh well. But let's at least try.


2018-03-03 03:32:37

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

On 03/02/2018 01:23 PM, Andrew Morton wrote:
>> On my Sandybridge desktop, with will-it-scale/page_fault1/single process
>> run to emulate uniprocessor system, the score is(average of 3 runs):
>>
>> base(patch 1/3): 649710
>> this patch: 653554 +0.6%
> Does that mean we got faster or slower?

Faster. The unit of measurement here is iterations through a loop.
More iterations are better.


2018-03-05 14:11:49

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock

On Fri, Mar 02, 2018 at 06:55:25PM +0100, Vlastimil Babka wrote:
> On 03/01/2018 03:00 PM, Michal Hocko wrote:
> > On Thu 01-03-18 14:28:45, Aaron Lu wrote:
> >> When a page is freed back to the global pool, its buddy will be checked
> >> to see if it's possible to do a merge. This requires accessing buddy's
> >> page structure and that access could take a long time if it's cache cold.
> >>
> >> This patch adds a prefetch to the to-be-freed page's buddy outside of
> >> zone->lock in hope of accessing buddy's page structure later under
> >> zone->lock will be faster. Since we *always* do buddy merging and check
> >> an order-0 page's buddy to try to merge it when it goes into the main
> >> allocator, the cacheline will always come in, i.e. the prefetched data
> >> will never be unused.
> >>
> >> In the meantime, there are two concerns:
> >> 1 the prefetch could potentially evict existing cachelines, especially
> >> for L1D cache since it is not huge;
> >> 2 there is some additional instruction overhead, namely calculating
> >> buddy pfn twice.
> >>
> >> For 1, it's hard to say, this microbenchmark though shows good result but
> >> the actual benefit of this patch will be workload/CPU dependant;
> >> For 2, since the calculation is a XOR on two local variables, it's expected
> >> in many cases that cycles spent will be offset by reduced memory latency
> >> later. This is especially true for NUMA machines where multiple CPUs are
> >> contending on zone->lock and the most time consuming part under zone->lock
> >> is the wait of 'struct page' cacheline of the to-be-freed pages and their
> >> buddies.
> >>
> >> Test with will-it-scale/page_fault1 full load:
> >>
> >> kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S)
> >> v4.16-rc2+ 9034215 7971818 13667135 15677465
> >> patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
> >> this patch 10338868 +8.4% 8544477 +2.8% 14839808 +5.5% 17155464 +2.9%
> >> Note: this patch's performance improvement percent is against patch2/3.
> >
> > I am really surprised that this has such a big impact.
>
> It's even stranger to me. Struct page is 64 bytes these days, exactly a
> a cache line. Unless that changed, Intel CPUs prefetched a "buddy" cache
> line (that forms an aligned 128 bytes block with the one we touch).
> Which is exactly a order-0 buddy struct page! Maybe that implicit
> prefetching stopped at L2 and explicit goes all the way to L1, can't

The Intel Architecture Optimization Manual section 7.3.2 says:

prefetchT0 - fetch data into all cache levels
Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer
microarchitectures: 1st, 2nd and 3rd level cache.

prefetchT2 - fetch data into 2nd and 3rd level caches (identical to
prefetchT1)
Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer
microarchitectures: 2nd and 3rd level cache.

prefetchNTA - fetch data into non-temporal cache close to the processor,
minimizing cache pollution
Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer
microarchitectures: must fetch into 3rd level cache with fast replacement.

I tried 'prefetcht0' and 'prefetcht2' instead of the default
'prefetchNTA' on a 2 sockets Intel Skylake, the two ended up with about
the same performance number as prefetchNTA. I had expected prefetchT0 to
deliver a better score if it was indeed due to L1D since prefetchT2 will
not place data into L1 while prefetchT0 will, but looks like it is not
the case here.

It feels more like the buddy cacheline isn't in any level of the caches
without prefetch for some reason.

> remember. Would that make such a difference? It would be nice to do some
> perf tests with cache counters to see what is really going on...

Compare prefetchT2 to no-prefetch, I saw these metrics change:

no-prefetch change prefetchT2 metrics
\ \
stddev stddev
------------------------------------------------------------------------
0.18 +0.0 0.18 perf-stat.branch-miss-rate%
8.268e+09 +3.8% 8.585e+09 perf-stat.branch-misses
2.333e+10 +4.7% 2.443e+10 perf-stat.cache-misses
2.402e+11 +5.0% 2.522e+11 perf-stat.cache-references
3.52 -1.1% 3.48 perf-stat.cpi
0.02 -0.0 0.01 ?3% perf-stat.dTLB-load-miss-rate%
8.677e+08 -7.3% 8.048e+08 ?3% perf-stat.dTLB-load-misses
1.18 +0.0 1.19 perf-stat.dTLB-store-miss-rate%
2.359e+10 +6.0% 2.502e+10 perf-stat.dTLB-store-misses
1.979e+12 +5.0% 2.078e+12 perf-stat.dTLB-stores
6.126e+09 +10.1% 6.745e+09 ?3% perf-stat.iTLB-load-misses
3464 -8.4% 3172 ?3% perf-stat.instructions-per-iTLB-miss
0.28 +1.1% 0.29 perf-stat.ipc
2.929e+09 +5.1% 3.077e+09 perf-stat.minor-faults
9.244e+09 +4.7% 9.681e+09 perf-stat.node-loads
2.491e+08 +5.8% 2.634e+08 perf-stat.node-store-misses
6.472e+09 +6.1% 6.869e+09 perf-stat.node-stores
2.929e+09 +5.1% 3.077e+09 perf-stat.page-faults
2182469 -4.2% 2090977 perf-stat.path-length

Not sure if this is useful though...

2018-03-05 14:13:07

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock

On Mon, Mar 05, 2018 at 07:41:59PM +0800, Aaron Lu wrote:
> On Fri, Mar 02, 2018 at 06:55:25PM +0100, Vlastimil Babka wrote:
> > On 03/01/2018 03:00 PM, Michal Hocko wrote:
> > > On Thu 01-03-18 14:28:45, Aaron Lu wrote:
> > >> When a page is freed back to the global pool, its buddy will be checked
> > >> to see if it's possible to do a merge. This requires accessing buddy's
> > >> page structure and that access could take a long time if it's cache cold.
> > >>
> > >> This patch adds a prefetch to the to-be-freed page's buddy outside of
> > >> zone->lock in hope of accessing buddy's page structure later under
> > >> zone->lock will be faster. Since we *always* do buddy merging and check
> > >> an order-0 page's buddy to try to merge it when it goes into the main
> > >> allocator, the cacheline will always come in, i.e. the prefetched data
> > >> will never be unused.
> > >>
> > >> In the meantime, there are two concerns:
> > >> 1 the prefetch could potentially evict existing cachelines, especially
> > >> for L1D cache since it is not huge;
> > >> 2 there is some additional instruction overhead, namely calculating
> > >> buddy pfn twice.
> > >>
> > >> For 1, it's hard to say, this microbenchmark though shows good result but
> > >> the actual benefit of this patch will be workload/CPU dependant;
> > >> For 2, since the calculation is a XOR on two local variables, it's expected
> > >> in many cases that cycles spent will be offset by reduced memory latency
> > >> later. This is especially true for NUMA machines where multiple CPUs are
> > >> contending on zone->lock and the most time consuming part under zone->lock
> > >> is the wait of 'struct page' cacheline of the to-be-freed pages and their
> > >> buddies.
> > >>
> > >> Test with will-it-scale/page_fault1 full load:
> > >>
> > >> kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S)
> > >> v4.16-rc2+ 9034215 7971818 13667135 15677465
> > >> patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
> > >> this patch 10338868 +8.4% 8544477 +2.8% 14839808 +5.5% 17155464 +2.9%
> > >> Note: this patch's performance improvement percent is against patch2/3.
> > >
> > > I am really surprised that this has such a big impact.
> >
> > It's even stranger to me. Struct page is 64 bytes these days, exactly a
> > a cache line. Unless that changed, Intel CPUs prefetched a "buddy" cache
> > line (that forms an aligned 128 bytes block with the one we touch).
> > Which is exactly a order-0 buddy struct page! Maybe that implicit
> > prefetching stopped at L2 and explicit goes all the way to L1, can't
>
> The Intel Architecture Optimization Manual section 7.3.2 says:
>
> prefetchT0 - fetch data into all cache levels
> Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer
> microarchitectures: 1st, 2nd and 3rd level cache.
>
> prefetchT2 - fetch data into 2nd and 3rd level caches (identical to
> prefetchT1)
> Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer
> microarchitectures: 2nd and 3rd level cache.
>
> prefetchNTA - fetch data into non-temporal cache close to the processor,
> minimizing cache pollution
> Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer
> microarchitectures: must fetch into 3rd level cache with fast replacement.
>
> I tried 'prefetcht0' and 'prefetcht2' instead of the default
> 'prefetchNTA' on a 2 sockets Intel Skylake, the two ended up with about
~~~~~~~
Correction: should be Broadwell here.

> the same performance number as prefetchNTA. I had expected prefetchT0 to
> deliver a better score if it was indeed due to L1D since prefetchT2 will
> not place data into L1 while prefetchT0 will, but looks like it is not
> the case here.
>
> It feels more like the buddy cacheline isn't in any level of the caches
> without prefetch for some reason.

2018-03-06 07:57:06

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock

On 03/05/2018 12:41 PM, Aaron Lu wrote:
> On Fri, Mar 02, 2018 at 06:55:25PM +0100, Vlastimil Babka wrote:
>> On 03/01/2018 03:00 PM, Michal Hocko wrote:
>>>
>>> I am really surprised that this has such a big impact.
>>
>> It's even stranger to me. Struct page is 64 bytes these days, exactly a
>> a cache line. Unless that changed, Intel CPUs prefetched a "buddy" cache
>> line (that forms an aligned 128 bytes block with the one we touch).
>> Which is exactly a order-0 buddy struct page! Maybe that implicit
>> prefetching stopped at L2 and explicit goes all the way to L1, can't
>
> The Intel Architecture Optimization Manual section 7.3.2 says:
>
> prefetchT0 - fetch data into all cache levels
> Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer
> microarchitectures: 1st, 2nd and 3rd level cache.
>
> prefetchT2 - fetch data into 2nd and 3rd level caches (identical to
> prefetchT1)
> Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer
> microarchitectures: 2nd and 3rd level cache.
>
> prefetchNTA - fetch data into non-temporal cache close to the processor,
> minimizing cache pollution
> Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer
> microarchitectures: must fetch into 3rd level cache with fast replacement.
>
> I tried 'prefetcht0' and 'prefetcht2' instead of the default
> 'prefetchNTA' on a 2 sockets Intel Skylake, the two ended up with about
> the same performance number as prefetchNTA. I had expected prefetchT0 to
> deliver a better score if it was indeed due to L1D since prefetchT2 will
> not place data into L1 while prefetchT0 will, but looks like it is not
> the case here.
>
> It feels more like the buddy cacheline isn't in any level of the caches
> without prefetch for some reason.

So the adjacent line prefetch might be disabled? Could you check bios or
the MSR mentioned in
https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors

>> remember. Would that make such a difference? It would be nice to do some
>> perf tests with cache counters to see what is really going on...
>
> Compare prefetchT2 to no-prefetch, I saw these metrics change:
>
> no-prefetch change prefetchT2 metrics
> \ \
> stddev stddev
> ------------------------------------------------------------------------
> 0.18 +0.0 0.18 perf-stat.branch-miss-rate%
> 8.268e+09 +3.8% 8.585e+09 perf-stat.branch-misses
> 2.333e+10 +4.7% 2.443e+10 perf-stat.cache-misses
> 2.402e+11 +5.0% 2.522e+11 perf-stat.cache-references
> 3.52 -1.1% 3.48 perf-stat.cpi
> 0.02 -0.0 0.01 ±3% perf-stat.dTLB-load-miss-rate%
> 8.677e+08 -7.3% 8.048e+08 ±3% perf-stat.dTLB-load-misses
> 1.18 +0.0 1.19 perf-stat.dTLB-store-miss-rate%
> 2.359e+10 +6.0% 2.502e+10 perf-stat.dTLB-store-misses
> 1.979e+12 +5.0% 2.078e+12 perf-stat.dTLB-stores
> 6.126e+09 +10.1% 6.745e+09 ±3% perf-stat.iTLB-load-misses
> 3464 -8.4% 3172 ±3% perf-stat.instructions-per-iTLB-miss
> 0.28 +1.1% 0.29 perf-stat.ipc
> 2.929e+09 +5.1% 3.077e+09 perf-stat.minor-faults
> 9.244e+09 +4.7% 9.681e+09 perf-stat.node-loads
> 2.491e+08 +5.8% 2.634e+08 perf-stat.node-store-misses
> 6.472e+09 +6.1% 6.869e+09 perf-stat.node-stores
> 2.929e+09 +5.1% 3.077e+09 perf-stat.page-faults
> 2182469 -4.2% 2090977 perf-stat.path-length
>
> Not sure if this is useful though...

Looks like most stats increased in absolute values as the work done
increased and this is a time-limited benchmark? Although number of
instructions (calculated from itlb misses and insns-per-itlb-miss) shows
less than 1% increase, so dunno. And the improvement comes from reduced
dTLB-load-misses? That makes no sense for order-0 buddy struct pages
which always share a page. And the memmap mapping should use huge pages.
BTW what is path-length?


2018-03-06 12:28:21

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock

On Tue, Mar 06, 2018 at 08:55:57AM +0100, Vlastimil Babka wrote:
> On 03/05/2018 12:41 PM, Aaron Lu wrote:
> > On Fri, Mar 02, 2018 at 06:55:25PM +0100, Vlastimil Babka wrote:
> >> On 03/01/2018 03:00 PM, Michal Hocko wrote:
> >>>
> >>> I am really surprised that this has such a big impact.
> >>
> >> It's even stranger to me. Struct page is 64 bytes these days, exactly a
> >> a cache line. Unless that changed, Intel CPUs prefetched a "buddy" cache
> >> line (that forms an aligned 128 bytes block with the one we touch).
> >> Which is exactly a order-0 buddy struct page! Maybe that implicit
> >> prefetching stopped at L2 and explicit goes all the way to L1, can't
> >
> > The Intel Architecture Optimization Manual section 7.3.2 says:
> >
> > prefetchT0 - fetch data into all cache levels
> > Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer
> > microarchitectures: 1st, 2nd and 3rd level cache.
> >
> > prefetchT2 - fetch data into 2nd and 3rd level caches (identical to
> > prefetchT1)
> > Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer
> > microarchitectures: 2nd and 3rd level cache.
> >
> > prefetchNTA - fetch data into non-temporal cache close to the processor,
> > minimizing cache pollution
> > Intel Xeon Processors based on Nehalem, Westmere, Sandy Bridge and newer
> > microarchitectures: must fetch into 3rd level cache with fast replacement.
> >
> > I tried 'prefetcht0' and 'prefetcht2' instead of the default
> > 'prefetchNTA' on a 2 sockets Intel Skylake, the two ended up with about
> > the same performance number as prefetchNTA. I had expected prefetchT0 to
> > deliver a better score if it was indeed due to L1D since prefetchT2 will
> > not place data into L1 while prefetchT0 will, but looks like it is not
> > the case here.
> >
> > It feels more like the buddy cacheline isn't in any level of the caches
> > without prefetch for some reason.
>
> So the adjacent line prefetch might be disabled? Could you check bios or
> the MSR mentioned in
> https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors

root@lkp-bdw-ep2 ~# rdmsr 0x1a4
0

Looks like this feature isn't disabled(the doc you linked says value 1
means disable).

> >> remember. Would that make such a difference? It would be nice to do some
> >> perf tests with cache counters to see what is really going on...
> >
> > Compare prefetchT2 to no-prefetch, I saw these metrics change:
> >
> > no-prefetch change prefetchT2 metrics
> > \ \
> > stddev stddev
> > ------------------------------------------------------------------------
> > 0.18 +0.0 0.18 perf-stat.branch-miss-rate%
> > 8.268e+09 +3.8% 8.585e+09 perf-stat.branch-misses
> > 2.333e+10 +4.7% 2.443e+10 perf-stat.cache-misses
> > 2.402e+11 +5.0% 2.522e+11 perf-stat.cache-references
> > 3.52 -1.1% 3.48 perf-stat.cpi
> > 0.02 -0.0 0.01 ?3% perf-stat.dTLB-load-miss-rate%
> > 8.677e+08 -7.3% 8.048e+08 ?3% perf-stat.dTLB-load-misses
> > 1.18 +0.0 1.19 perf-stat.dTLB-store-miss-rate%
> > 2.359e+10 +6.0% 2.502e+10 perf-stat.dTLB-store-misses
> > 1.979e+12 +5.0% 2.078e+12 perf-stat.dTLB-stores
> > 6.126e+09 +10.1% 6.745e+09 ?3% perf-stat.iTLB-load-misses
> > 3464 -8.4% 3172 ?3% perf-stat.instructions-per-iTLB-miss
> > 0.28 +1.1% 0.29 perf-stat.ipc
> > 2.929e+09 +5.1% 3.077e+09 perf-stat.minor-faults
> > 9.244e+09 +4.7% 9.681e+09 perf-stat.node-loads
> > 2.491e+08 +5.8% 2.634e+08 perf-stat.node-store-misses
> > 6.472e+09 +6.1% 6.869e+09 perf-stat.node-stores
> > 2.929e+09 +5.1% 3.077e+09 perf-stat.page-faults
> > 2182469 -4.2% 2090977 perf-stat.path-length
> >
> > Not sure if this is useful though...
>
> Looks like most stats increased in absolute values as the work done
> increased and this is a time-limited benchmark? Although number of

Yes it is.

> instructions (calculated from itlb misses and insns-per-itlb-miss) shows
> less than 1% increase, so dunno. And the improvement comes from reduced
> dTLB-load-misses? That makes no sense for order-0 buddy struct pages
> which always share a page. And the memmap mapping should use huge pages.

THP is disabled to stress order 0 pages(should have mentioned this in
patch's description, sorry about this).

> BTW what is path-length?

It's the instruction path length: the number of machine code instructions
required to execute a section of a computer program.

2018-03-06 12:54:16

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 3/3] mm/free_pcppages_bulk: prefetch buddy while not holding lock

On Tue, Mar 06, 2018 at 08:27:33PM +0800, Aaron Lu wrote:
> On Tue, Mar 06, 2018 at 08:55:57AM +0100, Vlastimil Babka wrote:
> > So the adjacent line prefetch might be disabled? Could you check bios or
> > the MSR mentioned in
> > https://software.intel.com/en-us/articles/disclosure-of-hw-prefetcher-control-on-some-intel-processors
>
> root@lkp-bdw-ep2 ~# rdmsr 0x1a4
> 0

Technically 0x1a4 is per-core, so you should run rdmsr -a 0x1a4 in order to
check all the cores. But I can't imagine they're being set differently on
each core.

> > instructions (calculated from itlb misses and insns-per-itlb-miss) shows
> > less than 1% increase, so dunno. And the improvement comes from reduced
> > dTLB-load-misses? That makes no sense for order-0 buddy struct pages
> > which always share a page. And the memmap mapping should use huge pages.
>
> THP is disabled to stress order 0 pages(should have mentioned this in
> patch's description, sorry about this).

THP isn't related to memmap; the kernel uses huge pages (usually the 1G
pages) in order to map its own memory.


2018-03-09 08:24:47

by Aaron Lu

[permalink] [raw]
Subject: [PATCH v4 3/3 update] mm/free_pcppages_bulk: prefetch buddy while not holding lock

On Fri, Mar 02, 2018 at 04:27:56PM +0800, Aaron Lu wrote:
> With this said, the count here could be pcp->count when pcp's pages
> need to be all drained and though pcp->count's default value is
> (6*pcp->batch)=186, user can increase that value through the above
> mentioned procfs interface and the resulting pcp->count could be too
> big for prefetch. Ying also mentioned this today and suggested adding
> an upper limit here to avoid prefetching too much. Perhaps just prefetch
> the last pcp->batch pages if count here > pcp->batch? Since pcp->batch
> has an upper limit, we won't need to worry prefetching too much.

The following patch adds an upper limit on prefetching, the upper limit
is set to pcp->batch since 1) it is the most likely value of input param
'count' in free_pcppages_bulk() and 2) it has an upper limit itself.

From: Aaron Lu <[email protected]>
Subject: [PATCH v4 3/3 update] mm/free_pcppages_bulk: prefetch buddy while not holding lock

When a page is freed back to the global pool, its buddy will be checked
to see if it's possible to do a merge. This requires accessing buddy's
page structure and that access could take a long time if it's cache cold.

This patch adds a prefetch to the to-be-freed page's buddy outside of
zone->lock in hope of accessing buddy's page structure later under
zone->lock will be faster. Since we *always* do buddy merging and check
an order-0 page's buddy to try to merge it when it goes into the main
allocator, the cacheline will always come in, i.e. the prefetched data
will never be unused.

Normally, the number of to-be-freed pages(i.e. count) equals to
pcp->batch (default=31 and has an upper limit of (PAGE_SHIFT * 8)=96 on
x86_64) but in the case of pcp's pages getting all drained, it will be
pcp->count which has an upper limit of pcp->high. pcp->high, although
has a default value of 186 (pcp->batch=31 * 6), can be changed by user
through /proc/sys/vm/percpu_pagelist_fraction and there is no software
upper limit so could be large, like several thousand. For this reason,
only the last pcp->batch number of page's buddy structure is prefetched
to avoid excessive prefetching. pcp-batch is used because:
1 most often, count == pcp->batch;
2 it has an upper limit itself so we won't prefetch excessively.

Considering the possible large value of pcp->high, it also makes
sense to free the last added page first for cache hot's reason.
That's where the change of list_add_tail() to list_add() comes in
as we will free them from head to tail one by one.

In the meantime, there are two concerns:
1 the prefetch could potentially evict existing cachelines, especially
for L1D cache since it is not huge;
2 there is some additional instruction overhead, namely calculating
buddy pfn twice.

For 1, it's hard to say, this microbenchmark though shows good result but
the actual benefit of this patch will be workload/CPU dependant;
For 2, since the calculation is a XOR on two local variables, it's expected
in many cases that cycles spent will be offset by reduced memory latency
later. This is especially true for NUMA machines where multiple CPUs are
contending on zone->lock and the most time consuming part under zone->lock
is the wait of 'struct page' cacheline of the to-be-freed pages and their
buddies.

Test with will-it-scale/page_fault1 full load:

kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S)
v4.16-rc2+ 9034215 7971818 13667135 15677465
patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
this patch 10180856 +6.8% 8506369 +2.3% 14756865 +4.9% 17325324 +3.9%
Note: this patch's performance improvement percent is against patch2/3.

(Changelog stolen from Dave Hansen and Mel Gorman's comments at
http://lkml.kernel.org/r/[email protected])

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Aaron Lu <[email protected]>
Suggested-by: Ying Huang <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
---
mm/page_alloc.c | 21 ++++++++++++++++++++-
1 file changed, 20 insertions(+), 1 deletion(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dafdcdec9c1f..5f31f7bab583 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1141,6 +1141,9 @@ static void free_pcppages_bulk(struct zone *zone, int count,
batch_free = count;

do {
+ unsigned long pfn, buddy_pfn;
+ struct page *buddy;
+
page = list_last_entry(list, struct page, lru);
/* must delete to avoid corrupting pcp list */
list_del(&page->lru);
@@ -1149,7 +1152,23 @@ static void free_pcppages_bulk(struct zone *zone, int count,
if (bulkfree_pcp_prepare(page))
continue;

- list_add_tail(&page->lru, &head);
+ list_add(&page->lru, &head);
+
+ /*
+ * We are going to put the page back to the global
+ * pool, prefetch its buddy to speed up later access
+ * under zone->lock. It is believed the overhead of
+ * an additional test and calculating buddy_pfn here
+ * can be offset by reduced memory latency later. To
+ * avoid excessive prefetching due to large count, only
+ * prefetch buddy for the last pcp->batch nr of pages.
+ */
+ if (count > pcp->batch)
+ continue;
+ pfn = page_to_pfn(page);
+ buddy_pfn = __find_buddy_pfn(pfn, 0);
+ buddy = page + (buddy_pfn - pfn);
+ prefetch(buddy);
} while (--count && --batch_free && !list_empty(list));
}

--
2.14.3


2018-03-09 22:01:04

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v4 3/3 update] mm/free_pcppages_bulk: prefetch buddy while not holding lock

>
> When a page is freed back to the global pool, its buddy will be checked
> to see if it's possible to do a merge. This requires accessing buddy's
> page structure and that access could take a long time if it's cache cold.
>
> This patch adds a prefetch to the to-be-freed page's buddy outside of
> zone->lock in hope of accessing buddy's page structure later under
> zone->lock will be faster. Since we *always* do buddy merging and check
> an order-0 page's buddy to try to merge it when it goes into the main
> allocator, the cacheline will always come in, i.e. the prefetched data
> will never be unused.
>
> Normally, the number of to-be-freed pages(i.e. count) equals to
> pcp->batch (default=31 and has an upper limit of (PAGE_SHIFT * 8)=96 on
> x86_64) but in the case of pcp's pages getting all drained, it will be
> pcp->count which has an upper limit of pcp->high. pcp->high, although
> has a default value of 186 (pcp->batch=31 * 6), can be changed by user
> through /proc/sys/vm/percpu_pagelist_fraction and there is no software
> upper limit so could be large, like several thousand. For this reason,
> only the last pcp->batch number of page's buddy structure is prefetched
> to avoid excessive prefetching. pcp-batch is used because:
> 1 most often, count == pcp->batch;
> 2 it has an upper limit itself so we won't prefetch excessively.
>
> Considering the possible large value of pcp->high, it also makes
> sense to free the last added page first for cache hot's reason.
> That's where the change of list_add_tail() to list_add() comes in
> as we will free them from head to tail one by one.
>
> In the meantime, there are two concerns:
> 1 the prefetch could potentially evict existing cachelines, especially
> for L1D cache since it is not huge;
> 2 there is some additional instruction overhead, namely calculating
> buddy pfn twice.
>
> For 1, it's hard to say, this microbenchmark though shows good result but
> the actual benefit of this patch will be workload/CPU dependant;
> For 2, since the calculation is a XOR on two local variables, it's expected
> in many cases that cycles spent will be offset by reduced memory latency
> later. This is especially true for NUMA machines where multiple CPUs are
> contending on zone->lock and the most time consuming part under zone->lock
> is the wait of 'struct page' cacheline of the to-be-freed pages and their
> buddies.
>
> Test with will-it-scale/page_fault1 full load:
>
> kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S)
> v4.16-rc2+ 9034215 7971818 13667135 15677465
> patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
> this patch 10180856 +6.8% 8506369 +2.3% 14756865 +4.9% 17325324 +3.9%
> Note: this patch's performance improvement percent is against patch2/3.
>
> (Changelog stolen from Dave Hansen and Mel Gorman's comments at
> http://lkml.kernel.org/r/[email protected])
>
> Link: http://lkml.kernel.org/r/[email protected]
>
> ...
>
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1141,6 +1141,9 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> batch_free = count;
>
> do {
> + unsigned long pfn, buddy_pfn;
> + struct page *buddy;
> +
> page = list_last_entry(list, struct page, lru);
> /* must delete to avoid corrupting pcp list */
> list_del(&page->lru);
> @@ -1149,7 +1152,23 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> if (bulkfree_pcp_prepare(page))
> continue;
>
> - list_add_tail(&page->lru, &head);
> + list_add(&page->lru, &head);

The result here will be that free_pcppages_bulk() frees the pages in
the reverse order?

I don't immediately see a downside to that. In the (distant) past we
had issues when successive alloc_page() calls would return pages in
descending address order - that totally screwed up scatter-gather page
merging. But this is the page-freeing path. Still, something to be
thought about and monitored.

> +
> + /*
> + * We are going to put the page back to the global
> + * pool, prefetch its buddy to speed up later access
> + * under zone->lock. It is believed the overhead of
> + * an additional test and calculating buddy_pfn here
> + * can be offset by reduced memory latency later. To
> + * avoid excessive prefetching due to large count, only
> + * prefetch buddy for the last pcp->batch nr of pages.
> + */
> + if (count > pcp->batch)
> + continue;
> + pfn = page_to_pfn(page);
> + buddy_pfn = __find_buddy_pfn(pfn, 0);
> + buddy = page + (buddy_pfn - pfn);
> + prefetch(buddy);
> } while (--count && --batch_free && !list_empty(list));

This loop hurts my brain, mainly the handling of `count':

while (count) {
do {
batch_free++;
} while (list_empty(list));

/* This is the only non-empty list. Free them all. */
if (batch_free == MIGRATE_PCPTYPES)
batch_free = count;

do {
} while (--count && --batch_free && !list_empty(list));
}

I guess it kinda makes sense - both loops terminate on count==0. But
still. Can it be clarified?


2018-03-10 14:47:13

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v4 3/3 update] mm/free_pcppages_bulk: prefetch buddy while not holding lock

On Fri, Mar 09, 2018 at 01:58:32PM -0800, Andrew Morton wrote:
> >
> > When a page is freed back to the global pool, its buddy will be checked
> > to see if it's possible to do a merge. This requires accessing buddy's
> > page structure and that access could take a long time if it's cache cold.
> >
> > This patch adds a prefetch to the to-be-freed page's buddy outside of
> > zone->lock in hope of accessing buddy's page structure later under
> > zone->lock will be faster. Since we *always* do buddy merging and check
> > an order-0 page's buddy to try to merge it when it goes into the main
> > allocator, the cacheline will always come in, i.e. the prefetched data
> > will never be unused.
> >
> > Normally, the number of to-be-freed pages(i.e. count) equals to
> > pcp->batch (default=31 and has an upper limit of (PAGE_SHIFT * 8)=96 on
> > x86_64) but in the case of pcp's pages getting all drained, it will be
> > pcp->count which has an upper limit of pcp->high. pcp->high, although
> > has a default value of 186 (pcp->batch=31 * 6), can be changed by user
> > through /proc/sys/vm/percpu_pagelist_fraction and there is no software
> > upper limit so could be large, like several thousand. For this reason,
> > only the last pcp->batch number of page's buddy structure is prefetched
> > to avoid excessive prefetching. pcp-batch is used because:
> > 1 most often, count == pcp->batch;
> > 2 it has an upper limit itself so we won't prefetch excessively.
> >
> > Considering the possible large value of pcp->high, it also makes
> > sense to free the last added page first for cache hot's reason.
> > That's where the change of list_add_tail() to list_add() comes in
> > as we will free them from head to tail one by one.
> >
> > In the meantime, there are two concerns:
> > 1 the prefetch could potentially evict existing cachelines, especially
> > for L1D cache since it is not huge;
> > 2 there is some additional instruction overhead, namely calculating
> > buddy pfn twice.
> >
> > For 1, it's hard to say, this microbenchmark though shows good result but
> > the actual benefit of this patch will be workload/CPU dependant;
> > For 2, since the calculation is a XOR on two local variables, it's expected
> > in many cases that cycles spent will be offset by reduced memory latency
> > later. This is especially true for NUMA machines where multiple CPUs are
> > contending on zone->lock and the most time consuming part under zone->lock
> > is the wait of 'struct page' cacheline of the to-be-freed pages and their
> > buddies.
> >
> > Test with will-it-scale/page_fault1 full load:
> >
> > kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S)
> > v4.16-rc2+ 9034215 7971818 13667135 15677465
> > patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
> > this patch 10180856 +6.8% 8506369 +2.3% 14756865 +4.9% 17325324 +3.9%
> > Note: this patch's performance improvement percent is against patch2/3.
> >
> > (Changelog stolen from Dave Hansen and Mel Gorman's comments at
> > http://lkml.kernel.org/r/[email protected])
> >
> > Link: http://lkml.kernel.org/r/[email protected]
> >
> > ...
> >
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1141,6 +1141,9 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> > batch_free = count;
> >
> > do {
> > + unsigned long pfn, buddy_pfn;
> > + struct page *buddy;
> > +
> > page = list_last_entry(list, struct page, lru);
> > /* must delete to avoid corrupting pcp list */
> > list_del(&page->lru);
> > @@ -1149,7 +1152,23 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> > if (bulkfree_pcp_prepare(page))
> > continue;
> >
> > - list_add_tail(&page->lru, &head);
> > + list_add(&page->lru, &head);
>
> The result here will be that free_pcppages_bulk() frees the pages in
> the reverse order?

Yes, so that the last touched page will be freed first as the list of
pages will be freed from head to tail later.

This change is for the case when count is large(which is a rare case):
since the last touched pages and their buddies are more likely to be
still cache hot when later these pages are freed under lock, it seems
natural to free them first.

We can revert this change if it causes any trouble without affecting
performance much as count is not a large value most of the time.

>
> I don't immediately see a downside to that. In the (distant) past we
> had issues when successive alloc_page() calls would return pages in
> descending address order - that totally screwed up scatter-gather page
> merging. But this is the page-freeing path. Still, something to be
> thought about and monitored.

OK.

>
> > +
> > + /*
> > + * We are going to put the page back to the global
> > + * pool, prefetch its buddy to speed up later access
> > + * under zone->lock. It is believed the overhead of
> > + * an additional test and calculating buddy_pfn here
> > + * can be offset by reduced memory latency later. To
> > + * avoid excessive prefetching due to large count, only
> > + * prefetch buddy for the last pcp->batch nr of pages.
> > + */
> > + if (count > pcp->batch)
> > + continue;
> > + pfn = page_to_pfn(page);
> > + buddy_pfn = __find_buddy_pfn(pfn, 0);
> > + buddy = page + (buddy_pfn - pfn);
> > + prefetch(buddy);
> > } while (--count && --batch_free && !list_empty(list));
>
> This loop hurts my brain, mainly the handling of `count':
>
> while (count) {
> do {
> batch_free++;
> } while (list_empty(list));
>
> /* This is the only non-empty list. Free them all. */
> if (batch_free == MIGRATE_PCPTYPES)
> batch_free = count;
>
> do {
> } while (--count && --batch_free && !list_empty(list));
> }
>
> I guess it kinda makes sense - both loops terminate on count==0. But
> still. Can it be clarified?

That's right, count == 0 is the final termination condition.
count is decremented when a page is to be freed so:

if (count > pcp->batch)
continue;
prefetch();

means to only prefetch for the last pcp->batch pages.

2018-03-12 13:23:48

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] mm/free_pcppages_bulk: update pcp->count inside

On 03/01/2018 07:28 AM, Aaron Lu wrote:
> Matthew Wilcox found that all callers of free_pcppages_bulk() currently
> update pcp->count immediately after so it's natural to do it inside
> free_pcppages_bulk().
>
> No functionality or performance change is expected from this patch.

Well, it's N decrements instead of one decrement by N / assignment of
zero. But I assume the difference is negligible anyway, right?

> Suggested-by: Matthew Wilcox <[email protected]>
> Signed-off-by: Aaron Lu <[email protected]>

Acked-by: Vlastimil Babka <[email protected]>

> ---
> mm/page_alloc.c | 10 +++-------
> 1 file changed, 3 insertions(+), 7 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index cb416723538f..faa33eac1635 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1148,6 +1148,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> page = list_last_entry(list, struct page, lru);
> /* must delete as __free_one_page list manipulates */
> list_del(&page->lru);
> + pcp->count--;
>
> mt = get_pcppage_migratetype(page);
> /* MIGRATE_ISOLATE page should not go to pcplists */
> @@ -2416,10 +2417,8 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
> local_irq_save(flags);
> batch = READ_ONCE(pcp->batch);
> to_drain = min(pcp->count, batch);
> - if (to_drain > 0) {
> + if (to_drain > 0)
> free_pcppages_bulk(zone, to_drain, pcp);
> - pcp->count -= to_drain;
> - }
> local_irq_restore(flags);
> }
> #endif
> @@ -2441,10 +2440,8 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
> pset = per_cpu_ptr(zone->pageset, cpu);
>
> pcp = &pset->pcp;
> - if (pcp->count) {
> + if (pcp->count)
> free_pcppages_bulk(zone, pcp->count, pcp);
> - pcp->count = 0;
> - }
> local_irq_restore(flags);
> }
>
> @@ -2668,7 +2665,6 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
> if (pcp->count >= pcp->high) {
> unsigned long batch = READ_ONCE(pcp->batch);
> free_pcppages_bulk(zone, batch, pcp);
> - pcp->count -= batch;
> }
> }
>
>


2018-03-12 14:24:31

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

On 03/01/2018 07:28 AM, Aaron Lu wrote:
> When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
> the zone->lock is held and then pages are chosen from PCP's migratetype
> list. While there is actually no need to do this 'choose part' under
> lock since it's PCP pages, the only CPU that can touch them is us and
> irq is also disabled.
>
> Moving this part outside could reduce lock held time and improve
> performance. Test with will-it-scale/page_fault1 full load:
>
> kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S)
> v4.16-rc2+ 9034215 7971818 13667135 15677465
> this patch 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
>
> What the test does is: starts $nr_cpu processes and each will repeatedly
> do the following for 5 minutes:
> 1 mmap 128M anonymouse space;
> 2 write access to that space;
> 3 munmap.
> The score is the aggregated iteration.
>
> https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c
>
> Acked-by: Mel Gorman <[email protected]>
> Signed-off-by: Aaron Lu <[email protected]>
> ---
> mm/page_alloc.c | 39 +++++++++++++++++++++++----------------
> 1 file changed, 23 insertions(+), 16 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index faa33eac1635..dafdcdec9c1f 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1116,12 +1116,10 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> int migratetype = 0;
> int batch_free = 0;
> bool isolated_pageblocks;
> -
> - spin_lock(&zone->lock);
> - isolated_pageblocks = has_isolate_pageblock(zone);
> + struct page *page, *tmp;
> + LIST_HEAD(head);
>
> while (count) {
> - struct page *page;
> struct list_head *list;
>
> /*
> @@ -1143,27 +1141,36 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> batch_free = count;
>
> do {
> - int mt; /* migratetype of the to-be-freed page */
> -
> page = list_last_entry(list, struct page, lru);
> - /* must delete as __free_one_page list manipulates */
> + /* must delete to avoid corrupting pcp list */
> list_del(&page->lru);

Well, since bulkfree_pcp_prepare() doesn't care about page->lru, you
could maybe use list_move_tail() instead of list_del() +
list_add_tail()? That avoids temporarily writing poison values.

Hm actually, you are reversing the list in the process, because page is
obtained by list_last_entry and you use list_add_tail. That could have
unintended performance consequences?

Also maybe list_cut_position() could be faster than shuffling pages one
by one? I guess not really, because batch_free will be generally low?

> pcp->count--;
>
> - mt = get_pcppage_migratetype(page);
> - /* MIGRATE_ISOLATE page should not go to pcplists */
> - VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> - /* Pageblock could have been isolated meanwhile */
> - if (unlikely(isolated_pageblocks))
> - mt = get_pageblock_migratetype(page);
> -
> if (bulkfree_pcp_prepare(page))
> continue;
>
> - __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> - trace_mm_page_pcpu_drain(page, 0, mt);
> + list_add_tail(&page->lru, &head);
> } while (--count && --batch_free && !list_empty(list));
> }
> +
> + spin_lock(&zone->lock);
> + isolated_pageblocks = has_isolate_pageblock(zone);
> +
> + /*
> + * Use safe version since after __free_one_page(),
> + * page->lru.next will not point to original list.
> + */
> + list_for_each_entry_safe(page, tmp, &head, lru) {
> + int mt = get_pcppage_migratetype(page);
> + /* MIGRATE_ISOLATE page should not go to pcplists */
> + VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> + /* Pageblock could have been isolated meanwhile */
> + if (unlikely(isolated_pageblocks))
> + mt = get_pageblock_migratetype(page);
> +
> + __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> + trace_mm_page_pcpu_drain(page, 0, mt);
> + }
> spin_unlock(&zone->lock);
> }
>
>


2018-03-12 15:07:44

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v4 3/3 update] mm/free_pcppages_bulk: prefetch buddy while not holding lock

On 03/09/2018 10:58 PM, Andrew Morton wrote:
>>
>> When a page is freed back to the global pool, its buddy will be checked
>> to see if it's possible to do a merge. This requires accessing buddy's
>> page structure and that access could take a long time if it's cache cold.
>>
>> This patch adds a prefetch to the to-be-freed page's buddy outside of
>> zone->lock in hope of accessing buddy's page structure later under
>> zone->lock will be faster. Since we *always* do buddy merging and check
>> an order-0 page's buddy to try to merge it when it goes into the main
>> allocator, the cacheline will always come in, i.e. the prefetched data
>> will never be unused.
>>
>> Normally, the number of to-be-freed pages(i.e. count) equals to
>> pcp->batch (default=31 and has an upper limit of (PAGE_SHIFT * 8)=96 on
>> x86_64) but in the case of pcp's pages getting all drained, it will be
>> pcp->count which has an upper limit of pcp->high. pcp->high, although
>> has a default value of 186 (pcp->batch=31 * 6), can be changed by user
>> through /proc/sys/vm/percpu_pagelist_fraction and there is no software
>> upper limit so could be large, like several thousand. For this reason,
>> only the last pcp->batch number of page's buddy structure is prefetched
>> to avoid excessive prefetching. pcp-batch is used because:
>> 1 most often, count == pcp->batch;
>> 2 it has an upper limit itself so we won't prefetch excessively.
>>
>> Considering the possible large value of pcp->high, it also makes
>> sense to free the last added page first for cache hot's reason.
>> That's where the change of list_add_tail() to list_add() comes in
>> as we will free them from head to tail one by one.
>>
>> In the meantime, there are two concerns:
>> 1 the prefetch could potentially evict existing cachelines, especially
>> for L1D cache since it is not huge;
>> 2 there is some additional instruction overhead, namely calculating
>> buddy pfn twice.
>>
>> For 1, it's hard to say, this microbenchmark though shows good result but
>> the actual benefit of this patch will be workload/CPU dependant;
>> For 2, since the calculation is a XOR on two local variables, it's expected
>> in many cases that cycles spent will be offset by reduced memory latency
>> later. This is especially true for NUMA machines where multiple CPUs are
>> contending on zone->lock and the most time consuming part under zone->lock
>> is the wait of 'struct page' cacheline of the to-be-freed pages and their
>> buddies.
>>
>> Test with will-it-scale/page_fault1 full load:
>>
>> kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S)
>> v4.16-rc2+ 9034215 7971818 13667135 15677465
>> patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
>> this patch 10180856 +6.8% 8506369 +2.3% 14756865 +4.9% 17325324 +3.9%
>> Note: this patch's performance improvement percent is against patch2/3.
>>
>> (Changelog stolen from Dave Hansen and Mel Gorman's comments at
>> http://lkml.kernel.org/r/[email protected])
>>
>> Link: http://lkml.kernel.org/r/[email protected]
>>
>> ...
>>
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1141,6 +1141,9 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>> batch_free = count;
>>
>> do {
>> + unsigned long pfn, buddy_pfn;
>> + struct page *buddy;
>> +
>> page = list_last_entry(list, struct page, lru);
>> /* must delete to avoid corrupting pcp list */
>> list_del(&page->lru);
>> @@ -1149,7 +1152,23 @@ static void free_pcppages_bulk(struct zone *zone, int count,
>> if (bulkfree_pcp_prepare(page))
>> continue;
>>
>> - list_add_tail(&page->lru, &head);
>> + list_add(&page->lru, &head);
>
> The result here will be that free_pcppages_bulk() frees the pages in
> the reverse order?

I actually think it restores the order, compared to the previous version
(see my earlier reply).

> I don't immediately see a downside to that. In the (distant) past we
> had issues when successive alloc_page() calls would return pages in
> descending address order - that totally screwed up scatter-gather page
> merging. But this is the page-freeing path. Still, something to be
> thought about and monitored.
>
>> +
>> + /*
>> + * We are going to put the page back to the global
>> + * pool, prefetch its buddy to speed up later access
>> + * under zone->lock. It is believed the overhead of
>> + * an additional test and calculating buddy_pfn here
>> + * can be offset by reduced memory latency later. To
>> + * avoid excessive prefetching due to large count, only
>> + * prefetch buddy for the last pcp->batch nr of pages.
>> + */
>> + if (count > pcp->batch)
>> + continue;

You could also go to the locked part after pcp->batch pages and then
return back, but maybe let's not complicate it further for corner cases :)

>> + pfn = page_to_pfn(page);
>> + buddy_pfn = __find_buddy_pfn(pfn, 0);
>> + buddy = page + (buddy_pfn - pfn);
>> + prefetch(buddy);
>> } while (--count && --batch_free && !list_empty(list));
>
> This loop hurts my brain, mainly the handling of `count':
>
> while (count) {
> do {
> batch_free++;
> } while (list_empty(list));
>
> /* This is the only non-empty list. Free them all. */
> if (batch_free == MIGRATE_PCPTYPES)
> batch_free = count;
>
> do {
> } while (--count && --batch_free && !list_empty(list));
> }
>
> I guess it kinda makes sense - both loops terminate on count==0. But
> still. Can it be clarified?

Yeah this is rather far from straightforward :(


2018-03-12 17:33:57

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH v4 3/3 update] mm/free_pcppages_bulk: prefetch buddy while not holding lock

On 03/09/2018 12:24 AM, Aaron Lu wrote:
> + /*
> + * We are going to put the page back to the global
> + * pool, prefetch its buddy to speed up later access
> + * under zone->lock. It is believed the overhead of
> + * an additional test and calculating buddy_pfn here
> + * can be offset by reduced memory latency later. To
> + * avoid excessive prefetching due to large count, only
> + * prefetch buddy for the last pcp->batch nr of pages.
> + */
> + if (count > pcp->batch)
> + continue;
> + pfn = page_to_pfn(page);
> + buddy_pfn = __find_buddy_pfn(pfn, 0);
> + buddy = page + (buddy_pfn - pfn);
> + prefetch(buddy);

FWIW, I think this needs to go into a helper function. Is that possible?

There's too much logic happening here. Also, 'count' going from
batch_size->0 is totally non-obvious from the patch context. It makes
this hunk look totally wrong by itself.

2018-03-13 02:11:45

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v4 1/3] mm/free_pcppages_bulk: update pcp->count inside

On Mon, Mar 12, 2018 at 02:22:28PM +0100, Vlastimil Babka wrote:
> On 03/01/2018 07:28 AM, Aaron Lu wrote:
> > Matthew Wilcox found that all callers of free_pcppages_bulk() currently
> > update pcp->count immediately after so it's natural to do it inside
> > free_pcppages_bulk().
> >
> > No functionality or performance change is expected from this patch.
>
> Well, it's N decrements instead of one decrement by N / assignment of
> zero. But I assume the difference is negligible anyway, right?

Yes.

>
> > Suggested-by: Matthew Wilcox <[email protected]>
> > Signed-off-by: Aaron Lu <[email protected]>
>
> Acked-by: Vlastimil Babka <[email protected]>

Thanks!

> > ---
> > mm/page_alloc.c | 10 +++-------
> > 1 file changed, 3 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index cb416723538f..faa33eac1635 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1148,6 +1148,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> > page = list_last_entry(list, struct page, lru);
> > /* must delete as __free_one_page list manipulates */
> > list_del(&page->lru);
> > + pcp->count--;
> >
> > mt = get_pcppage_migratetype(page);
> > /* MIGRATE_ISOLATE page should not go to pcplists */
> > @@ -2416,10 +2417,8 @@ void drain_zone_pages(struct zone *zone, struct per_cpu_pages *pcp)
> > local_irq_save(flags);
> > batch = READ_ONCE(pcp->batch);
> > to_drain = min(pcp->count, batch);
> > - if (to_drain > 0) {
> > + if (to_drain > 0)
> > free_pcppages_bulk(zone, to_drain, pcp);
> > - pcp->count -= to_drain;
> > - }
> > local_irq_restore(flags);
> > }
> > #endif
> > @@ -2441,10 +2440,8 @@ static void drain_pages_zone(unsigned int cpu, struct zone *zone)
> > pset = per_cpu_ptr(zone->pageset, cpu);
> >
> > pcp = &pset->pcp;
> > - if (pcp->count) {
> > + if (pcp->count)
> > free_pcppages_bulk(zone, pcp->count, pcp);
> > - pcp->count = 0;
> > - }
> > local_irq_restore(flags);
> > }
> >
> > @@ -2668,7 +2665,6 @@ static void free_unref_page_commit(struct page *page, unsigned long pfn)
> > if (pcp->count >= pcp->high) {
> > unsigned long batch = READ_ONCE(pcp->batch);
> > free_pcppages_bulk(zone, batch, pcp);
> > - pcp->count -= batch;
> > }
> > }
> >
> >
>

2018-03-13 03:35:40

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

On Mon, Mar 12, 2018 at 03:22:53PM +0100, Vlastimil Babka wrote:
> On 03/01/2018 07:28 AM, Aaron Lu wrote:
> > When freeing a batch of pages from Per-CPU-Pages(PCP) back to buddy,
> > the zone->lock is held and then pages are chosen from PCP's migratetype
> > list. While there is actually no need to do this 'choose part' under
> > lock since it's PCP pages, the only CPU that can touch them is us and
> > irq is also disabled.
> >
> > Moving this part outside could reduce lock held time and improve
> > performance. Test with will-it-scale/page_fault1 full load:
> >
> > kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S)
> > v4.16-rc2+ 9034215 7971818 13667135 15677465
> > this patch 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
> >
> > What the test does is: starts $nr_cpu processes and each will repeatedly
> > do the following for 5 minutes:
> > 1 mmap 128M anonymouse space;
> > 2 write access to that space;
> > 3 munmap.
> > The score is the aggregated iteration.
> >
> > https://github.com/antonblanchard/will-it-scale/blob/master/tests/page_fault1.c
> >
> > Acked-by: Mel Gorman <[email protected]>
> > Signed-off-by: Aaron Lu <[email protected]>
> > ---
> > mm/page_alloc.c | 39 +++++++++++++++++++++++----------------
> > 1 file changed, 23 insertions(+), 16 deletions(-)
> >
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index faa33eac1635..dafdcdec9c1f 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1116,12 +1116,10 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> > int migratetype = 0;
> > int batch_free = 0;
> > bool isolated_pageblocks;
> > -
> > - spin_lock(&zone->lock);
> > - isolated_pageblocks = has_isolate_pageblock(zone);
> > + struct page *page, *tmp;
> > + LIST_HEAD(head);
> >
> > while (count) {
> > - struct page *page;
> > struct list_head *list;
> >
> > /*
> > @@ -1143,27 +1141,36 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> > batch_free = count;
> >
> > do {
> > - int mt; /* migratetype of the to-be-freed page */
> > -
> > page = list_last_entry(list, struct page, lru);
> > - /* must delete as __free_one_page list manipulates */
> > + /* must delete to avoid corrupting pcp list */
> > list_del(&page->lru);
>
> Well, since bulkfree_pcp_prepare() doesn't care about page->lru, you
> could maybe use list_move_tail() instead of list_del() +
> list_add_tail()? That avoids temporarily writing poison values.

Good point, except bulkfree_pcp_prepare() could return error and then
the page will need to be removed from the to-be-freed list, like this:

do {
page = list_last_entry(list, struct page, lru);
list_move_tail(&page->lru, &head);
pcp->count--;

if (bulkfree_pcp_prepare(page))
list_del(&page->lru);
} while (--count && --batch_free && !list_empty(list));

Considering bulkfree_pcp_prepare() returning error is the rare case,
this list_del() should rarely happen. At the same time, this part is
outside of zone->lock and can hardly impact performance...so I'm not
sure.

> Hm actually, you are reversing the list in the process, because page is
> obtained by list_last_entry and you use list_add_tail. That could have
> unintended performance consequences?

True the order is changed when these to-be-freed pages are in this
temporary list, but then they are iterated and freed one by one from
head to tail so the order they landed in free_list is the same as
before the patch(also the same as they are in pcp list).

>
> Also maybe list_cut_position() could be faster than shuffling pages one
> by one? I guess not really, because batch_free will be generally low?

We will need to know where to cut if list_cut_position() is to be used
and to find that out, the list will need to be iterated first. I guess
that's too much trouble.

Since this part of code is per-cpu(outside of zone->lock) and these
pages are in pcp(meaning their cachelines are not likely in remote),
I didn't worry too much about not being able to list_cut_position() but
iterate. On allocation side though, when manipulating the global
free_list under zone->lock, this is a big problem since pages there are
freed from different CPUs and the cache could be cold for the allocating
CPU. That is why I'm proposing clusted allocation sometime ago as an RFC
patch where list_cut_position() is so good that it could eliminate the
cacheline miss issue since we do not need to iterate cold pages one by
one.

I wish there is a data structure that has the flexibility of list while
at the same time we can locate the Nth element in the list without the
need to iterate. That's what I'm looking for when developing clustered
allocation for order 0 pages. In the end, I had to use another place to
record where the Nth element is. I hope to send out v2 of that RFC
series soon but I'm still collecting data for it. I would appreciate if
people could take a look then :-)

batch_free's value depends on what the system is doing. When user
application is making use of memory, the common case is, only
migratetype of MIGRATE_MOVABLE has pages to free and then batch_free
will be 1 in the first round and (pcp->batch-1) in the 2nd round.

Here is some data I collected recently on how often only MIGRATE_MOVABLE
list has pages to free in free_pcppages_bulk():

On my desktop, after boot:

free_pcppages_bulk: 6268
single_mt_movable: 2566 (41%)

free_pcppages_bulk means the number of time this function gets called.
single_mt_movable means number of times when only MIGRATE_MOVABLE list
has pages to free.

After kbuild with a distro kconfig:

free_pcppages_bulk: 9100508
single_mt_movable: 8435483 (92.75%)

If we change the initial value of migratetype in free_pcppages_bulk()
from 0(MIGRATE_UNMOVABLE) to 1(MIGRATE_MOVABLE), then batch_free will be
pcp->batch in the 1st round and we can save something but the saving is
negligible when running a workload so I didn't send a patch for it yet.

> > pcp->count--;
> >
> > - mt = get_pcppage_migratetype(page);
> > - /* MIGRATE_ISOLATE page should not go to pcplists */
> > - VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> > - /* Pageblock could have been isolated meanwhile */
> > - if (unlikely(isolated_pageblocks))
> > - mt = get_pageblock_migratetype(page);
> > -
> > if (bulkfree_pcp_prepare(page))
> > continue;
> >
> > - __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> > - trace_mm_page_pcpu_drain(page, 0, mt);
> > + list_add_tail(&page->lru, &head);
> > } while (--count && --batch_free && !list_empty(list));
> > }
> > +
> > + spin_lock(&zone->lock);
> > + isolated_pageblocks = has_isolate_pageblock(zone);
> > +
> > + /*
> > + * Use safe version since after __free_one_page(),
> > + * page->lru.next will not point to original list.
> > + */
> > + list_for_each_entry_safe(page, tmp, &head, lru) {
> > + int mt = get_pcppage_migratetype(page);
> > + /* MIGRATE_ISOLATE page should not go to pcplists */
> > + VM_BUG_ON_PAGE(is_migrate_isolate(mt), page);
> > + /* Pageblock could have been isolated meanwhile */
> > + if (unlikely(isolated_pageblocks))
> > + mt = get_pageblock_migratetype(page);
> > +
> > + __free_one_page(page, page_to_pfn(page), zone, 0, mt);
> > + trace_mm_page_pcpu_drain(page, 0, mt);
> > + }
> > spin_unlock(&zone->lock);
> > }
> >
> >
>

2018-03-13 03:38:46

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v4 3/3 update] mm/free_pcppages_bulk: prefetch buddy while not holding lock

On Mon, Mar 12, 2018 at 10:32:32AM -0700, Dave Hansen wrote:
> On 03/09/2018 12:24 AM, Aaron Lu wrote:
> > + /*
> > + * We are going to put the page back to the global
> > + * pool, prefetch its buddy to speed up later access
> > + * under zone->lock. It is believed the overhead of
> > + * an additional test and calculating buddy_pfn here
> > + * can be offset by reduced memory latency later. To
> > + * avoid excessive prefetching due to large count, only
> > + * prefetch buddy for the last pcp->batch nr of pages.
> > + */
> > + if (count > pcp->batch)
> > + continue;
> > + pfn = page_to_pfn(page);
> > + buddy_pfn = __find_buddy_pfn(pfn, 0);
> > + buddy = page + (buddy_pfn - pfn);
> > + prefetch(buddy);
>
> FWIW, I think this needs to go into a helper function. Is that possible?

I'll give it a try.

>
> There's too much logic happening here. Also, 'count' going from
> batch_size->0 is totally non-obvious from the patch context. It makes
> this hunk look totally wrong by itself.

2018-03-13 07:04:52

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v4 3/3 update] mm/free_pcppages_bulk: prefetch buddy while not holding lock

On Tue, Mar 13, 2018 at 11:35:19AM +0800, Aaron Lu wrote:
> On Mon, Mar 12, 2018 at 10:32:32AM -0700, Dave Hansen wrote:
> > On 03/09/2018 12:24 AM, Aaron Lu wrote:
> > > + /*
> > > + * We are going to put the page back to the global
> > > + * pool, prefetch its buddy to speed up later access
> > > + * under zone->lock. It is believed the overhead of
> > > + * an additional test and calculating buddy_pfn here
> > > + * can be offset by reduced memory latency later. To
> > > + * avoid excessive prefetching due to large count, only
> > > + * prefetch buddy for the last pcp->batch nr of pages.
> > > + */
> > > + if (count > pcp->batch)
> > > + continue;
> > > + pfn = page_to_pfn(page);
> > > + buddy_pfn = __find_buddy_pfn(pfn, 0);
> > > + buddy = page + (buddy_pfn - pfn);
> > > + prefetch(buddy);
> >
> > FWIW, I think this needs to go into a helper function. Is that possible?
>
> I'll give it a try.
>
> >
> > There's too much logic happening here. Also, 'count' going from
> > batch_size->0 is totally non-obvious from the patch context. It makes
> > this hunk look totally wrong by itself.

I tried to avoid adding one more local variable but looks like it caused
a lot of pain. What about the following? It doesn't use count any more
but prefetch_nr to indicate how many prefetches have happened.

Also, I think it's not worth the risk of disordering pages in free_list
by changing list_add_tail() to list_add() as Andrew reminded so I
dropped that change too.


diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dafdcdec9c1f..00ea4483f679 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1099,6 +1099,15 @@ static bool bulkfree_pcp_prepare(struct page *page)
}
#endif /* CONFIG_DEBUG_VM */

+static inline void prefetch_buddy(struct page *page)
+{
+ unsigned long pfn = page_to_pfn(page);
+ unsigned long buddy_pfn = __find_buddy_pfn(pfn, 0);
+ struct page *buddy = page + (buddy_pfn - pfn);
+
+ prefetch(buddy);
+}
+
/*
* Frees a number of pages from the PCP lists
* Assumes all pages on list are in same zone, and of same order.
@@ -1115,6 +1124,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
{
int migratetype = 0;
int batch_free = 0;
+ int prefetch_nr = 0;
bool isolated_pageblocks;
struct page *page, *tmp;
LIST_HEAD(head);
@@ -1150,6 +1160,18 @@ static void free_pcppages_bulk(struct zone *zone, int count,
continue;

list_add_tail(&page->lru, &head);
+
+ /*
+ * We are going to put the page back to the global
+ * pool, prefetch its buddy to speed up later access
+ * under zone->lock. It is believed the overhead of
+ * an additional test and calculating buddy_pfn here
+ * can be offset by reduced memory latency later. To
+ * avoid excessive prefetching due to large count, only
+ * prefetch buddy for the first pcp->batch nr of pages.
+ */
+ if (prefetch_nr++ < pcp->batch)
+ prefetch_buddy(page);
} while (--count && --batch_free && !list_empty(list));
}

--
2.14.3


2018-03-20 09:53:59

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH v4 3/3 update] mm/free_pcppages_bulk: prefetch buddy while not holding lock

On 03/13/2018 08:04 AM, Aaron Lu wrote:
> On Tue, Mar 13, 2018 at 11:35:19AM +0800, Aaron Lu wrote:
>> On Mon, Mar 12, 2018 at 10:32:32AM -0700, Dave Hansen wrote:
>>> On 03/09/2018 12:24 AM, Aaron Lu wrote:
>>>> + /*
>>>> + * We are going to put the page back to the global
>>>> + * pool, prefetch its buddy to speed up later access
>>>> + * under zone->lock. It is believed the overhead of
>>>> + * an additional test and calculating buddy_pfn here
>>>> + * can be offset by reduced memory latency later. To
>>>> + * avoid excessive prefetching due to large count, only
>>>> + * prefetch buddy for the last pcp->batch nr of pages.
>>>> + */
>>>> + if (count > pcp->batch)
>>>> + continue;
>>>> + pfn = page_to_pfn(page);
>>>> + buddy_pfn = __find_buddy_pfn(pfn, 0);
>>>> + buddy = page + (buddy_pfn - pfn);
>>>> + prefetch(buddy);
>>>
>>> FWIW, I think this needs to go into a helper function. Is that possible?
>>
>> I'll give it a try.
>>
>>>
>>> There's too much logic happening here. Also, 'count' going from
>>> batch_size->0 is totally non-obvious from the patch context. It makes
>>> this hunk look totally wrong by itself.
>
> I tried to avoid adding one more local variable but looks like it caused
> a lot of pain. What about the following? It doesn't use count any more
> but prefetch_nr to indicate how many prefetches have happened.
>
> Also, I think it's not worth the risk of disordering pages in free_list
> by changing list_add_tail() to list_add() as Andrew reminded so I
> dropped that change too.

Looks fine, you can add

Acked-by: Vlastimil Babka <[email protected]>

>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index dafdcdec9c1f..00ea4483f679 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1099,6 +1099,15 @@ static bool bulkfree_pcp_prepare(struct page *page)
> }
> #endif /* CONFIG_DEBUG_VM */
>
> +static inline void prefetch_buddy(struct page *page)
> +{
> + unsigned long pfn = page_to_pfn(page);
> + unsigned long buddy_pfn = __find_buddy_pfn(pfn, 0);
> + struct page *buddy = page + (buddy_pfn - pfn);
> +
> + prefetch(buddy);
> +}
> +
> /*
> * Frees a number of pages from the PCP lists
> * Assumes all pages on list are in same zone, and of same order.
> @@ -1115,6 +1124,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> {
> int migratetype = 0;
> int batch_free = 0;
> + int prefetch_nr = 0;
> bool isolated_pageblocks;
> struct page *page, *tmp;
> LIST_HEAD(head);
> @@ -1150,6 +1160,18 @@ static void free_pcppages_bulk(struct zone *zone, int count,
> continue;
>
> list_add_tail(&page->lru, &head);
> +
> + /*
> + * We are going to put the page back to the global
> + * pool, prefetch its buddy to speed up later access
> + * under zone->lock. It is believed the overhead of
> + * an additional test and calculating buddy_pfn here
> + * can be offset by reduced memory latency later. To
> + * avoid excessive prefetching due to large count, only
> + * prefetch buddy for the first pcp->batch nr of pages.
> + */
> + if (prefetch_nr++ < pcp->batch)
> + prefetch_buddy(page);
> } while (--count && --batch_free && !list_empty(list));
> }
>
>


2018-03-20 11:32:07

by Aaron Lu

[permalink] [raw]
Subject: [PATCH v4 3/3 update2] mm/free_pcppages_bulk: prefetch buddy while not holding lock

On Tue, Mar 20, 2018 at 10:50:18AM +0100, Vlastimil Babka wrote:
> On 03/13/2018 08:04 AM, Aaron Lu wrote:
> > I tried to avoid adding one more local variable but looks like it caused
> > a lot of pain. What about the following? It doesn't use count any more
> > but prefetch_nr to indicate how many prefetches have happened.
> >
> > Also, I think it's not worth the risk of disordering pages in free_list
> > by changing list_add_tail() to list_add() as Andrew reminded so I
> > dropped that change too.
>
> Looks fine, you can add
>
> Acked-by: Vlastimil Babka <[email protected]>

Thanks, here is the updated patch.

From: Aaron Lu <[email protected]>
Date: Thu, 18 Jan 2018 11:19:59 +0800
Subject: [PATCH v4 3/3 update2] mm/free_pcppages_bulk: prefetch buddy while not holding lock

When a page is freed back to the global pool, its buddy will be checked
to see if it's possible to do a merge. This requires accessing buddy's
page structure and that access could take a long time if it's cache cold.

This patch adds a prefetch to the to-be-freed page's buddy outside of
zone->lock in hope of accessing buddy's page structure later under
zone->lock will be faster. Since we *always* do buddy merging and check
an order-0 page's buddy to try to merge it when it goes into the main
allocator, the cacheline will always come in, i.e. the prefetched data
will never be unused.

Normally, the number of prefetch will be pcp->batch(default=31 and has
an upper limit of (PAGE_SHIFT * 8)=96 on x86_64) but in the case of
pcp's pages get all drained, it will be pcp->count which has an upper
limit of pcp->high. pcp->high, although has a default value of 186
(pcp->batch=31 * 6), can be changed by user through
/proc/sys/vm/percpu_pagelist_fraction and there is no software upper
limit so could be large, like several thousand. For this reason, only
the first pcp->batch number of page's buddy structure is prefetched to
avoid excessive prefetching.

In the meantime, there are two concerns:
1 the prefetch could potentially evict existing cachelines, especially
for L1D cache since it is not huge;
2 there is some additional instruction overhead, namely calculating
buddy pfn twice.

For 1, it's hard to say, this microbenchmark though shows good result but
the actual benefit of this patch will be workload/CPU dependant;
For 2, since the calculation is a XOR on two local variables, it's expected
in many cases that cycles spent will be offset by reduced memory latency
later. This is especially true for NUMA machines where multiple CPUs are
contending on zone->lock and the most time consuming part under zone->lock
is the wait of 'struct page' cacheline of the to-be-freed pages and their
buddies.

Test with will-it-scale/page_fault1 full load:

kernel Broadwell(2S) Skylake(2S) Broadwell(4S) Skylake(4S)
v4.16-rc2+ 9034215 7971818 13667135 15677465
patch2/3 9536374 +5.6% 8314710 +4.3% 14070408 +3.0% 16675866 +6.4%
this patch 10180856 +6.8% 8506369 +2.3% 14756865 +4.9% 17325324 +3.9%
Note: this patch's performance improvement percent is against patch2/3.

(Changelog stolen from Dave Hansen and Mel Gorman's comments at
http://lkml.kernel.org/r/[email protected])

Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Aaron Lu <[email protected]>
Suggested-by: Ying Huang <[email protected]>
Acked-by: Vlastimil Babka <[email protected]>
Reviewed-by: Andrew Morton <[email protected]>
---
update2:
Use a helper function to prefetch buddy as suggested by Dave Hansen.
Drop the change of list_add_tail() to avoid disordering page.

mm/page_alloc.c | 22 ++++++++++++++++++++++
1 file changed, 22 insertions(+)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index dafdcdec9c1f..00ea4483f679 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1099,6 +1099,15 @@ static bool bulkfree_pcp_prepare(struct page *page)
}
#endif /* CONFIG_DEBUG_VM */

+static inline void prefetch_buddy(struct page *page)
+{
+ unsigned long pfn = page_to_pfn(page);
+ unsigned long buddy_pfn = __find_buddy_pfn(pfn, 0);
+ struct page *buddy = page + (buddy_pfn - pfn);
+
+ prefetch(buddy);
+}
+
/*
* Frees a number of pages from the PCP lists
* Assumes all pages on list are in same zone, and of same order.
@@ -1115,6 +1124,7 @@ static void free_pcppages_bulk(struct zone *zone, int count,
{
int migratetype = 0;
int batch_free = 0;
+ int prefetch_nr = 0;
bool isolated_pageblocks;
struct page *page, *tmp;
LIST_HEAD(head);
@@ -1150,6 +1160,18 @@ static void free_pcppages_bulk(struct zone *zone, int count,
continue;

list_add_tail(&page->lru, &head);
+
+ /*
+ * We are going to put the page back to the global
+ * pool, prefetch its buddy to speed up later access
+ * under zone->lock. It is believed the overhead of
+ * an additional test and calculating buddy_pfn here
+ * can be offset by reduced memory latency later. To
+ * avoid excessive prefetching due to large count, only
+ * prefetch buddy for the first pcp->batch nr of pages.
+ */
+ if (prefetch_nr++ < pcp->batch)
+ prefetch_buddy(page);
} while (--count && --batch_free && !list_empty(list));
}

--
2.14.3


2018-03-22 15:43:49

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

On Tue, Mar 13, 2018 at 11:34:53AM +0800, Aaron Lu wrote:
> I wish there is a data structure that has the flexibility of list while
> at the same time we can locate the Nth element in the list without the
> need to iterate. That's what I'm looking for when developing clustered
> allocation for order 0 pages. In the end, I had to use another place to
> record where the Nth element is. I hope to send out v2 of that RFC
> series soon but I'm still collecting data for it. I would appreciate if
> people could take a look then :-)

Sorry, I missed this. There is such a data structure -- the IDR, or
possibly a bare radix tree, or we can build a better data structure on
top of the radix tree (I talked about one called the XQueue a while ago).

The IDR will automatically grow to whatever needed size, it stores
pointers, you can find out quickly where the last allocated index is,
you can remove from the middle of the array. Disadvantage is that it
requires memory allocation to store the array of pointers, *but* it
can always hold at least one entry. So if you have no memory, you can
always return the one element in your IDR to the free pool and allocate
from that page.

2018-03-26 03:06:58

by Aaron Lu

[permalink] [raw]
Subject: Re: [PATCH v4 2/3] mm/free_pcppages_bulk: do not hold lock when picking pages to free

On Thu, Mar 22, 2018 at 08:17:19AM -0700, Matthew Wilcox wrote:
> On Tue, Mar 13, 2018 at 11:34:53AM +0800, Aaron Lu wrote:
> > I wish there is a data structure that has the flexibility of list while
> > at the same time we can locate the Nth element in the list without the
> > need to iterate. That's what I'm looking for when developing clustered
> > allocation for order 0 pages. In the end, I had to use another place to
> > record where the Nth element is. I hope to send out v2 of that RFC
> > series soon but I'm still collecting data for it. I would appreciate if
> > people could take a look then :-)
>
> Sorry, I missed this. There is such a data structure -- the IDR, or
> possibly a bare radix tree, or we can build a better data structure on
> top of the radix tree (I talked about one called the XQueue a while ago).
>
> The IDR will automatically grow to whatever needed size, it stores
> pointers, you can find out quickly where the last allocated index is,
> you can remove from the middle of the array. Disadvantage is that it
> requires memory allocation to store the array of pointers, *but* it
> can always hold at least one entry. So if you have no memory, you can
> always return the one element in your IDR to the free pool and allocate
> from that page.

Thanks for the pointer, will take a look later.
Currently I'm focusing on finding real workloads that have zone lock
contention issue.