2023-02-17 12:05:17

by Alexander Halbuer

[permalink] [raw]
Subject: [PATCH] mm, page_alloc: batch cma update on pcp buffer refill

As proposed by Vlastimil Babka [1] this is an extension to the previous patch
"mm: reduce lock contention of pcp buffer refill". This version also moves the
is_migrate_cma() check from the critical section to the loop below.

The optimization has several advantages:
- Less time in critical section
- Batching update of NR_FREE_CMA_PAGES into a single call to
mod_zone_page_state()
- Utilizing cache locality of the related struct page when doing the cma check
is_migrate_cma() and the sanity check check_pcp_refill() in the same loop

However, this change only affects performance with CONFIG_CMA=true which
may not be the common case. Another possibly negative effect is that the
actual update of NR_FREE_CMA_PAGES is delayed beyond the release of the
zone lock resulting in a short time span of inaccuracy between the
counter and the actual number of cma pages in the zone.

The tables below compare this patch with the initial one using a
parallel allocation benchmark. The used kernel config is based on the default
debian config but with CONFIG_INIT_ON_ALLOC_DEFAULT_ON=FALSE and
CONFIG_CMA=TRUE. The benchmarks have been performed with the default sanity
checks enabled as the patch "mm, page_alloc: reduce page alloc/free sanity
checks" [2] was not enabled on my test branch.
The given times are average allocation times. The improvement is not
significant, but the general trend is visible.

Hopefully, without sanity checks and disabled cma, a compiler will be able
to optimize away the second loop entirely.

[1] https://lore.kernel.org/lkml/[email protected]/
[2] https://lore.kernel.org/linux-mm/[email protected]/

Normal Pages
+-------+---------+---------+---------+
| Cores | Patch 1 | Patch 2 | Patch 2 |
| | (ns) | (ns) | Diff |
+-------+---------+---------+---------+
| 1 | 132 | 129 | (-2.3%) |
| 2 | 147 | 145 | (-1.4%) |
| 3 | 148 | 147 | (-0.7%) |
| 4 | 175 | 173 | (-1.1%) |
| 6 | 263 | 255 | (-3.0%) |
| 8 | 347 | 337 | (-2.9%) |
| 10 | 432 | 421 | (-2.5%) |
| 12 | 516 | 505 | (-2.1%) |
| 14 | 604 | 590 | (-2.3%) |
| 16 | 695 | 680 | (-2.2%) |
| 20 | 869 | 844 | (-2.9%) |
| 24 | 1043 | 1015 | (-2.7%) |
+-------+---------+---------+---------+

Huge Pages
+-------+---------+---------+---------+
| Cores | Patch 1 | Patch 2 | Patch 2 |
| | (ns) | (ns) | Diff |
+-------+---------+---------+---------+
| 1 | 3177 | 3133 | (-1.4%) |
| 2 | 3486 | 3471 | (-0.4%) |
| 3 | 3644 | 3634 | (-0.3%) |
| 4 | 3669 | 3643 | (-0.7%) |
| 6 | 3587 | 3578 | (-0.3%) |
| 8 | 3635 | 3621 | (-0.4%) |
| 10 | 4015 | 3960 | (-1.4%) |
| 12 | 4681 | 4510 | (-3.7%) |
| 14 | 5398 | 5180 | (-4.0%) |
| 16 | 6239 | 5891 | (-5.6%) |
| 20 | 7864 | 7435 | (-5.5%) |
| 24 | 9011 | 8971 | (-0.4%) |
+-------+---------+---------+---------+

Reported-by: Vlastimil Babka <[email protected]>
Signed-off-by: Alexander Halbuer <[email protected]>
---
mm/page_alloc.c | 48 ++++++++++++++++++++++++++++++++++++------------
1 file changed, 36 insertions(+), 12 deletions(-)

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0745aedebb37..f82a59eeb4fe 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3119,17 +3119,17 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
{
unsigned long flags;
int i, allocated = 0;
+ int cma_pages = 0;
+ struct list_head *prev_tail = list->prev;
+ struct page *pos, *n;

spin_lock_irqsave(&zone->lock, flags);
for (i = 0; i < count; ++i) {
- struct page *page = __rmqueue(zone, order, migratetype,
- alloc_flags);
+ struct page *page =
+ __rmqueue(zone, order, migratetype, alloc_flags);
if (unlikely(page == NULL))
break;

- if (unlikely(check_pcp_refill(page, order)))
- continue;
-
/*
* Split buddy pages returned by expand() are received here in
* physical page order. The page is added to the tail of
@@ -3141,20 +3141,44 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
* pages are ordered properly.
*/
list_add_tail(&page->pcp_list, list);
- allocated++;
- if (is_migrate_cma(get_pcppage_migratetype(page)))
- __mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
- -(1 << order));
}

/*
- * i pages were removed from the buddy list even if some leak due
- * to check_pcp_refill failing so adjust NR_FREE_PAGES based
- * on i. Do not confuse with 'allocated' which is the number of
+ * i pages were removed from the buddy list so adjust NR_FREE_PAGES
+ * based on i. Do not confuse with 'allocated' which is the number of
* pages added to the pcp list.
*/
__mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
+
spin_unlock_irqrestore(&zone->lock, flags);
+
+ /*
+ * Pages are appended to the pcp list without checking to reduce the
+ * time holding the zone lock. Checking the appended pages happens right
+ * after the critical section while still holding the pcp lock.
+ */
+ pos = list_first_entry(prev_tail, struct page, pcp_list);
+ list_for_each_entry_safe_from(pos, n, list, pcp_list) {
+ /*
+ * Count number of cma pages to batch update of
+ * NR_FREE_CMA_PAGES into a single function call.
+ */
+ if (is_migrate_cma(get_pcppage_migratetype(pos)))
+ cma_pages++;
+
+ if (unlikely(check_pcp_refill(pos, order))) {
+ list_del(&pos->pcp_list);
+ continue;
+ }
+
+ allocated++;
+ }
+
+ if (cma_pages > 0) {
+ mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
+ -(cma_pages << order));
+ }
+
return allocated;
}

--
2.39.2



2023-02-21 10:27:22

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm, page_alloc: batch cma update on pcp buffer refill

On 2/17/23 13:05, Alexander Halbuer wrote:
> As proposed by Vlastimil Babka [1] this is an extension to the previous patch
> "mm: reduce lock contention of pcp buffer refill". This version also moves the
> is_migrate_cma() check from the critical section to the loop below.

Hi, thanks for trying it out!

> The optimization has several advantages:
> - Less time in critical section
> - Batching update of NR_FREE_CMA_PAGES into a single call to
> mod_zone_page_state()
> - Utilizing cache locality of the related struct page when doing the cma check
> is_migrate_cma() and the sanity check check_pcp_refill() in the same loop

The last point probably doesn't apply as we are reading/modifying page->lru
with __rmqueue()/list_add_tail() so that brings the struct page to cache
already.

>
> However, this change only affects performance with CONFIG_CMA=true which
> may not be the common case. Another possibly negative effect is that the
> actual update of NR_FREE_CMA_PAGES is delayed beyond the release of the
> zone lock resulting in a short time span of inaccuracy between the
> counter and the actual number of cma pages in the zone.

Hmm didn't realize that, that might be perhaps a problem.

> The tables below compare this patch with the initial one using a
> parallel allocation benchmark. The used kernel config is based on the default
> debian config but with CONFIG_INIT_ON_ALLOC_DEFAULT_ON=FALSE and
> CONFIG_CMA=TRUE. The benchmarks have been performed with the default sanity
> checks enabled as the patch "mm, page_alloc: reduce page alloc/free sanity
> checks" [2] was not enabled on my test branch.
> The given times are average allocation times. The improvement is not
> significant, but the general trend is visible.

Yeah there's some improvement, but if [2] is accepted, then keeping two
loops there just for the cma update (as there will be no more checking in
the second loop) will almost certainly stop being a win. And with the risk
of inaccuracy you pointed out, on top.

Incidentally, did you observe any improvements by [2] with your test,
especially as the batch freeing side also no longer does checking under zone
lock?

Thanks!

> Hopefully, without sanity checks and disabled cma, a compiler will be able
> to optimize away the second loop entirely.
>
> [1] https://lore.kernel.org/lkml/[email protected]/
> [2] https://lore.kernel.org/linux-mm/[email protected]/
>
> Normal Pages
> +-------+---------+---------+---------+
> | Cores | Patch 1 | Patch 2 | Patch 2 |
> | | (ns) | (ns) | Diff |
> +-------+---------+---------+---------+
> | 1 | 132 | 129 | (-2.3%) |
> | 2 | 147 | 145 | (-1.4%) |
> | 3 | 148 | 147 | (-0.7%) |
> | 4 | 175 | 173 | (-1.1%) |
> | 6 | 263 | 255 | (-3.0%) |
> | 8 | 347 | 337 | (-2.9%) |
> | 10 | 432 | 421 | (-2.5%) |
> | 12 | 516 | 505 | (-2.1%) |
> | 14 | 604 | 590 | (-2.3%) |
> | 16 | 695 | 680 | (-2.2%) |
> | 20 | 869 | 844 | (-2.9%) |
> | 24 | 1043 | 1015 | (-2.7%) |
> +-------+---------+---------+---------+
>
> Huge Pages
> +-------+---------+---------+---------+
> | Cores | Patch 1 | Patch 2 | Patch 2 |
> | | (ns) | (ns) | Diff |
> +-------+---------+---------+---------+
> | 1 | 3177 | 3133 | (-1.4%) |
> | 2 | 3486 | 3471 | (-0.4%) |
> | 3 | 3644 | 3634 | (-0.3%) |
> | 4 | 3669 | 3643 | (-0.7%) |
> | 6 | 3587 | 3578 | (-0.3%) |
> | 8 | 3635 | 3621 | (-0.4%) |
> | 10 | 4015 | 3960 | (-1.4%) |
> | 12 | 4681 | 4510 | (-3.7%) |
> | 14 | 5398 | 5180 | (-4.0%) |
> | 16 | 6239 | 5891 | (-5.6%) |
> | 20 | 7864 | 7435 | (-5.5%) |
> | 24 | 9011 | 8971 | (-0.4%) |
> +-------+---------+---------+---------+
>
> Reported-by: Vlastimil Babka <[email protected]>
> Signed-off-by: Alexander Halbuer <[email protected]>
> ---
> mm/page_alloc.c | 48 ++++++++++++++++++++++++++++++++++++------------
> 1 file changed, 36 insertions(+), 12 deletions(-)
>
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0745aedebb37..f82a59eeb4fe 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -3119,17 +3119,17 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
> {
> unsigned long flags;
> int i, allocated = 0;
> + int cma_pages = 0;
> + struct list_head *prev_tail = list->prev;
> + struct page *pos, *n;
>
> spin_lock_irqsave(&zone->lock, flags);
> for (i = 0; i < count; ++i) {
> - struct page *page = __rmqueue(zone, order, migratetype,
> - alloc_flags);
> + struct page *page =
> + __rmqueue(zone, order, migratetype, alloc_flags);
> if (unlikely(page == NULL))
> break;
>
> - if (unlikely(check_pcp_refill(page, order)))
> - continue;
> -
> /*
> * Split buddy pages returned by expand() are received here in
> * physical page order. The page is added to the tail of
> @@ -3141,20 +3141,44 @@ static int rmqueue_bulk(struct zone *zone, unsigned int order,
> * pages are ordered properly.
> */
> list_add_tail(&page->pcp_list, list);
> - allocated++;
> - if (is_migrate_cma(get_pcppage_migratetype(page)))
> - __mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
> - -(1 << order));
> }
>
> /*
> - * i pages were removed from the buddy list even if some leak due
> - * to check_pcp_refill failing so adjust NR_FREE_PAGES based
> - * on i. Do not confuse with 'allocated' which is the number of
> + * i pages were removed from the buddy list so adjust NR_FREE_PAGES
> + * based on i. Do not confuse with 'allocated' which is the number of
> * pages added to the pcp list.
> */
> __mod_zone_page_state(zone, NR_FREE_PAGES, -(i << order));
> +
> spin_unlock_irqrestore(&zone->lock, flags);
> +
> + /*
> + * Pages are appended to the pcp list without checking to reduce the
> + * time holding the zone lock. Checking the appended pages happens right
> + * after the critical section while still holding the pcp lock.
> + */
> + pos = list_first_entry(prev_tail, struct page, pcp_list);
> + list_for_each_entry_safe_from(pos, n, list, pcp_list) {
> + /*
> + * Count number of cma pages to batch update of
> + * NR_FREE_CMA_PAGES into a single function call.
> + */
> + if (is_migrate_cma(get_pcppage_migratetype(pos)))
> + cma_pages++;
> +
> + if (unlikely(check_pcp_refill(pos, order))) {
> + list_del(&pos->pcp_list);
> + continue;
> + }
> +
> + allocated++;
> + }
> +
> + if (cma_pages > 0) {
> + mod_zone_page_state(zone, NR_FREE_CMA_PAGES,
> + -(cma_pages << order));
> + }
> +
> return allocated;
> }
>


2023-02-23 21:48:31

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm, page_alloc: batch cma update on pcp buffer refill

On Tue, 21 Feb 2023 11:27:14 +0100 Vlastimil Babka <[email protected]> wrote:

>
> > The tables below compare this patch with the initial one using a
> > parallel allocation benchmark. The used kernel config is based on the default
> > debian config but with CONFIG_INIT_ON_ALLOC_DEFAULT_ON=FALSE and
> > CONFIG_CMA=TRUE. The benchmarks have been performed with the default sanity
> > checks enabled as the patch "mm, page_alloc: reduce page alloc/free sanity
> > checks" [2] was not enabled on my test branch.
> > The given times are average allocation times. The improvement is not
> > significant, but the general trend is visible.
>
> Yeah there's some improvement, but if [2] is accepted, then keeping two
> loops there just for the cma update (as there will be no more checking in
> the second loop) will almost certainly stop being a win. And with the risk
> of inaccuracy you pointed out, on top.
>
> ...
>
> > [2] https://lore.kernel.org/linux-mm/[email protected]/

I've just added "mm, page_alloc: reduce page alloc/free sanity checks"
to mm-unstable.

2023-03-03 12:53:09

by Alexander Halbuer

[permalink] [raw]
Subject: Re: [PATCH] mm, page_alloc: batch cma update on pcp buffer refill

On 2/21/23 11:27, Vlastimil Babka wrote:
> Incidentally, did you observe any improvements by [2] with your test,
> especially as the batch freeing side also no longer does checking under zone
> lock?
>
> Thanks!
>

I finally managed to repeat the benchmark to see the effects of
disabling alloc and free sanity checks by default ("mm, page_alloc:
reduce page alloc/free sanity checks"; results below).

Average huge page allocation latency drastically reduces by over 90% in
the single core case. However i can't see any real improvements for the
free operation.

---

Measurement results for the batch allocation benchmark for different
core counts: Internally used functions are alloc_pages_node() for
allocation (get) and __free_pages() for free (put).

Compared kernel versions are from the mm-unstable branch:
- Reference version (without the mentioned patch):
daf4bcbf2b72 ("mm: cma: make kobj_type structure constant")
- Patched version:
60114678f165 ("mm-page_alloc-reduce-page-alloc-free-sanity-checks-fix")

Normale pages
+-------+------+-------+---------+------+-------+---------+
| cores | base | patch | diff | base | patch | diff |
| | get | get | get | put | put | put |
| | (ns) | (ns) | | (ns) | (ns) | |
+-------+------+-------+---------+------+-------+---------+
| 1 | 122 | 118 | (-3.3%) | 118 | 116 | (-1.7%) |
| 2 | 133 | 130 | (-2.3%) | 130 | 123 | (-5.4%) |
| 3 | 136 | 132 | (-2.9%) | 175 | 162 | (-7.4%) |
| 4 | 161 | 149 | (-7.5%) | 241 | 226 | (-6.2%) |
| 6 | 247 | 228 | (-7.7%) | 366 | 344 | (-6.0%) |
| 8 | 331 | 304 | (-8.2%) | 484 | 456 | (-5.8%) |
| 10 | 416 | 390 | (-6.2%) | 615 | 578 | (-6.0%) |
| 12 | 502 | 472 | (-6.0%) | 727 | 687 | (-5.5%) |
| 14 | 584 | 552 | (-5.5%) | 862 | 816 | (-5.3%) |
| 16 | 669 | 632 | (-5.5%) | 967 | 923 | (-4.6%) |
| 20 | 833 | 787 | (-5.5%) | 1232 | 1164 | (-5.5%) |
| 24 | 999 | 944 | (-5.5%) | 1462 | 1384 | (-5.3%) |
+-------+------+-------+---------+------+-------+---------+

Huge Pages
+-------+------+-------+----------+-------+-------+---------+
| cores | base | patch | diff | base | patch | diff |
| | get | get | get | put | put | put |
| | (ns) | (ns) | | (ns) | (ns) | |
+-------+------+-------+----------+-------+-------+---------+
| 1 | 3148 | 177 | (-94.4%) | 2946 | 2872 | (-2.5%) |
| 2 | 3404 | 596 | (-82.5%) | 3318 | 3306 | (-0.4%) |
| 3 | 3581 | 950 | (-73.5%) | 3401 | 3358 | (-1.3%) |
| 4 | 3651 | 1284 | (-64.8%) | 3562 | 3616 | (1.5%) |
| 6 | 3568 | 1929 | (-45.9%) | 4478 | 4564 | (1.9%) |
| 8 | 3605 | 2328 | (-35.4%) | 5658 | 5546 | (-2.0%) |
| 10 | 4093 | 2935 | (-28.3%) | 6758 | 6457 | (-4.5%) |
| 12 | 4778 | 3540 | (-25.9%) | 7698 | 7565 | (-1.7%) |
| 14 | 5565 | 4097 | (-26.4%) | 8748 | 8810 | (0.7%) |
| 16 | 6364 | 4725 | (-25.8%) | 9942 | 10103 | (1.6%) |
| 20 | 8014 | 5915 | (-26.2%) | 12509 | 12772 | (2.1%) |
| 24 | 8732 | 7138 | (-18.3%) | 15217 | 15433 | (1.4%) |
+-------+------+-------+----------+-------+-------+---------+

>>
>> [1] https://lore.kernel.org/lkml/[email protected]/
>> [2] https://lore.kernel.org/linux-mm/[email protected]/


Attachments:
smime.p7s (5.87 kB)
S/MIME Cryptographic Signature

2023-03-03 13:36:42

by Vlastimil Babka

[permalink] [raw]
Subject: Re: [PATCH] mm, page_alloc: batch cma update on pcp buffer refill

On 3/3/23 13:52, Alexander Halbuer wrote:
> On 2/21/23 11:27, Vlastimil Babka wrote:
>> Incidentally, did you observe any improvements by [2] with your test,
>> especially as the batch freeing side also no longer does checking under zone
>> lock?
>>
>> Thanks!
>>
>
> I finally managed to repeat the benchmark to see the effects of
> disabling alloc and free sanity checks by default ("mm, page_alloc:
> reduce page alloc/free sanity checks"; results below).

Thanks for the measurements!

> Average huge page allocation latency drastically reduces by over 90% in
> the single core case. However i can't see any real improvements for the
> free operation.

Ah, I see now why huge page freeing didn't improve. It's because
bulkfree_pcp_prepare() was calling free_page_is_bad(page) thus checking just
the head page. So the tail pages were not actually checked. Just another
detail of how the checking wasn't thorough anyway.

> ---
>
> Measurement results for the batch allocation benchmark for different
> core counts: Internally used functions are alloc_pages_node() for
> allocation (get) and __free_pages() for free (put).
>
> Compared kernel versions are from the mm-unstable branch:
> - Reference version (without the mentioned patch):
> daf4bcbf2b72 ("mm: cma: make kobj_type structure constant")
> - Patched version:
> 60114678f165 ("mm-page_alloc-reduce-page-alloc-free-sanity-checks-fix")
>
> Normale pages
> +-------+------+-------+---------+------+-------+---------+
> | cores | base | patch | diff | base | patch | diff |
> | | get | get | get | put | put | put |
> | | (ns) | (ns) | | (ns) | (ns) | |
> +-------+------+-------+---------+------+-------+---------+
> | 1 | 122 | 118 | (-3.3%) | 118 | 116 | (-1.7%) |
> | 2 | 133 | 130 | (-2.3%) | 130 | 123 | (-5.4%) |
> | 3 | 136 | 132 | (-2.9%) | 175 | 162 | (-7.4%) |
> | 4 | 161 | 149 | (-7.5%) | 241 | 226 | (-6.2%) |
> | 6 | 247 | 228 | (-7.7%) | 366 | 344 | (-6.0%) |
> | 8 | 331 | 304 | (-8.2%) | 484 | 456 | (-5.8%) |
> | 10 | 416 | 390 | (-6.2%) | 615 | 578 | (-6.0%) |
> | 12 | 502 | 472 | (-6.0%) | 727 | 687 | (-5.5%) |
> | 14 | 584 | 552 | (-5.5%) | 862 | 816 | (-5.3%) |
> | 16 | 669 | 632 | (-5.5%) | 967 | 923 | (-4.6%) |
> | 20 | 833 | 787 | (-5.5%) | 1232 | 1164 | (-5.5%) |
> | 24 | 999 | 944 | (-5.5%) | 1462 | 1384 | (-5.3%) |
> +-------+------+-------+---------+------+-------+---------+
>
> Huge Pages
> +-------+------+-------+----------+-------+-------+---------+
> | cores | base | patch | diff | base | patch | diff |
> | | get | get | get | put | put | put |
> | | (ns) | (ns) | | (ns) | (ns) | |
> +-------+------+-------+----------+-------+-------+---------+
> | 1 | 3148 | 177 | (-94.4%) | 2946 | 2872 | (-2.5%) |
> | 2 | 3404 | 596 | (-82.5%) | 3318 | 3306 | (-0.4%) |
> | 3 | 3581 | 950 | (-73.5%) | 3401 | 3358 | (-1.3%) |
> | 4 | 3651 | 1284 | (-64.8%) | 3562 | 3616 | (1.5%) |
> | 6 | 3568 | 1929 | (-45.9%) | 4478 | 4564 | (1.9%) |
> | 8 | 3605 | 2328 | (-35.4%) | 5658 | 5546 | (-2.0%) |
> | 10 | 4093 | 2935 | (-28.3%) | 6758 | 6457 | (-4.5%) |
> | 12 | 4778 | 3540 | (-25.9%) | 7698 | 7565 | (-1.7%) |
> | 14 | 5565 | 4097 | (-26.4%) | 8748 | 8810 | (0.7%) |
> | 16 | 6364 | 4725 | (-25.8%) | 9942 | 10103 | (1.6%) |
> | 20 | 8014 | 5915 | (-26.2%) | 12509 | 12772 | (2.1%) |
> | 24 | 8732 | 7138 | (-18.3%) | 15217 | 15433 | (1.4%) |
> +-------+------+-------+----------+-------+-------+---------+
>
>>>
>>> [1] https://lore.kernel.org/lkml/[email protected]/
>>> [2] https://lore.kernel.org/linux-mm/[email protected]/