2023-04-05 16:20:51

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v2 1/2] mm: swap: fix performance regression on sparsetruncate-tiny

The ->percpu_pvec_drained was originally introduced by
commit d9ed0d08b6c6 ("mm: only drain per-cpu pagevecs once per
pagevec usage") to drain per-cpu pagevecs only once per pagevec
usage. But after converting the swap code to be more folio-based,
the commit c2bc16817aa0 ("mm/swap: add folio_batch_move_lru()")
breaks this logic, which would cause ->percpu_pvec_drained to be
reset to false, that means per-cpu pagevecs will be drained
multiple times per pagevec usage.

In theory, there should be no functional changes when converting
code to be more folio-based. We should call folio_batch_reinit()
in folio_batch_move_lru() instead of folio_batch_init(). And to
verify that we still need ->percpu_pvec_drained, I ran
mmtests/sparsetruncate-tiny and got the following data:

baseline with
baseline/ patch/
Min Time 326.00 ( 0.00%) 328.00 ( -0.61%)
1st-qrtle Time 334.00 ( 0.00%) 336.00 ( -0.60%)
2nd-qrtle Time 338.00 ( 0.00%) 341.00 ( -0.89%)
3rd-qrtle Time 343.00 ( 0.00%) 347.00 ( -1.17%)
Max-1 Time 326.00 ( 0.00%) 328.00 ( -0.61%)
Max-5 Time 327.00 ( 0.00%) 330.00 ( -0.92%)
Max-10 Time 328.00 ( 0.00%) 331.00 ( -0.91%)
Max-90 Time 350.00 ( 0.00%) 357.00 ( -2.00%)
Max-95 Time 395.00 ( 0.00%) 390.00 ( 1.27%)
Max-99 Time 508.00 ( 0.00%) 434.00 ( 14.57%)
Max Time 547.00 ( 0.00%) 476.00 ( 12.98%)
Amean Time 344.61 ( 0.00%) 345.56 * -0.28%*
Stddev Time 30.34 ( 0.00%) 19.51 ( 35.69%)
CoeffVar Time 8.81 ( 0.00%) 5.65 ( 35.87%)
BAmean-99 Time 342.38 ( 0.00%) 344.27 ( -0.55%)
BAmean-95 Time 338.58 ( 0.00%) 341.87 ( -0.97%)
BAmean-90 Time 336.89 ( 0.00%) 340.26 ( -1.00%)
BAmean-75 Time 335.18 ( 0.00%) 338.40 ( -0.96%)
BAmean-50 Time 332.54 ( 0.00%) 335.42 ( -0.87%)
BAmean-25 Time 329.30 ( 0.00%) 332.00 ( -0.82%)

From the above it can be seen that we get similar data to when
->percpu_pvec_drained was introduced, so we still need it. Let's
call folio_batch_reinit() in folio_batch_move_lru() to restore
the original logic.

Fixes: c2bc16817aa0 ("mm/swap: add folio_batch_move_lru()")
Signed-off-by: Qi Zheng <[email protected]>
---
Changlog in v1 to v2:
- revise commit message and add test data

mm/swap.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/swap.c b/mm/swap.c
index 57cb01b042f6..423199ee8478 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -222,7 +222,7 @@ static void folio_batch_move_lru(struct folio_batch *fbatch, move_fn_t move_fn)
if (lruvec)
unlock_page_lruvec_irqrestore(lruvec, flags);
folios_put(fbatch->folios, folio_batch_count(fbatch));
- folio_batch_init(fbatch);
+ folio_batch_reinit(fbatch);
}

static void folio_batch_add_and_move(struct folio_batch *fbatch,
--
2.20.1


2023-04-05 16:21:01

by Qi Zheng

[permalink] [raw]
Subject: [PATCH v2 2/2] mm: mlock: use folios_put() in mlock_folio_batch()

Since we have updated mlock to use folios, it's better
to call folios_put() instead of calling release_pages()
directly.

Signed-off-by: Qi Zheng <[email protected]>
---
mm/mlock.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/mlock.c b/mm/mlock.c
index 617469fce96d..40b43f8740df 100644
--- a/mm/mlock.c
+++ b/mm/mlock.c
@@ -206,7 +206,7 @@ static void mlock_folio_batch(struct folio_batch *fbatch)

if (lruvec)
unlock_page_lruvec_irq(lruvec);
- release_pages(fbatch->folios, fbatch->nr);
+ folios_put(fbatch->folios, folio_batch_count(fbatch));
folio_batch_reinit(fbatch);
}

--
2.20.1

2023-04-05 16:48:01

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm: swap: fix performance regression on sparsetruncate-tiny

On Thu, Apr 06, 2023 at 12:18:53AM +0800, Qi Zheng wrote:
> The ->percpu_pvec_drained was originally introduced by
> commit d9ed0d08b6c6 ("mm: only drain per-cpu pagevecs once per
> pagevec usage") to drain per-cpu pagevecs only once per pagevec
> usage. But after converting the swap code to be more folio-based,
> the commit c2bc16817aa0 ("mm/swap: add folio_batch_move_lru()")
> breaks this logic, which would cause ->percpu_pvec_drained to be
> reset to false, that means per-cpu pagevecs will be drained
> multiple times per pagevec usage.

My mistake. I didn't reaise that we'd need a folio_batch_reinit(),
and indeed we didn't have one until 811561288397 (January 2023).
I thought this usage of percpu_pvec_drained was going to be fine
with being set to false each time. Thanks for showing I was wrong.

> Fixes: c2bc16817aa0 ("mm/swap: add folio_batch_move_lru()")
> Signed-off-by: Qi Zheng <[email protected]>

Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>

2023-04-06 10:36:16

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH v2 2/2] mm: mlock: use folios_put() in mlock_folio_batch()

On Thu, Apr 06, 2023 at 12:18:54AM +0800, Qi Zheng wrote:
> Since we have updated mlock to use folios, it's better
> to call folios_put() instead of calling release_pages()
> directly.
>
> Signed-off-by: Qi Zheng <[email protected]>

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
SUSE Labs

2023-04-06 10:36:26

by Mel Gorman

[permalink] [raw]
Subject: Re: [PATCH v2 1/2] mm: swap: fix performance regression on sparsetruncate-tiny

On Thu, Apr 06, 2023 at 12:18:53AM +0800, Qi Zheng wrote:
> The ->percpu_pvec_drained was originally introduced by
> commit d9ed0d08b6c6 ("mm: only drain per-cpu pagevecs once per
> pagevec usage") to drain per-cpu pagevecs only once per pagevec
> usage. But after converting the swap code to be more folio-based,
> the commit c2bc16817aa0 ("mm/swap: add folio_batch_move_lru()")
> breaks this logic, which would cause ->percpu_pvec_drained to be
> reset to false, that means per-cpu pagevecs will be drained
> multiple times per pagevec usage.
>
> In theory, there should be no functional changes when converting
> code to be more folio-based. We should call folio_batch_reinit()
> in folio_batch_move_lru() instead of folio_batch_init(). And to
> verify that we still need ->percpu_pvec_drained, I ran
> mmtests/sparsetruncate-tiny and got the following data:
>
> baseline with
> baseline/ patch/
> Min Time 326.00 ( 0.00%) 328.00 ( -0.61%)
> 1st-qrtle Time 334.00 ( 0.00%) 336.00 ( -0.60%)
> 2nd-qrtle Time 338.00 ( 0.00%) 341.00 ( -0.89%)
> 3rd-qrtle Time 343.00 ( 0.00%) 347.00 ( -1.17%)
> Max-1 Time 326.00 ( 0.00%) 328.00 ( -0.61%)
> Max-5 Time 327.00 ( 0.00%) 330.00 ( -0.92%)
> Max-10 Time 328.00 ( 0.00%) 331.00 ( -0.91%)
> Max-90 Time 350.00 ( 0.00%) 357.00 ( -2.00%)
> Max-95 Time 395.00 ( 0.00%) 390.00 ( 1.27%)
> Max-99 Time 508.00 ( 0.00%) 434.00 ( 14.57%)
> Max Time 547.00 ( 0.00%) 476.00 ( 12.98%)
> Amean Time 344.61 ( 0.00%) 345.56 * -0.28%*
> Stddev Time 30.34 ( 0.00%) 19.51 ( 35.69%)
> CoeffVar Time 8.81 ( 0.00%) 5.65 ( 35.87%)
> BAmean-99 Time 342.38 ( 0.00%) 344.27 ( -0.55%)
> BAmean-95 Time 338.58 ( 0.00%) 341.87 ( -0.97%)
> BAmean-90 Time 336.89 ( 0.00%) 340.26 ( -1.00%)
> BAmean-75 Time 335.18 ( 0.00%) 338.40 ( -0.96%)
> BAmean-50 Time 332.54 ( 0.00%) 335.42 ( -0.87%)
> BAmean-25 Time 329.30 ( 0.00%) 332.00 ( -0.82%)
>
> From the above it can be seen that we get similar data to when
> ->percpu_pvec_drained was introduced, so we still need it. Let's
> call folio_batch_reinit() in folio_batch_move_lru() to restore
> the original logic.
>
> Fixes: c2bc16817aa0 ("mm/swap: add folio_batch_move_lru()")
> Signed-off-by: Qi Zheng <[email protected]>

Well spotted,

Acked-by: Mel Gorman <[email protected]>

--
Mel Gorman
SUSE Labs