2024-02-09 12:00:57

by Chengming Zhou

[permalink] [raw]
Subject: [PATCH RFC 0/1] mm/zswap: fix LRU reclaim for zswap writeback folios

From: Chengming Zhou <[email protected]>

The memory shrinker of zswap is special: it has to first allocate more
memory (folio) to free less memory (compressed copy), later that folio
can be freed after written back. So it's better to evict these folios
before trying to reclaim other folios.

Here may come some problems of LRU management:

1. zswap_writeback_entry() reuses the __read_swap_cache_async(), which
is normally only used in the swapin context. This may cause refault
accounting and active/workingset information of the folio to be wrong.

For example, zswap shrinker writeback try to reclaim the folio, but
workingset_refault() mark it active to put it at head of active list.

2. folio_end_writeback() will check PG_reclaim flag which we did set
in zswap_writeback_entry(), to try to rotate the folio to the tail
of inactive list, to speed up its reclaim.

But folio_rotate_reclaimable() won't work as we thought, actually
all LRU move interfaces may don't work when the folio is isolated
from LRU. (per-cpu add batch is somewhat like isolated from LRU)

So when folio_end_writeback() calls folio_rotate_reclaimable(),
it won't do nothing but just clear PG_reclaim flag if that folio
is isolated (in per-cpu add batch or isolated by vmscan shrinker)

3. so the final result is the folio that has been written back and is
expected to be evicted, but now is not at tail of inactive list.

Meanwhile vmscan shrinker may try to evict other folios to cause
more refaults. There is a report [1] of this problem.

We should handle these cases better. First of all, we should consider
when and where to put these folios on LRU.

1. after alloc: now we use folio_add_lru() to put folio on local batch,
so it will be put at head of inactive/active list when batch drain.

2. after writeback: clear PG_reclaim and folio_rotate_reclaimable().
- after add batch drain: rotate successfully to tail of inactive list
- before add batch drain: do nothing since folio is not on LRU list

So these are two main time points we care about: the first is somewhat
correct IMHO since the folio is under writeback and has PG_reclaim set,
it may confuse shrinker if we put those to the tail of inactive list
too early. If we really want to put it to tail, we can easily introduce
another folio_add_lru_tail() to put on a new local lru_add_tail batch.

The second is where we need to improve, we should rotate it to tail
even when folio is in local batch. Since we can't manipulate folio LRU
status when it's isolated in local batch, an obvious fix is to use
folio flag to tell later lru_add_fn() where the folio should be put:
active or inactive, head or tail.

But the problem is that PG_readahead is the alias of PG_reclaim, we
can't put readahead folio to the tail of inactive list obviously.

So this patch changes folio_rotate_reclaimable() to queue to local
rotate batch even when !PG_lru at first, hoping that:

- folio_end_writeback() finish on the same cpu with lru_add batch,
so it must be handled after the lru_add batch, after which it will
see PG_lru and successfully rotate it to tail of inactive list.

- even folio_end_writeback() is not finished on the same cpu, there
maybe a big chance that rotate batch is handled after add batch.

Testing kernel build in tmpfs with memory.max = 2GB.
(zswap shrinker and writeback enabled with one 50GB swapfile)

mm-unstable-hot zswap-lru-reclaim
real 63.34 62.72
user 1063.20 1060.30
sys 272.04 256.14
workingset_refault_anon 2103297.00 1788155.80
workingset_refault_file 28638.20 39249.40
workingset_activate_anon 746134.00 695435.40
workingset_activate_file 4344.60 4255.80
workingset_restore_anon 653163.80 605315.60
workingset_restore_file 1079.00 883.00
workingset_nodereclaim 0.00 0.00
pgscan 12971305.60 12730331.20
pgscan_kswapd 0.00 0.00
pgscan_direct 12971305.60 12730331.20
pgscan_khugepaged 0.00 0.00

We can see that refault and sys cpu have some improvements.

As for the workingset_refault() caused by zswap writeback, maybe we
should remove it in zswap writeback case, but there are more pgscan
and some regression. I don't know why, so just leave it as it is.

This is RFC, any comment or discussion is welcome! Thanks!

[1] https://lore.kernel.org/all/[email protected]/

Chengming Zhou (1):
mm/swap: queue reclaimable folio to local rotate batch when
!folio_test_lru()

mm/swap.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

--
2.40.1



2024-02-09 12:01:06

by Chengming Zhou

[permalink] [raw]
Subject: [PATCH RFC 1/1] mm/swap: queue reclaimable folio to local rotate batch when !folio_test_lru()

From: Chengming Zhou <[email protected]>

All LRU move interfaces have a problem that it has no effect if the
folio is isolated from LRU (in cpu batch or isolated by shrinker).
Since it can't move/change folio LRU status when it's isolated, mostly
just clear the folio flag and do nothing in this case.

In our case, a written back and reclaimable folio won't be rotated to
the tail of inactive list, since it's still in cpu lru_add batch. It
may cause the delayed reclaim of this folio and evict other folios.

This patch changes to queue the reclaimable folio to cpu rotate batch
even when !folio_test_lru(), hoping it will likely be handled after
the lru_add batch which will put folio on the LRU list first, so
will be rotated to the tail successfully when handle rotate batch.

Signed-off-by: Chengming Zhou <[email protected]>
---
mm/swap.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/mm/swap.c b/mm/swap.c
index cd8f0150ba3a..d304731e47cf 100644
--- a/mm/swap.c
+++ b/mm/swap.c
@@ -236,7 +236,8 @@ static void folio_batch_add_and_move(struct folio_batch *fbatch,

static void lru_move_tail_fn(struct lruvec *lruvec, struct folio *folio)
{
- if (!folio_test_unevictable(folio)) {
+ if (!folio_test_locked(folio) && !folio_test_dirty(folio) &&
+ !folio_test_unevictable(folio) && !folio_test_active(folio)) {
lruvec_del_folio(lruvec, folio);
folio_clear_active(folio);
lruvec_add_folio_tail(lruvec, folio);
@@ -254,7 +255,7 @@ static void lru_move_tail_fn(struct lruvec *lruvec, struct folio *folio)
void folio_rotate_reclaimable(struct folio *folio)
{
if (!folio_test_locked(folio) && !folio_test_dirty(folio) &&
- !folio_test_unevictable(folio) && folio_test_lru(folio)) {
+ !folio_test_unevictable(folio) && !folio_test_active(folio)) {
struct folio_batch *fbatch;
unsigned long flags;

--
2.40.1


2024-02-13 09:02:01

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH RFC 1/1] mm/swap: queue reclaimable folio to local rotate batch when !folio_test_lru()

On Fri, Feb 9, 2024 at 4:00 AM <[email protected]> wrote:
>
> From: Chengming Zhou <[email protected]>
>
> All LRU move interfaces have a problem that it has no effect if the
> folio is isolated from LRU (in cpu batch or isolated by shrinker).
> Since it can't move/change folio LRU status when it's isolated, mostly
> just clear the folio flag and do nothing in this case.
>
> In our case, a written back and reclaimable folio won't be rotated to
> the tail of inactive list, since it's still in cpu lru_add batch. It
> may cause the delayed reclaim of this folio and evict other folios.
>
> This patch changes to queue the reclaimable folio to cpu rotate batch
> even when !folio_test_lru(), hoping it will likely be handled after
> the lru_add batch which will put folio on the LRU list first, so
> will be rotated to the tail successfully when handle rotate batch.

It seems to me that it is totally up to chance whether the lru_add
batch is handled first, especially that there may be problems if it
isn't.

>
> Signed-off-by: Chengming Zhou <[email protected]>
> ---
> mm/swap.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/mm/swap.c b/mm/swap.c
> index cd8f0150ba3a..d304731e47cf 100644
> --- a/mm/swap.c
> +++ b/mm/swap.c
> @@ -236,7 +236,8 @@ static void folio_batch_add_and_move(struct folio_batch *fbatch,
>
> static void lru_move_tail_fn(struct lruvec *lruvec, struct folio *folio)
> {
> - if (!folio_test_unevictable(folio)) {
> + if (!folio_test_locked(folio) && !folio_test_dirty(folio) &&
> + !folio_test_unevictable(folio) && !folio_test_active(folio)) {

What are these conditions based on? I assume we want to check if the
folio is locked because we no longer check that it is on the LRUs, so
we want to check that no one else is operating on it, but I am not
sure that's enough.

> lruvec_del_folio(lruvec, folio);
> folio_clear_active(folio);
> lruvec_add_folio_tail(lruvec, folio);
> @@ -254,7 +255,7 @@ static void lru_move_tail_fn(struct lruvec *lruvec, struct folio *folio)
> void folio_rotate_reclaimable(struct folio *folio)
> {
> if (!folio_test_locked(folio) && !folio_test_dirty(folio) &&
> - !folio_test_unevictable(folio) && folio_test_lru(folio)) {
> + !folio_test_unevictable(folio) && !folio_test_active(folio)) {

I am not sure it is safe to continue with a folio that is not on the
LRUs. It could be isolated for other purposes, and we end up adding it
to an LRU nonetheless. Also, folio_batch_move_lru() will do
folio_test_clear_lru() and skip such folios anyway. There may also be
messed up accounting, for example lru_move_tail_fn() calls
lruvec_del_folio(), which does some bookkeeping, at least for the
MGLRU case.

> struct folio_batch *fbatch;
> unsigned long flags;
>
> --
> 2.40.1
>

2024-02-14 09:26:29

by Chengming Zhou

[permalink] [raw]
Subject: Re: [PATCH RFC 1/1] mm/swap: queue reclaimable folio to local rotate batch when !folio_test_lru()

On 2024/2/14 15:13, Yu Zhao wrote:
> On Fri, Feb 9, 2024 at 6:00 AM <[email protected]> wrote:
>>
>> From: Chengming Zhou <[email protected]>
>>
>> All LRU move interfaces have a problem that it has no effect if the
>> folio is isolated from LRU (in cpu batch or isolated by shrinker).
>> Since it can't move/change folio LRU status when it's isolated, mostly
>> just clear the folio flag and do nothing in this case.
>>
>> In our case, a written back and reclaimable folio won't be rotated to
>> the tail of inactive list, since it's still in cpu lru_add batch. It
>> may cause the delayed reclaim of this folio and evict other folios.
>>
>> This patch changes to queue the reclaimable folio to cpu rotate batch
>> even when !folio_test_lru(), hoping it will likely be handled after
>> the lru_add batch which will put folio on the LRU list first, so
>> will be rotated to the tail successfully when handle rotate batch.
>>
>> Signed-off-by: Chengming Zhou <[email protected]>
>
> I don't think the analysis is correct. IIRC, writeback from non
> reclaim paths doesn't require isolation and the reclaim path doesn't
> use struct folio_batch lru_add.

Ah, my bad, I forgot to mention the important context in the message:

This is not from the normal reclaim context, it's from zswap writeback
reclaim context, which will first set PG_reclaim flag, then submit the
async writeback io.

If the writeback io complete fast enough, folio_rotate_reclaimable()
will be called before that folio put on LRU list (it still in the local
lru_add batch, so it's somewhat like isolated too)

>
> Did you see any performance improvements with this patch? In general,
> this kind of patches should have performance numbers to show it really
> helps (not just in theory).

Right, there are some improvements, the numbers are put in cover letter.
But this solution is not good enough, just RFC for discussion. :)

mm-unstable-hot zswap-lru-reclaim
real 63.34 62.72
user 1063.20 1060.30
sys 272.04 256.14
workingset_refault_anon 2103297.00 1788155.80
workingset_refault_file 28638.20 39249.40
workingset_activate_anon 746134.00 695435.40
workingset_activate_file 4344.60 4255.80
workingset_restore_anon 653163.80 605315.60
workingset_restore_file 1079.00 883.00
workingset_nodereclaim 0.00 0.00
pgscan 12971305.60 12730331.20
pgscan_kswapd 0.00 0.00
pgscan_direct 12971305.60 12730331.20
pgscan_khugepaged 0.00 0.00

>
> My guess is that you are hitting this problem [1].
>
> [1] https://lore.kernel.org/linux-mm/[email protected]/

Right, I just see it, it's the same problem. The only difference is that
in your case the folio is isolated by shrinker, in my case, the folio is
in cpu lru_add batch. Anyway, the result is the same, that folio can't be
rotated successfully when writeback complete.

Thanks.

2024-02-14 09:55:14

by Chengming Zhou

[permalink] [raw]
Subject: Re: [PATCH RFC 1/1] mm/swap: queue reclaimable folio to local rotate batch when !folio_test_lru()

On 2024/2/13 16:49, Yosry Ahmed wrote:
> On Fri, Feb 9, 2024 at 4:00 AM <[email protected]> wrote:
>>
>> From: Chengming Zhou <[email protected]>
>>
>> All LRU move interfaces have a problem that it has no effect if the
>> folio is isolated from LRU (in cpu batch or isolated by shrinker).
>> Since it can't move/change folio LRU status when it's isolated, mostly
>> just clear the folio flag and do nothing in this case.
>>
>> In our case, a written back and reclaimable folio won't be rotated to
>> the tail of inactive list, since it's still in cpu lru_add batch. It
>> may cause the delayed reclaim of this folio and evict other folios.
>>
>> This patch changes to queue the reclaimable folio to cpu rotate batch
>> even when !folio_test_lru(), hoping it will likely be handled after
>> the lru_add batch which will put folio on the LRU list first, so
>> will be rotated to the tail successfully when handle rotate batch.
>
> It seems to me that it is totally up to chance whether the lru_add
> batch is handled first, especially that there may be problems if it
> isn't.

You're right, I just don't know better solution :)

>
>>
>> Signed-off-by: Chengming Zhou <[email protected]>
>> ---
>> mm/swap.c | 5 +++--
>> 1 file changed, 3 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/swap.c b/mm/swap.c
>> index cd8f0150ba3a..d304731e47cf 100644
>> --- a/mm/swap.c
>> +++ b/mm/swap.c
>> @@ -236,7 +236,8 @@ static void folio_batch_add_and_move(struct folio_batch *fbatch,
>>
>> static void lru_move_tail_fn(struct lruvec *lruvec, struct folio *folio)
>> {
>> - if (!folio_test_unevictable(folio)) {
>> + if (!folio_test_locked(folio) && !folio_test_dirty(folio) &&
>> + !folio_test_unevictable(folio) && !folio_test_active(folio)) {
>
> What are these conditions based on? I assume we want to check if the
> folio is locked because we no longer check that it is on the LRUs, so
> we want to check that no one else is operating on it, but I am not
> sure that's enough.

These conditions are used for checking whether the folio should be reclaimed/rotated
at this point. Like we shouldn't reclaim it if it has been dirtied or actived.

lru_move_tail_fn() will only be called after we isolate this folio successfully
in folio_batch_move_lru(), so if other path has isolated this folio (cpu batch
or reclaim shrinker), this function will not be called.

>
>> lruvec_del_folio(lruvec, folio);
>> folio_clear_active(folio);
>> lruvec_add_folio_tail(lruvec, folio);
>> @@ -254,7 +255,7 @@ static void lru_move_tail_fn(struct lruvec *lruvec, struct folio *folio)
>> void folio_rotate_reclaimable(struct folio *folio)
>> {
>> if (!folio_test_locked(folio) && !folio_test_dirty(folio) &&
>> - !folio_test_unevictable(folio) && folio_test_lru(folio)) {
>> + !folio_test_unevictable(folio) && !folio_test_active(folio)) {
>
> I am not sure it is safe to continue with a folio that is not on the
> LRUs. It could be isolated for other purposes, and we end up adding it
> to an LRU nonetheless. Also, folio_batch_move_lru() will do

This shouldn't happen since lru_move_tail_fn() will only be called if
folio_test_clear_lru() successfully in folio_batch_move_lru().

Thanks.

2024-02-14 18:59:24

by Yosry Ahmed

[permalink] [raw]
Subject: Re: [PATCH RFC 1/1] mm/swap: queue reclaimable folio to local rotate batch when !folio_test_lru()

On Wed, Feb 14, 2024 at 05:54:56PM +0800, Chengming Zhou wrote:
> On 2024/2/13 16:49, Yosry Ahmed wrote:
> > On Fri, Feb 9, 2024 at 4:00 AM <[email protected]> wrote:
> >>
> >> From: Chengming Zhou <[email protected]>
> >>
> >> All LRU move interfaces have a problem that it has no effect if the
> >> folio is isolated from LRU (in cpu batch or isolated by shrinker).
> >> Since it can't move/change folio LRU status when it's isolated, mostly
> >> just clear the folio flag and do nothing in this case.
> >>
> >> In our case, a written back and reclaimable folio won't be rotated to
> >> the tail of inactive list, since it's still in cpu lru_add batch. It
> >> may cause the delayed reclaim of this folio and evict other folios.
> >>
> >> This patch changes to queue the reclaimable folio to cpu rotate batch
> >> even when !folio_test_lru(), hoping it will likely be handled after
> >> the lru_add batch which will put folio on the LRU list first, so
> >> will be rotated to the tail successfully when handle rotate batch.
> >
> > It seems to me that it is totally up to chance whether the lru_add
> > batch is handled first, especially that there may be problems if it
> > isn't.
>
> You're right, I just don't know better solution :)
>
> >
> >>
> >> Signed-off-by: Chengming Zhou <[email protected]>
> >> ---
> >> mm/swap.c | 5 +++--
> >> 1 file changed, 3 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/mm/swap.c b/mm/swap.c
> >> index cd8f0150ba3a..d304731e47cf 100644
> >> --- a/mm/swap.c
> >> +++ b/mm/swap.c
> >> @@ -236,7 +236,8 @@ static void folio_batch_add_and_move(struct folio_batch *fbatch,
> >>
> >> static void lru_move_tail_fn(struct lruvec *lruvec, struct folio *folio)
> >> {
> >> - if (!folio_test_unevictable(folio)) {
> >> + if (!folio_test_locked(folio) && !folio_test_dirty(folio) &&
> >> + !folio_test_unevictable(folio) && !folio_test_active(folio)) {
> >
> > What are these conditions based on? I assume we want to check if the
> > folio is locked because we no longer check that it is on the LRUs, so
> > we want to check that no one else is operating on it, but I am not
> > sure that's enough.
>
> These conditions are used for checking whether the folio should be reclaimed/rotated
> at this point. Like we shouldn't reclaim it if it has been dirtied or actived.

This should be explained somewhere, a comment or in the commit message.

> lru_move_tail_fn() will only be called after we isolate this folio successfully
> in folio_batch_move_lru(), so if other path has isolated this folio (cpu batch
> or reclaim shrinker), this function will not be called.

Interesting, why are we checking if the folio is locked here then?

>
> >
> >> lruvec_del_folio(lruvec, folio);
> >> folio_clear_active(folio);
> >> lruvec_add_folio_tail(lruvec, folio);
> >> @@ -254,7 +255,7 @@ static void lru_move_tail_fn(struct lruvec *lruvec, struct folio *folio)
> >> void folio_rotate_reclaimable(struct folio *folio)
> >> {
> >> if (!folio_test_locked(folio) && !folio_test_dirty(folio) &&
> >> - !folio_test_unevictable(folio) && folio_test_lru(folio)) {
> >> + !folio_test_unevictable(folio) && !folio_test_active(folio)) {
> >
> > I am not sure it is safe to continue with a folio that is not on the
> > LRUs. It could be isolated for other purposes, and we end up adding it
> > to an LRU nonetheless. Also, folio_batch_move_lru() will do
>
> This shouldn't happen since lru_move_tail_fn() will only be called if
> folio_test_clear_lru() successfully in folio_batch_move_lru().

I see, so this is where we hope lru_add batch gets handled first. I need
to think about this some more, let's also see what others like Yu say.

Thanks!

2024-02-15 07:12:59

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH RFC 1/1] mm/swap: queue reclaimable folio to local rotate batch when !folio_test_lru()

On Wed, Feb 14, 2024 at 4:18 AM Chengming Zhou <[email protected]> wrote:
>
> On 2024/2/14 15:13, Yu Zhao wrote:
> > On Fri, Feb 9, 2024 at 6:00 AM <[email protected]> wrote:
> >>
> >> From: Chengming Zhou <[email protected]>
> >>
> >> All LRU move interfaces have a problem that it has no effect if the
> >> folio is isolated from LRU (in cpu batch or isolated by shrinker).
> >> Since it can't move/change folio LRU status when it's isolated, mostly
> >> just clear the folio flag and do nothing in this case.
> >>
> >> In our case, a written back and reclaimable folio won't be rotated to
> >> the tail of inactive list, since it's still in cpu lru_add batch. It
> >> may cause the delayed reclaim of this folio and evict other folios.
> >>
> >> This patch changes to queue the reclaimable folio to cpu rotate batch
> >> even when !folio_test_lru(), hoping it will likely be handled after
> >> the lru_add batch which will put folio on the LRU list first, so
> >> will be rotated to the tail successfully when handle rotate batch.
> >>
> >> Signed-off-by: Chengming Zhou <[email protected]>
> >
> > I don't think the analysis is correct. IIRC, writeback from non
> > reclaim paths doesn't require isolation and the reclaim path doesn't
> > use struct folio_batch lru_add.
>
> Ah, my bad, I forgot to mention the important context in the message:
>
> This is not from the normal reclaim context, it's from zswap writeback
> reclaim context, which will first set PG_reclaim flag, then submit the
> async writeback io.
>
> If the writeback io complete fast enough, folio_rotate_reclaimable()
> will be called before that folio put on LRU list (it still in the local
> lru_add batch, so it's somewhat like isolated too)
>
> >
> > Did you see any performance improvements with this patch? In general,
> > this kind of patches should have performance numbers to show it really
> > helps (not just in theory).
>
> Right, there are some improvements, the numbers are put in cover letter.
> But this solution is not good enough, just RFC for discussion. :)
>
> mm-unstable-hot zswap-lru-reclaim
> real 63.34 62.72
> user 1063.20 1060.30
> sys 272.04 256.14
> workingset_refault_anon 2103297.00 1788155.80
> workingset_refault_file 28638.20 39249.40
> workingset_activate_anon 746134.00 695435.40
> workingset_activate_file 4344.60 4255.80
> workingset_restore_anon 653163.80 605315.60
> workingset_restore_file 1079.00 883.00
> workingset_nodereclaim 0.00 0.00
> pgscan 12971305.60 12730331.20
> pgscan_kswapd 0.00 0.00
> pgscan_direct 12971305.60 12730331.20
> pgscan_khugepaged 0.00 0.00
>
> >
> > My guess is that you are hitting this problem [1].
> >
> > [1] https://lore.kernel.org/linux-mm/[email protected]/
>
> Right, I just see it, it's the same problem. The only difference is that
> in your case the folio is isolated by shrinker, in my case, the folio is
> in cpu lru_add batch. Anyway, the result is the same, that folio can't be
> rotated successfully when writeback complete.

In that case, a better solution would be to make lru_add add
(_reclaim() && !_dirty() && !_writeback()) folios at the tail.
(_rotate() needs to leave _reclaim() set if it fails to rotate.)

2024-02-18 02:47:23

by Chengming Zhou

[permalink] [raw]
Subject: Re: [PATCH RFC 1/1] mm/swap: queue reclaimable folio to local rotate batch when !folio_test_lru()

On 2024/2/15 02:59, Yosry Ahmed wrote:
> On Wed, Feb 14, 2024 at 05:54:56PM +0800, Chengming Zhou wrote:
>> On 2024/2/13 16:49, Yosry Ahmed wrote:
>>> On Fri, Feb 9, 2024 at 4:00 AM <[email protected]> wrote:
>>>>
>>>> From: Chengming Zhou <[email protected]>
>>>>
>>>> All LRU move interfaces have a problem that it has no effect if the
>>>> folio is isolated from LRU (in cpu batch or isolated by shrinker).
>>>> Since it can't move/change folio LRU status when it's isolated, mostly
>>>> just clear the folio flag and do nothing in this case.
>>>>
>>>> In our case, a written back and reclaimable folio won't be rotated to
>>>> the tail of inactive list, since it's still in cpu lru_add batch. It
>>>> may cause the delayed reclaim of this folio and evict other folios.
>>>>
>>>> This patch changes to queue the reclaimable folio to cpu rotate batch
>>>> even when !folio_test_lru(), hoping it will likely be handled after
>>>> the lru_add batch which will put folio on the LRU list first, so
>>>> will be rotated to the tail successfully when handle rotate batch.
>>>
>>> It seems to me that it is totally up to chance whether the lru_add
>>> batch is handled first, especially that there may be problems if it
>>> isn't.
>>
>> You're right, I just don't know better solution :)
>>
>>>
>>>>
>>>> Signed-off-by: Chengming Zhou <[email protected]>
>>>> ---
>>>> mm/swap.c | 5 +++--
>>>> 1 file changed, 3 insertions(+), 2 deletions(-)
>>>>
>>>> diff --git a/mm/swap.c b/mm/swap.c
>>>> index cd8f0150ba3a..d304731e47cf 100644
>>>> --- a/mm/swap.c
>>>> +++ b/mm/swap.c
>>>> @@ -236,7 +236,8 @@ static void folio_batch_add_and_move(struct folio_batch *fbatch,
>>>>
>>>> static void lru_move_tail_fn(struct lruvec *lruvec, struct folio *folio)
>>>> {
>>>> - if (!folio_test_unevictable(folio)) {
>>>> + if (!folio_test_locked(folio) && !folio_test_dirty(folio) &&
>>>> + !folio_test_unevictable(folio) && !folio_test_active(folio)) {
>>>
>>> What are these conditions based on? I assume we want to check if the
>>> folio is locked because we no longer check that it is on the LRUs, so
>>> we want to check that no one else is operating on it, but I am not
>>> sure that's enough.
>>
>> These conditions are used for checking whether the folio should be reclaimed/rotated
>> at this point. Like we shouldn't reclaim it if it has been dirtied or actived.
>
> This should be explained somewhere, a comment or in the commit message.
>
>> lru_move_tail_fn() will only be called after we isolate this folio successfully
>> in folio_batch_move_lru(), so if other path has isolated this folio (cpu batch
>> or reclaim shrinker), this function will not be called.
>
> Interesting, why are we checking if the folio is locked here then?

I think it means the folio is using by others, and reclaim needs to lock the folio.
Not very sure.

>
>>
>>>
>>>> lruvec_del_folio(lruvec, folio);
>>>> folio_clear_active(folio);
>>>> lruvec_add_folio_tail(lruvec, folio);
>>>> @@ -254,7 +255,7 @@ static void lru_move_tail_fn(struct lruvec *lruvec, struct folio *folio)
>>>> void folio_rotate_reclaimable(struct folio *folio)
>>>> {
>>>> if (!folio_test_locked(folio) && !folio_test_dirty(folio) &&
>>>> - !folio_test_unevictable(folio) && folio_test_lru(folio)) {
>>>> + !folio_test_unevictable(folio) && !folio_test_active(folio)) {
>>>
>>> I am not sure it is safe to continue with a folio that is not on the
>>> LRUs. It could be isolated for other purposes, and we end up adding it
>>> to an LRU nonetheless. Also, folio_batch_move_lru() will do
>>
>> This shouldn't happen since lru_move_tail_fn() will only be called if
>> folio_test_clear_lru() successfully in folio_batch_move_lru().
>
> I see, so this is where we hope lru_add batch gets handled first. I need
> to think about this some more, let's also see what others like Yu say.

Right.


2024-02-18 02:52:33

by Chengming Zhou

[permalink] [raw]
Subject: Re: [PATCH RFC 1/1] mm/swap: queue reclaimable folio to local rotate batch when !folio_test_lru()

On 2024/2/15 15:06, Yu Zhao wrote:
> On Wed, Feb 14, 2024 at 4:18 AM Chengming Zhou <[email protected]> wrote:
>>
>> On 2024/2/14 15:13, Yu Zhao wrote:
>>> On Fri, Feb 9, 2024 at 6:00 AM <[email protected]> wrote:
>>>>
>>>> From: Chengming Zhou <[email protected]>
>>>>
>>>> All LRU move interfaces have a problem that it has no effect if the
>>>> folio is isolated from LRU (in cpu batch or isolated by shrinker).
>>>> Since it can't move/change folio LRU status when it's isolated, mostly
>>>> just clear the folio flag and do nothing in this case.
>>>>
>>>> In our case, a written back and reclaimable folio won't be rotated to
>>>> the tail of inactive list, since it's still in cpu lru_add batch. It
>>>> may cause the delayed reclaim of this folio and evict other folios.
>>>>
>>>> This patch changes to queue the reclaimable folio to cpu rotate batch
>>>> even when !folio_test_lru(), hoping it will likely be handled after
>>>> the lru_add batch which will put folio on the LRU list first, so
>>>> will be rotated to the tail successfully when handle rotate batch.
>>>>
>>>> Signed-off-by: Chengming Zhou <[email protected]>
>>>
>>> I don't think the analysis is correct. IIRC, writeback from non
>>> reclaim paths doesn't require isolation and the reclaim path doesn't
>>> use struct folio_batch lru_add.
>>
>> Ah, my bad, I forgot to mention the important context in the message:
>>
>> This is not from the normal reclaim context, it's from zswap writeback
>> reclaim context, which will first set PG_reclaim flag, then submit the
>> async writeback io.
>>
>> If the writeback io complete fast enough, folio_rotate_reclaimable()
>> will be called before that folio put on LRU list (it still in the local
>> lru_add batch, so it's somewhat like isolated too)
>>
>>>
>>> Did you see any performance improvements with this patch? In general,
>>> this kind of patches should have performance numbers to show it really
>>> helps (not just in theory).
>>
>> Right, there are some improvements, the numbers are put in cover letter.
>> But this solution is not good enough, just RFC for discussion. :)
>>
>> mm-unstable-hot zswap-lru-reclaim
>> real 63.34 62.72
>> user 1063.20 1060.30
>> sys 272.04 256.14
>> workingset_refault_anon 2103297.00 1788155.80
>> workingset_refault_file 28638.20 39249.40
>> workingset_activate_anon 746134.00 695435.40
>> workingset_activate_file 4344.60 4255.80
>> workingset_restore_anon 653163.80 605315.60
>> workingset_restore_file 1079.00 883.00
>> workingset_nodereclaim 0.00 0.00
>> pgscan 12971305.60 12730331.20
>> pgscan_kswapd 0.00 0.00
>> pgscan_direct 12971305.60 12730331.20
>> pgscan_khugepaged 0.00 0.00
>>
>>>
>>> My guess is that you are hitting this problem [1].
>>>
>>> [1] https://lore.kernel.org/linux-mm/[email protected]/
>>
>> Right, I just see it, it's the same problem. The only difference is that
>> in your case the folio is isolated by shrinker, in my case, the folio is
>> in cpu lru_add batch. Anyway, the result is the same, that folio can't be
>> rotated successfully when writeback complete.
>
> In that case, a better solution would be to make lru_add add
> (_reclaim() && !_dirty() && !_writeback()) folios at the tail.
> (_rotate() needs to leave _reclaim() set if it fails to rotate.)

Right, this is a solution. But PG_readahead is alias of PG_reclaim,
I'm afraid this would rotate readahead folio to the inactive tail.


2024-02-18 08:09:16

by Yu Zhao

[permalink] [raw]
Subject: Re: [PATCH RFC 1/1] mm/swap: queue reclaimable folio to local rotate batch when !folio_test_lru()

On Sat, Feb 17, 2024 at 9:52 PM Chengming Zhou <[email protected]> wrote:
>
> On 2024/2/15 15:06, Yu Zhao wrote:
> > On Wed, Feb 14, 2024 at 4:18 AM Chengming Zhou <[email protected]> wrote:
> >>
> >> On 2024/2/14 15:13, Yu Zhao wrote:
> >>> On Fri, Feb 9, 2024 at 6:00 AM <[email protected]> wrote:
> >>>>
> >>>> From: Chengming Zhou <[email protected]>
> >>>>
> >>>> All LRU move interfaces have a problem that it has no effect if the
> >>>> folio is isolated from LRU (in cpu batch or isolated by shrinker).
> >>>> Since it can't move/change folio LRU status when it's isolated, mostly
> >>>> just clear the folio flag and do nothing in this case.
> >>>>
> >>>> In our case, a written back and reclaimable folio won't be rotated to
> >>>> the tail of inactive list, since it's still in cpu lru_add batch. It
> >>>> may cause the delayed reclaim of this folio and evict other folios.
> >>>>
> >>>> This patch changes to queue the reclaimable folio to cpu rotate batch
> >>>> even when !folio_test_lru(), hoping it will likely be handled after
> >>>> the lru_add batch which will put folio on the LRU list first, so
> >>>> will be rotated to the tail successfully when handle rotate batch.
> >>>>
> >>>> Signed-off-by: Chengming Zhou <[email protected]>
> >>>
> >>> I don't think the analysis is correct. IIRC, writeback from non
> >>> reclaim paths doesn't require isolation and the reclaim path doesn't
> >>> use struct folio_batch lru_add.
> >>
> >> Ah, my bad, I forgot to mention the important context in the message:
> >>
> >> This is not from the normal reclaim context, it's from zswap writeback
> >> reclaim context, which will first set PG_reclaim flag, then submit the
> >> async writeback io.
> >>
> >> If the writeback io complete fast enough, folio_rotate_reclaimable()
> >> will be called before that folio put on LRU list (it still in the local
> >> lru_add batch, so it's somewhat like isolated too)
> >>
> >>>
> >>> Did you see any performance improvements with this patch? In general,
> >>> this kind of patches should have performance numbers to show it really
> >>> helps (not just in theory).
> >>
> >> Right, there are some improvements, the numbers are put in cover letter.
> >> But this solution is not good enough, just RFC for discussion. :)
> >>
> >> mm-unstable-hot zswap-lru-reclaim
> >> real 63.34 62.72
> >> user 1063.20 1060.30
> >> sys 272.04 256.14
> >> workingset_refault_anon 2103297.00 1788155.80
> >> workingset_refault_file 28638.20 39249.40
> >> workingset_activate_anon 746134.00 695435.40
> >> workingset_activate_file 4344.60 4255.80
> >> workingset_restore_anon 653163.80 605315.60
> >> workingset_restore_file 1079.00 883.00
> >> workingset_nodereclaim 0.00 0.00
> >> pgscan 12971305.60 12730331.20
> >> pgscan_kswapd 0.00 0.00
> >> pgscan_direct 12971305.60 12730331.20
> >> pgscan_khugepaged 0.00 0.00
> >>
> >>>
> >>> My guess is that you are hitting this problem [1].
> >>>
> >>> [1] https://lore.kernel.org/linux-mm/[email protected]/
> >>
> >> Right, I just see it, it's the same problem. The only difference is that
> >> in your case the folio is isolated by shrinker, in my case, the folio is
> >> in cpu lru_add batch. Anyway, the result is the same, that folio can't be
> >> rotated successfully when writeback complete.
> >
> > In that case, a better solution would be to make lru_add add
> > (_reclaim() && !_dirty() && !_writeback()) folios at the tail.
> > (_rotate() needs to leave _reclaim() set if it fails to rotate.)
>
> Right, this is a solution. But PG_readahead is alias of PG_reclaim,
> I'm afraid this would rotate readahead folio to the inactive tail.

Then drain before setting readahead, since readahead isn't set on every folio.