LinuxLists.cc - Re: [PATCH] mm:zswap: fix zswap entry reclamation failure in two scenarios

[permalink] [raw]

Subject: Re: [PATCH] mm:zswap: fix zswap entry reclamation failure in two scenarios

On Thu, Nov 16, 2023 at 12:12 PM Chris Li <[email protected]> wrote:
>
> Hi Yosry,
>
> On Tue, Nov 14, 2023 at 9:16 AM Yosry Ahmed <[email protected]> wrote:
> > > 1)The swap entry has been freed, but cached in swap_slots_cache,
> > > no swap cache and swapcount=0.
> > > 2)When the option zswap_exclusive_loads_enabled disabled and
> > > zswap_load completed(page in swap_cache and swapcount = 0).
> >
> > For case (1), I think a cleaner solution would be to move the
> > zswap_invalidate() call from swap_range_free() (which is called after
> > the cached slots are freed) to __swap_entry_free_locked() if the usage
> > goes to 0. I actually think conceptually this makes not just for
> > zswap_invalidate(), but also for the arch call, memcg uncharging, etc.
> > Slots caching is a swapfile optimization that should be internal to
> > swapfile code. Once a swap entry is freed (i.e. swap count is 0 AND
>
> Do you mean moving all swap slots free to bypass the swap slot cache, even it
> is not from zswap? That might have unwanted side effects. The swap
> slot cache is not just for swap files on disk. The batching has the
> effect that on average lower cost of freeing per entry.

Not bypassing the swap slot cache, just make the callbacks to
invalidate the zswap entry, do memg uncharging, etc when the slot is
no longer used and is entering the swap slot cache (i.e. when
free_swap_slot() is called), instead of when draining the swap slot
cache (i.e. when swap_range_free() is called). For all parts of MM
outside of swap, the swap entry is freed when free_swap_slot() is
called. We don't free it immediately because of caching, but this
should be transparent to other parts of MM (e.g. zswap, memcg, etc).

>
> > not in the swap cache), all the hooks should be called (memcg, zswap,
> > arch, ..) as the swap entry is effectively freed. The fact that
> > swapfile code internally batches and caches slots should be
> > transparent to other parts of MM. I am not sure if the calls can just
> > be moved or if there are underlying assumptions in the implementation
> > that would be broken, but it feels like the right thing to do.
>
> There is also the behavior that if the page gets swapped in but hasn't
> changed, when swap out again, it is possible to avoid writing the
> page again to the disk. For disk there is no overhead keeping the old
> date on the disk not to touch it. For zpool it might have memory
> overhead holding the compressed pool. The trade off might be
> different.
>
> Chris

2023-11-16 20:31:24

[permalink] [raw]

Subject: Re: [PATCH] mm:zswap: fix zswap entry reclamation failure in two scenarios

On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <[email protected]> wrote:
>
> Not bypassing the swap slot cache, just make the callbacks to
> invalidate the zswap entry, do memg uncharging, etc when the slot is
> no longer used and is entering the swap slot cache (i.e. when
> free_swap_slot() is called), instead of when draining the swap slot
> cache (i.e. when swap_range_free() is called). For all parts of MM
> outside of swap, the swap entry is freed when free_swap_slot() is
> called. We don't free it immediately because of caching, but this
> should be transparent to other parts of MM (e.g. zswap, memcg, etc).

That will cancel the batching effect on the swap slot free, making the
common case for swapping faults take longer to complete, righ?
If I recall correctly, the uncharge is the expensive part of the swap
slot free operation.
I just want to figure out what we are trading off against. This is not
one side wins all situations.

Chris

2023-11-16 20:47:14

[permalink] [raw]

Subject: Re: [PATCH] mm:zswap: fix zswap entry reclamation failure in two scenarios

On Thu, Nov 16, 2023 at 12:30 PM Chris Li <[email protected]> wrote:
>
> On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <[email protected]> wrote:
> >
> > Not bypassing the swap slot cache, just make the callbacks to
> > invalidate the zswap entry, do memg uncharging, etc when the slot is
> > no longer used and is entering the swap slot cache (i.e. when
> > free_swap_slot() is called), instead of when draining the swap slot
> > cache (i.e. when swap_range_free() is called). For all parts of MM
> > outside of swap, the swap entry is freed when free_swap_slot() is
> > called. We don't free it immediately because of caching, but this
> > should be transparent to other parts of MM (e.g. zswap, memcg, etc).
>
> That will cancel the batching effect on the swap slot free, making the
> common case for swapping faults take longer to complete, righ?
> If I recall correctly, the uncharge is the expensive part of the swap
> slot free operation.
> I just want to figure out what we are trading off against. This is not
> one side wins all situations.

Interesting. Maybe we can just move the zswap_invalidate() call to
save memory early, and leave the memcg uncharge call to be batched?
IIUC we already do some version of this internally at Google.

>
>
> Chris

2023-11-17 09:57:01

[permalink] [raw]

Subject: Re: [External] Re: [PATCH] mm:zswap: fix zswap entry reclamation failure in two scenarios

>
> That will cancel the batching effect on the swap slot free, making the
> common case for swapping faults take longer to complete, righ?
> If I recall correctly, the uncharge is the expensive part of the swap
> slot free operation.
> I just want to figure out what we are trading off against. This is not
> one side wins all situations.
>

Hi Chris, thanks for your feedback. I have the same concerns,
maybe we should just move the zswap_invalidate() out of batches,
as Yosry mentioned above.

2023-11-17 23:31:18

[permalink] [raw]

Subject: Re: [PATCH] mm:zswap: fix zswap entry reclamation failure in two scenarios

On Thu, Nov 16, 2023 at 12:46 PM Yosry Ahmed <[email protected]> wrote:
>
> On Thu, Nov 16, 2023 at 12:30 PM Chris Li <[email protected]> wrote:
> >
> > On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <[email protected]> wrote:
> > >
> > > Not bypassing the swap slot cache, just make the callbacks to
> > > invalidate the zswap entry, do memg uncharging, etc when the slot is
> > > no longer used and is entering the swap slot cache (i.e. when
> > > free_swap_slot() is called), instead of when draining the swap slot
> > > cache (i.e. when swap_range_free() is called). For all parts of MM
> > > outside of swap, the swap entry is freed when free_swap_slot() is
> > > called. We don't free it immediately because of caching, but this
> > > should be transparent to other parts of MM (e.g. zswap, memcg, etc).
> >
> > That will cancel the batching effect on the swap slot free, making the
> > common case for swapping faults take longer to complete, righ?
> > If I recall correctly, the uncharge is the expensive part of the swap
> > slot free operation.
> > I just want to figure out what we are trading off against. This is not
> > one side wins all situations.
>
> Interesting. Maybe we can just move the zswap_invalidate() call to
> save memory early, and leave the memcg uncharge call to be batched?
> IIUC we already do some version of this internally at Google.

I would like to see the patch so I can reason about it better.
Do you have the link for the patch you are talking about?
I can also look up the Google internal one if you have a commit hash.

One thing I would like to find out is whether the behavior of reusing
swap slots without page writing has been changed.
e.g. Previously if the swap slot can be page out again without
writing/compression again, if the page is not dirty. If we change that
behavior, I would like to understand the trade off better.

Chris

2023-11-17 23:48:15

[permalink] [raw]

Subject: Re: [External] Re: [PATCH] mm:zswap: fix zswap entry reclamation failure in two scenarios

On Fri, Nov 17, 2023 at 1:56 AM Zhongkun He
<[email protected]> wrote:
> Hi Chris, thanks for your feedback. I have the same concerns,
> maybe we should just move the zswap_invalidate() out of batches,
> as Yosry mentioned above.

As I replied in the previous email, I just want to understand the
other side effects of the change better.

To me, this patching is actually freeing the memory that does not
require actual page IO write from zswap. Which means the memory is
from some kind of cache. It would be interesting if we can not
complicate the write back path further. Instead, we can drop those
memories from the different cache if needed. I assume those caches are
doing something useful in the common case. If not, we should have a
patch to remove these caches instead. Not sure how big a mess it will
be to implement separate the write and drop caches.

While you are here, I have some questions for you.

Can you help me understand how much memory you can free from this
patch? For example, are we talking about a few pages or a few GB?

Where does the freed memory come from?
If the memory comes from zswap entry struct. Due to the slab allocator
fragmentation. It would take a lot of zswap entries to have meaningful
memory reclaimed from the slab allocator.

If the memory comes from the swap cached pages, that would be much
more meaningful. But that is not what this patch is doing, right?

Chris

2023-11-18 01:46:41

[permalink] [raw]

Subject: Re: [External] Re: [PATCH] mm:zswap: fix zswap entry reclamation failure in two scenarios

Hi Chris, thanks for your time.

>
> On Fri, Nov 17, 2023 at 1:56 AM Zhongkun He
> <[email protected]> wrote:
> > Hi Chris, thanks for your feedback. I have the same concerns,
> > maybe we should just move the zswap_invalidate() out of batches,
> > as Yosry mentioned above.
>
> As I replied in the previous email, I just want to understand the
> other side effects of the change better.
>
> To me, this patching is actually freeing the memory that does not
> require actual page IO write from zswap. Which means the memory is
> from some kind of cache. It would be interesting if we can not
> complicate the write back path further. Instead, we can drop those
> memories from the different cache if needed. I assume those caches are
> doing something useful in the common case. If not, we should have a
> patch to remove these caches instead. Not sure how big a mess it will
> be to implement separate the write and drop caches.
>
> While you are here, I have some questions for you.
>
> Can you help me understand how much memory you can free from this
> patch? For example, are we talking about a few pages or a few GB?
>
> Where does the freed memory come from?
> If the memory comes from zswap entry struct. Due to the slab allocator
> fragmentation. It would take a lot of zswap entries to have meaningful
> memory reclaimed from the slab allocator.
>
> If the memory comes from the swap cached pages, that would be much
> more meaningful. But that is not what this patch is doing, right?
>
> Chris

It's my bad for putting two cases together. The memory released in both
cases comes from zswap entry struct and zswap compressed page.

The original intention of this patch is to solve the problem that
shrink_work() fails to reclaim memory in two situations.

For case (1), the zswap_writeback_entry() will failed for the
__read_swap_cache_async return NULL because the swap has been
freed but cached in swap_slots_cache, so the memory come from
the zswap entry struct and compressed page.
Count = SWAP_BATCH * ncpu.
Solution: move the zswap_invalidate() out of batches, free it once the swap
count equal to 0.

For case (2), the zswap_writeback_entry() will failed for !page_was_allocated
because zswap_load will have two copies of the same page in memory
(compressed and uncompressed) after faulting in a page from zswap when
zswap_exclusive_loads disabled. The amount of memory is greater but depends
on the usage.

Why do we need to release them?
Consider this scenario,there is a lot of data cached in memory and zswap,
hit the limit，and shrink_worker will fail. The new coming data will be written
directly to swap due to zswap_store failure. Should we free the last one
to store the latest one in zswap.

According to the previous discussion, the writeback is inevitable.
So I want to make zswap_exclusive_loads_enabled the default behavior
or make it the only way to do zswap loads. It only makes sense when
the page is read and no longer dirty. If the page is read frequently, it
should stay in cache rather than zswap. The benefit of doing this is
very small, i.e. two copies of the same page in memory.

Thanks again.

2023-11-18 18:44:14

by Nhat Pham

[permalink] [raw]

Subject: Re: [External] Re: [PATCH] mm:zswap: fix zswap entry reclamation failure in two scenarios

On Fri, Nov 17, 2023 at 8:46 PM Zhongkun He
<[email protected]> wrote:
>
> Hi Chris, thanks for your time.
>
> >
> > On Fri, Nov 17, 2023 at 1:56 AM Zhongkun He
> > <[email protected]> wrote:
> > > Hi Chris, thanks for your feedback. I have the same concerns,
> > > maybe we should just move the zswap_invalidate() out of batches,
> > > as Yosry mentioned above.
> >
> > As I replied in the previous email, I just want to understand the
> > other side effects of the change better.
> >
> > To me, this patching is actually freeing the memory that does not
> > require actual page IO write from zswap. Which means the memory is
> > from some kind of cache. It would be interesting if we can not
> > complicate the write back path further. Instead, we can drop those
> > memories from the different cache if needed. I assume those caches are
> > doing something useful in the common case. If not, we should have a
> > patch to remove these caches instead. Not sure how big a mess it will
> > be to implement separate the write and drop caches.
> >
> > While you are here, I have some questions for you.
> >
> > Can you help me understand how much memory you can free from this
> > patch? For example, are we talking about a few pages or a few GB?
> >
> > Where does the freed memory come from?
> > If the memory comes from zswap entry struct. Due to the slab allocator
> > fragmentation. It would take a lot of zswap entries to have meaningful
> > memory reclaimed from the slab allocator.
> >
> > If the memory comes from the swap cached pages, that would be much
> > more meaningful. But that is not what this patch is doing, right?
> >
> > Chris
>
> It's my bad for putting two cases together. The memory released in both
> cases comes from zswap entry struct and zswap compressed page.
>
> The original intention of this patch is to solve the problem that
> shrink_work() fails to reclaim memory in two situations.
>
> For case (1), the zswap_writeback_entry() will failed for the
> __read_swap_cache_async return NULL because the swap has been
> freed but cached in swap_slots_cache, so the memory come from
> the zswap entry struct and compressed page.
> Count = SWAP_BATCH * ncpu.
> Solution: move the zswap_invalidate() out of batches, free it once the swap
> count equal to 0.
>
> For case (2), the zswap_writeback_entry() will failed for !page_was_allocated
> because zswap_load will have two copies of the same page in memory
> (compressed and uncompressed) after faulting in a page from zswap when
> zswap_exclusive_loads disabled. The amount of memory is greater but depends
> on the usage.
>
> Why do we need to release them?
> Consider this scenario,there is a lot of data cached in memory and zswap,
> hit the limit，and shrink_worker will fail. The new coming data will be written
> directly to swap due to zswap_store failure. Should we free the last one
> to store the latest one in zswap.

Shameless plug: zswap will much less likely hit the limit (global or
cgroup) with the shrinker enabled ;) It will proactively reclaim the
objects way ahead of the limit.

It comes with its own can of worms, of course - it's unlikely to work
for all workloads in its current form, but perhaps worth experimenting
with/improved upon?

>
> According to the previous discussion, the writeback is inevitable.
> So I want to make zswap_exclusive_loads_enabled the default behavior
> or make it the only way to do zswap loads. It only makes sense when
> the page is read and no longer dirty. If the page is read frequently, it
> should stay in cache rather than zswap. The benefit of doing this is
> very small, i.e. two copies of the same page in memory.
>
> Thanks again.

2023-11-19 08:24:21

[permalink] [raw]

Subject: Re: [External] Re: [PATCH] mm:zswap: fix zswap entry reclamation failure in two scenarios

Hi Zhongkun,

On Fri, Nov 17, 2023 at 5:46 PM Zhongkun He
<[email protected]> wrote:
> > Can you help me understand how much memory you can free from this
> > patch? For example, are we talking about a few pages or a few GB?
> >
> > Where does the freed memory come from?
> > If the memory comes from zswap entry struct. Due to the slab allocator
> > fragmentation. It would take a lot of zswap entries to have meaningful
> > memory reclaimed from the slab allocator.
> >
> > If the memory comes from the swap cached pages, that would be much
> > more meaningful. But that is not what this patch is doing, right?
> >
> > Chris
>
> It's my bad for putting two cases together. The memory released in both
> cases comes from zswap entry struct and zswap compressed page.

Thanks for the clarification. Keep in mind that memory freeing from
and zswap entry and zpool does not directly translate into page free.
If the page has other none freed zswap entry or zsmalloc usage, those
pages will not be free to the system. That is the fragmentation cost I
was talking about.

With this consideration, do you know many extra pages it can release
back to the system by this patch in your usage case? If the difference
is very small, it might not be worth the extra complexity to release
those.

> The original intention of this patch is to solve the problem that
> shrink_work() fails to reclaim memory in two situations.
>
> For case (1), the zswap_writeback_entry() will failed for the
> __read_swap_cache_async return NULL because the swap has been
> freed but cached in swap_slots_cache, so the memory come from
> the zswap entry struct and compressed page.
In those cases, if we drop the swap_slots_cache, it will also free
those zswap entries and compressed pages (zpool), right?

> Count = SWAP_BATCH * ncpu.

That is the upper limit. Not all CPUs have swap batches fully loaded.

> Solution: move the zswap_invalidate() out of batches, free it once the swap
> count equal to 0.
Per previous discussion, this will have an impact on the
swap_slot_cache behavior.
We need some data points for cost benefit analysis.

> For case (2), the zswap_writeback_entry() will failed for !page_was_allocated
> because zswap_load will have two copies of the same page in memory
> (compressed and uncompressed) after faulting in a page from zswap when
> zswap_exclusive_loads disabled. The amount of memory is greater but depends
> on the usage.

That is basically disable the future swap out page IO write
optimization that skip the write if the page hasn't changed. If the
system are low on memory, that is justifiable. Again, it seems we can
have a pass to drop the compressed memory if the swap count is zero
(and mark page dirty).

>
> Why do we need to release them?
> Consider this scenario,there is a lot of data cached in memory and zswap,
> hit the limit，and shrink_worker will fail. The new coming data will be written

Yes, the shrink_worker will need to allocate a page to store
uncompressed data for write back.

> directly to swap due to zswap_store failure. Should we free the last one
> to store the latest one in zswap.

The "last one" you mean the most recent zswap entry written into zswap?
Well, you need to allocate a page to write it out, that is an async process.
Shrink the zpool now is kind of too late already.

> According to the previous discussion, the writeback is inevitable.
> So I want to make zswap_exclusive_loads_enabled the default behavior
> or make it the only way to do zswap loads. It only makes sense when

We need some data point for how often we swap it out to zswap again,
where the zswap out write can be saved by using the existing compressed data.

It is totally possible this page IO write out optimization is not
worthwhile for zswap.
We need some data to support that claim.

> the page is read and no longer dirty. If the page is read frequently, it
> should stay in cache rather than zswap. The benefit of doing this is
> very small, i.e. two copies of the same page in memory.

If the benefit of doing this is very small, that seems to be the
argument against this patch?
Again we need some data points for cost and benefit analysis.

Chris

2023-11-19 08:29:40

[permalink] [raw]

Subject: Re: [External] Re: [PATCH] mm:zswap: fix zswap entry reclamation failure in two scenarios

On Sat, Nov 18, 2023 at 10:44 AM Nhat Pham <[email protected]> wrote:

> > Why do we need to release them?
> > Consider this scenario,there is a lot of data cached in memory and zswap,
> > hit the limit，and shrink_worker will fail. The new coming data will be written
> > directly to swap due to zswap_store failure. Should we free the last one
> > to store the latest one in zswap.
>
> Shameless plug: zswap will much less likely hit the limit (global or
> cgroup) with the shrinker enabled ;) It will proactively reclaim the
> objects way ahead of the limit.

I think that is actually the proper path, by the time it hits the
limit of zpool. That is already too late to shrink zpool to make room.
The shrinker should have guaranteed the amount of pages for write back
purposes to make forward progress.

> It comes with its own can of worms, of course - it's unlikely to work
> for all workloads in its current form, but perhaps worth experimenting
> with/improved upon?

Agree.

Chris

2023-11-20 02:43:06

[permalink] [raw]

Subject: Re: [External] Re: [PATCH] mm:zswap: fix zswap entry reclamation failure in two scenarios

>
> Shameless plug: zswap will much less likely hit the limit (global or
> cgroup) with the shrinker enabled ;) It will proactively reclaim the
> objects way ahead of the limit.

Hi Nhat，glad to hear from you.
Back to the beginning, the original intention of this patch is to solve
the problem that shrink_work() fails to reclaim memory in two situations.
The zswap_writeback_entry() will failed for !page_was_allocated
because zswap_load will have two copies of the same page in memory
(compressed and uncompressed) after faulting in a page from zswap when
zswap_exclusive_loads disabled.

A simple test:
1): Turn off zswap_exclusive_loads_enabled.
2): Run a read-only program and allocate more memory than the limit,
so the limit will be reached and shrinker will fail.

>
> It comes with its own can of worms, of course - it's unlikely to work
> for all workloads in its current form, but perhaps worth experimenting
> with/improved upon?
>

2023-11-20 03:16:41

[permalink] [raw]

Subject: Re: [External] Re: [PATCH] mm:zswap: fix zswap entry reclamation failure in two scenarios

>
> Thanks for the clarification. Keep in mind that memory freeing from
> and zswap entry and zpool does not directly translate into page free.
> If the page has other none freed zswap entry or zsmalloc usage, those
> pages will not be free to the system. That is the fragmentation cost I
> was talking about.

Yes, it may need to be compacted.

>
> With this consideration, do you know many extra pages it can release
> back to the system by this patch in your usage case? If the difference
> is very small, it might not be worth the extra complexity to release
> those.
>

The original intention of this patch is to make shrink work properly，
not to release cache and related memory.

> > The original intention of this patch is to solve the problem that
> > shrink_work() fails to reclaim memory in two situations.
> >
> > For case (1), the zswap_writeback_entry() will failed for the
> > __read_swap_cache_async return NULL because the swap has been
> > freed but cached in swap_slots_cache, so the memory come from
> > the zswap entry struct and compressed page.
> In those cases, if we drop the swap_slots_cache, it will also free
> those zswap entries and compressed pages (zpool), right?
>
> > Count = SWAP_BATCH * ncpu.
>
> That is the upper limit. Not all CPUs have swap batches fully loaded.

Yes.

>
> > Solution: move the zswap_invalidate() out of batches, free it once the swap
> > count equal to 0.
> Per previous discussion, this will have an impact on the
> swap_slot_cache behavior.
> We need some data points for cost benefit analysis.
>
> > For case (2), the zswap_writeback_entry() will failed for !page_was_allocated
> > because zswap_load will have two copies of the same page in memory
> > (compressed and uncompressed) after faulting in a page from zswap when
> > zswap_exclusive_loads disabled. The amount of memory is greater but depends
> > on the usage.
>
> That is basically disable the future swap out page IO write
> optimization that skip the write if the page hasn't changed. If the
> system are low on memory, that is justifiable. Again, it seems we can
> have a pass to drop the compressed memory if the swap count is zero
> (and mark page dirty).
>

OK.

> >
> > Why do we need to release them?
> > Consider this scenario,there is a lot of data cached in memory and zswap,
> > hit the limit，and shrink_worker will fail. The new coming data will be written
>
> Yes, the shrink_worker will need to allocate a page to store
> uncompressed data for write back.
>
> > directly to swap due to zswap_store failure. Should we free the last one
> > to store the latest one in zswap.
>
> The "last one" you mean the most recent zswap entry written into zswap?
> Well, you need to allocate a page to write it out, that is an async process.
> Shrink the zpool now is kind of too late already.
>

The last zswap_entry in zswap_pool->lru.

> > According to the previous discussion, the writeback is inevitable.
> > So I want to make zswap_exclusive_loads_enabled the default behavior
> > or make it the only way to do zswap loads. It only makes sense when
>
> We need some data point for how often we swap it out to zswap again,
> where the zswap out write can be saved by using the existing compressed data.
>
> It is totally possible this page IO write out optimization is not
> worthwhile for zswap.
> We need some data to support that claim.
>

Got it. I will find it.

> > the page is read and no longer dirty. If the page is read frequently, it
> > should stay in cache rather than zswap. The benefit of doing this is
> > very small, i.e. two copies of the same page in memory.
>
> If the benefit of doing this is very small, that seems to be the
> argument against this patch?
> Again we need some data points for cost and benefit analysis.
>

Yes, it is the new idea to make zswap_exclusive_loads_enabled the
default behavior or make it the only way to do zswap loads.

> Chris

2023-11-20 03:21:33

[permalink] [raw]

Subject: Re: [PATCH] mm:zswap: fix zswap entry reclamation failure in two scenarios

Chris Li <[email protected]> writes:

> On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <[email protected]> wrote:
>>
>> Not bypassing the swap slot cache, just make the callbacks to
>> invalidate the zswap entry, do memg uncharging, etc when the slot is
>> no longer used and is entering the swap slot cache (i.e. when
>> free_swap_slot() is called), instead of when draining the swap slot
>> cache (i.e. when swap_range_free() is called). For all parts of MM
>> outside of swap, the swap entry is freed when free_swap_slot() is
>> called. We don't free it immediately because of caching, but this
>> should be transparent to other parts of MM (e.g. zswap, memcg, etc).
>
> That will cancel the batching effect on the swap slot free, making the
> common case for swapping faults take longer to complete, righ?
> If I recall correctly, the uncharge is the expensive part of the swap
> slot free operation.
> I just want to figure out what we are trading off against. This is not
> one side wins all situations.

Per my understanding, we don't batch memcg uncharging in
swap_entry_free() now. Although it's possible and may improve
performance.

--
Best Regards,
Huang, Ying

2023-11-20 05:32:33

[permalink] [raw]

Subject: Re: [PATCH] mm:zswap: fix zswap entry reclamation failure in two scenarios

On Sun, Nov 19, 2023 at 7:20 PM Huang, Ying <[email protected]> wrote:
>
> Chris Li <[email protected]> writes:
>
> > On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <[email protected]> wrote:
> > That will cancel the batching effect on the swap slot free, making the
> > common case for swapping faults take longer to complete, righ?
> > If I recall correctly, the uncharge is the expensive part of the swap
> > slot free operation.
> > I just want to figure out what we are trading off against. This is not
> > one side wins all situations.
>
> Per my understanding, we don't batch memcg uncharging in
> swap_entry_free() now. Although it's possible and may improve
> performance.

swap_entry_free() does not do batching, it is at the caller level.

I just checked. The batching is done in free_swap_slot() is still
using swap slot cache and batching.
It uses swapcache_free_entries() to batch free the swap_slots. That is
where the uncharge happens per my understanding.

Chris

2023-11-20 05:42:32

[permalink] [raw]

Subject: Re: [PATCH] mm:zswap: fix zswap entry reclamation failure in two scenarios

Chris Li <[email protected]> writes:

> On Sun, Nov 19, 2023 at 7:20 PM Huang, Ying <[email protected]> wrote:
>>
>> Chris Li <[email protected]> writes:
>>
>> > On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <[email protected]> wrote:
>> > That will cancel the batching effect on the swap slot free, making the
>> > common case for swapping faults take longer to complete, righ?
>> > If I recall correctly, the uncharge is the expensive part of the swap
>> > slot free operation.
>> > I just want to figure out what we are trading off against. This is not
>> > one side wins all situations.
>>
>> Per my understanding, we don't batch memcg uncharging in
>> swap_entry_free() now. Although it's possible and may improve
>> performance.
>
> swap_entry_free() does not do batching, it is at the caller level.
>
> I just checked. The batching is done in free_swap_slot() is still
> using swap slot cache and batching.
> It uses swapcache_free_entries() to batch free the swap_slots. That is
> where the uncharge happens per my understanding.

Per my understanding, memcg uncharging happens in

swapcache_free_entries()
swap_entry_free()
mem_cgroup_uncharge_swap()

The swap entries are uncharged one-by-one, not

acquire lock in memcg
uncharge all entries
release lock in memcg

--
Best Regards,
Huang, Ying

2023-11-20 05:52:43

[permalink] [raw]

Subject: Re: [PATCH] mm:zswap: fix zswap entry reclamation failure in two scenarios

On Sun, Nov 19, 2023 at 9:41 PM Huang, Ying <[email protected]> wrote:
>
> Chris Li <[email protected]> writes:
>
> > On Sun, Nov 19, 2023 at 7:20 PM Huang, Ying <[email protected]> wrote:
>
> Per my understanding, memcg uncharging happens in
>
> swapcache_free_entries()
> swap_entry_free()
> mem_cgroup_uncharge_swap()
>
> The swap entries are uncharged one-by-one, not

Yes. That matches my understanding as well.
I think I am using the term "batching" very loosely. My bad and thanks
for the clarification.

I am referring to the fact that in most cases, the free_swap_slot()
does not perform uncharge.
It is grouped together with other entries to uncharge together inside
swapcache_free_entries(). Yes, the uncharge itself is done by a for
loop page by page. No batching in the for loop. BTW, Not all uncharges
can be batched, because they can come from different memcg.

Chris

2023-11-20 18:53:34

[permalink] [raw]

Subject: Re: [PATCH] mm:zswap: fix zswap entry reclamation failure in two scenarios

On Sun, Nov 19, 2023 at 7:20 PM Huang, Ying <[email protected]> wrote:
>
> Chris Li <[email protected]> writes:
>
> > On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <[email protected]> wrote:
> >>
> >> Not bypassing the swap slot cache, just make the callbacks to
> >> invalidate the zswap entry, do memg uncharging, etc when the slot is
> >> no longer used and is entering the swap slot cache (i.e. when
> >> free_swap_slot() is called), instead of when draining the swap slot
> >> cache (i.e. when swap_range_free() is called). For all parts of MM
> >> outside of swap, the swap entry is freed when free_swap_slot() is
> >> called. We don't free it immediately because of caching, but this
> >> should be transparent to other parts of MM (e.g. zswap, memcg, etc).
> >
> > That will cancel the batching effect on the swap slot free, making the
> > common case for swapping faults take longer to complete, righ?
> > If I recall correctly, the uncharge is the expensive part of the swap
> > slot free operation.
> > I just want to figure out what we are trading off against. This is not
> > one side wins all situations.
>
> Per my understanding, we don't batch memcg uncharging in
> swap_entry_free() now. Although it's possible and may improve
> performance.

Yes. It actually causes a long tail in swapin fault latency as Chris
discovered in our prod. I am wondering if doing the memcg uncharging
outside the slots cache will actually amortize the cost instead.

Regardless of memcg charging, which is more complicated, I think we
should at least move the call to zswap_invalidate() before the slots
cache. I would prefer that we move everything non-swapfile specific
outside the slots cache layer (zswap_invalidate(),
arch_swap_invalidate_page(), clear_shadow_from_swap_cache(),
mem_cgroup_uncharge_swap(), ..). However, if some of those are
controversial, we can move some of them for now.

When draining free swap slots from the cache, swap_range_free() is
called with nr_entries == 1 anyway, so I can't see how any batching is
going on. If anything it should help amortize the cost.

>
> --
> Best Regards,
> Huang, Ying

2023-11-21 00:57:20

[permalink] [raw]

Subject: Re: [PATCH] mm:zswap: fix zswap entry reclamation failure in two scenarios

Yosry Ahmed <[email protected]> writes:

> On Sun, Nov 19, 2023 at 7:20 PM Huang, Ying <[email protected]> wrote:
>>
>> Chris Li <[email protected]> writes:
>>
>> > On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <[email protected]> wrote:
>> >>
>> >> Not bypassing the swap slot cache, just make the callbacks to
>> >> invalidate the zswap entry, do memg uncharging, etc when the slot is
>> >> no longer used and is entering the swap slot cache (i.e. when
>> >> free_swap_slot() is called), instead of when draining the swap slot
>> >> cache (i.e. when swap_range_free() is called). For all parts of MM
>> >> outside of swap, the swap entry is freed when free_swap_slot() is
>> >> called. We don't free it immediately because of caching, but this
>> >> should be transparent to other parts of MM (e.g. zswap, memcg, etc).
>> >
>> > That will cancel the batching effect on the swap slot free, making the
>> > common case for swapping faults take longer to complete, righ?
>> > If I recall correctly, the uncharge is the expensive part of the swap
>> > slot free operation.
>> > I just want to figure out what we are trading off against. This is not
>> > one side wins all situations.
>>
>> Per my understanding, we don't batch memcg uncharging in
>> swap_entry_free() now. Although it's possible and may improve
>> performance.
>
> Yes. It actually causes a long tail in swapin fault latency as Chris
> discovered in our prod. I am wondering if doing the memcg uncharging
> outside the slots cache will actually amortize the cost instead.
>
> Regardless of memcg charging, which is more complicated, I think we
> should at least move the call to zswap_invalidate() before the slots
> cache. I would prefer that we move everything non-swapfile specific
> outside the slots cache layer (zswap_invalidate(),
> arch_swap_invalidate_page(), clear_shadow_from_swap_cache(),
> mem_cgroup_uncharge_swap(), ..). However, if some of those are
> controversial, we can move some of them for now.

That makes sense for me.

> When draining free swap slots from the cache, swap_range_free() is
> called with nr_entries == 1 anyway, so I can't see how any batching is
> going on. If anything it should help amortize the cost.

In swapcache_free_entries(), the sis->lock will be held to free multiple
swap slots via swap_info_get_cont() if possible. This can reduce
sis->lock contention.

--
Best Regards,
Huang, Ying

2023-11-21 01:17:53

[permalink] [raw]

Subject: Re: [PATCH] mm:zswap: fix zswap entry reclamation failure in two scenarios

On Mon, Nov 20, 2023 at 4:57 PM Huang, Ying <[email protected]> wrote:
>
> Yosry Ahmed <[email protected]> writes:
>
> > On Sun, Nov 19, 2023 at 7:20 PM Huang, Ying <[email protected]> wrote:
> >>
> >> Chris Li <[email protected]> writes:
> >>
> >> > On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <[email protected]> wrote:
> >> >>
> >> >> Not bypassing the swap slot cache, just make the callbacks to
> >> >> invalidate the zswap entry, do memg uncharging, etc when the slot is
> >> >> no longer used and is entering the swap slot cache (i.e. when
> >> >> free_swap_slot() is called), instead of when draining the swap slot
> >> >> cache (i.e. when swap_range_free() is called). For all parts of MM
> >> >> outside of swap, the swap entry is freed when free_swap_slot() is
> >> >> called. We don't free it immediately because of caching, but this
> >> >> should be transparent to other parts of MM (e.g. zswap, memcg, etc).
> >> >
> >> > That will cancel the batching effect on the swap slot free, making the
> >> > common case for swapping faults take longer to complete, righ?
> >> > If I recall correctly, the uncharge is the expensive part of the swap
> >> > slot free operation.
> >> > I just want to figure out what we are trading off against. This is not
> >> > one side wins all situations.
> >>
> >> Per my understanding, we don't batch memcg uncharging in
> >> swap_entry_free() now. Although it's possible and may improve
> >> performance.
> >
> > Yes. It actually causes a long tail in swapin fault latency as Chris
> > discovered in our prod. I am wondering if doing the memcg uncharging
> > outside the slots cache will actually amortize the cost instead.
> >
> > Regardless of memcg charging, which is more complicated, I think we
> > should at least move the call to zswap_invalidate() before the slots
> > cache. I would prefer that we move everything non-swapfile specific
> > outside the slots cache layer (zswap_invalidate(),
> > arch_swap_invalidate_page(), clear_shadow_from_swap_cache(),
> > mem_cgroup_uncharge_swap(), ..). However, if some of those are
> > controversial, we can move some of them for now.
>
> That makes sense for me.
>
> > When draining free swap slots from the cache, swap_range_free() is
> > called with nr_entries == 1 anyway, so I can't see how any batching is
> > going on. If anything it should help amortize the cost.
>
> In swapcache_free_entries(), the sis->lock will be held to free multiple
> swap slots via swap_info_get_cont() if possible. This can reduce
> sis->lock contention.

Ah yes that's a good point. Since most of these callbacks don't
actually access sis, but use the swap entry value itself, I am
guessing the reason we need to hold the lock for all these callbacks
is to prevent swapoff and swapon reusing the same swap entry on a
different swap device, right?

2023-11-21 01:56:34

[permalink] [raw]

Subject: Re: [PATCH] mm:zswap: fix zswap entry reclamation failure in two scenarios

Yosry Ahmed <[email protected]> writes:

> On Mon, Nov 20, 2023 at 4:57 PM Huang, Ying <[email protected]> wrote:
>>
>> Yosry Ahmed <[email protected]> writes:
>>
>> > On Sun, Nov 19, 2023 at 7:20 PM Huang, Ying <[email protected]> wrote:
>> >>
>> >> Chris Li <[email protected]> writes:
>> >>
>> >> > On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <[email protected]> wrote:
>> >> >>
>> >> >> Not bypassing the swap slot cache, just make the callbacks to
>> >> >> invalidate the zswap entry, do memg uncharging, etc when the slot is
>> >> >> no longer used and is entering the swap slot cache (i.e. when
>> >> >> free_swap_slot() is called), instead of when draining the swap slot
>> >> >> cache (i.e. when swap_range_free() is called). For all parts of MM
>> >> >> outside of swap, the swap entry is freed when free_swap_slot() is
>> >> >> called. We don't free it immediately because of caching, but this
>> >> >> should be transparent to other parts of MM (e.g. zswap, memcg, etc).
>> >> >
>> >> > That will cancel the batching effect on the swap slot free, making the
>> >> > common case for swapping faults take longer to complete, righ?
>> >> > If I recall correctly, the uncharge is the expensive part of the swap
>> >> > slot free operation.
>> >> > I just want to figure out what we are trading off against. This is not
>> >> > one side wins all situations.
>> >>
>> >> Per my understanding, we don't batch memcg uncharging in
>> >> swap_entry_free() now. Although it's possible and may improve
>> >> performance.
>> >
>> > Yes. It actually causes a long tail in swapin fault latency as Chris
>> > discovered in our prod. I am wondering if doing the memcg uncharging
>> > outside the slots cache will actually amortize the cost instead.
>> >
>> > Regardless of memcg charging, which is more complicated, I think we
>> > should at least move the call to zswap_invalidate() before the slots
>> > cache. I would prefer that we move everything non-swapfile specific
>> > outside the slots cache layer (zswap_invalidate(),
>> > arch_swap_invalidate_page(), clear_shadow_from_swap_cache(),
>> > mem_cgroup_uncharge_swap(), ..). However, if some of those are
>> > controversial, we can move some of them for now.
>>
>> That makes sense for me.
>>
>> > When draining free swap slots from the cache, swap_range_free() is
>> > called with nr_entries == 1 anyway, so I can't see how any batching is
>> > going on. If anything it should help amortize the cost.
>>
>> In swapcache_free_entries(), the sis->lock will be held to free multiple
>> swap slots via swap_info_get_cont() if possible. This can reduce
>> sis->lock contention.
>
> Ah yes that's a good point. Since most of these callbacks don't
> actually access sis, but use the swap entry value itself, I am
> guessing the reason we need to hold the lock for all these callbacks
> is to prevent swapoff and swapon reusing the same swap entry on a
> different swap device, right?

In,

swapcache_free_entries()
swap_entry_free()
swap_range_free()

Quite some sis fields will be accessed.

--
Best Regards,
Huang, Ying

2023-11-21 02:48:27

[permalink] [raw]

Subject: Re: [PATCH] mm:zswap: fix zswap entry reclamation failure in two scenarios

On Mon, Nov 20, 2023 at 5:55 PM Huang, Ying <[email protected]> wrote:
>
> Yosry Ahmed <[email protected]> writes:
>
> > On Mon, Nov 20, 2023 at 4:57 PM Huang, Ying <[email protected]> wrote:
> >>
> >> Yosry Ahmed <[email protected]> writes:
> >>
> >> > On Sun, Nov 19, 2023 at 7:20 PM Huang, Ying <[email protected]> wrote:
> >> >>
> >> >> Chris Li <[email protected]> writes:
> >> >>
> >> >> > On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <[email protected]> wrote:
> >> >> >>
> >> >> >> Not bypassing the swap slot cache, just make the callbacks to
> >> >> >> invalidate the zswap entry, do memg uncharging, etc when the slot is
> >> >> >> no longer used and is entering the swap slot cache (i.e. when
> >> >> >> free_swap_slot() is called), instead of when draining the swap slot
> >> >> >> cache (i.e. when swap_range_free() is called). For all parts of MM
> >> >> >> outside of swap, the swap entry is freed when free_swap_slot() is
> >> >> >> called. We don't free it immediately because of caching, but this
> >> >> >> should be transparent to other parts of MM (e.g. zswap, memcg, etc).
> >> >> >
> >> >> > That will cancel the batching effect on the swap slot free, making the
> >> >> > common case for swapping faults take longer to complete, righ?
> >> >> > If I recall correctly, the uncharge is the expensive part of the swap
> >> >> > slot free operation.
> >> >> > I just want to figure out what we are trading off against. This is not
> >> >> > one side wins all situations.
> >> >>
> >> >> Per my understanding, we don't batch memcg uncharging in
> >> >> swap_entry_free() now. Although it's possible and may improve
> >> >> performance.
> >> >
> >> > Yes. It actually causes a long tail in swapin fault latency as Chris
> >> > discovered in our prod. I am wondering if doing the memcg uncharging
> >> > outside the slots cache will actually amortize the cost instead.
> >> >
> >> > Regardless of memcg charging, which is more complicated, I think we
> >> > should at least move the call to zswap_invalidate() before the slots
> >> > cache. I would prefer that we move everything non-swapfile specific
> >> > outside the slots cache layer (zswap_invalidate(),
> >> > arch_swap_invalidate_page(), clear_shadow_from_swap_cache(),
> >> > mem_cgroup_uncharge_swap(), ..). However, if some of those are
> >> > controversial, we can move some of them for now.
> >>
> >> That makes sense for me.
> >>
> >> > When draining free swap slots from the cache, swap_range_free() is
> >> > called with nr_entries == 1 anyway, so I can't see how any batching is
> >> > going on. If anything it should help amortize the cost.
> >>
> >> In swapcache_free_entries(), the sis->lock will be held to free multiple
> >> swap slots via swap_info_get_cont() if possible. This can reduce
> >> sis->lock contention.
> >
> > Ah yes that's a good point. Since most of these callbacks don't
> > actually access sis, but use the swap entry value itself, I am
> > guessing the reason we need to hold the lock for all these callbacks
> > is to prevent swapoff and swapon reusing the same swap entry on a
> > different swap device, right?
>
> In,
>
> swapcache_free_entries()
> swap_entry_free()
> swap_range_free()
>
> Quite some sis fields will be accessed.

I wasn't referring to this code. I was what's preventing us from
moving the callbacks I mentioned outside the lock (zswap_invalidate(),
arch_swap_invalidate_page(), clear_shadow_from_swap_cache(),
mem_cgroup_uncharge_swap(), ..). I think most or all of them don't
really access sis, but perhaps they need the lock to ensure the swap
entry value does not get reused?

>
> --
> Best Regards,
> Huang, Ying

2023-11-21 03:35:54

[permalink] [raw]

Subject: Re: [PATCH] mm:zswap: fix zswap entry reclamation failure in two scenarios

Yosry Ahmed <[email protected]> writes:

> On Mon, Nov 20, 2023 at 5:55 PM Huang, Ying <[email protected]> wrote:
>>
>> Yosry Ahmed <[email protected]> writes:
>>
>> > On Mon, Nov 20, 2023 at 4:57 PM Huang, Ying <[email protected]> wrote:
>> >>
>> >> Yosry Ahmed <[email protected]> writes:
>> >>
>> >> > On Sun, Nov 19, 2023 at 7:20 PM Huang, Ying <[email protected]> wrote:
>> >> >>
>> >> >> Chris Li <[email protected]> writes:
>> >> >>
>> >> >> > On Thu, Nov 16, 2023 at 12:19 PM Yosry Ahmed <[email protected]> wrote:
>> >> >> >>
>> >> >> >> Not bypassing the swap slot cache, just make the callbacks to
>> >> >> >> invalidate the zswap entry, do memg uncharging, etc when the slot is
>> >> >> >> no longer used and is entering the swap slot cache (i.e. when
>> >> >> >> free_swap_slot() is called), instead of when draining the swap slot
>> >> >> >> cache (i.e. when swap_range_free() is called). For all parts of MM
>> >> >> >> outside of swap, the swap entry is freed when free_swap_slot() is
>> >> >> >> called. We don't free it immediately because of caching, but this
>> >> >> >> should be transparent to other parts of MM (e.g. zswap, memcg, etc).
>> >> >> >
>> >> >> > That will cancel the batching effect on the swap slot free, making the
>> >> >> > common case for swapping faults take longer to complete, righ?
>> >> >> > If I recall correctly, the uncharge is the expensive part of the swap
>> >> >> > slot free operation.
>> >> >> > I just want to figure out what we are trading off against. This is not
>> >> >> > one side wins all situations.
>> >> >>
>> >> >> Per my understanding, we don't batch memcg uncharging in
>> >> >> swap_entry_free() now. Although it's possible and may improve
>> >> >> performance.
>> >> >
>> >> > Yes. It actually causes a long tail in swapin fault latency as Chris
>> >> > discovered in our prod. I am wondering if doing the memcg uncharging
>> >> > outside the slots cache will actually amortize the cost instead.
>> >> >
>> >> > Regardless of memcg charging, which is more complicated, I think we
>> >> > should at least move the call to zswap_invalidate() before the slots
>> >> > cache. I would prefer that we move everything non-swapfile specific
>> >> > outside the slots cache layer (zswap_invalidate(),
>> >> > arch_swap_invalidate_page(), clear_shadow_from_swap_cache(),
>> >> > mem_cgroup_uncharge_swap(), ..). However, if some of those are
>> >> > controversial, we can move some of them for now.
>> >>
>> >> That makes sense for me.
>> >>
>> >> > When draining free swap slots from the cache, swap_range_free() is
>> >> > called with nr_entries == 1 anyway, so I can't see how any batching is
>> >> > going on. If anything it should help amortize the cost.
>> >>
>> >> In swapcache_free_entries(), the sis->lock will be held to free multiple
>> >> swap slots via swap_info_get_cont() if possible. This can reduce
>> >> sis->lock contention.
>> >
>> > Ah yes that's a good point. Since most of these callbacks don't
>> > actually access sis, but use the swap entry value itself, I am
>> > guessing the reason we need to hold the lock for all these callbacks
>> > is to prevent swapoff and swapon reusing the same swap entry on a
>> > different swap device, right?
>>
>> In,
>>
>> swapcache_free_entries()
>> swap_entry_free()
>> swap_range_free()
>>
>> Quite some sis fields will be accessed.
>
> I wasn't referring to this code. I was what's preventing us from
> moving the callbacks I mentioned outside the lock (zswap_invalidate(),
> arch_swap_invalidate_page(), clear_shadow_from_swap_cache(),
> mem_cgroup_uncharge_swap(), ..). I think most or all of them don't
> really access sis, but perhaps they need the lock to ensure the swap
> entry value does not get reused?

In fact, the swap entries to be freed by swapcache_free_entries() is in
a state that can not be freed by other path (including swapoff()). It's
swap_map value is SWAP_HAS_CACHE, but we can not find folio in
swap_address_space().

To be honest, I don't know whether there are dependencies on sis->lock
in these callbacks. You need to investigate them one by one.

--
Best Regards,
Huang, Ying

2023-11-21 03:38:02