+ Chris Li
Chris, I vaguely remember from our last conversation that you have
some concurrent efforts to use xarray here right?
On Wed, Dec 6, 2023 at 1:46 AM Chengming Zhou
<[email protected]> wrote:
>
> Hi everyone,
>
> This patch series is based on the linux-next 20231205, which depends on
> the "workload-specific and memory pressure-driven zswap writeback" series
> from Nhat Pham.
>
> When testing the zswap performance by using kernel build -j32 in a tmpfs
> directory, I found the scalability of zswap rb-tree is not good, which
> is protected by the only spinlock. That would cause heavy lock contention
> if multiple tasks zswap_store/load concurrently.
>
> So a simple solution is to split the only one zswap rb-tree into multiple
> rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is
> from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
>
> Although this method can't solve the spinlock contention completely, it
> can mitigate much of that contention.
>
> Another problem when testing the zswap using our default zsmalloc is that
> zswap_load() and zswap_writeback_entry() have to malloc a temporary memory
> to support !zpool_can_sleep_mapped().
>
> Optimize it by reusing the percpu crypto_acomp_ctx->dstmem, which is also
> used by zswap_store() and protected by the same percpu crypto_acomp_ctx->mutex.
>
> Thanks for review and comment!
>
> To: Andrew Morton <[email protected]>
> To: Seth Jennings <[email protected]>
> To: Dan Streetman <[email protected]>
> To: Vitaly Wool <[email protected]>
> To: Nhat Pham <[email protected]>
> To: Johannes Weiner <[email protected]>
> To: Yosry Ahmed <[email protected]>
> To: Michal Hocko <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Signed-off-by: Chengming Zhou <[email protected]>
>
> ---
> Chengming Zhou (7):
> mm/zswap: make sure each swapfile always have zswap rb-tree
> mm/zswap: split zswap rb-tree
> mm/zswap: reuse dstmem when decompress
> mm/zswap: change dstmem size to one page
> mm/zswap: refactor out __zswap_load()
> mm/zswap: cleanup zswap_load()
> mm/zswap: cleanup zswap_reclaim_entry()
>
> include/linux/zswap.h | 4 +-
> mm/swapfile.c | 10 ++-
> mm/zswap.c | 233 +++++++++++++++++++++-----------------------------
> 3 files changed, 106 insertions(+), 141 deletions(-)
> ---
> base-commit: 0f5f12ac05f36f117e793656c3f560625e927f1b
> change-id: 20231206-zswap-lock-optimize-06f45683b02b
>
> Best regards,
> --
> Chengming Zhou <[email protected]>
On Wed, Dec 6, 2023 at 9:24 AM Nhat Pham <[email protected]> wrote:
>
> + Chris Li
>
> Chris, I vaguely remember from our last conversation that you have
> some concurrent efforts to use xarray here right?
If I recall correctly, the xarray already reduces the lock contention
as lookups are lockless, but Chris knows more here. As you mentioned
in a different email, it would be nice to get some data so that we can
compare different solutions.
>
> On Wed, Dec 6, 2023 at 1:46 AM Chengming Zhou
> <[email protected]> wrote:
> >
> > Hi everyone,
> >
> > This patch series is based on the linux-next 20231205, which depends on
> > the "workload-specific and memory pressure-driven zswap writeback" series
> > from Nhat Pham.
> >
> > When testing the zswap performance by using kernel build -j32 in a tmpfs
> > directory, I found the scalability of zswap rb-tree is not good, which
> > is protected by the only spinlock. That would cause heavy lock contention
> > if multiple tasks zswap_store/load concurrently.
> >
> > So a simple solution is to split the only one zswap rb-tree into multiple
> > rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is
> > from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
> >
> > Although this method can't solve the spinlock contention completely, it
> > can mitigate much of that contention.
> >
> > Another problem when testing the zswap using our default zsmalloc is that
> > zswap_load() and zswap_writeback_entry() have to malloc a temporary memory
> > to support !zpool_can_sleep_mapped().
> >
> > Optimize it by reusing the percpu crypto_acomp_ctx->dstmem, which is also
> > used by zswap_store() and protected by the same percpu crypto_acomp_ctx->mutex.
> >
> > Thanks for review and comment!
> >
> > To: Andrew Morton <[email protected]>
> > To: Seth Jennings <[email protected]>
> > To: Dan Streetman <[email protected]>
> > To: Vitaly Wool <[email protected]>
> > To: Nhat Pham <[email protected]>
> > To: Johannes Weiner <[email protected]>
> > To: Yosry Ahmed <[email protected]>
> > To: Michal Hocko <[email protected]>
> > Cc: [email protected]
> > Cc: [email protected]
> > Signed-off-by: Chengming Zhou <[email protected]>
> >
> > ---
> > Chengming Zhou (7):
> > mm/zswap: make sure each swapfile always have zswap rb-tree
> > mm/zswap: split zswap rb-tree
> > mm/zswap: reuse dstmem when decompress
> > mm/zswap: change dstmem size to one page
> > mm/zswap: refactor out __zswap_load()
> > mm/zswap: cleanup zswap_load()
> > mm/zswap: cleanup zswap_reclaim_entry()
> >
> > include/linux/zswap.h | 4 +-
> > mm/swapfile.c | 10 ++-
> > mm/zswap.c | 233 +++++++++++++++++++++-----------------------------
> > 3 files changed, 106 insertions(+), 141 deletions(-)
> > ---
> > base-commit: 0f5f12ac05f36f117e793656c3f560625e927f1b
> > change-id: 20231206-zswap-lock-optimize-06f45683b02b
> >
> > Best regards,
> > --
> > Chengming Zhou <[email protected]>
Hi Nhat and Yosry,
On Wed, Dec 6, 2023 at 12:42 PM Yosry Ahmed <[email protected]> wrote:
>
> On Wed, Dec 6, 2023 at 9:24 AM Nhat Pham <[email protected]> wrote:
> >
> > + Chris Li
> >
> > Chris, I vaguely remember from our last conversation that you have
> > some concurrent efforts to use xarray here right?
Yes, I do have the zswap xarray for older versions of the kernel. The
recent mm-unstable tree has a lot of zswap related updates. Give me 2
days to refresh and post it. The zswap invalid entry and the reference
count change is causing a good portion of the code to be updated. That
is the price to pay keeping out of tree patches. My fault is not
getting to it sooner.
>
> If I recall correctly, the xarray already reduces the lock contention
> as lookups are lockless, but Chris knows more here. As you mentioned
Yes. To be exact, xarray can use spin lock (same as current RB tree)
or take RCU read lock on the lookup path (depending on how you call
the xarray API). Not completely lockless but the RCU read lock should
have less lock contention than normal spinlock. +Matthew
> in a different email, it would be nice to get some data so that we can
> compare different solutions.
Yes, it is certainly welcome to see more data points. If I recall
correctly, the zswap xarray array makes the lookup similar to the swap
cache lookup. It has a noticeable difference in the long tail end.
Stay tuned.
Chris
On 2023/12/7 08:43, Chris Li wrote:
> Hi Nhat and Yosry,
>
> On Wed, Dec 6, 2023 at 12:42 PM Yosry Ahmed <[email protected]> wrote:
>>
>> On Wed, Dec 6, 2023 at 9:24 AM Nhat Pham <[email protected]> wrote:
>>>
>>> + Chris Li
>>>
>>> Chris, I vaguely remember from our last conversation that you have
>>> some concurrent efforts to use xarray here right?
>
> Yes, I do have the zswap xarray for older versions of the kernel. The
> recent mm-unstable tree has a lot of zswap related updates. Give me 2
> days to refresh and post it. The zswap invalid entry and the reference
> count change is causing a good portion of the code to be updated. That
> is the price to pay keeping out of tree patches. My fault is not
> getting to it sooner.
>
>>
>> If I recall correctly, the xarray already reduces the lock contention
>> as lookups are lockless, but Chris knows more here. As you mentioned
>
> Yes. To be exact, xarray can use spin lock (same as current RB tree)
> or take RCU read lock on the lookup path (depending on how you call
> the xarray API). Not completely lockless but the RCU read lock should
> have less lock contention than normal spinlock. +Matthew
>
Great! Lockless lookup in zswap_load() should reduce spinlock contention.
And multiple trees (multiple xarrays) can further reduce the contention
on the concurrent zswap_store() side. So it's complementary IMHO.
>> in a different email, it would be nice to get some data so that we can
>> compare different solutions.
>
> Yes, it is certainly welcome to see more data points. If I recall
> correctly, the zswap xarray array makes the lookup similar to the swap
> cache lookup. It has a noticeable difference in the long tail end.
>
Right, I post some data from yesterday in another reply.
Will test again and update the data since Nhat's zswap shrinker fix patch
has been merged into mm-unstable today.
Thanks!
On Wed, Dec 6, 2023 at 7:25 PM Chengming Zhou
<[email protected]> wrote:
>
> On 2023/12/7 08:43, Chris Li wrote:
> > Hi Nhat and Yosry,
> >
> > On Wed, Dec 6, 2023 at 12:42 PM Yosry Ahmed <[email protected]> wrote:
> >>
> >> On Wed, Dec 6, 2023 at 9:24 AM Nhat Pham <[email protected]> wrote:
> >>>
> >>> + Chris Li
> >>>
> >>> Chris, I vaguely remember from our last conversation that you have
> >>> some concurrent efforts to use xarray here right?
> >
> > Yes, I do have the zswap xarray for older versions of the kernel. The
> > recent mm-unstable tree has a lot of zswap related updates. Give me 2
> > days to refresh and post it. The zswap invalid entry and the reference
> > count change is causing a good portion of the code to be updated. That
> > is the price to pay keeping out of tree patches. My fault is not
> > getting to it sooner.
> >
> >>
> >> If I recall correctly, the xarray already reduces the lock contention
> >> as lookups are lockless, but Chris knows more here. As you mentioned
> >
> > Yes. To be exact, xarray can use spin lock (same as current RB tree)
> > or take RCU read lock on the lookup path (depending on how you call
> > the xarray API). Not completely lockless but the RCU read lock should
> > have less lock contention than normal spinlock. +Matthew
> >
>
> Great! Lockless lookup in zswap_load() should reduce spinlock contention.
> And multiple trees (multiple xarrays) can further reduce the contention
> on the concurrent zswap_store() side. So it's complementary IMHO.
>
> >> in a different email, it would be nice to get some data so that we can
> >> compare different solutions.
> >
> > Yes, it is certainly welcome to see more data points. If I recall
> > correctly, the zswap xarray array makes the lookup similar to the swap
> > cache lookup. It has a noticeable difference in the long tail end.
> >
>
> Right, I post some data from yesterday in another reply.
> Will test again and update the data since Nhat's zswap shrinker fix patch
> has been merged into mm-unstable today.
>
> Thanks!
Let's split the rbtree breakdown into a separate series. This series
has irrelevant (and very nice) cleanups and optimizations, let's get
them separately and defer the rbtree breakdown part until we get data
about the xarray implementation. Perhaps the tree breakdown is not
needed as much with an xarray, or at the very least the implementation
would look different on top of an xarray.
On Tue, Dec 12, 2023 at 3:27 PM Yosry Ahmed <[email protected]> wrote:
>
> Let's split the rbtree breakdown into a separate series. This series
> has irrelevant (and very nice) cleanups and optimizations, let's get
> them separately and defer the rbtree breakdown part until we get data
> about the xarray implementation. Perhaps the tree breakdown is not
> needed as much with an xarray, or at the very least the implementation
> would look different on top of an xarray.
Actually, kinda agree - I quite like the cleanup/optimization done
w.r.t dstmem reuse :)
On 2023/12/13 07:33, Nhat Pham wrote:
> On Tue, Dec 12, 2023 at 3:27 PM Yosry Ahmed <[email protected]> wrote:
>>
>> Let's split the rbtree breakdown into a separate series. This series
>> has irrelevant (and very nice) cleanups and optimizations, let's get
>> them separately and defer the rbtree breakdown part until we get data
Ok, will split and just send the cleanups/optimizations with dstmem reuse.
>> about the xarray implementation. Perhaps the tree breakdown is not
>> needed as much with an xarray, or at the very least the implementation
>> would look different on top of an xarray.
Yeah, will retest on the xarray version of Chris, the implementation is
easy anyway.
>
> Actually, kinda agree - I quite like the cleanup/optimization done
> w.r.t dstmem reuse :)
Thanks!