On Wed, Dec 6, 2023 at 1:46 AM Chengming Zhou
<[email protected]> wrote:
> When testing the zswap performance by using kernel build -j32 in a tmpfs
> directory, I found the scalability of zswap rb-tree is not good, which
> is protected by the only spinlock. That would cause heavy lock contention
> if multiple tasks zswap_store/load concurrently.
>
> So a simple solution is to split the only one zswap rb-tree into multiple
> rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is
> from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
>
> Although this method can't solve the spinlock contention completely, it
> can mitigate much of that contention.
By how much? Do you have any stats to estimate the amount of
contention and the reduction by this patch?
I do think lock contention could be a problem here, and it will be
even worse with the zswap shrinker enabled (which introduces an
theoretically unbounded number of concurrent reclaimers hammering on
the zswap rbtree and its lock). I am generally a bit weary about
architectural change though, especially if it is just a bandaid. We
have tried to reduce the lock contention somewhere else (multiple
zpools), and as predicted it just shifts the contention point
elsewhere. Maybe we need a deeper architectural re-think.
Not an outright NACK of course - just food for thought.
>
> Another problem when testing the zswap using our default zsmalloc is that
> zswap_load() and zswap_writeback_entry() have to malloc a temporary memory
> to support !zpool_can_sleep_mapped().
>
> Optimize it by reusing the percpu crypto_acomp_ctx->dstmem, which is also
> used by zswap_store() and protected by the same percpu crypto_acomp_ctx->mutex.
It'd be nice to reduce the (temporary) memory allocation on these
paths, but would this introduce contention on the per-cpu dstmem and
the mutex that protects it, if there are too many concurrent
store/load/writeback requests?
On 2023/12/7 04:08, Nhat Pham wrote:
> On Wed, Dec 6, 2023 at 1:46 AM Chengming Zhou
> <[email protected]> wrote:
>> When testing the zswap performance by using kernel build -j32 in a tmpfs
>> directory, I found the scalability of zswap rb-tree is not good, which
>> is protected by the only spinlock. That would cause heavy lock contention
>> if multiple tasks zswap_store/load concurrently.
>>
>> So a simple solution is to split the only one zswap rb-tree into multiple
>> rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is
>> from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
>>
>> Although this method can't solve the spinlock contention completely, it
>> can mitigate much of that contention.
>
> By how much? Do you have any stats to estimate the amount of
> contention and the reduction by this patch?
Actually, I did some test using the linux-next 20231205 yesterday.
Testcase: memory.max = 2G, zswap enabled, make -j32 in tmpfs.
20231205 +patchset
1. !shrinker_enabled: 156s 126s
2. shrinker_enabled: 79s 70s
I think your zswap shrinker fix patch can solve !shrinker_enabled case.
So will test again today using the new mm-unstable branch.
>
> I do think lock contention could be a problem here, and it will be
> even worse with the zswap shrinker enabled (which introduces an
> theoretically unbounded number of concurrent reclaimers hammering on
> the zswap rbtree and its lock). I am generally a bit weary about
> architectural change though, especially if it is just a bandaid. We
> have tried to reduce the lock contention somewhere else (multiple
> zpools), and as predicted it just shifts the contention point
> elsewhere. Maybe we need a deeper architectural re-think.
>
> Not an outright NACK of course - just food for thought.
>
Right, I think xarray is good for lockless reading side, and
multiple trees is also complementary, which can reduce the lock
contention on the writing sides too.
>>
>> Another problem when testing the zswap using our default zsmalloc is that
>> zswap_load() and zswap_writeback_entry() have to malloc a temporary memory
>> to support !zpool_can_sleep_mapped().
>>
>> Optimize it by reusing the percpu crypto_acomp_ctx->dstmem, which is also
>> used by zswap_store() and protected by the same percpu crypto_acomp_ctx->mutex.
>
> It'd be nice to reduce the (temporary) memory allocation on these
> paths, but would this introduce contention on the per-cpu dstmem and
> the mutex that protects it, if there are too many concurrent
> store/load/writeback requests?
I think the mutex holding time is not changed, right? So the contention
on the per-cpu mutex should be the same. We just reuse percpu dstmem more.
Thanks!
On 2023/12/7 11:13, Chengming Zhou wrote:
> On 2023/12/7 04:08, Nhat Pham wrote:
>> On Wed, Dec 6, 2023 at 1:46 AM Chengming Zhou
>> <[email protected]> wrote:
>>> When testing the zswap performance by using kernel build -j32 in a tmpfs
>>> directory, I found the scalability of zswap rb-tree is not good, which
>>> is protected by the only spinlock. That would cause heavy lock contention
>>> if multiple tasks zswap_store/load concurrently.
>>>
>>> So a simple solution is to split the only one zswap rb-tree into multiple
>>> rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is
>>> from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
>>>
>>> Although this method can't solve the spinlock contention completely, it
>>> can mitigate much of that contention.
>>
>> By how much? Do you have any stats to estimate the amount of
>> contention and the reduction by this patch?
>
> Actually, I did some test using the linux-next 20231205 yesterday.
>
> Testcase: memory.max = 2G, zswap enabled, make -j32 in tmpfs.
>
> 20231205 +patchset
> 1. !shrinker_enabled: 156s 126s
> 2. shrinker_enabled: 79s 70s
>
> I think your zswap shrinker fix patch can solve !shrinker_enabled case.
>
> So will test again today using the new mm-unstable branch.
>
Updated test data based on today's mm-unstable branch:
mm-unstable +patchset
1. !shrinker_enabled: 86s 74s
2. shrinker_enabled: 63s 61s
Shows much less optimization for the shrinker_enabled case, but still
much optimization for the !shrinker_enabled case.
Thanks!
On Thu, Dec 7, 2023 at 7:18 AM Chengming Zhou
<[email protected]> wrote:
>
> On 2023/12/7 11:13, Chengming Zhou wrote:
> > On 2023/12/7 04:08, Nhat Pham wrote:
> >> On Wed, Dec 6, 2023 at 1:46 AM Chengming Zhou
> >> <[email protected]> wrote:
> >>> When testing the zswap performance by using kernel build -j32 in a tmpfs
> >>> directory, I found the scalability of zswap rb-tree is not good, which
> >>> is protected by the only spinlock. That would cause heavy lock contention
> >>> if multiple tasks zswap_store/load concurrently.
> >>>
> >>> So a simple solution is to split the only one zswap rb-tree into multiple
> >>> rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is
> >>> from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
> >>>
> >>> Although this method can't solve the spinlock contention completely, it
> >>> can mitigate much of that contention.
> >>
> >> By how much? Do you have any stats to estimate the amount of
> >> contention and the reduction by this patch?
> >
> > Actually, I did some test using the linux-next 20231205 yesterday.
> >
> > Testcase: memory.max = 2G, zswap enabled, make -j32 in tmpfs.
> >
> > 20231205 +patchset
> > 1. !shrinker_enabled: 156s 126s
> > 2. shrinker_enabled: 79s 70s
> >
> > I think your zswap shrinker fix patch can solve !shrinker_enabled case.
> >
> > So will test again today using the new mm-unstable branch.
> >
>
> Updated test data based on today's mm-unstable branch:
>
> mm-unstable +patchset
> 1. !shrinker_enabled: 86s 74s
> 2. shrinker_enabled: 63s 61s
>
> Shows much less optimization for the shrinker_enabled case, but still
> much optimization for the !shrinker_enabled case.
>
> Thanks!
I'm gonna assume this is build time since it makes the zswap shrinker
look pretty good :)
I think this just means some of the gains between this patchset and
the zswap shrinker overlaps. But on the positive note:
a) Both are complementary, i.e enable both (bottom right corner) gives
us the best result.
b) Each individual change improves the runtime. If you disable the
shrinker, then this patch helps tremendously, so we're onto something.
c) The !shrinker_enabled is no longer *too* bad - once again, thanks
for noticing the regression and help me fix it! In fact, every cell
improves compared to the last run. Woohoo!
On Thu, Dec 7, 2023 at 10:15 AM Nhat Pham <[email protected]> wrote:
>
> On Thu, Dec 7, 2023 at 7:18 AM Chengming Zhou
> <[email protected]> wrote:
> >
> > Updated test data based on today's mm-unstable branch:
> >
> > mm-unstable +patchset
> > 1. !shrinker_enabled: 86s 74s
> > 2. shrinker_enabled: 63s 61s
> >
> > Shows much less optimization for the shrinker_enabled case, but still
> > much optimization for the !shrinker_enabled case.
> >
> > Thanks!
>
> I'm gonna assume this is build time since it makes the zswap shrinker
> look pretty good :)
> I think this just means some of the gains between this patchset and
> the zswap shrinker overlaps. But on the positive note:
>
> a) Both are complementary, i.e enable both (bottom right corner) gives
> us the best result.
> b) Each individual change improves the runtime. If you disable the
> shrinker, then this patch helps tremendously, so we're onto something.
> c) The !shrinker_enabled is no longer *too* bad - once again, thanks
> for noticing the regression and help me fix it! In fact, every cell
> improves compared to the last run. Woohoo!
Oh and, another thing that might be helpful to observe reduction in
lock contention (and compare approaches if necessary) is this analysis
that Yosry performed for the multiple zpools change:
https://lore.kernel.org/lkml/[email protected]/
We could look at the various paths that utilize rbtree and see how
long we're spinning at the lock(s) etc.
On 2023/12/8 02:15, Nhat Pham wrote:
> On Thu, Dec 7, 2023 at 7:18 AM Chengming Zhou
> <[email protected]> wrote:
>>
>> On 2023/12/7 11:13, Chengming Zhou wrote:
>>> On 2023/12/7 04:08, Nhat Pham wrote:
>>>> On Wed, Dec 6, 2023 at 1:46 AM Chengming Zhou
>>>> <[email protected]> wrote:
>>>>> When testing the zswap performance by using kernel build -j32 in a tmpfs
>>>>> directory, I found the scalability of zswap rb-tree is not good, which
>>>>> is protected by the only spinlock. That would cause heavy lock contention
>>>>> if multiple tasks zswap_store/load concurrently.
>>>>>
>>>>> So a simple solution is to split the only one zswap rb-tree into multiple
>>>>> rb-trees, each corresponds to SWAP_ADDRESS_SPACE_PAGES (64M). This idea is
>>>>> from the commit 4b3ef9daa4fc ("mm/swap: split swap cache into 64MB trunks").
>>>>>
>>>>> Although this method can't solve the spinlock contention completely, it
>>>>> can mitigate much of that contention.
>>>>
>>>> By how much? Do you have any stats to estimate the amount of
>>>> contention and the reduction by this patch?
>>>
>>> Actually, I did some test using the linux-next 20231205 yesterday.
>>>
>>> Testcase: memory.max = 2G, zswap enabled, make -j32 in tmpfs.
>>>
>>> 20231205 +patchset
>>> 1. !shrinker_enabled: 156s 126s
>>> 2. shrinker_enabled: 79s 70s
>>>
>>> I think your zswap shrinker fix patch can solve !shrinker_enabled case.
>>>
>>> So will test again today using the new mm-unstable branch.
>>>
>>
>> Updated test data based on today's mm-unstable branch:
>>
>> mm-unstable +patchset
>> 1. !shrinker_enabled: 86s 74s
>> 2. shrinker_enabled: 63s 61s
>>
>> Shows much less optimization for the shrinker_enabled case, but still
>> much optimization for the !shrinker_enabled case.
>>
>> Thanks!
>
> I'm gonna assume this is build time since it makes the zswap shrinker
> look pretty good :)
> I think this just means some of the gains between this patchset and
> the zswap shrinker overlaps. But on the positive note:
>
> a) Both are complementary, i.e enable both (bottom right corner) gives
> us the best result.
Right, both optimizations are complementary, to make zswap perform better :)
> b) Each individual change improves the runtime. If you disable the
> shrinker, then this patch helps tremendously, so we're onto something.
> c) The !shrinker_enabled is no longer *too* bad - once again, thanks
> for noticing the regression and help me fix it! In fact, every cell
> improves compared to the last run. Woohoo!
It's my pleasure! Thanks!