2020-12-19 10:09:34

by Mike Galbraith

[permalink] [raw]
Subject: [patch] zswap: fix zswap_frontswap_load() vs zsmalloc::map/unmap() might_sleep() splat

Greetings,

Beating on zswap+zsmalloc leads to the splat below, possible patch
below that.

[ 139.794413] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:935
[ 139.794585] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 907, name: avahi-daemon
[ 139.794608] 2 locks held by avahi-daemon/907:
[ 139.794621] #0: ffff8881010766d8 (&mm->mmap_lock#2){++++}-{3:3}, at: exc_page_fault+0x15b/0x6e0
[ 139.794659] #1: ffff8881b6b13110 (&zspage->lock){.+.+}-{2:2}, at: zs_map_object+0x89/0x350
[ 139.794691] Preemption disabled at:
[ 139.794693] [<ffffffff812763a8>] zs_map_object+0x38/0x350
[ 139.794719] CPU: 0 PID: 907 Comm: avahi-daemon Kdump: loaded Tainted: G E 5.10.0.g3644e2d-preempt #4
[ 139.794746] Hardware name: MEDION MS-7848/MS-7848, BIOS M7848W08.20C 09/23/2013
[ 139.794766] Call Trace:
[ 139.794775] dump_stack+0x77/0x97
[ 139.794788] ___might_sleep+0x17d/0x260
[ 139.794806] __mutex_lock+0x47/0x9d0
[ 139.794823] ? zswap_frontswap_load+0xcb/0x240
[ 139.794841] ? zs_map_object+0x248/0x350
[ 139.794858] ? zswap_frontswap_load+0xcb/0x240
[ 139.794871] zswap_frontswap_load+0xcb/0x240
[ 139.794886] ? lru_cache_add+0xe9/0x1b0
[ 139.794904] __frontswap_load+0x6e/0xd0
[ 139.794921] swap_readpage+0x70/0x240
[ 139.794939] swap_cluster_readahead+0x1d9/0x280
[ 139.794964] ? swapin_readahead+0x140/0x3e0
[ 139.794979] swapin_readahead+0x140/0x3e0
[ 139.794997] ? lookup_swap_cache+0x5c/0x190
[ 139.795016] ? do_swap_page+0x2f7/0x900
[ 139.795030] do_swap_page+0x2f7/0x900
[ 139.795046] handle_mm_fault+0xa06/0x13b0
[ 139.795083] exc_page_fault+0x2a3/0x6e0
[ 139.795110] ? asm_exc_page_fault+0x1e/0x30
[ 139.795139] ? asm_exc_page_fault+0x8/0x30
[ 139.795166] asm_exc_page_fault+0x1e/0x30
[ 139.795191] RIP: 0033:0x560caa4074d0
[ 139.795215] Code: 89 05 e4 74 21 00 0f 84 28 06 00 00 e8 d9 13 00 00 e8 b4 12 00 00 44 8b 25 b9 75 21 00 45 85 e4 0f 85 ac 04 00 00 41 83 cf ff <48> 8b 3d b1 74 21 00 44 89 fe e8 d1 f6 ff ff 85 c0 89 c3 0f 88 d9
[ 139.795299] RSP: 002b:00007ffeb4ec5ce0 EFLAGS: 00010246
[ 139.795316] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 00007ff1616e4db4
[ 139.795336] RDX: 0000000000000001 RSI: 00007ffeb4ec5c5f RDI: 0000000000000006
[ 139.795356] RBP: 0000560cab63be60 R08: 0000560cab656220 R09: 0000560cab670750
[ 139.795376] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000
[ 139.795395] R13: 0000560cab63f550 R14: 0000560cab63c000 R15: 00000000ffffffff

(CC Sebastian because RT has the ~same problem at the same spot,
curable the same way, or by fixing the upstream buglet then take
backports+fix, and nuking the associated RT patch)

mm/zswap: fix zswap_frontswap_load() vs zsmalloc::map/unmap() might_sleep() splat

zsmalloc map/unmap methods use preemption disabling bit spinlocks.
Take the
mutex outside of pool map/unmap methods in zswap_frontswap_load() as is done
in zswap_frontswap_store().

Signed-off-by: Mike Galbraith <[email protected]>
Fixes: 1ec3b5fe6eec "mm/zswap: move to use crypto_acomp API for hardware acceleration"
---
mm/zswap.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1258,20 +1258,20 @@ static int zswap_frontswap_load(unsigned

/* decompress */
dlen = PAGE_SIZE;
+ acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
+ mutex_lock(acomp_ctx->mutex);
src = zpool_map_handle(entry->pool->zpool, entry->handle, ZPOOL_MM_RO);
if (zpool_evictable(entry->pool->zpool))
src += sizeof(struct zswap_header);

- acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
- mutex_lock(acomp_ctx->mutex);
sg_init_one(&input, src, entry->length);
sg_init_table(&output, 1);
sg_set_page(&output, page, PAGE_SIZE, 0);
acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, dlen);
ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait);
- mutex_unlock(acomp_ctx->mutex);

zpool_unmap_handle(entry->pool->zpool, entry->handle);
+ mutex_unlock(acomp_ctx->mutex);
BUG_ON(ret);

freeentry:


2020-12-19 10:15:44

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch] zswap: fix zswap_frontswap_load() vs zsmalloc::map/unmap() might_sleep() splat

(mailer partially munged formatting? resend)

mm/zswap: fix zswap_frontswap_load() vs zsmalloc::map/unmap() might_sleep() splat

zsmalloc map/unmap methods use preemption disabling bit spinlocks. Take the
mutex outside of pool map/unmap methods in zswap_frontswap_load() as is done
in zswap_frontswap_store().

Signed-off-by: Mike Galbraith <[email protected]>
Fixes: 1ec3b5fe6eec "mm/zswap: move to use crypto_acomp API for hardware acceleration"
---
mm/zswap.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1258,20 +1258,20 @@ static int zswap_frontswap_load(unsigned

/* decompress */
dlen = PAGE_SIZE;
+ acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
+ mutex_lock(acomp_ctx->mutex);
src = zpool_map_handle(entry->pool->zpool, entry->handle, ZPOOL_MM_RO);
if (zpool_evictable(entry->pool->zpool))
src += sizeof(struct zswap_header);

- acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
- mutex_lock(acomp_ctx->mutex);
sg_init_one(&input, src, entry->length);
sg_init_table(&output, 1);
sg_set_page(&output, page, PAGE_SIZE, 0);
acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, dlen);
ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait);
- mutex_unlock(acomp_ctx->mutex);

zpool_unmap_handle(entry->pool->zpool, entry->handle);
+ mutex_unlock(acomp_ctx->mutex);
BUG_ON(ret);

freeentry:

2020-12-19 10:23:19

by Vitaly Wool

[permalink] [raw]
Subject: Re: [patch] zswap: fix zswap_frontswap_load() vs zsmalloc::map/unmap() might_sleep() splat

Hi Mike,

On Sat, Dec 19, 2020 at 11:12 AM Mike Galbraith <[email protected]> wrote:
>
> (mailer partially munged formatting? resend)
>
> mm/zswap: fix zswap_frontswap_load() vs zsmalloc::map/unmap() might_sleep() splat
>
> zsmalloc map/unmap methods use preemption disabling bit spinlocks. Take the
> mutex outside of pool map/unmap methods in zswap_frontswap_load() as is done
> in zswap_frontswap_store().

oh wait... So is zsmalloc taking a spin lock in its map callback and
releasing it only in unmap? In this case, I would rather keep zswap as
is, mark zsmalloc as RT unsafe and have zsmalloc maintainer fix it.

Best regards,
Vitaly

> Signed-off-by: Mike Galbraith <[email protected]>
> Fixes: 1ec3b5fe6eec "mm/zswap: move to use crypto_acomp API for hardware acceleration"
> ---
> mm/zswap.c | 6 +++---
> 1 file changed, 3 insertions(+), 3 deletions(-)
>
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1258,20 +1258,20 @@ static int zswap_frontswap_load(unsigned
>
> /* decompress */
> dlen = PAGE_SIZE;
> + acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> + mutex_lock(acomp_ctx->mutex);
> src = zpool_map_handle(entry->pool->zpool, entry->handle, ZPOOL_MM_RO);
> if (zpool_evictable(entry->pool->zpool))
> src += sizeof(struct zswap_header);
>
> - acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> - mutex_lock(acomp_ctx->mutex);
> sg_init_one(&input, src, entry->length);
> sg_init_table(&output, 1);
> sg_set_page(&output, page, PAGE_SIZE, 0);
> acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, dlen);
> ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait);
> - mutex_unlock(acomp_ctx->mutex);
>
> zpool_unmap_handle(entry->pool->zpool, entry->handle);
> + mutex_unlock(acomp_ctx->mutex);
> BUG_ON(ret);
>
> freeentry:
>

2020-12-19 10:29:45

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch] zswap: fix zswap_frontswap_load() vs zsmalloc::map/unmap() might_sleep() splat

On Sat, 2020-12-19 at 11:20 +0100, Vitaly Wool wrote:
> Hi Mike,
>
> On Sat, Dec 19, 2020 at 11:12 AM Mike Galbraith <[email protected]> wrote:
> >
> > (mailer partially munged formatting? resend)
> >
> > mm/zswap: fix zswap_frontswap_load() vs zsmalloc::map/unmap() might_sleep() splat
> >
> > zsmalloc map/unmap methods use preemption disabling bit spinlocks. Take the
> > mutex outside of pool map/unmap methods in zswap_frontswap_load() as is done
> > in zswap_frontswap_store().
>
> oh wait... So is zsmalloc taking a spin lock in its map callback and
> releasing it only in unmap? In this case, I would rather keep zswap as
> is, mark zsmalloc as RT unsafe and have zsmalloc maintainer fix it.

The kernel that generated that splat was NOT an RT kernel, it was plain
master.today with a PREEMPT config.

-Mike

2020-12-19 10:49:47

by Vitaly Wool

[permalink] [raw]
Subject: Re: [patch] zswap: fix zswap_frontswap_load() vs zsmalloc::map/unmap() might_sleep() splat

On Sat, 19 Dec 2020, 11:27 Mike Galbraith, <[email protected]> wrote:
>
> On Sat, 2020-12-19 at 11:20 +0100, Vitaly Wool wrote:
> > Hi Mike,
> >
> > On Sat, Dec 19, 2020 at 11:12 AM Mike Galbraith <[email protected]> wrote:
> > >
> > > (mailer partially munged formatting? resend)
> > >
> > > mm/zswap: fix zswap_frontswap_load() vs zsmalloc::map/unmap() might_sleep() splat
> > >
> > > zsmalloc map/unmap methods use preemption disabling bit spinlocks. Take the
> > > mutex outside of pool map/unmap methods in zswap_frontswap_load() as is done
> > > in zswap_frontswap_store().
> >
> > oh wait... So is zsmalloc taking a spin lock in its map callback and
> > releasing it only in unmap? In this case, I would rather keep zswap as
> > is, mark zsmalloc as RT unsafe and have zsmalloc maintainer fix it.
>
> The kernel that generated that splat was NOT an RT kernel, it was plain
> master.today with a PREEMPT config.


I see, thanks. I don't think it makes things better for zsmalloc
though. From what I can see, the offending code is this:

> /* From now on, migration cannot move the object */
> pin_tag(handle);

Bit spinlock is taken in pin_tag(). I find the comment above somewhat
misleading, why is it necessary to take a spinlock to prevent
migration? I would guess an atomic flag should normally be enough.

zswap is not broken here, it is zsmalloc that needs to be fixed.

Best regards,
Vitaly

2020-12-19 11:02:24

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch] zswap: fix zswap_frontswap_load() vs zsmalloc::map/unmap() might_sleep() splat

On Sat, 2020-12-19 at 11:46 +0100, Vitaly Wool wrote:
> On Sat, 19 Dec 2020, 11:27 Mike Galbraith, <[email protected]> wrote:
>
> > The kernel that generated that splat was NOT an RT kernel, it was plain
> > master.today with a PREEMPT config.
>
>
> I see, thanks. I don't think it makes things better for zsmalloc
> though. From what I can see, the offending code is this:
>
> > /* From now on, migration cannot move the object */
> > pin_tag(handle);
>
> Bit spinlock is taken in pin_tag(). I find the comment above somewhat
> misleading, why is it necessary to take a spinlock to prevent
> migration? I would guess an atomic flag should normally be enough.
>
> zswap is not broken here, it is zsmalloc that needs to be fixed.

Cool, those damn bit spinlocks going away would be a case of happiness
for RT as well :)

-Mike

2020-12-19 11:06:25

by Mike Galbraith

[permalink] [raw]
Subject: Re: [patch] zswap: fix zswap_frontswap_load() vs zsmalloc::map/unmap() might_sleep() splat

(CC zsmalloc maintainers)

On Sat, 2020-12-19 at 11:59 +0100, Mike Galbraith wrote:
> On Sat, 2020-12-19 at 11:46 +0100, Vitaly Wool wrote:
> > On Sat, 19 Dec 2020, 11:27 Mike Galbraith, <[email protected]> wrote:
> >
> > > The kernel that generated that splat was NOT an RT kernel, it was plain
> > > master.today with a PREEMPT config.
> >
> >
> > I see, thanks. I don't think it makes things better for zsmalloc
> > though. From what I can see, the offending code is this:
> >
> > > /* From now on, migration cannot move the object */
> > > pin_tag(handle);
> >
> > Bit spinlock is taken in pin_tag(). I find the comment above somewhat
> > misleading, why is it necessary to take a spinlock to prevent
> > migration? I would guess an atomic flag should normally be enough.
> >
> > zswap is not broken here, it is zsmalloc that needs to be fixed.
>
> Cool, those damn bit spinlocks going away would be a case of happiness
> for RT as well :)
>
> -Mike

2020-12-20 00:25:32

by Vitaly Wool

[permalink] [raw]
Subject: [PATCH] zsmalloc: do not use bit_spin_lock

zsmalloc takes bit spinlock in its _map() callback and releases it
only in unmap() which is unsafe and leads to zswap complaining
about scheduling in atomic context.

To fix that and to improve RT properties of zsmalloc, remove that
bit spinlock completely and use a bit flag instead.

Signed-off-by: Vitaly Wool <[email protected]>
---
mm/zsmalloc.c | 13 ++++++++-----
1 file changed, 8 insertions(+), 5 deletions(-)

diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
index 7289f502ffac..ff26546a7fed 100644
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -876,22 +876,25 @@ static unsigned long obj_to_head(struct page *page, void *obj)

static inline int testpin_tag(unsigned long handle)
{
- return bit_spin_is_locked(HANDLE_PIN_BIT, (unsigned long *)handle);
+ return test_bit(HANDLE_PIN_BIT, (unsigned long *)handle);
}

static inline int trypin_tag(unsigned long handle)
{
- return bit_spin_trylock(HANDLE_PIN_BIT, (unsigned long *)handle);
+ return !test_and_set_bit(HANDLE_PIN_BIT, (unsigned long *)handle);
}

-static void pin_tag(unsigned long handle) __acquires(bitlock)
+static void pin_tag(unsigned long handle)
{
- bit_spin_lock(HANDLE_PIN_BIT, (unsigned long *)handle);
+ preempt_disable();
+ while(test_and_set_bit(HANDLE_PIN_BIT, (unsigned long *)handle))
+ cpu_relax();
+ preempt_enable();
}

static void unpin_tag(unsigned long handle) __releases(bitlock)
{
- bit_spin_unlock(HANDLE_PIN_BIT, (unsigned long *)handle);
+ clear_bit(HANDLE_PIN_BIT, (unsigned long *)handle);
}

static void reset_page(struct page *page)
--
2.20.1

2020-12-20 01:21:35

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock

On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> zsmalloc takes bit spinlock in its _map() callback and releases it
> only in unmap() which is unsafe and leads to zswap complaining
> about scheduling in atomic context.
>
> To fix that and to improve RT properties of zsmalloc, remove that
> bit spinlock completely and use a bit flag instead.

Isn't this just "I open coded bit spinlock to make the lockdep
warnings go away"?

2020-12-20 01:27:43

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock

On Sun, 2020-12-20 at 02:22 +0200, Vitaly Wool wrote:
> zsmalloc takes bit spinlock in its _map() callback and releases it
> only in unmap() which is unsafe and leads to zswap complaining
> about scheduling in atomic context.
>
> To fix that and to improve RT properties of zsmalloc, remove that
> bit spinlock completely and use a bit flag instead.

It also does get_cpu_var() in map(), put_cpu_var() in unmap().

-Mike

2020-12-20 01:59:20

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock

On Sun, 2020-12-20 at 02:22 +0200, Vitaly Wool wrote:
> zsmalloc takes bit spinlock in its _map() callback and releases it
> only in unmap() which is unsafe and leads to zswap complaining
> about scheduling in atomic context.
>
> To fix that and to improve RT properties of zsmalloc, remove that
> bit spinlock completely and use a bit flag instead.


> -static void pin_tag(unsigned long handle) __acquires(bitlock)
> +static void pin_tag(unsigned long handle)
> {
> - bit_spin_lock(HANDLE_PIN_BIT, (unsigned long *)handle);
> + preempt_disable();
> + while(test_and_set_bit(HANDLE_PIN_BIT, (unsigned long *)handle))
> + cpu_relax();
> + preempt_enable();
> }

If try doesn't need to disable preemption, neither does pin.

-Mike


2020-12-20 04:14:42

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock

On Sun, 2020-12-20 at 02:23 +0100, Mike Galbraith wrote:
> On Sun, 2020-12-20 at 02:22 +0200, Vitaly Wool wrote:
> > zsmalloc takes bit spinlock in its _map() callback and releases it
> > only in unmap() which is unsafe and leads to zswap complaining
> > about scheduling in atomic context.
> >
> > To fix that and to improve RT properties of zsmalloc, remove that
> > bit spinlock completely and use a bit flag instead.
>
> It also does get_cpu_var() in map(), put_cpu_var() in unmap().

Bah, I forgot to mention the config dependent rwlock, it's held across
map()/unmap() as well, so there are two more hurdles, not one.

-Mike

2020-12-20 07:23:22

by Vitaly Wool

[permalink] [raw]
Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock

On Sun, Dec 20, 2020 at 2:18 AM Matthew Wilcox <[email protected]> wrote:
>
> On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> > zsmalloc takes bit spinlock in its _map() callback and releases it
> > only in unmap() which is unsafe and leads to zswap complaining
> > about scheduling in atomic context.
> >
> > To fix that and to improve RT properties of zsmalloc, remove that
> > bit spinlock completely and use a bit flag instead.
>
> Isn't this just "I open coded bit spinlock to make the lockdep
> warnings go away"?

Not really because bit spinlock leaves preemption disabled.

2020-12-20 07:50:58

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock

On Sun, 2020-12-20 at 02:23 +0100, Mike Galbraith wrote:
> On Sun, 2020-12-20 at 02:22 +0200, Vitaly Wool wrote:
> > zsmalloc takes bit spinlock in its _map() callback and releases it
> > only in unmap() which is unsafe and leads to zswap complaining
> > about scheduling in atomic context.
> >
> > To fix that and to improve RT properties of zsmalloc, remove that
> > bit spinlock completely and use a bit flag instead.
>
> It also does get_cpu_var() in map(), put_cpu_var() in unmap().

That aside, the bit spinlock removal seems to hold up to beating in RT.
I stripped out the RT changes to replace the bit spinlocks, applied the
still needed atm might_sleep() fix, and ltp zram and zswap test are
running in a loop with no signs that it's a bad idea, so I hope that
makes it in (minus the preempt disabled spin which I whacked), as it
makes zsmalloc markedly more RT friendly.

RT changes go from:
1 file changed, 79 insertions(+), 6 deletions(-)
to:
1 file changed, 8 insertions(+), 3 deletions(-)

-Mike

Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock



> -----Original Message-----
> From: Mike Galbraith [mailto:[email protected]]
> Sent: Sunday, December 20, 2020 8:48 PM
> To: Vitaly Wool <[email protected]>; LKML
> <[email protected]>; linux-mm <[email protected]>
> Cc: Song Bao Hua (Barry Song) <[email protected]>; Sebastian Andrzej
> Siewior <[email protected]>; Minchan Kim <[email protected]>; NitinGupta
> <[email protected]>
> Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
>
> On Sun, 2020-12-20 at 02:23 +0100, Mike Galbraith wrote:
> > On Sun, 2020-12-20 at 02:22 +0200, Vitaly Wool wrote:
> > > zsmalloc takes bit spinlock in its _map() callback and releases it
> > > only in unmap() which is unsafe and leads to zswap complaining
> > > about scheduling in atomic context.
> > >
> > > To fix that and to improve RT properties of zsmalloc, remove that
> > > bit spinlock completely and use a bit flag instead.
> >
> > It also does get_cpu_var() in map(), put_cpu_var() in unmap().
>
> That aside, the bit spinlock removal seems to hold up to beating in RT.
> I stripped out the RT changes to replace the bit spinlocks, applied the
> still needed atm might_sleep() fix, and ltp zram and zswap test are
> running in a loop with no signs that it's a bad idea, so I hope that
> makes it in (minus the preempt disabled spin which I whacked), as it
> makes zsmalloc markedly more RT friendly.
>
> RT changes go from:
> 1 file changed, 79 insertions(+), 6 deletions(-)
> to:
> 1 file changed, 8 insertions(+), 3 deletions(-)
>

Sorry, would you like to show the change for
"8 insertions(+), 3 deletions(-)"?

BTW, your original patch looks not right as
crypto_wait_req(crypto_acomp_decompress()...)
can sleep too.

[copy from your original patch with comment]
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -1258,20 +1258,20 @@ static int zswap_frontswap_load(unsigned

/* decompress */
dlen = PAGE_SIZE;
+ acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
+ mutex_lock(acomp_ctx->mutex);
src = zpool_map_handle(entry->pool->zpool, entry->handle, ZPOOL_MM_RO);
if (zpool_evictable(entry->pool->zpool))
src += sizeof(struct zswap_header);

- acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
- mutex_lock(acomp_ctx->mutex);
sg_init_one(&input, src, entry->length);
sg_init_table(&output, 1);
sg_set_page(&output, page, PAGE_SIZE, 0);
acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, dlen);

/*!!!!!!!!!!!!!!!!
* here crypto could sleep
!!!!!!!!!!!!!!*/

ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait);
- mutex_unlock(acomp_ctx->mutex);

zpool_unmap_handle(entry->pool->zpool, entry->handle);
+ mutex_unlock(acomp_ctx->mutex);
BUG_ON(ret);

freeentry:

[end]

so I guess we have to fix zsmalloc.


> -Mike

Thanks
Barry

2020-12-20 22:13:49

by Mike Galbraith

[permalink] [raw]
Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock

On Sun, 2020-12-20 at 21:20 +0000, Song Bao Hua (Barry Song) wrote:
>
> > -----Original Message-----
> > From: Mike Galbraith [mailto:[email protected]]
> > Sent: Sunday, December 20, 2020 8:48 PM
> > To: Vitaly Wool <[email protected]>; LKML
> > <[email protected]>; linux-mm <[email protected]>
> > Cc: Song Bao Hua (Barry Song) <[email protected]>; Sebastian Andrzej
> > Siewior <[email protected]>; Minchan Kim <[email protected]>; NitinGupta
> > <[email protected]>
> > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> >
> > On Sun, 2020-12-20 at 02:23 +0100, Mike Galbraith wrote:
> > > On Sun, 2020-12-20 at 02:22 +0200, Vitaly Wool wrote:
> > > > zsmalloc takes bit spinlock in its _map() callback and releases it
> > > > only in unmap() which is unsafe and leads to zswap complaining
> > > > about scheduling in atomic context.
> > > >
> > > > To fix that and to improve RT properties of zsmalloc, remove that
> > > > bit spinlock completely and use a bit flag instead.
> > >
> > > It also does get_cpu_var() in map(), put_cpu_var() in unmap().
> >
> > That aside, the bit spinlock removal seems to hold up to beating in RT.
> > I stripped out the RT changes to replace the bit spinlocks, applied the
> > still needed atm might_sleep() fix, and ltp zram and zswap test are
> > running in a loop with no signs that it's a bad idea, so I hope that
> > makes it in (minus the preempt disabled spin which I whacked), as it
> > makes zsmalloc markedly more RT friendly.
> >
> > RT changes go from:
> > 1 file changed, 79 insertions(+), 6 deletions(-)
> > to:
> > 1 file changed, 8 insertions(+), 3 deletions(-)
> >
>
> Sorry, would you like to show the change for
> "8 insertions(+), 3 deletions(-)"?

Sure.
---
mm/zsmalloc.c | 11 ++++++++---
1 file changed, 8 insertions(+), 3 deletions(-)

--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -57,6 +57,7 @@
#include <linux/wait.h>
#include <linux/pagemap.h>
#include <linux/fs.h>
+#include <linux/local_lock.h>

#define ZSPAGE_MAGIC 0x58

@@ -293,6 +294,7 @@ struct zspage {
};

struct mapping_area {
+ local_lock_t lock;
char *vm_buf; /* copy buffer for objects that span pages */
char *vm_addr; /* address of kmap_atomic()'ed pages */
enum zs_mapmode vm_mm; /* mapping mode */
@@ -455,7 +457,9 @@ MODULE_ALIAS("zpool-zsmalloc");
#endif /* CONFIG_ZPOOL */

/* per-cpu VM mapping areas for zspage accesses that cross page
boundaries */
-static DEFINE_PER_CPU(struct mapping_area, zs_map_area);
+static DEFINE_PER_CPU(struct mapping_area, zs_map_area) = {
+ .lock = INIT_LOCAL_LOCK(lock),
+};

static bool is_zspage_isolated(struct zspage *zspage)
{
@@ -1276,7 +1280,8 @@ void *zs_map_object(struct zs_pool *pool
class = pool->size_class[class_idx];
off = (class->size * obj_idx) & ~PAGE_MASK;

- area = &get_cpu_var(zs_map_area);
+ local_lock(&zs_map_area.lock);
+ area = this_cpu_ptr(&zs_map_area);
area->vm_mm = mm;
if (off + class->size <= PAGE_SIZE) {
/* this object is contained entirely within a page */
@@ -1330,7 +1335,7 @@ void zs_unmap_object(struct zs_pool *poo

__zs_unmap_object(area, pages, off, class->size);
}
- put_cpu_var(zs_map_area);
+ local_unlock(&zs_map_area.lock);

migrate_read_unlock(zspage);
unpin_tag(handle);


> BTW, your original patch looks not right as
> crypto_wait_req(crypto_acomp_decompress()...)
> can sleep too.
>
> [copy from your original patch with comment]
> --- a/mm/zswap.c
> +++ b/mm/zswap.c
> @@ -1258,20 +1258,20 @@ static int zswap_frontswap_load(unsigned
>
> /* decompress */
> dlen = PAGE_SIZE;
> + acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> + mutex_lock(acomp_ctx->mutex);
> src = zpool_map_handle(entry->pool->zpool, entry->handle, ZPOOL_MM_RO);
> if (zpool_evictable(entry->pool->zpool))
> src += sizeof(struct zswap_header);
>
> - acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> - mutex_lock(acomp_ctx->mutex);
> sg_init_one(&input, src, entry->length);
> sg_init_table(&output, 1);
> sg_set_page(&output, page, PAGE_SIZE, 0);
> acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, dlen);
>
> /*!!!!!!!!!!!!!!!!
> * here crypto could sleep
> !!!!!!!!!!!!!!*/

Hohum, another one for my Bitmaster-9000 patch shredder.

-Mike


2020-12-21 18:32:25

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock

On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> zsmalloc takes bit spinlock in its _map() callback and releases it
> only in unmap() which is unsafe and leads to zswap complaining
> about scheduling in atomic context.
>
> To fix that and to improve RT properties of zsmalloc, remove that
> bit spinlock completely and use a bit flag instead.

I don't want to use such open code for the lock.

I see from Mike's patch, recent zswap change introduced the lockdep
splat bug and you want to improve zsmalloc to fix the zswap bug and
introduce this patch with allowing preemption enabling.

https://lore.kernel.org/linux-mm/[email protected]/

zs_[un/map]_object is designed to be used in fast path(i.e.,
zs_map_object/4K page copy/zs_unmap_object) so the spinlock is
perfectly fine for API point of view. However, zswap introduced
using the API with mutex_lock/crypto_wait_req where allowing
preemption, which was wrong.

Furthermore, the zs_map_object already has a few more places where
disablepreemptions(migrate_read_lock, get_cpu_var and kmap_atomic).

Without making those locks preemptible all at once, zswap will still
see the lockdep warning.

>
> Signed-off-by: Vitaly Wool <[email protected]>
> ---
> mm/zsmalloc.c | 13 ++++++++-----
> 1 file changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> index 7289f502ffac..ff26546a7fed 100644
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -876,22 +876,25 @@ static unsigned long obj_to_head(struct page *page, void *obj)
>
> static inline int testpin_tag(unsigned long handle)
> {
> - return bit_spin_is_locked(HANDLE_PIN_BIT, (unsigned long *)handle);
> + return test_bit(HANDLE_PIN_BIT, (unsigned long *)handle);
> }
>
> static inline int trypin_tag(unsigned long handle)
> {
> - return bit_spin_trylock(HANDLE_PIN_BIT, (unsigned long *)handle);
> + return !test_and_set_bit(HANDLE_PIN_BIT, (unsigned long *)handle);
> }
>
> -static void pin_tag(unsigned long handle) __acquires(bitlock)
> +static void pin_tag(unsigned long handle)
> {
> - bit_spin_lock(HANDLE_PIN_BIT, (unsigned long *)handle);
> + preempt_disable();
> + while(test_and_set_bit(HANDLE_PIN_BIT, (unsigned long *)handle))
> + cpu_relax();
> + preempt_enable();
> }
>
> static void unpin_tag(unsigned long handle) __releases(bitlock)
> {
> - bit_spin_unlock(HANDLE_PIN_BIT, (unsigned long *)handle);
> + clear_bit(HANDLE_PIN_BIT, (unsigned long *)handle);
> }
>
> static void reset_page(struct page *page)
> --
> 2.20.1
>

2020-12-21 19:23:01

by Vitaly Wool

[permalink] [raw]
Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock

On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <[email protected]> wrote:
>
> On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> > zsmalloc takes bit spinlock in its _map() callback and releases it
> > only in unmap() which is unsafe and leads to zswap complaining
> > about scheduling in atomic context.
> >
> > To fix that and to improve RT properties of zsmalloc, remove that
> > bit spinlock completely and use a bit flag instead.
>
> I don't want to use such open code for the lock.
>
> I see from Mike's patch, recent zswap change introduced the lockdep
> splat bug and you want to improve zsmalloc to fix the zswap bug and
> introduce this patch with allowing preemption enabling.

This understanding is upside down. The code in zswap you are referring
to is not buggy. You may claim that it is suboptimal but there is
nothing wrong in taking a mutex.

> https://lore.kernel.org/linux-mm/[email protected]/
>
> zs_[un/map]_object is designed to be used in fast path(i.e.,
> zs_map_object/4K page copy/zs_unmap_object) so the spinlock is
> perfectly fine for API point of view. However, zswap introduced
> using the API with mutex_lock/crypto_wait_req where allowing
> preemption, which was wrong.

Taking a spinlock in one callback and releasing it in another is
unsafe and error prone. What if unmap was called on completion of a
DMA-like transfer from another context, like a threaded IRQ handler?
In that case this spinlock might never be released.

Anyway I can come up with a zswap patch explicitly stating that
zsmalloc is not fully compliant with zswap / zpool API to avoid
confusion for the time being. Would that be ok with you?

Best regards,
Vitaly

> Furthermore, the zs_map_object already has a few more places where
> disablepreemptions(migrate_read_lock, get_cpu_var and kmap_atomic).
>
> Without making those locks preemptible all at once, zswap will still
> see the lockdep warning.
>
> >
> > Signed-off-by: Vitaly Wool <[email protected]>
> > ---
> > mm/zsmalloc.c | 13 ++++++++-----
> > 1 file changed, 8 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> > index 7289f502ffac..ff26546a7fed 100644
> > --- a/mm/zsmalloc.c
> > +++ b/mm/zsmalloc.c
> > @@ -876,22 +876,25 @@ static unsigned long obj_to_head(struct page *page, void *obj)
> >
> > static inline int testpin_tag(unsigned long handle)
> > {
> > - return bit_spin_is_locked(HANDLE_PIN_BIT, (unsigned long *)handle);
> > + return test_bit(HANDLE_PIN_BIT, (unsigned long *)handle);
> > }
> >
> > static inline int trypin_tag(unsigned long handle)
> > {
> > - return bit_spin_trylock(HANDLE_PIN_BIT, (unsigned long *)handle);
> > + return !test_and_set_bit(HANDLE_PIN_BIT, (unsigned long *)handle);
> > }
> >
> > -static void pin_tag(unsigned long handle) __acquires(bitlock)
> > +static void pin_tag(unsigned long handle)
> > {
> > - bit_spin_lock(HANDLE_PIN_BIT, (unsigned long *)handle);
> > + preempt_disable();
> > + while(test_and_set_bit(HANDLE_PIN_BIT, (unsigned long *)handle))
> > + cpu_relax();
> > + preempt_enable();
> > }
> >
> > static void unpin_tag(unsigned long handle) __releases(bitlock)
> > {
> > - bit_spin_unlock(HANDLE_PIN_BIT, (unsigned long *)handle);
> > + clear_bit(HANDLE_PIN_BIT, (unsigned long *)handle);
> > }
> >
> > static void reset_page(struct page *page)
> > --
> > 2.20.1
> >

2020-12-21 19:51:56

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock

On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool <[email protected]> wrote:
>
> On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <[email protected]> wrote:
> >
> > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> > > zsmalloc takes bit spinlock in its _map() callback and releases it
> > > only in unmap() which is unsafe and leads to zswap complaining
> > > about scheduling in atomic context.
> > >
> > > To fix that and to improve RT properties of zsmalloc, remove that
> > > bit spinlock completely and use a bit flag instead.
> >
> > I don't want to use such open code for the lock.
> >
> > I see from Mike's patch, recent zswap change introduced the lockdep
> > splat bug and you want to improve zsmalloc to fix the zswap bug and
> > introduce this patch with allowing preemption enabling.
>
> This understanding is upside down. The code in zswap you are referring
> to is not buggy. You may claim that it is suboptimal but there is
> nothing wrong in taking a mutex.
>

Is this suboptimal for all or just the hardware accelerators? Sorry, I
am not very familiar with the crypto API. If I select lzo or lz4 as a
zswap compressor will the [de]compression be async or sync?

> > https://lore.kernel.org/linux-mm/[email protected]/
> >
> > zs_[un/map]_object is designed to be used in fast path(i.e.,
> > zs_map_object/4K page copy/zs_unmap_object) so the spinlock is
> > perfectly fine for API point of view. However, zswap introduced
> > using the API with mutex_lock/crypto_wait_req where allowing
> > preemption, which was wrong.
>
> Taking a spinlock in one callback and releasing it in another is
> unsafe and error prone. What if unmap was called on completion of a
> DMA-like transfer from another context, like a threaded IRQ handler?
> In that case this spinlock might never be released.
>
> Anyway I can come up with a zswap patch explicitly stating that
> zsmalloc is not fully compliant with zswap / zpool API

The documentation of zpool_map_handle() clearly states "This may hold
locks, disable interrupts, and/or preemption, ...", so how come
zsmalloc is not fully compliant?

> to avoid
> confusion for the time being. Would that be ok with you?
>
> Best regards,
> Vitaly
>

Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock



> -----Original Message-----
> From: Shakeel Butt [mailto:[email protected]]
> Sent: Tuesday, December 22, 2020 8:50 AM
> To: Vitaly Wool <[email protected]>
> Cc: Minchan Kim <[email protected]>; Mike Galbraith <[email protected]>; LKML
> <[email protected]>; linux-mm <[email protected]>; Song Bao Hua
> (Barry Song) <[email protected]>; Sebastian Andrzej Siewior
> <[email protected]>; NitinGupta <[email protected]>; Sergey Senozhatsky
> <[email protected]>; Andrew Morton
> <[email protected]>
> Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
>
> On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool <[email protected]> wrote:
> >
> > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <[email protected]> wrote:
> > >
> > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> > > > zsmalloc takes bit spinlock in its _map() callback and releases it
> > > > only in unmap() which is unsafe and leads to zswap complaining
> > > > about scheduling in atomic context.
> > > >
> > > > To fix that and to improve RT properties of zsmalloc, remove that
> > > > bit spinlock completely and use a bit flag instead.
> > >
> > > I don't want to use such open code for the lock.
> > >
> > > I see from Mike's patch, recent zswap change introduced the lockdep
> > > splat bug and you want to improve zsmalloc to fix the zswap bug and
> > > introduce this patch with allowing preemption enabling.
> >
> > This understanding is upside down. The code in zswap you are referring
> > to is not buggy. You may claim that it is suboptimal but there is
> > nothing wrong in taking a mutex.
> >
>
> Is this suboptimal for all or just the hardware accelerators? Sorry, I
> am not very familiar with the crypto API. If I select lzo or lz4 as a
> zswap compressor will the [de]compression be async or sync?

Right now, in crypto subsystem, new drivers are required to write based on
async APIs. The old sync API can't work in new accelerator drivers as they
are not supported at all.

Old drivers are used to sync, but they've got async wrappers to support async
APIs. Eg.
crypto: acomp - add support for lz4 via scomp
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d

crypto: acomp - add support for lzo via scomp
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e

so they are supporting async APIs but they are still working in sync mode as
those old drivers don't sleep.

>
> > >
> https://lore.kernel.org/linux-mm/fae85e4440a8ef6f13192476bd33a4826416fc58.
> [email protected]/
> > >
> > > zs_[un/map]_object is designed to be used in fast path(i.e.,
> > > zs_map_object/4K page copy/zs_unmap_object) so the spinlock is
> > > perfectly fine for API point of view. However, zswap introduced
> > > using the API with mutex_lock/crypto_wait_req where allowing
> > > preemption, which was wrong.
> >
> > Taking a spinlock in one callback and releasing it in another is
> > unsafe and error prone. What if unmap was called on completion of a
> > DMA-like transfer from another context, like a threaded IRQ handler?
> > In that case this spinlock might never be released.
> >
> > Anyway I can come up with a zswap patch explicitly stating that
> > zsmalloc is not fully compliant with zswap / zpool API
>
> The documentation of zpool_map_handle() clearly states "This may hold
> locks, disable interrupts, and/or preemption, ...", so how come
> zsmalloc is not fully compliant?

Zbud, z3fold haven't really done this. If we hold spinlock before
entering zswap and release spinlock after calling zswap, this will
put zswap in an atomic context which isn't necessarily needed.

>
> > to avoid
> > confusion for the time being. Would that be ok with you?
> >
> > Best regards,
> > Vitaly
> >

Thanks
Barry

2020-12-21 20:24:29

by Minchan Kim

[permalink] [raw]
Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock

On Mon, Dec 21, 2020 at 08:20:26PM +0100, Vitaly Wool wrote:
> On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <[email protected]> wrote:
> >
> > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> > > zsmalloc takes bit spinlock in its _map() callback and releases it
> > > only in unmap() which is unsafe and leads to zswap complaining
> > > about scheduling in atomic context.
> > >
> > > To fix that and to improve RT properties of zsmalloc, remove that
> > > bit spinlock completely and use a bit flag instead.
> >
> > I don't want to use such open code for the lock.
> >
> > I see from Mike's patch, recent zswap change introduced the lockdep
> > splat bug and you want to improve zsmalloc to fix the zswap bug and
> > introduce this patch with allowing preemption enabling.
>
> This understanding is upside down. The code in zswap you are referring
> to is not buggy. You may claim that it is suboptimal but there is
> nothing wrong in taking a mutex.

No, it's surely break from zswap since zpool/zsmalloc has worked like
this and now you are saying "nothing wrong" even though it breaks
the rule.

>
> > https://lore.kernel.org/linux-mm/[email protected]/
> >
> > zs_[un/map]_object is designed to be used in fast path(i.e.,
> > zs_map_object/4K page copy/zs_unmap_object) so the spinlock is
> > perfectly fine for API point of view. However, zswap introduced
> > using the API with mutex_lock/crypto_wait_req where allowing
> > preemption, which was wrong.
>
> Taking a spinlock in one callback and releasing it in another is
> unsafe and error prone. What if unmap was called on completion of a
> DMA-like transfer from another context, like a threaded IRQ handler?
> In that case this spinlock might never be released.
>
> Anyway I can come up with a zswap patch explicitly stating that
> zsmalloc is not fully compliant with zswap / zpool API to avoid
> confusion for the time being. Would that be ok with you?

It's your call since you are maintainer of zswap now and you are
breaking the rule we have kept for a long time.


>
> Best regards,
> Vitaly
>
> > Furthermore, the zs_map_object already has a few more places where
> > disablepreemptions(migrate_read_lock, get_cpu_var and kmap_atomic).
> >
> > Without making those locks preemptible all at once, zswap will still
> > see the lockdep warning.
> >
> > >
> > > Signed-off-by: Vitaly Wool <[email protected]>
> > > ---
> > > mm/zsmalloc.c | 13 ++++++++-----
> > > 1 file changed, 8 insertions(+), 5 deletions(-)
> > >
> > > diff --git a/mm/zsmalloc.c b/mm/zsmalloc.c
> > > index 7289f502ffac..ff26546a7fed 100644
> > > --- a/mm/zsmalloc.c
> > > +++ b/mm/zsmalloc.c
> > > @@ -876,22 +876,25 @@ static unsigned long obj_to_head(struct page *page, void *obj)
> > >
> > > static inline int testpin_tag(unsigned long handle)
> > > {
> > > - return bit_spin_is_locked(HANDLE_PIN_BIT, (unsigned long *)handle);
> > > + return test_bit(HANDLE_PIN_BIT, (unsigned long *)handle);
> > > }
> > >
> > > static inline int trypin_tag(unsigned long handle)
> > > {
> > > - return bit_spin_trylock(HANDLE_PIN_BIT, (unsigned long *)handle);
> > > + return !test_and_set_bit(HANDLE_PIN_BIT, (unsigned long *)handle);
> > > }
> > >
> > > -static void pin_tag(unsigned long handle) __acquires(bitlock)
> > > +static void pin_tag(unsigned long handle)
> > > {
> > > - bit_spin_lock(HANDLE_PIN_BIT, (unsigned long *)handle);
> > > + preempt_disable();
> > > + while(test_and_set_bit(HANDLE_PIN_BIT, (unsigned long *)handle))
> > > + cpu_relax();
> > > + preempt_enable();
> > > }
> > >
> > > static void unpin_tag(unsigned long handle) __releases(bitlock)
> > > {
> > > - bit_spin_unlock(HANDLE_PIN_BIT, (unsigned long *)handle);
> > > + clear_bit(HANDLE_PIN_BIT, (unsigned long *)handle);
> > > }
> > >
> > > static void reset_page(struct page *page)
> > > --
> > > 2.20.1
> > >

2020-12-21 21:04:30

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock

On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song)
<[email protected]> wrote:
>
>
>
> > -----Original Message-----
> > From: Shakeel Butt [mailto:[email protected]]
> > Sent: Tuesday, December 22, 2020 8:50 AM
> > To: Vitaly Wool <[email protected]>
> > Cc: Minchan Kim <[email protected]>; Mike Galbraith <[email protected]>; LKML
> > <[email protected]>; linux-mm <[email protected]>; Song Bao Hua
> > (Barry Song) <[email protected]>; Sebastian Andrzej Siewior
> > <[email protected]>; NitinGupta <[email protected]>; Sergey Senozhatsky
> > <[email protected]>; Andrew Morton
> > <[email protected]>
> > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> >
> > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool <[email protected]> wrote:
> > >
> > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <[email protected]> wrote:
> > > >
> > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> > > > > zsmalloc takes bit spinlock in its _map() callback and releases it
> > > > > only in unmap() which is unsafe and leads to zswap complaining
> > > > > about scheduling in atomic context.
> > > > >
> > > > > To fix that and to improve RT properties of zsmalloc, remove that
> > > > > bit spinlock completely and use a bit flag instead.
> > > >
> > > > I don't want to use such open code for the lock.
> > > >
> > > > I see from Mike's patch, recent zswap change introduced the lockdep
> > > > splat bug and you want to improve zsmalloc to fix the zswap bug and
> > > > introduce this patch with allowing preemption enabling.
> > >
> > > This understanding is upside down. The code in zswap you are referring
> > > to is not buggy. You may claim that it is suboptimal but there is
> > > nothing wrong in taking a mutex.
> > >
> >
> > Is this suboptimal for all or just the hardware accelerators? Sorry, I
> > am not very familiar with the crypto API. If I select lzo or lz4 as a
> > zswap compressor will the [de]compression be async or sync?
>
> Right now, in crypto subsystem, new drivers are required to write based on
> async APIs. The old sync API can't work in new accelerator drivers as they
> are not supported at all.
>
> Old drivers are used to sync, but they've got async wrappers to support async
> APIs. Eg.
> crypto: acomp - add support for lz4 via scomp
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d
>
> crypto: acomp - add support for lzo via scomp
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e
>
> so they are supporting async APIs but they are still working in sync mode as
> those old drivers don't sleep.
>

Good to know that those are sync because I want them to be sync.
Please note that zswap is a cache in front of a real swap and the load
operation is latency sensitive as it comes in the page fault path and
directly impacts the applications. I doubt decompressing synchronously
a 4k page on a cpu will be costlier than asynchronously decompressing
the same page from hardware accelerators.

Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock



> -----Original Message-----
> From: Shakeel Butt [mailto:[email protected]]
> Sent: Tuesday, December 22, 2020 10:03 AM
> To: Song Bao Hua (Barry Song) <[email protected]>
> Cc: Vitaly Wool <[email protected]>; Minchan Kim <[email protected]>;
> Mike Galbraith <[email protected]>; LKML <[email protected]>; linux-mm
> <[email protected]>; Sebastian Andrzej Siewior <[email protected]>;
> NitinGupta <[email protected]>; Sergey Senozhatsky
> <[email protected]>; Andrew Morton
> <[email protected]>
> Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
>
> On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song)
> <[email protected]> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Shakeel Butt [mailto:[email protected]]
> > > Sent: Tuesday, December 22, 2020 8:50 AM
> > > To: Vitaly Wool <[email protected]>
> > > Cc: Minchan Kim <[email protected]>; Mike Galbraith <[email protected]>; LKML
> > > <[email protected]>; linux-mm <[email protected]>; Song Bao
> Hua
> > > (Barry Song) <[email protected]>; Sebastian Andrzej Siewior
> > > <[email protected]>; NitinGupta <[email protected]>; Sergey
> Senozhatsky
> > > <[email protected]>; Andrew Morton
> > > <[email protected]>
> > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > >
> > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool <[email protected]>
> wrote:
> > > >
> > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <[email protected]> wrote:
> > > > >
> > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> > > > > > zsmalloc takes bit spinlock in its _map() callback and releases it
> > > > > > only in unmap() which is unsafe and leads to zswap complaining
> > > > > > about scheduling in atomic context.
> > > > > >
> > > > > > To fix that and to improve RT properties of zsmalloc, remove that
> > > > > > bit spinlock completely and use a bit flag instead.
> > > > >
> > > > > I don't want to use such open code for the lock.
> > > > >
> > > > > I see from Mike's patch, recent zswap change introduced the lockdep
> > > > > splat bug and you want to improve zsmalloc to fix the zswap bug and
> > > > > introduce this patch with allowing preemption enabling.
> > > >
> > > > This understanding is upside down. The code in zswap you are referring
> > > > to is not buggy. You may claim that it is suboptimal but there is
> > > > nothing wrong in taking a mutex.
> > > >
> > >
> > > Is this suboptimal for all or just the hardware accelerators? Sorry, I
> > > am not very familiar with the crypto API. If I select lzo or lz4 as a
> > > zswap compressor will the [de]compression be async or sync?
> >
> > Right now, in crypto subsystem, new drivers are required to write based on
> > async APIs. The old sync API can't work in new accelerator drivers as they
> > are not supported at all.
> >
> > Old drivers are used to sync, but they've got async wrappers to support async
> > APIs. Eg.
> > crypto: acomp - add support for lz4 via scomp
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d
> >
> > crypto: acomp - add support for lzo via scomp
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e
> >
> > so they are supporting async APIs but they are still working in sync mode
> as
> > those old drivers don't sleep.
> >
>
> Good to know that those are sync because I want them to be sync.
> Please note that zswap is a cache in front of a real swap and the load
> operation is latency sensitive as it comes in the page fault path and
> directly impacts the applications. I doubt decompressing synchronously
> a 4k page on a cpu will be costlier than asynchronously decompressing
> the same page from hardware accelerators.

If you read the old paper:
https://www.ibm.com/support/pages/new-linux-zswap-compression-functionality
Because the hardware accelerator speeds up compression, looking at the zswap
metrics we observed that there were more store and load requests in a given
amount of time, which filled up the zswap pool faster than a software
compression run. Because of this behavior, we set the max_pool_percent
parameter to 30 for the hardware compression runs - this means that zswap
can use up to 30% of the 10GB of total memory.

So using hardware accelerators, we get a chance to speed up compression
while decreasing cpu utilization.

BTW, If it is not easy to change zsmalloc, one quick workaround we might do
in zswap is adding the below after applying Mike's original patch:

if(in_atomic()) /* for zsmalloc */
while(!try_wait_for_completion(&req->done);
else /* for zbud, z3fold */
crypto_wait_req(....);

crypto_wait_req() is actually doing wait_for_completion():
static inline int crypto_wait_req(int err, struct crypto_wait *wait)
{
switch (err) {
case -EINPROGRESS:
case -EBUSY:
wait_for_completion(&wait->completion);
reinit_completion(&wait->completion);
err = wait->err;
break;
}

return err;
}

Thanks
Barry

2020-12-21 22:14:57

by Vitaly Wool

[permalink] [raw]
Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock

On Mon, Dec 21, 2020 at 10:30 PM Song Bao Hua (Barry Song)
<[email protected]> wrote:
>
>
>
> > -----Original Message-----
> > From: Shakeel Butt [mailto:[email protected]]
> > Sent: Tuesday, December 22, 2020 10:03 AM
> > To: Song Bao Hua (Barry Song) <[email protected]>
> > Cc: Vitaly Wool <[email protected]>; Minchan Kim <[email protected]>;
> > Mike Galbraith <[email protected]>; LKML <[email protected]>; linux-mm
> > <[email protected]>; Sebastian Andrzej Siewior <[email protected]>;
> > NitinGupta <[email protected]>; Sergey Senozhatsky
> > <[email protected]>; Andrew Morton
> > <[email protected]>
> > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> >
> > On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song)
> > <[email protected]> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Shakeel Butt [mailto:[email protected]]
> > > > Sent: Tuesday, December 22, 2020 8:50 AM
> > > > To: Vitaly Wool <[email protected]>
> > > > Cc: Minchan Kim <[email protected]>; Mike Galbraith <[email protected]>; LKML
> > > > <[email protected]>; linux-mm <[email protected]>; Song Bao
> > Hua
> > > > (Barry Song) <[email protected]>; Sebastian Andrzej Siewior
> > > > <[email protected]>; NitinGupta <[email protected]>; Sergey
> > Senozhatsky
> > > > <[email protected]>; Andrew Morton
> > > > <[email protected]>
> > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > > >
> > > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool <[email protected]>
> > wrote:
> > > > >
> > > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <[email protected]> wrote:
> > > > > >
> > > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> > > > > > > zsmalloc takes bit spinlock in its _map() callback and releases it
> > > > > > > only in unmap() which is unsafe and leads to zswap complaining
> > > > > > > about scheduling in atomic context.
> > > > > > >
> > > > > > > To fix that and to improve RT properties of zsmalloc, remove that
> > > > > > > bit spinlock completely and use a bit flag instead.
> > > > > >
> > > > > > I don't want to use such open code for the lock.
> > > > > >
> > > > > > I see from Mike's patch, recent zswap change introduced the lockdep
> > > > > > splat bug and you want to improve zsmalloc to fix the zswap bug and
> > > > > > introduce this patch with allowing preemption enabling.
> > > > >
> > > > > This understanding is upside down. The code in zswap you are referring
> > > > > to is not buggy. You may claim that it is suboptimal but there is
> > > > > nothing wrong in taking a mutex.
> > > > >
> > > >
> > > > Is this suboptimal for all or just the hardware accelerators? Sorry, I
> > > > am not very familiar with the crypto API. If I select lzo or lz4 as a
> > > > zswap compressor will the [de]compression be async or sync?
> > >
> > > Right now, in crypto subsystem, new drivers are required to write based on
> > > async APIs. The old sync API can't work in new accelerator drivers as they
> > > are not supported at all.
> > >
> > > Old drivers are used to sync, but they've got async wrappers to support async
> > > APIs. Eg.
> > > crypto: acomp - add support for lz4 via scomp
> > >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d
> > >
> > > crypto: acomp - add support for lzo via scomp
> > >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e
> > >
> > > so they are supporting async APIs but they are still working in sync mode
> > as
> > > those old drivers don't sleep.
> > >
> >
> > Good to know that those are sync because I want them to be sync.
> > Please note that zswap is a cache in front of a real swap and the load
> > operation is latency sensitive as it comes in the page fault path and
> > directly impacts the applications. I doubt decompressing synchronously
> > a 4k page on a cpu will be costlier than asynchronously decompressing
> > the same page from hardware accelerators.
>
> If you read the old paper:
> https://www.ibm.com/support/pages/new-linux-zswap-compression-functionality
> Because the hardware accelerator speeds up compression, looking at the zswap
> metrics we observed that there were more store and load requests in a given
> amount of time, which filled up the zswap pool faster than a software
> compression run. Because of this behavior, we set the max_pool_percent
> parameter to 30 for the hardware compression runs - this means that zswap
> can use up to 30% of the 10GB of total memory.
>
> So using hardware accelerators, we get a chance to speed up compression
> while decreasing cpu utilization.
>
> BTW, If it is not easy to change zsmalloc, one quick workaround we might do
> in zswap is adding the below after applying Mike's original patch:
>
> if(in_atomic()) /* for zsmalloc */
> while(!try_wait_for_completion(&req->done);
> else /* for zbud, z3fold */
> crypto_wait_req(....);

I don't think I'm going to ack this, sorry.

Best regards,
Vitaly

> crypto_wait_req() is actually doing wait_for_completion():
> static inline int crypto_wait_req(int err, struct crypto_wait *wait)
> {
> switch (err) {
> case -EINPROGRESS:
> case -EBUSY:
> wait_for_completion(&wait->completion);
> reinit_completion(&wait->completion);
> err = wait->err;
> break;
> }
>
> return err;
> }
>
> Thanks
> Barry

Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock



> -----Original Message-----
> From: Vitaly Wool [mailto:[email protected]]
> Sent: Tuesday, December 22, 2020 11:12 AM
> To: Song Bao Hua (Barry Song) <[email protected]>
> Cc: Shakeel Butt <[email protected]>; Minchan Kim <[email protected]>; Mike
> Galbraith <[email protected]>; LKML <[email protected]>; linux-mm
> <[email protected]>; Sebastian Andrzej Siewior <[email protected]>;
> NitinGupta <[email protected]>; Sergey Senozhatsky
> <[email protected]>; Andrew Morton
> <[email protected]>
> Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
>
> On Mon, Dec 21, 2020 at 10:30 PM Song Bao Hua (Barry Song)
> <[email protected]> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Shakeel Butt [mailto:[email protected]]
> > > Sent: Tuesday, December 22, 2020 10:03 AM
> > > To: Song Bao Hua (Barry Song) <[email protected]>
> > > Cc: Vitaly Wool <[email protected]>; Minchan Kim
> <[email protected]>;
> > > Mike Galbraith <[email protected]>; LKML <[email protected]>;
> linux-mm
> > > <[email protected]>; Sebastian Andrzej Siewior <[email protected]>;
> > > NitinGupta <[email protected]>; Sergey Senozhatsky
> > > <[email protected]>; Andrew Morton
> > > <[email protected]>
> > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > >
> > > On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song)
> > > <[email protected]> wrote:
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Shakeel Butt [mailto:[email protected]]
> > > > > Sent: Tuesday, December 22, 2020 8:50 AM
> > > > > To: Vitaly Wool <[email protected]>
> > > > > Cc: Minchan Kim <[email protected]>; Mike Galbraith <[email protected]>;
> LKML
> > > > > <[email protected]>; linux-mm <[email protected]>; Song
> Bao
> > > Hua
> > > > > (Barry Song) <[email protected]>; Sebastian Andrzej Siewior
> > > > > <[email protected]>; NitinGupta <[email protected]>; Sergey
> > > Senozhatsky
> > > > > <[email protected]>; Andrew Morton
> > > > > <[email protected]>
> > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > > > >
> > > > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool <[email protected]>
> > > wrote:
> > > > > >
> > > > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <[email protected]> wrote:
> > > > > > >
> > > > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> > > > > > > > zsmalloc takes bit spinlock in its _map() callback and releases
> it
> > > > > > > > only in unmap() which is unsafe and leads to zswap complaining
> > > > > > > > about scheduling in atomic context.
> > > > > > > >
> > > > > > > > To fix that and to improve RT properties of zsmalloc, remove that
> > > > > > > > bit spinlock completely and use a bit flag instead.
> > > > > > >
> > > > > > > I don't want to use such open code for the lock.
> > > > > > >
> > > > > > > I see from Mike's patch, recent zswap change introduced the lockdep
> > > > > > > splat bug and you want to improve zsmalloc to fix the zswap bug
> and
> > > > > > > introduce this patch with allowing preemption enabling.
> > > > > >
> > > > > > This understanding is upside down. The code in zswap you are referring
> > > > > > to is not buggy. You may claim that it is suboptimal but there is
> > > > > > nothing wrong in taking a mutex.
> > > > > >
> > > > >
> > > > > Is this suboptimal for all or just the hardware accelerators? Sorry,
> I
> > > > > am not very familiar with the crypto API. If I select lzo or lz4 as
> a
> > > > > zswap compressor will the [de]compression be async or sync?
> > > >
> > > > Right now, in crypto subsystem, new drivers are required to write based
> on
> > > > async APIs. The old sync API can't work in new accelerator drivers as
> they
> > > > are not supported at all.
> > > >
> > > > Old drivers are used to sync, but they've got async wrappers to support
> async
> > > > APIs. Eg.
> > > > crypto: acomp - add support for lz4 via scomp
> > > >
> > >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > > crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d
> > > >
> > > > crypto: acomp - add support for lzo via scomp
> > > >
> > >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > > crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e
> > > >
> > > > so they are supporting async APIs but they are still working in sync mode
> > > as
> > > > those old drivers don't sleep.
> > > >
> > >
> > > Good to know that those are sync because I want them to be sync.
> > > Please note that zswap is a cache in front of a real swap and the load
> > > operation is latency sensitive as it comes in the page fault path and
> > > directly impacts the applications. I doubt decompressing synchronously
> > > a 4k page on a cpu will be costlier than asynchronously decompressing
> > > the same page from hardware accelerators.
> >
> > If you read the old paper:
> >
> https://www.ibm.com/support/pages/new-linux-zswap-compression-functionalit
> y
> > Because the hardware accelerator speeds up compression, looking at the zswap
> > metrics we observed that there were more store and load requests in a given
> > amount of time, which filled up the zswap pool faster than a software
> > compression run. Because of this behavior, we set the max_pool_percent
> > parameter to 30 for the hardware compression runs - this means that zswap
> > can use up to 30% of the 10GB of total memory.
> >
> > So using hardware accelerators, we get a chance to speed up compression
> > while decreasing cpu utilization.
> >
> > BTW, If it is not easy to change zsmalloc, one quick workaround we might do
> > in zswap is adding the below after applying Mike's original patch:
> >
> > if(in_atomic()) /* for zsmalloc */
> > while(!try_wait_for_completion(&req->done);
> > else /* for zbud, z3fold */
> > crypto_wait_req(....);
>
> I don't think I'm going to ack this, sorry.
>

Fair enough. And I am also thinking if we can move zpool_unmap_handle()
quite after zpool_map_handle() as below:

dlen = PAGE_SIZE;
src = zpool_map_handle(entry->pool->zpool, entry->handle, ZPOOL_MM_RO);
if (zpool_evictable(entry->pool->zpool))
src += sizeof(struct zswap_header);
+ zpool_unmap_handle(entry->pool->zpool, entry->handle);

acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
mutex_lock(acomp_ctx->mutex);
sg_init_one(&input, src, entry->length);
sg_init_table(&output, 1);
sg_set_page(&output, page, PAGE_SIZE, 0);
acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, dlen);
ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait);
mutex_unlock(acomp_ctx->mutex);

- zpool_unmap_handle(entry->pool->zpool, entry->handle);

Since src is always low memory and we only need its virtual address
to get the page of src in sg_init_one(). We don't actually read it
by CPU anywhere.

> Best regards,
> Vitaly
>
> > crypto_wait_req() is actually doing wait_for_completion():
> > static inline int crypto_wait_req(int err, struct crypto_wait *wait)
> > {
> > switch (err) {
> > case -EINPROGRESS:
> > case -EBUSY:
> > wait_for_completion(&wait->completion);
> > reinit_completion(&wait->completion);
> > err = wait->err;
> > break;
> > }
> >
> > return err;
> > }
> >
Thanks
Barry

2020-12-21 22:49:10

by Shakeel Butt

[permalink] [raw]
Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock

On Mon, Dec 21, 2020 at 1:30 PM Song Bao Hua (Barry Song)
<[email protected]> wrote:
>
>
>
> > -----Original Message-----
> > From: Shakeel Butt [mailto:[email protected]]
> > Sent: Tuesday, December 22, 2020 10:03 AM
> > To: Song Bao Hua (Barry Song) <[email protected]>
> > Cc: Vitaly Wool <[email protected]>; Minchan Kim <[email protected]>;
> > Mike Galbraith <[email protected]>; LKML <[email protected]>; linux-mm
> > <[email protected]>; Sebastian Andrzej Siewior <[email protected]>;
> > NitinGupta <[email protected]>; Sergey Senozhatsky
> > <[email protected]>; Andrew Morton
> > <[email protected]>
> > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> >
> > On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song)
> > <[email protected]> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Shakeel Butt [mailto:[email protected]]
> > > > Sent: Tuesday, December 22, 2020 8:50 AM
> > > > To: Vitaly Wool <[email protected]>
> > > > Cc: Minchan Kim <[email protected]>; Mike Galbraith <[email protected]>; LKML
> > > > <[email protected]>; linux-mm <[email protected]>; Song Bao
> > Hua
> > > > (Barry Song) <[email protected]>; Sebastian Andrzej Siewior
> > > > <[email protected]>; NitinGupta <[email protected]>; Sergey
> > Senozhatsky
> > > > <[email protected]>; Andrew Morton
> > > > <[email protected]>
> > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > > >
> > > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool <[email protected]>
> > wrote:
> > > > >
> > > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <[email protected]> wrote:
> > > > > >
> > > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> > > > > > > zsmalloc takes bit spinlock in its _map() callback and releases it
> > > > > > > only in unmap() which is unsafe and leads to zswap complaining
> > > > > > > about scheduling in atomic context.
> > > > > > >
> > > > > > > To fix that and to improve RT properties of zsmalloc, remove that
> > > > > > > bit spinlock completely and use a bit flag instead.
> > > > > >
> > > > > > I don't want to use such open code for the lock.
> > > > > >
> > > > > > I see from Mike's patch, recent zswap change introduced the lockdep
> > > > > > splat bug and you want to improve zsmalloc to fix the zswap bug and
> > > > > > introduce this patch with allowing preemption enabling.
> > > > >
> > > > > This understanding is upside down. The code in zswap you are referring
> > > > > to is not buggy. You may claim that it is suboptimal but there is
> > > > > nothing wrong in taking a mutex.
> > > > >
> > > >
> > > > Is this suboptimal for all or just the hardware accelerators? Sorry, I
> > > > am not very familiar with the crypto API. If I select lzo or lz4 as a
> > > > zswap compressor will the [de]compression be async or sync?
> > >
> > > Right now, in crypto subsystem, new drivers are required to write based on
> > > async APIs. The old sync API can't work in new accelerator drivers as they
> > > are not supported at all.
> > >
> > > Old drivers are used to sync, but they've got async wrappers to support async
> > > APIs. Eg.
> > > crypto: acomp - add support for lz4 via scomp
> > >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d
> > >
> > > crypto: acomp - add support for lzo via scomp
> > >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e
> > >
> > > so they are supporting async APIs but they are still working in sync mode
> > as
> > > those old drivers don't sleep.
> > >
> >
> > Good to know that those are sync because I want them to be sync.
> > Please note that zswap is a cache in front of a real swap and the load
> > operation is latency sensitive as it comes in the page fault path and
> > directly impacts the applications. I doubt decompressing synchronously
> > a 4k page on a cpu will be costlier than asynchronously decompressing
> > the same page from hardware accelerators.
>
> If you read the old paper:
> https://www.ibm.com/support/pages/new-linux-zswap-compression-functionality
> Because the hardware accelerator speeds up compression, looking at the zswap
> metrics we observed that there were more store and load requests in a given
> amount of time, which filled up the zswap pool faster than a software
> compression run. Because of this behavior, we set the max_pool_percent
> parameter to 30 for the hardware compression runs - this means that zswap
> can use up to 30% of the 10GB of total memory.
>
> So using hardware accelerators, we get a chance to speed up compression
> while decreasing cpu utilization.
>

I don't care much about the compression. It's the decompression or
more specifically the latency of decompression I really care about.

Compression happens on reclaim, so latency is not really an issue.
Reclaim can be pressure-based or proactive. I think async batched
compression by accelerators makes a lot of sense. Though I doubt zswap
is the right layer for that. To me adding "async batched compression
support by accelerators" in zram looks more natural as the kernel
already has async block I/O support.

For decompression, I would like as low latency as possible which I
think is only possible by doing decompression on a cpu synchronously.

Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock



> -----Original Message-----
> From: Shakeel Butt [mailto:[email protected]]
> Sent: Tuesday, December 22, 2020 11:46 AM
> To: Song Bao Hua (Barry Song) <[email protected]>
> Cc: Vitaly Wool <[email protected]>; Minchan Kim <[email protected]>;
> Mike Galbraith <[email protected]>; LKML <[email protected]>; linux-mm
> <[email protected]>; Sebastian Andrzej Siewior <[email protected]>;
> NitinGupta <[email protected]>; Sergey Senozhatsky
> <[email protected]>; Andrew Morton
> <[email protected]>
> Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
>
> On Mon, Dec 21, 2020 at 1:30 PM Song Bao Hua (Barry Song)
> <[email protected]> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Shakeel Butt [mailto:[email protected]]
> > > Sent: Tuesday, December 22, 2020 10:03 AM
> > > To: Song Bao Hua (Barry Song) <[email protected]>
> > > Cc: Vitaly Wool <[email protected]>; Minchan Kim
> <[email protected]>;
> > > Mike Galbraith <[email protected]>; LKML <[email protected]>;
> linux-mm
> > > <[email protected]>; Sebastian Andrzej Siewior <[email protected]>;
> > > NitinGupta <[email protected]>; Sergey Senozhatsky
> > > <[email protected]>; Andrew Morton
> > > <[email protected]>
> > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > >
> > > On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song)
> > > <[email protected]> wrote:
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Shakeel Butt [mailto:[email protected]]
> > > > > Sent: Tuesday, December 22, 2020 8:50 AM
> > > > > To: Vitaly Wool <[email protected]>
> > > > > Cc: Minchan Kim <[email protected]>; Mike Galbraith <[email protected]>;
> LKML
> > > > > <[email protected]>; linux-mm <[email protected]>; Song
> Bao
> > > Hua
> > > > > (Barry Song) <[email protected]>; Sebastian Andrzej Siewior
> > > > > <[email protected]>; NitinGupta <[email protected]>; Sergey
> > > Senozhatsky
> > > > > <[email protected]>; Andrew Morton
> > > > > <[email protected]>
> > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > > > >
> > > > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool <[email protected]>
> > > wrote:
> > > > > >
> > > > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <[email protected]> wrote:
> > > > > > >
> > > > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> > > > > > > > zsmalloc takes bit spinlock in its _map() callback and releases
> it
> > > > > > > > only in unmap() which is unsafe and leads to zswap complaining
> > > > > > > > about scheduling in atomic context.
> > > > > > > >
> > > > > > > > To fix that and to improve RT properties of zsmalloc, remove that
> > > > > > > > bit spinlock completely and use a bit flag instead.
> > > > > > >
> > > > > > > I don't want to use such open code for the lock.
> > > > > > >
> > > > > > > I see from Mike's patch, recent zswap change introduced the lockdep
> > > > > > > splat bug and you want to improve zsmalloc to fix the zswap bug
> and
> > > > > > > introduce this patch with allowing preemption enabling.
> > > > > >
> > > > > > This understanding is upside down. The code in zswap you are referring
> > > > > > to is not buggy. You may claim that it is suboptimal but there is
> > > > > > nothing wrong in taking a mutex.
> > > > > >
> > > > >
> > > > > Is this suboptimal for all or just the hardware accelerators? Sorry,
> I
> > > > > am not very familiar with the crypto API. If I select lzo or lz4 as
> a
> > > > > zswap compressor will the [de]compression be async or sync?
> > > >
> > > > Right now, in crypto subsystem, new drivers are required to write based
> on
> > > > async APIs. The old sync API can't work in new accelerator drivers as
> they
> > > > are not supported at all.
> > > >
> > > > Old drivers are used to sync, but they've got async wrappers to support
> async
> > > > APIs. Eg.
> > > > crypto: acomp - add support for lz4 via scomp
> > > >
> > >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > > crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d
> > > >
> > > > crypto: acomp - add support for lzo via scomp
> > > >
> > >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > > crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e
> > > >
> > > > so they are supporting async APIs but they are still working in sync mode
> > > as
> > > > those old drivers don't sleep.
> > > >
> > >
> > > Good to know that those are sync because I want them to be sync.
> > > Please note that zswap is a cache in front of a real swap and the load
> > > operation is latency sensitive as it comes in the page fault path and
> > > directly impacts the applications. I doubt decompressing synchronously
> > > a 4k page on a cpu will be costlier than asynchronously decompressing
> > > the same page from hardware accelerators.
> >
> > If you read the old paper:
> >
> https://www.ibm.com/support/pages/new-linux-zswap-compression-functionalit
> y
> > Because the hardware accelerator speeds up compression, looking at the zswap
> > metrics we observed that there were more store and load requests in a given
> > amount of time, which filled up the zswap pool faster than a software
> > compression run. Because of this behavior, we set the max_pool_percent
> > parameter to 30 for the hardware compression runs - this means that zswap
> > can use up to 30% of the 10GB of total memory.
> >
> > So using hardware accelerators, we get a chance to speed up compression
> > while decreasing cpu utilization.
> >
>
> I don't care much about the compression. It's the decompression or
> more specifically the latency of decompression I really care about.
>
> Compression happens on reclaim, so latency is not really an issue.
> Reclaim can be pressure-based or proactive. I think async batched
> compression by accelerators makes a lot of sense. Though I doubt zswap
> is the right layer for that. To me adding "async batched compression
> support by accelerators" in zram looks more natural as the kernel
> already has async block I/O support.

Yep.
zram is one of the targets I have thought about to support acomp.

>
> For decompression, I would like as low latency as possible which I
> think is only possible by doing decompression on a cpu synchronously.

One possibility is that we change HW accelerator driver to be sync
polling for decompression. But this still depends on async api as
this is the framework nowadays, the difference would be the driver
won't really block. crypto_wait_req() will return without actual
sleep.

Thanks
Barry

Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock



> -----Original Message-----
> From: Song Bao Hua (Barry Song)
> Sent: Tuesday, December 22, 2020 11:38 AM
> To: 'Vitaly Wool' <[email protected]>
> Cc: Shakeel Butt <[email protected]>; Minchan Kim <[email protected]>; Mike
> Galbraith <[email protected]>; LKML <[email protected]>; linux-mm
> <[email protected]>; Sebastian Andrzej Siewior <[email protected]>;
> NitinGupta <[email protected]>; Sergey Senozhatsky
> <[email protected]>; Andrew Morton
> <[email protected]>
> Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock
>
>
>
> > -----Original Message-----
> > From: Vitaly Wool [mailto:[email protected]]
> > Sent: Tuesday, December 22, 2020 11:12 AM
> > To: Song Bao Hua (Barry Song) <[email protected]>
> > Cc: Shakeel Butt <[email protected]>; Minchan Kim <[email protected]>;
> Mike
> > Galbraith <[email protected]>; LKML <[email protected]>; linux-mm
> > <[email protected]>; Sebastian Andrzej Siewior <[email protected]>;
> > NitinGupta <[email protected]>; Sergey Senozhatsky
> > <[email protected]>; Andrew Morton
> > <[email protected]>
> > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> >
> > On Mon, Dec 21, 2020 at 10:30 PM Song Bao Hua (Barry Song)
> > <[email protected]> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Shakeel Butt [mailto:[email protected]]
> > > > Sent: Tuesday, December 22, 2020 10:03 AM
> > > > To: Song Bao Hua (Barry Song) <[email protected]>
> > > > Cc: Vitaly Wool <[email protected]>; Minchan Kim
> > <[email protected]>;
> > > > Mike Galbraith <[email protected]>; LKML <[email protected]>;
> > linux-mm
> > > > <[email protected]>; Sebastian Andrzej Siewior <[email protected]>;
> > > > NitinGupta <[email protected]>; Sergey Senozhatsky
> > > > <[email protected]>; Andrew Morton
> > > > <[email protected]>
> > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > > >
> > > > On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song)
> > > > <[email protected]> wrote:
> > > > >
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Shakeel Butt [mailto:[email protected]]
> > > > > > Sent: Tuesday, December 22, 2020 8:50 AM
> > > > > > To: Vitaly Wool <[email protected]>
> > > > > > Cc: Minchan Kim <[email protected]>; Mike Galbraith <[email protected]>;
> > LKML
> > > > > > <[email protected]>; linux-mm <[email protected]>; Song
> > Bao
> > > > Hua
> > > > > > (Barry Song) <[email protected]>; Sebastian Andrzej Siewior
> > > > > > <[email protected]>; NitinGupta <[email protected]>; Sergey
> > > > Senozhatsky
> > > > > > <[email protected]>; Andrew Morton
> > > > > > <[email protected]>
> > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > > > > >
> > > > > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool <[email protected]>
> > > > wrote:
> > > > > > >
> > > > > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <[email protected]>
> wrote:
> > > > > > > >
> > > > > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> > > > > > > > > zsmalloc takes bit spinlock in its _map() callback and releases
> > it
> > > > > > > > > only in unmap() which is unsafe and leads to zswap complaining
> > > > > > > > > about scheduling in atomic context.
> > > > > > > > >
> > > > > > > > > To fix that and to improve RT properties of zsmalloc, remove
> that
> > > > > > > > > bit spinlock completely and use a bit flag instead.
> > > > > > > >
> > > > > > > > I don't want to use such open code for the lock.
> > > > > > > >
> > > > > > > > I see from Mike's patch, recent zswap change introduced the lockdep
> > > > > > > > splat bug and you want to improve zsmalloc to fix the zswap bug
> > and
> > > > > > > > introduce this patch with allowing preemption enabling.
> > > > > > >
> > > > > > > This understanding is upside down. The code in zswap you are referring
> > > > > > > to is not buggy. You may claim that it is suboptimal but there is
> > > > > > > nothing wrong in taking a mutex.
> > > > > > >
> > > > > >
> > > > > > Is this suboptimal for all or just the hardware accelerators? Sorry,
> > I
> > > > > > am not very familiar with the crypto API. If I select lzo or lz4 as
> > a
> > > > > > zswap compressor will the [de]compression be async or sync?
> > > > >
> > > > > Right now, in crypto subsystem, new drivers are required to write based
> > on
> > > > > async APIs. The old sync API can't work in new accelerator drivers as
> > they
> > > > > are not supported at all.
> > > > >
> > > > > Old drivers are used to sync, but they've got async wrappers to support
> > async
> > > > > APIs. Eg.
> > > > > crypto: acomp - add support for lz4 via scomp
> > > > >
> > > >
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > > > crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d
> > > > >
> > > > > crypto: acomp - add support for lzo via scomp
> > > > >
> > > >
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > > > crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e
> > > > >
> > > > > so they are supporting async APIs but they are still working in sync
> mode
> > > > as
> > > > > those old drivers don't sleep.
> > > > >
> > > >
> > > > Good to know that those are sync because I want them to be sync.
> > > > Please note that zswap is a cache in front of a real swap and the load
> > > > operation is latency sensitive as it comes in the page fault path and
> > > > directly impacts the applications. I doubt decompressing synchronously
> > > > a 4k page on a cpu will be costlier than asynchronously decompressing
> > > > the same page from hardware accelerators.
> > >
> > > If you read the old paper:
> > >
> >
> https://www.ibm.com/support/pages/new-linux-zswap-compression-functionalit
> > y
> > > Because the hardware accelerator speeds up compression, looking at the zswap
> > > metrics we observed that there were more store and load requests in a given
> > > amount of time, which filled up the zswap pool faster than a software
> > > compression run. Because of this behavior, we set the max_pool_percent
> > > parameter to 30 for the hardware compression runs - this means that zswap
> > > can use up to 30% of the 10GB of total memory.
> > >
> > > So using hardware accelerators, we get a chance to speed up compression
> > > while decreasing cpu utilization.
> > >
> > > BTW, If it is not easy to change zsmalloc, one quick workaround we might
> do
> > > in zswap is adding the below after applying Mike's original patch:
> > >
> > > if(in_atomic()) /* for zsmalloc */
> > > while(!try_wait_for_completion(&req->done);
> > > else /* for zbud, z3fold */
> > > crypto_wait_req(....);
> >
> > I don't think I'm going to ack this, sorry.
> >
>
> Fair enough. And I am also thinking if we can move zpool_unmap_handle()
> quite after zpool_map_handle() as below:
>
> dlen = PAGE_SIZE;
> src = zpool_map_handle(entry->pool->zpool, entry->handle, ZPOOL_MM_RO);
> if (zpool_evictable(entry->pool->zpool))
> src += sizeof(struct zswap_header);
> + zpool_unmap_handle(entry->pool->zpool, entry->handle);
>
> acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> mutex_lock(acomp_ctx->mutex);
> sg_init_one(&input, src, entry->length);
> sg_init_table(&output, 1);
> sg_set_page(&output, page, PAGE_SIZE, 0);
> acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length,
> dlen);
> ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req),
> &acomp_ctx->wait);
> mutex_unlock(acomp_ctx->mutex);
>
> - zpool_unmap_handle(entry->pool->zpool, entry->handle);
>
> Since src is always low memory and we only need its virtual address
> to get the page of src in sg_init_one(). We don't actually read it
> by CPU anywhere.

The below code might be better:

dlen = PAGE_SIZE;
src = zpool_map_handle(entry->pool->zpool, entry->handle, ZPOOL_MM_RO);
if (zpool_evictable(entry->pool->zpool))
src += sizeof(struct zswap_header);

acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);

+ zpool_unmap_handle(entry->pool->zpool, entry->handle);

mutex_lock(acomp_ctx->mutex);
sg_init_one(&input, src, entry->length);
sg_init_table(&output, 1);
sg_set_page(&output, page, PAGE_SIZE, 0);
acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, dlen);
ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait);
mutex_unlock(acomp_ctx->mutex);

- zpool_unmap_handle(entry->pool->zpool, entry->handle);

>
> > Best regards,
> > Vitaly
> >
> > > crypto_wait_req() is actually doing wait_for_completion():
> > > static inline int crypto_wait_req(int err, struct crypto_wait *wait)
> > > {
> > > switch (err) {
> > > case -EINPROGRESS:
> > > case -EBUSY:
> > > wait_for_completion(&wait->completion);
> > > reinit_completion(&wait->completion);
> > > err = wait->err;
> > > break;
> > > }
> > >
> > > return err;
> > > }

Thanks
Barry

2020-12-22 01:01:54

by Vitaly Wool

[permalink] [raw]
Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock

On Tue, Dec 22, 2020 at 12:37 AM Song Bao Hua (Barry Song)
<[email protected]> wrote:
>
>
>
> > -----Original Message-----
> > From: Song Bao Hua (Barry Song)
> > Sent: Tuesday, December 22, 2020 11:38 AM
> > To: 'Vitaly Wool' <[email protected]>
> > Cc: Shakeel Butt <[email protected]>; Minchan Kim <[email protected]>; Mike
> > Galbraith <[email protected]>; LKML <[email protected]>; linux-mm
> > <[email protected]>; Sebastian Andrzej Siewior <[email protected]>;
> > NitinGupta <[email protected]>; Sergey Senozhatsky
> > <[email protected]>; Andrew Morton
> > <[email protected]>
> > Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock
> >
> >
> >
> > > -----Original Message-----
> > > From: Vitaly Wool [mailto:[email protected]]
> > > Sent: Tuesday, December 22, 2020 11:12 AM
> > > To: Song Bao Hua (Barry Song) <[email protected]>
> > > Cc: Shakeel Butt <[email protected]>; Minchan Kim <[email protected]>;
> > Mike
> > > Galbraith <[email protected]>; LKML <[email protected]>; linux-mm
> > > <[email protected]>; Sebastian Andrzej Siewior <[email protected]>;
> > > NitinGupta <[email protected]>; Sergey Senozhatsky
> > > <[email protected]>; Andrew Morton
> > > <[email protected]>
> > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > >
> > > On Mon, Dec 21, 2020 at 10:30 PM Song Bao Hua (Barry Song)
> > > <[email protected]> wrote:
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Shakeel Butt [mailto:[email protected]]
> > > > > Sent: Tuesday, December 22, 2020 10:03 AM
> > > > > To: Song Bao Hua (Barry Song) <[email protected]>
> > > > > Cc: Vitaly Wool <[email protected]>; Minchan Kim
> > > <[email protected]>;
> > > > > Mike Galbraith <[email protected]>; LKML <[email protected]>;
> > > linux-mm
> > > > > <[email protected]>; Sebastian Andrzej Siewior <[email protected]>;
> > > > > NitinGupta <[email protected]>; Sergey Senozhatsky
> > > > > <[email protected]>; Andrew Morton
> > > > > <[email protected]>
> > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > > > >
> > > > > On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song)
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Shakeel Butt [mailto:[email protected]]
> > > > > > > Sent: Tuesday, December 22, 2020 8:50 AM
> > > > > > > To: Vitaly Wool <[email protected]>
> > > > > > > Cc: Minchan Kim <[email protected]>; Mike Galbraith <[email protected]>;
> > > LKML
> > > > > > > <[email protected]>; linux-mm <[email protected]>; Song
> > > Bao
> > > > > Hua
> > > > > > > (Barry Song) <[email protected]>; Sebastian Andrzej Siewior
> > > > > > > <[email protected]>; NitinGupta <[email protected]>; Sergey
> > > > > Senozhatsky
> > > > > > > <[email protected]>; Andrew Morton
> > > > > > > <[email protected]>
> > > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > > > > > >
> > > > > > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool <[email protected]>
> > > > > wrote:
> > > > > > > >
> > > > > > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <[email protected]>
> > wrote:
> > > > > > > > >
> > > > > > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> > > > > > > > > > zsmalloc takes bit spinlock in its _map() callback and releases
> > > it
> > > > > > > > > > only in unmap() which is unsafe and leads to zswap complaining
> > > > > > > > > > about scheduling in atomic context.
> > > > > > > > > >
> > > > > > > > > > To fix that and to improve RT properties of zsmalloc, remove
> > that
> > > > > > > > > > bit spinlock completely and use a bit flag instead.
> > > > > > > > >
> > > > > > > > > I don't want to use such open code for the lock.
> > > > > > > > >
> > > > > > > > > I see from Mike's patch, recent zswap change introduced the lockdep
> > > > > > > > > splat bug and you want to improve zsmalloc to fix the zswap bug
> > > and
> > > > > > > > > introduce this patch with allowing preemption enabling.
> > > > > > > >
> > > > > > > > This understanding is upside down. The code in zswap you are referring
> > > > > > > > to is not buggy. You may claim that it is suboptimal but there is
> > > > > > > > nothing wrong in taking a mutex.
> > > > > > > >
> > > > > > >
> > > > > > > Is this suboptimal for all or just the hardware accelerators? Sorry,
> > > I
> > > > > > > am not very familiar with the crypto API. If I select lzo or lz4 as
> > > a
> > > > > > > zswap compressor will the [de]compression be async or sync?
> > > > > >
> > > > > > Right now, in crypto subsystem, new drivers are required to write based
> > > on
> > > > > > async APIs. The old sync API can't work in new accelerator drivers as
> > > they
> > > > > > are not supported at all.
> > > > > >
> > > > > > Old drivers are used to sync, but they've got async wrappers to support
> > > async
> > > > > > APIs. Eg.
> > > > > > crypto: acomp - add support for lz4 via scomp
> > > > > >
> > > > >
> > >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > > > > crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d
> > > > > >
> > > > > > crypto: acomp - add support for lzo via scomp
> > > > > >
> > > > >
> > >
> > https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > > > > crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e
> > > > > >
> > > > > > so they are supporting async APIs but they are still working in sync
> > mode
> > > > > as
> > > > > > those old drivers don't sleep.
> > > > > >
> > > > >
> > > > > Good to know that those are sync because I want them to be sync.
> > > > > Please note that zswap is a cache in front of a real swap and the load
> > > > > operation is latency sensitive as it comes in the page fault path and
> > > > > directly impacts the applications. I doubt decompressing synchronously
> > > > > a 4k page on a cpu will be costlier than asynchronously decompressing
> > > > > the same page from hardware accelerators.
> > > >
> > > > If you read the old paper:
> > > >
> > >
> > https://www.ibm.com/support/pages/new-linux-zswap-compression-functionalit
> > > y
> > > > Because the hardware accelerator speeds up compression, looking at the zswap
> > > > metrics we observed that there were more store and load requests in a given
> > > > amount of time, which filled up the zswap pool faster than a software
> > > > compression run. Because of this behavior, we set the max_pool_percent
> > > > parameter to 30 for the hardware compression runs - this means that zswap
> > > > can use up to 30% of the 10GB of total memory.
> > > >
> > > > So using hardware accelerators, we get a chance to speed up compression
> > > > while decreasing cpu utilization.
> > > >
> > > > BTW, If it is not easy to change zsmalloc, one quick workaround we might
> > do
> > > > in zswap is adding the below after applying Mike's original patch:
> > > >
> > > > if(in_atomic()) /* for zsmalloc */
> > > > while(!try_wait_for_completion(&req->done);
> > > > else /* for zbud, z3fold */
> > > > crypto_wait_req(....);
> > >
> > > I don't think I'm going to ack this, sorry.
> > >
> >
> > Fair enough. And I am also thinking if we can move zpool_unmap_handle()
> > quite after zpool_map_handle() as below:
> >
> > dlen = PAGE_SIZE;
> > src = zpool_map_handle(entry->pool->zpool, entry->handle, ZPOOL_MM_RO);
> > if (zpool_evictable(entry->pool->zpool))
> > src += sizeof(struct zswap_header);
> > + zpool_unmap_handle(entry->pool->zpool, entry->handle);
> >
> > acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> > mutex_lock(acomp_ctx->mutex);
> > sg_init_one(&input, src, entry->length);
> > sg_init_table(&output, 1);
> > sg_set_page(&output, page, PAGE_SIZE, 0);
> > acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length,
> > dlen);
> > ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req),
> > &acomp_ctx->wait);
> > mutex_unlock(acomp_ctx->mutex);
> >
> > - zpool_unmap_handle(entry->pool->zpool, entry->handle);
> >
> > Since src is always low memory and we only need its virtual address
> > to get the page of src in sg_init_one(). We don't actually read it
> > by CPU anywhere.
>
> The below code might be better:
>
> dlen = PAGE_SIZE;
> src = zpool_map_handle(entry->pool->zpool, entry->handle, ZPOOL_MM_RO);
> if (zpool_evictable(entry->pool->zpool))
> src += sizeof(struct zswap_header);
>
> acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
>
> + zpool_unmap_handle(entry->pool->zpool, entry->handle);
>
> mutex_lock(acomp_ctx->mutex);
> sg_init_one(&input, src, entry->length);
> sg_init_table(&output, 1);
> sg_set_page(&output, page, PAGE_SIZE, 0);
> acomp_request_set_params(acomp_ctx->req, &input, &output, entry->length, dlen);
> ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req), &acomp_ctx->wait);
> mutex_unlock(acomp_ctx->mutex);
>
> - zpool_unmap_handle(entry->pool->zpool, entry->handle);

I don't see how this is going to work since we can't guarantee src
will be a valid pointer after the zpool_unmap_handle() call, can we?
Could you please elaborate?

~Vitaly

> >
> > > Best regards,
> > > Vitaly
> > >
> > > > crypto_wait_req() is actually doing wait_for_completion():
> > > > static inline int crypto_wait_req(int err, struct crypto_wait *wait)
> > > > {
> > > > switch (err) {
> > > > case -EINPROGRESS:
> > > > case -EBUSY:
> > > > wait_for_completion(&wait->completion);
> > > > reinit_completion(&wait->completion);
> > > > err = wait->err;
> > > > break;
> > > > }
> > > >
> > > > return err;
> > > > }
>
> Thanks
> Barry

Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock



> -----Original Message-----
> From: Vitaly Wool [mailto:[email protected]]
> Sent: Tuesday, December 22, 2020 2:00 PM
> To: Song Bao Hua (Barry Song) <[email protected]>
> Cc: Shakeel Butt <[email protected]>; Minchan Kim <[email protected]>; Mike
> Galbraith <[email protected]>; LKML <[email protected]>; linux-mm
> <[email protected]>; Sebastian Andrzej Siewior <[email protected]>;
> NitinGupta <[email protected]>; Sergey Senozhatsky
> <[email protected]>; Andrew Morton
> <[email protected]>
> Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
>
> On Tue, Dec 22, 2020 at 12:37 AM Song Bao Hua (Barry Song)
> <[email protected]> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Song Bao Hua (Barry Song)
> > > Sent: Tuesday, December 22, 2020 11:38 AM
> > > To: 'Vitaly Wool' <[email protected]>
> > > Cc: Shakeel Butt <[email protected]>; Minchan Kim <[email protected]>;
> Mike
> > > Galbraith <[email protected]>; LKML <[email protected]>; linux-mm
> > > <[email protected]>; Sebastian Andrzej Siewior <[email protected]>;
> > > NitinGupta <[email protected]>; Sergey Senozhatsky
> > > <[email protected]>; Andrew Morton
> > > <[email protected]>
> > > Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Vitaly Wool [mailto:[email protected]]
> > > > Sent: Tuesday, December 22, 2020 11:12 AM
> > > > To: Song Bao Hua (Barry Song) <[email protected]>
> > > > Cc: Shakeel Butt <[email protected]>; Minchan Kim <[email protected]>;
> > > Mike
> > > > Galbraith <[email protected]>; LKML <[email protected]>; linux-mm
> > > > <[email protected]>; Sebastian Andrzej Siewior <[email protected]>;
> > > > NitinGupta <[email protected]>; Sergey Senozhatsky
> > > > <[email protected]>; Andrew Morton
> > > > <[email protected]>
> > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > > >
> > > > On Mon, Dec 21, 2020 at 10:30 PM Song Bao Hua (Barry Song)
> > > > <[email protected]> wrote:
> > > > >
> > > > >
> > > > >
> > > > > > -----Original Message-----
> > > > > > From: Shakeel Butt [mailto:[email protected]]
> > > > > > Sent: Tuesday, December 22, 2020 10:03 AM
> > > > > > To: Song Bao Hua (Barry Song) <[email protected]>
> > > > > > Cc: Vitaly Wool <[email protected]>; Minchan Kim
> > > > <[email protected]>;
> > > > > > Mike Galbraith <[email protected]>; LKML <[email protected]>;
> > > > linux-mm
> > > > > > <[email protected]>; Sebastian Andrzej Siewior
> <[email protected]>;
> > > > > > NitinGupta <[email protected]>; Sergey Senozhatsky
> > > > > > <[email protected]>; Andrew Morton
> > > > > > <[email protected]>
> > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > > > > >
> > > > > > On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song)
> > > > > > <[email protected]> wrote:
> > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > > -----Original Message-----
> > > > > > > > From: Shakeel Butt [mailto:[email protected]]
> > > > > > > > Sent: Tuesday, December 22, 2020 8:50 AM
> > > > > > > > To: Vitaly Wool <[email protected]>
> > > > > > > > Cc: Minchan Kim <[email protected]>; Mike Galbraith
> <[email protected]>;
> > > > LKML
> > > > > > > > <[email protected]>; linux-mm <[email protected]>;
> Song
> > > > Bao
> > > > > > Hua
> > > > > > > > (Barry Song) <[email protected]>; Sebastian Andrzej
> Siewior
> > > > > > > > <[email protected]>; NitinGupta <[email protected]>; Sergey
> > > > > > Senozhatsky
> > > > > > > > <[email protected]>; Andrew Morton
> > > > > > > > <[email protected]>
> > > > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > > > > > > >
> > > > > > > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool
> <[email protected]>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <[email protected]>
> > > wrote:
> > > > > > > > > >
> > > > > > > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> > > > > > > > > > > zsmalloc takes bit spinlock in its _map() callback and releases
> > > > it
> > > > > > > > > > > only in unmap() which is unsafe and leads to zswap complaining
> > > > > > > > > > > about scheduling in atomic context.
> > > > > > > > > > >
> > > > > > > > > > > To fix that and to improve RT properties of zsmalloc, remove
> > > that
> > > > > > > > > > > bit spinlock completely and use a bit flag instead.
> > > > > > > > > >
> > > > > > > > > > I don't want to use such open code for the lock.
> > > > > > > > > >
> > > > > > > > > > I see from Mike's patch, recent zswap change introduced the
> lockdep
> > > > > > > > > > splat bug and you want to improve zsmalloc to fix the zswap
> bug
> > > > and
> > > > > > > > > > introduce this patch with allowing preemption enabling.
> > > > > > > > >
> > > > > > > > > This understanding is upside down. The code in zswap you are
> referring
> > > > > > > > > to is not buggy. You may claim that it is suboptimal but there
> is
> > > > > > > > > nothing wrong in taking a mutex.
> > > > > > > > >
> > > > > > > >
> > > > > > > > Is this suboptimal for all or just the hardware accelerators?
> Sorry,
> > > > I
> > > > > > > > am not very familiar with the crypto API. If I select lzo or lz4
> as
> > > > a
> > > > > > > > zswap compressor will the [de]compression be async or sync?
> > > > > > >
> > > > > > > Right now, in crypto subsystem, new drivers are required to write
> based
> > > > on
> > > > > > > async APIs. The old sync API can't work in new accelerator drivers
> as
> > > > they
> > > > > > > are not supported at all.
> > > > > > >
> > > > > > > Old drivers are used to sync, but they've got async wrappers to
> support
> > > > async
> > > > > > > APIs. Eg.
> > > > > > > crypto: acomp - add support for lz4 via scomp
> > > > > > >
> > > > > >
> > > >
> > >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > > > > > crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d
> > > > > > >
> > > > > > > crypto: acomp - add support for lzo via scomp
> > > > > > >
> > > > > >
> > > >
> > >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > > > > > crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e
> > > > > > >
> > > > > > > so they are supporting async APIs but they are still working in
> sync
> > > mode
> > > > > > as
> > > > > > > those old drivers don't sleep.
> > > > > > >
> > > > > >
> > > > > > Good to know that those are sync because I want them to be sync.
> > > > > > Please note that zswap is a cache in front of a real swap and the
> load
> > > > > > operation is latency sensitive as it comes in the page fault path
> and
> > > > > > directly impacts the applications. I doubt decompressing synchronously
> > > > > > a 4k page on a cpu will be costlier than asynchronously decompressing
> > > > > > the same page from hardware accelerators.
> > > > >
> > > > > If you read the old paper:
> > > > >
> > > >
> > >
> https://www.ibm.com/support/pages/new-linux-zswap-compression-functionalit
> > > > y
> > > > > Because the hardware accelerator speeds up compression, looking at the
> zswap
> > > > > metrics we observed that there were more store and load requests in
> a given
> > > > > amount of time, which filled up the zswap pool faster than a software
> > > > > compression run. Because of this behavior, we set the max_pool_percent
> > > > > parameter to 30 for the hardware compression runs - this means that
> zswap
> > > > > can use up to 30% of the 10GB of total memory.
> > > > >
> > > > > So using hardware accelerators, we get a chance to speed up compression
> > > > > while decreasing cpu utilization.
> > > > >
> > > > > BTW, If it is not easy to change zsmalloc, one quick workaround we might
> > > do
> > > > > in zswap is adding the below after applying Mike's original patch:
> > > > >
> > > > > if(in_atomic()) /* for zsmalloc */
> > > > > while(!try_wait_for_completion(&req->done);
> > > > > else /* for zbud, z3fold */
> > > > > crypto_wait_req(....);
> > > >
> > > > I don't think I'm going to ack this, sorry.
> > > >
> > >
> > > Fair enough. And I am also thinking if we can move zpool_unmap_handle()
> > > quite after zpool_map_handle() as below:
> > >
> > > dlen = PAGE_SIZE;
> > > src = zpool_map_handle(entry->pool->zpool, entry->handle,
> ZPOOL_MM_RO);
> > > if (zpool_evictable(entry->pool->zpool))
> > > src += sizeof(struct zswap_header);
> > > + zpool_unmap_handle(entry->pool->zpool, entry->handle);
> > >
> > > acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> > > mutex_lock(acomp_ctx->mutex);
> > > sg_init_one(&input, src, entry->length);
> > > sg_init_table(&output, 1);
> > > sg_set_page(&output, page, PAGE_SIZE, 0);
> > > acomp_request_set_params(acomp_ctx->req, &input, &output,
> entry->length,
> > > dlen);
> > > ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req),
> > > &acomp_ctx->wait);
> > > mutex_unlock(acomp_ctx->mutex);
> > >
> > > - zpool_unmap_handle(entry->pool->zpool, entry->handle);
> > >
> > > Since src is always low memory and we only need its virtual address
> > > to get the page of src in sg_init_one(). We don't actually read it
> > > by CPU anywhere.
> >
> > The below code might be better:
> >
> > dlen = PAGE_SIZE;
> > src = zpool_map_handle(entry->pool->zpool, entry->handle,
> ZPOOL_MM_RO);
> > if (zpool_evictable(entry->pool->zpool))
> > src += sizeof(struct zswap_header);
> >
> > acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> >
> > + zpool_unmap_handle(entry->pool->zpool, entry->handle);
> >
> > mutex_lock(acomp_ctx->mutex);
> > sg_init_one(&input, src, entry->length);
> > sg_init_table(&output, 1);
> > sg_set_page(&output, page, PAGE_SIZE, 0);
> > acomp_request_set_params(acomp_ctx->req, &input, &output,
> entry->length, dlen);
> > ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req),
> &acomp_ctx->wait);
> > mutex_unlock(acomp_ctx->mutex);
> >
> > - zpool_unmap_handle(entry->pool->zpool, entry->handle);
>
> I don't see how this is going to work since we can't guarantee src
> will be a valid pointer after the zpool_unmap_handle() call, can we?
> Could you please elaborate?

A valid pointer is for cpu to read and write. Here, cpu doesn't read
and write it, we only need to get page struct from the address.

void sg_init_one(struct scatterlist *sg, const void *buf, unsigned int buflen)
{
sg_init_table(sg, 1);
sg_set_buf(sg, buf, buflen);
}

static inline void sg_set_buf(struct scatterlist *sg, const void *buf,
unsigned int buflen)
{
#ifdef CONFIG_DEBUG_SG
BUG_ON(!virt_addr_valid(buf));
#endif
sg_set_page(sg, virt_to_page(buf), buflen, offset_in_page(buf));
}

sg_init_one() is always using an address which has a linear mapping
with physical address.
So once we get the value of src, we can get the page struct.

src has a linear mapping with physical address. It doesn't require
page table walk which vmalloc_to_page() wants.

The req only requires page to initialize sg table, I think if
we are going to use a cpu-based (de)compression, the crypto
driver will kmap it again.

>
> ~Vitaly

Thanks
Barry

Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock



> -----Original Message-----
> From: Song Bao Hua (Barry Song)
> Sent: Tuesday, December 22, 2020 2:06 PM
> To: 'Vitaly Wool' <[email protected]>
> Cc: Shakeel Butt <[email protected]>; Minchan Kim <[email protected]>; Mike
> Galbraith <[email protected]>; LKML <[email protected]>; linux-mm
> <[email protected]>; Sebastian Andrzej Siewior <[email protected]>;
> NitinGupta <[email protected]>; Sergey Senozhatsky
> <[email protected]>; Andrew Morton
> <[email protected]>
> Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock
>
>
>
> > -----Original Message-----
> > From: Vitaly Wool [mailto:[email protected]]
> > Sent: Tuesday, December 22, 2020 2:00 PM
> > To: Song Bao Hua (Barry Song) <[email protected]>
> > Cc: Shakeel Butt <[email protected]>; Minchan Kim <[email protected]>;
> Mike
> > Galbraith <[email protected]>; LKML <[email protected]>; linux-mm
> > <[email protected]>; Sebastian Andrzej Siewior <[email protected]>;
> > NitinGupta <[email protected]>; Sergey Senozhatsky
> > <[email protected]>; Andrew Morton
> > <[email protected]>
> > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> >
> > On Tue, Dec 22, 2020 at 12:37 AM Song Bao Hua (Barry Song)
> > <[email protected]> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Song Bao Hua (Barry Song)
> > > > Sent: Tuesday, December 22, 2020 11:38 AM
> > > > To: 'Vitaly Wool' <[email protected]>
> > > > Cc: Shakeel Butt <[email protected]>; Minchan Kim <[email protected]>;
> > Mike
> > > > Galbraith <[email protected]>; LKML <[email protected]>; linux-mm
> > > > <[email protected]>; Sebastian Andrzej Siewior <[email protected]>;
> > > > NitinGupta <[email protected]>; Sergey Senozhatsky
> > > > <[email protected]>; Andrew Morton
> > > > <[email protected]>
> > > > Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock
> > > >
> > > >
> > > >
> > > > > -----Original Message-----
> > > > > From: Vitaly Wool [mailto:[email protected]]
> > > > > Sent: Tuesday, December 22, 2020 11:12 AM
> > > > > To: Song Bao Hua (Barry Song) <[email protected]>
> > > > > Cc: Shakeel Butt <[email protected]>; Minchan Kim
> <[email protected]>;
> > > > Mike
> > > > > Galbraith <[email protected]>; LKML <[email protected]>;
> linux-mm
> > > > > <[email protected]>; Sebastian Andrzej Siewior
> <[email protected]>;
> > > > > NitinGupta <[email protected]>; Sergey Senozhatsky
> > > > > <[email protected]>; Andrew Morton
> > > > > <[email protected]>
> > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > > > >
> > > > > On Mon, Dec 21, 2020 at 10:30 PM Song Bao Hua (Barry Song)
> > > > > <[email protected]> wrote:
> > > > > >
> > > > > >
> > > > > >
> > > > > > > -----Original Message-----
> > > > > > > From: Shakeel Butt [mailto:[email protected]]
> > > > > > > Sent: Tuesday, December 22, 2020 10:03 AM
> > > > > > > To: Song Bao Hua (Barry Song) <[email protected]>
> > > > > > > Cc: Vitaly Wool <[email protected]>; Minchan Kim
> > > > > <[email protected]>;
> > > > > > > Mike Galbraith <[email protected]>; LKML <[email protected]>;
> > > > > linux-mm
> > > > > > > <[email protected]>; Sebastian Andrzej Siewior
> > <[email protected]>;
> > > > > > > NitinGupta <[email protected]>; Sergey Senozhatsky
> > > > > > > <[email protected]>; Andrew Morton
> > > > > > > <[email protected]>
> > > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > > > > > >
> > > > > > > On Mon, Dec 21, 2020 at 12:06 PM Song Bao Hua (Barry Song)
> > > > > > > <[email protected]> wrote:
> > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > > -----Original Message-----
> > > > > > > > > From: Shakeel Butt [mailto:[email protected]]
> > > > > > > > > Sent: Tuesday, December 22, 2020 8:50 AM
> > > > > > > > > To: Vitaly Wool <[email protected]>
> > > > > > > > > Cc: Minchan Kim <[email protected]>; Mike Galbraith
> > <[email protected]>;
> > > > > LKML
> > > > > > > > > <[email protected]>; linux-mm <[email protected]>;
> > Song
> > > > > Bao
> > > > > > > Hua
> > > > > > > > > (Barry Song) <[email protected]>; Sebastian Andrzej
> > Siewior
> > > > > > > > > <[email protected]>; NitinGupta <[email protected]>; Sergey
> > > > > > > Senozhatsky
> > > > > > > > > <[email protected]>; Andrew Morton
> > > > > > > > > <[email protected]>
> > > > > > > > > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> > > > > > > > >
> > > > > > > > > On Mon, Dec 21, 2020 at 11:20 AM Vitaly Wool
> > <[email protected]>
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > On Mon, Dec 21, 2020 at 6:24 PM Minchan Kim <[email protected]>
> > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > > On Sun, Dec 20, 2020 at 02:22:28AM +0200, Vitaly Wool wrote:
> > > > > > > > > > > > zsmalloc takes bit spinlock in its _map() callback and
> releases
> > > > > it
> > > > > > > > > > > > only in unmap() which is unsafe and leads to zswap complaining
> > > > > > > > > > > > about scheduling in atomic context.
> > > > > > > > > > > >
> > > > > > > > > > > > To fix that and to improve RT properties of zsmalloc,
> remove
> > > > that
> > > > > > > > > > > > bit spinlock completely and use a bit flag instead.
> > > > > > > > > > >
> > > > > > > > > > > I don't want to use such open code for the lock.
> > > > > > > > > > >
> > > > > > > > > > > I see from Mike's patch, recent zswap change introduced
> the
> > lockdep
> > > > > > > > > > > splat bug and you want to improve zsmalloc to fix the zswap
> > bug
> > > > > and
> > > > > > > > > > > introduce this patch with allowing preemption enabling.
> > > > > > > > > >
> > > > > > > > > > This understanding is upside down. The code in zswap you are
> > referring
> > > > > > > > > > to is not buggy. You may claim that it is suboptimal but there
> > is
> > > > > > > > > > nothing wrong in taking a mutex.
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > > Is this suboptimal for all or just the hardware accelerators?
> > Sorry,
> > > > > I
> > > > > > > > > am not very familiar with the crypto API. If I select lzo or
> lz4
> > as
> > > > > a
> > > > > > > > > zswap compressor will the [de]compression be async or sync?
> > > > > > > >
> > > > > > > > Right now, in crypto subsystem, new drivers are required to write
> > based
> > > > > on
> > > > > > > > async APIs. The old sync API can't work in new accelerator drivers
> > as
> > > > > they
> > > > > > > > are not supported at all.
> > > > > > > >
> > > > > > > > Old drivers are used to sync, but they've got async wrappers to
> > support
> > > > > async
> > > > > > > > APIs. Eg.
> > > > > > > > crypto: acomp - add support for lz4 via scomp
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > > > > > > crypto/lz4.c?id=8cd9330e0a615c931037d4def98b5ce0d540f08d
> > > > > > > >
> > > > > > > > crypto: acomp - add support for lzo via scomp
> > > > > > > >
> > > > > > >
> > > > >
> > > >
> >
> https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/
> > > > > > > crypto/lzo.c?id=ac9d2c4b39e022d2c61486bfc33b730cfd02898e
> > > > > > > >
> > > > > > > > so they are supporting async APIs but they are still working in
> > sync
> > > > mode
> > > > > > > as
> > > > > > > > those old drivers don't sleep.
> > > > > > > >
> > > > > > >
> > > > > > > Good to know that those are sync because I want them to be sync.
> > > > > > > Please note that zswap is a cache in front of a real swap and the
> > load
> > > > > > > operation is latency sensitive as it comes in the page fault path
> > and
> > > > > > > directly impacts the applications. I doubt decompressing synchronously
> > > > > > > a 4k page on a cpu will be costlier than asynchronously decompressing
> > > > > > > the same page from hardware accelerators.
> > > > > >
> > > > > > If you read the old paper:
> > > > > >
> > > > >
> > > >
> >
> https://www.ibm.com/support/pages/new-linux-zswap-compression-functionalit
> > > > > y
> > > > > > Because the hardware accelerator speeds up compression, looking at
> the
> > zswap
> > > > > > metrics we observed that there were more store and load requests in
> > a given
> > > > > > amount of time, which filled up the zswap pool faster than a software
> > > > > > compression run. Because of this behavior, we set the max_pool_percent
> > > > > > parameter to 30 for the hardware compression runs - this means that
> > zswap
> > > > > > can use up to 30% of the 10GB of total memory.
> > > > > >
> > > > > > So using hardware accelerators, we get a chance to speed up compression
> > > > > > while decreasing cpu utilization.
> > > > > >
> > > > > > BTW, If it is not easy to change zsmalloc, one quick workaround we
> might
> > > > do
> > > > > > in zswap is adding the below after applying Mike's original patch:
> > > > > >
> > > > > > if(in_atomic()) /* for zsmalloc */
> > > > > > while(!try_wait_for_completion(&req->done);
> > > > > > else /* for zbud, z3fold */
> > > > > > crypto_wait_req(....);
> > > > >
> > > > > I don't think I'm going to ack this, sorry.
> > > > >
> > > >
> > > > Fair enough. And I am also thinking if we can move zpool_unmap_handle()
> > > > quite after zpool_map_handle() as below:
> > > >
> > > > dlen = PAGE_SIZE;
> > > > src = zpool_map_handle(entry->pool->zpool, entry->handle,
> > ZPOOL_MM_RO);
> > > > if (zpool_evictable(entry->pool->zpool))
> > > > src += sizeof(struct zswap_header);
> > > > + zpool_unmap_handle(entry->pool->zpool, entry->handle);
> > > >
> > > > acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> > > > mutex_lock(acomp_ctx->mutex);
> > > > sg_init_one(&input, src, entry->length);
> > > > sg_init_table(&output, 1);
> > > > sg_set_page(&output, page, PAGE_SIZE, 0);
> > > > acomp_request_set_params(acomp_ctx->req, &input, &output,
> > entry->length,
> > > > dlen);
> > > > ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req),
> > > > &acomp_ctx->wait);
> > > > mutex_unlock(acomp_ctx->mutex);
> > > >
> > > > - zpool_unmap_handle(entry->pool->zpool, entry->handle);
> > > >
> > > > Since src is always low memory and we only need its virtual address
> > > > to get the page of src in sg_init_one(). We don't actually read it
> > > > by CPU anywhere.
> > >
> > > The below code might be better:
> > >
> > > dlen = PAGE_SIZE;
> > > src = zpool_map_handle(entry->pool->zpool, entry->handle,
> > ZPOOL_MM_RO);
> > > if (zpool_evictable(entry->pool->zpool))
> > > src += sizeof(struct zswap_header);
> > >
> > > acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> > >
> > > + zpool_unmap_handle(entry->pool->zpool, entry->handle);
> > >
> > > mutex_lock(acomp_ctx->mutex);
> > > sg_init_one(&input, src, entry->length);
> > > sg_init_table(&output, 1);
> > > sg_set_page(&output, page, PAGE_SIZE, 0);
> > > acomp_request_set_params(acomp_ctx->req, &input, &output,
> > entry->length, dlen);
> > > ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req),
> > &acomp_ctx->wait);
> > > mutex_unlock(acomp_ctx->mutex);
> > >
> > > - zpool_unmap_handle(entry->pool->zpool, entry->handle);
> >
> > I don't see how this is going to work since we can't guarantee src
> > will be a valid pointer after the zpool_unmap_handle() call, can we?
> > Could you please elaborate?
>
> A valid pointer is for cpu to read and write. Here, cpu doesn't read
> and write it, we only need to get page struct from the address.
>
> void sg_init_one(struct scatterlist *sg, const void *buf, unsigned int buflen)
> {
> sg_init_table(sg, 1);
> sg_set_buf(sg, buf, buflen);
> }
>
> static inline void sg_set_buf(struct scatterlist *sg, const void *buf,
> unsigned int buflen)
> {
> #ifdef CONFIG_DEBUG_SG
> BUG_ON(!virt_addr_valid(buf));
> #endif
> sg_set_page(sg, virt_to_page(buf), buflen, offset_in_page(buf));
> }
>
> sg_init_one() is always using an address which has a linear mapping
> with physical address.
> So once we get the value of src, we can get the page struct.
>
> src has a linear mapping with physical address. It doesn't require
> page table walk which vmalloc_to_page() wants.
>
> The req only requires page to initialize sg table, I think if
> we are going to use a cpu-based (de)compression, the crypto
> driver will kmap it again.

Probably I made another bug here. for zsmalloc, it is possible to
get highmem for zpool since its malloc_support_movable = true.

if (zpool_malloc_support_movable(entry->pool->zpool))
gfp |= __GFP_HIGHMEM | __GFP_MOVABLE;
ret = zpool_malloc(entry->pool->zpool, hlen + dlen, gfp, &handle);

For 64bit system, there is never a highmem. For 32bit system, we may
trigger this bug.

So actually zswap should have used kmap_to_page() which can support
both linear mapping and non-linear mapping. sg_init_one() only supports
linear mapping.
But it does't change the fact: Once req is initialized with page
struct, we can unmap src. If we are going to use a HW accelerator,
it would be a DMA; if we are going to use CPU decompression, crypto
driver will kmap() again.

>
> >
> > ~Vitaly
>

Thanks
Barry

Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock


> I'm still not convinced. Will kmap what, src? At this point src might become just a bogus pointer.

As long as the memory is still there, we can kmap it by its page struct. But if
it is not there anymore, we have no way.

> Why couldn't the object have been moved somewhere else (due to the compaction mechanism for instance)
> at the time DMA kicks in?

So zs_map_object() will guarantee the src won't be moved by holding those preemption-disabled lock?
If so, it seems we have to drop the MOVABLE gfp in zswap for zsmalloc case?

>
> >
> > ~Vitaly
>

Thanks
Barry

Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock



> -----Original Message-----
> From: Song Bao Hua (Barry Song)
> Sent: Tuesday, December 22, 2020 3:03 PM
> To: 'Vitaly Wool' <[email protected]>
> Cc: Shakeel Butt <[email protected]>; Minchan Kim <[email protected]>; Mike
> Galbraith <[email protected]>; LKML <[email protected]>; linux-mm
> <[email protected]>; Sebastian Andrzej Siewior <[email protected]>;
> NitinGupta <[email protected]>; Sergey Senozhatsky
> <[email protected]>; Andrew Morton
> <[email protected]>; tiantao (H) <[email protected]>
> Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock
>
>
> > I'm still not convinced. Will kmap what, src? At this point src might become
> just a bogus pointer.
>
> As long as the memory is still there, we can kmap it by its page struct. But
> if
> it is not there anymore, we have no way.
>
> > Why couldn't the object have been moved somewhere else (due to the compaction
> mechanism for instance)
> > at the time DMA kicks in?
>
> So zs_map_object() will guarantee the src won't be moved by holding those
> preemption-disabled lock?
> If so, it seems we have to drop the MOVABLE gfp in zswap for zsmalloc case?
>

Or we can do get_page() to avoid the movement of the page.

> >
> > >
> > > ~Vitaly
> >
>
> Thanks
> Barry


2020-12-22 09:22:44

by David Laight

[permalink] [raw]
Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock

From: Song Bao Hua
> Sent: 21 December 2020 23:02
...
> > For decompression, I would like as low latency as possible which I
> > think is only possible by doing decompression on a cpu synchronously.
>
> One possibility is that we change HW accelerator driver to be sync
> polling for decompression. But this still depends on async api as
> this is the framework nowadays, the difference would be the driver
> won't really block. crypto_wait_req() will return without actual
> sleep.

How long does the HW accelerated compress/decompress need to be before
it is actually worth sleeping the process?
While the HW version might be faster than the SW one, it may not be
enough faster to allow for the hardware interrupt and process sleep.
So it may be worth just spinning (polling the hardware register)
until the request completes.

If decompress are done that way, but compress left async, then
the decompress might need to fallback to SW if the HW is busy.

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)

2020-12-22 09:46:41

by Vitaly Wool

[permalink] [raw]
Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock

On Tue, 22 Dec 2020, 03:11 Song Bao Hua (Barry Song),
<[email protected]> wrote:
>
>
>
> > -----Original Message-----
> > From: Song Bao Hua (Barry Song)
> > Sent: Tuesday, December 22, 2020 3:03 PM
> > To: 'Vitaly Wool' <[email protected]>
> > Cc: Shakeel Butt <[email protected]>; Minchan Kim <[email protected]>; Mike
> > Galbraith <[email protected]>; LKML <[email protected]>; linux-mm
> > <[email protected]>; Sebastian Andrzej Siewior <[email protected]>;
> > NitinGupta <[email protected]>; Sergey Senozhatsky
> > <[email protected]>; Andrew Morton
> > <[email protected]>; tiantao (H) <[email protected]>
> > Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock
> >
> >
> > > I'm still not convinced. Will kmap what, src? At this point src might become
> > just a bogus pointer.
> >
> > As long as the memory is still there, we can kmap it by its page struct. But
> > if
> > it is not there anymore, we have no way.
> >
> > > Why couldn't the object have been moved somewhere else (due to the compaction
> > mechanism for instance)
> > > at the time DMA kicks in?
> >
> > So zs_map_object() will guarantee the src won't be moved by holding those
> > preemption-disabled lock?
> > If so, it seems we have to drop the MOVABLE gfp in zswap for zsmalloc case?
> >
>
> Or we can do get_page() to avoid the movement of the page.


I would like to discuss this more in zswap context than zsmalloc's.
Since zsmalloc does not implement reclaim callback, using it in zswap
is a corner case anyway.

zswap, on the other hand, may be dealing with some new backends in
future which have more chances to become mainstream. Imagine typical
NUMA-like cases, i. e. a zswap pool allocated in some kind SRAM, or in
unused video memory. In such a case if you try to use a pointer to an
invalidated zpool mapping, you are on the way to thrash the system.
So: no assumptions that the zswap pool is in regular linear RAM should
be made.

~Vitaly
>
>
> > >
> > > >
> > > > ~Vitaly
> > >
> >
> > Thanks
> > Barry
>
>

Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock



> -----Original Message-----
> From: Vitaly Wool [mailto:[email protected]]
> Sent: Tuesday, December 22, 2020 10:44 PM
> To: Song Bao Hua (Barry Song) <[email protected]>
> Cc: Shakeel Butt <[email protected]>; Minchan Kim <[email protected]>; Mike
> Galbraith <[email protected]>; LKML <[email protected]>; linux-mm
> <[email protected]>; Sebastian Andrzej Siewior <[email protected]>;
> NitinGupta <[email protected]>; Sergey Senozhatsky
> <[email protected]>; Andrew Morton
> <[email protected]>; tiantao (H) <[email protected]>
> Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
>
> On Tue, 22 Dec 2020, 03:11 Song Bao Hua (Barry Song),
> <[email protected]> wrote:
> >
> >
> >
> > > -----Original Message-----
> > > From: Song Bao Hua (Barry Song)
> > > Sent: Tuesday, December 22, 2020 3:03 PM
> > > To: 'Vitaly Wool' <[email protected]>
> > > Cc: Shakeel Butt <[email protected]>; Minchan Kim <[email protected]>;
> Mike
> > > Galbraith <[email protected]>; LKML <[email protected]>; linux-mm
> > > <[email protected]>; Sebastian Andrzej Siewior <[email protected]>;
> > > NitinGupta <[email protected]>; Sergey Senozhatsky
> > > <[email protected]>; Andrew Morton
> > > <[email protected]>; tiantao (H) <[email protected]>
> > > Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock
> > >
> > >
> > > > I'm still not convinced. Will kmap what, src? At this point src might
> become
> > > just a bogus pointer.
> > >
> > > As long as the memory is still there, we can kmap it by its page struct.
> But
> > > if
> > > it is not there anymore, we have no way.
> > >
> > > > Why couldn't the object have been moved somewhere else (due to the compaction
> > > mechanism for instance)
> > > > at the time DMA kicks in?
> > >
> > > So zs_map_object() will guarantee the src won't be moved by holding those
> > > preemption-disabled lock?
> > > If so, it seems we have to drop the MOVABLE gfp in zswap for zsmalloc case?
> > >
> >
> > Or we can do get_page() to avoid the movement of the page.
>
>
> I would like to discuss this more in zswap context than zsmalloc's.
> Since zsmalloc does not implement reclaim callback, using it in zswap
> is a corner case anyway.

I see. But it seems we still need a solution for the compatibility
of zsmalloc and zswap? this will require change in either zsmalloc
or zswap.
or do you want to make zswap depend on !ZSMALLOC?

>
> zswap, on the other hand, may be dealing with some new backends in
> future which have more chances to become mainstream. Imagine typical
> NUMA-like cases, i. e. a zswap pool allocated in some kind SRAM, or in
> unused video memory. In such a case if you try to use a pointer to an
> invalidated zpool mapping, you are on the way to thrash the system.
> So: no assumptions that the zswap pool is in regular linear RAM should
> be made.
>
> ~Vitaly

Thanks
Barry

2020-12-23 00:15:11

by Vitaly Wool

[permalink] [raw]
Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock

On Tue, 22 Dec 2020, 22:06 Song Bao Hua (Barry Song),
<[email protected]> wrote:
>
>
>
> > -----Original Message-----
> > From: Vitaly Wool [mailto:[email protected]]
> > Sent: Tuesday, December 22, 2020 10:44 PM
> > To: Song Bao Hua (Barry Song) <[email protected]>
> > Cc: Shakeel Butt <[email protected]>; Minchan Kim <[email protected]>; Mike
> > Galbraith <[email protected]>; LKML <[email protected]>; linux-mm
> > <[email protected]>; Sebastian Andrzej Siewior <[email protected]>;
> > NitinGupta <[email protected]>; Sergey Senozhatsky
> > <[email protected]>; Andrew Morton
> > <[email protected]>; tiantao (H) <[email protected]>
> > Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> >
> > On Tue, 22 Dec 2020, 03:11 Song Bao Hua (Barry Song),
> > <[email protected]> wrote:
> > >
> > >
> > >
> > > > -----Original Message-----
> > > > From: Song Bao Hua (Barry Song)
> > > > Sent: Tuesday, December 22, 2020 3:03 PM
> > > > To: 'Vitaly Wool' <[email protected]>
> > > > Cc: Shakeel Butt <[email protected]>; Minchan Kim <[email protected]>;
> > Mike
> > > > Galbraith <[email protected]>; LKML <[email protected]>; linux-mm
> > > > <[email protected]>; Sebastian Andrzej Siewior <[email protected]>;
> > > > NitinGupta <[email protected]>; Sergey Senozhatsky
> > > > <[email protected]>; Andrew Morton
> > > > <[email protected]>; tiantao (H) <[email protected]>
> > > > Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock
> > > >
> > > >
> > > > > I'm still not convinced. Will kmap what, src? At this point src might
> > become
> > > > just a bogus pointer.
> > > >
> > > > As long as the memory is still there, we can kmap it by its page struct.
> > But
> > > > if
> > > > it is not there anymore, we have no way.
> > > >
> > > > > Why couldn't the object have been moved somewhere else (due to the compaction
> > > > mechanism for instance)
> > > > > at the time DMA kicks in?
> > > >
> > > > So zs_map_object() will guarantee the src won't be moved by holding those
> > > > preemption-disabled lock?
> > > > If so, it seems we have to drop the MOVABLE gfp in zswap for zsmalloc case?
> > > >
> > >
> > > Or we can do get_page() to avoid the movement of the page.
> >
> >
> > I would like to discuss this more in zswap context than zsmalloc's.
> > Since zsmalloc does not implement reclaim callback, using it in zswap
> > is a corner case anyway.
>
> I see. But it seems we still need a solution for the compatibility
> of zsmalloc and zswap? this will require change in either zsmalloc
> or zswap.
> or do you want to make zswap depend on !ZSMALLOC?

No, I really don't think we should go that far. What if we add a flag
to zpool, named like "can_sleep_mapped", and have it set for
zbud/z3fold?
Then zswap could go the current path if the flag is set; and if it's
not set, and mutex_trylock fails, copy data from src to a temporary
buffer, then unmap the handle, take the mutex, process the buffer
instead of src. Not the nicest thing to do but at least it won't break
anything.

~Vitaly

> > zswap, on the other hand, may be dealing with some new backends in
> > future which have more chances to become mainstream. Imagine typical
> > NUMA-like cases, i. e. a zswap pool allocated in some kind SRAM, or in
> > unused video memory. In such a case if you try to use a pointer to an
> > invalidated zpool mapping, you are on the way to thrash the system.
> > So: no assumptions that the zswap pool is in regular linear RAM should
> > be made.
> >
> > ~Vitaly
>
> Thanks
> Barry

2020-12-23 12:46:52

by tiantao (H)

[permalink] [raw]
Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock


在 2020/12/23 8:11, Vitaly Wool 写道:
> On Tue, 22 Dec 2020, 22:06 Song Bao Hua (Barry Song),
> <[email protected]> wrote:
>>
>>
>>> -----Original Message-----
>>> From: Vitaly Wool [mailto:[email protected]]
>>> Sent: Tuesday, December 22, 2020 10:44 PM
>>> To: Song Bao Hua (Barry Song) <[email protected]>
>>> Cc: Shakeel Butt <[email protected]>; Minchan Kim <[email protected]>; Mike
>>> Galbraith <[email protected]>; LKML <[email protected]>; linux-mm
>>> <[email protected]>; Sebastian Andrzej Siewior <[email protected]>;
>>> NitinGupta <[email protected]>; Sergey Senozhatsky
>>> <[email protected]>; Andrew Morton
>>> <[email protected]>; tiantao (H) <[email protected]>
>>> Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
>>>
>>> On Tue, 22 Dec 2020, 03:11 Song Bao Hua (Barry Song),
>>> <[email protected]> wrote:
>>>>
>>>>
>>>>> -----Original Message-----
>>>>> From: Song Bao Hua (Barry Song)
>>>>> Sent: Tuesday, December 22, 2020 3:03 PM
>>>>> To: 'Vitaly Wool' <[email protected]>
>>>>> Cc: Shakeel Butt <[email protected]>; Minchan Kim <[email protected]>;
>>> Mike
>>>>> Galbraith <[email protected]>; LKML <[email protected]>; linux-mm
>>>>> <[email protected]>; Sebastian Andrzej Siewior <[email protected]>;
>>>>> NitinGupta <[email protected]>; Sergey Senozhatsky
>>>>> <[email protected]>; Andrew Morton
>>>>> <[email protected]>; tiantao (H) <[email protected]>
>>>>> Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock
>>>>>
>>>>>
>>>>>> I'm still not convinced. Will kmap what, src? At this point src might
>>> become
>>>>> just a bogus pointer.
>>>>>
>>>>> As long as the memory is still there, we can kmap it by its page struct.
>>> But
>>>>> if
>>>>> it is not there anymore, we have no way.
>>>>>
>>>>>> Why couldn't the object have been moved somewhere else (due to the compaction
>>>>> mechanism for instance)
>>>>>> at the time DMA kicks in?
>>>>> So zs_map_object() will guarantee the src won't be moved by holding those
>>>>> preemption-disabled lock?
>>>>> If so, it seems we have to drop the MOVABLE gfp in zswap for zsmalloc case?
>>>>>
>>>> Or we can do get_page() to avoid the movement of the page.
>>>
>>> I would like to discuss this more in zswap context than zsmalloc's.
>>> Since zsmalloc does not implement reclaim callback, using it in zswap
>>> is a corner case anyway.
>> I see. But it seems we still need a solution for the compatibility
>> of zsmalloc and zswap? this will require change in either zsmalloc
>> or zswap.
>> or do you want to make zswap depend on !ZSMALLOC?
> No, I really don't think we should go that far. What if we add a flag
> to zpool, named like "can_sleep_mapped", and have it set for
> zbud/z3fold?
> Then zswap could go the current path if the flag is set; and if it's
> not set, and mutex_trylock fails, copy data from src to a temporary
> buffer, then unmap the handle, take the mutex, process the buffer
> instead of src. Not the nicest thing to do but at least it won't break
> anything.

write the following patch according to your idea, what do you think ?

--- a/mm/zswap.c

+++ b/mm/zswap.c
@@ -1235,7 +1235,7 @@ static int zswap_frontswap_load(unsigned type,
pgoff_t offset,
        struct zswap_entry *entry;
        struct scatterlist input, output;
        struct crypto_acomp_ctx *acomp_ctx;
-       u8 *src, *dst;
+       u8 *src, *dst, *tmp;
        unsigned int dlen;
        int ret;

@@ -1262,16 +1262,26 @@ static int zswap_frontswap_load(unsigned type,
pgoff_t offset,
        if (zpool_evictable(entry->pool->zpool))
                src += sizeof(struct zswap_header);

+       if (!zpool_can_sleep_mapped(entry->pool->zpool) &&
!mutex_trylock(acomp_ctx->mutex)) {
+               tmp = kmemdup(src, entry->length, GFP_ATOMIC);
+               zpool_unmap_handle(entry->pool->zpool, entry->handle); ???
+               if (!tmp)
+                       goto freeentry;
+       }
        acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
        mutex_lock(acomp_ctx->mutex);
-       sg_init_one(&input, src, entry->length);
+       sg_init_one(&input, zpool_can_sleep_mapped(entry->pool->zpool) ?
src : tmp, entry->length);
        sg_init_table(&output, 1);
        sg_set_page(&output, page, PAGE_SIZE, 0);
        acomp_request_set_params(acomp_ctx->req, &input, &output,
entry->length, dlen);
        ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req),
&acomp_ctx->wait);
        mutex_unlock(acomp_ctx->mutex);

-       zpool_unmap_handle(entry->pool->zpool, entry->handle);
+       if (zpool_can_sleep_mapped(entry->pool->zpool))
+               zpool_unmap_handle(entry->pool->zpool, entry->handle);
+       else
+               kfree(tmp);
+
--- a/mm/zsmalloc.c
+++ b/mm/zsmalloc.c
@@ -440,6 +440,7 @@ static u64 zs_zpool_total_size(void *pool)

 static struct zpool_driver zs_zpool_driver = {
        .type =                   "zsmalloc",
+       .sleep_mapped =           false,
        .owner =                  THIS_MODULE,
        .create =                 zs_zpool_create,
        .destroy =                zs_zpool_destroy,

>
> ~Vitaly
>
>>> zswap, on the other hand, may be dealing with some new backends in
>>> future which have more chances to become mainstream. Imagine typical
>>> NUMA-like cases, i. e. a zswap pool allocated in some kind SRAM, or in
>>> unused video memory. In such a case if you try to use a pointer to an
>>> invalidated zpool mapping, you are on the way to thrash the system.
>>> So: no assumptions that the zswap pool is in regular linear RAM should
>>> be made.
>>>
>>> ~Vitaly
>> Thanks
>> Barry

2020-12-23 18:26:47

by Vitaly Wool

[permalink] [raw]
Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock

On Wed, Dec 23, 2020 at 1:44 PM tiantao (H) <[email protected]> wrote:
>
>
> 在 2020/12/23 8:11, Vitaly Wool 写道:
> > On Tue, 22 Dec 2020, 22:06 Song Bao Hua (Barry Song),
> > <[email protected]> wrote:
> >>
> >>
> >>> -----Original Message-----
> >>> From: Vitaly Wool [mailto:[email protected]]
> >>> Sent: Tuesday, December 22, 2020 10:44 PM
> >>> To: Song Bao Hua (Barry Song) <[email protected]>
> >>> Cc: Shakeel Butt <[email protected]>; Minchan Kim <[email protected]>; Mike
> >>> Galbraith <[email protected]>; LKML <[email protected]>; linux-mm
> >>> <[email protected]>; Sebastian Andrzej Siewior <[email protected]>;
> >>> NitinGupta <[email protected]>; Sergey Senozhatsky
> >>> <[email protected]>; Andrew Morton
> >>> <[email protected]>; tiantao (H) <[email protected]>
> >>> Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock
> >>>
> >>> On Tue, 22 Dec 2020, 03:11 Song Bao Hua (Barry Song),
> >>> <[email protected]> wrote:
> >>>>
> >>>>
> >>>>> -----Original Message-----
> >>>>> From: Song Bao Hua (Barry Song)
> >>>>> Sent: Tuesday, December 22, 2020 3:03 PM
> >>>>> To: 'Vitaly Wool' <[email protected]>
> >>>>> Cc: Shakeel Butt <[email protected]>; Minchan Kim <[email protected]>;
> >>> Mike
> >>>>> Galbraith <[email protected]>; LKML <[email protected]>; linux-mm
> >>>>> <[email protected]>; Sebastian Andrzej Siewior <[email protected]>;
> >>>>> NitinGupta <[email protected]>; Sergey Senozhatsky
> >>>>> <[email protected]>; Andrew Morton
> >>>>> <[email protected]>; tiantao (H) <[email protected]>
> >>>>> Subject: RE: [PATCH] zsmalloc: do not use bit_spin_lock
> >>>>>
> >>>>>
> >>>>>> I'm still not convinced. Will kmap what, src? At this point src might
> >>> become
> >>>>> just a bogus pointer.
> >>>>>
> >>>>> As long as the memory is still there, we can kmap it by its page struct.
> >>> But
> >>>>> if
> >>>>> it is not there anymore, we have no way.
> >>>>>
> >>>>>> Why couldn't the object have been moved somewhere else (due to the compaction
> >>>>> mechanism for instance)
> >>>>>> at the time DMA kicks in?
> >>>>> So zs_map_object() will guarantee the src won't be moved by holding those
> >>>>> preemption-disabled lock?
> >>>>> If so, it seems we have to drop the MOVABLE gfp in zswap for zsmalloc case?
> >>>>>
> >>>> Or we can do get_page() to avoid the movement of the page.
> >>>
> >>> I would like to discuss this more in zswap context than zsmalloc's.
> >>> Since zsmalloc does not implement reclaim callback, using it in zswap
> >>> is a corner case anyway.
> >> I see. But it seems we still need a solution for the compatibility
> >> of zsmalloc and zswap? this will require change in either zsmalloc
> >> or zswap.
> >> or do you want to make zswap depend on !ZSMALLOC?
> > No, I really don't think we should go that far. What if we add a flag
> > to zpool, named like "can_sleep_mapped", and have it set for
> > zbud/z3fold?
> > Then zswap could go the current path if the flag is set; and if it's
> > not set, and mutex_trylock fails, copy data from src to a temporary
> > buffer, then unmap the handle, take the mutex, process the buffer
> > instead of src. Not the nicest thing to do but at least it won't break
> > anything.
>
> write the following patch according to your idea, what do you think ?

Yep, that is basically what I was thinking of. Some nitpicks below:

> --- a/mm/zswap.c
>
> +++ b/mm/zswap.c
> @@ -1235,7 +1235,7 @@ static int zswap_frontswap_load(unsigned type,
> pgoff_t offset,
> struct zswap_entry *entry;
> struct scatterlist input, output;
> struct crypto_acomp_ctx *acomp_ctx;
> - u8 *src, *dst;
> + u8 *src, *dst, *tmp;
> unsigned int dlen;
> int ret;
>
> @@ -1262,16 +1262,26 @@ static int zswap_frontswap_load(unsigned type,
> pgoff_t offset,
> if (zpool_evictable(entry->pool->zpool))
> src += sizeof(struct zswap_header);
>
> + if (!zpool_can_sleep_mapped(entry->pool->zpool) &&
> !mutex_trylock(acomp_ctx->mutex)) {
> + tmp = kmemdup(src, entry->length, GFP_ATOMIC);

kmemdump? just use memcpy :)

> + zpool_unmap_handle(entry->pool->zpool, entry->handle); ???
> + if (!tmp)
> + goto freeentry;

Jumping to freentry results in returning success which isn't
appropriate here. You should return -ENOMEM in this case.

> + }
> acomp_ctx = raw_cpu_ptr(entry->pool->acomp_ctx);
> mutex_lock(acomp_ctx->mutex);
> - sg_init_one(&input, src, entry->length);
> + sg_init_one(&input, zpool_can_sleep_mapped(entry->pool->zpool) ?
> src : tmp, entry->length);

This is kind of hard to read, I would rather assign src to tmp after
memcpy and then no condition check would be needed here.

> sg_init_table(&output, 1);
> sg_set_page(&output, page, PAGE_SIZE, 0);
> acomp_request_set_params(acomp_ctx->req, &input, &output,
> entry->length, dlen);
> ret = crypto_wait_req(crypto_acomp_decompress(acomp_ctx->req),
> &acomp_ctx->wait);
> mutex_unlock(acomp_ctx->mutex);
>
> - zpool_unmap_handle(entry->pool->zpool, entry->handle);
> + if (zpool_can_sleep_mapped(entry->pool->zpool))
> + zpool_unmap_handle(entry->pool->zpool, entry->handle);
> + else
> + kfree(tmp);
> +
> --- a/mm/zsmalloc.c
> +++ b/mm/zsmalloc.c
> @@ -440,6 +440,7 @@ static u64 zs_zpool_total_size(void *pool)
>
> static struct zpool_driver zs_zpool_driver = {
> .type = "zsmalloc",
> + .sleep_mapped = false,
> .owner = THIS_MODULE,
> .create = zs_zpool_create,
> .destroy = zs_zpool_destroy,

Best regards,
Vitaly

> >
> > ~Vitaly
> >
> >>> zswap, on the other hand, may be dealing with some new backends in
> >>> future which have more chances to become mainstream. Imagine typical
> >>> NUMA-like cases, i. e. a zswap pool allocated in some kind SRAM, or in
> >>> unused video memory. In such a case if you try to use a pointer to an
> >>> invalidated zpool mapping, you are on the way to thrash the system.
> >>> So: no assumptions that the zswap pool is in regular linear RAM should
> >>> be made.
> >>>
> >>> ~Vitaly
> >> Thanks
> >> Barry
>

Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock

On 2020-12-20 08:21:37 [+0100], Vitaly Wool wrote:
> Not really because bit spinlock leaves preemption disabled.

It leaves it disabled for a reason. Now you just spin until the original
context gets back on the CPU. On UP with preemption, if the "lock owner"
gets preempted, the next lock attempt will lock-up the system.

Sebastian

Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock

On 2020-12-23 19:25:02 [+0100], Vitaly Wool wrote:
> > write the following patch according to your idea, what do you think ?
>
> Yep, that is basically what I was thinking of. Some nitpicks below:

Did this go somewhere? The thread just ends here on my end.
Mike, is this patch fixing / helping your case in anyway?

> Best regards,
> Vitaly

Sebastian

2021-01-14 16:32:23

by Vitaly Wool

[permalink] [raw]
Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock

On Thu, 14 Jan 2021, 17:18 Sebastian Andrzej Siewior,
<[email protected]> wrote:
>
> On 2020-12-23 19:25:02 [+0100], Vitaly Wool wrote:
> > > write the following patch according to your idea, what do you think ?
> >
> > Yep, that is basically what I was thinking of. Some nitpicks below:
>
> Did this go somewhere? The thread just ends here on my end.
> Mike, is this patch fixing / helping your case in anyway?

Please see
* https://marc.info/?l=linux-mm&m=160889419514019&w=2
* https://marc.info/?l=linux-mm&m=160889418114011&w=2
* https://marc.info/?l=linux-mm&m=160889448814057&w=2

Haven't had time to test these yet but seem to be alright.

Best regards,
Vitaly

Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock

On 2021-01-14 17:29:37 [+0100], Vitaly Wool wrote:
> On Thu, 14 Jan 2021, 17:18 Sebastian Andrzej Siewior,
> <[email protected]> wrote:
> >
> > On 2020-12-23 19:25:02 [+0100], Vitaly Wool wrote:
> > > > write the following patch according to your idea, what do you think ?
> > >
> > > Yep, that is basically what I was thinking of. Some nitpicks below:
> >
> > Did this go somewhere? The thread just ends here on my end.
> > Mike, is this patch fixing / helping your case in anyway?
>
> Please see
> * https://marc.info/?l=linux-mm&m=160889419514019&w=2
> * https://marc.info/?l=linux-mm&m=160889418114011&w=2
> * https://marc.info/?l=linux-mm&m=160889448814057&w=2

Thank you, that would be
[email protected]

for b4 compatibility :)

> Haven't had time to test these yet but seem to be alright.

So zs_map_object() still disables preemption but the mutex part is
avoided by the patch?

> Best regards,
> Vitaly

Sebastian

2021-01-14 17:18:08

by Vitaly Wool

[permalink] [raw]
Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock

On Thu, Jan 14, 2021 at 5:56 PM Sebastian Andrzej Siewior
<[email protected]> wrote:
>
> On 2021-01-14 17:29:37 [+0100], Vitaly Wool wrote:
> > On Thu, 14 Jan 2021, 17:18 Sebastian Andrzej Siewior,
> > <[email protected]> wrote:
> > >
> > > On 2020-12-23 19:25:02 [+0100], Vitaly Wool wrote:
> > > > > write the following patch according to your idea, what do you think ?
> > > >
> > > > Yep, that is basically what I was thinking of. Some nitpicks below:
> > >
> > > Did this go somewhere? The thread just ends here on my end.
> > > Mike, is this patch fixing / helping your case in anyway?
> >
> > Please see
> > * https://marc.info/?l=linux-mm&m=160889419514019&w=2
> > * https://marc.info/?l=linux-mm&m=160889418114011&w=2
> > * https://marc.info/?l=linux-mm&m=160889448814057&w=2
>
> Thank you, that would be
> [email protected]
>
> for b4 compatibility :)
>
> > Haven't had time to test these yet but seem to be alright.
>
> So zs_map_object() still disables preemption but the mutex part is
> avoided by the patch?

Basically, yes. Minchan was very clear that he didn't want to remove
that inter-function locking, so be it.
I wouldn't really advise to use zsmalloc with zswap because zsmalloc
has no support for reclaim, nevertheless I wouldn't like this
configuration to stop working for those who are already using it.

Would you or Mike be up for testing Tian Taos's patchset?

Best regards,
Vitaly

Subject: Re: [PATCH] zsmalloc: do not use bit_spin_lock

On 2021-01-14 18:15:08 [+0100], Vitaly Wool wrote:
>
> Basically, yes. Minchan was very clear that he didn't want to remove
> that inter-function locking, so be it.
> I wouldn't really advise to use zsmalloc with zswap because zsmalloc
> has no support for reclaim, nevertheless I wouldn't like this
> configuration to stop working for those who are already using it.
>
> Would you or Mike be up for testing Tian Taos's patchset?

I will try to reproduce Mike's original report and the fix early next
week.

> Best regards,
> Vitaly

Sebastian