LinuxLists.cc - Re: [PATCH] zsmalloc: use workqueue to destroy pool in zpool callback

2016-03-31 08:45:24

Subject: Re: [PATCH] zsmalloc: use workqueue to destroy pool in zpool callback

On (03/30/16 08:59), Minchan Kim wrote:
> On Tue, Mar 29, 2016 at 03:02:57PM -0700, Yu Zhao wrote:
> > zs_destroy_pool() might sleep so it shouldn't be used in zpool
> > destroy callback which can be invoked in softirq context when
> > zsmalloc is configured to work with zswap.
>
> I think it's a limitation of zswap design, not zsmalloc.
> Could you handle it in zswap?

agree. hm, looking at this backtrace

> [<ffffffffaea0224b>] mutex_lock+0x1b/0x2f
> [<ffffffffaebca4f0>] kmem_cache_destroy+0x50/0x130
> [<ffffffffaec10405>] zs_destroy_pool+0x85/0xe0
> [<ffffffffaec1046e>] zs_zpool_destroy+0xe/0x10
> [<ffffffffaec101a4>] zpool_destroy_pool+0x54/0x70
> [<ffffffffaebedac2>] __zswap_pool_release+0x62/0x90
> [<ffffffffaeb1037e>] rcu_process_callbacks+0x22e/0x640
> [<ffffffffaeb15a3e>] ? run_timer_softirq+0x3e/0x280
> [<ffffffffaeabe13b>] __do_softirq+0xcb/0x250
> [<ffffffffaeabe4dc>] irq_exit+0x9c/0xb0
> [<ffffffffaea03e7a>] smp_apic_timer_interrupt+0x6a/0x80
> [<ffffffffaf0a394f>] apic_timer_interrupt+0x7f/0x90

it also can hit the following path

rcu_process_callbacks()
__zswap_pool_release()
zswap_pool_destroy()
zswap_cpu_comp_destroy()
cpu_notifier_register_begin()
mutex_lock(&cpu_add_remove_lock); <<<

can't it?

-ss

2016-03-31 21:47:00

by Yu Zhao

[permalink] [raw]

Subject: Re: [PATCH] zsmalloc: use workqueue to destroy pool in zpool callback

On Thu, Mar 31, 2016 at 05:46:39PM +0900, Sergey Senozhatsky wrote:
> On (03/30/16 08:59), Minchan Kim wrote:
> > On Tue, Mar 29, 2016 at 03:02:57PM -0700, Yu Zhao wrote:
> > > zs_destroy_pool() might sleep so it shouldn't be used in zpool
> > > destroy callback which can be invoked in softirq context when
> > > zsmalloc is configured to work with zswap.
> >
> > I think it's a limitation of zswap design, not zsmalloc.
> > Could you handle it in zswap?
>
> agree. hm, looking at this backtrace
>
> > [<ffffffffaea0224b>] mutex_lock+0x1b/0x2f
> > [<ffffffffaebca4f0>] kmem_cache_destroy+0x50/0x130
> > [<ffffffffaec10405>] zs_destroy_pool+0x85/0xe0
> > [<ffffffffaec1046e>] zs_zpool_destroy+0xe/0x10
> > [<ffffffffaec101a4>] zpool_destroy_pool+0x54/0x70
> > [<ffffffffaebedac2>] __zswap_pool_release+0x62/0x90
> > [<ffffffffaeb1037e>] rcu_process_callbacks+0x22e/0x640
> > [<ffffffffaeb15a3e>] ? run_timer_softirq+0x3e/0x280
> > [<ffffffffaeabe13b>] __do_softirq+0xcb/0x250
> > [<ffffffffaeabe4dc>] irq_exit+0x9c/0xb0
> > [<ffffffffaea03e7a>] smp_apic_timer_interrupt+0x6a/0x80
> > [<ffffffffaf0a394f>] apic_timer_interrupt+0x7f/0x90
>
> it also can hit the following path
>
> rcu_process_callbacks()
> __zswap_pool_release()
> zswap_pool_destroy()
> zswap_cpu_comp_destroy()
> cpu_notifier_register_begin()
> mutex_lock(&cpu_add_remove_lock); <<<
>
> can't it?
>
> -ss

Thanks, Sergey. Now I'm convinced the problem should be fixed in
zswap. Since the rcu callback is already executed asynchronously,
using workqueue to defer the callback further more doesn't seem
to cause additional race condition at least.

2016-03-31 22:06:17

by Dan Streetman

[permalink] [raw]

Subject: Re: [PATCH] zsmalloc: use workqueue to destroy pool in zpool callback

On Thu, Mar 31, 2016 at 5:46 PM, Yu Zhao <[email protected]> wrote:
> On Thu, Mar 31, 2016 at 05:46:39PM +0900, Sergey Senozhatsky wrote:
>> On (03/30/16 08:59), Minchan Kim wrote:
>> > On Tue, Mar 29, 2016 at 03:02:57PM -0700, Yu Zhao wrote:
>> > > zs_destroy_pool() might sleep so it shouldn't be used in zpool
>> > > destroy callback which can be invoked in softirq context when
>> > > zsmalloc is configured to work with zswap.
>> >
>> > I think it's a limitation of zswap design, not zsmalloc.
>> > Could you handle it in zswap?
>>
>> agree. hm, looking at this backtrace
>>
>> > [<ffffffffaea0224b>] mutex_lock+0x1b/0x2f
>> > [<ffffffffaebca4f0>] kmem_cache_destroy+0x50/0x130
>> > [<ffffffffaec10405>] zs_destroy_pool+0x85/0xe0
>> > [<ffffffffaec1046e>] zs_zpool_destroy+0xe/0x10
>> > [<ffffffffaec101a4>] zpool_destroy_pool+0x54/0x70
>> > [<ffffffffaebedac2>] __zswap_pool_release+0x62/0x90
>> > [<ffffffffaeb1037e>] rcu_process_callbacks+0x22e/0x640
>> > [<ffffffffaeb15a3e>] ? run_timer_softirq+0x3e/0x280
>> > [<ffffffffaeabe13b>] __do_softirq+0xcb/0x250
>> > [<ffffffffaeabe4dc>] irq_exit+0x9c/0xb0
>> > [<ffffffffaea03e7a>] smp_apic_timer_interrupt+0x6a/0x80
>> > [<ffffffffaf0a394f>] apic_timer_interrupt+0x7f/0x90
>>
>> it also can hit the following path
>>
>> rcu_process_callbacks()
>> __zswap_pool_release()
>> zswap_pool_destroy()
>> zswap_cpu_comp_destroy()
>> cpu_notifier_register_begin()
>> mutex_lock(&cpu_add_remove_lock); <<<
>>
>> can't it?
>>
>> -ss
>
> Thanks, Sergey. Now I'm convinced the problem should be fixed in
> zswap. Since the rcu callback is already executed asynchronously,
> using workqueue to defer the callback further more doesn't seem
> to cause additional race condition at least.

certainly seems appropriate to fix it in zswap, I'll work on a patch
unless Seth or anyone else is already working on it.

2016-04-25 21:22:39

by Dan Streetman

[permalink] [raw]

Subject: [PATCH] mm/zpool: use workqueue for zpool_destroy

Add a work_struct to struct zpool, and change zpool_destroy_pool to
defer calling the pool implementation destroy.

The zsmalloc pool destroy function, which is one of the zpool
implementations, may sleep during destruction of the pool. However
zswap, which uses zpool, may call zpool_destroy_pool from atomic
context. So we need to defer the call to the zpool implementation
to destroy the pool.

This is essentially the same as Yu Zhao's proposed patch to zsmalloc,
but moved to zpool.

Reported-by: Yu Zhao <[email protected]>
Signed-off-by: Dan Streetman <[email protected]>
Cc: Dan Streetman <[email protected]>
---
mm/zpool.c | 18 ++++++++++++++----
1 file changed, 14 insertions(+), 4 deletions(-)

diff --git a/mm/zpool.c b/mm/zpool.c
index fd3ff71..ea12069 100644
--- a/mm/zpool.c
+++ b/mm/zpool.c
@@ -23,6 +23,7 @@ struct zpool {
const struct zpool_ops *ops;

struct list_head list;
+ struct work_struct work;
};

static LIST_HEAD(drivers_head);
@@ -197,6 +198,15 @@ struct zpool *zpool_create_pool(const char *type, const char *name, gfp_t gfp,
return zpool;
}

+static void zpool_destroy_pool_work(struct work_struct *work)
+{
+ struct zpool *zpool = container_of(work, struct zpool, work);
+
+ zpool->driver->destroy(zpool->pool);
+ zpool_put_driver(zpool->driver);
+ kfree(zpool);
+}
+
/**
* zpool_destroy_pool() - Destroy a zpool
* @pool The zpool to destroy.
@@ -204,7 +214,8 @@ struct zpool *zpool_create_pool(const char *type, const char *name, gfp_t gfp,
* Implementations must guarantee this to be thread-safe,
* however only when destroying different pools. The same
* pool should only be destroyed once, and should not be used
- * after it is destroyed.
+ * after it is destroyed. This defers calling the implementation
+ * to a workqueue, so the implementation may sleep.
*
* This destroys an existing zpool. The zpool should not be in use.
*/
@@ -215,9 +226,8 @@ void zpool_destroy_pool(struct zpool *zpool)
spin_lock(&pools_lock);
list_del(&zpool->list);
spin_unlock(&pools_lock);
- zpool->driver->destroy(zpool->pool);
- zpool_put_driver(zpool->driver);
- kfree(zpool);
+ INIT_WORK(&zpool->work, zpool_destroy_pool_work);
+ schedule_work(&zpool->work);
}

/**
--
2.7.4

2016-04-25 21:46:25

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH] mm/zpool: use workqueue for zpool_destroy

On Mon, 25 Apr 2016 17:20:10 -0400 Dan Streetman <[email protected]> wrote:

> Add a work_struct to struct zpool, and change zpool_destroy_pool to
> defer calling the pool implementation destroy.
>
> The zsmalloc pool destroy function, which is one of the zpool
> implementations, may sleep during destruction of the pool. However
> zswap, which uses zpool, may call zpool_destroy_pool from atomic
> context. So we need to defer the call to the zpool implementation
> to destroy the pool.
>
> This is essentially the same as Yu Zhao's proposed patch to zsmalloc,
> but moved to zpool.

OK, but the refrain remains the same: what are the runtime effects of
the change? Are real people in real worlds seeing scary kernel
warnings? Deadlocks?

This info is needed so that I and others can decide which kernel
version(s) should be patched.

Thanks.

2016-04-25 22:18:22

by Yu Zhao

[permalink] [raw]

Subject: Re: [PATCH] mm/zpool: use workqueue for zpool_destroy

On Mon, Apr 25, 2016 at 05:20:10PM -0400, Dan Streetman wrote:
> Add a work_struct to struct zpool, and change zpool_destroy_pool to
> defer calling the pool implementation destroy.
>
> The zsmalloc pool destroy function, which is one of the zpool
> implementations, may sleep during destruction of the pool. However
> zswap, which uses zpool, may call zpool_destroy_pool from atomic
> context. So we need to defer the call to the zpool implementation
> to destroy the pool.
>
> This is essentially the same as Yu Zhao's proposed patch to zsmalloc,
> but moved to zpool.

Thanks, Dan. Sergey also mentioned another call path that triggers the
same problem (BUG: scheduling while atomic):
rcu_process_callbacks()
__zswap_pool_release()
zswap_pool_destroy()
zswap_cpu_comp_destroy()
cpu_notifier_register_begin()
mutex_lock(&cpu_add_remove_lock);
So I was thinking zswap_pool_destroy() might be done in workqueue in zswap.c.
This way we fix both call paths.

Or you have another patch to fix the second call path?

2016-04-26 00:57:50

by Sergey Senozhatsky

[permalink] [raw]

Subject: Re: [PATCH] mm/zpool: use workqueue for zpool_destroy

On (04/25/16 15:18), Yu Zhao wrote:
> On Mon, Apr 25, 2016 at 05:20:10PM -0400, Dan Streetman wrote:
> > Add a work_struct to struct zpool, and change zpool_destroy_pool to
> > defer calling the pool implementation destroy.
> >
> > The zsmalloc pool destroy function, which is one of the zpool
> > implementations, may sleep during destruction of the pool. However
> > zswap, which uses zpool, may call zpool_destroy_pool from atomic
> > context. So we need to defer the call to the zpool implementation
> > to destroy the pool.
> >
> > This is essentially the same as Yu Zhao's proposed patch to zsmalloc,
> > but moved to zpool.
>
> Thanks, Dan. Sergey also mentioned another call path that triggers the
> same problem (BUG: scheduling while atomic):
> rcu_process_callbacks()
> __zswap_pool_release()
> zswap_pool_destroy()
> zswap_cpu_comp_destroy()
> cpu_notifier_register_begin()
> mutex_lock(&cpu_add_remove_lock);
> So I was thinking zswap_pool_destroy() might be done in workqueue in zswap.c.
> This way we fix both call paths.

right, I'm not sure the patch addressed all of the issues.

-ss

2016-04-26 11:08:27

by Dan Streetman

[permalink] [raw]

Subject: Re: [PATCH] mm/zpool: use workqueue for zpool_destroy

On Mon, Apr 25, 2016 at 6:18 PM, Yu Zhao <[email protected]> wrote:
> On Mon, Apr 25, 2016 at 05:20:10PM -0400, Dan Streetman wrote:
>> Add a work_struct to struct zpool, and change zpool_destroy_pool to
>> defer calling the pool implementation destroy.
>>
>> The zsmalloc pool destroy function, which is one of the zpool
>> implementations, may sleep during destruction of the pool. However
>> zswap, which uses zpool, may call zpool_destroy_pool from atomic
>> context. So we need to defer the call to the zpool implementation
>> to destroy the pool.
>>
>> This is essentially the same as Yu Zhao's proposed patch to zsmalloc,
>> but moved to zpool.
>
> Thanks, Dan. Sergey also mentioned another call path that triggers the
> same problem (BUG: scheduling while atomic):
> rcu_process_callbacks()
> __zswap_pool_release()
> zswap_pool_destroy()
> zswap_cpu_comp_destroy()
> cpu_notifier_register_begin()
> mutex_lock(&cpu_add_remove_lock);
> So I was thinking zswap_pool_destroy() might be done in workqueue in zswap.c.
> This way we fix both call paths.

Yes, you're right, I took so long to get around to this I forgot the details :-)

I'll send a new patch to zswap.

>
> Or you have another patch to fix the second call path?

2016-04-26 21:10:25

by Dan Streetman

[permalink] [raw]

Subject: [PATCH] mm/zswap: use workqueue to destroy pool

Add a work_struct to struct zswap_pool, and change __zswap_pool_empty
to use the workqueue instead of using call_rcu().

When zswap destroys a pool no longer in use, it uses call_rcu() to
perform the destruction/freeing. Since that executes in softirq
context, it must not sleep. However, actually destroying the pool
involves freeing the per-cpu compressors (which requires locking the
cpu_add_remove_lock mutex) and freeing the zpool, for which the
implementation may sleep (e.g. zsmalloc calls kmem_cache_destroy,
which locks the slab_mutex). So if either mutex is currently taken,
or any other part of the compressor or zpool implementation sleeps, it
will result in a BUG().

It's not easy to reproduce this when changing zswap's params normally.
In testing with a loaded system, this does not fail:

$ cd /sys/module/zswap/parameters
$ echo lz4 > compressor ; echo zsmalloc > zpool

nor does this:

$ while true ; do
> echo lzo > compressor ; echo zbud > zpool
> sleep 1
> echo lz4 > compressor ; echo zsmalloc > zpool
> sleep 1
> done

although it's still possible either of those might fail, depending on
whether anything else besides zswap has locked the mutexes.

However, changing a parameter with no delay immediately causes the
schedule while atomic BUG:

$ while true ; do
> echo lzo > compressor ; echo lz4 > compressor
> done

This is essentially the same as Yu Zhao's proposed patch to zsmalloc,
but moved to zswap, to cover compressor and zpool freeing.

Fixes: f1c54846ee45 ("zswap: dynamic pool creation")
Reported-by: Yu Zhao <[email protected]>
Signed-off-by: Dan Streetman <[email protected]>
Cc: Dan Streetman <[email protected]>
---
mm/zswap.c | 12 ++++++++----
1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index 91dad80..f207da7 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -117,7 +117,7 @@ struct zswap_pool {
struct crypto_comp * __percpu *tfm;
struct kref kref;
struct list_head list;
- struct rcu_head rcu_head;
+ struct work_struct work;
struct notifier_block notifier;
char tfm_name[CRYPTO_MAX_ALG_NAME];
};
@@ -652,9 +652,11 @@ static int __must_check zswap_pool_get(struct zswap_pool *pool)
return kref_get_unless_zero(&pool->kref);
}

-static void __zswap_pool_release(struct rcu_head *head)
+static void __zswap_pool_release(struct work_struct *work)
{
- struct zswap_pool *pool = container_of(head, typeof(*pool), rcu_head);
+ struct zswap_pool *pool = container_of(work, typeof(*pool), work);
+
+ synchronize_rcu();

/* nobody should have been able to get a kref... */
WARN_ON(kref_get_unless_zero(&pool->kref));
@@ -674,7 +676,9 @@ static void __zswap_pool_empty(struct kref *kref)
WARN_ON(pool == zswap_pool_current());

list_del_rcu(&pool->list);
- call_rcu(&pool->rcu_head, __zswap_pool_release);
+
+ INIT_WORK(&pool->work, __zswap_pool_release);
+ schedule_work(&pool->work);

spin_unlock(&zswap_pools_lock);
}
--
2.7.4

2016-04-27 00:57:24

by Sergey Senozhatsky

[permalink] [raw]

Subject: Re: [PATCH] mm/zswap: use workqueue to destroy pool

Hello,

On (04/26/16 17:08), Dan Streetman wrote:
[..]
> -static void __zswap_pool_release(struct rcu_head *head)
> +static void __zswap_pool_release(struct work_struct *work)
> {
> - struct zswap_pool *pool = container_of(head, typeof(*pool), rcu_head);
> + struct zswap_pool *pool = container_of(work, typeof(*pool), work);
> +
> + synchronize_rcu();
>
> /* nobody should have been able to get a kref... */
> WARN_ON(kref_get_unless_zero(&pool->kref));
> @@ -674,7 +676,9 @@ static void __zswap_pool_empty(struct kref *kref)
> WARN_ON(pool == zswap_pool_current());
>
> list_del_rcu(&pool->list);
> - call_rcu(&pool->rcu_head, __zswap_pool_release);
> +
> + INIT_WORK(&pool->work, __zswap_pool_release);
> + schedule_work(&pool->work);

so in general the patch look good to me.

it's either I didn't have enough coffee yet (which is true) or
_IN THEORY_ it creates a tiny race condition; which is hard (and
unlikely) to hit, but still. and the problem being is
CONFIG_ZSMALLOC_STAT.

zsmalloc stats are exported via debugfs which is getting init
during pool set up in zs_pool_stat_create() -> debugfs_create_dir() zsmalloc<ID>.

so, once again, in theory, since zswap has the same <ID>, debugfs
dir will have the same for different pool, so a series of zpool
changes via user space knob

zsmalloc > zpool
zbud > zpool
zsmalloc > zpool

can result in

release zsmalloc0 switch to zbud switch to zsmalloc
__zswap_pool_release()
schedule_work()
...
zs_create_pool()
zs_pool_stat_create()
<< zsmalloc0 still exists >>

work is finally scheduled
zs_destroy_pool()
zs_pool_stat_destroy()

-ss

2016-04-27 17:19:45

by Dan Streetman

[permalink] [raw]

Subject: Re: [PATCH] mm/zswap: use workqueue to destroy pool

On Tue, Apr 26, 2016 at 8:58 PM, Sergey Senozhatsky
<[email protected]> wrote:
> Hello,
>
> On (04/26/16 17:08), Dan Streetman wrote:
> [..]
>> -static void __zswap_pool_release(struct rcu_head *head)
>> +static void __zswap_pool_release(struct work_struct *work)
>> {
>> - struct zswap_pool *pool = container_of(head, typeof(*pool), rcu_head);
>> + struct zswap_pool *pool = container_of(work, typeof(*pool), work);
>> +
>> + synchronize_rcu();
>>
>> /* nobody should have been able to get a kref... */
>> WARN_ON(kref_get_unless_zero(&pool->kref));
>> @@ -674,7 +676,9 @@ static void __zswap_pool_empty(struct kref *kref)
>> WARN_ON(pool == zswap_pool_current());
>>
>> list_del_rcu(&pool->list);
>> - call_rcu(&pool->rcu_head, __zswap_pool_release);
>> +
>> + INIT_WORK(&pool->work, __zswap_pool_release);
>> + schedule_work(&pool->work);
>
> so in general the patch look good to me.
>
> it's either I didn't have enough coffee yet (which is true) or
> _IN THEORY_ it creates a tiny race condition; which is hard (and
> unlikely) to hit, but still. and the problem being is
> CONFIG_ZSMALLOC_STAT.

Aha, thanks, I hadn't tested with that param enabled. However, the
patch doesn't create the race condition, that existed already.

>
> zsmalloc stats are exported via debugfs which is getting init
> during pool set up in zs_pool_stat_create() -> debugfs_create_dir() zsmalloc<ID>.
>
> so, once again, in theory, since zswap has the same <ID>, debugfs
> dir will have the same for different pool, so a series of zpool
> changes via user space knob
>
> zsmalloc > zpool
> zbud > zpool
> zsmalloc > zpool
>
> can result in
>
> release zsmalloc0 switch to zbud switch to zsmalloc
> __zswap_pool_release()
> schedule_work()
> ...
> zs_create_pool()
> zs_pool_stat_create()
> << zsmalloc0 still exists >>
>
> work is finally scheduled
> zs_destroy_pool()
> zs_pool_stat_destroy()

zsmalloc uses the pool 'name' provided, without any checking, and in
this case it will always be 'zswap'. So this is easy to reproduce:

1. make sure kernel is compiled with CONFIG_ZSMALLOC_STAT=y
2. enable zswap, change zpool to zsmalloc
3. put some pages into zswap
4. try to change the compressor -> failure

It fails because the new zswap pool creates a new zpool using
zsmalloc, but it can't create the zsmalloc pool because there is
already one named 'zswap' so the stat dir can't be created.

So...either zswap needs to provide a unique 'name' to each of its
zpools, or zsmalloc needs to modify its provided pool name in some way
(add a unique suffix maybe). Or both.

It seems like zsmalloc should do the checking/modification - or, at
the very least, it should have consistent behavior regardless of the
CONFIG_ZSMALLOC_STAT setting. However, it's easy to change zswap to
provide a unique name for each zpool creation, and zsmalloc's primary
user (zram) guarantees to provide a unique name for each pool created.
So updating zswap is probably best.

>
> -ss

2016-04-28 01:38:59

by Sergey Senozhatsky

[permalink] [raw]

Subject: Re: [PATCH] mm/zswap: use workqueue to destroy pool

Hello Dan,

On (04/27/16 13:19), Dan Streetman wrote:
[..]
> > so in general the patch look good to me.
> >
> > it's either I didn't have enough coffee yet (which is true) or
> > _IN THEORY_ it creates a tiny race condition; which is hard (and
> > unlikely) to hit, but still. and the problem being is
> > CONFIG_ZSMALLOC_STAT.
>
> Aha, thanks, I hadn't tested with that param enabled. However, the
> patch doesn't create the race condition, that existed already.

well, agree. it's not like zsmalloc race condition, but the way zsmalloc
is managed (deferred destruction either via rcu or scheduled work).

> It fails because the new zswap pool creates a new zpool using
> zsmalloc, but it can't create the zsmalloc pool because there is
> already one named 'zswap' so the stat dir can't be created.
>
> So...either zswap needs to provide a unique 'name' to each of its
> zpools, or zsmalloc needs to modify its provided pool name in some way
> (add a unique suffix maybe). Or both.
>
> It seems like zsmalloc should do the checking/modification - or, at
> the very least, it should have consistent behavior regardless of the
> CONFIG_ZSMALLOC_STAT setting.

yes, zram guarantees that there won't be any name collisions. and the
way it's working for zram, zram<ID> corresponds to zsmalloc<ID>.

the bigger issue here (and I was thinking at some point of fixing it,
but then I grepped to see how many API users are in there, and I gave
up) is that it seems we have no way to check if the dir exists in debugfs.

we call this function

struct dentry *debugfs_create_dir(const char *name, struct dentry *parent)
{
struct dentry *dentry = start_creating(name, parent);
struct inode *inode;

if (IS_ERR(dentry))
return NULL;

inode = debugfs_get_inode(dentry->d_sb);
if (unlikely(!inode))
return failed_creating(dentry);

inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;
inode->i_op = &simple_dir_inode_operations;
inode->i_fop = &simple_dir_operations;

/* directory inodes start off with i_nlink == 2 (for "." entry) */
inc_nlink(inode);
d_instantiate(dentry, inode);
inc_nlink(d_inode(dentry->d_parent));
fsnotify_mkdir(d_inode(dentry->d_parent), dentry);
return end_creating(dentry);
}

and debugfs _does know_ that the directory ERR_PTR(-EEXIST), that's what
start_creating()->lookup_one_len() return

static struct dentry *start_creating(const char *name, struct dentry *parent)
{
struct dentry *dentry;
int error;

pr_debug("debugfs: creating file '%s'\n",name);

if (IS_ERR(parent))
return parent;

error = simple_pin_fs(&debug_fs_type, &debugfs_mount,
&debugfs_mount_count);
if (error)
return ERR_PTR(error);

/* If the parent is not specified, we create it in the root.
* We need the root dentry to do this, which is in the super
* block. A pointer to that is in the struct vfsmount that we
* have around.
*/
if (!parent)
parent = debugfs_mount->mnt_root;

inode_lock(d_inode(parent));
dentry = lookup_one_len(name, parent, strlen(name));
if (!IS_ERR(dentry) && d_really_is_positive(dentry)) {
dput(dentry);
dentry = ERR_PTR(-EEXIST);
}

if (IS_ERR(dentry)) {
inode_unlock(d_inode(parent));
simple_release_fs(&debugfs_mount, &debugfs_mount_count);
}

return dentry;
}

but debugfs_create_dir() instead of propagating this error, it swallows it
and simply return NULL, so we can't tell the difference between -EEXIST, OOM,
or anything else. so doing this check in zsmalloc() is not so easy.

/* well, I may be wrong here */

> However, it's easy to change zswap to provide a unique name for each
> zpool creation, and zsmalloc's primary user (zram) guarantees to
> provide a unique name for each pool created. So updating zswap is
> probably best.

if you can do it in zswap, then please do.

-ss

2016-04-28 04:08:19

by Sergey Senozhatsky

[permalink] [raw]

Subject: Re: [PATCH] mm/zswap: use workqueue to destroy pool

On (04/28/16 10:40), Sergey Senozhatsky wrote:
[..]
> the bigger issue here (and I was thinking at some point of fixing it,
> but then I grepped to see how many API users are in there, and I gave
> up) is that it seems we have no way to check if the dir exists in debugfs.

well, unless we want to do something like below. but I don't think Greg
will not buy it and the basic rule is to be careful in the driver code
and avoid any collisions.

---

fs/debugfs/inode.c | 48 ++++++++++++++++++++++++++++++++++++++++++++++++
include/linux/debugfs.h | 7 +++++++
2 files changed, 55 insertions(+)

diff --git a/fs/debugfs/inode.c b/fs/debugfs/inode.c
index 8580831..76cf851 100644
--- a/fs/debugfs/inode.c
+++ b/fs/debugfs/inode.c
@@ -709,6 +709,54 @@ exit:
EXPORT_SYMBOL_GPL(debugfs_rename);

/**
+ * debugfs_entry_exists - lookup file/directory name
+ *
+ * @name: a pointer to a string containing the name of the file/directory
+ * to lookup.
+ * @parent: a pointer to the parent dentry. This should be a directory
+ * dentry if set. If this parameter is NULL, then the root of the
+ * debugfs filesystem will be used.
+ *
+ * This function lookup a file/directory name in debugfs. If the
+ * name corresponds to positive dentry, the function will return %0.
+ *
+ * If debugfs is not enabled in the kernel, the value -%ENODEV will be
+ * returned.
+ */
+int debugfs_entry_exists(const char *name, struct dentry *parent)
+{
+ struct dentry *dentry;
+ int error;
+
+ if (IS_ERR(parent))
+ return PTR_ERR(parent);
+
+ error = simple_pin_fs(&debug_fs_type, &debugfs_mount,
+ &debugfs_mount_count);
+ if (error)
+ return error;
+
+ if (!parent)
+ parent = debugfs_mount->mnt_root;
+
+ error = -EINVAL;
+ inode_lock(d_inode(parent));
+ dentry = lookup_one_len(name, parent, strlen(name));
+ if (IS_ERR(dentry)) {
+ error = PTR_ERR(dentry);
+ } else {
+ if (d_really_is_positive(dentry))
+ error = 0;
+ dput(dentry);
+ }
+
+ inode_unlock(d_inode(parent));
+ simple_release_fs(&debugfs_mount, &debugfs_mount_count);
+ return error;
+}
+EXPORT_SYMBOL_GPL(debugfs_entry_exists);
+
+/**
* debugfs_initialized - Tells whether debugfs has been registered
*/
bool debugfs_initialized(void)
diff --git a/include/linux/debugfs.h b/include/linux/debugfs.h
index 981e53a..5b6321e 100644
--- a/include/linux/debugfs.h
+++ b/include/linux/debugfs.h
@@ -124,6 +124,8 @@ ssize_t debugfs_read_file_bool(struct file *file, char __user *user_buf,
ssize_t debugfs_write_file_bool(struct file *file, const char __user *user_buf,
size_t count, loff_t *ppos);

+int debugfs_entry_exists(const char *name, struct dentry *parent);
+
#else

#include <linux/err.h>
@@ -312,6 +314,11 @@ static inline ssize_t debugfs_write_file_bool(struct file *file,
return -ENODEV;
}

+static inline int debugfs_entry_exists(const char *name, struct dentry *parent)
+{
+ return -ENODEV;
+}
+
#endif

#endif

2016-04-28 08:22:38

by Dan Streetman

[permalink] [raw]

Subject: Re: [PATCH] mm/zswap: use workqueue to destroy pool

On Wed, Apr 27, 2016 at 9:40 PM, Sergey Senozhatsky
<[email protected]> wrote:
> Hello Dan,
>
> On (04/27/16 13:19), Dan Streetman wrote:
> [..]
>> > so in general the patch look good to me.
>> >
>> > it's either I didn't have enough coffee yet (which is true) or
>> > _IN THEORY_ it creates a tiny race condition; which is hard (and
>> > unlikely) to hit, but still. and the problem being is
>> > CONFIG_ZSMALLOC_STAT.
>>
>> Aha, thanks, I hadn't tested with that param enabled. However, the
>> patch doesn't create the race condition, that existed already.
>
> well, agree. it's not like zsmalloc race condition, but the way zsmalloc
> is managed (deferred destruction either via rcu or scheduled work).
>
>> It fails because the new zswap pool creates a new zpool using
>> zsmalloc, but it can't create the zsmalloc pool because there is
>> already one named 'zswap' so the stat dir can't be created.
>>
>> So...either zswap needs to provide a unique 'name' to each of its
>> zpools, or zsmalloc needs to modify its provided pool name in some way
>> (add a unique suffix maybe). Or both.
>>
>> It seems like zsmalloc should do the checking/modification - or, at
>> the very least, it should have consistent behavior regardless of the
>> CONFIG_ZSMALLOC_STAT setting.
>
> yes, zram guarantees that there won't be any name collisions. and the
> way it's working for zram, zram<ID> corresponds to zsmalloc<ID>.
>
>
> the bigger issue here (and I was thinking at some point of fixing it,
> but then I grepped to see how many API users are in there, and I gave
> up) is that it seems we have no way to check if the dir exists in debugfs.
>
> we call this function
>
> struct dentry *debugfs_create_dir(const char *name, struct dentry *parent)
> {
> struct dentry *dentry = start_creating(name, parent);
> struct inode *inode;
>
> if (IS_ERR(dentry))
> return NULL;
>
> inode = debugfs_get_inode(dentry->d_sb);
> if (unlikely(!inode))
> return failed_creating(dentry);
>
> inode->i_mode = S_IFDIR | S_IRWXU | S_IRUGO | S_IXUGO;
> inode->i_op = &simple_dir_inode_operations;
> inode->i_fop = &simple_dir_operations;
>
> /* directory inodes start off with i_nlink == 2 (for "." entry) */
> inc_nlink(inode);
> d_instantiate(dentry, inode);
> inc_nlink(d_inode(dentry->d_parent));
> fsnotify_mkdir(d_inode(dentry->d_parent), dentry);
> return end_creating(dentry);
> }
>
> and debugfs _does know_ that the directory ERR_PTR(-EEXIST), that's what
> start_creating()->lookup_one_len() return
>
> static struct dentry *start_creating(const char *name, struct dentry *parent)
> {
> struct dentry *dentry;
> int error;
>
> pr_debug("debugfs: creating file '%s'\n",name);
>
> if (IS_ERR(parent))
> return parent;
>
> error = simple_pin_fs(&debug_fs_type, &debugfs_mount,
> &debugfs_mount_count);
> if (error)
> return ERR_PTR(error);
>
> /* If the parent is not specified, we create it in the root.
> * We need the root dentry to do this, which is in the super
> * block. A pointer to that is in the struct vfsmount that we
> * have around.
> */
> if (!parent)
> parent = debugfs_mount->mnt_root;
>
> inode_lock(d_inode(parent));
> dentry = lookup_one_len(name, parent, strlen(name));
> if (!IS_ERR(dentry) && d_really_is_positive(dentry)) {
> dput(dentry);
> dentry = ERR_PTR(-EEXIST);
> }
>
> if (IS_ERR(dentry)) {
> inode_unlock(d_inode(parent));
> simple_release_fs(&debugfs_mount, &debugfs_mount_count);
> }
>
> return dentry;
> }
>
> but debugfs_create_dir() instead of propagating this error, it swallows it
> and simply return NULL, so we can't tell the difference between -EEXIST, OOM,
> or anything else. so doing this check in zsmalloc() is not so easy.

yeah, Greg intentionally made the debugfs api opaque, so there's only
a binary created/failed indication.

While I agree zswap should provide unique names, I also think zsmalloc
should not abort if its debugfs content fails to be created - the
intention of debugfs is not to be a critical part of drivers, but only
to provide debug information. I'll send a patch to zsmalloc
separately, that allows zsmalloc pool creation to continue even if the
debugfs dir/file failed to be created.

>
> /* well, I may be wrong here */
>
>> However, it's easy to change zswap to provide a unique name for each
>> zpool creation, and zsmalloc's primary user (zram) guarantees to
>> provide a unique name for each pool created. So updating zswap is
>> probably best.
>
> if you can do it in zswap, then please do.
>
> -ss

2016-04-28 09:16:17

by Dan Streetman

[permalink] [raw]

Subject: [PATCH] mm/zswap: provide unique zpool name

Instead of using "zswap" as the name for all zpools created, add
an atomic counter and use "zswap%x" with the counter number for each
zpool created, to provide a unique name for each new zpool.

As zsmalloc, one of the zpool implementations, requires/expects a
unique name for each pool created, zswap should provide a unique name.
The zsmalloc pool creation does not fail if a new pool with a
conflicting name is created, unless CONFIG_ZSMALLOC_STAT is enabled;
in that case, zsmalloc pool creation fails with -ENOMEM. Then zswap
will be unable to change its compressor parameter if its zpool is
zsmalloc; it also will be unable to change its zpool parameter back
to zsmalloc, if it has any existing old zpool using zsmalloc with
page(s) in it. Attempts to change the parameters will result in
failure to create the zpool. This changes zswap to provide a
unique name for each zpool creation.

Fixes: f1c54846ee45 ("zswap: dynamic pool creation")
Reported-by: Sergey Senozhatsky <[email protected]>
Cc: Dan Streetman <[email protected]>
Signed-off-by: Dan Streetman <[email protected]>
---
mm/zswap.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/mm/zswap.c b/mm/zswap.c
index f207da7..275b22c 100644
--- a/mm/zswap.c
+++ b/mm/zswap.c
@@ -170,6 +170,8 @@ static struct zswap_tree *zswap_trees[MAX_SWAPFILES];
static LIST_HEAD(zswap_pools);
/* protects zswap_pools list modification */
static DEFINE_SPINLOCK(zswap_pools_lock);
+/* pool counter to provide unique names to zpool */
+static atomic_t zswap_pools_count = ATOMIC_INIT(0);

/* used by param callback function */
static bool zswap_init_started;
@@ -565,6 +567,7 @@ static struct zswap_pool *zswap_pool_find_get(char *type, char *compressor)
static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
{
struct zswap_pool *pool;
+ char name[38]; /* 'zswap' + 32 char (max) num + \0 */
gfp_t gfp = __GFP_NORETRY | __GFP_NOWARN | __GFP_KSWAPD_RECLAIM;

pool = kzalloc(sizeof(*pool), GFP_KERNEL);
@@ -573,7 +576,10 @@ static struct zswap_pool *zswap_pool_create(char *type, char *compressor)
return NULL;
}

- pool->zpool = zpool_create_pool(type, "zswap", gfp, &zswap_zpool_ops);
+ /* unique name for each pool specifically required by zsmalloc */
+ snprintf(name, 38, "zswap%x", atomic_inc_return(&zswap_pools_count));
+
+ pool->zpool = zpool_create_pool(type, name, gfp, &zswap_zpool_ops);
if (!pool->zpool) {
pr_err("%s zpool not available\n", type);
goto error;
--
2.7.4

2016-04-28 22:16:50

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH] mm/zswap: provide unique zpool name

On Thu, 28 Apr 2016 05:13:23 -0400 Dan Streetman <[email protected]> wrote:

> Instead of using "zswap" as the name for all zpools created, add
> an atomic counter and use "zswap%x" with the counter number for each
> zpool created, to provide a unique name for each new zpool.
>
> As zsmalloc, one of the zpool implementations, requires/expects a
> unique name for each pool created, zswap should provide a unique name.
> The zsmalloc pool creation does not fail if a new pool with a
> conflicting name is created, unless CONFIG_ZSMALLOC_STAT is enabled;
> in that case, zsmalloc pool creation fails with -ENOMEM. Then zswap
> will be unable to change its compressor parameter if its zpool is
> zsmalloc; it also will be unable to change its zpool parameter back
> to zsmalloc, if it has any existing old zpool using zsmalloc with
> page(s) in it. Attempts to change the parameters will result in
> failure to create the zpool. This changes zswap to provide a
> unique name for each zpool creation.
>
> Fixes: f1c54846ee45 ("zswap: dynamic pool creation")

September 2015. I added a cc:stable to this one.

> Reported-by: Sergey Senozhatsky <[email protected]>
> Cc: Dan Streetman <[email protected]>
> Signed-off-by: Dan Streetman <[email protected]>

2016-04-29 00:23:35

by Sergey Senozhatsky

[permalink] [raw]

Subject: Re: [PATCH] mm/zswap: use workqueue to destroy pool

On (04/26/16 17:08), Dan Streetman wrote:
> Add a work_struct to struct zswap_pool, and change __zswap_pool_empty
> to use the workqueue instead of using call_rcu().
>
> When zswap destroys a pool no longer in use, it uses call_rcu() to
> perform the destruction/freeing. Since that executes in softirq
> context, it must not sleep. However, actually destroying the pool
> involves freeing the per-cpu compressors (which requires locking the
> cpu_add_remove_lock mutex) and freeing the zpool, for which the
> implementation may sleep (e.g. zsmalloc calls kmem_cache_destroy,
> which locks the slab_mutex). So if either mutex is currently taken,
> or any other part of the compressor or zpool implementation sleeps, it
> will result in a BUG().
>
> It's not easy to reproduce this when changing zswap's params normally.
> In testing with a loaded system, this does not fail:
>
> $ cd /sys/module/zswap/parameters
> $ echo lz4 > compressor ; echo zsmalloc > zpool
>
> nor does this:
>
> $ while true ; do
> > echo lzo > compressor ; echo zbud > zpool
> > sleep 1
> > echo lz4 > compressor ; echo zsmalloc > zpool
> > sleep 1
> > done
>
> although it's still possible either of those might fail, depending on
> whether anything else besides zswap has locked the mutexes.
>
> However, changing a parameter with no delay immediately causes the
> schedule while atomic BUG:
>
> $ while true ; do
> > echo lzo > compressor ; echo lz4 > compressor
> > done
>
> This is essentially the same as Yu Zhao's proposed patch to zsmalloc,
> but moved to zswap, to cover compressor and zpool freeing.
>
> Fixes: f1c54846ee45 ("zswap: dynamic pool creation")
> Reported-by: Yu Zhao <[email protected]>
> Signed-off-by: Dan Streetman <[email protected]>
> Cc: Dan Streetman <[email protected]>

Reviewed-by: Sergey Senozhatsky <[email protected]>

-ss

2016-04-29 00:24:16

by Sergey Senozhatsky

[permalink] [raw]

Subject: Re: [PATCH] mm/zswap: provide unique zpool name

On (04/28/16 05:13), Dan Streetman wrote:
> Instead of using "zswap" as the name for all zpools created, add
> an atomic counter and use "zswap%x" with the counter number for each
> zpool created, to provide a unique name for each new zpool.
>
> As zsmalloc, one of the zpool implementations, requires/expects a
> unique name for each pool created, zswap should provide a unique name.
> The zsmalloc pool creation does not fail if a new pool with a
> conflicting name is created, unless CONFIG_ZSMALLOC_STAT is enabled;
> in that case, zsmalloc pool creation fails with -ENOMEM. Then zswap
> will be unable to change its compressor parameter if its zpool is
> zsmalloc; it also will be unable to change its zpool parameter back
> to zsmalloc, if it has any existing old zpool using zsmalloc with
> page(s) in it. Attempts to change the parameters will result in
> failure to create the zpool. This changes zswap to provide a
> unique name for each zpool creation.
>
> Fixes: f1c54846ee45 ("zswap: dynamic pool creation")
> Reported-by: Sergey Senozhatsky <[email protected]>
> Cc: Dan Streetman <[email protected]>
> Signed-off-by: Dan Streetman <[email protected]>

Reviewed-by: Sergey Senozhatsky <[email protected]>

-ss