2022-05-25 20:04:25

by Michal Koutný

[permalink] [raw]
Subject: [PATCH 2/2] cgroup: Use separate work structs on css release path

The cgroup_subsys_state of cgroup subsystems (not cgroup->self) use both
kill and release callbacks on their release path (see comment for
css_free_rwork_fn()).

When the last reference is also the base reference, we run into issues
when active work_struct (1) is re-initialized from css_release (2).

// ref=1: only base reference
kill_css()
css_get() // fuse, ref+=1 == 2
percpu_ref_kill_and_confirm
// ref -= 1 == 1: kill base references
[via rcu]
css_killed_ref_fn == refcnt.confirm_switch
queue_work(css->destroy_work) (1)
[via css->destroy_work]
css_killed_work_fn == wq.func
offline_css() // needs fuse
css_put // ref -= 1 == 0: de-fuse, was last
...
percpu_ref_put_many
css_release
queue_work(css->destroy_work) (2)
[via css->destroy_work]
css_release_work_fn == wq.func

Despite we take a fuse reference in css_killed_work_fn() it serves
for pinning the css until only after offline_css().

We could check inside css_release whether destroy_work is active
(WORK_STRUCT_PENDING_BIT) and daisy-chain css_release_work_fn from
css_release(). In order to avoid clashes with various stages of the work
item processing, we just spend some space in css (my config's css grows
to 232B + 32B) and create a separate work entry for each user.

Reported-by: [email protected]
Reported-by: Tadeusz Struk <[email protected]>
Link: https://lore.kernel.org/r/[email protected]/
Signed-off-by: Tadeusz Struk <[email protected]>
Signed-off-by: Michal Koutný <[email protected]>
---
include/linux/cgroup-defs.h | 5 +++--
kernel/cgroup/cgroup.c | 14 +++++++-------
2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 1bfcfb1af352..16b99aa04305 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -178,8 +178,9 @@ struct cgroup_subsys_state {
*/
atomic_t online_cnt;

- /* percpu_ref killing and RCU release */
- struct work_struct destroy_work;
+ /* percpu_ref killing, css release, and RCU release work structs */
+ struct work_struct killed_ref_work;
+ struct work_struct release_work;
struct rcu_work destroy_rwork;

/*
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index a5b0d5d54fbc..33b3a44391d7 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5102,7 +5102,7 @@ static struct cftype cgroup_base_files[] = {
* css_free_work_fn().
*
* It is actually hairier because both step 2 and 4 require process context
- * and thus involve punting to css->destroy_work adding two additional
+ * and thus involve punting to css->release_work adding two additional
* steps to the already complex sequence.
*/
static void css_free_rwork_fn(struct work_struct *work)
@@ -5157,7 +5157,7 @@ static void css_free_rwork_fn(struct work_struct *work)
static void css_release_work_fn(struct work_struct *work)
{
struct cgroup_subsys_state *css =
- container_of(work, struct cgroup_subsys_state, destroy_work);
+ container_of(work, struct cgroup_subsys_state, release_work);
struct cgroup_subsys *ss = css->ss;
struct cgroup *cgrp = css->cgroup;

@@ -5213,8 +5213,8 @@ static void css_release(struct percpu_ref *ref)
struct cgroup_subsys_state *css =
container_of(ref, struct cgroup_subsys_state, refcnt);

- INIT_WORK(&css->destroy_work, css_release_work_fn);
- queue_work(cgroup_destroy_wq, &css->destroy_work);
+ INIT_WORK(&css->release_work, css_release_work_fn);
+ queue_work(cgroup_destroy_wq, &css->release_work);
}

static void init_and_link_css(struct cgroup_subsys_state *css,
@@ -5549,7 +5549,7 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
static void css_killed_work_fn(struct work_struct *work)
{
struct cgroup_subsys_state *css =
- container_of(work, struct cgroup_subsys_state, destroy_work);
+ container_of(work, struct cgroup_subsys_state, killed_ref_work);

mutex_lock(&cgroup_mutex);

@@ -5570,8 +5570,8 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
container_of(ref, struct cgroup_subsys_state, refcnt);

if (atomic_dec_and_test(&css->online_cnt)) {
- INIT_WORK(&css->destroy_work, css_killed_work_fn);
- queue_work(cgroup_destroy_wq, &css->destroy_work);
+ INIT_WORK(&css->killed_ref_work, css_killed_work_fn);
+ queue_work(cgroup_destroy_wq, &css->killed_ref_work);
}
}

--
2.35.3



2022-05-26 00:03:50

by Michal Koutný

[permalink] [raw]
Subject: Re: [PATCH 2/2] cgroup: Use separate work structs on css release path

On Wed, May 25, 2022 at 05:15:17PM +0200, Michal Koutn? <[email protected]> wrote:
> // ref=1: only base reference
> kill_css()
> css_get() // fuse, ref+=1 == 2
> percpu_ref_kill_and_confirm
> // ref -= 1 == 1: kill base references
> [via rcu]
> css_killed_ref_fn == refcnt.confirm_switch
> queue_work(css->destroy_work) (1)
> [via css->destroy_work]
> css_killed_work_fn == wq.func
> offline_css() // needs fuse
> css_put // ref -= 1 == 0: de-fuse, was last
> ...
> percpu_ref_put_many
> css_release
> queue_work(css->destroy_work) (2)
> [via css->destroy_work]
> css_release_work_fn == wq.func

Apologies, this is wrong explanation. (I thought this explains why
Tadeusz's patch with double get/put didn't fix it (i.e. any number
wouldn't help with the explanation above).)

But the above is not correct. I've looked at the stack trace [1] and the
offending percpu_ref_put_many is called from an RCU callback
percpu_ref_switch_to_atomic_rcu(), so I can't actually see why it drops
to zero there...

Regards,
Michal

2022-05-26 20:45:41

by Michal Koutný

[permalink] [raw]
Subject: Re: [PATCH 2/2] cgroup: Use separate work structs on css release path

On Wed, May 25, 2022 at 06:14:55PM +0200, Michal Koutn? <[email protected]> wrote:
> But the above is not correct. I've looked at the stack trace [1] and the
> offending percpu_ref_put_many is called from an RCU callback
> percpu_ref_switch_to_atomic_rcu(), so I can't actually see why it drops
> to zero there...

The link [1] should have been [1].
After some more thought, the following is possible sequencing of
involved functions.

// ref=A: initial state
kill_css()
css_get // ref+=F == A+F: fuse
percpu_ref_kill_and_confirm
__percpu_ref_switch_to_atomic
percpu_ref_get
// ref += 1 == A+F+1: atomic mode, self-protection
percpu_ref_put
// ref -= 1 == A+F: kill the base reference
[via rcu]
percpu_ref_switch_to_atomic_rcu
percpu_ref_call_confirm_rcu
css_killed_ref_fn == refcnt.confirm_switch
queue_work(css->destroy_work) (1)
[via css->destroy_work]
css_killed_work_fn == wq.func
offline_css() // needs fuse
css_put // ref -= F == A: de-fuse
percpu_ref_put
// ref -= 1 == A-1: remove self-protection
css_release // A <= 1 -> 2nd queue_work explodes!
queue_work(css->destroy_work) (2)
[via css->destroy_work]
css_release_work_fn == wq.func

Another CPU would have to dispatch and run the css_killed_work_fn
callback in parallel to percpu_ref_switch_to_atomic_rcu. It's a more
correct explanation, however, its likelihood does seem very low. Perhaps
some debug prints of percpu_ref_data.data in percpu_ref_call_confirm_rcu
could shed more light onto this [2].

HTH,
Michal

[1] https://syzkaller.appspot.com/text?tag=CrashReport&x=162b5781f00000
[2] I tried notifying syzbot about [3] moments ago.
[3] https://github.com/Werkov/linux/tree/cgroup-ml/css-lifecycle-syzbot


2022-05-27 12:28:55

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/2] cgroup: Use separate work structs on css release path

Hello, Michal.

On Thu, May 26, 2022 at 11:56:34AM +0200, Michal Koutn? wrote:
> // ref=A: initial state
> kill_css()
> css_get // ref+=F == A+F: fuse
> percpu_ref_kill_and_confirm
> __percpu_ref_switch_to_atomic
> percpu_ref_get
> // ref += 1 == A+F+1: atomic mode, self-protection
> percpu_ref_put
> // ref -= 1 == A+F: kill the base reference
> [via rcu]
> percpu_ref_switch_to_atomic_rcu
> percpu_ref_call_confirm_rcu
> css_killed_ref_fn == refcnt.confirm_switch
> queue_work(css->destroy_work) (1)
> [via css->destroy_work]
> css_killed_work_fn == wq.func
> offline_css() // needs fuse
> css_put // ref -= F == A: de-fuse
> percpu_ref_put
> // ref -= 1 == A-1: remove self-protection
> css_release // A <= 1 -> 2nd queue_work explodes!

I'm not sure I'm following it but it's perfectly fine to re-use the work
item at this point. The work item actually can be re-cycled from the very
beginning of the work function. The only thing we need to make sure is that
we don't css_put() prematurely to avoid it being freed while we're using it.

For the sharing to be a problem, we should be queueing the release work item
while the destroy instance is still pending, and if that is the case, it
doesn't really matter whether we use two separate work items or not. We're
already broken and would just be shifting the problem to explode elsewhere.

The only possibility that I can think of is that somehow we're ending up
with an extra css_put() somewhere thus triggering the release path
prematurely. If that's the case, we'll prolly need to trace get/puts to find
out who's causing the ref imbalance.

Thanks.

--
tejun

2022-05-27 20:06:33

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/2] cgroup: Use separate work structs on css release path

On Fri, May 27, 2022 at 06:54:29PM +0200, Michal Koutn? wrote:
> Hello Tadeusz.
>
> On Fri, May 27, 2022 at 09:39:20AM -0700, Tadeusz Struk <[email protected]> wrote:
> > As far as I can see we are trying to test the same thing suggested by Tejun.
> > I just sent a test request to try this:
> > https://github.com/tstruk/linux/commit/master
>
> Yup, I've added few more prints to get more fine-grained resolution.
> Also, I decided to use ftrace printk not to interfere with timing too
> much (due to the original race hypothesis).
>
> > Let me know if you have any more tests to run and I will hold off until
> > you are done.
>
> My latest attempt is [1] (tip 5500e05d82fd5b5db2203eedb3f786857d3ccbea).
>
> So far, I'm not convinced, I extract the complete ftrace buffer from the
> syzbot runs, so I'm not drawing any conclusions from the traces I've
> got. I'm not going to continue today. You may have more luck with your
> plain printk (if it's just imbalance and it avoids printk locking
> sensitive paths).

At least for triaging, it may be easier to use bpftrace to attach kprobes to
the get/put functions with the right filter and record the kstack counts.

Thanks.

--
tejun

2022-05-28 19:58:03

by Tadeusz Struk

[permalink] [raw]
Subject: Re: [PATCH 2/2] cgroup: Use separate work structs on css release path

On 5/26/22 11:15, Tejun Heo wrote:
> Hello, Michal.
>
> On Thu, May 26, 2022 at 11:56:34AM +0200, Michal Koutný wrote:
>> // ref=A: initial state
>> kill_css()
>> css_get // ref+=F == A+F: fuse
>> percpu_ref_kill_and_confirm
>> __percpu_ref_switch_to_atomic
>> percpu_ref_get
>> // ref += 1 == A+F+1: atomic mode, self-protection
>> percpu_ref_put
>> // ref -= 1 == A+F: kill the base reference
>> [via rcu]
>> percpu_ref_switch_to_atomic_rcu
>> percpu_ref_call_confirm_rcu
>> css_killed_ref_fn == refcnt.confirm_switch
>> queue_work(css->destroy_work) (1)
>> [via css->destroy_work]
>> css_killed_work_fn == wq.func
>> offline_css() // needs fuse
>> css_put // ref -= F == A: de-fuse
>> percpu_ref_put
>> // ref -= 1 == A-1: remove self-protection
>> css_release // A <= 1 -> 2nd queue_work explodes!
>
> I'm not sure I'm following it but it's perfectly fine to re-use the work
> item at this point. The work item actually can be re-cycled from the very
> beginning of the work function. The only thing we need to make sure is that
> we don't css_put() prematurely to avoid it being freed while we're using it.
>
> For the sharing to be a problem, we should be queueing the release work item
> while the destroy instance is still pending, and if that is the case, it
> doesn't really matter whether we use two separate work items or not. We're
> already broken and would just be shifting the problem to explode elsewhere.
>
> The only possibility that I can think of is that somehow we're ending up
> with an extra css_put() somewhere thus triggering the release path
> prematurely. If that's the case, we'll prolly need to trace get/puts to find
> out who's causing the ref imbalance.

Hi Michal,
As far as I can see we are trying to test the same thing suggested by Tejun.
I just sent a test request to try this:
https://github.com/tstruk/linux/commit/master

Let me know if you have any more tests to run and I will hold off until
you are done.

--
Thanks,
Tadeusz

2022-05-28 19:59:56

by Michal Koutný

[permalink] [raw]
Subject: Re: [PATCH 2/2] cgroup: Use separate work structs on css release path

Hello Tadeusz.

On Fri, May 27, 2022 at 09:39:20AM -0700, Tadeusz Struk <[email protected]> wrote:
> As far as I can see we are trying to test the same thing suggested by Tejun.
> I just sent a test request to try this:
> https://github.com/tstruk/linux/commit/master

Yup, I've added few more prints to get more fine-grained resolution.
Also, I decided to use ftrace printk not to interfere with timing too
much (due to the original race hypothesis).

> Let me know if you have any more tests to run and I will hold off until
> you are done.

My latest attempt is [1] (tip 5500e05d82fd5b5db2203eedb3f786857d3ccbea).

So far, I'm not convinced, I extract the complete ftrace buffer from the
syzbot runs, so I'm not drawing any conclusions from the traces I've
got. I'm not going to continue today. You may have more luck with your
plain printk (if it's just imbalance and it avoids printk locking
sensitive paths).

HTH,
Michal

[1] https://github.com/Werkov/linux/tree/cgroup-ml/css-lifecycle-b2

2022-06-01 23:15:26

by Tadeusz Struk

[permalink] [raw]
Subject: Re: [PATCH 2/2] cgroup: Use separate work structs on css release path

On 5/26/22 11:15, Tejun Heo wrote:
> Hello, Michal.
>
> On Thu, May 26, 2022 at 11:56:34AM +0200, Michal Koutný wrote:
>> // ref=A: initial state
>> kill_css()
>> css_get // ref+=F == A+F: fuse
>> percpu_ref_kill_and_confirm
>> __percpu_ref_switch_to_atomic
>> percpu_ref_get
>> // ref += 1 == A+F+1: atomic mode, self-protection
>> percpu_ref_put
>> // ref -= 1 == A+F: kill the base reference
>> [via rcu]
>> percpu_ref_switch_to_atomic_rcu
>> percpu_ref_call_confirm_rcu
>> css_killed_ref_fn == refcnt.confirm_switch
>> queue_work(css->destroy_work) (1)
>> [via css->destroy_work]
>> css_killed_work_fn == wq.func
>> offline_css() // needs fuse
>> css_put // ref -= F == A: de-fuse
>> percpu_ref_put
>> // ref -= 1 == A-1: remove self-protection
>> css_release // A <= 1 -> 2nd queue_work explodes!
>
> I'm not sure I'm following it but it's perfectly fine to re-use the work
> item at this point. The work item actually can be re-cycled from the very
> beginning of the work function. The only thing we need to make sure is that
> we don't css_put() prematurely to avoid it being freed while we're using it.

Yes, it is ok to reuse a work struct, but it's not ok to have the same
work struct enqueued twice on the same WQ when list debug is enabled.
That's why we are getting this "BUG: corrupted list.."

> For the sharing to be a problem, we should be queueing the release work item
> while the destroy instance is still pending, and if that is the case, it
> doesn't really matter whether we use two separate work items or not. We're
> already broken and would just be shifting the problem to explode elsewhere.
>
> The only possibility that I can think of is that somehow we're ending up
> with an extra css_put() somewhere thus triggering the release path
> prematurely. If that's the case, we'll prolly need to trace get/puts to find
> out who's causing the ref imbalance.

That's right. Michal was on the right track for the kill_css() part.
What I think is going on is that once css_create() fails then
cgroup_subtree_control_write() ends up calling first kill_css() and
then css_put() on the same css, I think it's &cgrp->self of the kernfs_node.
The each_live_descendant_post() also iterates on the root.
Here is the call flow (sorry for long lines):

cgroup_subtree_control_write(of)->cgroup_apply_control(cgrp)->cgroup_apply_control_enable(cgrp)->css_create() <- fails here and returns error
|
|-> cgroup_finalize_control(cgrp)->cgroup_apply_control_disable(cgrp)->each_live_descendant_post(cgrp)->kill_css()->percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn) <- this triggers css_killed_ref_fn() to be called
|
| css_killed_ref_fn() <- first css->destroy_work enqueue
| |
| |-> INIT_WORK(&css->destroy_work, css_killed_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work);
|
|
|-> goto out_unlock;
| |
| |-> cgroup_kn_unlock(kernfs_node)->cgroup_put(cgrp)->css_put(&cgrp->self)->percpu_ref_put(&css->refcnt) <- this triggers css_release() to be called
|
|
css_release(percpu_ref) <- second css->destroy_work enqueue
|
|-> INIT_WORK(&css->destroy_work, css_release_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work) <- and it fails here with BUG: corrupted list in insert_work; list_add corruption.


What seems to work for me as the simplest fix is to prevent enqueuing a dying
css in css_release() as below. Please let me know if that makes sense to you.

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1779ccddb734..5618211487cc 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5210,8 +5210,10 @@ static void css_release(struct percpu_ref *ref)
struct cgroup_subsys_state *css =
container_of(ref, struct cgroup_subsys_state, refcnt);

- INIT_WORK(&css->destroy_work, css_release_work_fn);
- queue_work(cgroup_destroy_wq, &css->destroy_work);
+ if (!(css->flags & CSS_DYING)) {
+ INIT_WORK(&css->destroy_work, css_release_work_fn);
+ queue_work(cgroup_destroy_wq, &css->destroy_work);
+ }
}

static void init_and_link_css(struct cgroup_subsys_state *css,

--
Thanks,
Tadeusz

2022-06-01 23:21:09

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/2] cgroup: Use separate work structs on css release path

On Wed, Jun 01, 2022 at 04:13:32PM -0700, Tadeusz Struk wrote:
> > On Thu, May 26, 2022 at 11:56:34AM +0200, Michal Koutn? wrote:
> > > // ref=A: initial state
> > > kill_css()
> > > css_get // ref+=F == A+F: fuse
> > > percpu_ref_kill_and_confirm
> > > __percpu_ref_switch_to_atomic
> > > percpu_ref_get
> > > // ref += 1 == A+F+1: atomic mode, self-protection
> > > percpu_ref_put
> > > // ref -= 1 == A+F: kill the base reference
> > > [via rcu]
> > > percpu_ref_switch_to_atomic_rcu
> > > percpu_ref_call_confirm_rcu
> > > css_killed_ref_fn == refcnt.confirm_switch
> > > queue_work(css->destroy_work) (1)
> > > [via css->destroy_work]
> > > css_killed_work_fn == wq.func
> > > offline_css() // needs fuse
> > > css_put // ref -= F == A: de-fuse
> > > percpu_ref_put
> > > // ref -= 1 == A-1: remove self-protection
> > > css_release // A <= 1 -> 2nd queue_work explodes!
> >
> > I'm not sure I'm following it but it's perfectly fine to re-use the work
> > item at this point. The work item actually can be re-cycled from the very
> > beginning of the work function. The only thing we need to make sure is that
> > we don't css_put() prematurely to avoid it being freed while we're using it.
>
> Yes, it is ok to reuse a work struct, but it's not ok to have the same
> work struct enqueued twice on the same WQ when list debug is enabled.
> That's why we are getting this "BUG: corrupted list.."

The above scenario isn't that tho. Once the work item starts executing, wq
doesn't care about what happens to it and as killed_work_fn is holding a
reference, the release scheduling shouldn't happen before it starts
executing unless somebody is screwing up the refcnting.

> That's right. Michal was on the right track for the kill_css() part.
> What I think is going on is that once css_create() fails then
> cgroup_subtree_control_write() ends up calling first kill_css() and
> then css_put() on the same css, I think it's &cgrp->self of the kernfs_node.
> The each_live_descendant_post() also iterates on the root.
> Here is the call flow (sorry for long lines):
>
> cgroup_subtree_control_write(of)->cgroup_apply_control(cgrp)->cgroup_apply_control_enable(cgrp)->css_create() <- fails here and returns error
> |
> |-> cgroup_finalize_control(cgrp)->cgroup_apply_control_disable(cgrp)->each_live_descendant_post(cgrp)->kill_css()->percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn) <- this triggers css_killed_ref_fn() to be called
> |
> | css_killed_ref_fn() <- first css->destroy_work enqueue
> | |
> | |-> INIT_WORK(&css->destroy_work, css_killed_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work);
> |
> |
> |-> goto out_unlock;
> | |
> | |-> cgroup_kn_unlock(kernfs_node)->cgroup_put(cgrp)->css_put(&cgrp->self)->percpu_ref_put(&css->refcnt) <- this triggers css_release() to be called
> |
> |
> css_release(percpu_ref) <- second css->destroy_work enqueue
> |
> |-> INIT_WORK(&css->destroy_work, css_release_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work) <- and it fails here with BUG: corrupted list in insert_work; list_add corruption.
>
>
> What seems to work for me as the simplest fix is to prevent enqueuing a dying
> css in css_release() as below. Please let me know if that makes sense to you.
>
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index 1779ccddb734..5618211487cc 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -5210,8 +5210,10 @@ static void css_release(struct percpu_ref *ref)
> struct cgroup_subsys_state *css =
> container_of(ref, struct cgroup_subsys_state, refcnt);
> - INIT_WORK(&css->destroy_work, css_release_work_fn);
> - queue_work(cgroup_destroy_wq, &css->destroy_work);
> + if (!(css->flags & CSS_DYING)) {
> + INIT_WORK(&css->destroy_work, css_release_work_fn);
> + queue_work(cgroup_destroy_wq, &css->destroy_work);
> + }

When the problem is ref imbalance, how can above be the solution? Of course
release path won't cause an issue if they don't run, but we still need to
free the thing, right?

Thanks.

--
tejun

2022-06-01 23:41:27

by Tadeusz Struk

[permalink] [raw]
Subject: Re: [PATCH 2/2] cgroup: Use separate work structs on css release path

On 6/1/22 16:20, Tejun Heo wrote:
> On Wed, Jun 01, 2022 at 04:13:32PM -0700, Tadeusz Struk wrote:
>>> On Thu, May 26, 2022 at 11:56:34AM +0200, Michal Koutný wrote:
>>>> // ref=A: initial state
>>>> kill_css()
>>>> css_get // ref+=F == A+F: fuse
>>>> percpu_ref_kill_and_confirm
>>>> __percpu_ref_switch_to_atomic
>>>> percpu_ref_get
>>>> // ref += 1 == A+F+1: atomic mode, self-protection
>>>> percpu_ref_put
>>>> // ref -= 1 == A+F: kill the base reference
>>>> [via rcu]
>>>> percpu_ref_switch_to_atomic_rcu
>>>> percpu_ref_call_confirm_rcu
>>>> css_killed_ref_fn == refcnt.confirm_switch
>>>> queue_work(css->destroy_work) (1)
>>>> [via css->destroy_work]
>>>> css_killed_work_fn == wq.func
>>>> offline_css() // needs fuse
>>>> css_put // ref -= F == A: de-fuse
>>>> percpu_ref_put
>>>> // ref -= 1 == A-1: remove self-protection
>>>> css_release // A <= 1 -> 2nd queue_work explodes!
>>>
>>> I'm not sure I'm following it but it's perfectly fine to re-use the work
>>> item at this point. The work item actually can be re-cycled from the very
>>> beginning of the work function. The only thing we need to make sure is that
>>> we don't css_put() prematurely to avoid it being freed while we're using it.
>>
>> Yes, it is ok to reuse a work struct, but it's not ok to have the same
>> work struct enqueued twice on the same WQ when list debug is enabled.
>> That's why we are getting this "BUG: corrupted list.."
>
> The above scenario isn't that tho. Once the work item starts executing, wq
> doesn't care about what happens to it and as killed_work_fn is holding a
> reference, the release scheduling shouldn't happen before it starts
> executing unless somebody is screwing up the refcnting.
>
>> That's right. Michal was on the right track for the kill_css() part.
>> What I think is going on is that once css_create() fails then
>> cgroup_subtree_control_write() ends up calling first kill_css() and
>> then css_put() on the same css, I think it's &cgrp->self of the kernfs_node.
>> The each_live_descendant_post() also iterates on the root.
>> Here is the call flow (sorry for long lines):
>>
>> cgroup_subtree_control_write(of)->cgroup_apply_control(cgrp)->cgroup_apply_control_enable(cgrp)->css_create() <- fails here and returns error
>> |
>> |-> cgroup_finalize_control(cgrp)->cgroup_apply_control_disable(cgrp)->each_live_descendant_post(cgrp)->kill_css()->percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn) <- this triggers css_killed_ref_fn() to be called
>> |
>> | css_killed_ref_fn() <- first css->destroy_work enqueue
>> | |
>> | |-> INIT_WORK(&css->destroy_work, css_killed_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work);
>> |
>> |
>> |-> goto out_unlock;
>> | |
>> | |-> cgroup_kn_unlock(kernfs_node)->cgroup_put(cgrp)->css_put(&cgrp->self)->percpu_ref_put(&css->refcnt) <- this triggers css_release() to be called
>> |
>> |
>> css_release(percpu_ref) <- second css->destroy_work enqueue
>> |
>> |-> INIT_WORK(&css->destroy_work, css_release_work_fn); queue_work(cgroup_destroy_wq, &css->destroy_work) <- and it fails here with BUG: corrupted list in insert_work; list_add corruption.
>>
>>
>> What seems to work for me as the simplest fix is to prevent enqueuing a dying
>> css in css_release() as below. Please let me know if that makes sense to you.
>>
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index 1779ccddb734..5618211487cc 100644
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
>> @@ -5210,8 +5210,10 @@ static void css_release(struct percpu_ref *ref)
>> struct cgroup_subsys_state *css =
>> container_of(ref, struct cgroup_subsys_state, refcnt);
>> - INIT_WORK(&css->destroy_work, css_release_work_fn);
>> - queue_work(cgroup_destroy_wq, &css->destroy_work);
>> + if (!(css->flags & CSS_DYING)) {
>> + INIT_WORK(&css->destroy_work, css_release_work_fn);
>> + queue_work(cgroup_destroy_wq, &css->destroy_work);
>> + }
>
> When the problem is ref imbalance, how can above be the solution? Of course
> release path won't cause an issue if they don't run, but we still need to
> free the thing, right?

Yes, but as far as I can see the percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn)
doesn't change the value of the refcnt, it just causes the css_killed_ref_fn() to be called
on it. Only css_get() & css_put() modify the refcnt value.
And for the "free the thing" the css_killed_work_fn() does that.
It calls offline_css(css) and css_put(css) for the whole css hierarchy.

--
Thanks,
Tadeusz

2022-06-01 23:45:49

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/2] cgroup: Use separate work structs on css release path

On Wed, Jun 01, 2022 at 04:37:17PM -0700, Tadeusz Struk wrote:
> Yes, but as far as I can see the percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn)
> doesn't change the value of the refcnt, it just causes the css_killed_ref_fn() to be called

Yeah, the base ref is special for percpu_ref.

> on it. Only css_get() & css_put() modify the refcnt value.
> And for the "free the thing" the css_killed_work_fn() does that.
> It calls offline_css(css) and css_put(css) for the whole css hierarchy.

Yeah, the freeing path depends on the css_put(css) invoking css_release()
which schedules the work item which actually frees. Am I misunderstanding
something here?

Thanks.

--
tejun

2022-06-02 00:01:33

by Tadeusz Struk

[permalink] [raw]
Subject: Re: [PATCH 2/2] cgroup: Use separate work structs on css release path

On 6/1/22 16:43, Tejun Heo wrote:
> On Wed, Jun 01, 2022 at 04:37:17PM -0700, Tadeusz Struk wrote:
>> Yes, but as far as I can see the percpu_ref_kill_and_confirm(&css->refcnt, css_killed_ref_fn)
>> doesn't change the value of the refcnt, it just causes the css_killed_ref_fn() to be called
>
> Yeah, the base ref is special for percpu_ref.
>
>> on it. Only css_get() & css_put() modify the refcnt value.
>> And for the "free the thing" the css_killed_work_fn() does that.
>> It calls offline_css(css) and css_put(css) for the whole css hierarchy.
>
> Yeah, the freeing path depends on the css_put(css) invoking css_release()
> which schedules the work item which actually frees. Am I misunderstanding
> something here?

What I'm trying to say is that it's not really a ref imbalance problem.
I think once the kill_css() has been called on a css, and it is enqueued to be
killed, we shouldn't call css_put(css) on it anymore outside of the "killed call
flow path". It will all be handled by the css_killed_ref_fn() function.

The fact the css_release() is called (via cgroup_kn_unlock()) just after
kill_css() causes the css->destroy_work to be enqueued twice on the same WQ
(cgroup_destroy_wq), just with different function. This results in the
BUG: corrupted list in insert_work issue.

--
Thanks,
Tadeusz

2022-06-02 00:09:56

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/2] cgroup: Use separate work structs on css release path

Hello,

On Wed, Jun 01, 2022 at 05:00:44PM -0700, Tadeusz Struk wrote:
> What I'm trying to say is that it's not really a ref imbalance problem.
> I think once the kill_css() has been called on a css, and it is enqueued to be
> killed, we shouldn't call css_put(css) on it anymore outside of the "killed call
> flow path". It will all be handled by the css_killed_ref_fn() function.
>
> The fact the css_release() is called (via cgroup_kn_unlock()) just after
> kill_css() causes the css->destroy_work to be enqueued twice on the same WQ
> (cgroup_destroy_wq), just with different function. This results in the
> BUG: corrupted list in insert_work issue.

I have a hard time following here. The kill / release paths relationship
isn't that complicated. The kill path is invoked when the percpu_ref's base
ref is killed and holds an extra ref so that it's guaranteed that release
can't happen before the kill path is done with the css. When the final put
happens - whether that's from the kill path or someone else, which often is
the case - the release path triggers. If we have release getting scheduled
while the kill path isn't finished, it is a reference counting problem,
right?

Can you elaborate the exact scenario that you think is happening? Please
feel free to omit the function calls and all that. Just lay out who's doing
what.

Thanks.

--
tejun

2022-06-02 00:29:37

by Tadeusz Struk

[permalink] [raw]
Subject: Re: [PATCH 2/2] cgroup: Use separate work structs on css release path

On 6/1/22 17:07, Tejun Heo wrote:
> Hello,
>
> On Wed, Jun 01, 2022 at 05:00:44PM -0700, Tadeusz Struk wrote:
>> What I'm trying to say is that it's not really a ref imbalance problem.
>> I think once the kill_css() has been called on a css, and it is enqueued to be
>> killed, we shouldn't call css_put(css) on it anymore outside of the "killed call
>> flow path". It will all be handled by the css_killed_ref_fn() function.
>>
>> The fact the css_release() is called (via cgroup_kn_unlock()) just after
>> kill_css() causes the css->destroy_work to be enqueued twice on the same WQ
>> (cgroup_destroy_wq), just with different function. This results in the
>> BUG: corrupted list in insert_work issue.
>
> I have a hard time following here. The kill / release paths relationship
> isn't that complicated. The kill path is invoked when the percpu_ref's base
> ref is killed and holds an extra ref so that it's guaranteed that release
> can't happen before the kill path is done with the css. When the final put
> happens - whether that's from the kill path or someone else, which often is
> the case - the release path triggers. If we have release getting scheduled
> while the kill path isn't finished, it is a reference counting problem,
> right?
>
> Can you elaborate the exact scenario that you think is happening? Please
> feel free to omit the function calls and all that. Just lay out who's doing
> what.

Ok the problem is that

1. kill_css() triggers css_killed_ref_fn(), which enqueues &css->destroy_work on cgroup_destroy_wq
2. Last put_css() calls css_release(), which enqueues &css->destroy_work on cgroup_destroy_wq

We have two instances of the same work struct enqueued on the same WQ (cgroup_destroy_wq),
which causes "BUG: corrupted list in insert_work"

So I think the easiest way to solve this would be to have two separate work_structs,
one for the killed_ref path and css_release path as in:

https://lore.kernel.org/all/[email protected]/

--
Thanks,
Tadeusz

2022-06-02 00:30:03

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH 2/2] cgroup: Use separate work structs on css release path

On Wed, Jun 01, 2022 at 05:26:34PM -0700, Tadeusz Struk wrote:
> Ok the problem is that
>
> 1. kill_css() triggers css_killed_ref_fn(), which enqueues &css->destroy_work on cgroup_destroy_wq
> 2. Last put_css() calls css_release(), which enqueues &css->destroy_work on cgroup_destroy_wq
>
> We have two instances of the same work struct enqueued on the same WQ (cgroup_destroy_wq),
> which causes "BUG: corrupted list in insert_work"

#2 shouldn't be happening before kill_ref_fn() is done with the css. If what
you're saying is happening, what's broken is the fact that the refcnt is
reaching 0 prematurely.

> So I think the easiest way to solve this would be to have two separate work_structs,
> one for the killed_ref path and css_release path as in:

If you do that, you'd just be racing the free path against the kill path and
the css might get freed while the kill path is still accessing it.

Thanks.

--
tejun

2022-06-02 00:41:06

by Tadeusz Struk

[permalink] [raw]
Subject: Re: [PATCH 2/2] cgroup: Use separate work structs on css release path

On 6/1/22 17:29, Tejun Heo wrote:
> On Wed, Jun 01, 2022 at 05:26:34PM -0700, Tadeusz Struk wrote:
>> Ok the problem is that
>>
>> 1. kill_css() triggers css_killed_ref_fn(), which enqueues &css->destroy_work on cgroup_destroy_wq
>> 2. Last put_css() calls css_release(), which enqueues &css->destroy_work on cgroup_destroy_wq
>>
>> We have two instances of the same work struct enqueued on the same WQ (cgroup_destroy_wq),
>> which causes "BUG: corrupted list in insert_work"
>
> #2 shouldn't be happening before kill_ref_fn() is done with the css. If what
> you're saying is happening, what's broken is the fact that the refcnt is
> reaching 0 prematurely.

css_killed_ref_fn() will be called regardless of the value of refcnt (via percpu_ref_kill_and_confirm())
and it will only enqueue the css_killed_work_fn() to be called later.
Then css_put()->css_release() will be called before the css_killed_work_fn() will even
get a chance to run, and it will also *only* enqueue css_release_work_fn() to be called later.
The problem happens on the second enqueue. So there need to be something in place that
will make sure that css_killed_work_fn() is done before css_release() can enqueue
the second job. Does it sound right?

>> So I think the easiest way to solve this would be to have two separate work_structs,
>> one for the killed_ref path and css_release path as in:
>
> If you do that, you'd just be racing the free path against the kill path and
> the css might get freed while the kill path is still accessing it.
>
> Thanks.
>


--
Thanks,
Tadeusz

2022-06-02 22:40:22

by Michal Koutný

[permalink] [raw]
Subject: Re: [PATCH 2/2] cgroup: Use separate work structs on css release path

On Wed, Jun 01, 2022 at 05:40:51PM -0700, Tadeusz Struk <[email protected]> wrote:
> css_killed_ref_fn() will be called regardless of the value of refcnt (via percpu_ref_kill_and_confirm())
> and it will only enqueue the css_killed_work_fn() to be called later.
> Then css_put()->css_release() will be called before the css_killed_work_fn() will even
> get a chance to run, and it will also *only* enqueue css_release_work_fn() to be called later.
> The problem happens on the second enqueue. So there need to be something in place that
> will make sure that css_killed_work_fn() is done before css_release() can enqueue
> the second job.

IIUC, here you describe the same scenario I broke down at [1].

> Does it sound right?

I added a parameter A there (that is sum of base and percpu references
before kill_css()).
I thought it fails because A == 1 (i.e. killing the base reference),
however, that seems an unlikely situation (because cgroup code uses a
"fuse" reference to pin css for offline_css()).

So the remaining option (at least I find it more likely now) is that
A == 0 (A < 0 would trigger the warning in
percpu_ref_switch_to_atomic_rcu()), aka the ref imbalance. I hope we can
get to the bottom of this with detailed enough tracing of gets/puts.

Splitting the work struct is condradictive to the existing approach with
the "fuse" reference.

(BTW you also wrote On Wed, Jun 01, 2022 at 05:00:44PM -0700, Tadeusz Struk <[email protected]> wrote:
> The fact the css_release() is called (via cgroup_kn_unlock()) just after
> kill_css() causes the css->destroy_work to be enqueued twice on the same WQ
> (cgroup_destroy_wq), just with different function. This results in the
> BUG: corrupted list in insert_work issue.

Where do you see a critical css_release called from cgroup_kn_unlock()?
I always observed the css_release() being called via
percpu_ref_call_confirm_rcu() (in the original and subsequent syzbot
logs.))

Thanks,
Michal

[1] https://lore.kernel.org/r/Yo7KfEOz92kS2z5Y@blackbook/

2022-06-03 09:27:56

by Tadeusz Struk

[permalink] [raw]
Subject: Re: [PATCH 2/2] cgroup: Use separate work structs on css release path

On 6/2/22 04:47, Michal Koutný wrote:
> On Wed, Jun 01, 2022 at 05:40:51PM -0700, Tadeusz Struk<[email protected]> wrote:
>> css_killed_ref_fn() will be called regardless of the value of refcnt (via percpu_ref_kill_and_confirm())
>> and it will only enqueue the css_killed_work_fn() to be called later.
>> Then css_put()->css_release() will be called before the css_killed_work_fn() will even
>> get a chance to run, and it will also*only* enqueue css_release_work_fn() to be called later.
>> The problem happens on the second enqueue. So there need to be something in place that
>> will make sure that css_killed_work_fn() is done before css_release() can enqueue
>> the second job.
> IIUC, here you describe the same scenario I broke down at [1].

Right, except the last css_put(), which I think is called from cgroup_kn_unlock()
See below.

>> Does it sound right?
> I added a parameter A there (that is sum of base and percpu references
> before kill_css()).
> I thought it fails because A == 1 (i.e. killing the base reference),
> however, that seems an unlikely situation (because cgroup code uses a
> "fuse" reference to pin css for offline_css()).
>
> So the remaining option (at least I find it more likely now) is that
> A == 0 (A < 0 would trigger the warning in
> percpu_ref_switch_to_atomic_rcu()), aka the ref imbalance. I hope we can
> get to the bottom of this with detailed enough tracing of gets/puts.
>
> Splitting the work struct is condradictive to the existing approach with
> the "fuse" reference.
>
> (BTW you also wrote On Wed, Jun 01, 2022 at 05:00:44PM -0700, Tadeusz Struk<[email protected]> wrote:
>> The fact the css_release() is called (via cgroup_kn_unlock()) just after
>> kill_css() causes the css->destroy_work to be enqueued twice on the same WQ
>> (cgroup_destroy_wq), just with different function. This results in the
>> BUG: corrupted list in insert_work issue.
> Where do you see a critical css_release called from cgroup_kn_unlock()?
> I always observed the css_release() being called via
> percpu_ref_call_confirm_rcu() (in the original and subsequent syzbot

it goes like this:
cgroup_kn_unlock(kn)->cgroup_put(cgrp)->css_put(&cgrp->self), which
brings the refcnt to zero and triggers css_release().
I think what's missing is something that will serialize the kill
and release paths. I will try to put something together today.

--
Thanks,
Tadeusz