2022-04-12 23:37:56

by Tadeusz Struk

[permalink] [raw]
Subject: [PATCH] cgroup: don't queue css_release_work if one already pending

Syzbot found a corrupted list bug scenario that can be triggered from
cgroup css_create(). The reproduces writes to cgroup.subtree_control
file, which invokes cgroup_apply_control_enable(), css_create(), and
css_populate_dir(), which then randomly fails with a fault injected -ENOMEM.
In such scenario the css_create() error path rcu enqueues css_free_rwork_fn
work for an css->refcnt initialized with css_release() destructor,
and there is a chance that the css_release() function will be invoked
for a cgroup_subsys_state, for which a destroy_work has already been
queued via css_create() error path. This causes a list_add corruption
as can be seen in the syzkaller report [1].
This can be avoided by adding a check to css_release() that checks
if it has already been enqueued.

[1] https://syzkaller.appspot.com/bug?id=e26e54d6eac9d9fb50b221ec3e4627b327465dbd

Cc: Tejun Heo <[email protected]>
Cc: Zefan Li <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Cc: Andrii Nakryiko <[email protected]>
Cc: Martin KaFai Lau <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Yonghong Song <[email protected]>
Cc: John Fastabend <[email protected]>
Cc: KP Singh <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>

Reported-by: [email protected]
Fixes: 8f36aaec9c92 ("cgroup: Use rcu_work instead of explicit rcu and work item")
Signed-off-by: Tadeusz Struk <[email protected]>
---
kernel/cgroup/cgroup.c | 7 +++++--
1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index adb820e98f24..9ae2de29f8c9 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5210,8 +5210,11 @@ static void css_release(struct percpu_ref *ref)
struct cgroup_subsys_state *css =
container_of(ref, struct cgroup_subsys_state, refcnt);

- INIT_WORK(&css->destroy_work, css_release_work_fn);
- queue_work(cgroup_destroy_wq, &css->destroy_work);
+ if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT,
+ work_data_bits(&css->destroy_work))) {
+ INIT_WORK(&css->destroy_work, css_release_work_fn);
+ queue_work(cgroup_destroy_wq, &css->destroy_work);
+ }
}

static void init_and_link_css(struct cgroup_subsys_state *css,
--
2.35.1


2022-04-14 13:33:22

by Tadeusz Struk

[permalink] [raw]
Subject: Re: [PATCH] cgroup: don't queue css_release_work if one already pending

Hi Hillf,
On 4/13/22 02:56, Hillf Danton wrote:
> On Tue, 12 Apr 2022 12:24:59 -0700 Tadeusz Struk wrote:
>> Syzbot found a corrupted list bug scenario that can be triggered from
>> cgroup css_create(). The reproduces writes to cgroup.subtree_control
>> file, which invokes cgroup_apply_control_enable(), css_create(), and
>> css_populate_dir(), which then randomly fails with a fault injected -ENOMEM.
>> In such scenario the css_create() error path rcu enqueues css_free_rwork_fn
>> work for an css->refcnt initialized with css_release() destructor,
>> and there is a chance that the css_release() function will be invoked
>
> Could you tip-point the percpu_ref_kill needed to trigger the css_release()?

What I think happens is that the write triggers:
cgroup_subtree_control_write()->cgroup_apply_control()->cgroup_apply_control_enable()->css_create()

which, allocates and initializes the css, then fails in cgroup_idr_alloc(),
bails out and calls queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);

then cgroup_subtree_control_write bails out to out_unlock, which then goes:

cgroup_kn_unlock()->cgroup_put()->css_put()->percpu_ref_put(&css->refcnt)->percpu_ref_put_many(ref)
which then calls ref->data->release(ref); and tries to enqueue the same causing
list corruption in insert_work.


>> for a cgroup_subsys_state, for which a destroy_work has already been
>> queued via css_create() error path. This causes a list_add corruption
>> as can be seen in the syzkaller report [1].
>> This can be avoided by adding a check to css_release() that checks
>> if it has already been enqueued.
>>
>> [1] https://syzkaller.appspot.com/bug?id=e26e54d6eac9d9fb50b221ec3e4627b327465dbd
>
> Given my failure of finding the lore URL to what was
> Reported-by: [email protected]
> such a link is welcome to see the list corruption.
>

syzkaller doesn't send reports to any kernel mailing lists that are rachived in
lore. The relevant links can be found in the dashboard, link [1] in the patch.

https://syzkaller.appspot.com/bug?id=e26e54d6eac9d9fb50b221ec3e4627b327465dbd

See the links in:
Crash: BUG: corrupted list in insert_work (log)

crash: https://syzkaller.appspot.com/text?tag=CrashReport&x=11bd5e9b700000
log: https://syzkaller.appspot.com/text?tag=CrashLog&x=16bd5e9b700000

>>
>> Cc: Tejun Heo <[email protected]>
>> Cc: Zefan Li <[email protected]>
>> Cc: Johannes Weiner <[email protected]>
>> Cc: Christian Brauner <[email protected]>
>> Cc: Alexei Starovoitov <[email protected]>
>> Cc: Daniel Borkmann <[email protected]>
>> Cc: Andrii Nakryiko <[email protected]>
>> Cc: Martin KaFai Lau <[email protected]>
>> Cc: Song Liu <[email protected]>
>> Cc: Yonghong Song <[email protected]>
>> Cc: John Fastabend <[email protected]>
>> Cc: KP Singh <[email protected]>
>> Cc: <[email protected]>
>> Cc: <[email protected]>
>> Cc: <[email protected]>
>> Cc: <[email protected]>
>> Cc: <[email protected]>
>>
>> Reported-by: [email protected]
>> Fixes: 8f36aaec9c92 ("cgroup: Use rcu_work instead of explicit rcu and work item")
>> Signed-off-by: Tadeusz Struk <[email protected]>
>> ---
>> kernel/cgroup/cgroup.c | 7 +++++--
>> 1 file changed, 5 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index adb820e98f24..9ae2de29f8c9 100644
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
>> @@ -5210,8 +5210,11 @@ static void css_release(struct percpu_ref *ref)
>> struct cgroup_subsys_state *css =
>> container_of(ref, struct cgroup_subsys_state, refcnt);
>>
>> - INIT_WORK(&css->destroy_work, css_release_work_fn);
>> - queue_work(cgroup_destroy_wq, &css->destroy_work);
>> + if (!test_and_set_bit(WORK_STRUCT_PENDING_BIT,
>> + work_data_bits(&css->destroy_work))) {
>> + INIT_WORK(&css->destroy_work, css_release_work_fn);
>> + queue_work(cgroup_destroy_wq, &css->destroy_work);
>> + }
>> }
>>
>> static void init_and_link_css(struct cgroup_subsys_state *css,
>> --
>> 2.35.1
>>

--
Thanks,
Tadeusz

2022-04-15 14:23:22

by Tadeusz Struk

[permalink] [raw]
Subject: Re: [PATCH] cgroup: don't queue css_release_work if one already pending

Hi Michal,
Thanks for your analysis.

On 4/14/22 09:44, Michal Koutný wrote:
> Hello Tadeusz.
>
> Thanks for analyzing this syzbot report. Let me provide my understanding
> of the test case and explanation why I think your patch fixes it but is
> not fully correct.
>
> On Tue, Apr 12, 2022 at 12:24:59PM -0700, Tadeusz Struk <[email protected]> wrote:
>> Syzbot found a corrupted list bug scenario that can be triggered from
>> cgroup css_create(). The reproduces writes to cgroup.subtree_control
>> file, which invokes cgroup_apply_control_enable(), css_create(), and
>> css_populate_dir(), which then randomly fails with a fault injected -ENOMEM.
>
> The reproducer code makes it hard for me to understand which function
> fails with ENOMEM.
> But I can see your patch fixes the reproducer and your additional debug
> patch which proves that css->destroy_work is re-queued.

Yes, it is hard to see the actual failing point because, I think it is randomly
failing in different places. I think in the actual case that causes the list
corruption is in fact in css_create().
It is the css_create() error path that does fist rcu enqueue in:

https://elixir.bootlin.com/linux/v5.10.109/source/kernel/cgroup/cgroup.c#L5228

and the second is triggered by the css->refcnt calling css_release()

The reason why we don't see it actually failing in css_create() in the trace
dump is that the fail_dump() is rate-limited, see:
https://elixir.bootlin.com/linux/v5.18-rc2/source/lib/fault-inject.c#L44

I was confused as well, so I put additional debug prints in every place
where css_release() can fail, and it was actually in
css_create()->cgroup_idr_alloc() that failed in my case.

What happened was, the write triggered:
cgroup_subtree_control_write()->cgroup_apply_control()->cgroup_apply_control_enable()->css_create()

which, allocates and initializes the css, then fails in cgroup_idr_alloc(),
bails out and calls queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);

then cgroup_subtree_control_write() bails out to out_unlock:, which then goes:

cgroup_kn_unlock()->cgroup_put()->css_put()->percpu_ref_put(&css->refcnt)->percpu_ref_put_many(ref)

which then calls ref->data->release(ref) and enqueues the same
&css->destroy_rwork on cgroup_destroy_wq causing list corruption in insert_work.

>> In such scenario the css_create() error path rcu enqueues css_free_rwork_fn
>> work for an css->refcnt initialized with css_release() destructor,
>
> Note that css_free_rwork_fn() utilizes css->destroy_*r*work.
> The error path in css_create() open codes relevant parts of
> css_release_work_fn() so that css_release() can be skipped and the
> refcnt is eventually just percpu_ref_exit()'d.
>
>> and there is a chance that the css_release() function will be invoked
>> for a cgroup_subsys_state, for which a destroy_work has already been
>> queued via css_create() error path.
>
> But I think the problem is css_populate_dir() failing in
> cgroup_apply_control_enable(). (Is this what you actually meant?
> css_create() error path is then irrelevant, no?)

I thought so too at first as the the crushdump shows that this is failing
in css_populate_dir(), but this is not the fail that causes the list corruption.
The code can recover from the fail in css_populate_dir().
The fail that causes trouble is in css_create(), that makes it go to its error path.
I can dig out the patch with my debug prints and request syzbot to run it
if you want.

>
> The already created csses should then be rolled back via
> cgroup_restore_control(cgrp);
> cgroup_apply_control_disable(cgrp);
> ...
> kill_css(css)
>
> I suspect the double-queuing is a result of the fact that there exists
> only the single reference to the css->refcnt. I.e. it's
> percpu_ref_kill_and_confirm()'d and released both at the same time.
>
> (Normally (when not killing the last reference), css->destroy_work reuse
> is not a problem because of the sequenced chain
> css_killed_work_fn()->css_put()->css_release().)
>
>> This can be avoided by adding a check to css_release() that checks
>> if it has already been enqueued.
>
> If that's what's happening, then your patch omits the final
> css_release_work_fn() in favor of css_killed_work_fn() but both should
> be run during the rollback upon css_populate_dir() failure.

This change only prevents from double queue:

queue_[rcu]_work(cgroup_destroy_wq, &css->destroy_rwork);

I don't see how it affects the css_killed_work_fn() clean path.
I didn't look at it, since I thought it is irrelevant in this case.

--
Thanks,
Tadeusz

2022-04-15 22:00:00

by Michal Koutný

[permalink] [raw]
Subject: Re: [PATCH] cgroup: don't queue css_release_work if one already pending

Hello Tadeusz.

Thanks for analyzing this syzbot report. Let me provide my understanding
of the test case and explanation why I think your patch fixes it but is
not fully correct.

On Tue, Apr 12, 2022 at 12:24:59PM -0700, Tadeusz Struk <[email protected]> wrote:
> Syzbot found a corrupted list bug scenario that can be triggered from
> cgroup css_create(). The reproduces writes to cgroup.subtree_control
> file, which invokes cgroup_apply_control_enable(), css_create(), and
> css_populate_dir(), which then randomly fails with a fault injected -ENOMEM.

The reproducer code makes it hard for me to understand which function
fails with ENOMEM.
But I can see your patch fixes the reproducer and your additional debug
patch which proves that css->destroy_work is re-queued.

> In such scenario the css_create() error path rcu enqueues css_free_rwork_fn
> work for an css->refcnt initialized with css_release() destructor,

Note that css_free_rwork_fn() utilizes css->destroy_*r*work.
The error path in css_create() open codes relevant parts of
css_release_work_fn() so that css_release() can be skipped and the
refcnt is eventually just percpu_ref_exit()'d.

> and there is a chance that the css_release() function will be invoked
> for a cgroup_subsys_state, for which a destroy_work has already been
> queued via css_create() error path.

But I think the problem is css_populate_dir() failing in
cgroup_apply_control_enable(). (Is this what you actually meant?
css_create() error path is then irrelevant, no?)

The already created csses should then be rolled back via
cgroup_restore_control(cgrp);
cgroup_apply_control_disable(cgrp);
...
kill_css(css)

I suspect the double-queuing is a result of the fact that there exists
only the single reference to the css->refcnt. I.e. it's
percpu_ref_kill_and_confirm()'d and released both at the same time.

(Normally (when not killing the last reference), css->destroy_work reuse
is not a problem because of the sequenced chain
css_killed_work_fn()->css_put()->css_release().)

> This can be avoided by adding a check to css_release() that checks
> if it has already been enqueued.

If that's what's happening, then your patch omits the final
css_release_work_fn() in favor of css_killed_work_fn() but both should
be run during the rollback upon css_populate_dir() failure.

So an alternative approach to tackle this situation would be to split
css->destroy_work into two work work_structs (one for killing, one for
releasing) at the cost of inflating cgroup_subsys_state.

Take my hypothesis with a grain of salt maybe the assumption (last
reference == initial reference) is not different from normal operation.

Regards,
Michal

2022-04-22 18:31:46

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] cgroup: don't queue css_release_work if one already pending

Hello,

On Thu, Apr 14, 2022 at 10:51:18AM -0700, Tadeusz Struk wrote:
> What happened was, the write triggered:
> cgroup_subtree_control_write()->cgroup_apply_control()->cgroup_apply_control_enable()->css_create()
>
> which, allocates and initializes the css, then fails in cgroup_idr_alloc(),
> bails out and calls queue_rcu_work(cgroup_destroy_wq, &css->destroy_rwork);

Yes, but this css hasn't been installed yet.

> then cgroup_subtree_control_write() bails out to out_unlock:, which then goes:
>
> cgroup_kn_unlock()->cgroup_put()->css_put()->percpu_ref_put(&css->refcnt)->percpu_ref_put_many(ref)

And this is a different css. cgroup->self which isn't connected to the half
built css which got destroyed in css_create().

So, I have a bit of difficulty following this scenario. The way that the
current code uses destroy_work is definitely nasty and it'd probably be a
good idea to separate out the different use cases, but let's first
understand what's failing.

Thanks.

--
tejun

2022-04-22 18:34:02

by Michal Koutný

[permalink] [raw]
Subject: Re: [PATCH] cgroup: don't queue css_release_work if one already pending

On Thu, Apr 21, 2022 at 02:00:56PM -1000, Tejun Heo <[email protected]> wrote:
> If this is the case, we need to hold an extra reference to be put by the
> css_killed_work_fn(), right?

I looked into it a bit more lately and found that there already is such
a fuse in kill_css() [1].

At the same type syzbots stack trace demonstrates the fuse is
ineffective

> css_release+0xae/0xc0 kernel/cgroup/cgroup.c:5146 (**)
> percpu_ref_put_many include/linux/percpu-refcount.h:322 [inline]
> percpu_ref_put include/linux/percpu-refcount.h:338 [inline]
> percpu_ref_call_confirm_rcu lib/percpu-refcount.c:162 [inline] (*)
> percpu_ref_switch_to_atomic_rcu+0x5a2/0x5b0 lib/percpu-refcount.c:199
> rcu_do_batch+0x4f8/0xbc0 kernel/rcu/tree.c:2485
> rcu_core+0x59b/0xe30 kernel/rcu/tree.c:2722
> rcu_core_si+0x9/0x10 kernel/rcu/tree.c:2735
> __do_softirq+0x27e/0x596 kernel/softirq.c:305

(*) this calls css_killed_ref_fn confirm_switch
(**) zero references after confirmed kill?

So, I was also looking at the possible race with css_free_rwork_fn()
(from failed css_create()) but that would likely emit a warning from
__percpu_ref_exit().

So, I still think there's something fishy (so far possible only via
artificial ENOMEM injection) that needs an explanation...

Michal

[1] https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/tree/kernel/cgroup/cgroup.c#n5608

2022-04-22 22:24:58

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] cgroup: don't queue css_release_work if one already pending

On Thu, Apr 14, 2022 at 06:44:09PM +0200, Michal Koutn? wrote:
> I suspect the double-queuing is a result of the fact that there exists
> only the single reference to the css->refcnt. I.e. it's
> percpu_ref_kill_and_confirm()'d and released both at the same time.
>
> (Normally (when not killing the last reference), css->destroy_work reuse
> is not a problem because of the sequenced chain
> css_killed_work_fn()->css_put()->css_release().)

If this is the case, we need to hold an extra reference to be put by the
css_killed_work_fn(), right?

Thanks.

--
tejun

2022-05-18 16:52:19

by Tadeusz Struk

[permalink] [raw]
Subject: Re: [PATCH] cgroup: don't queue css_release_work if one already pending

On 4/22/22 04:05, Michal Koutný wrote:
> On Thu, Apr 21, 2022 at 02:00:56PM -1000, Tejun Heo <[email protected]> wrote:
>> If this is the case, we need to hold an extra reference to be put by the
>> css_killed_work_fn(), right?
>
> I looked into it a bit more lately and found that there already is such
> a fuse in kill_css() [1].
>
> At the same type syzbots stack trace demonstrates the fuse is
> ineffective
>
>> css_release+0xae/0xc0 kernel/cgroup/cgroup.c:5146 (**)
>> percpu_ref_put_many include/linux/percpu-refcount.h:322 [inline]
>> percpu_ref_put include/linux/percpu-refcount.h:338 [inline]
>> percpu_ref_call_confirm_rcu lib/percpu-refcount.c:162 [inline] (*)
>> percpu_ref_switch_to_atomic_rcu+0x5a2/0x5b0 lib/percpu-refcount.c:199
>> rcu_do_batch+0x4f8/0xbc0 kernel/rcu/tree.c:2485
>> rcu_core+0x59b/0xe30 kernel/rcu/tree.c:2722
>> rcu_core_si+0x9/0x10 kernel/rcu/tree.c:2735
>> __do_softirq+0x27e/0x596 kernel/softirq.c:305
>
> (*) this calls css_killed_ref_fn confirm_switch
> (**) zero references after confirmed kill?
>
> So, I was also looking at the possible race with css_free_rwork_fn()
> (from failed css_create()) but that would likely emit a warning from
> __percpu_ref_exit().
>
> So, I still think there's something fishy (so far possible only via
> artificial ENOMEM injection) that needs an explanation...

I can't reliably reproduce this issue on neither mainline nor v5.10, where
syzbot originally found it. It still triggers for syzbot though.

--
Thanks,
Tadeusz

2022-05-20 12:34:28

by Tadeusz Struk

[permalink] [raw]
Subject: Re: [PATCH] cgroup: don't queue css_release_work if one already pending

On 5/19/22 04:23, Hillf Danton wrote:
> On Wed, 18 May 2022 09:48:21 -0700 Tadeusz Struk wrote:
>> On 4/22/22 04:05, Michal Koutny wrote:
>>> On Thu, Apr 21, 2022 at 02:00:56PM -1000, Tejun Heo<[email protected]> wrote:
>>>> If this is the case, we need to hold an extra reference to be put by the
>>>> css_killed_work_fn(), right?
> That put could trigger INIT_WORK in css_release() and warning [1]
> on init active (active state 0) object OTOH as the same
> css->destroy_work is used in both kill and release pathes.

Will this help if there would be two WQs, one for the css_release path
and one for the rcu_work?

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index adb820e98f24..a4873b33e488 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -124,6 +124,7 @@ DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem);
* which may lead to deadlock.
*/
static struct workqueue_struct *cgroup_destroy_wq;
+static struct workqueue_struct *cgroup_destroy_rcu_wq;

/* generate an array of cgroup subsystem pointers */
#define SUBSYS(_x) [_x ## _cgrp_id] = &_x ## _cgrp_subsys,

--
Thanks,
Tadeusz

2022-05-23 05:45:15

by Tadeusz Struk

[permalink] [raw]
Subject: Re: [PATCH] cgroup: don't queue css_release_work if one already pending

On 5/20/22 01:13, Tejun Heo wrote:
> On Thu, May 19, 2022 at 04:26:51PM -0700, Tadeusz Struk wrote:
>> On 5/19/22 04:23, Hillf Danton wrote:
>>> On Wed, 18 May 2022 09:48:21 -0700 Tadeusz Struk wrote:
>>>> On 4/22/22 04:05, Michal Koutny wrote:
>>>>> On Thu, Apr 21, 2022 at 02:00:56PM -1000, Tejun Heo<[email protected]> wrote:
>>>>>> If this is the case, we need to hold an extra reference to be put by the
>>>>>> css_killed_work_fn(), right?
>>> That put could trigger INIT_WORK in css_release() and warning [1]
>>> on init active (active state 0) object OTOH as the same
>>> css->destroy_work is used in both kill and release paths.
>
> Hmm... wouldn't the extra reference keep release from happening?
>
>> Will this help if there would be two WQs, one for the css_release path
>> and one for the rcu_work?
>>
>> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
>> index adb820e98f24..a4873b33e488 100644
>> --- a/kernel/cgroup/cgroup.c
>> +++ b/kernel/cgroup/cgroup.c
>> @@ -124,6 +124,7 @@ DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem);
>> * which may lead to deadlock.
>> */
>> static struct workqueue_struct *cgroup_destroy_wq;
>> +static struct workqueue_struct *cgroup_destroy_rcu_wq;
>
> I don't understand why this would help. Care to elaborate?

I think it will help to solve the list corruption issue:

list_add corruption. prev->next should be next (ffff8881f705c060), but was ffff888113123870. (prev=ffff888113123870).
------------[ cut here ]------------
kernel BUG at lib/list_debug.c:28!

as this is a result of enqueuing the same css->destroy_work onto the same WQ,
one on the rcu path and one on the css_release path.
I will prototype it today and test with syzbot.

--
Thanks,
Tadeusz

2022-05-23 06:09:04

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] cgroup: don't queue css_release_work if one already pending

On Thu, May 19, 2022 at 04:26:51PM -0700, Tadeusz Struk wrote:
> On 5/19/22 04:23, Hillf Danton wrote:
> > On Wed, 18 May 2022 09:48:21 -0700 Tadeusz Struk wrote:
> > > On 4/22/22 04:05, Michal Koutny wrote:
> > > > On Thu, Apr 21, 2022 at 02:00:56PM -1000, Tejun Heo<[email protected]> wrote:
> > > > > If this is the case, we need to hold an extra reference to be put by the
> > > > > css_killed_work_fn(), right?
> > That put could trigger INIT_WORK in css_release() and warning [1]
> > on init active (active state 0) object OTOH as the same
> > css->destroy_work is used in both kill and release pathes.

Hmm... wouldn't the extra reference keep release from happening?

> Will this help if there would be two WQs, one for the css_release path
> and one for the rcu_work?
>
> diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
> index adb820e98f24..a4873b33e488 100644
> --- a/kernel/cgroup/cgroup.c
> +++ b/kernel/cgroup/cgroup.c
> @@ -124,6 +124,7 @@ DEFINE_PERCPU_RWSEM(cgroup_threadgroup_rwsem);
> * which may lead to deadlock.
> */
> static struct workqueue_struct *cgroup_destroy_wq;
> +static struct workqueue_struct *cgroup_destroy_rcu_wq;

I don't understand why this would help. Care to elaborate?

Thanks.

--
tejun

2022-05-23 06:19:00

by Michal Koutný

[permalink] [raw]
Subject: Re: [PATCH] cgroup: don't queue css_release_work if one already pending

On Fri, May 20, 2022 at 09:38:12AM -0700, Tadeusz Struk <[email protected]> wrote:
> as this is a result of enqueuing the same css->destroy_work onto the same WQ,
> one on the rcu path and one on the css_release path.
> I will prototype it today and test with syzbot.

In my understanding, you'd need two independent work_structs in a css,
not two separate workqueues to put the single entry on.

Michal

2022-05-23 06:44:57

by Tadeusz Struk

[permalink] [raw]
Subject: Re: [PATCH] cgroup: don't queue css_release_work if one already pending

On 5/20/22 09:42, Michal Koutný wrote:
> On Fri, May 20, 2022 at 09:38:12AM -0700, Tadeusz Struk<[email protected]> wrote:
>> as this is a result of enqueuing the same css->destroy_work onto the same WQ,
>> one on the rcu path and one on the css_release path.
>> I will prototype it today and test with syzbot.
> In my understanding, you'd need two independent work_structs in a css,
> not two separate workqueues to put the single entry on.

I think either way would work, but two workqueues would be less churn.
I can try both.

--
Thanks,
Tadeusz

2022-05-23 19:35:16

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] cgroup: don't queue css_release_work if one already pending

On Mon, May 23, 2022 at 12:00:06PM -0700, Tadeusz Struk wrote:
> On 5/20/22 09:42, Michal Koutn? wrote:
> > On Fri, May 20, 2022 at 09:38:12AM -0700, Tadeusz Struk <[email protected]> wrote:
> > > as this is a result of enqueuing the same css->destroy_work onto the same WQ,
> > > one on the rcu path and one on the css_release path.
> > > I will prototype it today and test with syzbot.
> >
> > In my understanding, you'd need two independent work_structs in a css,
> > not two separate workqueues to put the single entry on.
>
> Yes, separating the css_killed_ref and css_release paths with two separate work_structs
> fixes the two syzbot list corruption issues [1]&[2].
> I tested it on mainline v5.18.0 and v5.10.117
> In case of [2] the mainline triggers an issue, but it is unrelated to list corruption.

Can you try holding an extra ref in the killed path?

Thanks.

--
tejun

2022-05-23 19:40:00

by Tadeusz Struk

[permalink] [raw]
Subject: Re: [PATCH] cgroup: don't queue css_release_work if one already pending

On 5/20/22 09:42, Michal Koutný wrote:
> On Fri, May 20, 2022 at 09:38:12AM -0700, Tadeusz Struk <[email protected]> wrote:
>> as this is a result of enqueuing the same css->destroy_work onto the same WQ,
>> one on the rcu path and one on the css_release path.
>> I will prototype it today and test with syzbot.
>
> In my understanding, you'd need two independent work_structs in a css,
> not two separate workqueues to put the single entry on.

Yes, separating the css_killed_ref and css_release paths with two separate work_structs
fixes the two syzbot list corruption issues [1]&[2].
I tested it on mainline v5.18.0 and v5.10.117
In case of [2] the mainline triggers an issue, but it is unrelated to list corruption.

[1] https://syzkaller.appspot.com/bug?id=e26e54d6eac9d9fb50b221ec3e4627b327465dbd
[2] https://syzkaller.appspot.com/bug?id=3c7ff113ccb695e839b859da3fc481c36eb1cfd5

I will send a patch soon.

--
Thanks,
Tadeusz

2022-05-23 21:18:08

by Tadeusz Struk

[permalink] [raw]
Subject: Re: [PATCH] cgroup: don't queue css_release_work if one already pending

On 5/23/22 12:02, Tejun Heo wrote:
> Can you try holding an extra ref in the killed path?

Yes, I can try that. Should that be with or without the extra work_struct?

--
Thanks,
Tadeusz

2022-05-23 21:39:22

by Tadeusz Struk

[permalink] [raw]
Subject: Re: [PATCH] cgroup: don't queue css_release_work if one already pending

On 5/23/22 12:08, Tadeusz Struk wrote:
> On 5/23/22 12:02, Tejun Heo wrote:
>> Can you try holding an extra ref in the killed path?
>
> Yes, I can try that. Should that be with or without the extra work_struct?

With this patch:
https://syzkaller.appspot.com/text?tag=Patch&x=1264b23df00000

The issue still triggers:
https://syzkaller.appspot.com/text?tag=CrashReport&x=162b5781f00000

--
Thanks,
Tadeusz

2022-05-23 22:35:04

by Tadeusz Struk

[permalink] [raw]
Subject: [PATCH v2] cgroups: separate destroy_work into two separate wq

Syzbot found a corrupted list bug scenario that can be triggered from
cgroup css_create(). The reproduces writes to cgroup.subtree_control
file, which invokes cgroup_apply_control_enable(), css_create(), and
css_populate_dir(), which then randomly fails with a fault injected -ENOMEM.
In such scenario the css_create() error path rcu enqueues css_free_rwork_fn
work for an css->refcnt initialized with css_release() destructor,
and there is a chance that the css_release() function will be invoked
for a cgroup_subsys_state, for which a destroy_work has already been
queued via css_create() error path. This causes a list_add corruption
as can be seen in the syzkaller report [1].
This can be fixed by separating the css_release and ref_kill paths
to work with two separate work_structs.

[1] https://syzkaller.appspot.com/bug?id=e26e54d6eac9d9fb50b221ec3e4627b327465dbd

Cc: Tejun Heo <[email protected]>
Cc: Zefan Li <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Cc: Andrii Nakryiko <[email protected]>
Cc: Martin KaFai Lau <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Yonghong Song <[email protected]>
Cc: John Fastabend <[email protected]>
Cc: KP Singh <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>

Reported-and-tested-by: [email protected]
Fixes: 8f36aaec9c92 ("cgroup: Use rcu_work instead of explicit rcu and work item")
Signed-off-by: Tadeusz Struk <[email protected]>
---
v2: Add a separate work_struct for the css_ref_kill path instead of
checking if a work has already been enqueued.
---
include/linux/cgroup-defs.h | 5 +++--
kernel/cgroup/cgroup.c | 14 +++++++-------
2 files changed, 10 insertions(+), 9 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 1bfcfb1af352..92b0c5e8c472 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -178,8 +178,9 @@ struct cgroup_subsys_state {
*/
atomic_t online_cnt;

- /* percpu_ref killing and RCU release */
- struct work_struct destroy_work;
+ /* percpu_ref killing, css release, and RCU release work structs */
+ struct work_struct release_work;
+ struct work_struct killed_ref_work;
struct rcu_work destroy_rwork;

/*
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index adb820e98f24..3e00a793e15d 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5099,7 +5099,7 @@ static struct cftype cgroup_base_files[] = {
* css_free_work_fn().
*
* It is actually hairier because both step 2 and 4 require process context
- * and thus involve punting to css->destroy_work adding two additional
+ * and thus involve punting to css->release_work adding two additional
* steps to the already complex sequence.
*/
static void css_free_rwork_fn(struct work_struct *work)
@@ -5154,7 +5154,7 @@ static void css_free_rwork_fn(struct work_struct *work)
static void css_release_work_fn(struct work_struct *work)
{
struct cgroup_subsys_state *css =
- container_of(work, struct cgroup_subsys_state, destroy_work);
+ container_of(work, struct cgroup_subsys_state, release_work);
struct cgroup_subsys *ss = css->ss;
struct cgroup *cgrp = css->cgroup;

@@ -5210,8 +5210,8 @@ static void css_release(struct percpu_ref *ref)
struct cgroup_subsys_state *css =
container_of(ref, struct cgroup_subsys_state, refcnt);

- INIT_WORK(&css->destroy_work, css_release_work_fn);
- queue_work(cgroup_destroy_wq, &css->destroy_work);
+ INIT_WORK(&css->release_work, css_release_work_fn);
+ queue_work(cgroup_destroy_wq, &css->release_work);
}

static void init_and_link_css(struct cgroup_subsys_state *css,
@@ -5546,7 +5546,7 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
static void css_killed_work_fn(struct work_struct *work)
{
struct cgroup_subsys_state *css =
- container_of(work, struct cgroup_subsys_state, destroy_work);
+ container_of(work, struct cgroup_subsys_state, killed_ref_work);

mutex_lock(&cgroup_mutex);

@@ -5567,8 +5567,8 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
container_of(ref, struct cgroup_subsys_state, refcnt);

if (atomic_dec_and_test(&css->online_cnt)) {
- INIT_WORK(&css->destroy_work, css_killed_work_fn);
- queue_work(cgroup_destroy_wq, &css->destroy_work);
+ INIT_WORK(&css->killed_ref_work, css_killed_work_fn);
+ queue_work(cgroup_destroy_wq, &css->killed_ref_work);
}
}

--
2.36.1