LinuxLists.cc - [PATCH v2] cgroup: serialize css kill and release paths

2022-06-06 04:58:03

Subject: [PATCH v2] cgroup: serialize css kill and release paths

Syzbot found a corrupted list bug scenario that can be triggered from
cgroup_subtree_control_write(cgrp). The reproduces writes to
cgroup.subtree_control file, which invokes:
cgroup_apply_control_enable()->css_create()->css_populate_dir(), which
then fails with a fault injected -ENOMEM.
In such scenario the css_killed_work_fn will be en-queued via
cgroup_apply_control_disable(cgrp)->kill_css(css), and bail out to
cgroup_kn_unlock(). Then cgroup_kn_unlock() will call:
cgroup_put(cgrp)->css_put(&cgrp->self), which will try to enqueue
css_release_work_fn for the same css instance, causing a list_add
corruption bug, as can be seen in the syzkaller report [1].

Fix this by synchronizing the css ref_kill and css_release jobs.
css_release() function will check if the css_killed_work_fn() has been
scheduled for the css and only en-queue the css_release_work_fn()
if css_killed_work_fn wasn't already en-queued. Otherwise css_release() will
set the CSS_REL_LATER flag for that css. This will cause the
css_release_work_fn() work to be executed after css_killed_work_fn() is finished.

Two scc flags have been introduced to implement this serialization mechanizm:

* CSS_KILL_ENQED, which will be set when css_killed_work_fn() is en-queued, and
* CSS_REL_LATER, which, if set, will cause the css_release_work_fn() to be
scheduled after the css_killed_work_fn is finished.

There is also a new lock, which will protect the integrity of the css flags.

[1] https://syzkaller.appspot.com/bug?id=e26e54d6eac9d9fb50b221ec3e4627b327465dbd

Cc: Tejun Heo <[email protected]>
Cc: Michal Koutny<[email protected]>
Cc: Zefan Li <[email protected]>
Cc: Johannes Weiner <[email protected]>
Cc: Christian Brauner <[email protected]>
Cc: Alexei Starovoitov <[email protected]>
Cc: Daniel Borkmann <[email protected]>
Cc: Andrii Nakryiko <[email protected]>
Cc: Martin KaFai Lau <[email protected]>
Cc: Song Liu <[email protected]>
Cc: Yonghong Song <[email protected]>
Cc: John Fastabend <[email protected]>
Cc: KP Singh <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>
Cc: <[email protected]>

Reported-and-tested-by: [email protected]
Fixes: 8f36aaec9c92 ("cgroup: Use rcu_work instead of explicit rcu and work item")
Signed-off-by: Tadeusz Struk <[email protected]>
---
v2: Use correct lock in css_killed_work_fn()
---
include/linux/cgroup-defs.h | 4 ++++
kernel/cgroup/cgroup.c | 35 ++++++++++++++++++++++++++++++++---
2 files changed, 36 insertions(+), 3 deletions(-)

diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 1bfcfb1af352..8dc8b4edb242 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -53,6 +53,8 @@ enum {
CSS_RELEASED = (1 << 2), /* refcnt reached zero, released */
CSS_VISIBLE = (1 << 3), /* css is visible to userland */
CSS_DYING = (1 << 4), /* css is dying */
+ CSS_KILL_ENQED = (1 << 5), /* kill work enqueued for the css */
+ CSS_REL_LATER = (1 << 6), /* release needs to be done after kill */
};

/* bits in struct cgroup flags field */
@@ -162,6 +164,8 @@ struct cgroup_subsys_state {
*/
int id;

+ /* lock to protect flags */
+ spinlock_t lock;
unsigned int flags;

/*
diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index 1779ccddb734..b1bbd438d426 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5210,8 +5210,23 @@ static void css_release(struct percpu_ref *ref)
struct cgroup_subsys_state *css =
container_of(ref, struct cgroup_subsys_state, refcnt);

- INIT_WORK(&css->destroy_work, css_release_work_fn);
- queue_work(cgroup_destroy_wq, &css->destroy_work);
+ spin_lock_bh(&css->lock);
+
+ /*
+ * Check if the css_killed_work_fn work has been scheduled for this
+ * css and enqueue css_release_work_fn only if it wasn't.
+ * Otherwise set the CSS_REL_LATER flag, which will cause
+ * release to be enqueued after css_killed_work_fn is finished.
+ * This is to prevent list corruption by en-queuing two instance
+ * of the same work struct on the same WQ, namely cgroup_destroy_wq.
+ */
+ if (!(css->flags & CSS_KILL_ENQED)) {
+ INIT_WORK(&css->destroy_work, css_release_work_fn);
+ queue_work(cgroup_destroy_wq, &css->destroy_work);
+ } else {
+ css->flags |= CSS_REL_LATER;
+ }
+ spin_unlock_bh(&css->lock);
}

static void init_and_link_css(struct cgroup_subsys_state *css,
@@ -5230,6 +5245,7 @@ static void init_and_link_css(struct cgroup_subsys_state *css,
INIT_LIST_HEAD(&css->rstat_css_node);
css->serial_nr = css_serial_nr_next++;
atomic_set(&css->online_cnt, 0);
+ spin_lock_init(&css->lock);

if (cgroup_parent(cgrp)) {
css->parent = cgroup_css(cgroup_parent(cgrp), ss);
@@ -5545,10 +5561,12 @@ int cgroup_mkdir(struct kernfs_node *parent_kn, const char *name, umode_t mode)
*/
static void css_killed_work_fn(struct work_struct *work)
{
- struct cgroup_subsys_state *css =
+ struct cgroup_subsys_state *css_killed, *css =
container_of(work, struct cgroup_subsys_state, destroy_work);

mutex_lock(&cgroup_mutex);
+ css_killed = css;
+ css_killed->flags &= ~CSS_KILL_ENQED;

do {
offline_css(css);
@@ -5557,6 +5575,14 @@ static void css_killed_work_fn(struct work_struct *work)
css = css->parent;
} while (css && atomic_dec_and_test(&css->online_cnt));

+ spin_lock_bh(&css_killed->lock);
+ if (css_killed->flags & CSS_REL_LATER) {
+ /* If css_release work was delayed for the css enqueue it now. */
+ INIT_WORK(&css_killed->destroy_work, css_release_work_fn);
+ queue_work(cgroup_destroy_wq, &css_killed->destroy_work);
+ css_killed->flags &= ~CSS_REL_LATER;
+ }
+ spin_unlock_bh(&css_killed->lock);
mutex_unlock(&cgroup_mutex);
}

@@ -5566,10 +5592,13 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
struct cgroup_subsys_state *css =
container_of(ref, struct cgroup_subsys_state, refcnt);

+ spin_lock_bh(&css->lock);
if (atomic_dec_and_test(&css->online_cnt)) {
+ css->flags |= CSS_KILL_ENQED;
INIT_WORK(&css->destroy_work, css_killed_work_fn);
queue_work(cgroup_destroy_wq, &css->destroy_work);
}
+ spin_unlock_bh(&css->lock);
}

/**
--
2.36.1

2022-06-06 05:10:27

by Tadeusz Struk

[permalink] [raw]

Subject: Re: [PATCH v2] cgroup: serialize css kill and release paths

On 6/3/22 11:13, Tadeusz Struk wrote:
> Syzbot found a corrupted list bug scenario that can be triggered from
> cgroup_subtree_control_write(cgrp). The reproduces writes to
> cgroup.subtree_control file, which invokes:
> cgroup_apply_control_enable()->css_create()->css_populate_dir(), which
> then fails with a fault injected -ENOMEM.
> In such scenario the css_killed_work_fn will be en-queued via
> cgroup_apply_control_disable(cgrp)->kill_css(css), and bail out to
> cgroup_kn_unlock(). Then cgroup_kn_unlock() will call:
> cgroup_put(cgrp)->css_put(&cgrp->self), which will try to enqueue
> css_release_work_fn for the same css instance, causing a list_add
> corruption bug, as can be seen in the syzkaller report [1].
>
> Fix this by synchronizing the css ref_kill and css_release jobs.
> css_release() function will check if the css_killed_work_fn() has been
> scheduled for the css and only en-queue the css_release_work_fn()
> if css_killed_work_fn wasn't already en-queued. Otherwise css_release() will
> set the CSS_REL_LATER flag for that css. This will cause the
> css_release_work_fn() work to be executed after css_killed_work_fn() is finished.
>
> Two scc flags have been introduced to implement this serialization mechanizm:
>
> * CSS_KILL_ENQED, which will be set when css_killed_work_fn() is en-queued, and
> * CSS_REL_LATER, which, if set, will cause the css_release_work_fn() to be
> scheduled after the css_killed_work_fn is finished.
>
> There is also a new lock, which will protect the integrity of the css flags.
>
> [1] https://syzkaller.appspot.com/bug?id=e26e54d6eac9d9fb50b221ec3e4627b327465dbd

This also fixes a similar, cgroup related list corrupt issue:
https://syzkaller.appspot.com/bug?id=3c7ff113ccb695e839b859da3fc481c36eb1cfd5

--
Thanks,
Tadeusz

2022-06-06 12:44:08

by Michal Koutný

[permalink] [raw]

Subject: Re: [PATCH v2] cgroup: serialize css kill and release paths

Hello.

On Fri, Jun 03, 2022 at 11:13:21AM -0700, Tadeusz Struk <[email protected]> wrote:
> In such scenario the css_killed_work_fn will be en-queued via
> cgroup_apply_control_disable(cgrp)->kill_css(css), and bail out to
> cgroup_kn_unlock(). Then cgroup_kn_unlock() will call:
> cgroup_put(cgrp)->css_put(&cgrp->self), which will try to enqueue
> css_release_work_fn for the same css instance, causing a list_add
> corruption bug, as can be seen in the syzkaller report [1].

This hypothesis doesn't add up to me (I am sorry).

The kill_css(css) would be a css associated with a subsys (css.ss !=
NULL) whereas css_put(&cgrp->self) is a different css just for the
cgroup (css.ss == NULL).

Michal

2022-06-08 02:59:03

by Tadeusz Struk

[permalink] [raw]

Subject: Re: [PATCH v2] cgroup: serialize css kill and release paths

On 6/6/22 05:39, Michal Koutný wrote:
> On Fri, Jun 03, 2022 at 11:13:21AM -0700, Tadeusz Struk<[email protected]> wrote:
>> In such scenario the css_killed_work_fn will be en-queued via
>> cgroup_apply_control_disable(cgrp)->kill_css(css), and bail out to
>> cgroup_kn_unlock(). Then cgroup_kn_unlock() will call:
>> cgroup_put(cgrp)->css_put(&cgrp->self), which will try to enqueue
>> css_release_work_fn for the same css instance, causing a list_add
>> corruption bug, as can be seen in the syzkaller report [1].
> This hypothesis doesn't add up to me (I am sorry).
>
> The kill_css(css) would be a css associated with a subsys (css.ss !=
> NULL) whereas css_put(&cgrp->self) is a different css just for the
> cgroup (css.ss == NULL).

Yes, you are right. I couldn't figure it out where the extra css_put()
is called from, and the only place that fitted into my theory was from
the cgroup_kn_unlock() in cgroup_apply_control_disable().
After some more debugging I can see that, as you said, the cgrp->self
is a different css. The offending _put() is actually called by the
percpu_ref_kill_and_confirm(), as it not only calls the passed confirm_kill
percpu_ref_func_t, but also it puts the refcnt iself.
Because the cgroup_apply_control_disable() will loop for_each_live_descendant,
and call css_kill() on all css'es, and css_killed_work_fn() will also loop
and call css_put() on all parents, the css_release() will be called on the
first parent prematurely, causing the BUG(). What I think should be done
to balance put/get is to call css_get() for all the parents in kill_css():

diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c
index c1e1a5c34e77..3ca61325bc4e 100644
--- a/kernel/cgroup/cgroup.c
+++ b/kernel/cgroup/cgroup.c
@@ -5527,6 +5527,8 @@ static void css_killed_ref_fn(struct percpu_ref *ref)
*/
static void kill_css(struct cgroup_subsys_state *css)
{
+ struct cgroup_subsys_state *_css = css;
+
lockdep_assert_held(&cgroup_mutex);

if (css->flags & CSS_DYING)
@@ -5541,10 +5543,13 @@ static void kill_css(struct cgroup_subsys_state *css)
css_clear_dir(css);

/*
- * Killing would put the base ref, but we need to keep it alive
- * until after ->css_offline().
+ * Killing would put the base ref, but we need to keep it alive,
+ * and all its parents, until after ->css_offline().
*/
- css_get(css);
+ do {
+ css_get(_css);
+ _css = _css->parent;
+ } while (_css && atomic_read(&_css->online_cnt));

/*
* cgroup core guarantees that, by the time ->css_offline() is

This will be then "reverted" in css_killed_work_fn()
Please let me know if it makes sense to you.
I'm still testing it, but syzbot is very slow today.

--
Thanks,
Tadeusz

2022-06-09 09:22:30

by kernel test robot

[permalink] [raw]

Subject: [cgroup] 3c87862ca1: WARNING:at_kernel/softirq.c:#__local_bh_enable_ip

Greeting,

FYI, we noticed the following commit (built with gcc-11):

commit: 3c87862ca13147416d900cf82ca56bb2f23910bf ("[PATCH v2] cgroup: serialize css kill and release paths")
url: https://github.com/intel-lab-lkp/linux/commits/Tadeusz-Struk/cgroup-serialize-css-kill-and-release-paths/20220606-014132
base: https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git for-next
patch link: https://lore.kernel.org/netdev/[email protected]

in testcase: boot

on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):

If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>

[ 55.821003][ C1] WARNING: CPU: 1 PID: 1 at kernel/softirq.c:363 __local_bh_enable_ip (kernel/softirq.c:363)
[ 55.822745][ C1] Modules linked in: fuse ip_tables
[ 55.823837][ C1] CPU: 1 PID: 1 Comm: systemd Not tainted 5.18.0-rc5-00048-g3c87862ca131 #1
[ 55.825505][ C1] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014
[ 55.827516][ C1] RIP: 0010:__local_bh_enable_ip (kernel/softirq.c:363)
[ 55.828671][ C1] Code: 47 65 8b 05 b1 d9 a8 47 a9 00 ff ff 00 74 30 65 ff 0d a3 d9 a8 47 e8 9e c6 39 00 fb 5b 5d c3 65 8b 05 8f e2 a8 47 85 c0 75 b0 <0f> 0b eb ac e8 c6 c7 39 00 eb ad 48 89 ef e8 3c 89 16 00 eb b6 65
All code
========
0: 47 rex.RXB
1: 65 8b 05 b1 d9 a8 47 mov %gs:0x47a8d9b1(%rip),%eax # 0x47a8d9b9
8: a9 00 ff ff 00 test $0xffff00,%eax
d: 74 30 je 0x3f
f: 65 ff 0d a3 d9 a8 47 decl %gs:0x47a8d9a3(%rip) # 0x47a8d9b9
16: e8 9e c6 39 00 callq 0x39c6b9
1b: fb sti
1c: 5b pop %rbx
1d: 5d pop %rbp
1e: c3 retq
1f: 65 8b 05 8f e2 a8 47 mov %gs:0x47a8e28f(%rip),%eax # 0x47a8e2b5
26: 85 c0 test %eax,%eax
28: 75 b0 jne 0xffffffffffffffda
2a:* 0f 0b ud2 <-- trapping instruction
2c: eb ac jmp 0xffffffffffffffda
2e: e8 c6 c7 39 00 callq 0x39c7f9
33: eb ad jmp 0xffffffffffffffe2
35: 48 89 ef mov %rbp,%rdi
38: e8 3c 89 16 00 callq 0x168979
3d: eb b6 jmp 0xfffffffffffffff5
3f: 65 gs

Code starting with the faulting instruction
===========================================
0: 0f 0b ud2
2: eb ac jmp 0xffffffffffffffb0
4: e8 c6 c7 39 00 callq 0x39c7cf
9: eb ad jmp 0xffffffffffffffb8
b: 48 89 ef mov %rbp,%rdi
e: e8 3c 89 16 00 callq 0x16894f
13: eb b6 jmp 0xffffffffffffffcb
15: 65 gs
[ 55.832215][ C1] RSP: 0018:ffffc90000178d60 EFLAGS: 00010046
[ 55.833390][ C1] RAX: 0000000000000000 RBX: 0000000000000201 RCX: 1ffffffff7a3ca41
[ 55.834990][ C1] RDX: 0000000000000000 RSI: 0000000000000201 RDI: ffffffffb88332da
[ 55.836576][ C1] RBP: ffffffffb88332da R08: 0000000000000000 R09: ffff8881c0af245b
[ 55.838123][ C1] R10: ffffed103815e48b R11: 0000000000000001 R12: 0000000000000008
[ 55.839742][ C1] R13: ffff8881a593a000 R14: dffffc0000000000 R15: ffff8881c0af2418
[ 55.841328][ C1] FS: 00007fd6571cc900(0000) GS:ffff88839d500000(0000) knlGS:0000000000000000
[ 55.843084][ C1] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 55.844302][ C1] CR2: 00007fd51c623028 CR3: 00000001bfc72000 CR4: 00000000000406e0
[ 55.845905][ C1] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 55.847510][ C1] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 55.849058][ C1] Call Trace:
[ 55.849769][ C1] <IRQ>
[ 55.850460][ C1] put_css_set_locked (include/linux/percpu-refcount.h:335 include/linux/percpu-refcount.h:351 include/linux/cgroup.h:404 kernel/cgroup/cgroup.c:971 kernel/cgroup/cgroup.c:955)
[ 55.851549][ C1] cgroup_free (include/linux/spinlock.h:404 kernel/cgroup/cgroup-internal.h:212 kernel/cgroup/cgroup-internal.h:198 kernel/cgroup/cgroup.c:6480)
[ 55.852470][ C1] __put_task_struct (kernel/fork.c:842)
[ 55.853436][ C1] rcu_do_batch (include/linux/rcupdate.h:273 kernel/rcu/tree.c:2537)
[ 55.854361][ C1] ? lock_is_held_type (kernel/locking/lockdep.c:5382 kernel/locking/lockdep.c:5684)
[ 55.855400][ C1] ? rcu_implicit_dynticks_qs (kernel/rcu/tree.c:2474)
[ 55.856581][ C1] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4493)
[ 55.857836][ C1] ? note_gp_changes (arch/x86/include/asm/irqflags.h:45 (discriminator 1) arch/x86/include/asm/irqflags.h:80 (discriminator 1) arch/x86/include/asm/irqflags.h:138 (discriminator 1) kernel/rcu/tree.c:1698 (discriminator 1))
[ 55.858828][ C1] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:50 (discriminator 22))
[ 55.859889][ C1] rcu_core (kernel/rcu/tree.c:2788)
[ 55.860786][ C1] __do_softirq (arch/x86/include/asm/jump_label.h:27 include/linux/jump_label.h:207 include/trace/events/irq.h:142 kernel/softirq.c:559)
[ 55.861747][ C1] __irq_exit_rcu (kernel/softirq.c:432 kernel/softirq.c:637)
[ 55.862673][ C1] irq_exit_rcu (kernel/softirq.c:651)
[ 55.863601][ C1] sysvec_apic_timer_interrupt (arch/x86/kernel/apic/apic.c:1097 (discriminator 14))
[ 55.864709][ C1] </IRQ>
[ 55.865399][ C1] <TASK>
[ 55.866087][ C1] asm_sysvec_apic_timer_interrupt (arch/x86/include/asm/idtentry.h:645)
[ 55.867251][ C1] RIP: 0010:walk_component (fs/namei.c:2027)
[ 55.868365][ C1] Code: c7 43 08 00 00 00 00 48 8b 84 24 80 00 00 00 65 48 2b 04 25 28 00 00 00 0f 85 99 03 00 00 48 81 c4 88 00 00 00 4c 89 e0 5b 5d <41> 5c 41 5d 41 5e 41 5f c3 48 b8 00 00 00 00 00 fc ff df 48 8d 7d
All code
========
0: c7 43 08 00 00 00 00 movl $0x0,0x8(%rbx)
7: 48 8b 84 24 80 00 00 mov 0x80(%rsp),%rax
e: 00
f: 65 48 2b 04 25 28 00 sub %gs:0x28,%rax
16: 00 00
18: 0f 85 99 03 00 00 jne 0x3b7
1e: 48 81 c4 88 00 00 00 add $0x88,%rsp
25: 4c 89 e0 mov %r12,%rax
28: 5b pop %rbx
29: 5d pop %rbp
2a:* 41 5c pop %r12 <-- trapping instruction
2c: 41 5d pop %r13
2e: 41 5e pop %r14
30: 41 5f pop %r15
32: c3 retq
33: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax
3a: fc ff df
3d: 48 rex.W
3e: 8d .byte 0x8d
3f: 7d .byte 0x7d

Code starting with the faulting instruction
===========================================
0: 41 5c pop %r12
2: 41 5d pop %r13
4: 41 5e pop %r14
6: 41 5f pop %r15
8: c3 retq
9: 48 b8 00 00 00 00 00 movabs $0xdffffc0000000000,%rax
10: fc ff df
13: 48 rex.W
14: 8d .byte 0x8d
15: 7d .byte 0x7d
[ 55.871899][ C1] RSP: 0018:ffffc9000001f988 EFLAGS: 00000286
[ 55.873081][ C1] RAX: 0000000000000000 RBX: 000000063dbe831e RCX: ffffc9000001f690
[ 55.874697][ C1] RDX: 0000000000000000 RSI: ffffffffbc798f20 RDI: ffffc9000001fbe0
[ 55.876281][ C1] RBP: 0000000000000006 R08: ffff888100310dd8 R09: ffffffffbd1de4e7
[ 55.877803][ C1] R10: ffffed102003531b R11: 0000000000000001 R12: 0000000000000000
[ 55.879386][ C1] R13: 0000000000000002 R14: 0000000000000000 R15: 0000000000000000
[ 55.881044][ C1] link_path_walk+0x563/0xc80
[ 55.882258][ C1] ? path_init (fs/namei.c:2411)
[ 55.883192][ C1] ? __raw_spin_lock_init (kernel/locking/spinlock_debug.c:26)
[ 55.884260][ C1] ? open_last_lookups (fs/namei.c:2265)
[ 55.885290][ C1] ? debug_mutex_init (kernel/locking/mutex-debug.c:89)
[ 55.886305][ C1] ? __alloc_file (fs/file_table.c:153)
[ 55.887298][ C1] path_openat (fs/namei.c:3605)
[ 55.888247][ C1] ? __lock_acquire (kernel/locking/lockdep.c:5029)
[ 55.889229][ C1] ? path_lookupat (fs/namei.c:3591)
[ 55.890215][ C1] do_filp_open (fs/namei.c:3636)
[ 55.893219][ C1] ? alloc_fd (fs/file.c:555 (discriminator 10))
[ 55.894154][ C1] ? may_open_dev (fs/namei.c:3630)
[ 55.895168][ C1] ? __lock_release (kernel/locking/lockdep.c:5317)
[ 55.896219][ C1] ? lock_is_held_type (kernel/locking/lockdep.c:5382 kernel/locking/lockdep.c:5684)
[ 55.897250][ C1] ? alloc_fd (fs/file.c:555 (discriminator 10))
[ 55.898122][ C1] ? _raw_spin_unlock (arch/x86/include/asm/preempt.h:85 include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:186)
[ 55.899114][ C1] ? alloc_fd (fs/file.c:555 (discriminator 10))
[ 55.900031][ C1] ? getname_flags (fs/namei.c:204)
[ 55.901118][ C1] do_sys_openat2 (fs/open.c:1213)
[ 55.902081][ C1] ? rcu_nocb_cb_kthread (kernel/rcu/tree_nocb.h:382)
[ 55.903138][ C1] ? mntput_no_expire (fs/namespace.c:1208)
[ 55.904198][ C1] ? lock_is_held_type (kernel/locking/lockdep.c:5382 kernel/locking/lockdep.c:5684)
[ 55.905231][ C1] ? build_open_flags (fs/open.c:1199)
[ 55.906242][ C1] ? call_rcu (arch/x86/include/asm/irqflags.h:45 arch/x86/include/asm/irqflags.h:80 arch/x86/include/asm/irqflags.h:138 kernel/rcu/tree.c:3109)
[ 55.907155][ C1] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:50 (discriminator 22))
[ 55.908186][ C1] ? call_rcu (arch/x86/include/asm/irqflags.h:45 arch/x86/include/asm/irqflags.h:80 arch/x86/include/asm/irqflags.h:138 kernel/rcu/tree.c:3109)
[ 55.909104][ C1] __x64_sys_openat (fs/open.c:1240)
[ 55.910087][ C1] ? __ia32_compat_sys_open (fs/open.c:1240)
[ 55.911242][ C1] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4501)
[ 55.912481][ C1] ? syscall_enter_from_user_mode (arch/x86/include/asm/irqflags.h:45 arch/x86/include/asm/irqflags.h:80 kernel/entry/common.c:109)
[ 55.913602][ C1] ? trace_hardirqs_on (kernel/trace/trace_preemptirq.c:50 (discriminator 22))
[ 55.914638][ C1] do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80)
[ 55.915572][ C1] ? lockdep_hardirqs_on_prepare (kernel/locking/lockdep.c:4501)
[ 55.916880][ C1] ? do_syscall_64 (arch/x86/entry/common.c:87)
[ 55.917809][ C1] ? do_syscall_64 (arch/x86/entry/common.c:87)
[ 55.918770][ C1] ? do_syscall_64 (arch/x86/entry/common.c:87)
[ 55.919737][ C1] ? do_syscall_64 (arch/x86/entry/common.c:87)
[ 55.920694][ C1] ? do_syscall_64 (arch/x86/entry/common.c:87)

To reproduce:

# build kernel
cd linux
cp config-5.18.0-rc5-00048-g3c87862ca131 .config
make HOSTCC=gcc-11 CC=gcc-11 ARCH=x86_64 olddefconfig prepare modules_prepare bzImage modules
make HOSTCC=gcc-11 CC=gcc-11 ARCH=x86_64 INSTALL_MOD_PATH=<mod-install-dir> modules_install
cd <mod-install-dir>
find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.

--
0-DAY CI Kernel Test Service
https://01.org/lkp

Attachments:

(No filename) (11.26 kB)
config-5.18.0-rc5-00048-g3c87862ca131 (169.77 kB)
job-script (4.89 kB)
dmesg.xz (16.41 kB)
Download all attachments

2022-06-09 15:04:17

by Tadeusz Struk

[permalink] [raw]

Subject: Re: [cgroup] 3c87862ca1: WARNING:at_kernel/softirq.c:#__local_bh_enable_ip

On 6/9/22 01:56, kernel test robot wrote:
>
> Greeting,
>
> FYI, we noticed the following commit (built with gcc-11):
>
> commit: 3c87862ca13147416d900cf82ca56bb2f23910bf ("[PATCH v2] cgroup: serialize css kill and release paths")
> url:https://github.com/intel-lab-lkp/linux/commits/Tadeusz-Struk/cgroup-serialize-css-kill-and-release-paths/20220606-014132
> base:https://git.kernel.org/cgit/linux/kernel/git/tj/cgroup.git for-next
> patch link:https://lore.kernel.org/netdev/[email protected]
>
> in testcase: boot
>
> on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G
>
> caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
>
>
>
> If you fix the issue, kindly add following tag
> Reported-by: kernel test robot<[email protected]>
>
>
> [ 55.821003][ C1] WARNING: CPU: 1 PID: 1 at kernel/softirq.c:363 __local_bh_enable_ip (kernel/softirq.c:363)

Looks like that will need to be spin_lock_irq(&css->lock) instead of spin_lock_bh(&css->lock)
I can respin the patch, but I would like to request some feedback on it first.

Tejun, Michal
Are you interested in fixing this at syzbot issue all? Do you have any more feedback on this?

--
Thanks,
Tadeusz

2022-06-09 15:10:56

by Michal Koutný

[permalink] [raw]

Subject: Re: [cgroup] 3c87862ca1: WARNING:at_kernel/softirq.c:#__local_bh_enable_ip

Hello Tadeusz.

On Thu, Jun 09, 2022 at 07:30:41AM -0700, Tadeusz Struk <[email protected]> wrote:
> Are you interested in fixing this at syzbot issue all?

The (original) syzbot report is conditioned by allocation failure that's
unlikely under normal conditions (AFAIU). Hence I don't treat it extra
high urgent.
OTOH, it's interesting and it points to some disparity worth fixing --
so I try helping (as time permits, so far I can only run the reproducers
via the syzbot).

> Do you have any more feedback on this?

Ad the patch v2 with spinlock per css -- that looks like an overkill to
me, I didn't look deeper into it.

Ad the in-thread patch with ancestry css_get(), the ->parent ref:
- is inc'd in init_and_link_css(),
- is dec'd in css_free_rwork_fn()
and thanks to ->online_cnt offlining is ordered (child before parent).

Where does your patch dec these ancestry references? (Or why would they
be missing in the first place?)

Thanks for digging into this!
Michal