A css_set represents the relationship between a set of tasks and
css's. css_set never pinned the associated css's. This was okay
because tasks used to always disassociate immediately (in RCU sense) -
either a task is moved to a different css_set or exits and never
accesses css_set again.
Unfortunately, afcf6c8b7544 ("cgroup: add cgroup_subsys->free() method
and use it to fix pids controller") and patches leading up to it made
a zombie hold onto its css_set and deref the associated css's on its
release. Nothing pins the css's after exit and it might have already
been freed leading to use-after-free.
general protection fault: 0000 [#1] PREEMPT SMP
task: ffffffff81bf2500 ti: ffffffff81be4000 task.ti: ffffffff81be4000
RIP: 0010:[<ffffffff810fa205>] [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40
...
Call Trace:
<IRQ>
[<ffffffff810fb02d>] ? pids_free+0x3d/0xa0
[<ffffffff810f8893>] cgroup_free+0x53/0xe0
[<ffffffff8104ed62>] __put_task_struct+0x42/0x130
[<ffffffff81053557>] delayed_put_task_struct+0x77/0x130
[<ffffffff810c6b34>] rcu_process_callbacks+0x2f4/0x820
[<ffffffff810c6af3>] ? rcu_process_callbacks+0x2b3/0x820
[<ffffffff81056e54>] __do_softirq+0xd4/0x460
[<ffffffff81057369>] irq_exit+0x89/0xa0
[<ffffffff81876212>] smp_apic_timer_interrupt+0x42/0x50
[<ffffffff818747f4>] apic_timer_interrupt+0x84/0x90
<EOI>
...
Code: 5b 5d c3 48 89 df 48 c7 c2 c9 f9 ae 81 48 c7 c6 91 2c ae 81 e8 1d 94 0e 00 31 c0 5b 5d c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <f0> 48 83 87 e0 00 00 00 ff 78 01 c3 80 3d 08 7a c1 00 00 74 02
RIP [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40
RSP <ffff88001fc03e20>
---[ end trace 89a4a4b916b90c49 ]---
Kernel panic - not syncing: Fatal exception in interrupt
Kernel Offset: disabled
---[ end Kernel panic - not syncing: Fatal exception in interrupt
Fix it by making css_set pin the associate css's until its release.
Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Dave Jones <[email protected]>
Reported-by: Daniel Wagner <[email protected]>
Link: http://lkml.kernel.org/g/[email protected]
Link: http://lkml.kernel.org/g/[email protected]
Fixes: afcf6c8b7544 ("cgroup: add cgroup_subsys->free() method and use it to fix pids controller")
---
kernel/cgroup.c | 14 ++++++++++----
1 file changed, 10 insertions(+), 4 deletions(-)
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -754,9 +754,11 @@ static void put_css_set_locked(struct cs
if (!atomic_dec_and_test(&cset->refcount))
return;
- /* This css_set is dead. unlink it and release cgroup refcounts */
- for_each_subsys(ss, ssid)
+ /* This css_set is dead. unlink it and release cgroup and css refs */
+ for_each_subsys(ss, ssid) {
list_del(&cset->e_cset_node[ssid]);
+ css_put(cset->subsys[ssid]);
+ }
hash_del(&cset->hlist);
css_set_count--;
@@ -1056,9 +1058,13 @@ static struct css_set *find_css_set(stru
key = css_set_hash(cset->subsys);
hash_add(css_set_table, &cset->hlist, key);
- for_each_subsys(ss, ssid)
+ for_each_subsys(ss, ssid) {
+ struct cgroup_subsys_state *css = cset->subsys[ssid];
+
list_add_tail(&cset->e_cset_node[ssid],
- &cset->subsys[ssid]->cgroup->e_csets[ssid]);
+ &css->cgroup->e_csets[ssid]);
+ css_get(css);
+ }
spin_unlock_bh(&css_set_lock);
On Mon, Nov 23, 2015 at 02:55:41PM -0500, Tejun Heo wrote:
> A css_set represents the relationship between a set of tasks and
> css's. css_set never pinned the associated css's. This was okay
> because tasks used to always disassociate immediately (in RCU sense) -
> either a task is moved to a different css_set or exits and never
> accesses css_set again.
>
> Unfortunately, afcf6c8b7544 ("cgroup: add cgroup_subsys->free() method
> and use it to fix pids controller") and patches leading up to it made
> a zombie hold onto its css_set and deref the associated css's on its
> release. Nothing pins the css's after exit and it might have already
> been freed leading to use-after-free.
>
> Fix it by making css_set pin the associate css's until its release.
This gets me booting again, thanks Tejun!
Dave
Hi Tejun,
On 11/23/2015 08:55 PM, Tejun Heo wrote:
> A css_set represents the relationship between a set of tasks and
> css's. css_set never pinned the associated css's. This was okay
> because tasks used to always disassociate immediately (in RCU sense) -
> either a task is moved to a different css_set or exits and never
> accesses css_set again.
>
> Unfortunately, afcf6c8b7544 ("cgroup: add cgroup_subsys->free() method
> and use it to fix pids controller") and patches leading up to it made
> a zombie hold onto its css_set and deref the associated css's on its
> release. Nothing pins the css's after exit and it might have already
> been freed leading to use-after-free.
>
> general protection fault: 0000 [#1] PREEMPT SMP
> task: ffffffff81bf2500 ti: ffffffff81be4000 task.ti: ffffffff81be4000
> RIP: 0010:[<ffffffff810fa205>] [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40
> ...
> Call Trace:
> <IRQ>
> [<ffffffff810fb02d>] ? pids_free+0x3d/0xa0
> [<ffffffff810f8893>] cgroup_free+0x53/0xe0
> [<ffffffff8104ed62>] __put_task_struct+0x42/0x130
> [<ffffffff81053557>] delayed_put_task_struct+0x77/0x130
> [<ffffffff810c6b34>] rcu_process_callbacks+0x2f4/0x820
> [<ffffffff810c6af3>] ? rcu_process_callbacks+0x2b3/0x820
> [<ffffffff81056e54>] __do_softirq+0xd4/0x460
> [<ffffffff81057369>] irq_exit+0x89/0xa0
> [<ffffffff81876212>] smp_apic_timer_interrupt+0x42/0x50
> [<ffffffff818747f4>] apic_timer_interrupt+0x84/0x90
> <EOI>
> ...
> Code: 5b 5d c3 48 89 df 48 c7 c2 c9 f9 ae 81 48 c7 c6 91 2c ae 81 e8 1d 94 0e 00 31 c0 5b 5d c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <f0> 48 83 87 e0 00 00 00 ff 78 01 c3 80 3d 08 7a c1 00 00 74 02
> RIP [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40
> RSP <ffff88001fc03e20>
> ---[ end trace 89a4a4b916b90c49 ]---
> Kernel panic - not syncing: Fatal exception in interrupt
> Kernel Offset: disabled
> ---[ end Kernel panic - not syncing: Fatal exception in interrupt
>
> Fix it by making css_set pin the associate css's until its release.
I still see this one with the patch applied:
[ 19.369455] ------------[ cut here ]------------
[ 19.369851] WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
[ 19.370596] Modules linked in:
[ 19.370916] CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
[ 19.371418] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
[ 19.372542] ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
[ 19.373173] ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
[ 19.374144] ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
[ 19.375185] Call Trace:
[ 19.375506] [<ffffffff81551ffc>] dump_stack+0x4e/0x82
[ 19.376238] [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
[ 19.376975] [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
[ 19.377765] [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
[ 19.378623] [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
[ 19.379451] [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
[ 19.380142] [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
[ 19.380592] [<ffffffff81188d15>] ? cgroup_migrate+0x5/0x190
[ 19.381041] [<ffffffff81189016>] cgroup_attach_task+0x176/0x200
[ 19.381500] [<ffffffff81188ea5>] ? cgroup_attach_task+0x5/0x200
[ 19.381962] [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
[ 19.382482] [<ffffffff8118924e>] ? __cgroup_procs_write+0x5e/0x460
[ 19.382949] [<ffffffff81189684>] cgroup_procs_write+0x14/0x20
[ 19.383432] [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
[ 19.383864] [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
[ 19.384367] [<ffffffff81265f88>] __vfs_write+0x28/0xe0
[ 19.384759] [<ffffffff811292d7>] ? percpu_down_read+0x57/0xa0
[ 19.385274] [<ffffffff81268c14>] ? __sb_start_write+0xb4/0xf0
[ 19.385712] [<ffffffff81268c14>] ? __sb_start_write+0xb4/0xf0
[ 19.386160] [<ffffffff812666fc>] vfs_write+0xac/0x1a0
[ 19.386563] [<ffffffff812860b6>] ? __fget_light+0x66/0x90
[ 19.386960] [<ffffffff81267019>] SyS_write+0x49/0xb0
[ 19.387373] [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
[ 19.387861] ---[ end trace 46552476f436a20f ]---
cheers,
daniel
Hello, Daniel.
On Tue, Nov 24, 2015 at 11:31:18AM +0100, Daniel Wagner wrote:
> I still see this one with the patch applied:
Yeap, this is a different one.
> [ 19.369455] ------------[ cut here ]------------
> [ 19.369851] WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
> [ 19.370596] Modules linked in:
> [ 19.370916] CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
> [ 19.371418] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
> [ 19.372542] ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
> [ 19.373173] ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
> [ 19.374144] ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
> [ 19.375185] Call Trace:
> [ 19.375506] [<ffffffff81551ffc>] dump_stack+0x4e/0x82
> [ 19.376238] [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
> [ 19.376975] [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
> [ 19.377765] [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
> [ 19.378623] [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
> [ 19.379451] [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
> [ 19.380142] [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
Can you please describe how to reproduce this one? If you have a qemu
image which reproduces this, I'd be happy to take a look at it.
Thanks.
--
tejun
Hi Tejun,
On 11/24/2015 03:44 PM, Tejun Heo wrote:
> On Tue, Nov 24, 2015 at 11:31:18AM +0100, Daniel Wagner wrote:
>> [ 19.369455] ------------[ cut here ]------------
>> [ 19.369851] WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
>> [ 19.370596] Modules linked in:
>> [ 19.370916] CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
>> [ 19.371418] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.8.2-0-g33fbe13 by qemu-project.org 04/01/2014
>> [ 19.372542] ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
>> [ 19.373173] ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
>> [ 19.374144] ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
>> [ 19.375185] Call Trace:
>> [ 19.375506] [<ffffffff81551ffc>] dump_stack+0x4e/0x82
>> [ 19.376238] [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
>> [ 19.376975] [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
>> [ 19.377765] [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
>> [ 19.378623] [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
>> [ 19.379451] [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
>> [ 19.380142] [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
>
> Can you please describe how to reproduce this one?
I start a not so updated rawhide image with some funky kernel options.
They are more less some left overs from debugging:
$QEMU -gdb tcp::1235 -enable-kvm -machine accel=kvm \
-m 2G -cpu Haswell \
-smp sockets=1,cores=2,threads=2 \
-hda ~/vm-images/rawhide-big.qcow2\
-net nic,model=virtio \
-net user,hostfwd=tcp::7777-:22 \
-monitor telnet:127.0.0.1:1234,server,nowait \
-serial stdio -display none \
-append "root=/dev/sda1 console=ttyS0 audit=0 isolcpus=3 systemd.unified_cgroup_hierarchy=1" \
-kernel arch/x86_64/boot/bzImage $@
After starting the image I just wait for a few seconds and I'll get it.
No interaction needed.
> If you have a qemu image which reproduces this, I'd be happy to take
> a look at it.
I'll upload it, though it will take a while... the fun of living
with asymmetric connectivity.
cheers,
daniel
Hello,
On Tue, Nov 24, 2015 at 03:58:42PM +0100, Daniel Wagner wrote:
> I start a not so updated rawhide image with some funky kernel options.
> They are more less some left overs from debugging:
>
> $QEMU -gdb tcp::1235 -enable-kvm -machine accel=kvm \
> -m 2G -cpu Haswell \
> -smp sockets=1,cores=2,threads=2 \
> -hda ~/vm-images/rawhide-big.qcow2\
> -net nic,model=virtio \
> -net user,hostfwd=tcp::7777-:22 \
> -monitor telnet:127.0.0.1:1234,server,nowait \
> -serial stdio -display none \
> -append "root=/dev/sda1 console=ttyS0 audit=0 isolcpus=3 systemd.unified_cgroup_hierarchy=1" \
> -kernel arch/x86_64/boot/bzImage $@
>
> After starting the image I just wait for a few seconds and I'll get it.
> No interaction needed.
>
> > If you have a qemu image which reproduces this, I'd be happy to take
> > a look at it.
>
> I'll upload it, though it will take a while... the fun of living
> with asymmetric connectivity.
Great, thanks a lot!
--
tejun
On Mon, Nov 23, 2015 at 02:55:41PM -0500, Tejun Heo wrote:
> A css_set represents the relationship between a set of tasks and
> css's. css_set never pinned the associated css's. This was okay
> because tasks used to always disassociate immediately (in RCU sense) -
> either a task is moved to a different css_set or exits and never
> accesses css_set again.
>
> Unfortunately, afcf6c8b7544 ("cgroup: add cgroup_subsys->free() method
> and use it to fix pids controller") and patches leading up to it made
> a zombie hold onto its css_set and deref the associated css's on its
> release. Nothing pins the css's after exit and it might have already
> been freed leading to use-after-free.
>
> general protection fault: 0000 [#1] PREEMPT SMP
> task: ffffffff81bf2500 ti: ffffffff81be4000 task.ti: ffffffff81be4000
> RIP: 0010:[<ffffffff810fa205>] [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40
> ...
> Call Trace:
> <IRQ>
> [<ffffffff810fb02d>] ? pids_free+0x3d/0xa0
> [<ffffffff810f8893>] cgroup_free+0x53/0xe0
> [<ffffffff8104ed62>] __put_task_struct+0x42/0x130
> [<ffffffff81053557>] delayed_put_task_struct+0x77/0x130
> [<ffffffff810c6b34>] rcu_process_callbacks+0x2f4/0x820
> [<ffffffff810c6af3>] ? rcu_process_callbacks+0x2b3/0x820
> [<ffffffff81056e54>] __do_softirq+0xd4/0x460
> [<ffffffff81057369>] irq_exit+0x89/0xa0
> [<ffffffff81876212>] smp_apic_timer_interrupt+0x42/0x50
> [<ffffffff818747f4>] apic_timer_interrupt+0x84/0x90
> <EOI>
> ...
> Code: 5b 5d c3 48 89 df 48 c7 c2 c9 f9 ae 81 48 c7 c6 91 2c ae 81 e8 1d 94 0e 00 31 c0 5b 5d c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 <f0> 48 83 87 e0 00 00 00 ff 78 01 c3 80 3d 08 7a c1 00 00 74 02
> RIP [<ffffffff810fa205>] pids_cancel.constprop.4+0x5/0x40
> RSP <ffff88001fc03e20>
> ---[ end trace 89a4a4b916b90c49 ]---
> Kernel panic - not syncing: Fatal exception in interrupt
> Kernel Offset: disabled
> ---[ end Kernel panic - not syncing: Fatal exception in interrupt
>
> Fix it by making css_set pin the associate css's until its release.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Reported-by: Dave Jones <[email protected]>
> Reported-by: Daniel Wagner <[email protected]>
> Link: http://lkml.kernel.org/g/[email protected]
> Link: http://lkml.kernel.org/g/[email protected]
> Fixes: afcf6c8b7544 ("cgroup: add cgroup_subsys->free() method and use it to fix pids controller")
Applied to cgroup/for-4.4-fixes.
--
tejun
>From bafc993033ea01429effe50ec693def57154ebe4 Mon Sep 17 00:00:00 2001
From: Tejun Heo <[email protected]>
Date: Mon, 30 Nov 2015 17:24:34 -0500
If one or more tasks get moved into a frozen css, the frozen state is
cleared up from the destination css so that it can be reasserted once
the migrated tasks are frozen. freezer_attach() implements this in
two separate steps - clearing CGROUP_FROZEN on the target css while
processing each task and propagating the clearing upwards after the
task loop is done if necessary.
This patch merges the two steps. Propagation now takes place inside
the task loop. This simplifies the code and prepares it for the fix
of multi-destination migration.
Signed-off-by: Tejun Heo <[email protected]>
---
kernel/cgroup_freezer.c | 17 +++++++----------
1 file changed, 7 insertions(+), 10 deletions(-)
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index f1b30ad..ff02a8e 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -158,9 +158,7 @@ static void freezer_css_free(struct cgroup_subsys_state *css)
static void freezer_attach(struct cgroup_subsys_state *new_css,
struct cgroup_taskset *tset)
{
- struct freezer *freezer = css_freezer(new_css);
struct task_struct *task;
- bool clear_frozen = false;
mutex_lock(&freezer_mutex);
@@ -175,21 +173,20 @@ static void freezer_attach(struct cgroup_subsys_state *new_css,
* be visible in a FROZEN cgroup and frozen tasks in a THAWED one.
*/
cgroup_taskset_for_each(task, tset) {
+ struct freezer *freezer = css_freezer(new_css);
+
if (!(freezer->state & CGROUP_FREEZING)) {
__thaw_task(task);
} else {
freeze_task(task);
- freezer->state &= ~CGROUP_FROZEN;
- clear_frozen = true;
+ /* clear FROZEN and propagate upwards */
+ while (freezer && (freezer->state & CGROUP_FROZEN)) {
+ freezer->state &= ~CGROUP_FROZEN;
+ freezer = parent_freezer(freezer);
+ }
}
}
- /* propagate FROZEN clearing upwards */
- while (clear_frozen && (freezer = parent_freezer(freezer))) {
- freezer->state &= ~CGROUP_FROZEN;
- clear_frozen = freezer->state & CGROUP_FREEZING;
- }
-
mutex_unlock(&freezer_mutex);
}
--
2.5.0
>From 0d7d444e260493252e30c70813c7657e9ede2f12 Mon Sep 17 00:00:00 2001
From: Tejun Heo <[email protected]>
Date: Mon, 30 Nov 2015 17:24:34 -0500
Consider the following v2 hierarchy.
P0 (+memory) --- P1 (-memory) --- A
\- B
P0 has memory enabled in its subtree_control while P1 doesn't. If
both A and B contain processes, they would belong to the memory css of
P1. Now if memory is enabled on P1's subtree_control, memory csses
should be created on both A and B and A's processes should be moved to
the former and B's processes the latter. IOW, enabling controllers
can cause atomic migrations into different csses.
The core cgroup migration logic has been updated accordingly but the
controller migration methods haven't and still assume that all tasks
migrate to a single target css; furthermore, the methods were fed the
css in which subtree_control was updated which is the parent of the
target csses. pids controller depends on the migration methods to
move charges and this made the controller attribute charges to the
wrong csses often triggering the following warning by driving a
counter negative.
WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
Modules linked in:
CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
...
ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
Call Trace:
[<ffffffff81551ffc>] dump_stack+0x4e/0x82
[<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
[<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
[<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
[<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
[<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
[<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
[<ffffffff81189016>] cgroup_attach_task+0x176/0x200
[<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
[<ffffffff81189684>] cgroup_procs_write+0x14/0x20
[<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
[<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
[<ffffffff81265f88>] __vfs_write+0x28/0xe0
[<ffffffff812666fc>] vfs_write+0xac/0x1a0
[<ffffffff81267019>] SyS_write+0x49/0xb0
[<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
This patch fixes the bug by removing @css parameter from the three
migration methods, ->can_attach, ->cancel_attach() and ->attach() and
updating cgroup_taskset iteration helpers also return the destination
css in addition to the task being migrated. All controllers are
updated accordingly.
* Controllers which don't care whether there are one or multiple
target csses can be converted trivially. cpu, io, freezer, perf,
netclassid and netprio fall in this category.
* cpuset's current implementation assumes that there's single source
and destination and thus doesn't support v2 hierarchy already. The
only change made by this patchset is how that single destination css
is obtained.
* memory migration path already doesn't do anything on v2. How the
single destination css is obtained is updated and the prep stage of
mem_cgroup_can_attach() is reordered to accomodate the change.
* pids is the only controller which was affected by this bug. It now
correctly handles multi-destination migrations and no longer causes
counter underflow from incorrect accounting.
Signed-off-by: Tejun Heo <[email protected]>
Reported-by: Daniel Wagner <[email protected]>
Cc: Aleksa Sarai <[email protected]>
---
block/blk-cgroup.c | 6 +++---
include/linux/cgroup-defs.h | 9 +++------
include/linux/cgroup.h | 33 +++++++++++++++++++++-----------
kernel/cgroup.c | 43 +++++++++++++++++++++++++++++++++---------
kernel/cgroup_freezer.c | 6 +++---
kernel/cgroup_pids.c | 16 ++++++++--------
kernel/cpuset.c | 33 ++++++++++++++++++++------------
kernel/events/core.c | 6 +++---
kernel/sched/core.c | 12 ++++++------
mm/memcontrol.c | 45 ++++++++++++++++++++++----------------------
net/core/netclassid_cgroup.c | 11 ++++++-----
net/core/netprio_cgroup.c | 9 +++++----
12 files changed, 137 insertions(+), 92 deletions(-)
diff --git a/block/blk-cgroup.c b/block/blk-cgroup.c
index 5bcdfc1..5a37188 100644
--- a/block/blk-cgroup.c
+++ b/block/blk-cgroup.c
@@ -1127,15 +1127,15 @@ void blkcg_exit_queue(struct request_queue *q)
* of the main cic data structures. For now we allow a task to change
* its cgroup only if it's the only owner of its ioc.
*/
-static int blkcg_can_attach(struct cgroup_subsys_state *css,
- struct cgroup_taskset *tset)
+static int blkcg_can_attach(struct cgroup_taskset *tset)
{
struct task_struct *task;
+ struct cgroup_subsys_state *dst_css;
struct io_context *ioc;
int ret = 0;
/* task_lock() is needed to avoid races with exit_io_context() */
- cgroup_taskset_for_each(task, tset) {
+ cgroup_taskset_for_each(task, dst_css, tset) {
task_lock(task);
ioc = task->io_context;
if (ioc && atomic_read(&ioc->nr_tasks) > 1)
diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h
index 869fd4a..06b77f9d 100644
--- a/include/linux/cgroup-defs.h
+++ b/include/linux/cgroup-defs.h
@@ -422,12 +422,9 @@ struct cgroup_subsys {
void (*css_reset)(struct cgroup_subsys_state *css);
void (*css_e_css_changed)(struct cgroup_subsys_state *css);
- int (*can_attach)(struct cgroup_subsys_state *css,
- struct cgroup_taskset *tset);
- void (*cancel_attach)(struct cgroup_subsys_state *css,
- struct cgroup_taskset *tset);
- void (*attach)(struct cgroup_subsys_state *css,
- struct cgroup_taskset *tset);
+ int (*can_attach)(struct cgroup_taskset *tset);
+ void (*cancel_attach)(struct cgroup_taskset *tset);
+ void (*attach)(struct cgroup_taskset *tset);
int (*can_fork)(struct task_struct *task, void **priv_p);
void (*cancel_fork)(struct task_struct *task, void *priv);
void (*fork)(struct task_struct *task, void *priv);
diff --git a/include/linux/cgroup.h b/include/linux/cgroup.h
index f640830..cb91b44 100644
--- a/include/linux/cgroup.h
+++ b/include/linux/cgroup.h
@@ -120,8 +120,10 @@ struct cgroup_subsys_state *css_rightmost_descendant(struct cgroup_subsys_state
struct cgroup_subsys_state *css_next_descendant_post(struct cgroup_subsys_state *pos,
struct cgroup_subsys_state *css);
-struct task_struct *cgroup_taskset_first(struct cgroup_taskset *tset);
-struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset);
+struct task_struct *cgroup_taskset_first(struct cgroup_taskset *tset,
+ struct cgroup_subsys_state **dst_cssp);
+struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset,
+ struct cgroup_subsys_state **dst_cssp);
void css_task_iter_start(struct cgroup_subsys_state *css,
struct css_task_iter *it);
@@ -236,30 +238,39 @@ void css_task_iter_end(struct css_task_iter *it);
/**
* cgroup_taskset_for_each - iterate cgroup_taskset
* @task: the loop cursor
+ * @dst_css: the destination css
* @tset: taskset to iterate
*
* @tset may contain multiple tasks and they may belong to multiple
- * processes. When there are multiple tasks in @tset, if a task of a
- * process is in @tset, all tasks of the process are in @tset. Also, all
- * are guaranteed to share the same source and destination csses.
+ * processes.
+ *
+ * On the v2 hierarchy, there may be tasks from multiple processes and they
+ * may not share the source or destination csses.
+ *
+ * On traditional hierarchies, when there are multiple tasks in @tset, if a
+ * task of a process is in @tset, all tasks of the process are in @tset.
+ * Also, all are guaranteed to share the same source and destination csses.
*
* Iteration is not in any specific order.
*/
-#define cgroup_taskset_for_each(task, tset) \
- for ((task) = cgroup_taskset_first((tset)); (task); \
- (task) = cgroup_taskset_next((tset)))
+#define cgroup_taskset_for_each(task, dst_css, tset) \
+ for ((task) = cgroup_taskset_first((tset), &(dst_css)); \
+ (task); \
+ (task) = cgroup_taskset_next((tset), &(dst_css)))
/**
* cgroup_taskset_for_each_leader - iterate group leaders in a cgroup_taskset
* @leader: the loop cursor
+ * @dst_css: the destination css
* @tset: takset to iterate
*
* Iterate threadgroup leaders of @tset. For single-task migrations, @tset
* may not contain any.
*/
-#define cgroup_taskset_for_each_leader(leader, tset) \
- for ((leader) = cgroup_taskset_first((tset)); (leader); \
- (leader) = cgroup_taskset_next((tset))) \
+#define cgroup_taskset_for_each_leader(leader, dst_css, tset) \
+ for ((leader) = cgroup_taskset_first((tset), &(dst_css)); \
+ (leader); \
+ (leader) = cgroup_taskset_next((tset), &(dst_css))) \
if ((leader) != (leader)->group_leader) \
; \
else
diff --git a/kernel/cgroup.c b/kernel/cgroup.c
index 5cea63f..470f653 100644
--- a/kernel/cgroup.c
+++ b/kernel/cgroup.c
@@ -2237,6 +2237,9 @@ struct cgroup_taskset {
struct list_head src_csets;
struct list_head dst_csets;
+ /* the subsys currently being processed */
+ int ssid;
+
/*
* Fields for cgroup_taskset_*() iteration.
*
@@ -2299,25 +2302,29 @@ static void cgroup_taskset_add(struct task_struct *task,
/**
* cgroup_taskset_first - reset taskset and return the first task
* @tset: taskset of interest
+ * @dst_cssp: output variable for the destination css
*
* @tset iteration is initialized and the first task is returned.
*/
-struct task_struct *cgroup_taskset_first(struct cgroup_taskset *tset)
+struct task_struct *cgroup_taskset_first(struct cgroup_taskset *tset,
+ struct cgroup_subsys_state **dst_cssp)
{
tset->cur_cset = list_first_entry(tset->csets, struct css_set, mg_node);
tset->cur_task = NULL;
- return cgroup_taskset_next(tset);
+ return cgroup_taskset_next(tset, dst_cssp);
}
/**
* cgroup_taskset_next - iterate to the next task in taskset
* @tset: taskset of interest
+ * @dst_cssp: output variable for the destination css
*
* Return the next task in @tset. Iteration must have been initialized
* with cgroup_taskset_first().
*/
-struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset)
+struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset,
+ struct cgroup_subsys_state **dst_cssp)
{
struct css_set *cset = tset->cur_cset;
struct task_struct *task = tset->cur_task;
@@ -2332,6 +2339,18 @@ struct task_struct *cgroup_taskset_next(struct cgroup_taskset *tset)
if (&task->cg_list != &cset->mg_tasks) {
tset->cur_cset = cset;
tset->cur_task = task;
+
+ /*
+ * This function may be called both before and
+ * after cgroup_taskset_migrate(). The two cases
+ * can be distinguished by looking at whether @cset
+ * has its ->mg_dst_cset set.
+ */
+ if (cset->mg_dst_cset)
+ *dst_cssp = cset->mg_dst_cset->subsys[tset->ssid];
+ else
+ *dst_cssp = cset->subsys[tset->ssid];
+
return task;
}
@@ -2367,7 +2386,8 @@ static int cgroup_taskset_migrate(struct cgroup_taskset *tset,
/* check that we can legitimately attach to the cgroup */
for_each_e_css(css, i, dst_cgrp) {
if (css->ss->can_attach) {
- ret = css->ss->can_attach(css, tset);
+ tset->ssid = i;
+ ret = css->ss->can_attach(tset);
if (ret) {
failed_css = css;
goto out_cancel_attach;
@@ -2400,9 +2420,12 @@ static int cgroup_taskset_migrate(struct cgroup_taskset *tset,
*/
tset->csets = &tset->dst_csets;
- for_each_e_css(css, i, dst_cgrp)
- if (css->ss->attach)
- css->ss->attach(css, tset);
+ for_each_e_css(css, i, dst_cgrp) {
+ if (css->ss->attach) {
+ tset->ssid = i;
+ css->ss->attach(tset);
+ }
+ }
ret = 0;
goto out_release_tset;
@@ -2411,8 +2434,10 @@ static int cgroup_taskset_migrate(struct cgroup_taskset *tset,
for_each_e_css(css, i, dst_cgrp) {
if (css == failed_css)
break;
- if (css->ss->cancel_attach)
- css->ss->cancel_attach(css, tset);
+ if (css->ss->cancel_attach) {
+ tset->ssid = i;
+ css->ss->cancel_attach(tset);
+ }
}
out_release_tset:
spin_lock_bh(&css_set_lock);
diff --git a/kernel/cgroup_freezer.c b/kernel/cgroup_freezer.c
index ff02a8e..2d3df82 100644
--- a/kernel/cgroup_freezer.c
+++ b/kernel/cgroup_freezer.c
@@ -155,10 +155,10 @@ static void freezer_css_free(struct cgroup_subsys_state *css)
* @freezer->lock. freezer_attach() makes the new tasks conform to the
* current state and all following state changes can see the new tasks.
*/
-static void freezer_attach(struct cgroup_subsys_state *new_css,
- struct cgroup_taskset *tset)
+static void freezer_attach(struct cgroup_taskset *tset)
{
struct task_struct *task;
+ struct cgroup_subsys_state *new_css;
mutex_lock(&freezer_mutex);
@@ -172,7 +172,7 @@ static void freezer_attach(struct cgroup_subsys_state *new_css,
* current state before executing the following - !frozen tasks may
* be visible in a FROZEN cgroup and frozen tasks in a THAWED one.
*/
- cgroup_taskset_for_each(task, tset) {
+ cgroup_taskset_for_each(task, new_css, tset) {
struct freezer *freezer = css_freezer(new_css);
if (!(freezer->state & CGROUP_FREEZING)) {
diff --git a/kernel/cgroup_pids.c b/kernel/cgroup_pids.c
index de3359a..8e27fc5 100644
--- a/kernel/cgroup_pids.c
+++ b/kernel/cgroup_pids.c
@@ -162,13 +162,13 @@ static int pids_try_charge(struct pids_cgroup *pids, int num)
return -EAGAIN;
}
-static int pids_can_attach(struct cgroup_subsys_state *css,
- struct cgroup_taskset *tset)
+static int pids_can_attach(struct cgroup_taskset *tset)
{
- struct pids_cgroup *pids = css_pids(css);
struct task_struct *task;
+ struct cgroup_subsys_state *dst_css;
- cgroup_taskset_for_each(task, tset) {
+ cgroup_taskset_for_each(task, dst_css, tset) {
+ struct pids_cgroup *pids = css_pids(dst_css);
struct cgroup_subsys_state *old_css;
struct pids_cgroup *old_pids;
@@ -187,13 +187,13 @@ static int pids_can_attach(struct cgroup_subsys_state *css,
return 0;
}
-static void pids_cancel_attach(struct cgroup_subsys_state *css,
- struct cgroup_taskset *tset)
+static void pids_cancel_attach(struct cgroup_taskset *tset)
{
- struct pids_cgroup *pids = css_pids(css);
struct task_struct *task;
+ struct cgroup_subsys_state *dst_css;
- cgroup_taskset_for_each(task, tset) {
+ cgroup_taskset_for_each(task, dst_css, tset) {
+ struct pids_cgroup *pids = css_pids(dst_css);
struct cgroup_subsys_state *old_css;
struct pids_cgroup *old_pids;
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 10ae736..02a8ea5 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1429,15 +1429,16 @@ static int fmeter_getrate(struct fmeter *fmp)
static struct cpuset *cpuset_attach_old_cs;
/* Called by cgroups to determine if a cpuset is usable; cpuset_mutex held */
-static int cpuset_can_attach(struct cgroup_subsys_state *css,
- struct cgroup_taskset *tset)
+static int cpuset_can_attach(struct cgroup_taskset *tset)
{
- struct cpuset *cs = css_cs(css);
+ struct cgroup_subsys_state *css;
+ struct cpuset *cs;
struct task_struct *task;
int ret;
/* used later by cpuset_attach() */
- cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset));
+ cpuset_attach_old_cs = task_cs(cgroup_taskset_first(tset, &css));
+ cs = css_cs(css);
mutex_lock(&cpuset_mutex);
@@ -1447,7 +1448,7 @@ static int cpuset_can_attach(struct cgroup_subsys_state *css,
(cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed)))
goto out_unlock;
- cgroup_taskset_for_each(task, tset) {
+ cgroup_taskset_for_each(task, css, tset) {
ret = task_can_attach(task, cs->cpus_allowed);
if (ret)
goto out_unlock;
@@ -1467,9 +1468,14 @@ static int cpuset_can_attach(struct cgroup_subsys_state *css,
return ret;
}
-static void cpuset_cancel_attach(struct cgroup_subsys_state *css,
- struct cgroup_taskset *tset)
+static void cpuset_cancel_attach(struct cgroup_taskset *tset)
{
+ struct cgroup_subsys_state *css;
+ struct cpuset *cs;
+
+ cgroup_taskset_first(tset, &css);
+ cs = css_cs(css);
+
mutex_lock(&cpuset_mutex);
css_cs(css)->attach_in_progress--;
mutex_unlock(&cpuset_mutex);
@@ -1482,16 +1488,19 @@ static void cpuset_cancel_attach(struct cgroup_subsys_state *css,
*/
static cpumask_var_t cpus_attach;
-static void cpuset_attach(struct cgroup_subsys_state *css,
- struct cgroup_taskset *tset)
+static void cpuset_attach(struct cgroup_taskset *tset)
{
/* static buf protected by cpuset_mutex */
static nodemask_t cpuset_attach_nodemask_to;
struct task_struct *task;
struct task_struct *leader;
- struct cpuset *cs = css_cs(css);
+ struct cgroup_subsys_state *css;
+ struct cpuset *cs;
struct cpuset *oldcs = cpuset_attach_old_cs;
+ cgroup_taskset_first(tset, &css);
+ cs = css_cs(css);
+
mutex_lock(&cpuset_mutex);
/* prepare for attach */
@@ -1502,7 +1511,7 @@ static void cpuset_attach(struct cgroup_subsys_state *css,
guarantee_online_mems(cs, &cpuset_attach_nodemask_to);
- cgroup_taskset_for_each(task, tset) {
+ cgroup_taskset_for_each(task, css, tset) {
/*
* can_attach beforehand should guarantee that this doesn't
* fail. TODO: have a better way to handle failure here
@@ -1518,7 +1527,7 @@ static void cpuset_attach(struct cgroup_subsys_state *css,
* sleep and should be moved outside migration path proper.
*/
cpuset_attach_nodemask_to = cs->effective_mems;
- cgroup_taskset_for_each_leader(leader, tset) {
+ cgroup_taskset_for_each_leader(leader, css, tset) {
struct mm_struct *mm = get_task_mm(leader);
if (mm) {
diff --git a/kernel/events/core.c b/kernel/events/core.c
index 36babfd..026305d 100644
--- a/kernel/events/core.c
+++ b/kernel/events/core.c
@@ -9456,12 +9456,12 @@ static int __perf_cgroup_move(void *info)
return 0;
}
-static void perf_cgroup_attach(struct cgroup_subsys_state *css,
- struct cgroup_taskset *tset)
+static void perf_cgroup_attach(struct cgroup_taskset *tset)
{
struct task_struct *task;
+ struct cgroup_subsys_state *css;
- cgroup_taskset_for_each(task, tset)
+ cgroup_taskset_for_each(task, css, tset)
task_function_call(task, __perf_cgroup_move, task);
}
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4d568ac..a9db4819 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8217,12 +8217,12 @@ static void cpu_cgroup_fork(struct task_struct *task, void *private)
sched_move_task(task);
}
-static int cpu_cgroup_can_attach(struct cgroup_subsys_state *css,
- struct cgroup_taskset *tset)
+static int cpu_cgroup_can_attach(struct cgroup_taskset *tset)
{
struct task_struct *task;
+ struct cgroup_subsys_state *css;
- cgroup_taskset_for_each(task, tset) {
+ cgroup_taskset_for_each(task, css, tset) {
#ifdef CONFIG_RT_GROUP_SCHED
if (!sched_rt_can_attach(css_tg(css), task))
return -EINVAL;
@@ -8235,12 +8235,12 @@ static int cpu_cgroup_can_attach(struct cgroup_subsys_state *css,
return 0;
}
-static void cpu_cgroup_attach(struct cgroup_subsys_state *css,
- struct cgroup_taskset *tset)
+static void cpu_cgroup_attach(struct cgroup_taskset *tset)
{
struct task_struct *task;
+ struct cgroup_subsys_state *css;
- cgroup_taskset_for_each(task, tset)
+ cgroup_taskset_for_each(task, css, tset)
sched_move_task(task);
}
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index 9acfb16..c92a65b 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -4779,23 +4779,18 @@ static void mem_cgroup_clear_mc(void)
spin_unlock(&mc.lock);
}
-static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
- struct cgroup_taskset *tset)
+static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
{
- struct mem_cgroup *memcg = mem_cgroup_from_css(css);
+ struct cgroup_subsys_state *css;
+ struct mem_cgroup *memcg;
struct mem_cgroup *from;
struct task_struct *leader, *p;
struct mm_struct *mm;
unsigned long move_flags;
int ret = 0;
- /*
- * We are now commited to this value whatever it is. Changes in this
- * tunable will only affect upcoming migrations, not the current one.
- * So we need to save it, and keep it going.
- */
- move_flags = READ_ONCE(memcg->move_charge_at_immigrate);
- if (!move_flags)
+ /* charge immigration isn't supported on the default hierarchy */
+ if (cgroup_subsys_on_dfl(memory_cgrp_subsys))
return 0;
/*
@@ -4805,13 +4800,23 @@ static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
* multiple.
*/
p = NULL;
- cgroup_taskset_for_each_leader(leader, tset) {
+ cgroup_taskset_for_each_leader(leader, css, tset) {
WARN_ON_ONCE(p);
p = leader;
+ memcg = mem_cgroup_from_css(css);
}
if (!p)
return 0;
+ /*
+ * We are now commited to this value whatever it is. Changes in this
+ * tunable will only affect upcoming migrations, not the current one.
+ * So we need to save it, and keep it going.
+ */
+ move_flags = READ_ONCE(memcg->move_charge_at_immigrate);
+ if (!move_flags)
+ return 0;
+
from = mem_cgroup_from_task(p);
VM_BUG_ON(from == memcg);
@@ -4842,8 +4847,7 @@ static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
return ret;
}
-static void mem_cgroup_cancel_attach(struct cgroup_subsys_state *css,
- struct cgroup_taskset *tset)
+static void mem_cgroup_cancel_attach(struct cgroup_taskset *tset)
{
if (mc.to)
mem_cgroup_clear_mc();
@@ -4985,10 +4989,10 @@ static void mem_cgroup_move_charge(struct mm_struct *mm)
atomic_dec(&mc.from->moving_account);
}
-static void mem_cgroup_move_task(struct cgroup_subsys_state *css,
- struct cgroup_taskset *tset)
+static void mem_cgroup_move_task(struct cgroup_taskset *tset)
{
- struct task_struct *p = cgroup_taskset_first(tset);
+ struct cgroup_subsys_state *css;
+ struct task_struct *p = cgroup_taskset_first(tset, &css);
struct mm_struct *mm = get_task_mm(p);
if (mm) {
@@ -5000,17 +5004,14 @@ static void mem_cgroup_move_task(struct cgroup_subsys_state *css,
mem_cgroup_clear_mc();
}
#else /* !CONFIG_MMU */
-static int mem_cgroup_can_attach(struct cgroup_subsys_state *css,
- struct cgroup_taskset *tset)
+static int mem_cgroup_can_attach(struct cgroup_taskset *tset)
{
return 0;
}
-static void mem_cgroup_cancel_attach(struct cgroup_subsys_state *css,
- struct cgroup_taskset *tset)
+static void mem_cgroup_cancel_attach(struct cgroup_taskset *tset)
{
}
-static void mem_cgroup_move_task(struct cgroup_subsys_state *css,
- struct cgroup_taskset *tset)
+static void mem_cgroup_move_task(struct cgroup_taskset *tset)
{
}
#endif
diff --git a/net/core/netclassid_cgroup.c b/net/core/netclassid_cgroup.c
index 6441f47..81cb3c7 100644
--- a/net/core/netclassid_cgroup.c
+++ b/net/core/netclassid_cgroup.c
@@ -67,14 +67,15 @@ static int update_classid(const void *v, struct file *file, unsigned n)
return 0;
}
-static void cgrp_attach(struct cgroup_subsys_state *css,
- struct cgroup_taskset *tset)
+static void cgrp_attach(struct cgroup_taskset *tset)
{
- struct cgroup_cls_state *cs = css_cls_state(css);
- void *v = (void *)(unsigned long)cs->classid;
struct task_struct *p;
+ struct cgroup_subsys_state *css;
+
+ cgroup_taskset_for_each(p, css, tset) {
+ struct cgroup_cls_state *cs = css_cls_state(css);
+ void *v = (void *)(unsigned long)cs->classid;
- cgroup_taskset_for_each(p, tset) {
task_lock(p);
iterate_fd(p->files, 0, update_classid, v);
task_unlock(p);
diff --git a/net/core/netprio_cgroup.c b/net/core/netprio_cgroup.c
index cbd0a19..40fd09f 100644
--- a/net/core/netprio_cgroup.c
+++ b/net/core/netprio_cgroup.c
@@ -218,13 +218,14 @@ static int update_netprio(const void *v, struct file *file, unsigned n)
return 0;
}
-static void net_prio_attach(struct cgroup_subsys_state *css,
- struct cgroup_taskset *tset)
+static void net_prio_attach(struct cgroup_taskset *tset)
{
struct task_struct *p;
- void *v = (void *)(unsigned long)css->cgroup->id;
+ struct cgroup_subsys_state *css;
+
+ cgroup_taskset_for_each(p, css, tset) {
+ void *v = (void *)(unsigned long)css->cgroup->id;
- cgroup_taskset_for_each(p, tset) {
task_lock(p);
iterate_fd(p->files, 0, update_netprio, v);
task_unlock(p);
--
2.5.0
Hi Tejun,
On 11/30/2015 11:44 PM, Tejun Heo wrote:
> WARNING: CPU: 1 PID: 1 at kernel/cgroup_pids.c:97 pids_cancel.constprop.6+0x31/0x40()
> Modules linked in:
> CPU: 1 PID: 1 Comm: systemd Not tainted 4.4.0-rc1+ #29
> ...
> ffffffff81f65382 ffff88007c043b90 ffffffff81551ffc 0000000000000000
> ffff88007c043bc8 ffffffff810de202 ffff88007a752000 ffff88007a29ab00
> ffff88007c043c80 ffff88007a1d8400 0000000000000001 ffff88007c043bd8
> Call Trace:
> [<ffffffff81551ffc>] dump_stack+0x4e/0x82
> [<ffffffff810de202>] warn_slowpath_common+0x82/0xc0
> [<ffffffff810de2fa>] warn_slowpath_null+0x1a/0x20
> [<ffffffff8118e031>] pids_cancel.constprop.6+0x31/0x40
> [<ffffffff8118e0fd>] pids_can_attach+0x6d/0xf0
> [<ffffffff81188a4c>] cgroup_taskset_migrate+0x6c/0x330
> [<ffffffff81188e05>] cgroup_migrate+0xf5/0x190
> [<ffffffff81189016>] cgroup_attach_task+0x176/0x200
> [<ffffffff8118949d>] __cgroup_procs_write+0x2ad/0x460
> [<ffffffff81189684>] cgroup_procs_write+0x14/0x20
> [<ffffffff811854e5>] cgroup_file_write+0x35/0x1c0
> [<ffffffff812e26f1>] kernfs_fop_write+0x141/0x190
> [<ffffffff81265f88>] __vfs_write+0x28/0xe0
> [<ffffffff812666fc>] vfs_write+0xac/0x1a0
> [<ffffffff81267019>] SyS_write+0x49/0xb0
> [<ffffffff81bcef32>] entry_SYSCALL_64_fastpath+0x12/0x76
>
> This patch fixes the bug by removing @css parameter from the three
> migration methods, ->can_attach, ->cancel_attach() and ->attach() and
> updating cgroup_taskset iteration helpers also return the destination
> css in addition to the task being migrated. All controllers are
> updated accordingly.
I was not able to verify if these two patches are fixing it. I don't see
the call trace on mainline only when using cgroup/review-xt_cgroup2
review branch.
So I ported it to review-xt_croup2 with only a small merge conflict for
in netclassid_cgroup.c. No fun though, I still see it.
Is there a patch missing? The subject indicates there should be 3 patches.
cheers,
daniel
Hello, Daniel.
On Tue, Dec 01, 2015 at 08:02:23AM +0100, Daniel Wagner wrote:
> I was not able to verify if these two patches are fixing it. I don't see
> the call trace on mainline only when using cgroup/review-xt_cgroup2
> review branch.
>
> So I ported it to review-xt_croup2 with only a small merge conflict for
> in netclassid_cgroup.c. No fun though, I still see it.
It also needs the previous ref fix patch and Oleg's race fix too. Can
you please test the following branch?
git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-pids-fixes
> Is there a patch missing? The subject indicates there should be 3 patches.
That's just me messing up patch title. Sorry.
Thanks.
--
tejun
Hi Tejun,
On 12/01/2015 05:44 PM, Tejun Heo wrote:
> On Tue, Dec 01, 2015 at 08:02:23AM +0100, Daniel Wagner wrote:
>> I was not able to verify if these two patches are fixing it. I don't see
>> the call trace on mainline only when using cgroup/review-xt_cgroup2
>> review branch.
>>
>> So I ported it to review-xt_croup2 with only a small merge conflict for
>> in netclassid_cgroup.c. No fun though, I still see it.
>
> It also needs the previous ref fix patch and Oleg's race fix too. Can
> you please test the following branch?
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup.git review-pids-fixes
First I verified that I see the stack trace using v4.4-rc1. Then I
forwarded to review-pids-fixes and all looks good now.
You can add a
Tested-by: Daniel Wagner <[email protected]>
if you like.
Thanks!
Daniel
On Mon, Nov 30, 2015 at 05:44:31PM -0500, Tejun Heo wrote:
> From 0d7d444e260493252e30c70813c7657e9ede2f12 Mon Sep 17 00:00:00 2001
> From: Tejun Heo <[email protected]>
> Date: Mon, 30 Nov 2015 17:24:34 -0500
>
> Consider the following v2 hierarchy.
>
> P0 (+memory) --- P1 (-memory) --- A
> \- B
>
> P0 has memory enabled in its subtree_control while P1 doesn't. If
> both A and B contain processes, they would belong to the memory css of
> P1. Now if memory is enabled on P1's subtree_control, memory csses
> should be created on both A and B and A's processes should be moved to
> the former and B's processes the latter. IOW, enabling controllers
> can cause atomic migrations into different csses.
>
> The core cgroup migration logic has been updated accordingly but the
> controller migration methods haven't and still assume that all tasks
> migrate to a single target css; furthermore, the methods were fed the
> css in which subtree_control was updated which is the parent of the
> target csses. pids controller depends on the migration methods to
> move charges and this made the controller attribute charges to the
> wrong csses often triggering the following warning by driving a
> counter negative.
Applying 1-2 to libata/for-4.4-fixes.
Thanks.
--
tejun
On Thu, Dec 03, 2015 at 10:16:32AM -0500, Tejun Heo wrote:
> Applying 1-2 to libata/for-4.4-fixes.
That should have been cgroup/for-4.4-fixes.
Thanks.
--
tejun