2008-06-14 22:31:46

by Daniel K.

[permalink] [raw]
Subject: [BUG: NULL pointer dereference] cgroups and RT scheduling interact badly.

I get the following on the latest Linus git tree.

Testcase:

mkdir /dev/cgroup
mount -t cgroup -o cpu,cpuset cgroup /dev/cgroup

mkdir -p /dev/cgroup/burn/oops
cd /dev/cgroup/burn

echo 3 > cpuset.cpus
echo 0 > cpuset.mems
echo 1000000 > cpu.rt_period_us
echo 940000 > cpu.rt_runtime_us

echo 3 > oops/cpuset.cpus
echo 0 > oops/cpuset.mems
echo 100000 > oops/cpu.rt_period_us
echo 4000 > oops/cpu.rt_runtime_us

echo $$ > oops/tasks
schedtool -R -p 1 -e burnP6

And then it breaks into the pieces below, as captured by netconsole.

> [ 492.586059] BUG: unable to handle kernel NULL pointer dereference at
0000000000000062
> [ 492.586059] IP: [<ffffffff8022e635>] enqueue_rt_entity+0x55/0x1d0
> [ 492.586059] PGD 21e439067 PUD 21e438067 PMD 0
> [ 492.586059] Oops: 0002 [1] SMP
> [ 492.586059] CPU 3
> [ 492.586059] Modules linked in: netconsole configfs ipmi_msghandler kvm_amd kvm ipv6 iptable_filter ip_tables x_tables loop af_packet usbhid hid evdev i2c_nforce2 k8temp button pcspkr shpchp pci_hotplug i2c_core tg3 sd_mod ehci_hcd ohci_hcd forcedeth sg usbcore thermal processor fan thermal_sys
> [ 492.586059] Pid: 3405, comm: schedtool Not tainted 2.6.26-rc6 #2
> [ 492.586059] RIP: 0010:[<ffffffff8022e635>] [<ffffffff8022e635>] enqueue_rt_entity+0x55/0x1d0
> [ 492.586059] RSP: 0018:ffff81021e415e48 EFLAGS: 00010012
> [ 492.586059] RAX: ffff810001056d48 RBX: ffff81022309e900 RCX: ffff81022309e860
> [ 492.586059] RDX: 0000000000000062 RSI: 0000000000000086 RDI: ffff81022309e900
> [ 492.586059] RBP: ffff81021e415e58 R08: ffff810001056e50 R09: 000000009b10fa5a
> [ 492.586059] R10: 0000000000000000 R11: ffff810001056670 R12: ffff8100010566f8
> [ 492.586059] R13: 0000000000000001 R14: 0000000000000001 R15: ffff81021e415f38
> [ 492.586059] FS: 00007f675ec286e0(0000) GS:ffff810223022980(0000) knlGS:0000000000000000
> [ 492.586059] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 492.586059] CR2: 0000000000000062 CR3: 000000021e44e000 CR4: 00000000000006e0
> [ 492.586059] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 492.586059] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 492.586059] Process schedtool (pid: 3405, threadinfo ffff81021e414000, task ffff810221771980)
> [ 492.586059] Stack: ffff81022309e900 ffff810221771980 ffff81021e415e78 ffffffff8022e7e8
> [ 492.586059] 0000000000000001 ffff810221771980 ffff81021e415e98 ffffffff80229ea3
> [ 492.586059] ffff810001056670 ffff810001056600 ffff81021e415eb8 ffffffff80229f20
> [ 492.586059] Call Trace:
> [ 492.586059] [<ffffffff8022e7e8>] enqueue_task_rt+0x38/0x50
> [ 492.586059] [<ffffffff80229ea3>] enqueue_task+0x13/0x30
> [ 492.586059] [<ffffffff80229f20>] activate_task+0x30/0x50
> [ 492.586059] [<ffffffff8023336f>] sched_setscheduler+0x28f/0x3b0
> [ 492.586059] [<ffffffff8028b818>] ? do_munmap+0x278/0x2d0
> [ 492.586059] [<ffffffff8023350d>] do_sched_setscheduler+0x7d/0x90
> [ 492.586059] [<ffffffff80233554>] sys_sched_setscheduler+0x14/0x20
> [ 492.586059] [<ffffffff8020b77a>] system_call_after_swapgs+0x8a/0x8f
> [ 492.586059]
> [ 492.586059]
> [ 492.586059] Code: 85 c9 0f 84 76 01 00 00 8b 81 58 06 00 00 48 98 48 8d 8b 60 ff ff ff 48 c1 e0 04 4a 8d 44 20 10 48 8b 50 08 48 89 03 48 89 58 08 <48> 89 1a 48 89 53 08 48 8b 53 40 48 8d 82 58 06 00 00 48 85 d2
> [ 492.586059] RIP [<ffffffff8022e635>] enqueue_rt_entity+0x55/0x1d0
> [ 492.586059] RSP <ffff81021e415e48>
> [ 492.586059] CR2: 0000000000000062

Some information about the compiler, and Kconfig

daniel@lc01:~/git/linux-2.6$ cat /proc/version
Linux version 2.6.26-rc6 (daniel@lc01) (gcc version 4.2.3 (Ubuntu
4.2.3-2ubuntu7)) #2 SMP Sat Jun 14 21:51:31 CEST 2008

daniel@lc01:~/git/linux-2.6$ cat .config|egrep "(CGROUP|SCHED)"
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
# CONFIG_CGROUP_NS is not set
# CONFIG_CGROUP_DEVICE is not set
CONFIG_HAVE_UNSTABLE_SCHED_CLOCK=y
CONFIG_GROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_RT_GROUP_SCHED=y
# CONFIG_USER_SCHED is not set
CONFIG_CGROUP_SCHED=y
CONFIG_CGROUP_CPUACCT=y
# CONFIG_CGROUP_MEM_RES_CTLR is not set
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
CONFIG_DEFAULT_IOSCHED="deadline"
# CONFIG_SCHED_SMT is not set
CONFIG_SCHED_MC=y
CONFIG_SCHED_HRTICK=y
CONFIG_NET_SCHED=y
CONFIG_USB_EHCI_TT_NEWSCHED=y
CONFIG_SCHED_DEBUG=y
# CONFIG_SCHEDSTATS is not set


Daniel K.


2008-06-16 10:34:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG: NULL pointer dereference] cgroups and RT scheduling interact badly.

On Sun, 2008-06-15 at 00:26 +0200, Daniel K. wrote:
> I get the following on the latest Linus git tree.
>
> Testcase:
>
> mkdir /dev/cgroup
> mount -t cgroup -o cpu,cpuset cgroup /dev/cgroup
>
> mkdir -p /dev/cgroup/burn/oops
> cd /dev/cgroup/burn
>
> echo 3 > cpuset.cpus
> echo 0 > cpuset.mems
> echo 1000000 > cpu.rt_period_us
> echo 940000 > cpu.rt_runtime_us
>
> echo 3 > oops/cpuset.cpus
> echo 0 > oops/cpuset.mems
> echo 100000 > oops/cpu.rt_period_us
> echo 4000 > oops/cpu.rt_runtime_us
>
> echo $$ > oops/tasks
> schedtool -R -p 1 -e burnP6
>
> And then it breaks into the pieces below, as captured by netconsole.

Excellent report, thanks!


Does the below work for you?


Signed-off-by: Peter Zijlstra <[email protected]>
---
Index: linux-2.6/kernel/sched.c
===================================================================
--- linux-2.6.orig/kernel/sched.c
+++ linux-2.6/kernel/sched.c
@@ -7784,7 +7784,6 @@ static void init_tg_rt_entry(struct task
else
rt_se->rt_rq = parent->my_q;

- rt_se->rt_rq = &rq->rt;
rt_se->my_q = rt_rq;
rt_se->parent = parent;
INIT_LIST_HEAD(&rt_se->run_list);

2008-06-16 13:11:50

by Daniel K.

[permalink] [raw]
Subject: Re: [BUG: NULL pointer dereference] cgroups and RT scheduling interact badly.

Peter Zijlstra wrote:
> Does the below work for you?
>
> Signed-off-by: Peter Zijlstra <[email protected]>
> ---
> Index: linux-2.6/kernel/sched.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched.c
> +++ linux-2.6/kernel/sched.c
> @@ -7784,7 +7784,6 @@ static void init_tg_rt_entry(struct task
> else
> rt_se->rt_rq = parent->my_q;
>
> - rt_se->rt_rq = &rq->rt;
> rt_se->my_q = rt_rq;
> rt_se->parent = parent;
> INIT_LIST_HEAD(&rt_se->run_list);

Although this patch seems to be correct, this is what shows up on my
netconsole, when applying it -- with an offset, do you have other fixes
applied as well?


Daniel K.

> [ 129.390189] ------------[ cut here ]------------
> [ 129.390370] Kernel BUG at ffffffff8022acea [verbose debug info unavailable]
> [ 129.390489] invalid opcode: 0000 [1] SMP
> [ 129.390672] CPU 3
> [ 129.390811] Modules linked in: netconsole configfs ipmi_msghandler kvm_amd kvm ipv6 iptable_filter ip_tables x_tables af_packet usbhid hid loop evdev ehci_hcd tg3 ohci_hcd i2c_nforce2 forcedeth shpchp pci_hotplug button pcspkr k8temp thermal usbcore processor i2c_core sd_mod sg fan thermal_sys
> [ 129.393446] Pid: 3375, comm: burnP6 Not tainted 2.6.26-rc6 #3
> [ 129.393560] RIP: 0010:[<ffffffff8022acea>] [<ffffffff8022acea>] pick_next_task_rt+0x5a/0x90
> [ 129.393784] RSP: 0000:ffff81021eda5ea0 EFLAGS: 00010002
> [ 129.393896] RAX: 0000000000000064 RBX: ffffffff8049ec00 RCX: ffff810221495000
> [ 129.393920] RDX: ffff81022113f9c0 RSI: 0000000000000003 RDI: ffff810001056600
> [ 129.393920] RBP: ffff81021eda5ea0 R08: ffff810001050660 R09: 0000000000001280
> [ 129.393920] R10: 0000000000000001 R11: 00000000ffffffff R12: 0000000000000000
> [ 129.393920] R13: ffff810001056600 R14: 0000000000000003 R15: 0000000000000000
> [ 129.393920] FS: 00007f36635536e0(0000) GS:ffff810223022980(0000) knlGS:0000000000000000
> [ 129.393920] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
> [ 129.393920] CR2: 0000000008049130 CR3: 0000000221d08000 CR4: 00000000000006e0
> [ 129.393920] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 129.393920] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 129.393920] Process burnP6 (pid: 3375, threadinfo ffff81021eda4000, task ffff810220ded2e0)
> [ 129.393920] Stack: ffff81021eda5f70 ffffffff8048c302 0000000000000000 ffff81021ec1da00
> [ 129.393920] ffffffff80689600 ffffffff80689600 ffffffff806858a0 ffffffff80689600
> [ 129.393920] ffff810220ded558 0000000000000000 0000000000000292 ffff810220ded2e0
> [ 129.393920] Call Trace:
> [ 129.393920] [<ffffffff8048c302>] thread_return+0x101/0x4af
> [ 129.393921] [<ffffffff80343761>] ? __up_read+0x21/0xb0
> [ 129.393921] [<ffffffff8020bdee>] retint_careful+0x1c/0x42
> [ 129.393921]
> [ 129.393921]
> [ 129.393921] Code: 48 c1 e0 04 48 8b 14 08 48 85 d2 74 49 48 8b 4a 40 48 85 c9 74 1b 48 8b 01 48 85 c0 75 d4 48 0f bc 41 08 83 c0 40 83 f8 63 7e d0 <0f> 0b eb fe 66 90 48 8b 87 f8 07 00 00 48 8d 8a 40 ff ff ff 48
> [ 129.400164] RIP [<ffffffff8022acea>] pick_next_task_rt+0x5a/0x90
> [ 129.400164] RSP <ffff81021eda5ea0>

2008-06-16 13:52:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG: NULL pointer dereference] cgroups and RT scheduling interact badly.

On Mon, 2008-06-16 at 15:14 +0200, Daniel K. wrote:
> Peter Zijlstra wrote:
> > Does the below work for you?
> >
> > Signed-off-by: Peter Zijlstra <[email protected]>
> > ---
> > Index: linux-2.6/kernel/sched.c
> > ===================================================================
> > --- linux-2.6.orig/kernel/sched.c
> > +++ linux-2.6/kernel/sched.c
> > @@ -7784,7 +7784,6 @@ static void init_tg_rt_entry(struct task
> > else
> > rt_se->rt_rq = parent->my_q;
> >
> > - rt_se->rt_rq = &rq->rt;
> > rt_se->my_q = rt_rq;
> > rt_se->parent = parent;
> > INIT_LIST_HEAD(&rt_se->run_list);
>
> Although this patch seems to be correct, this is what shows up on my
> netconsole, when applying it -- with an offset, do you have other fixes
> applied as well?

I had indeed, although nothing touching the rt scheduler. I popped all
my patches and pulled an update from Linus, but I fail to reproduce the
below.

/me goes look for that burnp6 thing, I used a simple while (1); loop.

2008-06-16 14:40:54

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG: NULL pointer dereference] cgroups and RT scheduling interact badly.

On Mon, 2008-06-16 at 15:51 +0200, Peter Zijlstra wrote:
> On Mon, 2008-06-16 at 15:14 +0200, Daniel K. wrote:
> > Peter Zijlstra wrote:
> > > Does the below work for you?
> > >
> > > Signed-off-by: Peter Zijlstra <[email protected]>
> > > ---
> > > Index: linux-2.6/kernel/sched.c
> > > ===================================================================
> > > --- linux-2.6.orig/kernel/sched.c
> > > +++ linux-2.6/kernel/sched.c
> > > @@ -7784,7 +7784,6 @@ static void init_tg_rt_entry(struct task
> > > else
> > > rt_se->rt_rq = parent->my_q;
> > >
> > > - rt_se->rt_rq = &rq->rt;
> > > rt_se->my_q = rt_rq;
> > > rt_se->parent = parent;
> > > INIT_LIST_HEAD(&rt_se->run_list);
> >
> > Although this patch seems to be correct, this is what shows up on my
> > netconsole, when applying it -- with an offset, do you have other fixes
> > applied as well?
>
> I had indeed, although nothing touching the rt scheduler. I popped all
> my patches and pulled an update from Linus, but I fail to reproduce the
> below.
>
> /me goes look for that burnp6 thing, I used a simple while (1); loop.

found it, still seems to work for me. do you have a funny number of
cpus? or anything else noteworthy?

2008-06-16 15:09:25

by Daniel K.

[permalink] [raw]
Subject: Re: [BUG: NULL pointer dereference] cgroups and RT scheduling interact badly.

Peter Zijlstra wrote:
> On Mon, 2008-06-16 at 15:51 +0200, Peter Zijlstra wrote:
>> On Mon, 2008-06-16 at 15:14 +0200, Daniel K. wrote:
>>> Peter Zijlstra wrote:
>>>
>>> Although this patch seems to be correct, this is what shows up on my
>>> netconsole, when applying it -- with an offset, do you have other fixes
>>> applied as well?
>> I had indeed, although nothing touching the rt scheduler. I popped all
>> my patches and pulled an update from Linus, but I fail to reproduce the
>> below.
>>
>> /me goes look for that burnp6 thing, I used a simple while (1); loop.
>
> found it, still seems to work for me. do you have a funny number of
> cpus? or anything else noteworthy?

I don't think so, this is on a SUN X2200 M2, with two AMD Opteron 2214
processors, and 8G RAM.

If I follow the procedure up to 'echo 4000 > oops/cpu.rt_runtime_us'
then I can

# burnP6 &
[1] 3395
# schedtool -R -p 1 3395

but

# echo -n 3395 > /dev/cgroup/burn/oops/tasks

yields this:

> [ 1116.296418] ------------[ cut here ]------------
> [ 1116.296559] Kernel BUG at ffffffff8022acea [verbose debug info unavailable]
> [ 1116.296644] invalid opcode: 0000 [1] SMP
> [ 1116.296721] CPU 3
> [ 1116.296788] Modules linked in: netconsole configfs ipmi_msghandler kvm_amd kvm ipv6 iptable_filter ip_tables x_tables af_packet usbhid hid loop tg3 evdev i2c_nforce2 o
> hci_hcd i2c_core ehci_hcd k8temp button thermal processor pcspkr usbcore shpchp pci_hotplug forcedeth sd_mod sg fan thermal_sys
> [ 1116.297161] Pid: 3395, comm: burnP6 Not tainted 2.6.26-rc6 #4
> [ 1116.297240] RIP: 0010:[<ffffffff8022acea>] [<ffffffff8022acea>] pick_next_task_rt+0x5a/0x90
> [ 1116.297390] RSP: 0000:ffff81021edf7ea0 EFLAGS: 00010002
> [ 1116.297467] RAX: 0000000000000064 RBX: ffffffff8049ec00 RCX: ffff81021ef5e800
> [ 1116.297551] RDX: ffff8102214d7c00 RSI: 0000000000000003 RDI: ffff810001056600
> [ 1116.298666] RBP: ffff81021edf7ea0 R08: ffff810001050660 R09: 00000000000010a8
> [ 1116.298750] R10: 0000000000000001 R11: 00000000ffffffff R12: 0000000000000000
> [ 1116.298833] R13: ffff810001056600 R14: 0000000000000003 R15: 0000000000000000
> [ 1116.298917] FS: 00007fecf7cc76e0(0000) GS:ffff810223022980(0000) knlGS:0000000000000000
> [ 1116.299060] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
> [ 1116.299142] CR2: 0000000001a0d958 CR3: 0000000220c79000 CR4: 00000000000006e0
> [ 1116.299225] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 1116.299309] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 1116.299393] Process burnP6 (pid: 3395, threadinfo ffff81021edf6000, task ffff81022178aca0)
> [ 1116.299535] Stack: ffff81021edf7f70 ffffffff8048c302 0000000000000000 ffff8102210b1b00
> [ 1116.299682] ffffffff80689600 ffffffff80689600 ffffffff806858a0 ffffffff80689600
> [ 1116.299827] ffff81022178af18 0000000000000000 0000000000000292 ffff81022178aca0
> [ 1116.299914] Call Trace:
> [ 1116.300046] [<ffffffff8048c302>] thread_return+0x101/0x4af
> [ 1116.300130] [<ffffffff8020bdee>] retint_careful+0x1c/0x42
> [ 1116.300210]
> [ 1116.300273]
> [ 1116.300335] Code: 48 c1 e0 04 48 8b 14 08 48 85 d2 74 49 48 8b 4a 40 48 85 c9 74 1b 48 8b 01 48 85 c0 75 d4 48 0f bc 41 08 83 c0 40 83 f8 63 7e d0 <0f> 0b eb fe 66

I'll go say hello to mr. proper, and report back.


Daniel K.

2008-06-16 15:19:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG: NULL pointer dereference] cgroups and RT scheduling interact badly.

On Mon, 2008-06-16 at 17:11 +0200, Daniel K. wrote:
> Peter Zijlstra wrote:
> > On Mon, 2008-06-16 at 15:51 +0200, Peter Zijlstra wrote:
> >> On Mon, 2008-06-16 at 15:14 +0200, Daniel K. wrote:
> >>> Peter Zijlstra wrote:
> >>>
> >>> Although this patch seems to be correct, this is what shows up on my
> >>> netconsole, when applying it -- with an offset, do you have other fixes
> >>> applied as well?
> >> I had indeed, although nothing touching the rt scheduler. I popped all
> >> my patches and pulled an update from Linus, but I fail to reproduce the
> >> below.
> >>
> >> /me goes look for that burnp6 thing, I used a simple while (1); loop.
> >
> > found it, still seems to work for me. do you have a funny number of
> > cpus? or anything else noteworthy?
>
> I don't think so, this is on a SUN X2200 M2, with two AMD Opteron 2214
> processors, and 8G RAM.
>
> If I follow the procedure up to 'echo 4000 > oops/cpu.rt_runtime_us'
> then I can
>
> # burnP6 &
> [1] 3395
> # schedtool -R -p 1 3395
>
> but
>
> # echo -n 3395 > /dev/cgroup/burn/oops/tasks

Ah, that did it,.. I'll go poke at it. Thanks!

> yields this:
>
> > [ 1116.296418] ------------[ cut here ]------------
> > [ 1116.296559] Kernel BUG at ffffffff8022acea [verbose debug info unavailable]
> > [ 1116.296644] invalid opcode: 0000 [1] SMP
> > [ 1116.296721] CPU 3
> > [ 1116.296788] Modules linked in: netconsole configfs ipmi_msghandler kvm_amd kvm ipv6 iptable_filter ip_tables x_tables af_packet usbhid hid loop tg3 evdev i2c_nforce2 o
> > hci_hcd i2c_core ehci_hcd k8temp button thermal processor pcspkr usbcore shpchp pci_hotplug forcedeth sd_mod sg fan thermal_sys
> > [ 1116.297161] Pid: 3395, comm: burnP6 Not tainted 2.6.26-rc6 #4
> > [ 1116.297240] RIP: 0010:[<ffffffff8022acea>] [<ffffffff8022acea>] pick_next_task_rt+0x5a/0x90
> > [ 1116.297390] RSP: 0000:ffff81021edf7ea0 EFLAGS: 00010002
> > [ 1116.297467] RAX: 0000000000000064 RBX: ffffffff8049ec00 RCX: ffff81021ef5e800
> > [ 1116.297551] RDX: ffff8102214d7c00 RSI: 0000000000000003 RDI: ffff810001056600
> > [ 1116.298666] RBP: ffff81021edf7ea0 R08: ffff810001050660 R09: 00000000000010a8
> > [ 1116.298750] R10: 0000000000000001 R11: 00000000ffffffff R12: 0000000000000000
> > [ 1116.298833] R13: ffff810001056600 R14: 0000000000000003 R15: 0000000000000000
> > [ 1116.298917] FS: 00007fecf7cc76e0(0000) GS:ffff810223022980(0000) knlGS:0000000000000000
> > [ 1116.299060] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
> > [ 1116.299142] CR2: 0000000001a0d958 CR3: 0000000220c79000 CR4: 00000000000006e0
> > [ 1116.299225] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [ 1116.299309] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > [ 1116.299393] Process burnP6 (pid: 3395, threadinfo ffff81021edf6000, task ffff81022178aca0)
> > [ 1116.299535] Stack: ffff81021edf7f70 ffffffff8048c302 0000000000000000 ffff8102210b1b00
> > [ 1116.299682] ffffffff80689600 ffffffff80689600 ffffffff806858a0 ffffffff80689600
> > [ 1116.299827] ffff81022178af18 0000000000000000 0000000000000292 ffff81022178aca0
> > [ 1116.299914] Call Trace:
> > [ 1116.300046] [<ffffffff8048c302>] thread_return+0x101/0x4af
> > [ 1116.300130] [<ffffffff8020bdee>] retint_careful+0x1c/0x42
> > [ 1116.300210]
> > [ 1116.300273]
> > [ 1116.300335] Code: 48 c1 e0 04 48 8b 14 08 48 85 d2 74 49 48 8b 4a 40 48 85 c9 74 1b 48 8b 01 48 85 c0 75 d4 48 0f bc 41 08 83 c0 40 83 f8 63 7e d0 <0f> 0b eb fe 66

2008-06-17 08:49:39

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG: NULL pointer dereference] cgroups and RT scheduling interact badly.

On Mon, 2008-06-16 at 17:18 +0200, Peter Zijlstra wrote:
> On Mon, 2008-06-16 at 17:11 +0200, Daniel K. wrote:
> > Peter Zijlstra wrote:
> > > On Mon, 2008-06-16 at 15:51 +0200, Peter Zijlstra wrote:
> > >> On Mon, 2008-06-16 at 15:14 +0200, Daniel K. wrote:
> > >>> Peter Zijlstra wrote:
> > >>>
> > >>> Although this patch seems to be correct, this is what shows up on my
> > >>> netconsole, when applying it -- with an offset, do you have other fixes
> > >>> applied as well?
> > >> I had indeed, although nothing touching the rt scheduler. I popped all
> > >> my patches and pulled an update from Linus, but I fail to reproduce the
> > >> below.
> > >>
> > >> /me goes look for that burnp6 thing, I used a simple while (1); loop.
> > >
> > > found it, still seems to work for me. do you have a funny number of
> > > cpus? or anything else noteworthy?
> >
> > I don't think so, this is on a SUN X2200 M2, with two AMD Opteron 2214
> > processors, and 8G RAM.
> >
> > If I follow the procedure up to 'echo 4000 > oops/cpu.rt_runtime_us'
> > then I can
> >
> > # burnP6 &
> > [1] 3395
> > # schedtool -R -p 1 3395
> >
> > but
> >
> > # echo -n 3395 > /dev/cgroup/burn/oops/tasks
>
> Ah, that did it,.. I'll go poke at it. Thanks!

How's this work for you? (includes the previuos patchlet too)

---
Subject: sched: rt-group: fix NULL deref

- the rt group hierarchy got corrupted by always pointing the entity's
rq pointer to the root rq.

- the enqueue/dequeue on throttle should also use the full
dequeue_rt_stack like regular task enqueue/dequeue do. Not doing this
can leave empty groups and possibly corrupt the priority queues.

Signed-off-by: Peter Zijlstra <[email protected]>
---
diff --git a/kernel/sched.c b/kernel/sched.c
index eaf6751..efbb7d9 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -7626,7 +7626,6 @@ static void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
else
rt_se->rt_rq = parent->my_q;

- rt_se->rt_rq = &rq->rt;
rt_se->my_q = rt_rq;
rt_se->parent = parent;
INIT_LIST_HEAD(&rt_se->run_list);
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 3432d57..95c408e 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -449,7 +449,7 @@ void dec_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
#endif
}

-static void enqueue_rt_entity(struct sched_rt_entity *rt_se)
+static void __enqueue_rt_entity(struct sched_rt_entity *rt_se)
{
struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
struct rt_prio_array *array = &rt_rq->active;
@@ -464,7 +464,7 @@ static void enqueue_rt_entity(struct sched_rt_entity *rt_se)
inc_rt_tasks(rt_se, rt_rq);
}

-static void dequeue_rt_entity(struct sched_rt_entity *rt_se)
+static void __dequeue_rt_entity(struct sched_rt_entity *rt_se)
{
struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
struct rt_prio_array *array = &rt_rq->active;
@@ -480,11 +480,10 @@ static void dequeue_rt_entity(struct sched_rt_entity *rt_se)
* Because the prio of an upper entry depends on the lower
* entries, we must remove entries top - down.
*/
-static void dequeue_rt_stack(struct task_struct *p)
+static void dequeue_rt_stack(struct sched_rt_entity *rt_se)
{
- struct sched_rt_entity *rt_se, *back = NULL;
+ struct sched_rt_entity *back = NULL;

- rt_se = &p->rt;
for_each_sched_rt_entity(rt_se) {
rt_se->back = back;
back = rt_se;
@@ -492,7 +491,26 @@ static void dequeue_rt_stack(struct task_struct *p)

for (rt_se = back; rt_se; rt_se = rt_se->back) {
if (on_rt_rq(rt_se))
- dequeue_rt_entity(rt_se);
+ __dequeue_rt_entity(rt_se);
+ }
+}
+
+static void enqueue_rt_entity(struct sched_rt_entity *rt_se)
+{
+ dequeue_rt_stack(rt_se);
+ for_each_sched_rt_entity(rt_se)
+ __enqueue_rt_entity(rt_se);
+}
+
+static void dequeue_rt_entity(struct sched_rt_entity *rt_se)
+{
+ dequeue_rt_stack(rt_se);
+
+ for_each_sched_rt_entity(rt_se) {
+ struct rt_rq *rt_rq = group_rt_rq(rt_se);
+
+ if (rt_rq && rt_rq->rt_nr_running)
+ __enqueue_rt_entity(rt_se);
}
}

@@ -506,32 +524,15 @@ static void enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup)
if (wakeup)
rt_se->timeout = 0;

- dequeue_rt_stack(p);
-
- /*
- * enqueue everybody, bottom - up.
- */
- for_each_sched_rt_entity(rt_se)
- enqueue_rt_entity(rt_se);
+ enqueue_rt_entity(rt_se);
}

static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int sleep)
{
struct sched_rt_entity *rt_se = &p->rt;
- struct rt_rq *rt_rq;

update_curr_rt(rq);
-
- dequeue_rt_stack(p);
-
- /*
- * re-enqueue all non-empty rt_rq entities.
- */
- for_each_sched_rt_entity(rt_se) {
- rt_rq = group_rt_rq(rt_se);
- if (rt_rq && rt_rq->rt_nr_running)
- enqueue_rt_entity(rt_se);
- }
+ dequeue_rt_entity(rt_se);
}

/*

2008-06-17 12:25:43

by Daniel K.

[permalink] [raw]
Subject: Re: [BUG: NULL pointer dereference] cgroups and RT scheduling interact badly.

Peter Zijlstra wrote:
> How's this [patch] work for you? (includes the previuos patchlet too)

Thanks,

this patch fixed the obvious problem, namely

# echo $$ > /dev/cgroup/burn/oops/tasks
# schedtool -R -p 1 -e burnP6 &

now works again. However, the last step below

# echo $$ > /dev/cgroup/tasks
# burnP6 &
[1] 3414
# echo 3414 > /dev/cgroup/burn/oops/tasks
# schedtool -R -p 1 3414

gives this new and shiny Oops instead.

> [ 189.274997] BUG: unable to handle kernel NULL pointer dereference at 0000000000000064
> [ 189.274997] IP: [<ffffffff8022e165>] __enqueue_rt_entity+0x55/0x1d0
> [ 189.274997] PGD 21e451067 PUD 220d19067 PMD 0
> [ 189.274997] Oops: 0002 [1] SMP
> [ 189.274997] CPU 1
> [ 189.274997] Modules linked in: netconsole configfs ipmi_msghandler kvm_amd kvm ipv6 iptable_filter ip_tables x_tables af_packet loop evdev usbhid hid i2c_nforce2 button pcspkr i2c_core k8temp shpchp pci_hotplug tg3 sd_mod forcedeth sg ohci_hcd ehci_hcd usbcore thermal processor fan
> [ 189.274997] Pid: 3415, comm: schedtool Not tainted 2.6.26-rc6 #3
> [ 189.274997] RIP: 0010:[<ffffffff8022e165>] [<ffffffff8022e165>] __enqueue_rt_entity+0x55/0x1d0
> [ 189.274997] RSP: 0018:ffff81021e477e48 EFLAGS: 00010012
> [ 189.274997] RAX: ffff810001056d08 RBX: ffff810220dbc060 RCX: ffff810220dbbfc0
> [ 189.274997] RDX: 0000000000000064 RSI: ffff81022120ec60 RDI: ffff810220dbc060
> [ 189.274997] RBP: ffff81021e477e58 R08: 44b0000000000000 R09: 0000000000000001
> [ 189.274997] R10: 0000000000000000 R11: 0000000000000206 R12: ffff8100010566b8
> [ 189.274997] R13: 0000000000000001 R14: 0000000000000001 R15: ffff81021e477f38
> [ 189.274997] FS: 00007f71ec2146e0(0000) GS:ffff810223022580(0000) knlGS:0000000000000000
> [ 189.274997] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 189.274997] CR2: 0000000000000064 CR3: 000000021e439000 CR4: 00000000000006e0
> [ 189.274997] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 189.274997] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 189.274997] Process schedtool (pid: 3415, threadinfo ffff81021e476000, task ffff810221daf920)
> [ 189.274997] Stack: ffff810220dbc060 ffff81022120ec60 ffff81021e477e78 ffffffff8022e34e
> [ 189.274997] 0000000000000001 ffff81022120ec60 ffff81021e477e98 ffffffff802299e3
> [ 189.274997] ffff81021e477f38 ffff8100010565c0 ffff81021e477eb8 ffffffff80229a60
> [ 189.274997] Call Trace:
> [ 189.274997] [<ffffffff8022e34e>] enqueue_rt_entity+0x1e/0x30
> [ 189.274997] [<ffffffff802299e3>] enqueue_task+0x13/0x30
> [ 189.274998] [<ffffffff80229a60>] activate_task+0x30/0x50
> [ 189.274998] [<ffffffff80232ebf>] sched_setscheduler+0x28f/0x3b0
> [ 189.274998] [<ffffffff8028b368>] ? do_munmap+0x278/0x2d0
> [ 189.274998] [<ffffffff8023305d>] do_sched_setscheduler+0x7d/0x90
> [ 189.274998] [<ffffffff802330a4>] sys_sched_setscheduler+0x14/0x20
> [ 189.274998] [<ffffffff8020b77a>] system_call_after_swapgs+0x8a/0x8f
> [ 189.274998]
> [ 189.274998]
> [ 189.274998] Code: 85 c9 0f 84 76 01 00 00 8b 81 58 06 00 00 48 98 48 8d 8b 60 ff ff ff 48 c1 e0 04 4a 8d 44 20 10 48 8b 50 08 48 89 03 48 89 58 08 <48> 89 1a 48 89 53 08 48 8b 53 40 48 8d 82 58 06 00 00 48 85 d2
> [ 189.274998] RIP [<ffffffff8022e165>] __enqueue_rt_entity+0x55/0x1d0
> [ 189.274998] RSP <ffff81021e477e48>
> [ 189.274998] CR2: 0000000000000064

If I switch the order of calling schedtool and assigning the task to a
cgroup, then it works OK.

# echo $$ > /dev/cgroup/tasks
# burnP6 &
[1] 3414
# schedtool -R -p 1 3414
# echo 3414 > /dev/cgroup/burn/oops/tasks

Note the distinct lack of an Oops here.

But all is not well. If I now use schedtool (obvoiusly redundantly)

# schedtool -R -p 1 3414

I get a new Oops, which looks essentially tha same as the one above.


Daniel K.

2008-06-17 12:47:22

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG: NULL pointer dereference] cgroups and RT scheduling interact badly.

On Tue, 2008-06-17 at 14:25 +0200, Daniel K. wrote:
> Peter Zijlstra wrote:
> > How's this [patch] work for you? (includes the previuos patchlet too)
>
> Thanks,
>
> this patch fixed the obvious problem, namely
>
> # echo $$ > /dev/cgroup/burn/oops/tasks
> # schedtool -R -p 1 -e burnP6 &
>
> now works again. However, the last step below
>
> # echo $$ > /dev/cgroup/tasks
> # burnP6 &
> [1] 3414
> # echo 3414 > /dev/cgroup/burn/oops/tasks
> # schedtool -R -p 1 3414
>
> gives this new and shiny Oops instead.
>
> > [ 189.274997] BUG: unable to handle kernel NULL pointer dereference at 0000000000000064
> > [ 189.274997] IP: [<ffffffff8022e165>] __enqueue_rt_entity+0x55/0x1d0
> > [ 189.274997] PGD 21e451067 PUD 220d19067 PMD 0
> > [ 189.274997] Oops: 0002 [1] SMP
> > [ 189.274997] CPU 1
> > [ 189.274997] Modules linked in: netconsole configfs ipmi_msghandler kvm_amd kvm ipv6 iptable_filter ip_tables x_tables af_packet loop evdev usbhid hid i2c_nforce2 button pcspkr i2c_core k8temp shpchp pci_hotplug tg3 sd_mod forcedeth sg ohci_hcd ehci_hcd usbcore thermal processor fan
> > [ 189.274997] Pid: 3415, comm: schedtool Not tainted 2.6.26-rc6 #3
> > [ 189.274997] RIP: 0010:[<ffffffff8022e165>] [<ffffffff8022e165>] __enqueue_rt_entity+0x55/0x1d0
> > [ 189.274997] RSP: 0018:ffff81021e477e48 EFLAGS: 00010012
> > [ 189.274997] RAX: ffff810001056d08 RBX: ffff810220dbc060 RCX: ffff810220dbbfc0
> > [ 189.274997] RDX: 0000000000000064 RSI: ffff81022120ec60 RDI: ffff810220dbc060
> > [ 189.274997] RBP: ffff81021e477e58 R08: 44b0000000000000 R09: 0000000000000001
> > [ 189.274997] R10: 0000000000000000 R11: 0000000000000206 R12: ffff8100010566b8
> > [ 189.274997] R13: 0000000000000001 R14: 0000000000000001 R15: ffff81021e477f38
> > [ 189.274997] FS: 00007f71ec2146e0(0000) GS:ffff810223022580(0000) knlGS:0000000000000000
> > [ 189.274997] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> > [ 189.274997] CR2: 0000000000000064 CR3: 000000021e439000 CR4: 00000000000006e0
> > [ 189.274997] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [ 189.274997] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> > [ 189.274997] Process schedtool (pid: 3415, threadinfo ffff81021e476000, task ffff810221daf920)
> > [ 189.274997] Stack: ffff810220dbc060 ffff81022120ec60 ffff81021e477e78 ffffffff8022e34e
> > [ 189.274997] 0000000000000001 ffff81022120ec60 ffff81021e477e98 ffffffff802299e3
> > [ 189.274997] ffff81021e477f38 ffff8100010565c0 ffff81021e477eb8 ffffffff80229a60
> > [ 189.274997] Call Trace:
> > [ 189.274997] [<ffffffff8022e34e>] enqueue_rt_entity+0x1e/0x30
> > [ 189.274997] [<ffffffff802299e3>] enqueue_task+0x13/0x30
> > [ 189.274998] [<ffffffff80229a60>] activate_task+0x30/0x50
> > [ 189.274998] [<ffffffff80232ebf>] sched_setscheduler+0x28f/0x3b0
> > [ 189.274998] [<ffffffff8028b368>] ? do_munmap+0x278/0x2d0
> > [ 189.274998] [<ffffffff8023305d>] do_sched_setscheduler+0x7d/0x90
> > [ 189.274998] [<ffffffff802330a4>] sys_sched_setscheduler+0x14/0x20
> > [ 189.274998] [<ffffffff8020b77a>] system_call_after_swapgs+0x8a/0x8f
> > [ 189.274998]
> > [ 189.274998]
> > [ 189.274998] Code: 85 c9 0f 84 76 01 00 00 8b 81 58 06 00 00 48 98 48 8d 8b 60 ff ff ff 48 c1 e0 04 4a 8d 44 20 10 48 8b 50 08 48 89 03 48 89 58 08 <48> 89 1a 48 89 53 08 48 8b 53 40 48 8d 82 58 06 00 00 48 85 d2
> > [ 189.274998] RIP [<ffffffff8022e165>] __enqueue_rt_entity+0x55/0x1d0
> > [ 189.274998] RSP <ffff81021e477e48>
> > [ 189.274998] CR2: 0000000000000064
>
> If I switch the order of calling schedtool and assigning the task to a
> cgroup, then it works OK.
>
> # echo $$ > /dev/cgroup/tasks
> # burnP6 &
> [1] 3414
> # schedtool -R -p 1 3414
> # echo 3414 > /dev/cgroup/burn/oops/tasks
>
> Note the distinct lack of an Oops here.
>
> But all is not well. If I now use schedtool (obvoiusly redundantly)
>
> # schedtool -R -p 1 3414
>
> I get a new Oops, which looks essentially tha same as the one above.

Fun, I could not reproduce with chrt, but I can with schedtool.

Thanks for testing,. and I guess I'm back to poking at it.. :-)

2008-06-17 20:01:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG: NULL pointer dereference] cgroups and RT scheduling interact badly.

On Tue, 2008-06-17 at 14:25 +0200, Daniel K. wrote:
> Peter Zijlstra wrote:
> > How's this [patch] work for you? (includes the previuos patchlet too)
>
> Thanks,
>
> this patch fixed the obvious problem, namely
>
> # echo $$ > /dev/cgroup/burn/oops/tasks
> # schedtool -R -p 1 -e burnP6 &
>
> now works again. However, the last step below
>
> # echo $$ > /dev/cgroup/tasks
> # burnP6 &
> [1] 3414
> # echo 3414 > /dev/cgroup/burn/oops/tasks
> # schedtool -R -p 1 3414
>
> gives this new and shiny Oops instead.

Whilst I'm gracious for your testing, I truly hope you're done breaking
my stuff ;-)

How's this for you?

Signed-off-by: Peter Zijlstra <[email protected]>
---
diff --git a/kernel/sched.c b/kernel/sched.c
index eaf6751..efbb7d9 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -7626,7 +7626,6 @@ static void init_tg_rt_entry(struct task_group *tg, struct rt_rq *rt_rq,
else
rt_se->rt_rq = parent->my_q;

- rt_se->rt_rq = &rq->rt;
rt_se->my_q = rt_rq;
rt_se->parent = parent;
INIT_LIST_HEAD(&rt_se->run_list);
diff --git a/kernel/sched_rt.c b/kernel/sched_rt.c
index 3432d57..2e73cac 100644
--- a/kernel/sched_rt.c
+++ b/kernel/sched_rt.c
@@ -449,13 +449,13 @@ void dec_rt_tasks(struct sched_rt_entity *rt_se, struct rt_rq *rt_rq)
#endif
}

-static void enqueue_rt_entity(struct sched_rt_entity *rt_se)
+static void __enqueue_rt_entity(struct sched_rt_entity *rt_se)
{
struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
struct rt_prio_array *array = &rt_rq->active;
struct rt_rq *group_rq = group_rt_rq(rt_se);

- if (group_rq && rt_rq_throttled(group_rq))
+ if (group_rq && (rt_rq_throttled(group_rq) || !group_rq->rt_nr_running))
return;

list_add_tail(&rt_se->run_list, array->queue + rt_se_prio(rt_se));
@@ -464,7 +464,7 @@ static void enqueue_rt_entity(struct sched_rt_entity *rt_se)
inc_rt_tasks(rt_se, rt_rq);
}

-static void dequeue_rt_entity(struct sched_rt_entity *rt_se)
+static void __dequeue_rt_entity(struct sched_rt_entity *rt_se)
{
struct rt_rq *rt_rq = rt_rq_of_se(rt_se);
struct rt_prio_array *array = &rt_rq->active;
@@ -480,11 +480,10 @@ static void dequeue_rt_entity(struct sched_rt_entity *rt_se)
* Because the prio of an upper entry depends on the lower
* entries, we must remove entries top - down.
*/
-static void dequeue_rt_stack(struct task_struct *p)
+static void dequeue_rt_stack(struct sched_rt_entity *rt_se)
{
- struct sched_rt_entity *rt_se, *back = NULL;
+ struct sched_rt_entity *back = NULL;

- rt_se = &p->rt;
for_each_sched_rt_entity(rt_se) {
rt_se->back = back;
back = rt_se;
@@ -492,7 +491,26 @@ static void dequeue_rt_stack(struct task_struct *p)

for (rt_se = back; rt_se; rt_se = rt_se->back) {
if (on_rt_rq(rt_se))
- dequeue_rt_entity(rt_se);
+ __dequeue_rt_entity(rt_se);
+ }
+}
+
+static void enqueue_rt_entity(struct sched_rt_entity *rt_se)
+{
+ dequeue_rt_stack(rt_se);
+ for_each_sched_rt_entity(rt_se)
+ __enqueue_rt_entity(rt_se);
+}
+
+static void dequeue_rt_entity(struct sched_rt_entity *rt_se)
+{
+ dequeue_rt_stack(rt_se);
+
+ for_each_sched_rt_entity(rt_se) {
+ struct rt_rq *rt_rq = group_rt_rq(rt_se);
+
+ if (rt_rq && rt_rq->rt_nr_running)
+ __enqueue_rt_entity(rt_se);
}
}

@@ -506,32 +524,15 @@ static void enqueue_task_rt(struct rq *rq, struct task_struct *p, int wakeup)
if (wakeup)
rt_se->timeout = 0;

- dequeue_rt_stack(p);
-
- /*
- * enqueue everybody, bottom - up.
- */
- for_each_sched_rt_entity(rt_se)
- enqueue_rt_entity(rt_se);
+ enqueue_rt_entity(rt_se);
}

static void dequeue_task_rt(struct rq *rq, struct task_struct *p, int sleep)
{
struct sched_rt_entity *rt_se = &p->rt;
- struct rt_rq *rt_rq;

update_curr_rt(rq);
-
- dequeue_rt_stack(p);
-
- /*
- * re-enqueue all non-empty rt_rq entities.
- */
- for_each_sched_rt_entity(rt_se) {
- rt_rq = group_rt_rq(rt_se);
- if (rt_rq && rt_rq->rt_nr_running)
- enqueue_rt_entity(rt_se);
- }
+ dequeue_rt_entity(rt_se);
}

/*

2008-06-17 21:03:33

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: [BUG: NULL pointer dereference] cgroups and RT scheduling interact badly.

2008/6/17 Peter Zijlstra <[email protected]>:
> On Tue, 2008-06-17 at 14:25 +0200, Daniel K. wrote:
>> Peter Zijlstra wrote:
>> > How's this [patch] work for you? (includes the previuos patchlet too)
>>
>> Thanks,
>>
>> this patch fixed the obvious problem, namely
>>
>> # echo $$ > /dev/cgroup/burn/oops/tasks
>> # schedtool -R -p 1 -e burnP6 &
>>
>> now works again. However, the last step below
>>
>> # echo $$ > /dev/cgroup/tasks
>> # burnP6 &
>> [1] 3414
>> # echo 3414 > /dev/cgroup/burn/oops/tasks
>> # schedtool -R -p 1 3414
>>
>> gives this new and shiny Oops instead.
>
> Whilst I'm gracious for your testing, I truly hope you're done breaking
> my stuff ;-)
>
> How's this for you?

FYI, I could reproduce a crash and this patch fixes it indeed.


--
Best regards,
Dmitry Adamushko

2008-06-17 21:50:37

by Daniel K.

[permalink] [raw]
Subject: Re: [BUG: NULL pointer dereference] cgroups and RT scheduling interact badly.

Peter Zijlstra wrote:
> On Tue, 2008-06-17 at 14:25 +0200, Daniel K. wrote:
>> Peter Zijlstra wrote:
>>> How's this [patch] work for you? (includes the previuos patchlet too)
>> Thanks,
>>
>> this patch fixed the obvious problem, namely
>>
>> # echo $$ > /dev/cgroup/burn/oops/tasks
>> # schedtool -R -p 1 -e burnP6 &
>>
>> now works again. However, the last step below
>>
>> # echo $$ > /dev/cgroup/tasks
>> # burnP6 &
>> [1] 3414
>> # echo 3414 > /dev/cgroup/burn/oops/tasks
>> # schedtool -R -p 1 3414
>>
>> gives this new and shiny Oops instead.
>
> Whilst I'm gracious for your testing, I truly hope you're done breaking
> my stuff ;-)
>
> How's this for you?

root@lc01:/dev/cgroup/burn# burnP6 &
[1] 3393
root@lc01:/dev/cgroup/burn# schedtool -R -p 1 3393
root@lc01:/dev/cgroup/burn# echo 3393 > oops/tasks
root@lc01:/dev/cgroup/burn# schedtool -R -p 1 3393
root@lc01:/dev/cgroup/burn# schedtool -R -p 1 3393

Multiple redundant schedtool invocations now work without incident.

I had almost given up trying to break it, but then this happened.

root@lc01:/dev/cgroup/burn# echo $$ > /dev/cgroup/burn/oops/tasks
root@lc01:/dev/cgroup/burn# schedtool -R -p 1 -e burnP6 &
[2] 3397

The following Oops happened immediately, but note that it was the first
burnP6 process (PID 3393) that is reported as the offender.

I tried the above procedure a second time, and now it ran for about one
second before the same Oops manifested itself, but this time with the
other burnP6 process as the culprit (the equivalent of PID 3397)

Yes, I realize I'm starting to sound like a broken record.


Daniel K.

> [ 444.197275] BUG: unable to handle kernel NULL pointer dereference at 0000000000000064
> [ 444.197543] IP: [<ffffffff80229823>] requeue_task_rt+0x53/0x70
> [ 444.197702] PGD 21f133067 PUD 21f5fb067 PMD 0
> [ 444.197923] Oops: 0002 [1] SMP
> [ 444.198102] CPU 3
> [ 444.198240] Modules linked in: netconsole configfs ipmi_msghandler kvm_amd kvm ipv6 iptable_filter ip_tables x_tables loop af_packet usbhid hid evdev i2c_nforce2 button i2c_core shpchp pci_hotplug k8temp pcspkr tg3 sd_mod sg forcedeth ehci_hcd ohci_hcd usbcore thermal processor fan
> [ 444.199793] Pid: 3393, comm: burnP6 Not tainted 2.6.26-rc6 #4
> [ 444.199906] RIP: 0010:[<ffffffff80229823>] [<ffffffff80229823>] requeue_task_rt+0x53/0x70
> [ 444.200123] RSP: 0000:ffff8102230fbe78 EFLAGS: 00010012
> [ 444.200234] RAX: ffff810001056d08 RBX: ffff810221685fa0 RCX: ffff81021f5e9d80
> [ 444.200352] RDX: 0000000000000064 RSI: ffff8100010566b8 RDI: ffff81021f5e9d80
> [ 444.200470] RBP: ffff8102230fbe78 R08: 0000000000000000 R09: ffff810223026c48
> [ 444.200588] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8100010565c0
> [ 444.200706] R13: 0000000000000003 R14: 7fffffffffffffff R15: 0000000000000001
> [ 444.200824] FS: 00007f2d7fea46e0(0000) GS:ffff810223022980(0000) knlGS:0000000000000000
> [ 444.201001] CS: 0010 DS: 002b ES: 002b CR0: 0000000080050033
> [ 444.201114] CR2: 0000000000000064 CR3: 000000021f1fa000 CR4: 00000000000006e0
> [ 444.201232] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [ 444.201350] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> [ 444.201468] Process burnP6 (pid: 3393, threadinfo ffff810220cfa000, task ffff810221685fa0)
> [ 444.201644] Stack: ffff8102230fbe98 ffffffff8022f357 ffff810221685fa0 ffff8100010565c0
> [ 444.202009] ffff8102230fbec8 ffffffff802352a4 0000000000000001 0000000000000003
> [ 444.202331] ffff810221685fa0 ffff810001052680 0000000000000001 ffffffff8024291d
> [ 444.202561] Call Trace:
> [ 444.202759] <IRQ> [<ffffffff8022f357>] task_tick_rt+0xc7/0xe0
> [ 444.202917] [<ffffffff802352a4>] scheduler_tick+0xb4/0x1c0
> [ 444.203032] [<ffffffff8024291d>] update_process_times+0x4d/0x70
> [ 444.203151] [<ffffffff802573b9>] ? tick_sched_timer+0x69/0xd0
> [ 444.203266] [<ffffffff80250120>] ? __run_hrtimer+0x90/0xb0
> [ 444.203380] [<ffffffff80250e08>] ? hrtimer_interrupt+0x108/0x180
> [ 444.203499] [<ffffffff8021d039>] ? smp_apic_timer_interrupt+0x79/0xc0
> [ 444.204829] [<ffffffff8020c5a2>] ? apic_timer_interrupt+0x72/0x80
> [ 444.204943] <EOI>
> [ 444.205082]
> [ 444.205179] Code: d2 48 8b 57 08 48 0f 44 c1 48 8b 0f 8b 00 48 89 51 08 48 89 0a 48 98 48 c1 e0 04 48 8d 44 30 10 48 8b 50 08 48 89 07 48 89 78 08 <48> 89 3a 48 89 57 08 48 8b 7f 30 48 85 ff 75 ad c9 c3 66 66 2e
> [ 444.207862] RIP [<ffffffff80229823>] requeue_task_rt+0x53/0x70
> [ 444.208017] RSP <ffff8102230fbe78>
> [ 444.208122] CR2: 0000000000000064

2008-06-18 11:51:33

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG: NULL pointer dereference] cgroups and RT scheduling interact badly.

On Tue, 2008-06-17 at 21:48 +0000, Daniel K. wrote:
> Peter Zijlstra wrote:
> > On Tue, 2008-06-17 at 14:25 +0200, Daniel K. wrote:
> >> Peter Zijlstra wrote:
> >>> How's this [patch] work for you? (includes the previuos patchlet too)
> >> Thanks,
> >>
> >> this patch fixed the obvious problem, namely
> >>
> >> # echo $$ > /dev/cgroup/burn/oops/tasks
> >> # schedtool -R -p 1 -e burnP6 &
> >>
> >> now works again. However, the last step below
> >>
> >> # echo $$ > /dev/cgroup/tasks
> >> # burnP6 &
> >> [1] 3414
> >> # echo 3414 > /dev/cgroup/burn/oops/tasks
> >> # schedtool -R -p 1 3414
> >>
> >> gives this new and shiny Oops instead.
> >
> > Whilst I'm gracious for your testing, I truly hope you're done breaking
> > my stuff ;-)
> >
> > How's this for you?
>
> root@lc01:/dev/cgroup/burn# burnP6 &
> [1] 3393
> root@lc01:/dev/cgroup/burn# schedtool -R -p 1 3393
> root@lc01:/dev/cgroup/burn# echo 3393 > oops/tasks
> root@lc01:/dev/cgroup/burn# schedtool -R -p 1 3393
> root@lc01:/dev/cgroup/burn# schedtool -R -p 1 3393
>
> Multiple redundant schedtool invocations now work without incident.
>
> I had almost given up trying to break it, but then this happened.
>
> root@lc01:/dev/cgroup/burn# echo $$ > /dev/cgroup/burn/oops/tasks
> root@lc01:/dev/cgroup/burn# schedtool -R -p 1 -e burnP6 &
> [2] 3397
>
> The following Oops happened immediately, but note that it was the first
> burnP6 process (PID 3393) that is reported as the offender.
>
> I tried the above procedure a second time, and now it ran for about one
> second before the same Oops manifested itself, but this time with the
> other burnP6 process as the culprit (the equivalent of PID 3397)

Ah, fun a race between dequeueing because of runtime quota and
requeueing because of RR slice length.

> Yes, I realize I'm starting to sound like a broken record.

Ah, don't worry - I was just hoping there was an end to the amount of
glaring bugs in my code :-/

Reproducing was a bit harder than for you, it took me a whole minute of
runtime and setting the runtime limit above the RR slice length (and
realizing you're running RR, not FIFO).

The below patch (on top of the other one) seems to not make it crash
this case for at least 15 minutes.

---
Subject: sched: rt-group: fix RR buglet
From: Peter Zijlstra <[email protected]>

In tick_task_rt() we first call update_curr_rt() which can dequeue a runqueue
due to it running out of runtime, and then we try to requeue it, of it also
having exhausted its RR quota. Obviously requeueing something that is no longer
on the runqueue will not have the expected result.

Signed-off-by: Peter Zijlstra <[email protected]>
---
kernel/sched_rt.c | 4 +++-
1 file changed, 3 insertions(+), 1 deletion(-)

Index: linux-2.6/kernel/sched_rt.c
===================================================================
--- linux-2.6.orig/kernel/sched_rt.c
+++ linux-2.6/kernel/sched_rt.c
@@ -549,8 +549,10 @@ static
void requeue_rt_entity(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se)
{
struct rt_prio_array *array = &rt_rq->active;
+ struct list_head *queue = array->queue + rt_se_prio(rt_se);

- list_move_tail(&rt_se->run_list, array->queue + rt_se_prio(rt_se));
+ if (on_rt_rq(rt_se))
+ list_move_tail(&rt_se->run_list, queue);
}

static void requeue_task_rt(struct rq *rq, struct task_struct *p)



2008-06-18 13:35:20

by Daniel K.

[permalink] [raw]
Subject: Re: [BUG: NULL pointer dereference] cgroups and RT scheduling interact badly.

Peter Zijlstra wrote:
> On Tue, 2008-06-17 at 21:48 +0000, Daniel K. wrote:
>> I had almost given up trying to break it, but then this happened.
>>
>> [...]
>
> Ah, fun a race between dequeueing because of runtime quota and
> requeueing because of RR slice length.
>
>> Yes, I realize I'm starting to sound like a broken record.
>
> Ah, don't worry - I was just hoping there was an end to the amount of
> glaring bugs in my code :-/

:)

> Reproducing was a bit harder than for you, it took me a whole minute of
> runtime and setting the runtime limit above the RR slice length (and
> realizing you're running RR, not FIFO).
>
> The below patch (on top of the other one) seems to not make it crash
> this case for at least 15 minutes.

I am happy to say that this nailed it squarely on the head. I no longer
see any of the Oops'es I could quite easily trigger before. I added my
Tested-by, please add it to the patch you sent yesterday as well.

I still have a few gripes with RR scheduling, but that is a topic for
another mail.

> ---
> Subject: sched: rt-group: fix RR buglet
> From: Peter Zijlstra <[email protected]>
>
> In tick_task_rt() we first call update_curr_rt() which can dequeue a runqueue
> due to it running out of runtime, and then we try to requeue it, of it also
> having exhausted its RR quota. Obviously requeueing something that is no longer
> on the runqueue will not have the expected result.
>
> Signed-off-by: Peter Zijlstra <[email protected]>
Tested-by: Daniel K. <[email protected]>

> ---
> kernel/sched_rt.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> Index: linux-2.6/kernel/sched_rt.c
> ===================================================================
> --- linux-2.6.orig/kernel/sched_rt.c
> +++ linux-2.6/kernel/sched_rt.c
> @@ -549,8 +549,10 @@ static
> void requeue_rt_entity(struct rt_rq *rt_rq, struct sched_rt_entity *rt_se)
> {
> struct rt_prio_array *array = &rt_rq->active;
> + struct list_head *queue = array->queue + rt_se_prio(rt_se);
>
> - list_move_tail(&rt_se->run_list, array->queue + rt_se_prio(rt_se));
> + if (on_rt_rq(rt_se))
> + list_move_tail(&rt_se->run_list, queue);
> }
>
> static void requeue_task_rt(struct rq *rq, struct task_struct *p)

Daniel K.

2008-06-18 14:13:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG: NULL pointer dereference] cgroups and RT scheduling interact badly.

On Wed, 2008-06-18 at 15:35 +0200, Daniel K. wrote:

> I am happy to say that this nailed it squarely on the head. I no longer
> see any of the Oops'es I could quite easily trigger before. I added my
> Tested-by, please add it to the patch you sent yesterday as well.

Done, and pushed to Ingo - should hopefully show up in a git tree near
you soonish.

> I still have a few gripes with RR scheduling, but that is a topic for
> another mail.

Please CC me when you get around to writing it.