2007-11-09 08:30:45

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: [BUG]: Crash with CONFIG_FAIR_CGROUP_SCHED=y



With CONFIG_FAIR_CGROUP_SCHED=y, following commands on 2.6.24-rc1 crash
the system.

$ mount -t cgroup none /cgroups
$ ./ns_exec -cm /bin/ls

"ns_exec -cm" calls clone() to clone the mount namespace and then
executes the '/bin/ls' program in the cloned child.

Some observations that Serge and I made (we have been able to reproduce
reliably, but crash logs have not been very useful)

a. If we skip the 'mount' command, there is no crash.

b. If CONFIG_FAIR_CGROUP_SCHED=n again, there is no crash (even with
'mount' command).

c. mounting just the cpu or just the ns subsystem does not
lead to a crash. You can even

mount -t cgroup -o cpu none /mnt1
mount -t cgroup -o ns none /mnt2

and you still don't get a crash

d. mount -t cgroup -o cpu,ns none /cgroup

will always cause a crash with subsequent ns_exec

ns_exec.c and config file are attached. Let us know if you need more info.

Suka

---
Crash log:

Red Hat Enterprise Linux release 4.90 (Tikanga)
Kernel 2.6.24-rc1 on an x86_64

elm3a241 login: Unable to handle kernel NULL pointer dereferenceUnable to handle kernel NULL pointer dereference at 0000000000000000 RIP:
at 0000000000000000 RIP:
[<0000000000000000>]
[<0000000000000000>]
PGD 102d4d067 PGD 102d4d067 PUD 102c88067 PUD 102c88067 PMD 0 PMD 0

Oops: 0000 [1] Oops: 0000 [1] SMP SMP

CPU 2 CPU 2

Modules linked in:Modules linked in:

Pid: 3273, comm: ns_exec Not tainted 2.6.24-rc1 #9
Pid: 3273, comm: ns_exec Not tainted 2.6.24-rc1 #9
RIP: 0010:[<0000000000000000>] RIP: 0010:[<0000000000000000>] [<0000000000000000>]
[<0000000000000000>]
RSP: 0018:ffff8101006a6af0 EFLAGS: 00055292
RSP: 0018:ffff8101006a6af0 EFLAGS: 00055292
RAX: 0000000000000000 RBX: ffff810100d11140 RCX: ffff810101de4000
RAX: 0000000000000000 RBX: ffff810100d11140 RCX: ffff810101de4000
RDX: 0000000000000000 RSI: ffff810100d1a880 RDI: ffff810001037c00
RDX: 0000000000000000 RSI: ffff810100d1a880 RDI: ffff810001037c00
RBP: ffff810102c136c0 R08: ffff810101de4000 R09: ffff810101d31bb8
RBP: ffff810102c136c0 R08: ffff810101de4000 R09: ffff810101d31bb8
R10: 0000000000000000 R11: 00000000ffffffff R12: ffff8101034075b8
R10: 0000000000000000 R11: 00000000ffffffff R12: ffff8101034075b8
R13: ffff8101029d6028 R14: ffff810103407500 R15: ffffffff80869d00
R13: ffff8101029d6028 R14: ffff810103407500 R15: ffffffff80869d00
FS: 00002b80c2a396f0(0000) GS:ffff81010068f3c0(0000) knlGS:0000000000000000
FS: 00002b80c2a396f0(0000) GS:ffff81010068f3c0(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000000 CR3: 00000001028be000 CR4: 00000000000006e0
CR2: 0000000000000000 CR3: 00000001028be000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process ns_exec (pid: 3273, threadinfo ffff810101de4000, task ffff810100d11140)
Process ns_exec (pid: 3273, threadinfo ffff810101de4000, task ffff810100d11140)
Stack: Stack: 0000000000000000 0000000000000000 0000000000000001 0000000000000001 6c2f343662696c2f 6c2f343662696c2f 2d78756e696c2d64 2d78756e696c2d64

732e34362d363878 732e34362d363878 ffffffff00322e6f ffffffff00322e6f 762f003561736376 762f003561736376 0000357363762f63 0000357363762f63

0000000000000000 0000000000000000 0000000000000000 0000000000000000 762f73007665642f 762f73007665642f 0000317363762f63 0000317363762f63

Call Trace:
Call Trace:




Code: Code: Bad RIP value. Bad RIP value.

RIP RIP [<0000000000000000>]
[<0000000000000000>]
RSP <ffff8101006a6af0>
RSP <ffff8101006a6af0>
CR2: 0000000000000000
CR2: 0000000000000000


Attachments:
(No filename) (3.76 kB)
cgroup-sched-config (33.04 kB)
config-file
ns_exec.c (3.33 kB)
ns_exec.c
Download all attachments

2007-11-09 06:50:18

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [BUG]: Crash with CONFIG_FAIR_CGROUP_SCHED=y

On Thu, Nov 08, 2007 at 03:48:05PM -0800, [email protected] wrote:
> With CONFIG_FAIR_CGROUP_SCHED=y, following commands on 2.6.24-rc1 crash
> the system.

Thanks for reporting the problem. It was caused because of the fact that
current task isn't kept in its runqueue in case of sched_fair class
tasks.

With the patch below, I could run ns_exec w/o any crash. Can you pls
verify it works for you as well?

Ingo,
Once Suka verifies that the patch fixes his crash, I would request you
to include the same in your tree and route it to Linus.

--

current task is not present in its runqueue in case of sched_fair class
tasks. Take care of this fact in rt_mutex_setprio(),
sched_setscheduler() and sched_move_task() routines.

Signed-off-by : Srivatsa Vaddagiri <[email protected]>


---
kernel/sched.c | 45 +++++++++++++++++++++++++--------------------
1 files changed, 25 insertions(+), 20 deletions(-)

Index: current/kernel/sched.c
===================================================================
--- current.orig/kernel/sched.c
+++ current/kernel/sched.c
@@ -3986,11 +3986,13 @@ void rt_mutex_setprio(struct task_struct
oldprio = p->prio;
on_rq = p->se.on_rq;
running = task_running(rq, p);
- if (on_rq) {
+ if (on_rq)
dequeue_task(rq, p, 0);
- if (running)
- p->sched_class->put_prev_task(rq, p);
- }
+ /* current task is not kept in its runqueue in case of sched_fair class.
+ * Hence we need the 'on_rq?' and 'running?' tests to be separate.
+ */
+ if (running)
+ p->sched_class->put_prev_task(rq, p);

if (rt_prio(prio))
p->sched_class = &rt_sched_class;
@@ -3999,9 +4001,9 @@ void rt_mutex_setprio(struct task_struct

p->prio = prio;

+ if (running)
+ p->sched_class->set_curr_task(rq);
if (on_rq) {
- if (running)
- p->sched_class->set_curr_task(rq);
enqueue_task(rq, p, 0);
inc_load(rq, p);
/*
@@ -4298,18 +4300,20 @@ recheck:
update_rq_clock(rq);
on_rq = p->se.on_rq;
running = task_running(rq, p);
- if (on_rq) {
+ if (on_rq)
deactivate_task(rq, p, 0);
- if (running)
- p->sched_class->put_prev_task(rq, p);
- }
+ /* current task is not kept in its runqueue in case of sched_fair class.
+ * Hence we need the 'on_rq?' and 'running?' tests to be separate.
+ */
+ if (running)
+ p->sched_class->put_prev_task(rq, p);

oldprio = p->prio;
__setscheduler(rq, p, policy, param->sched_priority);

+ if (running)
+ p->sched_class->set_curr_task(rq);
if (on_rq) {
- if (running)
- p->sched_class->set_curr_task(rq);
activate_task(rq, p, 0);
/*
* Reschedule if we are currently running on this runqueue and
@@ -7036,19 +7040,20 @@ void sched_move_task(struct task_struct
running = task_running(rq, tsk);
on_rq = tsk->se.on_rq;

- if (on_rq) {
+ if (on_rq)
dequeue_task(rq, tsk, 0);
- if (unlikely(running))
- tsk->sched_class->put_prev_task(rq, tsk);
- }
+ /* current task is not kept in its runqueue in case of sched_fair class.
+ * Hence we need the 'on_rq?' and 'running?' tests to be separate.
+ */
+ if (unlikely(running))
+ tsk->sched_class->put_prev_task(rq, tsk);

set_task_cfs_rq(tsk);

- if (on_rq) {
- if (unlikely(running))
- tsk->sched_class->set_curr_task(rq);
+ if (unlikely(running))
+ tsk->sched_class->set_curr_task(rq);
+ if (on_rq)
enqueue_task(rq, tsk, 0);
- }

done:
task_rq_unlock(rq, &flags);


--
Regards,
vatsa

2007-11-09 08:45:32

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: [BUG]: Crash with CONFIG_FAIR_CGROUP_SCHED=y

Hi Srivatsa,

> [ ... ]
> --
>
> current task is not present in its runqueue in case of sched_fair class
> tasks. Take care of this fact in rt_mutex_setprio(),
> sched_setscheduler() and sched_move_task() routines.
>
> Signed-off-by : Srivatsa Vaddagiri <[email protected]>
>
>
> ---
> kernel/sched.c | 45 +++++++++++++++++++++++++--------------------
> 1 files changed, 25 insertions(+), 20 deletions(-)
>
> Index: current/kernel/sched.c
> ===================================================================
> --- current.orig/kernel/sched.c
> +++ current/kernel/sched.c
> @@ -3986,11 +3986,13 @@ void rt_mutex_setprio(struct task_struct
> oldprio = p->prio;
> on_rq = p->se.on_rq;
> running = task_running(rq, p);
> - if (on_rq) {
> + if (on_rq)
> dequeue_task(rq, p, 0);
> - if (running)
> - p->sched_class->put_prev_task(rq, p);
> - }
> + /* current task is not kept in its runqueue in case of sched_fair class.
> + * Hence we need the 'on_rq?' and 'running?' tests to be separate.

Humm... the 'current' is not kept within the tree but
current->se.on_rq is supposed to be '1' ,
so the old code looks ok to me (at least for the 'leaf' elements).

Maybe you were able to get more useful oops on your site?


> --
> Regards,
> vatsa
>

--
Best regards,
Dmitry Adamushko

2007-11-09 10:02:36

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [BUG]: Crash with CONFIG_FAIR_CGROUP_SCHED=y

On Fri, Nov 09, 2007 at 09:45:21AM +0100, Dmitry Adamushko wrote:
> Humm... the 'current' is not kept within the tree but
> current->se.on_rq is supposed to be '1' ,
> so the old code looks ok to me (at least for the 'leaf' elements).

You are damned right! Sorry my mistake with the previous analysis and
(as I now find out) testing :(

There are couple of problems discovered by Suka's test:

- The test requires the cgroup filesystem to be mounted with
atleast the cpu and ns options (i.e both namespace and cpu
controllers are active in the same hierarchy).

# mkdir /dev/cpuctl
# mount -t cgroup -ocpu,ns none cpuctl
(or simply)
# mount -t cgroup none cpuctl -> Will activate all controllers
in same hierarchy.

- The test invokes clone() with CLONE_NEWNS set. This causes a a new child
to be created, also a new group (do_fork->copy_namespaces->ns_cgroup_clone->
cgroup_clone) and the child is attached to the new group (cgroup_clone->
attach_task->sched_move_task). At this point in time, the child's scheduler
related fields are uninitialized (including its on_rq field, which it has
inherited from parent). As a result sched_move_task thinks its on
runqueue, when it isn't.

As a solution to this problem, I moved sched_fork() call, which
initializes scheduler related fields on a new task, before
copy_namespaces(). I am not sure though whether moving up will
cause other side-effects. Do you see any issue?

- The second problem exposed by this test is that task_new_fair()
assumes that parent and child will be part of the same group (which
needn't be as this test shows). As a result, cfs_rq->curr can be NULL
for the child.

The solution is to test for curr pointer being NULL in
task_new_fair().


With the patch below, I could run ns_exec() fine w/o a crash.

Suka, can you verify whether this patch fixes your problem?


--

Fix copy_namespace() <-> sched_fork() dependency in do_fork, by moving
up sched_fork().

Also introduce a NULL pointer check for 'curr' in task_new_fair().

Signed-off-by : Srivatsa Vaddagiri <[email protected]>

---
kernel/fork.c | 6 +++---
kernel/sched_fair.c | 2 +-
2 files changed, 4 insertions(+), 4 deletions(-)

Index: current/kernel/fork.c
===================================================================
--- current.orig/kernel/fork.c
+++ current/kernel/fork.c
@@ -1121,6 +1121,9 @@ static struct task_struct *copy_process(
p->blocked_on = NULL; /* not blocked yet */
#endif

+ /* Perform scheduler related setup. Assign this task to a CPU. */
+ sched_fork(p, clone_flags);
+
if ((retval = security_task_alloc(p)))
goto bad_fork_cleanup_policy;
if ((retval = audit_alloc(p)))
@@ -1210,9 +1213,6 @@ static struct task_struct *copy_process(
INIT_LIST_HEAD(&p->ptrace_children);
INIT_LIST_HEAD(&p->ptrace_list);

- /* Perform scheduler related setup. Assign this task to a CPU. */
- sched_fork(p, clone_flags);
-
/* Now that the task is set up, run cgroup callbacks if
* necessary. We need to run them before the task is visible
* on the tasklist. */
Index: current/kernel/sched_fair.c
===================================================================
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -1023,7 +1023,7 @@ static void task_new_fair(struct rq *rq,
place_entity(cfs_rq, se, 1);

if (sysctl_sched_child_runs_first && this_cpu == task_cpu(p) &&
- curr->vruntime < se->vruntime) {
+ curr && curr->vruntime < se->vruntime) {
/*
* Upon rescheduling, sched_class::put_prev_task() will place
* 'current' within the tree based on its new key value.

2007-11-09 10:32:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: [BUG]: Crash with CONFIG_FAIR_CGROUP_SCHED=y


* Srivatsa Vaddagiri <[email protected]> wrote:

> With the patch below, I could run ns_exec() fine w/o a crash.
>
> Suka, can you verify whether this patch fixes your problem?

thanks, applied. I'll wait for confirmation from Suka before sending it
to Linus.

Ingo

2007-11-09 10:59:25

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: [BUG]: Crash with CONFIG_FAIR_CGROUP_SCHED=y

On 09/11/2007, Srivatsa Vaddagiri <[email protected]> wrote:
> [ ... ]
>
> As a solution to this problem, I moved sched_fork() call, which
> initializes scheduler related fields on a new task, before
> copy_namespaces(). I am not sure though whether moving up will
> cause other side-effects. Do you see any issue?

Should be ok (IMHO and at first glance :-)

> - The second problem exposed by this test is that task_new_fair()
> assumes that parent and child will be part of the same group (which
> needn't be as this test shows). As a result, cfs_rq->curr can be NULL
> for the child.

Would it be better, logically-wise, to use is_same_group() instead?
Although, we can't have 2 groups with cfs_rq->curr != NULL on the same
CPU... so if the child belongs to another group, it's cfs_rq->curr is
automatically NULL indeed.



--
Best regards,
Dmitry Adamushko

2007-11-09 11:59:38

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: [BUG]: Crash with CONFIG_FAIR_CGROUP_SCHED=y

On Fri, Nov 09, 2007 at 11:59:15AM +0100, Dmitry Adamushko wrote:
> > - The second problem exposed by this test is that task_new_fair()
> > assumes that parent and child will be part of the same group (which
> > needn't be as this test shows). As a result, cfs_rq->curr can be NULL
> > for the child.
>
> Would it be better, logically-wise, to use is_same_group() instead?
> Although, we can't have 2 groups with cfs_rq->curr != NULL on the same
> CPU... so if the child belongs to another group, it's cfs_rq->curr is
> automatically NULL indeed.

Yeah ..I feel safe with an explicit !curr check, perhaps with a comment like
below added to explain when curr can be NULL?


---
kernel/sched_fair.c | 1 +
1 files changed, 1 insertion(+)

Index: current/kernel/sched_fair.c
===================================================================
--- current.orig/kernel/sched_fair.c
+++ current/kernel/sched_fair.c
@@ -1022,6 +1022,7 @@ static void task_new_fair(struct rq *rq,
update_curr(cfs_rq);
place_entity(cfs_rq, se, 1);

+ /* 'curr' will be NULL if the child belongs to a different group */
if (sysctl_sched_child_runs_first && this_cpu == task_cpu(p) &&
curr && curr->vruntime < se->vruntime) {
/*



--
Regards,
vatsa

2007-11-09 16:06:19

by Serge E. Hallyn

[permalink] [raw]
Subject: Re: [BUG]: Crash with CONFIG_FAIR_CGROUP_SCHED=y

Quoting Srivatsa Vaddagiri ([email protected]):
> On Fri, Nov 09, 2007 at 09:45:21AM +0100, Dmitry Adamushko wrote:
> > Humm... the 'current' is not kept within the tree but
> > current->se.on_rq is supposed to be '1' ,
> > so the old code looks ok to me (at least for the 'leaf' elements).
>
> You are damned right! Sorry my mistake with the previous analysis and
> (as I now find out) testing :(
>
> There are couple of problems discovered by Suka's test:
>
> - The test requires the cgroup filesystem to be mounted with
> atleast the cpu and ns options (i.e both namespace and cpu
> controllers are active in the same hierarchy).
>
> # mkdir /dev/cpuctl
> # mount -t cgroup -ocpu,ns none cpuctl
> (or simply)
> # mount -t cgroup none cpuctl -> Will activate all controllers
> in same hierarchy.
>
> - The test invokes clone() with CLONE_NEWNS set. This causes a a new child
> to be created, also a new group (do_fork->copy_namespaces->ns_cgroup_clone->
> cgroup_clone) and the child is attached to the new group (cgroup_clone->
> attach_task->sched_move_task). At this point in time, the child's scheduler
> related fields are uninitialized (including its on_rq field, which it has
> inherited from parent). As a result sched_move_task thinks its on
> runqueue, when it isn't.
>
> As a solution to this problem, I moved sched_fork() call, which
> initializes scheduler related fields on a new task, before
> copy_namespaces(). I am not sure though whether moving up will
> cause other side-effects. Do you see any issue?
>
> - The second problem exposed by this test is that task_new_fair()
> assumes that parent and child will be part of the same group (which
> needn't be as this test shows). As a result, cfs_rq->curr can be NULL
> for the child.
>
> The solution is to test for curr pointer being NULL in
> task_new_fair().
>
>
> With the patch below, I could run ns_exec() fine w/o a crash.
>
> Suka, can you verify whether this patch fixes your problem?

Works on my machine. Thanks!

> --
>
> Fix copy_namespace() <-> sched_fork() dependency in do_fork, by moving
> up sched_fork().
>
> Also introduce a NULL pointer check for 'curr' in task_new_fair().
>
> Signed-off-by : Srivatsa Vaddagiri <[email protected]>

Tested-by: Serge Hallyn <[email protected]>

>
> ---
> kernel/fork.c | 6 +++---
> kernel/sched_fair.c | 2 +-
> 2 files changed, 4 insertions(+), 4 deletions(-)
>
> Index: current/kernel/fork.c
> ===================================================================
> --- current.orig/kernel/fork.c
> +++ current/kernel/fork.c
> @@ -1121,6 +1121,9 @@ static struct task_struct *copy_process(
> p->blocked_on = NULL; /* not blocked yet */
> #endif
>
> + /* Perform scheduler related setup. Assign this task to a CPU. */
> + sched_fork(p, clone_flags);
> +
> if ((retval = security_task_alloc(p)))
> goto bad_fork_cleanup_policy;
> if ((retval = audit_alloc(p)))
> @@ -1210,9 +1213,6 @@ static struct task_struct *copy_process(
> INIT_LIST_HEAD(&p->ptrace_children);
> INIT_LIST_HEAD(&p->ptrace_list);
>
> - /* Perform scheduler related setup. Assign this task to a CPU. */
> - sched_fork(p, clone_flags);
> -
> /* Now that the task is set up, run cgroup callbacks if
> * necessary. We need to run them before the task is visible
> * on the tasklist. */
> Index: current/kernel/sched_fair.c
> ===================================================================
> --- current.orig/kernel/sched_fair.c
> +++ current/kernel/sched_fair.c
> @@ -1023,7 +1023,7 @@ static void task_new_fair(struct rq *rq,
> place_entity(cfs_rq, se, 1);
>
> if (sysctl_sched_child_runs_first && this_cpu == task_cpu(p) &&
> - curr->vruntime < se->vruntime) {
> + curr && curr->vruntime < se->vruntime) {
> /*
> * Upon rescheduling, sched_class::put_prev_task() will place
> * 'current' within the tree based on its new key value.
> _______________________________________________
> Containers mailing list
> [email protected]
> https://lists.linux-foundation.org/mailman/listinfo/containers

2007-11-10 23:14:00

by Sukadev Bhattiprolu

[permalink] [raw]
Subject: Re: [BUG]: Crash with CONFIG_FAIR_CGROUP_SCHED=y

Serge E. Hallyn [[email protected]] wrote:
| Quoting Srivatsa Vaddagiri ([email protected]):
| > On Fri, Nov 09, 2007 at 09:45:21AM +0100, Dmitry Adamushko wrote:
| > > Humm... the 'current' is not kept within the tree but
| > > current->se.on_rq is supposed to be '1' ,
| > > so the old code looks ok to me (at least for the 'leaf' elements).
| >
| > You are damned right! Sorry my mistake with the previous analysis and
| > (as I now find out) testing :(
| >
| > There are couple of problems discovered by Suka's test:
| >
| > - The test requires the cgroup filesystem to be mounted with
| > atleast the cpu and ns options (i.e both namespace and cpu
| > controllers are active in the same hierarchy).
| >
| > # mkdir /dev/cpuctl
| > # mount -t cgroup -ocpu,ns none cpuctl
| > (or simply)
| > # mount -t cgroup none cpuctl -> Will activate all controllers
| > in same hierarchy.
| >
| > - The test invokes clone() with CLONE_NEWNS set. This causes a a new child
| > to be created, also a new group (do_fork->copy_namespaces->ns_cgroup_clone->
| > cgroup_clone) and the child is attached to the new group (cgroup_clone->
| > attach_task->sched_move_task). At this point in time, the child's scheduler
| > related fields are uninitialized (including its on_rq field, which it has
| > inherited from parent). As a result sched_move_task thinks its on
| > runqueue, when it isn't.
| >
| > As a solution to this problem, I moved sched_fork() call, which
| > initializes scheduler related fields on a new task, before
| > copy_namespaces(). I am not sure though whether moving up will
| > cause other side-effects. Do you see any issue?
| >
| > - The second problem exposed by this test is that task_new_fair()
| > assumes that parent and child will be part of the same group (which
| > needn't be as this test shows). As a result, cfs_rq->curr can be NULL
| > for the child.
| >
| > The solution is to test for curr pointer being NULL in
| > task_new_fair().
| >
| >
| > With the patch below, I could run ns_exec() fine w/o a crash.
| >
| > Suka, can you verify whether this patch fixes your problem?
|
| Works on my machine. Thanks!

And mine too. Thanks,


|
| > --
| >
| > Fix copy_namespace() <-> sched_fork() dependency in do_fork, by moving
| > up sched_fork().
| >
| > Also introduce a NULL pointer check for 'curr' in task_new_fair().
| >
| > Signed-off-by : Srivatsa Vaddagiri <[email protected]>
|
| Tested-by: Serge Hallyn <[email protected]>
Tested-by: Sukadev Bhattiprolu <[email protected]>