2014-10-24 15:07:53

by Burke Libbey

[permalink] [raw]
Subject: [PATCH] sched: reset sched_entity depth on changing parent

From 2014-02-15: https://lkml.org/lkml/2014/2/15/217

This issue was reported and patched, but it still occurs in some situations on
newer kernel versions.

[2249353.328452] BUG: unable to handle kernel NULL pointer dereference at 0000000000000150
[2249353.336528] IP: [<ffffffff810b1cf7>] check_preempt_wakeup+0xe7/0x210

se.parent gets out of sync with se.depth, causing a panic when the algorithm in
find_matching_se assumes they are correct. This patch forces se.depth to be
updated every time se.parent is, so they can no longer become desync'd.

CC: Ingo Molnar <[email protected]>
CC: Peter Zijlstra <[email protected]>
Signed-off-by: Burke Libbey <[email protected]>
---

I haven't been able to isolate the problem. Though I'm pretty confident this
fixes the issue I've been having, I have not been able to prove it.

I kept a debugging journal if you want more context for whatever reason:
https://gist.github.com/burke/c60dc5b8f0ba9bfd9275

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 24156c84..bcffba2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -844,6 +844,7 @@ static inline void set_task_rq(struct task_struct *p, unsigned int cpu)
#ifdef CONFIG_FAIR_GROUP_SCHED
p->se.cfs_rq = tg->cfs_rq[cpu];
p->se.parent = tg->se[cpu];
+ p->se.depth = p->se.parent ? p->se.parent->depth + 1 : 0;
#endif

#ifdef CONFIG_RT_GROUP_SCHED


Attachments:
(No filename) (1.36 kB)
signature.asc (181.00 B)
Download all attachments

2014-10-24 15:58:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: reset sched_entity depth on changing parent

On Fri, Oct 24, 2014 at 11:07:46AM -0400, Burke Libbey wrote:
> From 2014-02-15: https://lkml.org/lkml/2014/2/15/217
>
> This issue was reported and patched, but it still occurs in some situations on
> newer kernel versions.
>
> [2249353.328452] BUG: unable to handle kernel NULL pointer dereference at 0000000000000150
> [2249353.336528] IP: [<ffffffff810b1cf7>] check_preempt_wakeup+0xe7/0x210
>
> se.parent gets out of sync with se.depth, causing a panic when the algorithm in
> find_matching_se assumes they are correct. This patch forces se.depth to be
> updated every time se.parent is, so they can no longer become desync'd.
>
> CC: Ingo Molnar <[email protected]>
> CC: Peter Zijlstra <[email protected]>
> Signed-off-by: Burke Libbey <[email protected]>
> ---
>
> I haven't been able to isolate the problem. Though I'm pretty confident this
> fixes the issue I've been having, I have not been able to prove it.

So this isn't correct, switching rq should not change depth. I suspect
you're just papering over the issue by frequently resetting the value,
which simply narrows the race window.

2014-10-24 17:18:56

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH] sched: reset sched_entity depth on changing parent



24.10.2014, 19:58, "Peter Zijlstra" <[email protected]>:
> On Fri, Oct 24, 2014 at 11:07:46AM -0400, Burke Libbey wrote:
>> ?From 2014-02-15: https://lkml.org/lkml/2014/2/15/217
>>
>> ?This issue was reported and patched, but it still occurs in some situations on
>> ?newer kernel versions.
>>
>> ?[2249353.328452] BUG: unable to handle kernel NULL pointer dereference at 0000000000000150
>> ?[2249353.336528] IP: [<ffffffff810b1cf7>] check_preempt_wakeup+0xe7/0x210
>>
>> ?se.parent gets out of sync with se.depth, causing a panic when the algorithm in
>> ?find_matching_se assumes they are correct. This patch forces se.depth to be
>> ?updated every time se.parent is, so they can no longer become desync'd.
>>
>> ?CC: Ingo Molnar <[email protected]>
>> ?CC: Peter Zijlstra <[email protected]>
>> ?Signed-off-by: Burke Libbey <[email protected]>
>> ?---
>>
>> ?I haven't been able to isolate the problem. Though I'm pretty confident this
>> ?fixes the issue I've been having, I have not been able to prove it.
>
> So this isn't correct, switching rq should not change depth. I suspect
> you're just papering over the issue by frequently resetting the value,
> which simply narrows the race window.

Just a hypothesis.

I was seeking a places where task_group of a task may change. I can't understand
how changing of parent's cgroup during fork() applies to a child.

Child's cgroup is the same as parent's after dup_task_struct(). The only function
changing task_group is sched_move_task(), but we do not call it between
dup_task_struct() and wake_up_new_task(). Shouldn't we do something like this?

(compile tested only)
---
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index cc18694..0ccbbdb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7833,6 +7833,11 @@ static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
sched_offline_group(tg);
}

+static void cpu_cgroup_fork(struct task_struct *task)
+{
+ sched_move_task(task);
+}
+
static int cpu_cgroup_can_attach(struct cgroup_subsys_state *css,
struct cgroup_taskset *tset)
{
@@ -8205,6 +8210,7 @@ struct cgroup_subsys cpu_cgrp_subsys = {
.css_free = cpu_cgroup_css_free,
.css_online = cpu_cgroup_css_online,
.css_offline = cpu_cgroup_css_offline,
+ .fork = cpu_cgroup_fork,
.can_attach = cpu_cgroup_can_attach,
.attach = cpu_cgroup_attach,
.exit = cpu_cgroup_exit,


Or we just should set tsk->sched_task_group?

2014-10-27 09:51:02

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH] sched: reset sched_entity depth on changing parent

I've dived into this and found, we are really need this.
I'll send a patch with description soon.

24.10.2014, 22:19, "Kirill Tkhai" <[email protected]>:
> 24.10.2014, 19:58, "Peter Zijlstra" <[email protected]>:
>> ?On Fri, Oct 24, 2014 at 11:07:46AM -0400, Burke Libbey wrote:
>>> ??From 2014-02-15: https://lkml.org/lkml/2014/2/15/217
>>>
>>> ??This issue was reported and patched, but it still occurs in some situations on
>>> ??newer kernel versions.
>>>
>>> ??[2249353.328452] BUG: unable to handle kernel NULL pointer dereference at 0000000000000150
>>> ??[2249353.336528] IP: [<ffffffff810b1cf7>] check_preempt_wakeup+0xe7/0x210
>>>
>>> ??se.parent gets out of sync with se.depth, causing a panic when the algorithm in
>>> ??find_matching_se assumes they are correct. This patch forces se.depth to be
>>> ??updated every time se.parent is, so they can no longer become desync'd.
>>>
>>> ??CC: Ingo Molnar <[email protected]>
>>> ??CC: Peter Zijlstra <[email protected]>
>>> ??Signed-off-by: Burke Libbey <[email protected]>
>>> ??---
>>>
>>> ??I haven't been able to isolate the problem. Though I'm pretty confident this
>>> ??fixes the issue I've been having, I have not been able to prove it.
>> ?So this isn't correct, switching rq should not change depth. I suspect
>> ?you're just papering over the issue by frequently resetting the value,
>> ?which simply narrows the race window.
>
> Just a hypothesis.
>
> I was seeking a places where task_group of a task may change. I can't understand
> how changing of parent's cgroup during fork() applies to a child.
>
> Child's cgroup is the same as parent's after dup_task_struct(). The only function
> changing task_group is sched_move_task(), but we do not call it between
> dup_task_struct() and wake_up_new_task(). Shouldn't we do something like this?
>
> (compile tested only)
> ---
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index cc18694..0ccbbdb 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7833,6 +7833,11 @@ static void cpu_cgroup_css_offline(struct cgroup_subsys_state *css)
> ?????????sched_offline_group(tg);
> ?}
>
> +static void cpu_cgroup_fork(struct task_struct *task)
> +{
> + sched_move_task(task);
> +}
> +
> ?static int cpu_cgroup_can_attach(struct cgroup_subsys_state *css,
> ??????????????????????????????????struct cgroup_taskset *tset)
> ?{
> @@ -8205,6 +8210,7 @@ struct cgroup_subsys cpu_cgrp_subsys = {
> ?????????.css_free = cpu_cgroup_css_free,
> ?????????.css_online = cpu_cgroup_css_online,
> ?????????.css_offline = cpu_cgroup_css_offline,
> + .fork = cpu_cgroup_fork,
> ?????????.can_attach = cpu_cgroup_can_attach,
> ?????????.attach = cpu_cgroup_attach,
> ?????????.exit = cpu_cgroup_exit,
>
> Or we just should set tsk->sched_task_group?
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at ?http://www.tux.org/lkml/

2014-10-27 12:07:40

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: reset sched_entity depth on changing parent

On Fri, Oct 24, 2014 at 09:18:42PM +0400, Kirill Tkhai wrote:
> I was seeking a places where task_group of a task may change. I can't understand
> how changing of parent's cgroup during fork() applies to a child.

I didn't know we could change cgroup on fork(), I though the idea was
you always inherited your parents cgroup.

How can this be?

2014-10-27 12:40:11

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] sched: reset sched_entity depth on changing parent

On Mon, Oct 27, 2014 at 01:07:34PM +0100, Peter Zijlstra wrote:
> On Fri, Oct 24, 2014 at 09:18:42PM +0400, Kirill Tkhai wrote:
> > I was seeking a places where task_group of a task may change. I can't understand
> > how changing of parent's cgroup during fork() applies to a child.
>
> I didn't know we could change cgroup on fork(), I though the idea was
> you always inherited your parents cgroup.
>
> How can this be?

Hmmm? -ENEEDMORECONTEXT but the inheriting happens at one point during
fork in cgroup_post_fork(). The child inherits whatever the parent
cgroup is at that point.

Thanks.

--
tejun

2014-10-27 13:28:19

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: reset sched_entity depth on changing parent

On Mon, Oct 27, 2014 at 08:40:02AM -0400, Tejun Heo wrote:
> On Mon, Oct 27, 2014 at 01:07:34PM +0100, Peter Zijlstra wrote:
> > On Fri, Oct 24, 2014 at 09:18:42PM +0400, Kirill Tkhai wrote:
> > > I was seeking a places where task_group of a task may change. I can't understand
> > > how changing of parent's cgroup during fork() applies to a child.
> >
> > I didn't know we could change cgroup on fork(), I though the idea was
> > you always inherited your parents cgroup.
> >
> > How can this be?
>
> Hmmm? -ENEEDMORECONTEXT but the inheriting happens at one point during
> fork in cgroup_post_fork(). The child inherits whatever the parent
> cgroup is at that point.

So Kirill is saying that there is a race between fork and attach such
that a child can end up in a different cgroup than the parent and we
need to use the cgroup_subsys::fork call to fix that up.

I was always under the impression that fork was an 'atomic' operation
from the point of cgroups, an attach (or move) would happen either
before the fork or after, not during. But I appear to be mistaken in
that assumption, going by the comments around cgroup_post_fork().

2014-10-27 13:36:09

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH] sched: reset sched_entity depth on changing parent



27.10.2014, 16:28, "Peter Zijlstra" <[email protected]>:
> On Mon, Oct 27, 2014 at 08:40:02AM -0400, Tejun Heo wrote:
>> ?On Mon, Oct 27, 2014 at 01:07:34PM +0100, Peter Zijlstra wrote:
>>> ?On Fri, Oct 24, 2014 at 09:18:42PM +0400, Kirill Tkhai wrote:
>>>> ?I was seeking a places where task_group of a task may change. I can't understand
>>>> ?how changing of parent's cgroup during fork() applies to a child.
>>> ?I didn't know we could change cgroup on fork(), I though the idea was
>>> ?you always inherited your parents cgroup.
>>>
>>> ?How can this be?
>> ?Hmmm? -ENEEDMORECONTEXT but the inheriting happens at one point during
>> ?fork in cgroup_post_fork(). ?The child inherits whatever the parent
>> ?cgroup is at that point.
>
> So Kirill is saying that there is a race between fork and attach such
> that a child can end up in a different cgroup than the parent and we
> need to use the cgroup_subsys::fork call to fix that up.

I mean cgroup is the same, but sched_task_group is other (sched_task_group
is equal to parent's on the moment of dup_task_struct()).


> I was always under the impression that fork was an 'atomic' operation
> from the point of cgroups, an attach (or move) would happen either
> before the fork or after, not during. But I appear to be mistaken in
> that assumption, going by the comments around cgroup_post_fork().

2014-10-27 13:45:17

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [PATCH] sched: reset sched_entity depth on changing parent

On Mon, Oct 27, 2014 at 04:36:02PM +0300, Kirill Tkhai wrote:
>
>
> 27.10.2014, 16:28, "Peter Zijlstra" <[email protected]>:
> > On Mon, Oct 27, 2014 at 08:40:02AM -0400, Tejun Heo wrote:
> >> ?On Mon, Oct 27, 2014 at 01:07:34PM +0100, Peter Zijlstra wrote:
> >>> ?On Fri, Oct 24, 2014 at 09:18:42PM +0400, Kirill Tkhai wrote:
> >>>> ?I was seeking a places where task_group of a task may change. I can't understand
> >>>> ?how changing of parent's cgroup during fork() applies to a child.
> >>> ?I didn't know we could change cgroup on fork(), I though the idea was
> >>> ?you always inherited your parents cgroup.
> >>>
> >>> ?How can this be?
> >> ?Hmmm? -ENEEDMORECONTEXT but the inheriting happens at one point during
> >> ?fork in cgroup_post_fork(). ?The child inherits whatever the parent
> >> ?cgroup is at that point.
> >
> > So Kirill is saying that there is a race between fork and attach such
> > that a child can end up in a different cgroup than the parent and we
> > need to use the cgroup_subsys::fork call to fix that up.
>
> I mean cgroup is the same, but sched_task_group is other (sched_task_group
> is equal to parent's on the moment of dup_task_struct()).

But that still means the parent changed cgroup during fork right? It
started out in a different cgroup than it ended up with, and we need
that .fork callback to fixup state.

2014-10-27 13:47:40

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] sched: reset sched_entity depth on changing parent

Hello, Peter, Kirill.

On Mon, Oct 27, 2014 at 02:28:12PM +0100, Peter Zijlstra wrote:
...
> So Kirill is saying that there is a race between fork and attach such
> that a child can end up in a different cgroup than the parent and we
> need to use the cgroup_subsys::fork call to fix that up.
>
> I was always under the impression that fork was an 'atomic' operation
> from the point of cgroups, an attach (or move) would happen either
> before the fork or after, not during. But I appear to be mistaken in
> that assumption, going by the comments around cgroup_post_fork().

cgroup migration is atomic against threadgroup changes - tasks joining
or leaving the threadgroup. If a process is forking a child, the
child inherits the parent's cgroup at one point while forking and the
parent may have changed its cgroup while fork was in progress. So,
yeah, it can't depend on the dup'd result to be correct.

Thanks.

--
tejun

2014-10-27 13:48:16

by Kirill Tkhai

[permalink] [raw]
Subject: Re: [PATCH] sched: reset sched_entity depth on changing parent



27.10.2014, 16:45, "Peter Zijlstra" <[email protected]>:
> On Mon, Oct 27, 2014 at 04:36:02PM +0300, Kirill Tkhai wrote:
>> ?27.10.2014, 16:28, "Peter Zijlstra" <[email protected]>:
>>> ?On Mon, Oct 27, 2014 at 08:40:02AM -0400, Tejun Heo wrote:
>>>> ??On Mon, Oct 27, 2014 at 01:07:34PM +0100, Peter Zijlstra wrote:
>>>>> ??On Fri, Oct 24, 2014 at 09:18:42PM +0400, Kirill Tkhai wrote:
>>>>>> ??I was seeking a places where task_group of a task may change. I can't understand
>>>>>> ??how changing of parent's cgroup during fork() applies to a child.
>>>>> ??I didn't know we could change cgroup on fork(), I though the idea was
>>>>> ??you always inherited your parents cgroup.
>>>>>
>>>>> ??How can this be?
>>>> ??Hmmm? -ENEEDMORECONTEXT but the inheriting happens at one point during
>>>> ??fork in cgroup_post_fork(). ?The child inherits whatever the parent
>>>> ??cgroup is at that point.
>>> ?So Kirill is saying that there is a race between fork and attach such
>>> ?that a child can end up in a different cgroup than the parent and we
>>> ?need to use the cgroup_subsys::fork call to fix that up.
>> ?I mean cgroup is the same, but sched_task_group is other (sched_task_group
>> ?is equal to parent's on the moment of dup_task_struct()).
>
> But that still means the parent changed cgroup during fork right? It
> started out in a different cgroup than it ended up with, and we need
> that .fork callback to fixup state.

Yeah, I'm agree.

2014-10-27 13:48:54

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] sched: reset sched_entity depth on changing parent

On Mon, Oct 27, 2014 at 09:47:35AM -0400, Tejun Heo wrote:
...
> cgroup migration is atomic against threadgroup changes - tasks joining
> or leaving the threadgroup. If a process is forking a child, the
^
process
--
tejun