2019-12-03 09:12:07

by Schmid, Carsten

[permalink] [raw]
Subject: Crash in fair scheduler

Hi maintainers of the fair scheduler,

we had a crash in the fair scheduler and analysis shows that this could happen again.
Happened on 4.14.86 (LTS series) but failing code path still exists in 5.4-rc2 (and 4.14.147 too).

I have collected relevant data from the core dump and put together with the source.
So i was able to explain why the crash happened, but as i am not an expert on this
scheduler stuff i can't tell if any of the data fed into the code path is ok or not.

In the analysis appended you can find the details; please check if we ran into a corner
case that nobody expected yet or if the data is somehow corrupted (i really can't tell, for me it looks fine).
Can you please check if this is a bug in the scheduler?

To give some "surrounding" info:
It looks like a task has slept for a long time, even through some suspend/resume cycles.
When this task had to wake up, the failure occured.

If you need more data from the core dump, let me know.

Best regards
Carsten


Attachments:
fair_scheduler_crash.txt (11.10 kB)
fair_scheduler_crash.txt

2019-12-03 10:31:34

by Valentin Schneider

[permalink] [raw]
Subject: Re: Crash in fair scheduler

On 03/12/2019 09:11, Schmid, Carsten wrote:
[...]

> set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> /* 'current' is not kept within the tree. */
> if (se->on_rq) { <<<<<<< crash here
>
> set_next_entity is called from within pick_next_task_fair, from the following piece of code:
> static struct task_struct *
> pick_next_task_fair(struct rq *rq, struct task_struct *prev, struct rq_flags *rf)
> {
> struct cfs_rq *cfs_rq = &rq->cfs;
> struct sched_entity *se;
> struct task_struct *p;
> int new_tasks;
>
> again:
> if (!cfs_rq->nr_running) // this is 1, so we are not going to idle
> goto idle;
>
> #ifdef CONFIG_FAIR_GROUP_SCHED
> if (prev->sched_class != &fair_sched_class) <<<<< this is true:
> crash> p &fair_sched_class
> $1 = (const struct sched_class *) 0xffffffffaaa10cc0 <<<<
> crash> $prev=ffff99a97895a580
> crash> gdb set $prev=(struct task_struct *)0xffff99a97895a580
> crash> p $prev->sched_class
> $2 = (const struct sched_class *) 0xffffffffaaa10b40 <<<<
> goto simple; <<<< so we go to simple
> ....
> (Line 6360, Kernel 4.14.86; Line 6820 Kernel v5.4-rc2)
> simple:
> #endif
>
> put_prev_task(rq, prev);
>
> do {
> se = pick_next_entity(cfs_rq, NULL); <<<< this returns se=NULL
> set_next_entity(cfs_rq, se); <<<<<<<< here we crash
> cfs_rq = group_cfs_rq(se);
> } while (cfs_rq);
>
> So why is se = NULL returned?


That looks a lot like a recent issue we've had, see

https://lore.kernel.org/lkml/[email protected]/

The issue is caused by

67692435c411 ("sched: Rework pick_next_task() slow-path")

which 5.4-rc2 has (without the fix which landed in -rc7) but 4.14 really
shouldn't, unless the kernel you're using has had core scheduling somehow
backported to it?

I've only scraped the surface but I'd like to first ask: can you reproduce
the issue on v5.4 final ?

> Best regards
> Carsten
>

2019-12-03 10:32:11

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Crash in fair scheduler

On Tue, Dec 03, 2019 at 09:11:14AM +0000, Schmid, Carsten wrote:
> Hi maintainers of the fair scheduler,
>
> we had a crash in the fair scheduler and analysis shows that this could happen again.
> Happened on 4.14.86 (LTS series) but failing code path still exists in 5.4-rc2 (and 4.14.147 too).

Please, do try if you can reproduce with Linus' latest git. I've no idea
what is, or is not, in those stable trees.

> crash> * cfs_rq ffff99a96dda9800
> struct cfs_rq {
> load = { weight = 1048576, inv_weight = 0 },
> nr_running = 1,
> h_nr_running = 1,
> exec_clock = 0,
> min_vruntime = 190894920101,
> tasks_timeline = { rb_root = { rb_node = 0xffff99a9502e0d10 }, rb_leftmost = 0x0 },
> curr = 0x0,
> next = 0x0,
> last = 0x0,
> skip = 0x0,


> &cfs_rq->tasks_timeline->rb_leftmost
> tasks_timeline = {
> rb_root = {
> rb_node = 0xffff99a9502e0d10
> },
> rb_leftmost = 0x0
> },

> include/linux/rbtree.h:91:#define rb_first_cached(root) (root)->rb_leftmost

> struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
> {
> struct rb_node *left = rb_first_cached(&cfs_rq->tasks_timeline);
>
> if (!left)
> return NULL; <<<<<<<<<< the case
>
> return rb_entry(left, struct sched_entity, run_node);
> }

This the problem, for some reason the rbtree code got that rb_leftmost
thing wrecked.

> Is this a corner case nobody thought of or do we have cfs_rq data that is unexpected in it's content?

No, the rbtree is corrupt. Your tree has a single node (which matches
with nr_running), but for some reason it thinks rb_leftmost is NULL.
This is wrong, if the tree is non-empty, it must have a leftmost
element.

Can you reproduce at will? If so, can you please try the latest kernel,
and or share the reproducer?

2019-12-03 10:42:15

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: Crash in fair scheduler

On 03/12/2019 11:30, Valentin Schneider wrote:
> On 03/12/2019 09:11, Schmid, Carsten wrote:

[...]

> That looks a lot like a recent issue we've had, see
>
> https://lore.kernel.org/lkml/[email protected]/
>
> The issue is caused by
>
> 67692435c411 ("sched: Rework pick_next_task() slow-path")
>
> which 5.4-rc2 has (without the fix which landed in -rc7) but 4.14 really
> shouldn't, unless the kernel you're using has had core scheduling somehow
> backported to it?
>
> I've only scraped the surface but I'd like to first ask: can you reproduce
> the issue on v5.4 final ?

Can't be. 4.14.86 does not have ("sched: Rework pick_next_task()
slow-path").

2019-12-03 10:53:42

by Schmid, Carsten

[permalink] [raw]
Subject: AW: Crash in fair scheduler

> > we had a crash in the fair scheduler and analysis shows that this could
> happen again.
> > Happened on 4.14.86 (LTS series) but failing code path still exists in 5.4-rc2
> (and 4.14.147 too).
>
> Please, do try if you can reproduce with Linus' latest git. I've no idea
> what is, or is not, in those stable trees.
>
unfortunately a once issue so far ...


--- snip ---

> > include/linux/rbtree.h:91:#define rb_first_cached(root) (root)-
> >rb_leftmost
>
> > struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
> > {
> > struct rb_node *left = rb_first_cached(&cfs_rq->tasks_timeline);
> >
> > if (!left)
> > return NULL; <<<<<<<<<< the case
> >
> > return rb_entry(left, struct sched_entity, run_node);
> > }
>
> This the problem, for some reason the rbtree code got that rb_leftmost
> thing wrecked.
>
Any known issue on rbtree code regarding this?

> > Is this a corner case nobody thought of or do we have cfs_rq data that is
> unexpected in it's content?
>
> No, the rbtree is corrupt. Your tree has a single node (which matches
> with nr_running), but for some reason it thinks rb_leftmost is NULL.
> This is wrong, if the tree is non-empty, it must have a leftmost
> element.
Is there a chance to find the left-most element in the core dump?
Maybe i can dig deeper to find the root c ause then.
Does any of the structs/data in this context point to some memory
where i can continue to search?
Where should rb_leftmost point to if only one node is in the tree?
To the node itself?

>
> Can you reproduce at will? If so, can you please try the latest kernel,
> and or share the reproducer?
Unfortunately this was a "once" issue so far; i haven't a reproducer yet.

Thanks,
Carsten

2019-12-03 11:10:22

by Valentin Schneider

[permalink] [raw]
Subject: Re: Crash in fair scheduler

On 03/12/2019 10:40, Dietmar Eggemann wrote:
> On 03/12/2019 11:30, Valentin Schneider wrote:
>> On 03/12/2019 09:11, Schmid, Carsten wrote:
>
> [...]
>
>> That looks a lot like a recent issue we've had, see
>>
>> https://lore.kernel.org/lkml/[email protected]/
>>
>> The issue is caused by
>>
>> 67692435c411 ("sched: Rework pick_next_task() slow-path")
>>
>> which 5.4-rc2 has (without the fix which landed in -rc7) but 4.14 really
>> shouldn't, unless the kernel you're using has had core scheduling somehow
>> backported to it?
>>
>> I've only scraped the surface but I'd like to first ask: can you reproduce
>> the issue on v5.4 final ?
>
> Can't be. 4.14.86 does not have ("sched: Rework pick_next_task()
> slow-path").
>

Right, which is why I wondered if the kernel under test had had that
backported to it for some reason (e.g. core scheduling). Peter pointed out
that this is a slightly different issue (nr_running matches the rbt), so
this is probably unrelated.

2019-12-03 14:05:49

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Crash in fair scheduler

On Tue, Dec 03, 2019 at 10:51:46AM +0000, Schmid, Carsten wrote:

> > > struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
> > > {
> > > struct rb_node *left = rb_first_cached(&cfs_rq->tasks_timeline);
> > >
> > > if (!left)
> > > return NULL; <<<<<<<<<< the case
> > >
> > > return rb_entry(left, struct sched_entity, run_node);
> > > }
> >
> > This the problem, for some reason the rbtree code got that rb_leftmost
> > thing wrecked.
> >
> Any known issue on rbtree code regarding this?

I don't recall ever having seen this before. :/ Adding Davidlohr and
Michel who've poked at the rbtree code 'recently'.

> > > Is this a corner case nobody thought of or do we have cfs_rq data that is
> > unexpected in it's content?
> >
> > No, the rbtree is corrupt. Your tree has a single node (which matches
> > with nr_running), but for some reason it thinks rb_leftmost is NULL.
> > This is wrong, if the tree is non-empty, it must have a leftmost
> > element.
> Is there a chance to find the left-most element in the core dump?

If there is only one entry in the tree, then that must also be the
leftmost entry. See your own later question :-)

> Maybe i can dig deeper to find the root c ause then.
> Does any of the structs/data in this context point to some memory
> where i can continue to search?

There are only two places where rb_leftmost are updated,
rb_insert_color_cached() and rb_erase_cached() (the scheduler does not
use rb_replace_nod_cached).

We can 'forget' to set leftmost on insertion if @leftmost is somehow
false, and we can eroneously clear leftmost on erase if rb_next()
malfunctions.

No clues on which of those two cases happened.

> Where should rb_leftmost point to if only one node is in the tree?
> To the node itself?

Exatly.


I suppose one approach is to add code to both __enqueue_entity() and
__dequeue_entity() that compares ->rb_leftmost to the result of
rb_first(). That'd incur some overhead but it'd double check the logic.

2019-12-03 15:09:41

by Dietmar Eggemann

[permalink] [raw]
Subject: Re: Crash in fair scheduler

On 03/12/2019 12:09, Valentin Schneider wrote:
> On 03/12/2019 10:40, Dietmar Eggemann wrote:
>> On 03/12/2019 11:30, Valentin Schneider wrote:
>>> On 03/12/2019 09:11, Schmid, Carsten wrote:
>>
>> [...]
>>
>>> That looks a lot like a recent issue we've had, see
>>>
>>> https://lore.kernel.org/lkml/[email protected]/
>>>
>>> The issue is caused by
>>>
>>> 67692435c411 ("sched: Rework pick_next_task() slow-path")
>>>
>>> which 5.4-rc2 has (without the fix which landed in -rc7) but 4.14 really
>>> shouldn't, unless the kernel you're using has had core scheduling somehow
>>> backported to it?
>>>
>>> I've only scraped the surface but I'd like to first ask: can you reproduce
>>> the issue on v5.4 final ?
>>
>> Can't be. 4.14.86 does not have ("sched: Rework pick_next_task()
>> slow-path").
>>
>
> Right, which is why I wondered if the kernel under test had had that
> backported to it for some reason (e.g. core scheduling). Peter pointed out
> that this is a slightly different issue (nr_running matches the rbt), so
> this is probably unrelated.

I can't reproduce it on Arm64 Juno running 4.14.86. I suppose that there
is no extra reproducer testcase since the issue happened with
prev->sched_class eq. &idle_sched_class [prev eq. swapper/X 0] in the
simple path of pick_next_task_fair().

I'm running with CONFIG_SCHED_AUTOGROUP=y and CONFIG_FAIR_GROUP_SCHED=y
some taskgroup related tests for hours now. So the sched_entity (se) can
be a task, an autogroup or a taskgroup in the simple path. pref is
either swapper/X or migration/X.

2019-12-03 15:59:21

by Schmid, Carsten

[permalink] [raw]
Subject: AW: Crash in fair scheduler

> On 03/12/2019 12:09, Valentin Schneider wrote:
> > On 03/12/2019 10:40, Dietmar Eggemann wrote:
> >> On 03/12/2019 11:30, Valentin Schneider wrote:
> >>> On 03/12/2019 09:11, Schmid, Carsten wrote:
> >>
> >> [...]
>
> I can't reproduce it on Arm64 Juno running 4.14.86. I suppose that there
> is no extra reproducer testcase since the issue happened with
> prev->sched_class eq. &idle_sched_class [prev eq. swapper/X 0] in the
> simple path of pick_next_task_fair().
>
> I'm running with CONFIG_SCHED_AUTOGROUP=y and
> CONFIG_FAIR_GROUP_SCHED=y
> some taskgroup related tests for hours now. So the sched_entity (se) can
> be a task, an autogroup or a taskgroup in the simple path. pref is
> either swapper/X or migration/X.

We have the same kernel config settings.
However, as i stated in the analysis, we had
prev->sched_class ne. &fair_sched_class
and, unfortunately, no reproducer.

Looks like we need to find out why rb_leftmost is 0x0/NULL.

2019-12-05 10:57:14

by Schmid, Carsten

[permalink] [raw]
Subject: AW: Crash in fair scheduler

> Von: Peter Zijlstra [mailto:[email protected]]

>
> Exatly.
>
>
> I suppose one approach is to add code to both __enqueue_entity() and
> __dequeue_entity() that compares ->rb_leftmost to the result of
> rb_first(). That'd incur some overhead but it'd double check the logic.

As this is a ONCE without reproducer, i would prefer to use an approach
to exactly check for this case in the code path where it crashed.
Something like this (with pseudo-code):

simple:
....

do {
se = pick_next_entity(..)
if (unlikely(!se)) { /* here we check for the issue */
write warning and some useful data to dmesg
if (cur_rq->rb_leftmost == NULL) { /* our case */
set cur_rq->rb_leftmost to itself as mentioned in the discussion
se = pick_next_entity(..) /* should now return a valid pointer */
} else { /* another case happened, unknown */
write warning to dmesg UNKNOWN
panic() /* not known what to do here, would crash anyway. */
}
set_next_entity(se, ..)
cfs_rq = group_cfs_rq(...)
} while (cfs_rq);

This will definitely not fix the rb_leftmost being NULL, but we can't tell
where this happened at all, so it's digging in the dark.
Maybe the data written to dmesg will help to diagnose further, if the issue
will happen again.
And, this will not affect performance much, as i have to take care of this too.

Thanks for all your suggestions.
Carsten

2019-12-05 17:47:18

by Davidlohr Bueso

[permalink] [raw]
Subject: Re: Crash in fair scheduler

On Tue, 03 Dec 2019, Peter Zijlstra wrote:

>On Tue, Dec 03, 2019 at 10:51:46AM +0000, Schmid, Carsten wrote:
>
>> > > struct sched_entity *__pick_first_entity(struct cfs_rq *cfs_rq)
>> > > {
>> > > struct rb_node *left = rb_first_cached(&cfs_rq->tasks_timeline);
>> > >
>> > > if (!left)
>> > > return NULL; <<<<<<<<<< the case
>> > >
>> > > return rb_entry(left, struct sched_entity, run_node);
>> > > }
>> >
>> > This the problem, for some reason the rbtree code got that rb_leftmost
>> > thing wrecked.
>> >
>> Any known issue on rbtree code regarding this?
>
>I don't recall ever having seen this before. :/ Adding Davidlohr and
>Michel who've poked at the rbtree code 'recently'.

Yeah I had never seen this either, and would expect the world to fall
appart if leftmost is buggy (much less a one time occurance), but the
following certainly raises a red flag:

&cfs_rq->tasks_timeline->rb_leftmost
tasks_timeline = {
rb_root = {
rb_node = 0xffff99a9502e0d10
},
rb_leftmost = 0x0
},

>
>> > > Is this a corner case nobody thought of or do we have cfs_rq data that is
>> > unexpected in it's content?
>> >
>> > No, the rbtree is corrupt. Your tree has a single node (which matches
>> > with nr_running), but for some reason it thinks rb_leftmost is NULL.
>> > This is wrong, if the tree is non-empty, it must have a leftmost
>> > element.
>> Is there a chance to find the left-most element in the core dump?
>
>If there is only one entry in the tree, then that must also be the
>leftmost entry. See your own later question :-)
>
>> Maybe i can dig deeper to find the root c ause then.
>> Does any of the structs/data in this context point to some memory
>> where i can continue to search?
>
>There are only two places where rb_leftmost are updated,
>rb_insert_color_cached() and rb_erase_cached() (the scheduler does not
>use rb_replace_nod_cached).
>
>We can 'forget' to set leftmost on insertion if @leftmost is somehow
>false, and we can eroneously clear leftmost on erase if rb_next()
>malfunctions.
>
>No clues on which of those two cases happened.

For a bug in insertion I'm certainly not seeing it: we only call
insert into tasks_timeline in __enqueue_entity()... this is the standard
way of using the api, and cannot see how leftmost would become false
unless we take at least one path to the right while going down the tree.

For the erase case, this is more involved than insertion (rb_next()),
but this has not changed in years.

Fwiw, there have been three flavors of the leftmost pointer caching:

The first is the one the scheduler used by itself.

The second is when we moved the logic into the rbtree cached api.
bfb068892d3 (sched/fair: replace cfs_rq->rb_leftmost)

The third was the recent simplifications and cleanups from Michel,
which took out the caching checks into rbtree.h, instead of it being
passed down to the internal functions that actually do the insert/delete.
9f973cb3808 (lib/rbtree: avoid generating code twice for the cached versions)

Specifically looking at 4.14.86, it is using the bfb068892d3 changes.

Note how all three use the same logic to replace the rb_leftmost pointer.

>
>> Where should rb_leftmost point to if only one node is in the tree?
>> To the node itself?
>
>Exatly.
>
>
>I suppose one approach is to add code to both __enqueue_entity() and
>__dequeue_entity() that compares ->rb_leftmost to the result of
>rb_first(). That'd incur some overhead but it'd double check the logic.

We could benefit from improved debugging in rbtrees, not only the cached
flavor. Perhaps we can start with the following -- this would at least
let us know if the case where the tree is non-empty and leftmost is nil
was hit, whether in the scheduler or another user...

Thanks,
Davidlohr

diff --git a/include/linux/rbtree.h b/include/linux/rbtree.h
index 1fd61a9af45c..b4b4df3ad0fc 100644
--- a/include/linux/rbtree.h
+++ b/include/linux/rbtree.h
@@ -130,7 +130,28 @@ struct rb_root_cached {
#define RB_ROOT_CACHED (struct rb_root_cached) { {NULL, }, NULL }

/* Same as rb_first(), but O(1) */
-#define rb_first_cached(root) (root)->rb_leftmost
+#define __rb_first_cached(root) (root)->rb_leftmost
+
+#ifndef CONFIG_RBTREE_DEBUG
+# define rb_first_cached(root) __rb_first_cached(root)
+# define rbtree_cached_debug(root) do { } while(0)
+
+#else
+static inline struct rb_node *rb_first_cached(struct rb_root_cached *root)
+{
+ struct rb_node *leftmost = __rb_first_cached(root);
+
+ WARN_ON(leftmost != rb_first(&root->rb_root));
+ return leftmost;
+}
+
+#define rbtree_cached_debug(root) \
+do { \
+ WARN_ON(rb_first(&(root)->rb_root) != __rb_first_cached((root))); \
+ WARN_ON(!RB_EMPTY_ROOT(&(root)->rb_root) && !__rb_first_cached((root))); \
+ WARN_ON(RB_EMPTY_ROOT(&(root)->rb_root) && __rb_first_cached((root))); \
+} while (0)
+#endif /* CONFIG_RBTREE_DEBUG */

static inline void rb_insert_color_cached(struct rb_node *node,
struct rb_root_cached *root,
@@ -139,6 +160,8 @@ static inline void rb_insert_color_cached(struct rb_node *node,
if (leftmost)
root->rb_leftmost = node;
rb_insert_color(node, &root->rb_root);
+
+ rbtree_cached_debug(root);
}

static inline void rb_erase_cached(struct rb_node *node,
@@ -147,6 +170,8 @@ static inline void rb_erase_cached(struct rb_node *node,
if (root->rb_leftmost == node)
root->rb_leftmost = rb_next(node);
rb_erase(node, &root->rb_root);
+
+ rbtree_cached_debug(root);
}

static inline void rb_replace_node_cached(struct rb_node *victim,
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 2f6fb96405af..62ab9f978bc6 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1727,6 +1727,16 @@ config BACKTRACE_SELF_TEST

Say N if you are unsure.

+config RBTREE_DEBUG
+ bool "Red-Black tree sanity tests"
+ depends on DEBUG_KERNEL
+ help
+ This option enables runtime sanity checks on all variants
+ of the rbtree library. Doing so can cause significant overhead,
+ so only enable it in non-production environments.
+
+ Say N if you are unsure.
+
config RBTREE_TEST
tristate "Red-Black tree test"
depends on DEBUG_KERNEL