LinuxLists.cc - [PATCH] avoid race condition in pick_next_task_fair in kernel/sched

2010-06-29 07:10:40

Subject: [PATCH] avoid race condition in pick_next_task_fair in kernel/sched_fair.c

Hi,

I walked some code in kernel/sched_fair.c in version 2.6.35-rc3, and
got the following potential failure:

static struct task_struct *pick_next_task_fair(struct rq *rq)
{
...
if (!cfs_rq->nr_running)
return NULL;

do {
se = pick_next_entity(cfs_rq);
set_next_entity(cfs_rq, se);
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);
...
}

/*
* The dequeue_task method is called after nr_running is
* decreased. We remove the task from the rbtree and
* update the fair scheduling stats:
*/
static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
{
...
dequeue_entity(cfs_rq, se, flags);
...
}

static void
dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
{
...
if (se != cfs_rq->curr)
__dequeue_entity(cfs_rq, se);
account_entity_dequeue(cfs_rq, se);
update_min_vruntime(cfs_rq);
...
}

dequeue_task_fair() is used to dequeue some task, and it calls
dequeue_entity() to finish the job. While dequeue_entity() calls
__dequeue_entity() first, then calls account_entity_dequeue()
to do substraction on cfs_rq->nr_running.
Here, if one task is dequeued, then cfs_rq->nr_running will be
changed later, e.g 1 to 0.
In the pick_next_task_fair(), the if check will get passed before
nr_running is set 0 and the following while structure is executed.

do {
se = pick_next_entity(cfs_rq);
set_next_entity(cfs_rq, se);
cfs_rq = group_cfs_rq(se);
} while (cfs_rq);

We may get se set NULL here, and set_next_entity may deference
NULL pointer, which can lead to Oops.

I think some lock on the metadata can fix this issue, but we may
change plenty of code to add support for lock. I think the easist
way is just substacting nr_running before dequing tasks.

Following is my patch, please check it.

>From 4fe38deac173c7777cd02096950e979749170873 Mon Sep 17 00:00:00 2001
From: Wang Sheng-Hui <[email protected]>
Date: Tue, 29 Jun 2010 14:49:05 +0800
Subject: [PATCH] avoid race condition in pick_next_task_fair in
kernel/sched_fair.c

Signed-off-by: Wang Sheng-Hui <[email protected]>
---
kernel/sched_fair.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index eed35ed..93073ff 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -823,9 +823,9 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct
sched_entity *se, int flags)

clear_buddies(cfs_rq, se);

+ account_entity_dequeue(cfs_rq, se);
if (se != cfs_rq->curr)
__dequeue_entity(cfs_rq, se);
- account_entity_dequeue(cfs_rq, se);
update_min_vruntime(cfs_rq);

/*
@@ -1059,7 +1059,7 @@ enqueue_task_fair(struct rq *rq, struct
task_struct *p, int flags)
}

/*
- * The dequeue_task method is called before nr_running is
+ * The dequeue_task method is called after nr_running is
* decreased. We remove the task from the rbtree and
* update the fair scheduling stats:
*/
--
1.6.3.3

--

Thanks and Best Regards,
shenghui

2010-06-29 10:43:40

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] avoid race condition in pick_next_task_fair in kernel/sched_fair.c

On Tue, 2010-06-29 at 15:10 +0800, shenghui wrote:
> I think some lock on the metadata can fix this issue, but we may
> change plenty of code to add support for lock. I think the easist
> way is just substacting nr_running before dequing tasks.

But all that is fully serialized by the rq->lock.. so I'm really not
seeing how this can happen.

2010-06-29 11:24:20

by shenghui

[permalink] [raw]

Subject: Re: [PATCH] avoid race condition in pick_next_task_fair in kernel/sched_fair.c

2010/6/29 Peter Zijlstra <[email protected]>:
> On Tue, 2010-06-29 at 15:10 +0800, shenghui wrote:
>> I think some lock on the metadata can fix this issue, but we may
>> change plenty of code to add support for lock. I think the easist
>> way is just substacting nr_running before dequing tasks.
>
> But all that is fully serialized by the rq->lock.. so I'm really not
> seeing how this can happen.
>
>
>

I wonder is there any chance set_next_entity() can get NULL for
parameter se if so?
And will you please give me some instructions on where rq->lock
is required?

--

Thanks and Best Regards,
shenghui

2010-06-29 11:35:24

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] avoid race condition in pick_next_task_fair in kernel/sched_fair.c

On Tue, 2010-06-29 at 19:24 +0800, shenghui wrote:

> I wonder is there any chance set_next_entity() can get NULL for
> parameter se if so?

Well, if your machine crashes that way, maybe, but I haven't seen that
happen in a long while.

> And will you please give me some instructions on where rq->lock
> is required?

Pretty much everywhere, if you look at sched.c the only sched_class
method not called with rq->lock held is ::task_fork().

The interesting bits are that schedule()->pre_schedule()/idle_balance()
can drop rq->lock as well as ->select_task_rq().

2010-06-29 12:44:40

by shenghui

[permalink] [raw]

Subject: Re: [PATCH] avoid race condition in pick_next_task_fair in kernel/sched_fair.c

2010/6/29 Peter Zijlstra <[email protected]>:
> Pretty much everywhere, if you look at sched.c the only sched_class
> method not called with rq->lock held is ::task_fork().
>
> The interesting bits are that schedule()->pre_schedule()/idle_balance()
> can drop rq->lock as well as ->select_task_rq().
>
>
>
>

Thanks for your instructions!

--

Thanks and Best Regards,
shenghui

2010-12-19 02:20:42

by Miklos Vajna

[permalink] [raw]

Subject: Re: [PATCH] avoid race condition in pick_next_task_fair in kernel/sched_fair.c

On Tue, Jun 29, 2010 at 12:43:35PM +0200, Peter Zijlstra <[email protected]> wrote:
> On Tue, 2010-06-29 at 15:10 +0800, shenghui wrote:
> > I think some lock on the metadata can fix this issue, but we may
> > change plenty of code to add support for lock. I think the easist
> > way is just substacting nr_running before dequing tasks.
>
> But all that is fully serialized by the rq->lock.. so I'm really not
> seeing how this can happen.

Hi,

Here is a panic I got today:

http://frugalware.org/~vmiklos/pics/bug/2.6.37-rc6.png

More details:

I get this sometimes on boot or shutdown when testing systemd. I did not
get it with sysvinit, so I guess it may be related to systemd's heavy
cgroups usage, but I'm not sure. Sadly it isn't 100% reproducible but I
usually hit it at least once a day.

The config is here:
http://frugalware.org/~vmiklos/logs/2.6.37-rc6.config (I just did a
yes "" | make config to update it to 2.6.37-rc6.)

I got something similar with 2.6.36.1 as well:

http://frugalware.org/~vmiklos/pics/bug/2.6.36.1.png

Ah, and this is on i686 in VMware - though given that I never had this
problem with systemd, I guess it won't be an emulator bug. :)

I'm not familiar with the sched code, is it possible that this is
related?

Thanks,

Miklos

Attachments:

(No filename) (1.24 kB)
(No filename) (198.00 B)
Download all attachments

2010-12-22 00:22:57

by Miklos Vajna

[permalink] [raw]

Subject: Re: [PATCH] avoid race condition in pick_next_task_fair in kernel/sched_fair.c

On Sun, Dec 19, 2010 at 03:03:13AM +0100, Miklos Vajna <[email protected]> wrote:
> Here is a panic I got today:
>
> http://frugalware.org/~vmiklos/pics/bug/2.6.37-rc6.png
>
> More details:
>
> I get this sometimes on boot or shutdown when testing systemd. I did not
> get it with sysvinit, so I guess it may be related to systemd's heavy
> cgroups usage, but I'm not sure. Sadly it isn't 100% reproducible but I
> usually hit it at least once a day.
>
> The config is here:
> http://frugalware.org/~vmiklos/logs/2.6.37-rc6.config (I just did a
> yes "" | make config to update it to 2.6.37-rc6.)
>
> I got something similar with 2.6.36.1 as well:
>
> http://frugalware.org/~vmiklos/pics/bug/2.6.36.1.png
>
> Ah, and this is on i686 in VMware - though given that I never had this
> problem with systemd, I guess it won't be an emulator bug. :)
>
> I'm not familiar with the sched code, is it possible that this is
> related?

A few more info:

- I reproduced the issue on my Lenovo S12 netbook as well, so it's not a
virtual machine-related bug. Here is a poor shot:
http://frugalware.org/~vmiklos/pics/bug/p1040730.jpg
- I'm also attaching the full dmesg in the VM that is finally a text
using serial console.
- On "how to reproduce": sometimes I just get this during boot. If it
does not occur, then it does when using "systemctl emergency".

Please let me know if you need more info.

Thanks,

Miklos

Attachments:

(No filename) (0.00 B)
(No filename) (198.00 B)
Download all attachments

2010-12-22 08:29:52

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] avoid race condition in pick_next_task_fair in kernel/sched_fair.c

On Wed, 2010-12-22 at 01:22 +0100, Miklos Vajna wrote:
> Please let me know if you need more info.

What distro are you using? it looks like systemd is involved and I'm
actively avoiding anything using that crap. It now triggering a bug
means I need to get myself a 32bit image using it :/

If you happen to have a qemu image that reproduces.. :-)

2010-12-22 08:40:59

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] avoid race condition in pick_next_task_fair in kernel/sched_fair.c

On Wed, 2010-12-22 at 09:29 +0100, Peter Zijlstra wrote:
> On Wed, 2010-12-22 at 01:22 +0100, Miklos Vajna wrote:
> > Please let me know if you need more info.
>
> What distro are you using? it looks like systemd is involved and I'm
> actively avoiding anything using that crap. It now triggering a bug
> means I need to get myself a 32bit image using it :/
>
> If you happen to have a qemu image that reproduces.. :-)

Removed the systemd list from CC, please don't cross post with lists
that send bounces for non-members, that's considered rude.

2010-12-22 08:42:01

by Mike Galbraith

[permalink] [raw]

Subject: Re: [PATCH] avoid race condition in pick_next_task_fair in kernel/sched_fair.c

On Wed, 2010-12-22 at 09:29 +0100, Peter Zijlstra wrote:
> On Wed, 2010-12-22 at 01:22 +0100, Miklos Vajna wrote:
> > Please let me know if you need more info.
>
> What distro are you using? it looks like systemd is involved and I'm
> actively avoiding anything using that crap.

Heh, gearing up to hunt a bug it triggers, just doing all I had to do to
get the thing built and installed blew my box _all up_. I still have
smoldering bits lying about a week later.

-Mike

2010-12-22 09:07:23

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] avoid race condition in pick_next_task_fair in kernel/sched_fair.c

On Wed, 2010-12-22 at 09:41 +0100, Mike Galbraith wrote:
> On Wed, 2010-12-22 at 09:29 +0100, Peter Zijlstra wrote:
> > On Wed, 2010-12-22 at 01:22 +0100, Miklos Vajna wrote:
> > > Please let me know if you need more info.
> >
> > What distro are you using? it looks like systemd is involved and I'm
> > actively avoiding anything using that crap.
>
> Heh, gearing up to hunt a bug it triggers, just doing all I had to do to
> get the thing built and installed blew my box _all up_. I still have
> smoldering bits lying about a week later.

There's a reason I was asking for a qemu image ;-), I'm not letting that
stuff anywhere near my regular machines.

2010-12-22 13:32:03

by Miklos Vajna

[permalink] [raw]

Subject: Re: [PATCH] avoid race condition in pick_next_task_fair in kernel/sched_fair.c

On Wed, Dec 22, 2010 at 10:07:22AM +0100, Peter Zijlstra <[email protected]> wrote:
> On Wed, 2010-12-22 at 09:41 +0100, Mike Galbraith wrote:
> > On Wed, 2010-12-22 at 09:29 +0100, Peter Zijlstra wrote:
> > > On Wed, 2010-12-22 at 01:22 +0100, Miklos Vajna wrote:
> > > > Please let me know if you need more info.
> > >
> > > What distro are you using? it looks like systemd is involved and I'm
> > > actively avoiding anything using that crap.

First, sorry about the systemd list, I did not know it's
subscriber-only.

This is Frugalware Linux, and it's disabled by default here as well.

> > Heh, gearing up to hunt a bug it triggers, just doing all I had to do to
> > get the thing built and installed blew my box _all up_. I still have
> > smoldering bits lying about a week later.
>
> There's a reason I was asking for a qemu image ;-), I'm not letting that
> stuff anywhere near my regular machines.

Sure - here is a qcow image (871M):

http://frugalware.org/~vmiklos/files/systemd.img

Here is how I started it:

qemu-kvm -vnc :0 -m 1G -hda systemd.img

$ qemu-kvm -version
QEMU PC emulator version 0.12.3 (qemu-kvm-0.12.3), Copyright (c) 2003-2008 Fabrice Bellard

There are two menu entries in grub config - first is sysvinit, that
boots up correctly here, you can login using root / root. The second is
systemd, that just panics here almost all the time. In case it would
not, use 'systemctl emergency' as root and that will trigger the bug. :)

If you need more packages in the virtual machine and the pacman package
manager sounds exotic, here is a quick help:

http://frugalware.org/docs/pacman-g2#_apt_pacman_g2_cross_reference

Thanks,

Miklos

Attachments:

(No filename) (1.63 kB)
(No filename) (198.00 B)
Download all attachments

2010-12-22 14:00:31

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] avoid race condition in pick_next_task_fair in kernel/sched_fair.c

On Wed, 2010-12-22 at 14:31 +0100, Miklos Vajna wrote:
>
> http://frugalware.org/~vmiklos/files/systemd.img
>
> Here is how I started it:
>
> qemu-kvm -vnc :0 -m 1G -hda systemd.img

awesome, thanks!

I started it with something like:
qemu -kernel foo-build/arch/x86/boot/bzImage -append "root=/dev/sda1
debug sched_debug ignore_loglevel sysrq_always_enabled console=ttyS0"
-hda systemd.img -serial stdio -m 1G

Where foo-build/ contains a kernel build using your .config.

I'll have a poke at it..

2010-12-22 14:11:50

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] avoid race condition in pick_next_task_fair in kernel/sched_fair.c

On Wed, 2010-12-22 at 15:00 +0100, Peter Zijlstra wrote:
> On Wed, 2010-12-22 at 14:31 +0100, Miklos Vajna wrote:
> >
> > http://frugalware.org/~vmiklos/files/systemd.img
> >
> > Here is how I started it:
> >
> > qemu-kvm -vnc :0 -m 1G -hda systemd.img
>
> awesome, thanks!
>
> I started it with something like:
> qemu -kernel foo-build/arch/x86/boot/bzImage -append "root=/dev/sda1
> debug sched_debug ignore_loglevel sysrq_always_enabled console=ttyS0"
> -hda systemd.img -serial stdio -m 1G
>
> Where foo-build/ contains a kernel build using your .config.
>
> I'll have a poke at it..

Hrm,. its not really wanting to start properly..

---
EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
VFS: Mounted root (ext4 filesystem) readonly on device 8:1.
Freeing unused kernel memory: 520k freed
EXT4-fs (sda1): re-mounted. Opts: (null)
md: stopping all md devices.
Restarting system.
machine restart
---

Does it _require_ initrd like muck? Or are there some modules that need
to get built-in in order for the thing to boot?

This is an utter lack of error reporting here, no idea what's wrong.

2010-12-22 15:14:41

by Miklos Vajna

[permalink] [raw]

Subject: Re: [PATCH] avoid race condition in pick_next_task_fair in kernel/sched_fair.c

On Wed, Dec 22, 2010 at 03:11:52PM +0100, Peter Zijlstra <[email protected]> wrote:
> > I started it with something like:
> > qemu -kernel foo-build/arch/x86/boot/bzImage -append "root=/dev/sda1
> > debug sched_debug ignore_loglevel sysrq_always_enabled console=ttyS0"
> > -hda systemd.img -serial stdio -m 1G
> >
> > Where foo-build/ contains a kernel build using your .config.
> >
> > I'll have a poke at it..
>
> Hrm,. its not really wanting to start properly..
>
> ---
> EXT4-fs (sda1): mounted filesystem with ordered data mode. Opts: (null)
> VFS: Mounted root (ext4 filesystem) readonly on device 8:1.
> Freeing unused kernel memory: 520k freed
> EXT4-fs (sda1): re-mounted. Opts: (null)
> md: stopping all md devices.
> Restarting system.
> machine restart
> ---
>
> Does it _require_ initrd like muck? Or are there some modules that need
> to get built-in in order for the thing to boot?

Nope.

> This is an utter lack of error reporting here, no idea what's wrong.

I tried to do something really similar to your commandline:

qemu -enable-kvm -kernel kernel-build/arch/x86/boot/bzImage -append "root=/dev/sda1 debug sched_debug ignore_loglevel sysrq_always_enabled console=ttyS0" -hda systemd.img -serial stdio -m 1G -vnc :0

This boots up properly here, I can login using root/root from vnc.

qemu -enable-kvm -kernel kernel-build/arch/x86/boot/bzImage -append "root=/dev/sda1 debug sched_debug ignore_loglevel sysrq_always_enabled console=ttyS0 init=/bin/systemd" -hda systemd.img -serial stdio -m 1G -vnc :0

^ Only init=/bin/systemd added, and this results in a panic in most
cases. I'm attaching the stdout of qemu, showing the fail in
put_prev_task_fair.

$ qemu -version
QEMU emulator version 0.13.0, Copyright (c) 2003-2008 Fabrice Bellard

kernel-build is a git build using the config I already sent and after a
'git checkout v2.6.36'. I can try to build master as well, so far what I
saw is that the bug occurs there as well, but less frequently, so maybe
that's a bit harder to debug.

I also want to note that - at least on my machine - if I drop
-enable-kvm the bug is hard to reproduce, maybe that's because that way
it does not trigger a race condition or my machine is just too slow
without kvm and it triggers some timeout, changing the behaviour; I'm
not exactly sure. (But again, I can reproduce the bug on real hardware,
so I don't think we have a kvm bug here.)

Thanks,

Miklos

Attachments:

(No filename) (0.00 B)
(No filename) (198.00 B)
Download all attachments

2010-12-22 15:25:40

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] avoid race condition in pick_next_task_fair in kernel/sched_fair.c

On Wed, 2010-12-22 at 16:14 +0100, Miklos Vajna wrote:
>
> qemu -enable-kvm -kernel kernel-build/arch/x86/boot/bzImage -append
> "root=/dev/sda1 debug sched_debug ignore_loglevel sysrq_always_enabled
> console=ttyS0 init=/bin/systemd" -hda systemd.img -serial stdio -m 1G
> -vnc :0
>
> ^ Only init=/bin/systemd added, and this results in a panic in most
> cases. I'm attaching the stdout of qemu, showing the fail in
> put_prev_task_fair.

OK, that does seem to work, maybe it really wants the vnc thing, dunno.

Anyway, it looks like I need to go build a kernel with kvm in for that
machine ;-) Thanks for the effort so far.

2010-12-22 17:25:26

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] avoid race condition in pick_next_task_fair in kernel/sched_fair.c

On Wed, 2010-12-22 at 18:16 +0100, Ingo Molnar wrote:
> * Peter Zijlstra <[email protected]> wrote:
>
> > On Wed, 2010-12-22 at 16:14 +0100, Miklos Vajna wrote:
> > > kernel-build is a git build using the config I already sent and after a
> > > 'git checkout v2.6.36'. I can try to build master as well, so far what I
> > > saw is that the bug occurs there as well, but less frequently, so maybe
> > > that's a bit harder to debug.
> >
> > Right, so I've got v2.6.36 exploding, current -git and -tip won't
> > explode for me, tried both about 20 times. So I'll try and figure out
> > wth makes .36 explode and then see if further kernel still suffer that
> > problem.
>
> Does it explode if you change some of the CONFIG_CC flags, such as
> CONFIG_CC_OPTIMIZE_FOR_SIZE? I.e. is the crash dependent on the precise kernel image
> layout perhaps? If it's sensitive to that then this might keep you from chasing
> shadows ...

I've since changed the .config slightly (enabled frame pointers, added
debug info, enabled the function tracer) and its still exploding. Also,
I suspect Miklos and me have different gcc versions.

So I suspect its really something weird, I just can't really make sense
of it yet.

2010-12-22 17:09:44

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] avoid race condition in pick_next_task_fair in kernel/sched_fair.c

On Wed, 2010-12-22 at 16:14 +0100, Miklos Vajna wrote:
> kernel-build is a git build using the config I already sent and after a
> 'git checkout v2.6.36'. I can try to build master as well, so far what I
> saw is that the bug occurs there as well, but less frequently, so maybe
> that's a bit harder to debug.

Right, so I've got v2.6.36 exploding, current -git and -tip won't
explode for me, tried both about 20 times. So I'll try and figure out
wth makes .36 explode and then see if further kernel still suffer that
problem.

2010-12-22 17:17:12

by Ingo Molnar

[permalink] [raw]

Subject: Re: [PATCH] avoid race condition in pick_next_task_fair in kernel/sched_fair.c

* Peter Zijlstra <[email protected]> wrote:

> On Wed, 2010-12-22 at 16:14 +0100, Miklos Vajna wrote:
> > kernel-build is a git build using the config I already sent and after a
> > 'git checkout v2.6.36'. I can try to build master as well, so far what I
> > saw is that the bug occurs there as well, but less frequently, so maybe
> > that's a bit harder to debug.
>
> Right, so I've got v2.6.36 exploding, current -git and -tip won't
> explode for me, tried both about 20 times. So I'll try and figure out
> wth makes .36 explode and then see if further kernel still suffer that
> problem.

Does it explode if you change some of the CONFIG_CC flags, such as
CONFIG_CC_OPTIMIZE_FOR_SIZE? I.e. is the crash dependent on the precise kernel image
layout perhaps? If it's sensitive to that then this might keep you from chasing
shadows ...

Thanks,

Ingo

2010-12-22 20:36:26

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] avoid race condition in pick_next_task_fair in kernel/sched_fair.c

On Wed, 2010-12-22 at 18:08 +0100, Peter Zijlstra wrote:
> On Wed, 2010-12-22 at 16:14 +0100, Miklos Vajna wrote:
> > kernel-build is a git build using the config I already sent and after a
> > 'git checkout v2.6.36'. I can try to build master as well, so far what I
> > saw is that the bug occurs there as well, but less frequently, so maybe
> > that's a bit harder to debug.
>
> Right, so I've got v2.6.36 exploding, current -git and -tip won't
> explode for me, tried both about 20 times. So I'll try and figure out
> wth makes .36 explode and then see if further kernel still suffer that
> problem.

OK, so something goes wrong with load accounting,

Full log at: http://programming.kicks-ass.net/sekrit/log4

I print the entire runqueue before and after each op, the last good one
is a post-print:

kworker/-9 0d..2. 2013644us : pick_next_task_fair: picked: f6a8de5c, kworker/u:1/1251
kworker/-9 0d..2. 2013644us : print_runqueue <-pick_next_task_fair
kworker/-9 0d..2. 2013645us : __print_runqueue: cfs_rq: c2407c34, nr: 3, load: 3072
kworker/-9 0d..2. 2013645us : __print_runqueue: curr: f6a8de5c, comm: kworker/u:1/1251, load: 1024
kworker/-9 0d..2. 2013646us : __print_runqueue: se: f69e6300, load: 1024,
kworker/-9 0d..2. 2013646us : __print_runqueue: cfs_rq: f69e6540, nr: 2, load: 2048
kworker/-9 0d..2. 2013646us : __print_runqueue: curr: (null)
kworker/-9 0d..2. 2013647us : __print_runqueue: se: f69e65a0, load: 1024,
kworker/-9 0d..2. 2013647us : __print_runqueue: cfs_rq: f69e6240, nr: 1, load: 1024
kworker/-9 0d..2. 2013648us : __print_runqueue: curr: (null)
kworker/-9 0d..2. 2013648us : __print_runqueue: se: f725916c, comm: loadkeys/1225, load: 1024
kworker/-9 0d..2. 2013649us : __print_runqueue: se: f69e6660, load: 1024,
kworker/-9 0d..2. 2013649us : __print_runqueue: cfs_rq: f69e6600, nr: 1, load: 1024
kworker/-9 0d..2. 2013650us : __print_runqueue: curr: (null)
kworker/-9 0d..2. 2013650us : __print_runqueue: se: f6a8da0c, comm: systemd-tmpfile/1246, load: 1024
kworker/-9 0d..2. 2013651us : __print_runqueue: se: f6a8cd1c, comm: kworker/u:1/1252, load: 1024

And then the next pre-print looks like:

systemd--1251 0d..5. 2015398us : enqueue_task_fair <-enqueue_task
systemd--1251 0d..5. 2015398us : print_runqueue <-enqueue_task_fair
systemd--1251 0d..5. 2015399us : __print_runqueue: cfs_rq: c2407c34, nr: 3, load: 3072
systemd--1251 0d..5. 2015400us : __print_runqueue: curr: f6a8de5c, comm: systemd-cgroups/1251, load: 1024
systemd--1251 0d..5. 2015401us : __print_runqueue: se: f69e6300, load: 1024,
systemd--1251 0d..5. 2015401us : __print_runqueue: cfs_rq: f69e6540, nr: 2, load: 2048
systemd--1251 0d..5. 2015402us : __print_runqueue: curr: (null)
systemd--1251 0d..5. 2015402us : __print_runqueue: se: f69e65a0, load: 4137574976,
systemd--1251 0d..5. 2015402us : __print_runqueue: cfs_rq: f69e6240, nr: 1, load: 4137575360
systemd--1251 0d..5. 2015403us : __print_runqueue: curr: (null)
systemd--1251 0d..5. 2015404us : __print_runqueue: se: f725916c, comm: loadkeys/1225, load: 1024
systemd--1251 0d..5. 2015404us : __print_runqueue: se: f69e6660, load: 1024,
systemd--1251 0d..5. 2015404us : __print_runqueue: cfs_rq: f69e6600, nr: 1, load: 1024
systemd--1251 0d..5. 2015405us : __print_runqueue: curr: (null)
systemd--1251 0d..5. 2015406us : __print_runqueue: se: f6a8da0c, comm: systemd-tmpfile/1246, load: 1024
systemd--1251 0d..5. 2015406us : __print_runqueue: se: f6a8cd1c, comm: kworker/u:1/1252, load: 1024

The load value being crazy like this leads to dequeue_task_fair() making
weird decisions since it checks for load==0 to dequeue groups.

Not doing that, then leads to 'dead' groups still being enqueued and
eventually referencing freed memory etc..

I've no clue why any of this happens, it could be random scribbles,
except the se/cfs_rq combo are always off together (although not always
by the same amount).

Anyway, my brain needs sleep, I'll continue poking tomorrow.

---
kernel/sched_fair.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 71 insertions(+), 0 deletions(-)

diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index db3f674..8b8b038 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -333,6 +333,8 @@ static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
s64 key = entity_key(cfs_rq, se);
int leftmost = 1;

+ trace_printk("%p to %p, %ld tasks\n", se, cfs_rq, cfs_rq->nr_running);
+
/*
* Find the right place in the rbtree:
*/
@@ -364,6 +366,8 @@ static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)

static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
+ trace_printk("%p from %p, %ld left\n", se, cfs_rq, cfs_rq->nr_running);
+
if (cfs_rq->rb_leftmost == &se->run_node) {
struct rb_node *next_node;

@@ -394,6 +398,53 @@ static struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
return rb_entry(last, struct sched_entity, run_node);
}

+#define INDENT(i) ((i)*2), " "
+
+static void notrace __print_runqueue(struct cfs_rq *cfs_rq, int indent);
+
+static void __always_inline
+__print_entity(struct sched_entity *se, char *name, int indent)
+{
+ if (!se) {
+ trace_printk("%.*s %s: %p\n", INDENT(indent), name, se);
+ return;
+ }
+
+ if (entity_is_task(se)) {
+ struct task_struct *p = task_of(se);
+ trace_printk("%.*s %s: %p, comm: %s/%d, load: %ld\n",
+ INDENT(indent), name,
+ se, p->comm, p->pid, se->load.weight);
+ } else {
+ trace_printk("%.*s %s: %p, load: %ld, ",
+ INDENT(indent), name,
+ se, se->load.weight);
+ __print_runqueue(group_cfs_rq(se), indent+1);
+ }
+}
+
+static void notrace __print_runqueue(struct cfs_rq *cfs_rq, int indent)
+{
+ struct rb_node *n;
+
+ trace_printk("%.*s cfs_rq: %p, nr: %ld, load: %ld\n",
+ INDENT(indent),
+ cfs_rq, cfs_rq->nr_running,
+ cfs_rq->load.weight);
+
+ __print_entity(cfs_rq->curr, "curr", indent);
+
+ for (n = cfs_rq->rb_leftmost; n; n = rb_next(n)) {
+ struct sched_entity *se = rb_entry(n, struct sched_entity, run_node);
+ __print_entity(se, "se", indent);
+ }
+}
+
+void print_runqueue(struct cfs_rq *cfs_rq)
+{
+ __print_runqueue(cfs_rq, 0);
+}
+
/**************************************************************
* Scheduling class statistics methods:
*/
@@ -1047,6 +1098,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;

+ print_runqueue(&rq->cfs);
for_each_sched_entity(se) {
if (se->on_rq)
break;
@@ -1055,6 +1107,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
flags = ENQUEUE_WAKEUP;
}

+ trace_printk("add task: %p, %s/%d\n", &p->se, p->comm, p->pid);
+ print_runqueue(&rq->cfs);
+
hrtick_update(rq);
}

@@ -1068,6 +1123,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;

+ print_runqueue(&rq->cfs);
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, flags);
@@ -1077,6 +1133,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
flags |= DEQUEUE_SLEEP;
}

+ trace_printk("del task: %p, %s/%d\n", &p->se, p->comm, p->pid);
+ print_runqueue(&rq->cfs);
+
hrtick_update(rq);
}

@@ -1719,6 +1778,8 @@ static struct task_struct *pick_next_task_fair(struct rq *rq)
struct cfs_rq *cfs_rq = &rq->cfs;
struct sched_entity *se;

+ print_runqueue(&rq->cfs);
+
if (!cfs_rq->nr_running)
return NULL;

@@ -1731,6 +1792,9 @@ static struct task_struct *pick_next_task_fair(struct rq *rq)
p = task_of(se);
hrtick_start_fair(rq, p);

+ trace_printk("picked: %p, %s/%d\n", se, p->comm, p->pid);
+ print_runqueue(&rq->cfs);
+
return p;
}

@@ -1742,10 +1806,15 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
struct sched_entity *se = &prev->se;
struct cfs_rq *cfs_rq;

+ print_runqueue(&rq->cfs);
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
put_prev_entity(cfs_rq, se);
}
+
+ trace_printk("put: %p, %s/%d\n", se, prev->comm, prev->pid);
+ print_runqueue(&rq->cfs);
+
}

#ifdef CONFIG_SMP
@@ -3819,8 +3888,10 @@ static void set_curr_task_fair(struct rq *rq)
{
struct sched_entity *se = &rq->curr->se;

+ print_runqueue(&rq->cfs);
for_each_sched_entity(se)
set_next_entity(cfs_rq_of(se), se);
+ print_runqueue(&rq->cfs);
}

#ifdef CONFIG_FAIR_GROUP_SCHED

2010-12-22 21:11:59

by Miklos Vajna

[permalink] [raw]

Subject: Re: [PATCH] avoid race condition in pick_next_task_fair in kernel/sched_fair.c

On Wed, Dec 22, 2010 at 06:08:38PM +0100, Peter Zijlstra <[email protected]> wrote:
> On Wed, 2010-12-22 at 16:14 +0100, Miklos Vajna wrote:
> > kernel-build is a git build using the config I already sent and after a
> > 'git checkout v2.6.36'. I can try to build master as well, so far what I
> > saw is that the bug occurs there as well, but less frequently, so maybe
> > that's a bit harder to debug.
>
> Right, so I've got v2.6.36 exploding, current -git and -tip won't
> explode for me, tried both about 20 times. So I'll try and figure out
> wth makes .36 explode and then see if further kernel still suffer that
> problem.

I reported the issue as I tested 2.6.36 and 2.6.37-rc6 and both failed.
Now I tried current -git and indeed I can't reproduce the issue there.

I'm now bisecting between master and rc6 to see which commit fixes the
problem - just to make sure it's not something irrelevant and the bug is
still hidden somewhere.

Attachments:

(No filename) (948.00 B)
(No filename) (198.00 B)
Download all attachments

2010-12-22 23:40:04

by Miklos Vajna

[permalink] [raw]

Subject: Re: [PATCH] avoid race condition in pick_next_task_fair in kernel/sched_fair.c

On Wed, Dec 22, 2010 at 10:11:51PM +0100, Miklos Vajna <[email protected]> wrote:
> I reported the issue as I tested 2.6.36 and 2.6.37-rc6 and both failed.
> Now I tried current -git and indeed I can't reproduce the issue there.
>
> I'm now bisecting between master and rc6 to see which commit fixes the
> problem - just to make sure it's not something irrelevant and the bug is
> still hidden somewhere.

I spoke too early. After some trying, it seems I can reproduce with
current -git as well. (But as Peter points out it's harder to
reproduce.)

Just in case it's interesting, I attached the -git dmesg. From a quick
look the trace is the same as with v2.6.36, which is easy to reproduce.

Attachments:

(No filename) (0.00 B)
(No filename) (198.00 B)
Download all attachments

2010-12-23 02:08:59

by Yong Zhang

[permalink] [raw]

Subject: Re: [PATCH] avoid race condition in pick_next_task_fair in kernel/sched_fair.c

On Thu, Dec 23, 2010 at 4:36 AM, Peter Zijlstra <[email protected]> wrote:
> On Wed, 2010-12-22 at 18:08 +0100, Peter Zijlstra wrote:
>> On Wed, 2010-12-22 at 16:14 +0100, Miklos Vajna wrote:
>> > kernel-build is a git build using the config I already sent and after a
>> > 'git checkout v2.6.36'. I can try to build master as well, so far what I
>> > saw is that the bug occurs there as well, but less frequently, so maybe
>> > that's a bit harder to debug.
>>
>> Right, so I've got v2.6.36 exploding, current -git and -tip won't
>> explode for me, tried both about 20 times. So I'll try and figure out
>> wth makes .36 explode and then see if further kernel still suffer that
>> problem.
>
> OK, so something goes wrong with load accounting,
>
> Full log at: http://programming.kicks-ass.net/sekrit/log4
>
> I print the entire runqueue before and after each op, the last good one
> is a post-print:
>
> kworker/-9 0d..2. 2013644us : pick_next_task_fair: picked: f6a8de5c, kworker/u:1/1251
> kworker/-9 0d..2. 2013644us : print_runqueue <-pick_next_task_fair
> kworker/-9 0d..2. 2013645us : __print_runqueue: cfs_rq: c2407c34, nr: 3, load: 3072
> kworker/-9 0d..2. 2013645us : __print_runqueue: curr: f6a8de5c, comm: kworker/u:1/1251, load: 1024
> kworker/-9 0d..2. 2013646us : __print_runqueue: se: f69e6300, load: 1024,
> kworker/-9 0d..2. 2013646us : __print_runqueue: cfs_rq: f69e6540, nr: 2, load: 2048
> kworker/-9 0d..2. 2013646us : __print_runqueue: curr: (null)
> kworker/-9 0d..2. 2013647us : __print_runqueue: se: f69e65a0, load: 1024,
> kworker/-9 0d..2. 2013647us : __print_runqueue: cfs_rq: f69e6240, nr: 1, load: 1024
> kworker/-9 0d..2. 2013648us : __print_runqueue: curr: (null)
> kworker/-9 0d..2. 2013648us : __print_runqueue: se: f725916c, comm: loadkeys/1225, load: 1024
> kworker/-9 0d..2. 2013649us : __print_runqueue: se: f69e6660, load: 1024,
> kworker/-9 0d..2. 2013649us : __print_runqueue: cfs_rq: f69e6600, nr: 1, load: 1024
> kworker/-9 0d..2. 2013650us : __print_runqueue: curr: (null)
> kworker/-9 0d..2. 2013650us : __print_runqueue: se: f6a8da0c, comm: systemd-tmpfile/1246, load: 1024
> kworker/-9 0d..2. 2013651us : __print_runqueue: se: f6a8cd1c, comm: kworker/u:1/1252, load: 1024
>
>
> And then the next pre-print looks like:
>
> systemd--1251 0d..5. 2015398us : enqueue_task_fair <-enqueue_task
> systemd--1251 0d..5. 2015398us : print_runqueue <-enqueue_task_fair
> systemd--1251 0d..5. 2015399us : __print_runqueue: cfs_rq: c2407c34, nr: 3, load: 3072
> systemd--1251 0d..5. 2015400us : __print_runqueue: curr: f6a8de5c, comm: systemd-cgroups/1251, load: 1024
> systemd--1251 0d..5. 2015401us : __print_runqueue: se: f69e6300, load: 1024,
> systemd--1251 0d..5. 2015401us : __print_runqueue: cfs_rq: f69e6540, nr: 2, load: 2048
> systemd--1251 0d..5. 2015402us : __print_runqueue: curr: (null)
> systemd--1251 0d..5. 2015402us : __print_runqueue: se: f69e65a0, load: 4137574976,

the load == f69e65a0 == address of se, odd

Thanks,
Yong

> systemd--1251 0d..5. 2015402us : __print_runqueue: cfs_rq: f69e6240, nr: 1, load: 4137575360
> systemd--1251 0d..5. 2015403us : __print_runqueue: curr: (null)
> systemd--1251 0d..5. 2015404us : __print_runqueue: se: f725916c, comm: loadkeys/1225, load: 1024
> systemd--1251 0d..5. 2015404us : __print_runqueue: se: f69e6660, load: 1024,
> systemd--1251 0d..5. 2015404us : __print_runqueue: cfs_rq: f69e6600, nr: 1, load: 1024
> systemd--1251 0d..5. 2015405us : __print_runqueue: curr: (null)
> systemd--1251 0d..5. 2015406us : __print_runqueue: se: f6a8da0c, comm: systemd-tmpfile/1246, load: 1024
> systemd--1251 0d..5. 2015406us : __print_runqueue: se: f6a8cd1c, comm: kworker/u:1/1252, load: 1024
>
>
> The load value being crazy like this leads to dequeue_task_fair() making
> weird decisions since it checks for load==0 to dequeue groups.
>
> Not doing that, then leads to 'dead' groups still being enqueued and
> eventually referencing freed memory etc..
>
> I've no clue why any of this happens, it could be random scribbles,
> except the se/cfs_rq combo are always off together (although not always
> by the same amount).
>
> Anyway, my brain needs sleep, I'll continue poking tomorrow.
>
> ---
> kernel/sched_fair.c | 71 +++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 71 insertions(+), 0 deletions(-)
>
> diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
> index db3f674..8b8b038 100644
> --- a/kernel/sched_fair.c
> +++ b/kernel/sched_fair.c
> @@ -333,6 +333,8 @@ static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
> s64 key = entity_key(cfs_rq, se);
> int leftmost = 1;
>
> + trace_printk("%p to %p, %ld tasks\n", se, cfs_rq, cfs_rq->nr_running);
> +
> /*
> * Find the right place in the rbtree:
> */
> @@ -364,6 +366,8 @@ static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
>
> static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
> {
> + trace_printk("%p from %p, %ld left\n", se, cfs_rq, cfs_rq->nr_running);
> +
> if (cfs_rq->rb_leftmost == &se->run_node) {
> struct rb_node *next_node;
>
> @@ -394,6 +398,53 @@ static struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
> return rb_entry(last, struct sched_entity, run_node);
> }
>
> +#define INDENT(i) ((i)*2), " "
> +
> +static void notrace __print_runqueue(struct cfs_rq *cfs_rq, int indent);
> +
> +static void __always_inline
> +__print_entity(struct sched_entity *se, char *name, int indent)
> +{
> + if (!se) {
> + trace_printk("%.*s %s: %p\n", INDENT(indent), name, se);
> + return;
> + }
> +
> + if (entity_is_task(se)) {
> + struct task_struct *p = task_of(se);
> + trace_printk("%.*s %s: %p, comm: %s/%d, load: %ld\n",
> + INDENT(indent), name,
> + se, p->comm, p->pid, se->load.weight);
> + } else {
> + trace_printk("%.*s %s: %p, load: %ld, ",
> + INDENT(indent), name,
> + se, se->load.weight);
> + __print_runqueue(group_cfs_rq(se), indent+1);
> + }
> +}
> +
> +static void notrace __print_runqueue(struct cfs_rq *cfs_rq, int indent)
> +{
> + struct rb_node *n;
> +
> + trace_printk("%.*s cfs_rq: %p, nr: %ld, load: %ld\n",
> + INDENT(indent),
> + cfs_rq, cfs_rq->nr_running,
> + cfs_rq->load.weight);
> +
> + __print_entity(cfs_rq->curr, "curr", indent);
> +
> + for (n = cfs_rq->rb_leftmost; n; n = rb_next(n)) {
> + struct sched_entity *se = rb_entry(n, struct sched_entity, run_node);
> + __print_entity(se, "se", indent);
> + }
> +}
> +
> +void print_runqueue(struct cfs_rq *cfs_rq)
> +{
> + __print_runqueue(cfs_rq, 0);
> +}
> +
> /**************************************************************
> * Scheduling class statistics methods:
> */
> @@ -1047,6 +1098,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> struct cfs_rq *cfs_rq;
> struct sched_entity *se = &p->se;
>
> + print_runqueue(&rq->cfs);
> for_each_sched_entity(se) {
> if (se->on_rq)
> break;
> @@ -1055,6 +1107,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> flags = ENQUEUE_WAKEUP;
> }
>
> + trace_printk("add task: %p, %s/%d\n", &p->se, p->comm, p->pid);
> + print_runqueue(&rq->cfs);
> +
> hrtick_update(rq);
> }
>
> @@ -1068,6 +1123,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> struct cfs_rq *cfs_rq;
> struct sched_entity *se = &p->se;
>
> + print_runqueue(&rq->cfs);
> for_each_sched_entity(se) {
> cfs_rq = cfs_rq_of(se);
> dequeue_entity(cfs_rq, se, flags);
> @@ -1077,6 +1133,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> flags |= DEQUEUE_SLEEP;
> }
>
> + trace_printk("del task: %p, %s/%d\n", &p->se, p->comm, p->pid);
> + print_runqueue(&rq->cfs);
> +
> hrtick_update(rq);
> }
>
> @@ -1719,6 +1778,8 @@ static struct task_struct *pick_next_task_fair(struct rq *rq)
> struct cfs_rq *cfs_rq = &rq->cfs;
> struct sched_entity *se;
>
> + print_runqueue(&rq->cfs);
> +
> if (!cfs_rq->nr_running)
> return NULL;
>
> @@ -1731,6 +1792,9 @@ static struct task_struct *pick_next_task_fair(struct rq *rq)
> p = task_of(se);
> hrtick_start_fair(rq, p);
>
> + trace_printk("picked: %p, %s/%d\n", se, p->comm, p->pid);
> + print_runqueue(&rq->cfs);
> +
> return p;
> }
>
> @@ -1742,10 +1806,15 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
> struct sched_entity *se = &prev->se;
> struct cfs_rq *cfs_rq;
>
> + print_runqueue(&rq->cfs);
> for_each_sched_entity(se) {
> cfs_rq = cfs_rq_of(se);
> put_prev_entity(cfs_rq, se);
> }
> +
> + trace_printk("put: %p, %s/%d\n", se, prev->comm, prev->pid);
> + print_runqueue(&rq->cfs);
> +
> }
>
> #ifdef CONFIG_SMP
> @@ -3819,8 +3888,10 @@ static void set_curr_task_fair(struct rq *rq)
> {
> struct sched_entity *se = &rq->curr->se;
>
> + print_runqueue(&rq->cfs);
> for_each_sched_entity(se)
> set_next_entity(cfs_rq_of(se), se);
> + print_runqueue(&rq->cfs);
> }
>
> #ifdef CONFIG_FAIR_GROUP_SCHED
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

--
Only stand for myself.

2010-12-23 12:12:20

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] avoid race condition in pick_next_task_fair in kernel/sched_fair.c

On Thu, 2010-12-23 at 10:08 +0800, Yong Zhang wrote:
> > systemd--1251 0d..5. 2015398us : enqueue_task_fair <-enqueue_task
> > systemd--1251 0d..5. 2015398us : print_runqueue <-enqueue_task_fair
> > systemd--1251 0d..5. 2015399us : __print_runqueue: cfs_rq: c2407c34, nr: 3, load: 3072
> > systemd--1251 0d..5. 2015400us : __print_runqueue: curr: f6a8de5c, comm: systemd-cgroups/1251, load: 1024
> > systemd--1251 0d..5. 2015401us : __print_runqueue: se: f69e6300, load: 1024,
> > systemd--1251 0d..5. 2015401us : __print_runqueue: cfs_rq: f69e6540, nr: 2, load: 2048
> > systemd--1251 0d..5. 2015402us : __print_runqueue: curr: (null)
> > systemd--1251 0d..5. 2015402us : __print_runqueue: se: f69e65a0, load: 4137574976,
>
> the load == f69e65a0 == address of se, odd

This appears to be consistently true, I've also found that in between
these two prints, there is a free_sched_group() freeing that exact
entry. So post-print is a use-after-free artifact.

What's interesting is that its freeing a cfs_rq struct with
nr_running=1, that should not be possible...

/me goes stare at the whole cgroup task attach vs cgroup destruction
muck.

2010-12-23 12:33:48

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] avoid race condition in pick_next_task_fair in kernel/sched_fair.c

On Thu, 2010-12-23 at 13:12 +0100, Peter Zijlstra wrote:
> On Thu, 2010-12-23 at 10:08 +0800, Yong Zhang wrote:
> > > systemd--1251 0d..5. 2015398us : enqueue_task_fair <-enqueue_task
> > > systemd--1251 0d..5. 2015398us : print_runqueue <-enqueue_task_fair
> > > systemd--1251 0d..5. 2015399us : __print_runqueue: cfs_rq: c2407c34, nr: 3, load: 3072
> > > systemd--1251 0d..5. 2015400us : __print_runqueue: curr: f6a8de5c, comm: systemd-cgroups/1251, load: 1024
> > > systemd--1251 0d..5. 2015401us : __print_runqueue: se: f69e6300, load: 1024,
> > > systemd--1251 0d..5. 2015401us : __print_runqueue: cfs_rq: f69e6540, nr: 2, load: 2048
> > > systemd--1251 0d..5. 2015402us : __print_runqueue: curr: (null)
> > > systemd--1251 0d..5. 2015402us : __print_runqueue: se: f69e65a0, load: 4137574976,
> >
> > the load == f69e65a0 == address of se, odd
>
> This appears to be consistently true, I've also found that in between
> these two prints, there is a free_sched_group() freeing that exact
> entry. So post-print is a use-after-free artifact.
>
> What's interesting is that its freeing a cfs_rq struct with
> nr_running=1, that should not be possible...
>
> /me goes stare at the whole cgroup task attach vs cgroup destruction
> muck.

systemd-1 0d..1. 2070793us : sched_destroy_group: se: f69e43c0, load: 1024
systemd-1 0d..1. 2070794us : sched_destroy_group: cfs_rq: f69e4720, nr: 1, load: 1024
systemd-1 0d..1. 2070794us : __print_runqueue: cfs_rq: f69e4720, nr: 1, load: 1024
systemd-1 0d..1. 2070795us : __print_runqueue: curr: (null)
systemd-1 0d..1. 2070796us : __print_runqueue: se: f6a8eb4c, comm: systemd-tmpfile/1243, load: 1024
systemd-1 0d..1. 2070796us : _raw_spin_unlock_irqrestore <-sched_destroy_group

So somehow it manages to destroy a group with a task attached.

2010-12-23 18:24:43

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] avoid race condition in pick_next_task_fair in kernel/sched_fair.c

On Thu, 2010-12-23 at 13:33 +0100, Peter Zijlstra wrote:

> systemd-1 0d..1. 2070793us : sched_destroy_group: se: f69e43c0, load: 1024
> systemd-1 0d..1. 2070794us : sched_destroy_group: cfs_rq: f69e4720, nr: 1, load: 1024
> systemd-1 0d..1. 2070794us : __print_runqueue: cfs_rq: f69e4720, nr: 1, load: 1024
> systemd-1 0d..1. 2070795us : __print_runqueue: curr: (null)
> systemd-1 0d..1. 2070796us : __print_runqueue: se: f6a8eb4c, comm: systemd-tmpfile/1243, load: 1024
> systemd-1 0d..1. 2070796us : _raw_spin_unlock_irqrestore <-sched_destroy_group
>
> So somehow it manages to destroy a group with a task attached.

Its even weirder:

systemd-1 0d..1. 1663489us : sched_destroy_group: se: f69e7360, load: 1024
systemd-1 0d..1. 1663489us : sched_destroy_group: cfs_rq: f69e72a0, nr: 1, load: 1024
systemd-1 0d..1. 1663491us : __print_runqueue: cfs_rq: f69e72a0, nr: 1, load: 1024, cgroup: /system/systemd-sysctl.service
systemd-1 0d..1. 1663491us : __print_runqueue: curr: (null)
systemd-1 0d..1. 1663493us : __print_runqueue: se: f69d95bc, comm: systemd-sysctl/1209, load: 1024, cgroup: /
systemd-1 0d..1. 1663496us : do_invalid_op <-error_code

The task enqueued to the cfs_rq doesn't match the cgroup, the thing is,
I don't see a cpu_cgroup_attach/sched_move_task call in the log, nor
does a BUG_ON() validating the task's cgroup against the cfs_rq's cgroup
on account_entity_enqueue() trigger.

So it looks like a task changes cgroup without passing through the
cgroup_subsys::attach method, which afaict isn't supposed to happen.

---
kernel/sched.c | 26 ++++++++++++++-
kernel/sched_fair.c | 89 +++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 113 insertions(+), 2 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index dc85ceb..bc30efb 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -8033,10 +8033,20 @@ static void free_fair_sched_group(struct task_group *tg)
int i;

for_each_possible_cpu(i) {
- if (tg->cfs_rq)
+ if (tg->cfs_rq) {
+ struct cfs_rq *cfs_rq = tg->cfs_rq[i];
+ if (cfs_rq) {
+ trace_printk("cfs_rq: %p, nr: %ld, load: %ld\n",
+ cfs_rq, cfs_rq->nr_running, cfs_rq->load.weight);
+ }
kfree(tg->cfs_rq[i]);
- if (tg->se)
+ }
+ if (tg->se) {
+ struct sched_entity *se = tg->se[i];
+ if (se)
+ trace_printk("se: %p, load: %ld\n", se, se->load.weight);
kfree(tg->se[i]);
+ }
}

kfree(tg->cfs_rq);
@@ -8092,6 +8102,18 @@ static inline void register_fair_sched_group(struct task_group *tg, int cpu)

static inline void unregister_fair_sched_group(struct task_group *tg, int cpu)
{
+ struct cfs_rq *cfs_rq = tg->cfs_rq[cpu];
+ struct sched_entity *se = tg->se[cpu];
+
+ trace_printk("se: %p, load: %ld\n", se, se->load.weight);
+ trace_printk("cfs_rq: %p, nr: %ld, load: %ld\n",
+ cfs_rq, cfs_rq->nr_running, cfs_rq->load.weight);
+
+ if (cfs_rq->nr_running) {
+ print_runqueue(cfs_rq);
+ BUG();
+ }
+
list_del_rcu(&tg->cfs_rq[cpu]->leaf_cfs_rq_list);
}
#else /* !CONFG_FAIR_GROUP_SCHED */
diff --git a/kernel/sched_fair.c b/kernel/sched_fair.c
index db3f674..37d918a 100644
--- a/kernel/sched_fair.c
+++ b/kernel/sched_fair.c
@@ -333,6 +333,8 @@ static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
s64 key = entity_key(cfs_rq, se);
int leftmost = 1;

+ trace_printk("%p to %p, %ld tasks\n", se, cfs_rq, cfs_rq->nr_running);
+
/*
* Find the right place in the rbtree:
*/
@@ -364,6 +366,8 @@ static void __enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)

static void __dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
{
+ trace_printk("%p from %p, %ld left\n", se, cfs_rq, cfs_rq->nr_running);
+
if (cfs_rq->rb_leftmost == &se->run_node) {
struct rb_node *next_node;

@@ -394,6 +398,70 @@ static struct sched_entity *__pick_last_entity(struct cfs_rq *cfs_rq)
return rb_entry(last, struct sched_entity, run_node);
}

+#define INDENT(i) ((i)*2), " "
+
+static void notrace __print_runqueue(struct cfs_rq *cfs_rq, int indent);
+
+static __always_inline
+char *task_group_path(struct task_group *tg, char *buf, int buflen)
+{
+ /* may be NULL if the underlying cgroup isn't fully-created yet */
+ if (!tg->css.cgroup)
+ buf[0] = '\0';
+ else
+ cgroup_path(tg->css.cgroup, buf, buflen);
+
+ return buf;
+}
+
+static void __always_inline
+__print_entity(struct sched_entity *se, char *name, int indent)
+{
+ if (!se) {
+ trace_printk("%.*s %s: %p\n", INDENT(indent), name, se);
+ return;
+ }
+
+ if (entity_is_task(se)) {
+ struct task_struct *p = task_of(se);
+ char buf[64];
+
+ trace_printk("%.*s %s: %p, comm: %s/%d, load: %ld, cgroup: %s\n",
+ INDENT(indent), name,
+ se, p->comm, p->pid, se->load.weight,
+ task_group_path(task_group(p), buf, sizeof(buf)));
+ } else {
+ trace_printk("%.*s %s: %p, load: %ld, ",
+ INDENT(indent), name,
+ se, se->load.weight);
+ __print_runqueue(group_cfs_rq(se), indent+1);
+ }
+}
+
+static void notrace __print_runqueue(struct cfs_rq *cfs_rq, int indent)
+{
+ struct rb_node *n;
+ char buf[64];
+
+ trace_printk("%.*s cfs_rq: %p, nr: %ld, load: %ld, cgroup: %s\n",
+ INDENT(indent),
+ cfs_rq, cfs_rq->nr_running,
+ cfs_rq->load.weight,
+ task_group_path(cfs_rq->tg, buf, sizeof(buf)));
+
+ __print_entity(cfs_rq->curr, "curr", indent);
+
+ for (n = cfs_rq->rb_leftmost; n; n = rb_next(n)) {
+ struct sched_entity *se = rb_entry(n, struct sched_entity, run_node);
+ __print_entity(se, "se", indent);
+ }
+}
+
+void notrace print_runqueue(struct cfs_rq *cfs_rq)
+{
+ __print_runqueue(cfs_rq, 0);
+}
+
/**************************************************************
* Scheduling class statistics methods:
*/
@@ -631,6 +699,7 @@ account_entity_enqueue(struct cfs_rq *cfs_rq, struct sched_entity *se)
if (entity_is_task(se)) {
add_cfs_task_weight(cfs_rq, se->load.weight);
list_add(&se->group_node, &cfs_rq->tasks);
+ BUG_ON(task_group(task_of(se)) != cfs_rq->tg);
}
cfs_rq->nr_running++;
se->on_rq = 1;
@@ -1047,6 +1116,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;

+ print_runqueue(&rq->cfs);
for_each_sched_entity(se) {
if (se->on_rq)
break;
@@ -1055,6 +1125,9 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
flags = ENQUEUE_WAKEUP;
}

+ trace_printk("add task: %p, %s/%d\n", &p->se, p->comm, p->pid);
+ print_runqueue(&rq->cfs);
+
hrtick_update(rq);
}

@@ -1068,6 +1141,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
struct cfs_rq *cfs_rq;
struct sched_entity *se = &p->se;

+ print_runqueue(&rq->cfs);
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
dequeue_entity(cfs_rq, se, flags);
@@ -1077,6 +1151,9 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
flags |= DEQUEUE_SLEEP;
}

+ trace_printk("del task: %p, %s/%d\n", &p->se, p->comm, p->pid);
+ print_runqueue(&rq->cfs);
+
hrtick_update(rq);
}

@@ -1719,6 +1796,8 @@ static struct task_struct *pick_next_task_fair(struct rq *rq)
struct cfs_rq *cfs_rq = &rq->cfs;
struct sched_entity *se;

+ print_runqueue(&rq->cfs);
+
if (!cfs_rq->nr_running)
return NULL;

@@ -1731,6 +1810,9 @@ static struct task_struct *pick_next_task_fair(struct rq *rq)
p = task_of(se);
hrtick_start_fair(rq, p);

+ trace_printk("picked: %p, %s/%d\n", se, p->comm, p->pid);
+ print_runqueue(&rq->cfs);
+
return p;
}

@@ -1742,10 +1824,15 @@ static void put_prev_task_fair(struct rq *rq, struct task_struct *prev)
struct sched_entity *se = &prev->se;
struct cfs_rq *cfs_rq;

+ print_runqueue(&rq->cfs);
for_each_sched_entity(se) {
cfs_rq = cfs_rq_of(se);
put_prev_entity(cfs_rq, se);
}
+
+ trace_printk("put: %p, %s/%d\n", se, prev->comm, prev->pid);
+ print_runqueue(&rq->cfs);
+
}

#ifdef CONFIG_SMP
@@ -3819,8 +3906,10 @@ static void set_curr_task_fair(struct rq *rq)
{
struct sched_entity *se = &rq->curr->se;

+ print_runqueue(&rq->cfs);
for_each_sched_entity(se)
set_next_entity(cfs_rq_of(se), se);
+ print_runqueue(&rq->cfs);
}

#ifdef CONFIG_FAIR_GROUP_SCHED