LinuxLists.cc - [PATCH] sched: fix race in schedule

2008-03-10 18:01:32

Subject: [PATCH] sched: fix race in schedule

Hi Ingo,

I found a race condition in scheduler.
The first report is the below;
http://lkml.org/lkml/2008/2/26/459

It took a bit long time to investigate and I couldn't have much time last week.
It is hard to reproduce but -rt is little easier because it has preemptible
spin lock and rcu.

Could you please check the scenario and the patch.
It will be needed for the stable, too.

---
From: Hiroshi Shimamoto <[email protected]>

There is a race condition between schedule() and some dequeue/enqueue
functions; rt_mutex_setprio(), __setscheduler() and sched_move_task().

When scheduling to idle, idle_balance() is called to pull tasks from
other busy processor. It might drop the rq lock.
It means that those 3 functions encounter on_rq=0 and running=1.
The current task should be put when running.

Here is a possible scenario;
CPU0 CPU1
| schedule()
| ->deactivate_task()
| ->idle_balance()
| -->load_balance_newidle()
rt_mutex_setprio() |
| --->double_lock_balance()
*get lock *rel lock
* on_rq=0, ruuning=1 |
* sched_class is changed |
*rel lock *get lock
: |
:
->put_prev_task_rt()
->pick_next_task_fair()
=> panic

The current process of CPU1(P1) is scheduling. Deactivated P1,
and the scheduler looks for another process on other CPU's runqueue
because CPU1 will be idle. idle_balance(), load_balance_newidle()
and double_lock_balance() are called and double_lock_balance() could
drop the rq lock. On the other hand, CPU0 is trying to boost the
priority of P1. The result of boosting only P1's prio and sched_class
are changed to RT. The sched entities of P1 and P1's group are never
put. It makes cfs_rq invalid, because the cfs_rq has curr and no leaf,
but pick_next_task_fair() is called, then the kernel panics.

Signed-off-by: Hiroshi Shimamoto <[email protected]>
---
kernel/sched.c | 38 ++++++++++++++++----------------------
1 files changed, 16 insertions(+), 22 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 52b9867..eedf748 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4268,11 +4268,10 @@ void rt_mutex_setprio(struct task_struct *p, int prio)
oldprio = p->prio;
on_rq = p->se.on_rq;
running = task_current(rq, p);
- if (on_rq) {
+ if (on_rq)
dequeue_task(rq, p, 0);
- if (running)
- p->sched_class->put_prev_task(rq, p);
- }
+ if (running)
+ p->sched_class->put_prev_task(rq, p);

if (rt_prio(prio))
p->sched_class = &rt_sched_class;
@@ -4281,10 +4280,9 @@ void rt_mutex_setprio(struct task_struct *p, int prio)

p->prio = prio;

+ if (running)
+ p->sched_class->set_curr_task(rq);
if (on_rq) {
- if (running)
- p->sched_class->set_curr_task(rq);
-
enqueue_task(rq, p, 0);

check_class_changed(rq, p, prev_class, oldprio, running);
@@ -4581,19 +4579,17 @@ recheck:
update_rq_clock(rq);
on_rq = p->se.on_rq;
running = task_current(rq, p);
- if (on_rq) {
+ if (on_rq)
deactivate_task(rq, p, 0);
- if (running)
- p->sched_class->put_prev_task(rq, p);
- }
+ if (running)
+ p->sched_class->put_prev_task(rq, p);

oldprio = p->prio;
__setscheduler(rq, p, policy, param->sched_priority);

+ if (running)
+ p->sched_class->set_curr_task(rq);
if (on_rq) {
- if (running)
- p->sched_class->set_curr_task(rq);
-
activate_task(rq, p, 0);

check_class_changed(rq, p, prev_class, oldprio, running);
@@ -7617,11 +7613,10 @@ void sched_move_task(struct task_struct *tsk)
running = task_current(rq, tsk);
on_rq = tsk->se.on_rq;

- if (on_rq) {
+ if (on_rq)
dequeue_task(rq, tsk, 0);
- if (unlikely(running))
- tsk->sched_class->put_prev_task(rq, tsk);
- }
+ if (unlikely(running))
+ tsk->sched_class->put_prev_task(rq, tsk);

set_task_rq(tsk, task_cpu(tsk));

@@ -7630,11 +7625,10 @@ void sched_move_task(struct task_struct *tsk)
tsk->sched_class->moved_group(tsk);
#endif

- if (on_rq) {
- if (unlikely(running))
- tsk->sched_class->set_curr_task(rq);
+ if (unlikely(running))
+ tsk->sched_class->set_curr_task(rq);
+ if (on_rq)
enqueue_task(rq, tsk, 0);
- }

task_rq_unlock(rq, &flags);
}
--
1.5.4.1

2008-03-10 18:37:17

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] sched: fix race in schedule

On Mon, 2008-03-10 at 11:01 -0700, Hiroshi Shimamoto wrote:
> Hi Ingo,
>
> I found a race condition in scheduler.
> The first report is the below;
> http://lkml.org/lkml/2008/2/26/459
>
> It took a bit long time to investigate and I couldn't have much time last week.
> It is hard to reproduce but -rt is little easier because it has preemptible
> spin lock and rcu.
>
> Could you please check the scenario and the patch.
> It will be needed for the stable, too.
>
> ---
> From: Hiroshi Shimamoto <[email protected]>
>
> There is a race condition between schedule() and some dequeue/enqueue
> functions; rt_mutex_setprio(), __setscheduler() and sched_move_task().
>
> When scheduling to idle, idle_balance() is called to pull tasks from
> other busy processor. It might drop the rq lock.
> It means that those 3 functions encounter on_rq=0 and running=1.
> The current task should be put when running.
>
> Here is a possible scenario;
> CPU0 CPU1
> | schedule()
> | ->deactivate_task()
> | ->idle_balance()
> | -->load_balance_newidle()
> rt_mutex_setprio() |
> | --->double_lock_balance()
> *get lock *rel lock
> * on_rq=0, ruuning=1 |
> * sched_class is changed |
> *rel lock *get lock
> : |
> :
> ->put_prev_task_rt()
> ->pick_next_task_fair()
> => panic
>
> The current process of CPU1(P1) is scheduling. Deactivated P1,
> and the scheduler looks for another process on other CPU's runqueue
> because CPU1 will be idle. idle_balance(), load_balance_newidle()
> and double_lock_balance() are called and double_lock_balance() could
> drop the rq lock. On the other hand, CPU0 is trying to boost the
> priority of P1. The result of boosting only P1's prio and sched_class
> are changed to RT. The sched entities of P1 and P1's group are never
> put. It makes cfs_rq invalid, because the cfs_rq has curr and no leaf,
> but pick_next_task_fair() is called, then the kernel panics.

Very nice catch, this had me puzzled for a while. I'm not quite sure I
fully understand. Could you explain why the below isn't sufficient?

---
diff --git a/kernel/sched.c b/kernel/sched.c
index a0c79e9..ebd9fc5 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4067,10 +4067,11 @@ need_resched_nonpreemptible:
prev->sched_class->pre_schedule(rq, prev);
#endif

+ prev->sched_class->put_prev_task(rq, prev);
+
if (unlikely(!rq->nr_running))
idle_balance(cpu, rq);

- prev->sched_class->put_prev_task(rq, prev);
next = pick_next_task(rq, prev);

sched_info_switch(prev, next);

2008-03-10 20:02:50

by Hiroshi Shimamoto

[permalink] [raw]

Subject: Re: [PATCH] sched: fix race in schedule

Peter Zijlstra wrote:
> On Mon, 2008-03-10 at 11:01 -0700, Hiroshi Shimamoto wrote:
>> Hi Ingo,
>>
>> I found a race condition in scheduler.
>> The first report is the below;
>> http://lkml.org/lkml/2008/2/26/459
>>
>> It took a bit long time to investigate and I couldn't have much time last week.
>> It is hard to reproduce but -rt is little easier because it has preemptible
>> spin lock and rcu.
>>
>> Could you please check the scenario and the patch.
>> It will be needed for the stable, too.
>>
>> ---
>> From: Hiroshi Shimamoto <[email protected]>
>>
>> There is a race condition between schedule() and some dequeue/enqueue
>> functions; rt_mutex_setprio(), __setscheduler() and sched_move_task().
>>
>> When scheduling to idle, idle_balance() is called to pull tasks from
>> other busy processor. It might drop the rq lock.
>> It means that those 3 functions encounter on_rq=0 and running=1.
>> The current task should be put when running.
>>
>> Here is a possible scenario;
>> CPU0 CPU1
>> | schedule()
>> | ->deactivate_task()
>> | ->idle_balance()
>> | -->load_balance_newidle()
>> rt_mutex_setprio() |
>> | --->double_lock_balance()
>> *get lock *rel lock
>> * on_rq=0, ruuning=1 |
>> * sched_class is changed |
>> *rel lock *get lock
>> : |
>> :
>> ->put_prev_task_rt()
>> ->pick_next_task_fair()
>> => panic
>>
>> The current process of CPU1(P1) is scheduling. Deactivated P1,
>> and the scheduler looks for another process on other CPU's runqueue
>> because CPU1 will be idle. idle_balance(), load_balance_newidle()
>> and double_lock_balance() are called and double_lock_balance() could
>> drop the rq lock. On the other hand, CPU0 is trying to boost the
>> priority of P1. The result of boosting only P1's prio and sched_class
>> are changed to RT. The sched entities of P1 and P1's group are never
>> put. It makes cfs_rq invalid, because the cfs_rq has curr and no leaf,
>> but pick_next_task_fair() is called, then the kernel panics.
>
> Very nice catch, this had me puzzled for a while. I'm not quite sure I
> fully understand. Could you explain why the below isn't sufficient?

thanks, your patch looks nice to me.
I had focused setprio, on_rq=0 and running=1 situation, it makes me to
fix these functions.
But one point, I've just noticed. I'm not sure on same situation against
sched_rt. I think the pre_schedule() of rt has chance to drop rq lock.
Is it OK?

>
> ---
> diff --git a/kernel/sched.c b/kernel/sched.c
> index a0c79e9..ebd9fc5 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -4067,10 +4067,11 @@ need_resched_nonpreemptible:
> prev->sched_class->pre_schedule(rq, prev);
> #endif
>
> + prev->sched_class->put_prev_task(rq, prev);
> +
> if (unlikely(!rq->nr_running))
> idle_balance(cpu, rq);
>
> - prev->sched_class->put_prev_task(rq, prev);
> next = pick_next_task(rq, prev);
>
> sched_info_switch(prev, next);
>
>

2008-03-10 20:34:48

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] sched: fix race in schedule

On Mon, 2008-03-10 at 13:01 -0700, Hiroshi Shimamoto wrote:

> thanks, your patch looks nice to me.
> I had focused setprio, on_rq=0 and running=1 situation, it makes me to
> fix these functions.
> But one point, I've just noticed. I'm not sure on same situation against
> sched_rt. I think the pre_schedule() of rt has chance to drop rq lock.
> Is it OK?

Ah, you are quite right, that'll teach me to rush out a patch just
because dinner is ready :-).

How about we submit the following patch for mainline and CC -stable to
fix .23 and .24:

---
From: Hiroshi Shimamoto <[email protected]>

There is a race condition between schedule() and some dequeue/enqueue
functions; rt_mutex_setprio(), __setscheduler() and sched_move_task().

When scheduling to idle, idle_balance() is called to pull tasks from
other busy processor. It might drop the rq lock.
It means that those 3 functions encounter on_rq=0 and running=1.
The current task should be put when running.

Here is a possible scenario;
CPU0 CPU1
| schedule()
| ->deactivate_task()
| ->idle_balance()
| -->load_balance_newidle()
rt_mutex_setprio() |
| --->double_lock_balance()
*get lock *rel lock
* on_rq=0, ruuning=1 |
* sched_class is changed |
*rel lock *get lock
: |
:
->put_prev_task_rt()
->pick_next_task_fair()
=> panic

The current process of CPU1(P1) is scheduling. Deactivated P1,
and the scheduler looks for another process on other CPU's runqueue
because CPU1 will be idle. idle_balance(), load_balance_newidle()
and double_lock_balance() are called and double_lock_balance() could
drop the rq lock. On the other hand, CPU0 is trying to boost the
priority of P1. The result of boosting only P1's prio and sched_class
are changed to RT. The sched entities of P1 and P1's group are never
put. It makes cfs_rq invalid, because the cfs_rq has curr and no leaf,
but pick_next_task_fair() is called, then the kernel panics.

Signed-off-by: Hiroshi Shimamoto <[email protected]>
Signed-off-by: Peter Zijlstra <[email protected]>
CC: [email protected]
---
kernel/sched.c | 8 +++++++-
1 file changed, 7 insertions(+), 1 deletion(-)

Index: linux-2.6-2/kernel/sched.c
===================================================================
--- linux-2.6-2.orig/kernel/sched.c
+++ linux-2.6-2/kernel/sched.c
@@ -4062,6 +4062,13 @@ need_resched_nonpreemptible:
switch_count = &prev->nvcsw;
}

+ /*
+ * ->pre_schedule() and idle_balance() can release the rq->lock so we
+ * have to call ->put_prev_task() before we do the balancing calls,
+ * otherwise its possible to see the rq in an inconsistent state.
+ */
+ prev->sched_class->put_prev_task(rq, prev);
+
#ifdef CONFIG_SMP
if (prev->sched_class->pre_schedule)
prev->sched_class->pre_schedule(rq, prev);
@@ -4070,7 +4077,6 @@ need_resched_nonpreemptible:
if (unlikely(!rq->nr_running))
idle_balance(cpu, rq);

- prev->sched_class->put_prev_task(rq, prev);
next = pick_next_task(rq, prev);

sched_info_switch(prev, next);

2008-03-10 20:54:32

by Hiroshi Shimamoto

[permalink] [raw]

Subject: Re: [PATCH] sched: fix race in schedule

Peter Zijlstra wrote:
> On Mon, 2008-03-10 at 13:01 -0700, Hiroshi Shimamoto wrote:
>
>> thanks, your patch looks nice to me.
>> I had focused setprio, on_rq=0 and running=1 situation, it makes me to
>> fix these functions.
>> But one point, I've just noticed. I'm not sure on same situation against
>> sched_rt. I think the pre_schedule() of rt has chance to drop rq lock.
>> Is it OK?
>
> Ah, you are quite right, that'll teach me to rush out a patch just
> because dinner is ready :-).
>
> How about we submit the following patch for mainline and CC -stable to
> fix .23 and .24:

thanks for working, I'm OK, and will test it soon.
IIRC, it came from the group scheduling, .23 probably doesn't have this issue.

thanks.

>
> ---
> From: Hiroshi Shimamoto <[email protected]>
>
> There is a race condition between schedule() and some dequeue/enqueue
> functions; rt_mutex_setprio(), __setscheduler() and sched_move_task().
>
> When scheduling to idle, idle_balance() is called to pull tasks from
> other busy processor. It might drop the rq lock.
> It means that those 3 functions encounter on_rq=0 and running=1.
> The current task should be put when running.
>
> Here is a possible scenario;
> CPU0 CPU1
> | schedule()
> | ->deactivate_task()
> | ->idle_balance()
> | -->load_balance_newidle()
> rt_mutex_setprio() |
> | --->double_lock_balance()
> *get lock *rel lock
> * on_rq=0, ruuning=1 |
> * sched_class is changed |
> *rel lock *get lock
> : |
> :
> ->put_prev_task_rt()
> ->pick_next_task_fair()
> => panic
>
> The current process of CPU1(P1) is scheduling. Deactivated P1,
> and the scheduler looks for another process on other CPU's runqueue
> because CPU1 will be idle. idle_balance(), load_balance_newidle()
> and double_lock_balance() are called and double_lock_balance() could
> drop the rq lock. On the other hand, CPU0 is trying to boost the
> priority of P1. The result of boosting only P1's prio and sched_class
> are changed to RT. The sched entities of P1 and P1's group are never
> put. It makes cfs_rq invalid, because the cfs_rq has curr and no leaf,
> but pick_next_task_fair() is called, then the kernel panics.
>
> Signed-off-by: Hiroshi Shimamoto <[email protected]>
> Signed-off-by: Peter Zijlstra <[email protected]>
> CC: [email protected]
> ---
> kernel/sched.c | 8 +++++++-
> 1 file changed, 7 insertions(+), 1 deletion(-)
>
> Index: linux-2.6-2/kernel/sched.c
> ===================================================================
> --- linux-2.6-2.orig/kernel/sched.c
> +++ linux-2.6-2/kernel/sched.c
> @@ -4062,6 +4062,13 @@ need_resched_nonpreemptible:
> switch_count = &prev->nvcsw;
> }
>
> + /*
> + * ->pre_schedule() and idle_balance() can release the rq->lock so we
> + * have to call ->put_prev_task() before we do the balancing calls,
> + * otherwise its possible to see the rq in an inconsistent state.
> + */
> + prev->sched_class->put_prev_task(rq, prev);
> +
> #ifdef CONFIG_SMP
> if (prev->sched_class->pre_schedule)
> prev->sched_class->pre_schedule(rq, prev);
> @@ -4070,7 +4077,6 @@ need_resched_nonpreemptible:
> if (unlikely(!rq->nr_running))
> idle_balance(cpu, rq);
>
> - prev->sched_class->put_prev_task(rq, prev);
> next = pick_next_task(rq, prev);
>
> sched_info_switch(prev, next);
>
>

2008-03-10 21:02:28

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] sched: fix race in schedule

On Mon, 2008-03-10 at 13:54 -0700, Hiroshi Shimamoto wrote:
> Peter Zijlstra wrote:
> > On Mon, 2008-03-10 at 13:01 -0700, Hiroshi Shimamoto wrote:
> >
> >> thanks, your patch looks nice to me.
> >> I had focused setprio, on_rq=0 and running=1 situation, it makes me to
> >> fix these functions.
> >> But one point, I've just noticed. I'm not sure on same situation against
> >> sched_rt. I think the pre_schedule() of rt has chance to drop rq lock.
> >> Is it OK?
> >
> > Ah, you are quite right, that'll teach me to rush out a patch just
> > because dinner is ready :-).
> >
> > How about we submit the following patch for mainline and CC -stable to
> > fix .23 and .24:
>
> thanks for working, I'm OK, and will test it soon.
> IIRC, it came from the group scheduling, .23 probably doesn't have this issue.

Might not have this exact race, but I've checked both .23 and .24, both
can unlock the rq before we do ->put_prev_task(), leaving a window for
potential nasties. I'm rather safe than sorry :-)

2008-03-10 21:08:19

by Hiroshi Shimamoto

[permalink] [raw]

Subject: Re: [PATCH] sched: fix race in schedule

Peter Zijlstra wrote:
> On Mon, 2008-03-10 at 13:54 -0700, Hiroshi Shimamoto wrote:
>> Peter Zijlstra wrote:
>>> On Mon, 2008-03-10 at 13:01 -0700, Hiroshi Shimamoto wrote:
>>>
>>>> thanks, your patch looks nice to me.
>>>> I had focused setprio, on_rq=0 and running=1 situation, it makes me to
>>>> fix these functions.
>>>> But one point, I've just noticed. I'm not sure on same situation against
>>>> sched_rt. I think the pre_schedule() of rt has chance to drop rq lock.
>>>> Is it OK?
>>> Ah, you are quite right, that'll teach me to rush out a patch just
>>> because dinner is ready :-).
>>>
>>> How about we submit the following patch for mainline and CC -stable to
>>> fix .23 and .24:
>> thanks for working, I'm OK, and will test it soon.
>> IIRC, it came from the group scheduling, .23 probably doesn't have this issue.
>
> Might not have this exact race, but I've checked both .23 and .24, both
> can unlock the rq before we do ->put_prev_task(), leaving a window for
> potential nasties. I'm rather safe than sorry :-)

Ah, you're correct. I haven't gotten out from the first situation yet :-)

thanks,
Hiroshi Shimamoto

2008-03-11 02:13:00

by Hiroshi Shimamoto

[permalink] [raw]

Subject: Re: [PATCH] sched: fix race in schedule

Peter Zijlstra wrote:
> On Mon, 2008-03-10 at 13:01 -0700, Hiroshi Shimamoto wrote:
>
>> thanks, your patch looks nice to me.
>> I had focused setprio, on_rq=0 and running=1 situation, it makes me to
>> fix these functions.
>> But one point, I've just noticed. I'm not sure on same situation against
>> sched_rt. I think the pre_schedule() of rt has chance to drop rq lock.
>> Is it OK?
>
> Ah, you are quite right, that'll teach me to rush out a patch just
> because dinner is ready :-).
>
> How about we submit the following patch for mainline and CC -stable to
> fix .23 and .24:
>

Unfortunately, I encountered similar panic with this patch on -rt.
I'll look into this, again. I might have missed something...

Unable to handle kernel NULL pointer dereference at 0000000000000128 RIP:
[<ffffffff802297f5>] pick_next_task_fair+0x2d/0x42
PGD 13dbb2067 PUD 15146a067 PMD 0
Oops: 0000 [1] PREEMPT SMP
CPU 3
Modules linked in:
Pid: 31981, comm: dbench Not tainted 2.6.24.3-rt3 #1
RIP: 0010:[<ffffffff802297f5>] [<ffffffff802297f5>] pick_next_task_fair+0x2d/0x42
RSP: 0018:ffff8101d75b5b38 EFLAGS: 00010046
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000639
RDX: ffff810005009680 RSI: 0000000000000003 RDI: ffff8100050216e0
RBP: ffff8101d75b5b48 R08: ffff81000501dac0 R09: 0000000000000002
R10: ffff8101d75b5b08 R11: ffff81000501dac0 R12: ffff810005009680
R13: 0000000000000000 R14: ffff810005021680 R15: 00000001002ee6d0
FS: 00002b93ea5fe6f0(0000) GS:ffff81022fd28bc0(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 0000000000000128 CR3: 000000013dbca000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process dbench (pid: 31981, threadinfo ffff8101d75b4000, task ffff8101515f4100)
Stack: ffff8101d75b5b48 0000000000000000 ffff8101d75b5bd8 ffffffff804d98d5
ffff8101d75b5ba0 ffffffff8022f2b6 00000003e9550280 ffff8101515f4100
ffff8101d75b5b98 ffff8101515f4440 00000000000000ff ffffffff804db74f
Call Trace:
[<ffffffff804d98d5>] __schedule+0x414/0x775
[<ffffffff8022f2b6>] add_preempt_count+0x18/0xb2
[<ffffffff804db74f>] __spin_unlock+0x14/0x2e
[<ffffffff804d9f30>] schedule+0xdf/0xff
[<ffffffff804da77d>] rt_spin_lock_slowlock+0xf9/0x19e
[<ffffffff804db100>] __rt_spin_lock+0x6b/0x70
[<ffffffff804db10e>] rt_spin_lock+0x9/0xb
[<ffffffff802de7f0>] journal_invalidatepage+0xdd/0x282
[<ffffffff802d149f>] ext3_invalidatepage+0x38/0x3a
[<ffffffff8026e6cf>] do_invalidatepage+0x23/0x25
[<ffffffff8026ecf5>] truncate_complete_page+0x30/0x4e
[<ffffffff8026eddb>] truncate_inode_pages_range+0xc8/0x302
[<ffffffff8026f022>] truncate_inode_pages+0xd/0xf
[<ffffffff802d1956>] ext3_delete_inode+0x18/0xd8
[<ffffffff802d193e>] ext3_delete_inode+0x0/0xd8
[<ffffffff8029d307>] generic_delete_inode+0x7b/0xfb
[<ffffffff8029d39e>] generic_drop_inode+0x17/0x16f
[<ffffffff8029c898>] iput+0x7c/0x80
[<ffffffff8029433b>] do_unlinkat+0xf5/0x150
[<ffffffff8028d8a6>] sys_newstat+0x31/0x3c
[<ffffffff802943a7>] sys_unlink+0x11/0x13
[<ffffffff8020c19e>] system_call+0x7e/0x83
Code: 48 8b bb 28 01 00 00 48 85 ff 75 dd 48 8d 43 b8 41 58 5b 5d
RIP [<ffffffff802297f5>] pick_next_task_fair+0x2d/0x42
RSP <ffff8101d75b5b38>

thanks,
Hiroshi Shimamoto

2008-03-11 08:40:54

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] sched: fix race in schedule

On Mon, 2008-03-10 at 19:12 -0700, Hiroshi Shimamoto wrote:
> Peter Zijlstra wrote:
> > On Mon, 2008-03-10 at 13:01 -0700, Hiroshi Shimamoto wrote:
> >
> >> thanks, your patch looks nice to me.
> >> I had focused setprio, on_rq=0 and running=1 situation, it makes me to
> >> fix these functions.
> >> But one point, I've just noticed. I'm not sure on same situation against
> >> sched_rt. I think the pre_schedule() of rt has chance to drop rq lock.
> >> Is it OK?
> >
> > Ah, you are quite right, that'll teach me to rush out a patch just
> > because dinner is ready :-).
> >
> > How about we submit the following patch for mainline and CC -stable to
> > fix .23 and .24:
> >
>
> Unfortunately, I encountered similar panic with this patch on -rt.
> I'll look into this, again. I might have missed something...
>
> Unable to handle kernel NULL pointer dereference at 0000000000000128 RIP:
> [<ffffffff802297f5>] pick_next_task_fair+0x2d/0x42

:-(

OK, so that means I'm not getting it.

So what does your patch do that mine doesn't?

It removes the dependency of running (=task_current()) from on_rq
(p->se.on_rq).

So how can a current task not be on the runqueue?

Only sched.c:dequeue_task() and sched_fair.c:account_entity_dequeue()
set on_rq to 0, the only one changing rq->curr is schedule().

So the only scheme I can come up with is that we did dequeue p (on_rq ==
0), but we didn't yet schedule so rq->curr == p.

Is this how you ended up with your previuos analysis that it must be due
to a hole introduced by double_lock_balance()?

Because now we can seemingly call deactivate_task() and put_prev_task()
in non-atomic fashion, but by placing the put_prev_task() before the
load balance calls we should avoid doing that.

So what else is going on... /me puzzled

2008-03-11 17:10:50

by Hiroshi Shimamoto

[permalink] [raw]

Subject: Re: [PATCH] sched: fix race in schedule

Peter Zijlstra wrote:
> On Mon, 2008-03-10 at 19:12 -0700, Hiroshi Shimamoto wrote:
>> Peter Zijlstra wrote:
>>> On Mon, 2008-03-10 at 13:01 -0700, Hiroshi Shimamoto wrote:
>>>
>>>> thanks, your patch looks nice to me.
>>>> I had focused setprio, on_rq=0 and running=1 situation, it makes me to
>>>> fix these functions.
>>>> But one point, I've just noticed. I'm not sure on same situation against
>>>> sched_rt. I think the pre_schedule() of rt has chance to drop rq lock.
>>>> Is it OK?
>>> Ah, you are quite right, that'll teach me to rush out a patch just
>>> because dinner is ready :-).
>>>
>>> How about we submit the following patch for mainline and CC -stable to
>>> fix .23 and .24:
>>>
>> Unfortunately, I encountered similar panic with this patch on -rt.
>> I'll look into this, again. I might have missed something...
>>
>> Unable to handle kernel NULL pointer dereference at 0000000000000128 RIP:
>> [<ffffffff802297f5>] pick_next_task_fair+0x2d/0x42
>
> :-(
>
> OK, so that means I'm not getting it.
>
> So what does your patch do that mine doesn't?
>
> It removes the dependency of running (=task_current()) from on_rq
> (p->se.on_rq).
>
> So how can a current task not be on the runqueue?
>
> Only sched.c:dequeue_task() and sched_fair.c:account_entity_dequeue()
> set on_rq to 0, the only one changing rq->curr is schedule().
>
> So the only scheme I can come up with is that we did dequeue p (on_rq ==
> 0), but we didn't yet schedule so rq->curr == p.
>
> Is this how you ended up with your previuos analysis that it must be due
> to a hole introduced by double_lock_balance()?
>
> Because now we can seemingly call deactivate_task() and put_prev_task()
> in non-atomic fashion, but by placing the put_prev_task() before the
> load balance calls we should avoid doing that.
>
> So what else is going on... /me puzzled

thanks for taking time.
I've tested it with debug tracer, it took several hours to half day to
reproduce it on my box. And I got the possible scenario.

Before begin, I can tell that se->on_rq is changed at enqueue_task() or
dequeue_task() in sched.c.

Here is the flow to panic which I got;
CPU0 CPU1
| schedule()
| ->deactivate_task()
| -->dequeue_task()
| * on_rq=0
| ->put_prev_task_fair()
| ->idle_balance()
| -->load_balance_newidle()
(a wakeup function) |
| --->double_lock_balance()
*get lock *rel lock
* wake up target is CPU1's curr |
->enqueue_task() |
* on_rq=1 |
->rt_mutex_setprio() |
* on_rq=1, ruuning=1 |
-->dequeue_task()!! |
-->put_prev_task_fair()!! |
* sched_class is changed |
-->set_curr_task_fair()!! |
-->enqueue_task()!! |
*rel lock *get lock again
: |
:
->pick_next_task_fair()
=> panic

The difference from the previous scenario is;
some one enqueues CPU1,s current task before setprio,
it makes on_rq=1 and it causes set_curr_task_fair() called.
It means that the cfs_rq state of the current task is set to
before put_prev_task_fair(), I think.

Does this scenario help to solve this issue?
And I only test it on -rt because I can reproduce it on -rt only.

thanks,
Hiroshi Shimamoto

2008-03-11 23:38:46

by Dmitry Adamushko

[permalink] [raw]

Subject: Re: [PATCH] sched: fix race in schedule

On 11/03/2008, Hiroshi Shimamoto <[email protected]> wrote:
> Peter Zijlstra wrote:
> > On Mon, 2008-03-10 at 19:12 -0700, Hiroshi Shimamoto wrote:
> >> Peter Zijlstra wrote:
> >>> On Mon, 2008-03-10 at 13:01 -0700, Hiroshi Shimamoto wrote:
> >>>
> >>>> thanks, your patch looks nice to me.
> >>>> I had focused setprio, on_rq=0 and running=1 situation, it makes me to
> >>>> fix these functions.
> >>>> But one point, I've just noticed. I'm not sure on same situation against
> >>>> sched_rt. I think the pre_schedule() of rt has chance to drop rq lock.
> >>>> Is it OK?
> >>> Ah, you are quite right, that'll teach me to rush out a patch just
> >>> because dinner is ready :-).
> >>>
> >>> How about we submit the following patch for mainline and CC -stable to
> >>> fix .23 and .24:
> >>>
> >> Unfortunately, I encountered similar panic with this patch on -rt.
> >> I'll look into this, again. I might have missed something...
> >>
> >> Unable to handle kernel NULL pointer dereference at 0000000000000128 RIP:
> >> [<ffffffff802297f5>] pick_next_task_fair+0x2d/0x42
> >
> > :-(
> >
> > OK, so that means I'm not getting it.
> >
> > So what does your patch do that mine doesn't?
> >
> > It removes the dependency of running (=task_current()) from on_rq
> > (p->se.on_rq).
> >
> > So how can a current task not be on the runqueue?
> >
> > Only sched.c:dequeue_task() and sched_fair.c:account_entity_dequeue()
> > set on_rq to 0, the only one changing rq->curr is schedule().
> >
> > So the only scheme I can come up with is that we did dequeue p (on_rq ==
> > 0), but we didn't yet schedule so rq->curr == p.
> >
> > Is this how you ended up with your previuos analysis that it must be due
> > to a hole introduced by double_lock_balance()?
> >
> > Because now we can seemingly call deactivate_task() and put_prev_task()
> > in non-atomic fashion, but by placing the put_prev_task() before the
> > load balance calls we should avoid doing that.
> >
> > So what else is going on... /me puzzled
>
>
> thanks for taking time.
> I've tested it with debug tracer, it took several hours to half day to
> reproduce it on my box. And I got the possible scenario.
>
> Before begin, I can tell that se->on_rq is changed at enqueue_task() or
> dequeue_task() in sched.c.
>
> Here is the flow to panic which I got;
>
> CPU0 CPU1
> | schedule()
> | ->deactivate_task()
>
> | -->dequeue_task()
> | * on_rq=0
> | ->put_prev_task_fair()
>
> | ->idle_balance()
> | -->load_balance_newidle()
>
> (a wakeup function) |
>
> | --->double_lock_balance()
> *get lock *rel lock
>
> * wake up target is CPU1's curr |
> ->enqueue_task() |
> * on_rq=1 |
> ->rt_mutex_setprio() |
> * on_rq=1, ruuning=1 |
> -->dequeue_task()!! |
> -->put_prev_task_fair()!! |

humm... this one should have caused the problem.

->put_prev_task() has been previously done in schedule() so we get 2
consequent ->put_prev_task() without set_curr_task/pick_next_task()
being called in between
[ as a result, __enqueue_entitty() is called twice for CPU1's curr and
that definitely corrupts an rb-tree ]

your initial patch doesn't have this problem. humm... logically-wise,
it looks like a change of the 'current' which can be expressed by a
pair :

(1) put_prev_task() + (2) pick_next_task() or set_curr_task()
(both end up calling set_next_entity())

has to be 'atomic' wrt the rq->lock.

For schedule() that also involves a change of rq->curr.

Your initial patch seems to qualify for this rule... but I'm still
thinking whether the current scheme is a bit 'hairy' :-/

>
> Hiroshi Shimamoto
>

--
Best regards,
Dmitry Adamushko

2008-03-12 13:29:22

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] sched: fix race in schedule

On Wed, 2008-03-12 at 00:38 +0100, Dmitry Adamushko wrote:
> On 11/03/2008, Hiroshi Shimamoto <[email protected]> wrote:
> > Peter Zijlstra wrote:
> > > On Mon, 2008-03-10 at 19:12 -0700, Hiroshi Shimamoto wrote:
> > >> Peter Zijlstra wrote:
> > >>> On Mon, 2008-03-10 at 13:01 -0700, Hiroshi Shimamoto wrote:
> > >>>
> > >>>> thanks, your patch looks nice to me.
> > >>>> I had focused setprio, on_rq=0 and running=1 situation, it makes me to
> > >>>> fix these functions.
> > >>>> But one point, I've just noticed. I'm not sure on same situation against
> > >>>> sched_rt. I think the pre_schedule() of rt has chance to drop rq lock.
> > >>>> Is it OK?
> > >>> Ah, you are quite right, that'll teach me to rush out a patch just
> > >>> because dinner is ready :-).
> > >>>
> > >>> How about we submit the following patch for mainline and CC -stable to
> > >>> fix .23 and .24:
> > >>>
> > >> Unfortunately, I encountered similar panic with this patch on -rt.
> > >> I'll look into this, again. I might have missed something...
> > >>
> > >> Unable to handle kernel NULL pointer dereference at 0000000000000128 RIP:
> > >> [<ffffffff802297f5>] pick_next_task_fair+0x2d/0x42
> > >
> > > :-(
> > >
> > > OK, so that means I'm not getting it.
> > >
> > > So what does your patch do that mine doesn't?
> > >
> > > It removes the dependency of running (=task_current()) from on_rq
> > > (p->se.on_rq).
> > >
> > > So how can a current task not be on the runqueue?
> > >
> > > Only sched.c:dequeue_task() and sched_fair.c:account_entity_dequeue()
> > > set on_rq to 0, the only one changing rq->curr is schedule().
> > >
> > > So the only scheme I can come up with is that we did dequeue p (on_rq ==
> > > 0), but we didn't yet schedule so rq->curr == p.
> > >
> > > Is this how you ended up with your previuos analysis that it must be due
> > > to a hole introduced by double_lock_balance()?
> > >
> > > Because now we can seemingly call deactivate_task() and put_prev_task()
> > > in non-atomic fashion, but by placing the put_prev_task() before the
> > > load balance calls we should avoid doing that.
> > >
> > > So what else is going on... /me puzzled
> >
> >
> > thanks for taking time.
> > I've tested it with debug tracer, it took several hours to half day to
> > reproduce it on my box. And I got the possible scenario.
> >
> > Before begin, I can tell that se->on_rq is changed at enqueue_task() or
> > dequeue_task() in sched.c.
> >
> > Here is the flow to panic which I got;
> >
> > CPU0 CPU1
> > | schedule()
> > | ->deactivate_task()
> >
> > | -->dequeue_task()
> > | * on_rq=0
> > | ->put_prev_task_fair()
> >
> > | ->idle_balance()
> > | -->load_balance_newidle()
> >
> > (a wakeup function) |
> >
> > | --->double_lock_balance()
> > *get lock *rel lock
> >
> > * wake up target is CPU1's curr |
> > ->enqueue_task() |
> > * on_rq=1 |
> > ->rt_mutex_setprio() |
> > * on_rq=1, ruuning=1 |
> > -->dequeue_task()!! |
> > -->put_prev_task_fair()!! |
>
> humm... this one should have caused the problem.
>
> ->put_prev_task() has been previously done in schedule() so we get 2
> consequent ->put_prev_task() without set_curr_task/pick_next_task()
> being called in between
> [ as a result, __enqueue_entitty() is called twice for CPU1's curr and
> that definitely corrupts an rb-tree ]
>
> your initial patch doesn't have this problem. humm... logically-wise,
> it looks like a change of the 'current' which can be expressed by a
> pair :
>
> (1) put_prev_task() + (2) pick_next_task() or set_curr_task()
> (both end up calling set_next_entity())
>
> has to be 'atomic' wrt the rq->lock.
>
> For schedule() that also involves a change of rq->curr.

Right, this seems to 'rely' on rq->curr lagging behind put_prev_task().
So by doing something like:

---
diff --git a/kernel/sched.c b/kernel/sched.c
index a0c79e9..62d796f 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -4061,6 +4061,8 @@ need_resched_nonpreemptible:
}
switch_count = &prev->nvcsw;
}
+ prev->sched_class->put_prev_task(rq, prev);
+ rq->curr = rq->idle;

#ifdef CONFIG_SMP
if (prev->sched_class->pre_schedule)
@@ -4070,14 +4072,13 @@ need_resched_nonpreemptible:
if (unlikely(!rq->nr_running))
idle_balance(cpu, rq);

- prev->sched_class->put_prev_task(rq, prev);
next = pick_next_task(rq, prev);
+ rq->curr = next;

sched_info_switch(prev, next);

if (likely(prev != next)) {
rq->nr_switches++;
- rq->curr = next;
++*switch_count;

context_switch(rq, prev, next); /* unlocks the rq */
---

We would avoid being considered running while we're not.

We set rq->curr to rq->idle, because rq->curr is assumed valid at all
times. Also, should we clear TIF_NEED_RESCHED right before
pick_next_task()?

> Your initial patch seems to qualify for this rule... but I'm still
> thinking whether the current scheme is a bit 'hairy' :-/

I'm not liking the initial patch much because it allows for a situation
where we're not on the RQ but are considered running. That just doesn't
make sense to me.

2008-03-12 14:49:05

by Dmitry Adamushko

[permalink] [raw]

Subject: Re: [PATCH] sched: fix race in schedule

On 12/03/2008, Peter Zijlstra <[email protected]> wrote:
>
> [ ... ]
>
> > > Before begin, I can tell that se->on_rq is changed at enqueue_task() or
> > > dequeue_task() in sched.c.
> > >
> > > Here is the flow to panic which I got;
> > >
> > > CPU0 CPU1
> > > | schedule()
> > > | ->deactivate_task()
> > >
> > > | -->dequeue_task()
> > > | * on_rq=0
> > > | ->put_prev_task_fair()
> > >
> > > | ->idle_balance()
> > > | -->load_balance_newidle()
> > >
> > > (a wakeup function) |
> > >
> > > | --->double_lock_balance()
> > > *get lock *rel lock
> > >
> > > * wake up target is CPU1's curr |
> > > ->enqueue_task() |
> > > * on_rq=1 |
> > > ->rt_mutex_setprio() |
> > > * on_rq=1, ruuning=1 |
> > > -->dequeue_task()!! |
> > > -->put_prev_task_fair()!! |
> >
> > humm... this one should have caused the problem.
> >
> > ->put_prev_task() has been previously done in schedule() so we get 2
> > consequent ->put_prev_task() without set_curr_task/pick_next_task()
> > being called in between
> > [ as a result, __enqueue_entitty() is called twice for CPU1's curr and
> > that definitely corrupts an rb-tree ]
> >
> > your initial patch doesn't have this problem. humm... logically-wise,
> > it looks like a change of the 'current' which can be expressed by a
> > pair :
> >
> > (1) put_prev_task() + (2) pick_next_task() or set_curr_task()
> > (both end up calling set_next_entity())
> >
> > has to be 'atomic' wrt the rq->lock.
> >
> > For schedule() that also involves a change of rq->curr.
>
>
> Right, this seems to 'rely' on rq->curr lagging behind put_prev_task().
> So by doing something like:
>
>
> ---
> diff --git a/kernel/sched.c b/kernel/sched.c
>
> index a0c79e9..62d796f 100644
>
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
>
> @@ -4061,6 +4061,8 @@ need_resched_nonpreemptible:
> }
> switch_count = &prev->nvcsw;
> }
>
> + prev->sched_class->put_prev_task(rq, prev);
>
> + rq->curr = rq->idle;
>
>
> #ifdef CONFIG_SMP
> if (prev->sched_class->pre_schedule)
>
> @@ -4070,14 +4072,13 @@ need_resched_nonpreemptible:
>
> if (unlikely(!rq->nr_running))
> idle_balance(cpu, rq);
>
> - prev->sched_class->put_prev_task(rq, prev);
> next = pick_next_task(rq, prev);
>
> + rq->curr = next;
>
>
> sched_info_switch(prev, next);
>
>
> if (likely(prev != next)) {
> rq->nr_switches++;
> - rq->curr = next;
> ++*switch_count;
>
> context_switch(rq, prev, next); /* unlocks the rq */
> ---

hum, I'm quite suspicious about this approach... mainly, due to the
fact that I think your next assumption is wrong:
(unless we specify 'running' wrt to whom)

>
> We would avoid being considered running while we're not.
>

the fact is that we are (i.e. 'prev') actually running on a cpu until
some point in context_switch().

At the very least, the load balancer has to exactly know who is the
'current' on other cpus to treat such tasks differently.

With this patch, the load balancer can be confused/broken by the fact
that 'prev' is considered for migration as a "not-on-rq and
not-running" task [ from another cpu at the moment when either
pre_schedule() or idle_balance() drop a rq->lock of prev's cpu ].

well, the version of task_current() for __ARCH_WANT_UNLOCKED_CTXSW
would fix it if used by default.

But maybe there is something esle that would be exposed by the
'rq->curr = rq->idle' manipulation... I can't provide examples right
now though (I need to think on it).

--
Best regards,
Dmitry Adamushko

2008-03-12 14:57:29

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] sched: fix race in schedule

On Wed, 2008-03-12 at 15:48 +0100, Dmitry Adamushko wrote:
> On 12/03/2008, Peter Zijlstra <[email protected]> wrote:
> >
> > [ ... ]
> >
> > > > Before begin, I can tell that se->on_rq is changed at enqueue_task() or
> > > > dequeue_task() in sched.c.
> > > >
> > > > Here is the flow to panic which I got;
> > > >
> > > > CPU0 CPU1
> > > > | schedule()
> > > > | ->deactivate_task()
> > > >
> > > > | -->dequeue_task()
> > > > | * on_rq=0
> > > > | ->put_prev_task_fair()
> > > >
> > > > | ->idle_balance()
> > > > | -->load_balance_newidle()
> > > >
> > > > (a wakeup function) |
> > > >
> > > > | --->double_lock_balance()
> > > > *get lock *rel lock
> > > >
> > > > * wake up target is CPU1's curr |
> > > > ->enqueue_task() |
> > > > * on_rq=1 |
> > > > ->rt_mutex_setprio() |
> > > > * on_rq=1, ruuning=1 |
> > > > -->dequeue_task()!! |
> > > > -->put_prev_task_fair()!! |
> > >
> > > humm... this one should have caused the problem.
> > >
> > > ->put_prev_task() has been previously done in schedule() so we get 2
> > > consequent ->put_prev_task() without set_curr_task/pick_next_task()
> > > being called in between
> > > [ as a result, __enqueue_entitty() is called twice for CPU1's curr and
> > > that definitely corrupts an rb-tree ]
> > >
> > > your initial patch doesn't have this problem. humm... logically-wise,
> > > it looks like a change of the 'current' which can be expressed by a
> > > pair :
> > >
> > > (1) put_prev_task() + (2) pick_next_task() or set_curr_task()
> > > (both end up calling set_next_entity())
> > >
> > > has to be 'atomic' wrt the rq->lock.
> > >
> > > For schedule() that also involves a change of rq->curr.
> >
> >
> > Right, this seems to 'rely' on rq->curr lagging behind put_prev_task().
> > So by doing something like:
> >
> >
> > ---
> > diff --git a/kernel/sched.c b/kernel/sched.c
> >
> > index a0c79e9..62d796f 100644
> >
> > --- a/kernel/sched.c
> > +++ b/kernel/sched.c
> >
> > @@ -4061,6 +4061,8 @@ need_resched_nonpreemptible:
> > }
> > switch_count = &prev->nvcsw;
> > }
> >
> > + prev->sched_class->put_prev_task(rq, prev);
> >
> > + rq->curr = rq->idle;
> >
> >
> > #ifdef CONFIG_SMP
> > if (prev->sched_class->pre_schedule)
> >
> > @@ -4070,14 +4072,13 @@ need_resched_nonpreemptible:
> >
> > if (unlikely(!rq->nr_running))
> > idle_balance(cpu, rq);
> >
> > - prev->sched_class->put_prev_task(rq, prev);
> > next = pick_next_task(rq, prev);
> >
> > + rq->curr = next;
> >
> >
> > sched_info_switch(prev, next);
> >
> >
> > if (likely(prev != next)) {
> > rq->nr_switches++;
> > - rq->curr = next;
> > ++*switch_count;
> >
> > context_switch(rq, prev, next); /* unlocks the rq */
> > ---
>
> hum, I'm quite suspicious about this approach... mainly, due to the
> fact that I think your next assumption is wrong:
> (unless we specify 'running' wrt to whom)
>
> >
> > We would avoid being considered running while we're not.
> >
>
> the fact is that we are (i.e. 'prev') actually running on a cpu until
> some point in context_switch().
>
> At the very least, the load balancer has to exactly know who is the
> 'current' on other cpus to treat such tasks differently.
>
> With this patch, the load balancer can be confused/broken by the fact
> that 'prev' is considered for migration as a "not-on-rq and
> not-running" task [ from another cpu at the moment when either
> pre_schedule() or idle_balance() drop a rq->lock of prev's cpu ].
>
> well, the version of task_current() for __ARCH_WANT_UNLOCKED_CTXSW
> would fix it if used by default.
>
> But maybe there is something esle that would be exposed by the
> 'rq->curr = rq->idle' manipulation... I can't provide examples right
> now though (I need to think on it).

I share your concerns, I don't really like it either. Just spewing out
ideas here - bad ideas it seems :-/

Ingo also suggested moving the balance calls right before
deactivate_task(), but that gives a whole other set of head-aches.

2008-03-14 17:59:18

by Hiroshi Shimamoto

[permalink] [raw]

Subject: Re: [PATCH] sched: fix race in schedule

Peter Zijlstra wrote:
> On Wed, 2008-03-12 at 15:48 +0100, Dmitry Adamushko wrote:
>> On 12/03/2008, Peter Zijlstra <[email protected]> wrote:
>>> [ ... ]
>>>
>>> > > Before begin, I can tell that se->on_rq is changed at enqueue_task() or
>>> > > dequeue_task() in sched.c.
>>> > >
>>> > > Here is the flow to panic which I got;
>>> > >
>>> > > CPU0 CPU1
>>> > > | schedule()
>>> > > | ->deactivate_task()
>>> > >
>>> > > | -->dequeue_task()
>>> > > | * on_rq=0
>>> > > | ->put_prev_task_fair()
>>> > >
>>> > > | ->idle_balance()
>>> > > | -->load_balance_newidle()
>>> > >
>>> > > (a wakeup function) |
>>> > >
>>> > > | --->double_lock_balance()
>>> > > *get lock *rel lock
>>> > >
>>> > > * wake up target is CPU1's curr |
>>> > > ->enqueue_task() |
>>> > > * on_rq=1 |
>>> > > ->rt_mutex_setprio() |
>>> > > * on_rq=1, ruuning=1 |
>>> > > -->dequeue_task()!! |
>>> > > -->put_prev_task_fair()!! |
>>> >
>>> > humm... this one should have caused the problem.
>>> >
>>> > ->put_prev_task() has been previously done in schedule() so we get 2
>>> > consequent ->put_prev_task() without set_curr_task/pick_next_task()
>>> > being called in between
>>> > [ as a result, __enqueue_entitty() is called twice for CPU1's curr and
>>> > that definitely corrupts an rb-tree ]
>>> >
>>> > your initial patch doesn't have this problem. humm... logically-wise,
>>> > it looks like a change of the 'current' which can be expressed by a
>>> > pair :
>>> >
>>> > (1) put_prev_task() + (2) pick_next_task() or set_curr_task()
>>> > (both end up calling set_next_entity())
>>> >
>>> > has to be 'atomic' wrt the rq->lock.
>>> >
>>> > For schedule() that also involves a change of rq->curr.
>>>
>>>
>>> Right, this seems to 'rely' on rq->curr lagging behind put_prev_task().
>>> So by doing something like:
>>>
>>>
>>> ---
>>> diff --git a/kernel/sched.c b/kernel/sched.c
>>>
>>> index a0c79e9..62d796f 100644
>>>
>>> --- a/kernel/sched.c
>>> +++ b/kernel/sched.c
>>>
>>> @@ -4061,6 +4061,8 @@ need_resched_nonpreemptible:
>>> }
>>> switch_count = &prev->nvcsw;
>>> }
>>>
>>> + prev->sched_class->put_prev_task(rq, prev);
>>>
>>> + rq->curr = rq->idle;
>>>
>>>
>>> #ifdef CONFIG_SMP
>>> if (prev->sched_class->pre_schedule)
>>>
>>> @@ -4070,14 +4072,13 @@ need_resched_nonpreemptible:
>>>
>>> if (unlikely(!rq->nr_running))
>>> idle_balance(cpu, rq);
>>>
>>> - prev->sched_class->put_prev_task(rq, prev);
>>> next = pick_next_task(rq, prev);
>>>
>>> + rq->curr = next;
>>>
>>>
>>> sched_info_switch(prev, next);
>>>
>>>
>>> if (likely(prev != next)) {
>>> rq->nr_switches++;
>>> - rq->curr = next;
>>> ++*switch_count;
>>>
>>> context_switch(rq, prev, next); /* unlocks the rq */
>>> ---
>> hum, I'm quite suspicious about this approach... mainly, due to the
>> fact that I think your next assumption is wrong:
>> (unless we specify 'running' wrt to whom)
>>
>>> We would avoid being considered running while we're not.
>>>
>> the fact is that we are (i.e. 'prev') actually running on a cpu until
>> some point in context_switch().
>>
>> At the very least, the load balancer has to exactly know who is the
>> 'current' on other cpus to treat such tasks differently.
>>
>> With this patch, the load balancer can be confused/broken by the fact
>> that 'prev' is considered for migration as a "not-on-rq and
>> not-running" task [ from another cpu at the moment when either
>> pre_schedule() or idle_balance() drop a rq->lock of prev's cpu ].
>>
>> well, the version of task_current() for __ARCH_WANT_UNLOCKED_CTXSW
>> would fix it if used by default.
>>
>> But maybe there is something esle that would be exposed by the
>> 'rq->curr = rq->idle' manipulation... I can't provide examples right
>> now though (I need to think on it).
>
> I share your concerns, I don't really like it either. Just spewing out
> ideas here - bad ideas it seems :-/
>
> Ingo also suggested moving the balance calls right before
> deactivate_task(), but that gives a whole other set of head-aches.
>

Well, what will we do about this issue?
I see you're thinking to fix inconsistency in scheduler, right?
I agree about it.

However, I don't think it's good to remain this issue long time in
the -stable kernel.

Could you please let me know what I can do?

thanks,
Hiroshi Shimamoto

2008-03-14 22:47:21

by Dmitry Adamushko

[permalink] [raw]

Subject: Re: [PATCH] sched: fix race in schedule

On 14/03/2008, Hiroshi Shimamoto <[email protected]> wrote:
>
> [ ... ]
>
> >> But maybe there is something esle that would be exposed by the
> >> 'rq->curr = rq->idle' manipulation... I can't provide examples right
> >> now though (I need to think on it).
> >
> > I share your concerns, I don't really like it either. Just spewing out
> > ideas here - bad ideas it seems :-/
> >
> > Ingo also suggested moving the balance calls right before
> > deactivate_task(), but that gives a whole other set of head-aches.
> >
>
>
> Well, what will we do about this issue?
> I see you're thinking to fix inconsistency in scheduler, right?
> I agree about it.
>
> However, I don't think it's good to remain this issue long time in
> the -stable kernel.
>
> Could you please let me know what I can do?

IMHO, the safest solution for the time being (and esp. for -stable)
would be to proceed with Hiroshi's patch. It looks safe and it does
fix a real problem.

>
> Hiroshi Shimamoto
>

--
Best regards,
Dmitry Adamushko

2008-03-14 22:57:49

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] sched: fix race in schedule

On Fri, 2008-03-14 at 23:47 +0100, Dmitry Adamushko wrote:
> On 14/03/2008, Hiroshi Shimamoto <[email protected]> wrote:
> >
> > [ ... ]
> >
> > >> But maybe there is something esle that would be exposed by the
> > >> 'rq->curr = rq->idle' manipulation... I can't provide examples right
> > >> now though (I need to think on it).
> > >
> > > I share your concerns, I don't really like it either. Just spewing out
> > > ideas here - bad ideas it seems :-/
> > >
> > > Ingo also suggested moving the balance calls right before
> > > deactivate_task(), but that gives a whole other set of head-aches.
> > >
> >
> >
> > Well, what will we do about this issue?
> > I see you're thinking to fix inconsistency in scheduler, right?
> > I agree about it.
> >
> > However, I don't think it's good to remain this issue long time in
> > the -stable kernel.
> >
> > Could you please let me know what I can do?
>
> IMHO, the safest solution for the time being (and esp. for -stable)
> would be to proceed with Hiroshi's patch. It looks safe and it does
> fix a real problem.

Agreed

2008-03-20 05:45:23

by Sripathi Kodi

[permalink] [raw]

Subject: Re: [PATCH] sched: fix race in schedule

On Monday 10 March 2008 23:31, Hiroshi Shimamoto wrote:
> Hi Ingo,
>
> I found a race condition in scheduler.
> The first report is the below;
> http://lkml.org/lkml/2008/2/26/459
>
> It took a bit long time to investigate and I couldn't have much time
> last week. It is hard to reproduce but -rt is little easier because
> it has preemptible spin lock and rcu.
>
> Could you please check the scenario and the patch.
> It will be needed for the stable, too.

Hi,

I can recreate a problem that looks very similar to this on a kernel
based on 2.6.24.3-rt3. Hiroshi-san's patch seems to fix the problem for
me.

Ingo, is this patch going to be included in the next -rt patch? I see
that it is already part of mainline.

Thanks,
Sripathi.

>
> ---
> From: Hiroshi Shimamoto <[email protected]>
>
> There is a race condition between schedule() and some dequeue/enqueue
> functions; rt_mutex_setprio(), __setscheduler() and
> sched_move_task().
>
> When scheduling to idle, idle_balance() is called to pull tasks from
> other busy processor. It might drop the rq lock.
> It means that those 3 functions encounter on_rq=0 and running=1.
> The current task should be put when running.
>
> Here is a possible scenario;
> CPU0 CPU1
>
> | schedule()
> | ->deactivate_task()
> | ->idle_balance()
> | -->load_balance_newidle()
>
> rt_mutex_setprio() |
>
> | --->double_lock_balance()
>
> *get lock *rel lock
> * on_rq=0, ruuning=1 |
> * sched_class is changed |
> *rel lock *get lock
>
>
> ->put_prev_task_rt()
> ->pick_next_task_fair()
> => panic
>
> The current process of CPU1(P1) is scheduling. Deactivated P1,
> and the scheduler looks for another process on other CPU's runqueue
> because CPU1 will be idle. idle_balance(), load_balance_newidle()
> and double_lock_balance() are called and double_lock_balance() could
> drop the rq lock. On the other hand, CPU0 is trying to boost the
> priority of P1. The result of boosting only P1's prio and sched_class
> are changed to RT. The sched entities of P1 and P1's group are never
> put. It makes cfs_rq invalid, because the cfs_rq has curr and no
> leaf, but pick_next_task_fair() is called, then the kernel panics.
>
> Signed-off-by: Hiroshi Shimamoto <[email protected]>
> ---
> kernel/sched.c | 38 ++++++++++++++++----------------------
> 1 files changed, 16 insertions(+), 22 deletions(-)
>
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 52b9867..eedf748 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -4268,11 +4268,10 @@ void rt_mutex_setprio(struct task_struct *p,
> int prio) oldprio = p->prio;
> on_rq = p->se.on_rq;
> running = task_current(rq, p);
> - if (on_rq) {
> + if (on_rq)
> dequeue_task(rq, p, 0);
> - if (running)
> - p->sched_class->put_prev_task(rq, p);
> - }
> + if (running)
> + p->sched_class->put_prev_task(rq, p);
>
> if (rt_prio(prio))
> p->sched_class = &rt_sched_class;
> @@ -4281,10 +4280,9 @@ void rt_mutex_setprio(struct task_struct *p,
> int prio)
>
> p->prio = prio;
>
> + if (running)
> + p->sched_class->set_curr_task(rq);
> if (on_rq) {
> - if (running)
> - p->sched_class->set_curr_task(rq);
> -
> enqueue_task(rq, p, 0);
>
> check_class_changed(rq, p, prev_class, oldprio, running);
> @@ -4581,19 +4579,17 @@ recheck:
> update_rq_clock(rq);
> on_rq = p->se.on_rq;
> running = task_current(rq, p);
> - if (on_rq) {
> + if (on_rq)
> deactivate_task(rq, p, 0);
> - if (running)
> - p->sched_class->put_prev_task(rq, p);
> - }
> + if (running)
> + p->sched_class->put_prev_task(rq, p);
>
> oldprio = p->prio;
> __setscheduler(rq, p, policy, param->sched_priority);
>
> + if (running)
> + p->sched_class->set_curr_task(rq);
> if (on_rq) {
> - if (running)
> - p->sched_class->set_curr_task(rq);
> -
> activate_task(rq, p, 0);
>
> check_class_changed(rq, p, prev_class, oldprio, running);
> @@ -7617,11 +7613,10 @@ void sched_move_task(struct task_struct *tsk)
> running = task_current(rq, tsk);
> on_rq = tsk->se.on_rq;
>
> - if (on_rq) {
> + if (on_rq)
> dequeue_task(rq, tsk, 0);
> - if (unlikely(running))
> - tsk->sched_class->put_prev_task(rq, tsk);
> - }
> + if (unlikely(running))
> + tsk->sched_class->put_prev_task(rq, tsk);
>
> set_task_rq(tsk, task_cpu(tsk));
>
> @@ -7630,11 +7625,10 @@ void sched_move_task(struct task_struct *tsk)
> tsk->sched_class->moved_group(tsk);
> #endif
>
> - if (on_rq) {
> - if (unlikely(running))
> - tsk->sched_class->set_curr_task(rq);
> + if (unlikely(running))
> + tsk->sched_class->set_curr_task(rq);
> + if (on_rq)
> enqueue_task(rq, tsk, 0);
> - }
>
> task_rq_unlock(rq, &flags);
> }