2008-06-19 16:20:29

by Heiko Carstens

[permalink] [raw]
Subject: [BUG] CFS vs cpu hotplug

Hi Ingo, Peter,

I'm still seeing kernel crashes on cpu hotplug with Linus' current git tree.
All I have to do is to make all cpus busy (make -j4 of the kernel source is
sufficient) and then start cpu hotplug stress.
It usually takes below a minute to crash the system like this:

Unable to handle kernel pointer dereference at virtual kernel address 005a800000031000
Oops: 0038 [#1] PREEMPT SMP
Modules linked in:
CPU: 1 Not tainted 2.6.26-rc6-00232-g9bedbcb #356
Process swapper (pid: 0, task: 000000002fe7ccf8, ksp: 000000002fe93d78)
Krnl PSW : 0400e00180000000 0000000000032c6c (pick_next_task_fair+0x34/0xb0)
R:0 T:1 IO:0 EX:0 Key:0 M:0 W:0 P:0 AS:3 CC:2 PM:0 EA:3
Krnl GPRS: 00000000001ff000 0000000000030bd8 000000000075a380 000000002fe7ccf8
0000000000386690 0000000000000008 0000000000000000 000000002fe7cf58
0000000000000001 000000000075a300 0000000000000000 000000002fe93d40
005a800000031201 0000000000386010 000000002fe93d78 000000002fe93d40
Krnl Code: 0000000000032c5c: e3e0f0980024 stg %r14,152(%r15)
0000000000032c62: d507d000c010 clc 0(8,%r13),16(%r12)
0000000000032c68: a784003c brc 8,32ce0
>0000000000032c6c: d507d000c030 clc 0(8,%r13),48(%r12)
0000000000032c72: b904002c lgr %r2,%r12
0000000000032c76: a7a90000 lghi %r10,0
0000000000032c7a: a7840021 brc 8,32cbc
0000000000032c7e: c0e5ffffefe3 brasl %r14,30c44
Call Trace:
([<000000000075a300>] 0x75a300)
[<000000000037195a>] schedule+0x162/0x7f4
[<000000000001a2be>] cpu_idle+0x1ca/0x25c
[<000000000036f368>] start_secondary+0xac/0xb8
[<0000000000000000>] 0x0
[<0000000000000000>] 0x0
Last Breaking-Event-Address:
[<0000000000032cc6>] pick_next_task_fair+0x8e/0xb0
<4>---[ end trace 9bb55df196feedcc ]---
Kernel panic - not syncing: Attempted to kill the idle task!

Please note that the above call trace is from s390, however Avi reported the
same bug on x86_64.

I tried to bisect this and ended up somewhere at the beginning of 2.6.23 when
the CFS patches got merged. Unfortunately it got harder and harder to reproduce
so that I couldn't bisect this down to a single patch.

One observation however is that this always happens after cpu_up(), not
cpu_down().

I modified the kernel sources a bit (actually only added a single "noinline")
to get some sensible debug data and dumped a crashed system. These are the
contents of the scheduler data structures which cause the crash:

>> px *(cfs_rq *) 0x75a380
struct cfs_rq {
load = struct load_weight {
weight = 0x800
inv_weight = 0x0
}
nr_running = 0x1
exec_clock = 0x0
min_vruntime = 0xbf7e9776
tasks_timeline = struct rb_root {
rb_node = (nil)
}
rb_leftmost = (nil) <<<<<<<<<<<< shouldn't be NULL
tasks = struct list_head {
next = 0x759328
prev = 0x759328
}
balance_iterator = (nil)
curr = 0x759300
next = (nil)
nr_spread_over = 0x0
rq = 0x75a300
leaf_cfs_rq_list = struct list_head {
next = (nil)
prev = (nil)
}
tg = 0x564970
}

The sched_entity that belongs to the cfs_rq:

>> px *(sched_entity *) 0x759300
struct sched_entity {
load = struct load_weight {
weight = 0x800
inv_weight = 0x1ffc01
}
run_node = struct rb_node {
rb_parent_color = 0x1
rb_right = (nil)
rb_left = (nil)
}
group_node = struct list_head {
next = 0x75a3b8
prev = 0x75a3b8
}
on_rq = 0x1
exec_start = 0x189685acb4aa46
sum_exec_runtime = 0x188a2b84c
vruntime = 0xd036bd29
prev_sum_exec_runtime = 0x1672e3f62
last_wakeup = 0x0
avg_overlap = 0x0
parent = (nil)
cfs_rq = 0x75a380
my_q = 0x759400
}

And the rq:

>> px *(rq *) 0x75a300
struct rq {
lock = spinlock_t {
raw_lock = raw_spinlock_t {
owner_cpu = 0xfffffffe
}
break_lock = 0x1
magic = 0xdead4ead
owner_cpu = 0x1
owner = 0x2ef95350
}
nr_running = 0x1
cpu_load = {
[0] 0x3062
[1] 0x2bdf
[2] 0x20db
[3] 0x171e
[4] 0x1010
}
idle_at_tick = 0x0
last_tick_seen = 0x0
in_nohz_recently = 0x0
load = struct load_weight {
weight = 0xc31
inv_weight = 0x0
}
nr_load_updates = 0x95f
nr_switches = 0x3f68
cfs = struct cfs_rq {
load = struct load_weight {
weight = 0x800
inv_weight = 0x0
}
nr_running = 0x1
exec_clock = 0x0
min_vruntime = 0xbf7e9776
tasks_timeline = struct rb_root {
rb_node = (nil)
}
rb_leftmost = (nil)
tasks = struct list_head {
next = 0x759328
prev = 0x759328
}
balance_iterator = (nil)
curr = 0x759300
next = (nil)
nr_spread_over = 0x0
rq = 0x75a300
leaf_cfs_rq_list = struct list_head {
next = (nil)
prev = (nil)
}
tg = 0x564970
}
rt = struct rt_rq {
active = struct rt_prio_array {
bitmap = {
[0] 0x0
[1] 0x1000000000
}
queue = {
[0] struct list_head {
next = 0x75a418
prev = 0x75a418
}
[1] struct list_head {
next = 0x75a428
prev = 0x75a428
}
[2] struct list_head {
next = 0x75a438
prev = 0x75a438
}
[3] struct list_head {
next = 0x75a448
prev = 0x75a448
}
[4] struct list_head {
next = 0x75a458
prev = 0x75a458
}
[5] struct list_head {
next = 0x75a468
prev = 0x75a468
}
[6] struct list_head {
next = 0x75a478
prev = 0x75a478
}
[7] struct list_head {
next = 0x75a488
prev = 0x75a488
}
[8] struct list_head {
next = 0x75a498
prev = 0x75a498
}
[9] struct list_head {
next = 0x75a4a8
prev = 0x75a4a8
}
[10] struct list_head {
next = 0x75a4b8
prev = 0x75a4b8
}
[11] struct list_head {
next = 0x75a4c8
prev = 0x75a4c8
}
[12] struct list_head {
next = 0x75a4d8
prev = 0x75a4d8
}
[13] struct list_head {
next = 0x75a4e8
prev = 0x75a4e8
}
[14] struct list_head {
next = 0x75a4f8
prev = 0x75a4f8
}
[15] struct list_head {
next = 0x75a508
prev = 0x75a508
}
[16] struct list_head {
next = 0x75a518
prev = 0x75a518
}
[17] struct list_head {
next = 0x75a528
prev = 0x75a528
}
[18] struct list_head {
next = 0x75a538
prev = 0x75a538
}
[19] struct list_head {
next = 0x75a548
prev = 0x75a548
}
[20] struct list_head {
next = 0x75a558
prev = 0x75a558
}
[21] struct list_head {
next = 0x75a568
prev = 0x75a568
}
[22] struct list_head {
next = 0x75a578
prev = 0x75a578
}
[23] struct list_head {
next = 0x75a588
prev = 0x75a588
}
[24] struct list_head {
next = 0x75a598
prev = 0x75a598
}
[25] struct list_head {
next = 0x75a5a8
prev = 0x75a5a8
}
[26] struct list_head {
next = 0x75a5b8
prev = 0x75a5b8
}
[27] struct list_head {
next = 0x75a5c8
prev = 0x75a5c8
}
[28] struct list_head {
next = 0x75a5d8
prev = 0x75a5d8
}
[29] struct list_head {
next = 0x75a5e8
prev = 0x75a5e8
}
[30] struct list_head {
next = 0x75a5f8
prev = 0x75a5f8
}
[31] struct list_head {
next = 0x75a608
prev = 0x75a608
}
[32] struct list_head {
next = 0x75a618
prev = 0x75a618
}
[33] struct list_head {
next = 0x75a628
prev = 0x75a628
}
[34] struct list_head {
next = 0x75a638
prev = 0x75a638
}
[35] struct list_head {
next = 0x75a648
prev = 0x75a648
}
[36] struct list_head {
next = 0x75a658
prev = 0x75a658
}
[37] struct list_head {
next = 0x75a668
prev = 0x75a668
}
[38] struct list_head {
next = 0x75a678
prev = 0x75a678
}
[39] struct list_head {
next = 0x75a688
prev = 0x75a688
}
[40] struct list_head {
next = 0x75a698
prev = 0x75a698
}
[41] struct list_head {
next = 0x75a6a8
prev = 0x75a6a8
}
[42] struct list_head {
next = 0x75a6b8
prev = 0x75a6b8
}
[43] struct list_head {
next = 0x75a6c8
prev = 0x75a6c8
}
[44] struct list_head {
next = 0x75a6d8
prev = 0x75a6d8
}
[45] struct list_head {
next = 0x75a6e8
prev = 0x75a6e8
}
[46] struct list_head {
next = 0x75a6f8
prev = 0x75a6f8
}
[47] struct list_head {
next = 0x75a708
prev = 0x75a708
}
[48] struct list_head {
next = 0x75a718
prev = 0x75a718
}
[49] struct list_head {
next = 0x75a728
prev = 0x75a728
}
[50] struct list_head {
next = 0x75a738
prev = 0x75a738
}
[51] struct list_head {
next = 0x75a748
prev = 0x75a748
}
[52] struct list_head {
next = 0x75a758
prev = 0x75a758
}
[53] struct list_head {
next = 0x75a768
prev = 0x75a768
}
[54] struct list_head {
next = 0x75a778
prev = 0x75a778
}
[55] struct list_head {
next = 0x75a788
prev = 0x75a788
}
[56] struct list_head {
next = 0x75a798
prev = 0x75a798
}
[57] struct list_head {
next = 0x75a7a8
prev = 0x75a7a8
}
[58] struct list_head {
next = 0x75a7b8
prev = 0x75a7b8
}
[59] struct list_head {
next = 0x75a7c8
prev = 0x75a7c8
}
[60] struct list_head {
next = 0x75a7d8
prev = 0x75a7d8
}
[61] struct list_head {
next = 0x75a7e8
prev = 0x75a7e8
}
[62] struct list_head {
next = 0x75a7f8
prev = 0x75a7f8
}
[63] struct list_head {
next = 0x75a808
prev = 0x75a808
}
[64] struct list_head {
next = 0x75a818
prev = 0x75a818
}
[65] struct list_head {
next = 0x75a828
prev = 0x75a828
}
[66] struct list_head {
next = 0x75a838
prev = 0x75a838
}
[67] struct list_head {
next = 0x75a848
prev = 0x75a848
}
[68] struct list_head {
next = 0x75a858
prev = 0x75a858
}
[69] struct list_head {
next = 0x75a868
prev = 0x75a868
}
[70] struct list_head {
next = 0x75a878
prev = 0x75a878
}
[71] struct list_head {
next = 0x75a888
prev = 0x75a888
}
[72] struct list_head {
next = 0x75a898
prev = 0x75a898
}
[73] struct list_head {
next = 0x75a8a8
prev = 0x75a8a8
}
[74] struct list_head {
next = 0x75a8b8
prev = 0x75a8b8
}
[75] struct list_head {
next = 0x75a8c8
prev = 0x75a8c8
}
[76] struct list_head {
next = 0x75a8d8
prev = 0x75a8d8
}
[77] struct list_head {
next = 0x75a8e8
prev = 0x75a8e8
}
[78] struct list_head {
next = 0x75a8f8
prev = 0x75a8f8
}
[79] struct list_head {
next = 0x75a908
prev = 0x75a908
}
[80] struct list_head {
next = 0x75a918
prev = 0x75a918
}
[81] struct list_head {
next = 0x75a928
prev = 0x75a928
}
[82] struct list_head {
next = 0x75a938
prev = 0x75a938
}
[83] struct list_head {
next = 0x75a948
prev = 0x75a948
}
[84] struct list_head {
next = 0x75a958
prev = 0x75a958
}
[85] struct list_head {
next = 0x75a968
prev = 0x75a968
}
[86] struct list_head {
next = 0x75a978
prev = 0x75a978
}
[87] struct list_head {
next = 0x75a988
prev = 0x75a988
}
[88] struct list_head {
next = 0x75a998
prev = 0x75a998
}
[89] struct list_head {
next = 0x75a9a8
prev = 0x75a9a8
}
[90] struct list_head {
next = 0x75a9b8
prev = 0x75a9b8
}
[91] struct list_head {
next = 0x75a9c8
prev = 0x75a9c8
}
[92] struct list_head {
next = 0x75a9d8
prev = 0x75a9d8
}
[93] struct list_head {
next = 0x75a9e8
prev = 0x75a9e8
}
[94] struct list_head {
next = 0x75a9f8
prev = 0x75a9f8
}
[95] struct list_head {
next = 0x75aa08
prev = 0x75aa08
}
[96] struct list_head {
next = 0x75aa18
prev = 0x75aa18
}
[97] struct list_head {
next = 0x75aa28
prev = 0x75aa28
}
[98] struct list_head {
next = 0x75aa38
prev = 0x75aa38
}
[99] struct list_head {
next = 0x75aa48
prev = 0x75aa48
}
}
}
rt_nr_running = 0x0
highest_prio = 0x64
rt_nr_migratory = 0x0
overloaded = 0x0
rt_throttled = 0x0
rt_time = 0x123a999
rt_runtime = 0x389fd980
rt_runtime_lock = spinlock_t {
raw_lock = raw_spinlock_t {
owner_cpu = 0x0
}
break_lock = 0x0
magic = 0xdead4ead
owner_cpu = 0xffffffff
owner = 0xffffffffffffffff
}
}
leaf_cfs_rq_list = struct list_head {
next = 0x2f5a8970
prev = 0x759470
}
nr_uninterruptible = 0xfffffffffffffffe
curr = 0x2ef95350
idle = 0x2fe7ccf8
next_balance = 0x10000093b
prev_mm = (nil)
clock = 0x189685acb4d536
nr_iowait = atomic_t {
counter = 0x0
}
rd = 0x564a58
sd = (nil)
active_balance = 0x0
push_cpu = 0x0
cpu = 0x1
migration_thread = 0x2ef95350
migration_queue = struct list_head {
next = 0x75ab10
prev = 0x75ab10
}
rq_lock_key = struct lock_class_key {
}
}

Hopefully all of this debug data is of any use. If you need more, just let me
know.

Thanks!


2008-06-19 18:05:38

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG] CFS vs cpu hotplug

On Thu, 2008-06-19 at 18:19 +0200, Heiko Carstens wrote:
> Hi Ingo, Peter,
>
> I'm still seeing kernel crashes on cpu hotplug with Linus' current git tree.
> All I have to do is to make all cpus busy (make -j4 of the kernel source is
> sufficient) and then start cpu hotplug stress.
> It usually takes below a minute to crash the system like this:
>
> Unable to handle kernel pointer dereference at virtual kernel address 005a800000031000
> Oops: 0038 [#1] PREEMPT SMP
> Modules linked in:
> CPU: 1 Not tainted 2.6.26-rc6-00232-g9bedbcb #356
> Process swapper (pid: 0, task: 000000002fe7ccf8, ksp: 000000002fe93d78)
> Krnl PSW : 0400e00180000000 0000000000032c6c (pick_next_task_fair+0x34/0xb0)

I presume this is:

se = pick_next_entity(cfs_rq);

> R:0 T:1 IO:0 EX:0 Key:0 M:0 W:0 P:0 AS:3 CC:2 PM:0 EA:3
> Krnl GPRS: 00000000001ff000 0000000000030bd8 000000000075a380 000000002fe7ccf8
> 0000000000386690 0000000000000008 0000000000000000 000000002fe7cf58
> 0000000000000001 000000000075a300 0000000000000000 000000002fe93d40
> 005a800000031201 0000000000386010 000000002fe93d78 000000002fe93d40
> Krnl Code: 0000000000032c5c: e3e0f0980024 stg %r14,152(%r15)
> 0000000000032c62: d507d000c010 clc 0(8,%r13),16(%r12)
> 0000000000032c68: a784003c brc 8,32ce0
> >0000000000032c6c: d507d000c030 clc 0(8,%r13),48(%r12)
> 0000000000032c72: b904002c lgr %r2,%r12
> 0000000000032c76: a7a90000 lghi %r10,0
> 0000000000032c7a: a7840021 brc 8,32cbc
> 0000000000032c7e: c0e5ffffefe3 brasl %r14,30c44
> Call Trace:
> ([<000000000075a300>] 0x75a300)
> [<000000000037195a>] schedule+0x162/0x7f4
> [<000000000001a2be>] cpu_idle+0x1ca/0x25c
> [<000000000036f368>] start_secondary+0xac/0xb8
> [<0000000000000000>] 0x0
> [<0000000000000000>] 0x0
> Last Breaking-Event-Address:
> [<0000000000032cc6>] pick_next_task_fair+0x8e/0xb0
> <4>---[ end trace 9bb55df196feedcc ]---
> Kernel panic - not syncing: Attempted to kill the idle task!
>
> Please note that the above call trace is from s390, however Avi reported the
> same bug on x86_64.
>
> I tried to bisect this and ended up somewhere at the beginning of 2.6.23 when
> the CFS patches got merged. Unfortunately it got harder and harder to reproduce
> so that I couldn't bisect this down to a single patch.
>
> One observation however is that this always happens after cpu_up(), not
> cpu_down().
>
> I modified the kernel sources a bit (actually only added a single "noinline")
> to get some sensible debug data and dumped a crashed system. These are the
> contents of the scheduler data structures which cause the crash:
>
> >> px *(cfs_rq *) 0x75a380
> struct cfs_rq {
> load = struct load_weight {
> weight = 0x800
> inv_weight = 0x0
> }
> nr_running = 0x1
> exec_clock = 0x0
> min_vruntime = 0xbf7e9776
> tasks_timeline = struct rb_root {
> rb_node = (nil)
> }
> rb_leftmost = (nil) <<<<<<<<<<<< shouldn't be NULL
> tasks = struct list_head {
> next = 0x759328
> prev = 0x759328
> }
> balance_iterator = (nil)
> curr = 0x759300
> next = (nil)
> nr_spread_over = 0x0
> rq = 0x75a300
> leaf_cfs_rq_list = struct list_head {
> next = (nil)
> prev = (nil)
> }
> tg = 0x564970
> }

Right, this cfs_rq is buggered. rb_leftmost may be null when the tree is
empty (as is the case here).

However cfs_rq->curr != NULL and cfs_rq->nr_running != 0.

So this hints at a missing put_prev_entity() - we keep current out of
the tree, and put it back in right before we schedule(). The advantage
is that we don't need to reposition (dequeue/enqueue) curr in the tree
every time we update its virtual timeline.

So what races so that we can miss put_prev_entity() and how is cpu_up()
special..

> The sched_entity that belongs to the cfs_rq:
>
> >> px *(sched_entity *) 0x759300
> struct sched_entity {
> load = struct load_weight {
> weight = 0x800
> inv_weight = 0x1ffc01
> }
> run_node = struct rb_node {
> rb_parent_color = 0x1
> rb_right = (nil)
> rb_left = (nil)
> }
> group_node = struct list_head {
> next = 0x75a3b8
> prev = 0x75a3b8
> }
> on_rq = 0x1
> exec_start = 0x189685acb4aa46
> sum_exec_runtime = 0x188a2b84c
> vruntime = 0xd036bd29
> prev_sum_exec_runtime = 0x1672e3f62
> last_wakeup = 0x0
> avg_overlap = 0x0
> parent = (nil)
> cfs_rq = 0x75a380
> my_q = 0x759400
> }
>
> And the rq:
>
> >> px *(rq *) 0x75a300
> struct rq {
> lock = spinlock_t {
> raw_lock = raw_spinlock_t {
> owner_cpu = 0xfffffffe
> }
> break_lock = 0x1
> magic = 0xdead4ead
> owner_cpu = 0x1
> owner = 0x2ef95350
> }
> nr_running = 0x1
> cpu_load = {
> [0] 0x3062
> [1] 0x2bdf
> [2] 0x20db
> [3] 0x171e
> [4] 0x1010
> }
> idle_at_tick = 0x0
> last_tick_seen = 0x0
> in_nohz_recently = 0x0
> load = struct load_weight {
> weight = 0xc31
> inv_weight = 0x0
> }
> nr_load_updates = 0x95f
> nr_switches = 0x3f68
> cfs = struct cfs_rq {
> load = struct load_weight {
> weight = 0x800
> inv_weight = 0x0
> }
> nr_running = 0x1
> exec_clock = 0x0
> min_vruntime = 0xbf7e9776
> tasks_timeline = struct rb_root {
> rb_node = (nil)
> }
> rb_leftmost = (nil)
> tasks = struct list_head {
> next = 0x759328
> prev = 0x759328
> }
> balance_iterator = (nil)
> curr = 0x759300
> next = (nil)
> nr_spread_over = 0x0
> rq = 0x75a300
> leaf_cfs_rq_list = struct list_head {
> next = (nil)
> prev = (nil)
> }
> tg = 0x564970
> }
> rt = struct rt_rq {
> active = struct rt_prio_array {
> bitmap = {
> [0] 0x0
> [1] 0x1000000000
> }
> queue = {
> [0] struct list_head {
> next = 0x75a418
> prev = 0x75a418
> }
> [1] struct list_head {
> next = 0x75a428
> prev = 0x75a428
> }
> [2] struct list_head {
> next = 0x75a438
> prev = 0x75a438
> }
> [3] struct list_head {
> next = 0x75a448
> prev = 0x75a448
> }
> [4] struct list_head {
> next = 0x75a458
> prev = 0x75a458
> }
> [5] struct list_head {
> next = 0x75a468
> prev = 0x75a468
> }
> [6] struct list_head {
> next = 0x75a478
> prev = 0x75a478
> }
> [7] struct list_head {
> next = 0x75a488
> prev = 0x75a488
> }
> [8] struct list_head {
> next = 0x75a498
> prev = 0x75a498
> }
> [9] struct list_head {
> next = 0x75a4a8
> prev = 0x75a4a8
> }
> [10] struct list_head {
> next = 0x75a4b8
> prev = 0x75a4b8
> }
> [11] struct list_head {
> next = 0x75a4c8
> prev = 0x75a4c8
> }
> [12] struct list_head {
> next = 0x75a4d8
> prev = 0x75a4d8
> }
> [13] struct list_head {
> next = 0x75a4e8
> prev = 0x75a4e8
> }
> [14] struct list_head {
> next = 0x75a4f8
> prev = 0x75a4f8
> }
> [15] struct list_head {
> next = 0x75a508
> prev = 0x75a508
> }
> [16] struct list_head {
> next = 0x75a518
> prev = 0x75a518
> }
> [17] struct list_head {
> next = 0x75a528
> prev = 0x75a528
> }
> [18] struct list_head {
> next = 0x75a538
> prev = 0x75a538
> }
> [19] struct list_head {
> next = 0x75a548
> prev = 0x75a548
> }
> [20] struct list_head {
> next = 0x75a558
> prev = 0x75a558
> }
> [21] struct list_head {
> next = 0x75a568
> prev = 0x75a568
> }
> [22] struct list_head {
> next = 0x75a578
> prev = 0x75a578
> }
> [23] struct list_head {
> next = 0x75a588
> prev = 0x75a588
> }
> [24] struct list_head {
> next = 0x75a598
> prev = 0x75a598
> }
> [25] struct list_head {
> next = 0x75a5a8
> prev = 0x75a5a8
> }
> [26] struct list_head {
> next = 0x75a5b8
> prev = 0x75a5b8
> }
> [27] struct list_head {
> next = 0x75a5c8
> prev = 0x75a5c8
> }
> [28] struct list_head {
> next = 0x75a5d8
> prev = 0x75a5d8
> }
> [29] struct list_head {
> next = 0x75a5e8
> prev = 0x75a5e8
> }
> [30] struct list_head {
> next = 0x75a5f8
> prev = 0x75a5f8
> }
> [31] struct list_head {
> next = 0x75a608
> prev = 0x75a608
> }
> [32] struct list_head {
> next = 0x75a618
> prev = 0x75a618
> }
> [33] struct list_head {
> next = 0x75a628
> prev = 0x75a628
> }
> [34] struct list_head {
> next = 0x75a638
> prev = 0x75a638
> }
> [35] struct list_head {
> next = 0x75a648
> prev = 0x75a648
> }
> [36] struct list_head {
> next = 0x75a658
> prev = 0x75a658
> }
> [37] struct list_head {
> next = 0x75a668
> prev = 0x75a668
> }
> [38] struct list_head {
> next = 0x75a678
> prev = 0x75a678
> }
> [39] struct list_head {
> next = 0x75a688
> prev = 0x75a688
> }
> [40] struct list_head {
> next = 0x75a698
> prev = 0x75a698
> }
> [41] struct list_head {
> next = 0x75a6a8
> prev = 0x75a6a8
> }
> [42] struct list_head {
> next = 0x75a6b8
> prev = 0x75a6b8
> }
> [43] struct list_head {
> next = 0x75a6c8
> prev = 0x75a6c8
> }
> [44] struct list_head {
> next = 0x75a6d8
> prev = 0x75a6d8
> }
> [45] struct list_head {
> next = 0x75a6e8
> prev = 0x75a6e8
> }
> [46] struct list_head {
> next = 0x75a6f8
> prev = 0x75a6f8
> }
> [47] struct list_head {
> next = 0x75a708
> prev = 0x75a708
> }
> [48] struct list_head {
> next = 0x75a718
> prev = 0x75a718
> }
> [49] struct list_head {
> next = 0x75a728
> prev = 0x75a728
> }
> [50] struct list_head {
> next = 0x75a738
> prev = 0x75a738
> }
> [51] struct list_head {
> next = 0x75a748
> prev = 0x75a748
> }
> [52] struct list_head {
> next = 0x75a758
> prev = 0x75a758
> }
> [53] struct list_head {
> next = 0x75a768
> prev = 0x75a768
> }
> [54] struct list_head {
> next = 0x75a778
> prev = 0x75a778
> }
> [55] struct list_head {
> next = 0x75a788
> prev = 0x75a788
> }
> [56] struct list_head {
> next = 0x75a798
> prev = 0x75a798
> }
> [57] struct list_head {
> next = 0x75a7a8
> prev = 0x75a7a8
> }
> [58] struct list_head {
> next = 0x75a7b8
> prev = 0x75a7b8
> }
> [59] struct list_head {
> next = 0x75a7c8
> prev = 0x75a7c8
> }
> [60] struct list_head {
> next = 0x75a7d8
> prev = 0x75a7d8
> }
> [61] struct list_head {
> next = 0x75a7e8
> prev = 0x75a7e8
> }
> [62] struct list_head {
> next = 0x75a7f8
> prev = 0x75a7f8
> }
> [63] struct list_head {
> next = 0x75a808
> prev = 0x75a808
> }
> [64] struct list_head {
> next = 0x75a818
> prev = 0x75a818
> }
> [65] struct list_head {
> next = 0x75a828
> prev = 0x75a828
> }
> [66] struct list_head {
> next = 0x75a838
> prev = 0x75a838
> }
> [67] struct list_head {
> next = 0x75a848
> prev = 0x75a848
> }
> [68] struct list_head {
> next = 0x75a858
> prev = 0x75a858
> }
> [69] struct list_head {
> next = 0x75a868
> prev = 0x75a868
> }
> [70] struct list_head {
> next = 0x75a878
> prev = 0x75a878
> }
> [71] struct list_head {
> next = 0x75a888
> prev = 0x75a888
> }
> [72] struct list_head {
> next = 0x75a898
> prev = 0x75a898
> }
> [73] struct list_head {
> next = 0x75a8a8
> prev = 0x75a8a8
> }
> [74] struct list_head {
> next = 0x75a8b8
> prev = 0x75a8b8
> }
> [75] struct list_head {
> next = 0x75a8c8
> prev = 0x75a8c8
> }
> [76] struct list_head {
> next = 0x75a8d8
> prev = 0x75a8d8
> }
> [77] struct list_head {
> next = 0x75a8e8
> prev = 0x75a8e8
> }
> [78] struct list_head {
> next = 0x75a8f8
> prev = 0x75a8f8
> }
> [79] struct list_head {
> next = 0x75a908
> prev = 0x75a908
> }
> [80] struct list_head {
> next = 0x75a918
> prev = 0x75a918
> }
> [81] struct list_head {
> next = 0x75a928
> prev = 0x75a928
> }
> [82] struct list_head {
> next = 0x75a938
> prev = 0x75a938
> }
> [83] struct list_head {
> next = 0x75a948
> prev = 0x75a948
> }
> [84] struct list_head {
> next = 0x75a958
> prev = 0x75a958
> }
> [85] struct list_head {
> next = 0x75a968
> prev = 0x75a968
> }
> [86] struct list_head {
> next = 0x75a978
> prev = 0x75a978
> }
> [87] struct list_head {
> next = 0x75a988
> prev = 0x75a988
> }
> [88] struct list_head {
> next = 0x75a998
> prev = 0x75a998
> }
> [89] struct list_head {
> next = 0x75a9a8
> prev = 0x75a9a8
> }
> [90] struct list_head {
> next = 0x75a9b8
> prev = 0x75a9b8
> }
> [91] struct list_head {
> next = 0x75a9c8
> prev = 0x75a9c8
> }
> [92] struct list_head {
> next = 0x75a9d8
> prev = 0x75a9d8
> }
> [93] struct list_head {
> next = 0x75a9e8
> prev = 0x75a9e8
> }
> [94] struct list_head {
> next = 0x75a9f8
> prev = 0x75a9f8
> }
> [95] struct list_head {
> next = 0x75aa08
> prev = 0x75aa08
> }
> [96] struct list_head {
> next = 0x75aa18
> prev = 0x75aa18
> }
> [97] struct list_head {
> next = 0x75aa28
> prev = 0x75aa28
> }
> [98] struct list_head {
> next = 0x75aa38
> prev = 0x75aa38
> }
> [99] struct list_head {
> next = 0x75aa48
> prev = 0x75aa48
> }
> }
> }
> rt_nr_running = 0x0
> highest_prio = 0x64
> rt_nr_migratory = 0x0
> overloaded = 0x0
> rt_throttled = 0x0
> rt_time = 0x123a999
> rt_runtime = 0x389fd980
> rt_runtime_lock = spinlock_t {
> raw_lock = raw_spinlock_t {
> owner_cpu = 0x0
> }
> break_lock = 0x0
> magic = 0xdead4ead
> owner_cpu = 0xffffffff
> owner = 0xffffffffffffffff
> }
> }
> leaf_cfs_rq_list = struct list_head {
> next = 0x2f5a8970
> prev = 0x759470
> }
> nr_uninterruptible = 0xfffffffffffffffe
> curr = 0x2ef95350
> idle = 0x2fe7ccf8
> next_balance = 0x10000093b
> prev_mm = (nil)
> clock = 0x189685acb4d536
> nr_iowait = atomic_t {
> counter = 0x0
> }
> rd = 0x564a58
> sd = (nil)
> active_balance = 0x0
> push_cpu = 0x0
> cpu = 0x1
> migration_thread = 0x2ef95350
> migration_queue = struct list_head {
> next = 0x75ab10
> prev = 0x75ab10
> }
> rq_lock_key = struct lock_class_key {
> }
> }
>
> Hopefully all of this debug data is of any use. If you need more, just let me
> know.
>
> Thanks!

2008-06-19 18:14:28

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG] CFS vs cpu hotplug

On Thu, 2008-06-19 at 20:05 +0200, Peter Zijlstra wrote:
> On Thu, 2008-06-19 at 18:19 +0200, Heiko Carstens wrote:

> > The sched_entity that belongs to the cfs_rq:
> >
> > >> px *(sched_entity *) 0x759300
> > struct sched_entity {
> > load = struct load_weight {
> > weight = 0x800
> > inv_weight = 0x1ffc01
> > }
> > run_node = struct rb_node {
> > rb_parent_color = 0x1
> > rb_right = (nil)
> > rb_left = (nil)
> > }
> > group_node = struct list_head {
> > next = 0x75a3b8
> > prev = 0x75a3b8
> > }
> > on_rq = 0x1
> > exec_start = 0x189685acb4aa46
> > sum_exec_runtime = 0x188a2b84c
> > vruntime = 0xd036bd29
> > prev_sum_exec_runtime = 0x1672e3f62
> > last_wakeup = 0x0
> > avg_overlap = 0x0
> > parent = (nil)
> > cfs_rq = 0x75a380
> > my_q = 0x759400
> > }

Ooh, this thing is with CONFIG_GROUP_SCHED... does it still happen when
you disable that?

Not that that is any excuse for crashing.. but it does simplify the
scheduler somewhat.

2008-06-19 21:15:46

by Heiko Carstens

[permalink] [raw]
Subject: Re: [BUG] CFS vs cpu hotplug

On Thu, Jun 19, 2008 at 08:14:02PM +0200, Peter Zijlstra wrote:
> On Thu, 2008-06-19 at 20:05 +0200, Peter Zijlstra wrote:
> > On Thu, 2008-06-19 at 18:19 +0200, Heiko Carstens wrote:
>
> > > The sched_entity that belongs to the cfs_rq:
> > >
> > > >> px *(sched_entity *) 0x759300
> > > struct sched_entity {
> > > load = struct load_weight {
> > > weight = 0x800
> > > inv_weight = 0x1ffc01
> > > }
> > > run_node = struct rb_node {
> > > rb_parent_color = 0x1
> > > rb_right = (nil)
> > > rb_left = (nil)
> > > }
> > > group_node = struct list_head {
> > > next = 0x75a3b8
> > > prev = 0x75a3b8
> > > }
> > > on_rq = 0x1
> > > exec_start = 0x189685acb4aa46
> > > sum_exec_runtime = 0x188a2b84c
> > > vruntime = 0xd036bd29
> > > prev_sum_exec_runtime = 0x1672e3f62
> > > last_wakeup = 0x0
> > > avg_overlap = 0x0
> > > parent = (nil)
> > > cfs_rq = 0x75a380
> > > my_q = 0x759400
> > > }
>
> Ooh, this thing is with CONFIG_GROUP_SCHED... does it still happen when
> you disable that?

Indeed, when CONFIG_GROUP_SCHED is disabled I cannot reproduce it anymore.

2008-06-19 21:18:51

by Heiko Carstens

[permalink] [raw]
Subject: Re: [BUG] CFS vs cpu hotplug

On Thu, Jun 19, 2008 at 08:05:10PM +0200, Peter Zijlstra wrote:
> On Thu, 2008-06-19 at 18:19 +0200, Heiko Carstens wrote:
> > Hi Ingo, Peter,
> >
> > I'm still seeing kernel crashes on cpu hotplug with Linus' current git tree.
> > All I have to do is to make all cpus busy (make -j4 of the kernel source is
> > sufficient) and then start cpu hotplug stress.
> > It usually takes below a minute to crash the system like this:
> >
> > Unable to handle kernel pointer dereference at virtual kernel address 005a800000031000
> > Oops: 0038 [#1] PREEMPT SMP
> > Modules linked in:
> > CPU: 1 Not tainted 2.6.26-rc6-00232-g9bedbcb #356
> > Process swapper (pid: 0, task: 000000002fe7ccf8, ksp: 000000002fe93d78)
> > Krnl PSW : 0400e00180000000 0000000000032c6c (pick_next_task_fair+0x34/0xb0)
>
> I presume this is:
>
> se = pick_next_entity(cfs_rq);

Yes, that is correct. Sorry, missed to tell about this detail.

2008-06-19 21:27:12

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG] CFS vs cpu hotplug

On Thu, 2008-06-19 at 23:14 +0200, Heiko Carstens wrote:
> On Thu, Jun 19, 2008 at 08:14:02PM +0200, Peter Zijlstra wrote:
> > On Thu, 2008-06-19 at 20:05 +0200, Peter Zijlstra wrote:
> > > On Thu, 2008-06-19 at 18:19 +0200, Heiko Carstens wrote:
> >
> > > > The sched_entity that belongs to the cfs_rq:
> > > >
> > > > >> px *(sched_entity *) 0x759300
> > > > struct sched_entity {
> > > > load = struct load_weight {
> > > > weight = 0x800
> > > > inv_weight = 0x1ffc01
> > > > }
> > > > run_node = struct rb_node {
> > > > rb_parent_color = 0x1
> > > > rb_right = (nil)
> > > > rb_left = (nil)
> > > > }
> > > > group_node = struct list_head {
> > > > next = 0x75a3b8
> > > > prev = 0x75a3b8
> > > > }
> > > > on_rq = 0x1
> > > > exec_start = 0x189685acb4aa46
> > > > sum_exec_runtime = 0x188a2b84c
> > > > vruntime = 0xd036bd29
> > > > prev_sum_exec_runtime = 0x1672e3f62
> > > > last_wakeup = 0x0
> > > > avg_overlap = 0x0
> > > > parent = (nil)
> > > > cfs_rq = 0x75a380
> > > > my_q = 0x759400
> > > > }
> >
> > Ooh, this thing is with CONFIG_GROUP_SCHED... does it still happen when
> > you disable that?
>
> Indeed, when CONFIG_GROUP_SCHED is disabled I cannot reproduce it anymore.

Ok, that gives us some idea where to look, thanks for this data point.

2008-06-19 21:36:59

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG] CFS vs cpu hotplug

On Thu, 2008-06-19 at 20:05 +0200, Peter Zijlstra wrote:
> On Thu, 2008-06-19 at 18:19 +0200, Heiko Carstens wrote:

> > The sched_entity that belongs to the cfs_rq:
> >
> > >> px *(sched_entity *) 0x759300
> > struct sched_entity {
> > load = struct load_weight {
> > weight = 0x800
> > inv_weight = 0x1ffc01
> > }
> > run_node = struct rb_node {
> > rb_parent_color = 0x1
> > rb_right = (nil)
> > rb_left = (nil)
> > }
> > group_node = struct list_head {
> > next = 0x75a3b8
> > prev = 0x75a3b8
> > }
> > on_rq = 0x1
> > exec_start = 0x189685acb4aa46
> > sum_exec_runtime = 0x188a2b84c
> > vruntime = 0xd036bd29
> > prev_sum_exec_runtime = 0x1672e3f62
> > last_wakeup = 0x0
> > avg_overlap = 0x0
> > parent = (nil)
> > cfs_rq = 0x75a380
> > my_q = 0x759400
> > }

If you still have this dump, could you give the output of:

px *(struct cfs_rq *) 0x759400

And possibly walk down the chain getting its curr and then my_q again
etc..

2008-06-19 21:50:10

by Heiko Carstens

[permalink] [raw]
Subject: Re: [BUG] CFS vs cpu hotplug

On Thu, Jun 19, 2008 at 11:32:29PM +0200, Peter Zijlstra wrote:
> On Thu, 2008-06-19 at 20:05 +0200, Peter Zijlstra wrote:
> > On Thu, 2008-06-19 at 18:19 +0200, Heiko Carstens wrote:
>
> > > The sched_entity that belongs to the cfs_rq:
> > >
> > > >> px *(sched_entity *) 0x759300
> > > struct sched_entity {
> > > load = struct load_weight {
> > > weight = 0x800
> > > inv_weight = 0x1ffc01
> > > }
> > > run_node = struct rb_node {
> > > rb_parent_color = 0x1
> > > rb_right = (nil)
> > > rb_left = (nil)
> > > }
> > > group_node = struct list_head {
> > > next = 0x75a3b8
> > > prev = 0x75a3b8
> > > }
> > > on_rq = 0x1
> > > exec_start = 0x189685acb4aa46
> > > sum_exec_runtime = 0x188a2b84c
> > > vruntime = 0xd036bd29
> > > prev_sum_exec_runtime = 0x1672e3f62
> > > last_wakeup = 0x0
> > > avg_overlap = 0x0
> > > parent = (nil)
> > > cfs_rq = 0x75a380
> > > my_q = 0x759400
> > > }
>
> If you still have this dump, could you give the output of:
>
> px *(struct cfs_rq *) 0x759400
>
> And possibly walk down the chain getting its curr and then my_q again
> etc..

Sure, fortunately just a very short chain:

>> px *(struct cfs_rq *) 0x759400
struct cfs_rq {
load = struct load_weight {
weight = 0xc31
inv_weight = 0x0
}
nr_running = 0x1
exec_clock = 0x0
min_vruntime = 0x4f216b005
tasks_timeline = struct rb_root {
rb_node = 0x2fca4d40
}
rb_leftmost = 0x2fca4d40
tasks = struct list_head {
next = 0x2fca4d58
prev = 0x2fca4d58
}
balance_iterator = 0x2e29e700
curr = 0x2ef4f388
next = (nil)
nr_spread_over = 0x0
rq = 0x75a300
leaf_cfs_rq_list = struct list_head {
next = 0x75aaa0
prev = 0x2e1eca70
}
tg = 0x564910
}

>> px *(sched_entity *) 0x2ef4f388
struct sched_entity {
load = struct load_weight {
weight = 0x400
inv_weight = 0x400000
}
run_node = struct rb_node {
rb_parent_color = 0x2f07b399
rb_right = (nil)
rb_left = (nil)
}
group_node = struct list_head {
next = 0x2ef4f3b0
prev = 0x2ef4f3b0
}
on_rq = 0x0
exec_start = 0x189685c9a77b96
sum_exec_runtime = 0x3c51111
vruntime = 0x493becf68
prev_sum_exec_runtime = 0x3c50997
last_wakeup = 0x0
avg_overlap = 0x4b67d1
parent = 0x763300
cfs_rq = 0x763400
my_q = (nil)
}

2008-06-20 08:51:21

by Peter Zijlstra

[permalink] [raw]
Subject: Re: [BUG] CFS vs cpu hotplug

On Thu, 2008-06-19 at 23:49 +0200, Heiko Carstens wrote:
> On Thu, Jun 19, 2008 at 11:32:29PM +0200, Peter Zijlstra wrote:
> > On Thu, 2008-06-19 at 20:05 +0200, Peter Zijlstra wrote:
> > > On Thu, 2008-06-19 at 18:19 +0200, Heiko Carstens wrote:
> >
> > > > The sched_entity that belongs to the cfs_rq:
> > > >
> > > > >> px *(sched_entity *) 0x759300
> > > > struct sched_entity {
> > > > load = struct load_weight {
> > > > weight = 0x800
> > > > inv_weight = 0x1ffc01
> > > > }
> > > > run_node = struct rb_node {
> > > > rb_parent_color = 0x1
> > > > rb_right = (nil)
> > > > rb_left = (nil)
> > > > }
> > > > group_node = struct list_head {
> > > > next = 0x75a3b8
> > > > prev = 0x75a3b8
> > > > }
> > > > on_rq = 0x1
> > > > exec_start = 0x189685acb4aa46
> > > > sum_exec_runtime = 0x188a2b84c
> > > > vruntime = 0xd036bd29
> > > > prev_sum_exec_runtime = 0x1672e3f62
> > > > last_wakeup = 0x0
> > > > avg_overlap = 0x0
> > > > parent = (nil)
> > > > cfs_rq = 0x75a380
> > > > my_q = 0x759400
> > > > }
> >
> > If you still have this dump, could you give the output of:
> >
> > px *(struct cfs_rq *) 0x759400
> >
> > And possibly walk down the chain getting its curr and then my_q again
> > etc..
>
> Sure, fortunately just a very short chain:
>
> >> px *(struct cfs_rq *) 0x759400
> struct cfs_rq {
> load = struct load_weight {
> weight = 0xc31
> inv_weight = 0x0
> }
> nr_running = 0x1
> exec_clock = 0x0
> min_vruntime = 0x4f216b005
> tasks_timeline = struct rb_root {
> rb_node = 0x2fca4d40
> }
> rb_leftmost = 0x2fca4d40
> tasks = struct list_head {
> next = 0x2fca4d58
> prev = 0x2fca4d58
> }
> balance_iterator = 0x2e29e700
> curr = 0x2ef4f388
> next = (nil)
> nr_spread_over = 0x0
> rq = 0x75a300
> leaf_cfs_rq_list = struct list_head {
> next = 0x75aaa0
> prev = 0x2e1eca70
> }
> tg = 0x564910
> }

Hmm this one is buggered as well, it has nr_running = 1, and one entry
in the tree, but also a !NULL curr.

Could you please show:

px *container_of(0x2fca4d40, struct sched_entity, run_node)

which one might have to write like:

px *((struct sched_entity *)((char*)0x2fca4d40) - ((unsigned long)&(((struct sched_entity *)0)->run_node)))

/me prays he got the braces right,..

> >> px *(sched_entity *) 0x2ef4f388
> struct sched_entity {
> load = struct load_weight {
> weight = 0x400
> inv_weight = 0x400000
> }
> run_node = struct rb_node {
> rb_parent_color = 0x2f07b399
> rb_right = (nil)
> rb_left = (nil)
> }
> group_node = struct list_head {
> next = 0x2ef4f3b0
> prev = 0x2ef4f3b0
> }
> on_rq = 0x0
> exec_start = 0x189685c9a77b96
> sum_exec_runtime = 0x3c51111
> vruntime = 0x493becf68
> prev_sum_exec_runtime = 0x3c50997
> last_wakeup = 0x0
> avg_overlap = 0x4b67d1
> parent = 0x763300
> cfs_rq = 0x763400
> my_q = (nil)
> }

This one seems un-associated with the rest of the chain, as per its
back-pointers.

Fancy puzzle,..

2008-06-20 11:44:52

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: [BUG] CFS vs cpu hotplug

2008/6/19 Peter Zijlstra <[email protected]>:
> On Thu, 2008-06-19 at 18:19 +0200, Heiko Carstens wrote:
>> Hi Ingo, Peter,
>>
>> I'm still seeing kernel crashes on cpu hotplug with Linus' current git tree.
>> All I have to do is to make all cpus busy (make -j4 of the kernel source is
>> sufficient) and then start cpu hotplug stress.
>> It usually takes below a minute to crash the system like this:
>>
>> Unable to handle kernel pointer dereference at virtual kernel address 005a800000031000
>> Oops: 0038 [#1] PREEMPT SMP
>> Modules linked in:
>> CPU: 1 Not tainted 2.6.26-rc6-00232-g9bedbcb #356
>> Process swapper (pid: 0, task: 000000002fe7ccf8, ksp: 000000002fe93d78)
>> Krnl PSW : 0400e00180000000 0000000000032c6c (pick_next_task_fair+0x34/0xb0)
>
> I presume this is:
>
> se = pick_next_entity(cfs_rq);
>
>> R:0 T:1 IO:0 EX:0 Key:0 M:0 W:0 P:0 AS:3 CC:2 PM:0 EA:3
>> Krnl GPRS: 00000000001ff000 0000000000030bd8 000000000075a380 000000002fe7ccf8
>> 0000000000386690 0000000000000008 0000000000000000 000000002fe7cf58
>> 0000000000000001 000000000075a300 0000000000000000 000000002fe93d40
>> 005a800000031201 0000000000386010 000000002fe93d78 000000002fe93d40
>> Krnl Code: 0000000000032c5c: e3e0f0980024 stg %r14,152(%r15)
>> 0000000000032c62: d507d000c010 clc 0(8,%r13),16(%r12)
>> 0000000000032c68: a784003c brc 8,32ce0
>> >0000000000032c6c: d507d000c030 clc 0(8,%r13),48(%r12)
>> 0000000000032c72: b904002c lgr %r2,%r12
>> 0000000000032c76: a7a90000 lghi %r10,0
>> 0000000000032c7a: a7840021 brc 8,32cbc
>> 0000000000032c7e: c0e5ffffefe3 brasl %r14,30c44
>> Call Trace:
>> ([<000000000075a300>] 0x75a300)
>> [<000000000037195a>] schedule+0x162/0x7f4
>> [<000000000001a2be>] cpu_idle+0x1ca/0x25c
>> [<000000000036f368>] start_secondary+0xac/0xb8
>> [<0000000000000000>] 0x0
>> [<0000000000000000>] 0x0
>> Last Breaking-Event-Address:
>> [<0000000000032cc6>] pick_next_task_fair+0x8e/0xb0
>> <4>---[ end trace 9bb55df196feedcc ]---
>> Kernel panic - not syncing: Attempted to kill the idle task!
>>
>> Please note that the above call trace is from s390, however Avi reported the
>> same bug on x86_64.
>>
>> I tried to bisect this and ended up somewhere at the beginning of 2.6.23 when
>> the CFS patches got merged. Unfortunately it got harder and harder to reproduce
>> so that I couldn't bisect this down to a single patch.
>>
>> One observation however is that this always happens after cpu_up(), not
>> cpu_down().
>>
>> I modified the kernel sources a bit (actually only added a single "noinline")
>> to get some sensible debug data and dumped a crashed system. These are the
>> contents of the scheduler data structures which cause the crash:
>>
>> >> px *(cfs_rq *) 0x75a380
>> struct cfs_rq {
>> load = struct load_weight {
>> weight = 0x800
>> inv_weight = 0x0
>> }
>> nr_running = 0x1
>> exec_clock = 0x0
>> min_vruntime = 0xbf7e9776
>> tasks_timeline = struct rb_root {
>> rb_node = (nil)
>> }
>> rb_leftmost = (nil) <<<<<<<<<<<< shouldn't be NULL
>> tasks = struct list_head {
>> next = 0x759328
>> prev = 0x759328
>> }
>> balance_iterator = (nil)
>> curr = 0x759300
>> next = (nil)
>> nr_spread_over = 0x0
>> rq = 0x75a300
>> leaf_cfs_rq_list = struct list_head {
>> next = (nil)
>> prev = (nil)
>> }
>> tg = 0x564970
>> }
>
> Right, this cfs_rq is buggered. rb_leftmost may be null when the tree is
> empty (as is the case here).
>
> However cfs_rq->curr != NULL and cfs_rq->nr_running != 0.
>
> So this hints at a missing put_prev_entity() - we keep current out of
> the tree, and put it back in right before we schedule(). The advantage
> is that we don't need to reposition (dequeue/enqueue) curr in the tree
> every time we update its virtual timeline.
>
> So what races so that we can miss put_prev_entity() and how is cpu_up()
> special..
>

hum, I'd rather suppose that something weird happened at the time of
cpu_down() and some per-cpu data is already inconsistent by the time
of cpu_up().

Is it with CONFIG_USER_SCHED?

Maybe we can write a small function that does a 'sanety' check of :

for all sched_groups (task_groups's) : check 'sanity' of
group->cfs_rq[CPU] and group->se[CPU] somewhere early in cpu_up().

So we can verify whether it's legacy of cpu_down() or something
related to cpu_up().

hm?


--
Best regards,
Dmitry Adamushko

2008-06-20 22:25:59

by Heiko Carstens

[permalink] [raw]
Subject: Re: [BUG] CFS vs cpu hotplug

On Fri, Jun 20, 2008 at 01:44:41PM +0200, Dmitry Adamushko wrote:
> 2008/6/19 Peter Zijlstra <[email protected]>:
> > Right, this cfs_rq is buggered. rb_leftmost may be null when the tree is
> > empty (as is the case here).
> >
> > However cfs_rq->curr != NULL and cfs_rq->nr_running != 0.
> >
> > So this hints at a missing put_prev_entity() - we keep current out of
> > the tree, and put it back in right before we schedule(). The advantage
> > is that we don't need to reposition (dequeue/enqueue) curr in the tree
> > every time we update its virtual timeline.
> >
> > So what races so that we can miss put_prev_entity() and how is cpu_up()
> > special..
> >
>
> hum, I'd rather suppose that something weird happened at the time of
> cpu_down() and some per-cpu data is already inconsistent by the time
> of cpu_up().
>
> Is it with CONFIG_USER_SCHED?

Yes. For full config see below.

> Maybe we can write a small function that does a 'sanety' check of :
>
> for all sched_groups (task_groups's) : check 'sanity' of
> group->cfs_rq[CPU] and group->se[CPU] somewhere early in cpu_up().
>
> So we can verify whether it's legacy of cpu_down() or something
> related to cpu_up().
>
> hm?

If you have a patch at hand, I'll give it a try.

#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.26-rc6
# Sat Jun 21 00:20:36 2008
#
CONFIG_SCHED_MC=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_HAVE_LATENCYTOP_SUPPORT=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_GENERIC_HWEIGHT=y
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_BUG=y
CONFIG_NO_IOMEM=y
CONFIG_NO_DMA=y
CONFIG_GENERIC_LOCKBREAK=y
CONFIG_PGSTE=y
CONFIG_S390=y
CONFIG_DEFCONFIG_LIST="/lib/modules/$UNAME_RELEASE/.config"

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=""
CONFIG_LOCALVERSION_AUTO=y
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
# CONFIG_BSD_PROCESS_ACCT is not set
# CONFIG_TASKSTATS is not set
CONFIG_AUDIT=y
# CONFIG_AUDITSYSCALL is not set
CONFIG_IKCONFIG=y
CONFIG_IKCONFIG_PROC=y
CONFIG_LOG_BUF_SHIFT=17
CONFIG_CGROUPS=y
# CONFIG_CGROUP_DEBUG is not set
CONFIG_CGROUP_NS=y
# CONFIG_CGROUP_DEVICE is not set
# CONFIG_CPUSETS is not set
CONFIG_GROUP_SCHED=y
CONFIG_FAIR_GROUP_SCHED=y
# CONFIG_RT_GROUP_SCHED is not set
CONFIG_USER_SCHED=y
# CONFIG_CGROUP_SCHED is not set
# CONFIG_CGROUP_CPUACCT is not set
# CONFIG_RESOURCE_COUNTERS is not set
CONFIG_SYSFS_DEPRECATED=y
CONFIG_SYSFS_DEPRECATED_V2=y
# CONFIG_RELAY is not set
CONFIG_NAMESPACES=y
CONFIG_UTS_NS=y
CONFIG_IPC_NS=y
# CONFIG_USER_NS is not set
# CONFIG_PID_NS is not set
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=""
# CONFIG_CC_OPTIMIZE_FOR_SIZE is not set
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_SYSCTL_SYSCALL=y
CONFIG_SYSCTL_SYSCALL_CHECK=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_ALL is not set
# CONFIG_KALLSYMS_EXTRA_PASS is not set
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
# CONFIG_COMPAT_BRK is not set
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_TIMERFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLAB=y
# CONFIG_SLUB is not set
# CONFIG_SLOB is not set
# CONFIG_PROFILING is not set
# CONFIG_MARKERS is not set
CONFIG_HAVE_OPROFILE=y
CONFIG_KPROBES=y
CONFIG_KRETPROBES=y
CONFIG_HAVE_KPROBES=y
CONFIG_HAVE_KRETPROBES=y
# CONFIG_HAVE_DMA_ATTRS is not set
CONFIG_PROC_PAGE_MONITOR=y
CONFIG_SLABINFO=y
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
# CONFIG_MODULE_FORCE_LOAD is not set
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
CONFIG_MODVERSIONS=y
# CONFIG_MODULE_SRCVERSION_ALL is not set
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
# CONFIG_BLK_DEV_IO_TRACE is not set
CONFIG_BLK_DEV_BSG=y
CONFIG_BLOCK_COMPAT=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
CONFIG_DEFAULT_DEADLINE=y
# CONFIG_DEFAULT_CFQ is not set
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED="deadline"
CONFIG_PREEMPT_NOTIFIERS=y
CONFIG_CLASSIC_RCU=y

#
# Base setup
#

#
# Processor type and features
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_64BIT=y
CONFIG_SMP=y
CONFIG_NR_CPUS=32
CONFIG_HOTPLUG_CPU=y
CONFIG_COMPAT=y
CONFIG_SYSVIPC_COMPAT=y
CONFIG_AUDIT_ARCH=y
CONFIG_S390_SWITCH_AMODE=y
CONFIG_S390_EXEC_PROTECT=y

#
# Code generation options
#
# CONFIG_MARCH_G5 is not set
CONFIG_MARCH_Z900=y
# CONFIG_MARCH_Z990 is not set
# CONFIG_MARCH_Z9_109 is not set
CONFIG_PACK_STACK=y
# CONFIG_SMALL_STACK is not set
CONFIG_CHECK_STACK=y
CONFIG_STACK_GUARD=256
# CONFIG_WARN_STACK is not set
CONFIG_ARCH_POPULATES_NODE_MAP=y

#
# Kernel preemption
#
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
# CONFIG_PREEMPT_RCU is not set
CONFIG_ARCH_SPARSEMEM_ENABLE=y
CONFIG_ARCH_SPARSEMEM_DEFAULT=y
CONFIG_ARCH_SELECT_MEMORY_MODEL=y
CONFIG_SELECT_MEMORY_MODEL=y
# CONFIG_FLATMEM_MANUAL is not set
# CONFIG_DISCONTIGMEM_MANUAL is not set
CONFIG_SPARSEMEM_MANUAL=y
CONFIG_SPARSEMEM=y
CONFIG_HAVE_MEMORY_PRESENT=y
# CONFIG_SPARSEMEM_STATIC is not set
CONFIG_SPARSEMEM_EXTREME=y
CONFIG_SPARSEMEM_VMEMMAP_ENABLE=y
CONFIG_SPARSEMEM_VMEMMAP=y
CONFIG_PAGEFLAGS_EXTENDED=y
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_RESOURCES_64BIT=y
CONFIG_ZONE_DMA_FLAG=1
CONFIG_BOUNCE=y
CONFIG_VIRT_TO_BUS=y

#
# I/O subsystem configuration
#
CONFIG_MACHCHK_WARNING=y
CONFIG_QDIO=y
# CONFIG_QDIO_DEBUG is not set

#
# Misc
#
CONFIG_IPL=y
# CONFIG_IPL_TAPE is not set
CONFIG_IPL_VM=y
CONFIG_BINFMT_ELF=y
CONFIG_BINFMT_MISC=m
CONFIG_FORCE_MAX_ZONEORDER=9
# CONFIG_PROCESS_DEBUG is not set
CONFIG_PFAULT=y
# CONFIG_SHARED_KERNEL is not set
# CONFIG_CMM is not set
# CONFIG_PAGE_STATES is not set
CONFIG_VIRT_TIMER=y
CONFIG_VIRT_CPU_ACCOUNTING=y
# CONFIG_APPLDATA_BASE is not set
CONFIG_HZ_100=y
# CONFIG_HZ_250 is not set
# CONFIG_HZ_300 is not set
# CONFIG_HZ_1000 is not set
CONFIG_HZ=100
# CONFIG_SCHED_HRTICK is not set
CONFIG_S390_HYPFS_FS=y
CONFIG_KEXEC=y
# CONFIG_ZFCPDUMP is not set
CONFIG_S390_GUEST=y

#
# Networking
#
CONFIG_NET=y

#
# Networking options
#
CONFIG_PACKET=y
# CONFIG_PACKET_MMAP is not set
CONFIG_UNIX=y
CONFIG_XFRM=y
# CONFIG_XFRM_USER is not set
# CONFIG_XFRM_SUB_POLICY is not set
# CONFIG_XFRM_MIGRATE is not set
# CONFIG_XFRM_STATISTICS is not set
CONFIG_NET_KEY=y
# CONFIG_NET_KEY_MIGRATE is not set
CONFIG_IUCV=m
CONFIG_AFIUCV=m
CONFIG_INET=y
CONFIG_IP_MULTICAST=y
# CONFIG_IP_ADVANCED_ROUTER is not set
CONFIG_IP_FIB_HASH=y
# CONFIG_IP_PNP is not set
# CONFIG_NET_IPIP is not set
# CONFIG_NET_IPGRE is not set
# CONFIG_IP_MROUTE is not set
# CONFIG_ARPD is not set
# CONFIG_SYN_COOKIES is not set
# CONFIG_INET_AH is not set
# CONFIG_INET_ESP is not set
# CONFIG_INET_IPCOMP is not set
# CONFIG_INET_XFRM_TUNNEL is not set
CONFIG_INET_TUNNEL=y
CONFIG_INET_XFRM_MODE_TRANSPORT=y
CONFIG_INET_XFRM_MODE_TUNNEL=y
CONFIG_INET_XFRM_MODE_BEET=y
CONFIG_INET_LRO=y
CONFIG_INET_DIAG=y
CONFIG_INET_TCP_DIAG=y
# CONFIG_TCP_CONG_ADVANCED is not set
CONFIG_TCP_CONG_CUBIC=y
CONFIG_DEFAULT_TCP_CONG="cubic"
# CONFIG_TCP_MD5SIG is not set
# CONFIG_IP_VS is not set
CONFIG_IPV6=y
# CONFIG_IPV6_PRIVACY is not set
# CONFIG_IPV6_ROUTER_PREF is not set
# CONFIG_IPV6_OPTIMISTIC_DAD is not set
# CONFIG_INET6_AH is not set
# CONFIG_INET6_ESP is not set
# CONFIG_INET6_IPCOMP is not set
# CONFIG_IPV6_MIP6 is not set
# CONFIG_INET6_XFRM_TUNNEL is not set
# CONFIG_INET6_TUNNEL is not set
CONFIG_INET6_XFRM_MODE_TRANSPORT=y
CONFIG_INET6_XFRM_MODE_TUNNEL=y
CONFIG_INET6_XFRM_MODE_BEET=y
# CONFIG_INET6_XFRM_MODE_ROUTEOPTIMIZATION is not set
CONFIG_IPV6_SIT=y
CONFIG_IPV6_NDISC_NODETYPE=y
# CONFIG_IPV6_TUNNEL is not set
# CONFIG_IPV6_MULTIPLE_TABLES is not set
# CONFIG_IPV6_MROUTE is not set
# CONFIG_NETWORK_SECMARK is not set
CONFIG_NETFILTER=y
# CONFIG_NETFILTER_DEBUG is not set
CONFIG_NETFILTER_ADVANCED=y

#
# Core Netfilter Configuration
#
CONFIG_NETFILTER_NETLINK=m
CONFIG_NETFILTER_NETLINK_QUEUE=m
CONFIG_NETFILTER_NETLINK_LOG=m
CONFIG_NF_CONNTRACK=m
# CONFIG_NF_CT_ACCT is not set
# CONFIG_NF_CONNTRACK_MARK is not set
# CONFIG_NF_CONNTRACK_EVENTS is not set
# CONFIG_NF_CT_PROTO_DCCP is not set
# CONFIG_NF_CT_PROTO_SCTP is not set
# CONFIG_NF_CT_PROTO_UDPLITE is not set
# CONFIG_NF_CONNTRACK_AMANDA is not set
# CONFIG_NF_CONNTRACK_FTP is not set
# CONFIG_NF_CONNTRACK_H323 is not set
# CONFIG_NF_CONNTRACK_IRC is not set
# CONFIG_NF_CONNTRACK_NETBIOS_NS is not set
# CONFIG_NF_CONNTRACK_PPTP is not set
# CONFIG_NF_CONNTRACK_SANE is not set
# CONFIG_NF_CONNTRACK_SIP is not set
# CONFIG_NF_CONNTRACK_TFTP is not set
# CONFIG_NF_CT_NETLINK is not set
# CONFIG_NETFILTER_XTABLES is not set

#
# IP: Netfilter Configuration
#
# CONFIG_NF_CONNTRACK_IPV4 is not set
# CONFIG_IP_NF_QUEUE is not set
# CONFIG_IP_NF_IPTABLES is not set
# CONFIG_IP_NF_ARPTABLES is not set

#
# IPv6: Netfilter Configuration
#
# CONFIG_NF_CONNTRACK_IPV6 is not set
# CONFIG_IP6_NF_QUEUE is not set
# CONFIG_IP6_NF_IPTABLES is not set
# CONFIG_IP_DCCP is not set
CONFIG_IP_SCTP=m
# CONFIG_SCTP_DBG_MSG is not set
# CONFIG_SCTP_DBG_OBJCNT is not set
# CONFIG_SCTP_HMAC_NONE is not set
# CONFIG_SCTP_HMAC_SHA1 is not set
CONFIG_SCTP_HMAC_MD5=y
# CONFIG_TIPC is not set
# CONFIG_ATM is not set
# CONFIG_BRIDGE is not set
# CONFIG_VLAN_8021Q is not set
# CONFIG_DECNET is not set
# CONFIG_LLC2 is not set
# CONFIG_IPX is not set
# CONFIG_ATALK is not set
# CONFIG_X25 is not set
# CONFIG_LAPB is not set
# CONFIG_ECONET is not set
# CONFIG_WAN_ROUTER is not set
CONFIG_NET_SCHED=y

#
# Queueing/Scheduling
#
CONFIG_NET_SCH_CBQ=m
# CONFIG_NET_SCH_HTB is not set
# CONFIG_NET_SCH_HFSC is not set
CONFIG_NET_SCH_PRIO=m
CONFIG_NET_SCH_RR=m
CONFIG_NET_SCH_RED=m
CONFIG_NET_SCH_SFQ=m
CONFIG_NET_SCH_TEQL=m
CONFIG_NET_SCH_TBF=m
CONFIG_NET_SCH_GRED=m
CONFIG_NET_SCH_DSMARK=m
# CONFIG_NET_SCH_NETEM is not set
# CONFIG_NET_SCH_INGRESS is not set

#
# Classification
#
CONFIG_NET_CLS=y
# CONFIG_NET_CLS_BASIC is not set
CONFIG_NET_CLS_TCINDEX=m
CONFIG_NET_CLS_ROUTE4=m
CONFIG_NET_CLS_ROUTE=y
CONFIG_NET_CLS_FW=m
CONFIG_NET_CLS_U32=m
# CONFIG_CLS_U32_PERF is not set
CONFIG_CLS_U32_MARK=y
CONFIG_NET_CLS_RSVP=m
CONFIG_NET_CLS_RSVP6=m
CONFIG_NET_CLS_FLOW=m
# CONFIG_NET_EMATCH is not set
CONFIG_NET_CLS_ACT=y
CONFIG_NET_ACT_POLICE=y
# CONFIG_NET_ACT_GACT is not set
# CONFIG_NET_ACT_MIRRED is not set
CONFIG_NET_ACT_NAT=m
# CONFIG_NET_ACT_PEDIT is not set
# CONFIG_NET_ACT_SIMP is not set
# CONFIG_NET_CLS_IND is not set
CONFIG_NET_SCH_FIFO=y

#
# Network testing
#
# CONFIG_NET_PKTGEN is not set
# CONFIG_NET_TCPPROBE is not set
CONFIG_CAN=m
CONFIG_CAN_RAW=m
CONFIG_CAN_BCM=m

#
# CAN Device Drivers
#
CONFIG_CAN_VCAN=m
# CONFIG_CAN_DEBUG_DEVICES is not set
# CONFIG_AF_RXRPC is not set
# CONFIG_RFKILL is not set
# CONFIG_NET_9P is not set
# CONFIG_PCMCIA is not set
CONFIG_CCW=y

#
# Device Drivers
#

#
# Generic Driver Options
#
CONFIG_UEVENT_HELPER_PATH="/sbin/hotplug"
CONFIG_STANDALONE=y
CONFIG_PREVENT_FIRMWARE_BUILD=y
# CONFIG_FW_LOADER is not set
# CONFIG_DEBUG_DRIVER is not set
# CONFIG_DEBUG_DEVRES is not set
CONFIG_SYS_HYPERVISOR=y
# CONFIG_CONNECTOR is not set
CONFIG_BLK_DEV=y
# CONFIG_BLK_DEV_COW_COMMON is not set
CONFIG_BLK_DEV_LOOP=m
# CONFIG_BLK_DEV_CRYPTOLOOP is not set
CONFIG_BLK_DEV_NBD=m
CONFIG_BLK_DEV_RAM=y
CONFIG_BLK_DEV_RAM_COUNT=16
CONFIG_BLK_DEV_RAM_SIZE=4096
CONFIG_BLK_DEV_XIP=y
# CONFIG_CDROM_PKTCDVD is not set
# CONFIG_ATA_OVER_ETH is not set

#
# S/390 block device drivers
#
CONFIG_BLK_DEV_XPRAM=m
# CONFIG_DCSSBLK is not set
CONFIG_DASD=y
CONFIG_DASD_PROFILE=y
CONFIG_DASD_ECKD=y
CONFIG_DASD_FBA=y
CONFIG_DASD_DIAG=y
CONFIG_DASD_EER=y
CONFIG_VIRTIO_BLK=m
CONFIG_MISC_DEVICES=y
# CONFIG_EEPROM_93CX6 is not set
# CONFIG_ENCLOSURE_SERVICES is not set
# CONFIG_HAVE_IDE is not set

#
# SCSI device support
#
# CONFIG_RAID_ATTRS is not set
CONFIG_SCSI=y
# CONFIG_SCSI_DMA is not set
# CONFIG_SCSI_TGT is not set
CONFIG_SCSI_NETLINK=y
CONFIG_SCSI_PROC_FS=y

#
# SCSI support type (disk, tape, CD-ROM)
#
CONFIG_BLK_DEV_SD=y
CONFIG_CHR_DEV_ST=y
# CONFIG_CHR_DEV_OSST is not set
CONFIG_BLK_DEV_SR=y
CONFIG_BLK_DEV_SR_VENDOR=y
CONFIG_CHR_DEV_SG=y
# CONFIG_CHR_DEV_SCH is not set

#
# Some SCSI devices (e.g. CD jukebox) support multiple LUNs
#
CONFIG_SCSI_MULTI_LUN=y
CONFIG_SCSI_CONSTANTS=y
CONFIG_SCSI_LOGGING=y
CONFIG_SCSI_SCAN_ASYNC=y
CONFIG_SCSI_WAIT_SCAN=m

#
# SCSI Transports
#
# CONFIG_SCSI_SPI_ATTRS is not set
CONFIG_SCSI_FC_ATTRS=y
# CONFIG_SCSI_ISCSI_ATTRS is not set
# CONFIG_SCSI_SAS_ATTRS is not set
# CONFIG_SCSI_SAS_LIBSAS is not set
# CONFIG_SCSI_SRP_ATTRS is not set
CONFIG_SCSI_LOWLEVEL=y
# CONFIG_ISCSI_TCP is not set
# CONFIG_SCSI_DEBUG is not set
CONFIG_ZFCP=y
CONFIG_MD=y
CONFIG_BLK_DEV_MD=y
CONFIG_MD_LINEAR=m
CONFIG_MD_RAID0=m
CONFIG_MD_RAID1=m
# CONFIG_MD_RAID10 is not set
# CONFIG_MD_RAID456 is not set
CONFIG_MD_MULTIPATH=m
# CONFIG_MD_FAULTY is not set
CONFIG_BLK_DEV_DM=y
# CONFIG_DM_DEBUG is not set
CONFIG_DM_CRYPT=y
CONFIG_DM_SNAPSHOT=y
CONFIG_DM_MIRROR=y
CONFIG_DM_ZERO=y
CONFIG_DM_MULTIPATH=y
# CONFIG_DM_MULTIPATH_EMC is not set
# CONFIG_DM_MULTIPATH_RDAC is not set
# CONFIG_DM_MULTIPATH_HP is not set
# CONFIG_DM_DELAY is not set
# CONFIG_DM_UEVENT is not set
CONFIG_NETDEVICES=y
# CONFIG_NETDEVICES_MULTIQUEUE is not set
# CONFIG_IFB is not set
CONFIG_DUMMY=m
CONFIG_BONDING=m
# CONFIG_MACVLAN is not set
CONFIG_EQUALIZER=m
CONFIG_TUN=m
CONFIG_VETH=m
CONFIG_NET_ETHERNET=y
# CONFIG_MII is not set
# CONFIG_IBM_NEW_EMAC_ZMII is not set
# CONFIG_IBM_NEW_EMAC_RGMII is not set
# CONFIG_IBM_NEW_EMAC_TAH is not set
# CONFIG_IBM_NEW_EMAC_EMAC4 is not set
CONFIG_NETDEV_1000=y
# CONFIG_E1000E_ENABLED is not set
CONFIG_NETDEV_10000=y
# CONFIG_TR is not set
# CONFIG_WAN is not set

#
# S/390 network device drivers
#
CONFIG_LCS=m
CONFIG_CTCM=m
# CONFIG_NETIUCV is not set
# CONFIG_SMSGIUCV is not set
# CONFIG_CLAW is not set
CONFIG_QETH=y
CONFIG_QETH_L2=y
CONFIG_QETH_L3=y
CONFIG_QETH_IPV6=y
CONFIG_CCWGROUP=y
# CONFIG_PPP is not set
# CONFIG_SLIP is not set
# CONFIG_NETCONSOLE is not set
# CONFIG_NETPOLL is not set
# CONFIG_NET_POLL_CONTROLLER is not set
CONFIG_VIRTIO_NET=m

#
# Character devices
#
CONFIG_DEVKMEM=y
CONFIG_UNIX98_PTYS=y
CONFIG_LEGACY_PTYS=y
CONFIG_LEGACY_PTY_COUNT=256
CONFIG_HW_RANDOM=m
# CONFIG_HW_RANDOM_VIRTIO is not set
# CONFIG_R3964 is not set
CONFIG_RAW_DRIVER=m
CONFIG_MAX_RAW_DEVS=256
# CONFIG_HANGCHECK_TIMER is not set

#
# S/390 character device drivers
#
CONFIG_TN3270=y
CONFIG_TN3270_TTY=y
CONFIG_TN3270_FS=m
CONFIG_TN3270_CONSOLE=y
CONFIG_TN3215=y
CONFIG_TN3215_CONSOLE=y
CONFIG_CCW_CONSOLE=y
CONFIG_SCLP_TTY=y
CONFIG_SCLP_CONSOLE=y
CONFIG_SCLP_VT220_TTY=y
CONFIG_SCLP_VT220_CONSOLE=y
CONFIG_SCLP_CPI=m
CONFIG_S390_TAPE=m

#
# S/390 tape interface support
#
CONFIG_S390_TAPE_BLOCK=y

#
# S/390 tape hardware support
#
CONFIG_S390_TAPE_34XX=m
# CONFIG_S390_TAPE_3590 is not set
# CONFIG_VMLOGRDR is not set
# CONFIG_VMCP is not set
# CONFIG_MONREADER is not set
CONFIG_MONWRITER=m
CONFIG_S390_VMUR=m
# CONFIG_POWER_SUPPLY is not set
# CONFIG_THERMAL is not set
# CONFIG_WATCHDOG is not set

#
# Sonics Silicon Backplane
#
# CONFIG_MEMSTICK is not set
# CONFIG_NEW_LEDS is not set
CONFIG_ACCESSIBILITY=y

#
# File systems
#
CONFIG_EXT2_FS=y
# CONFIG_EXT2_FS_XATTR is not set
# CONFIG_EXT2_FS_XIP is not set
CONFIG_EXT3_FS=y
CONFIG_EXT3_FS_XATTR=y
# CONFIG_EXT3_FS_POSIX_ACL is not set
# CONFIG_EXT3_FS_SECURITY is not set
# CONFIG_EXT4DEV_FS is not set
CONFIG_JBD=y
# CONFIG_JBD_DEBUG is not set
CONFIG_FS_MBCACHE=y
# CONFIG_REISERFS_FS is not set
# CONFIG_JFS_FS is not set
CONFIG_FS_POSIX_ACL=y
# CONFIG_XFS_FS is not set
# CONFIG_GFS2_FS is not set
# CONFIG_OCFS2_FS is not set
CONFIG_DNOTIFY=y
CONFIG_INOTIFY=y
CONFIG_INOTIFY_USER=y
# CONFIG_QUOTA is not set
# CONFIG_AUTOFS_FS is not set
# CONFIG_AUTOFS4_FS is not set
# CONFIG_FUSE_FS is not set
CONFIG_GENERIC_ACL=y

#
# CD-ROM/DVD Filesystems
#
# CONFIG_ISO9660_FS is not set
# CONFIG_UDF_FS is not set

#
# DOS/FAT/NT Filesystems
#
# CONFIG_MSDOS_FS is not set
# CONFIG_VFAT_FS is not set
# CONFIG_NTFS_FS is not set

#
# Pseudo filesystems
#
CONFIG_PROC_FS=y
CONFIG_PROC_KCORE=y
CONFIG_PROC_SYSCTL=y
CONFIG_SYSFS=y
CONFIG_TMPFS=y
CONFIG_TMPFS_POSIX_ACL=y
# CONFIG_HUGETLBFS is not set
# CONFIG_HUGETLB_PAGE is not set
CONFIG_CONFIGFS_FS=m

#
# Miscellaneous filesystems
#
# CONFIG_ADFS_FS is not set
# CONFIG_AFFS_FS is not set
# CONFIG_HFS_FS is not set
# CONFIG_HFSPLUS_FS is not set
# CONFIG_BEFS_FS is not set
# CONFIG_BFS_FS is not set
# CONFIG_EFS_FS is not set
# CONFIG_CRAMFS is not set
# CONFIG_VXFS_FS is not set
# CONFIG_MINIX_FS is not set
# CONFIG_HPFS_FS is not set
# CONFIG_QNX4FS_FS is not set
# CONFIG_ROMFS_FS is not set
# CONFIG_SYSV_FS is not set
# CONFIG_UFS_FS is not set
CONFIG_NETWORK_FILESYSTEMS=y
CONFIG_NFS_FS=y
CONFIG_NFS_V3=y
# CONFIG_NFS_V3_ACL is not set
# CONFIG_NFS_V4 is not set
CONFIG_NFSD=y
CONFIG_NFSD_V3=y
# CONFIG_NFSD_V3_ACL is not set
# CONFIG_NFSD_V4 is not set
CONFIG_LOCKD=y
CONFIG_LOCKD_V4=y
CONFIG_EXPORTFS=y
CONFIG_NFS_COMMON=y
CONFIG_SUNRPC=y
# CONFIG_SUNRPC_BIND34 is not set
# CONFIG_RPCSEC_GSS_KRB5 is not set
# CONFIG_RPCSEC_GSS_SPKM3 is not set
# CONFIG_SMB_FS is not set
# CONFIG_CIFS is not set
# CONFIG_NCP_FS is not set
# CONFIG_CODA_FS is not set
# CONFIG_AFS_FS is not set

#
# Partition Types
#
CONFIG_PARTITION_ADVANCED=y
# CONFIG_ACORN_PARTITION is not set
# CONFIG_OSF_PARTITION is not set
# CONFIG_AMIGA_PARTITION is not set
# CONFIG_ATARI_PARTITION is not set
CONFIG_IBM_PARTITION=y
# CONFIG_MAC_PARTITION is not set
CONFIG_MSDOS_PARTITION=y
# CONFIG_BSD_DISKLABEL is not set
# CONFIG_MINIX_SUBPARTITION is not set
# CONFIG_SOLARIS_X86_PARTITION is not set
# CONFIG_UNIXWARE_DISKLABEL is not set
# CONFIG_LDM_PARTITION is not set
# CONFIG_SGI_PARTITION is not set
# CONFIG_ULTRIX_PARTITION is not set
# CONFIG_SUN_PARTITION is not set
# CONFIG_KARMA_PARTITION is not set
# CONFIG_EFI_PARTITION is not set
# CONFIG_SYSV68_PARTITION is not set
# CONFIG_NLS is not set
CONFIG_DLM=m
# CONFIG_DLM_DEBUG is not set

#
# Kernel hacking
#
CONFIG_TRACE_IRQFLAGS_SUPPORT=y
# CONFIG_PRINTK_TIME is not set
CONFIG_ENABLE_WARN_DEPRECATED=y
CONFIG_ENABLE_MUST_CHECK=y
CONFIG_FRAME_WARN=2048
CONFIG_MAGIC_SYSRQ=y
# CONFIG_UNUSED_SYMBOLS is not set
CONFIG_DEBUG_FS=y
# CONFIG_HEADERS_CHECK is not set
CONFIG_DEBUG_KERNEL=y
# CONFIG_SCHED_DEBUG is not set
# CONFIG_SCHEDSTATS is not set
# CONFIG_TIMER_STATS is not set
# CONFIG_DEBUG_OBJECTS is not set
# CONFIG_DEBUG_SLAB is not set
CONFIG_DEBUG_PREEMPT=y
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_RT_MUTEX_TESTER is not set
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_MUTEXES=y
# CONFIG_DEBUG_LOCK_ALLOC is not set
# CONFIG_PROVE_LOCKING is not set
# CONFIG_LOCK_STAT is not set
CONFIG_DEBUG_SPINLOCK_SLEEP=y
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_DEBUG_KOBJECT is not set
CONFIG_DEBUG_BUGVERBOSE=y
# CONFIG_DEBUG_INFO is not set
# CONFIG_DEBUG_VM is not set
# CONFIG_DEBUG_WRITECOUNT is not set
# CONFIG_DEBUG_LIST is not set
# CONFIG_DEBUG_SG is not set
# CONFIG_FRAME_POINTER is not set
# CONFIG_RCU_TORTURE_TEST is not set
# CONFIG_KPROBES_SANITY_TEST is not set
# CONFIG_BACKTRACE_SELF_TEST is not set
# CONFIG_LKDTM is not set
# CONFIG_FAULT_INJECTION is not set
# CONFIG_LATENCYTOP is not set
CONFIG_SAMPLES=y
# CONFIG_SAMPLE_KOBJECT is not set
# CONFIG_SAMPLE_KPROBES is not set
# CONFIG_DEBUG_PAGEALLOC is not set

#
# Security options
#
# CONFIG_KEYS is not set
# CONFIG_SECURITY is not set
# CONFIG_SECURITY_FILE_CAPABILITIES is not set
CONFIG_CRYPTO=y

#
# Crypto core or helper
#
CONFIG_CRYPTO_ALGAPI=y
CONFIG_CRYPTO_AEAD=m
CONFIG_CRYPTO_BLKCIPHER=y
CONFIG_CRYPTO_HASH=m
CONFIG_CRYPTO_MANAGER=y
CONFIG_CRYPTO_GF128MUL=m
# CONFIG_CRYPTO_NULL is not set
# CONFIG_CRYPTO_CRYPTD is not set
CONFIG_CRYPTO_AUTHENC=m
# CONFIG_CRYPTO_TEST is not set

#
# Authenticated Encryption with Associated Data
#
CONFIG_CRYPTO_CCM=m
CONFIG_CRYPTO_GCM=m
CONFIG_CRYPTO_SEQIV=m

#
# Block modes
#
CONFIG_CRYPTO_CBC=y
CONFIG_CRYPTO_CTR=m
CONFIG_CRYPTO_CTS=m
CONFIG_CRYPTO_ECB=m
# CONFIG_CRYPTO_LRW is not set
CONFIG_CRYPTO_PCBC=m
# CONFIG_CRYPTO_XTS is not set

#
# Hash modes
#
CONFIG_CRYPTO_HMAC=m
# CONFIG_CRYPTO_XCBC is not set

#
# Digest
#
# CONFIG_CRYPTO_CRC32C is not set
# CONFIG_CRYPTO_MD4 is not set
CONFIG_CRYPTO_MD5=m
# CONFIG_CRYPTO_MICHAEL_MIC is not set
CONFIG_CRYPTO_SHA1=m
# CONFIG_CRYPTO_SHA256 is not set
# CONFIG_CRYPTO_SHA512 is not set
# CONFIG_CRYPTO_TGR192 is not set
# CONFIG_CRYPTO_WP512 is not set

#
# Ciphers
#
# CONFIG_CRYPTO_AES is not set
# CONFIG_CRYPTO_ANUBIS is not set
# CONFIG_CRYPTO_ARC4 is not set
# CONFIG_CRYPTO_BLOWFISH is not set
CONFIG_CRYPTO_CAMELLIA=m
# CONFIG_CRYPTO_CAST5 is not set
# CONFIG_CRYPTO_CAST6 is not set
# CONFIG_CRYPTO_DES is not set
CONFIG_CRYPTO_FCRYPT=m
# CONFIG_CRYPTO_KHAZAD is not set
CONFIG_CRYPTO_SALSA20=m
CONFIG_CRYPTO_SEED=m
# CONFIG_CRYPTO_SERPENT is not set
# CONFIG_CRYPTO_TEA is not set
# CONFIG_CRYPTO_TWOFISH is not set

#
# Compression
#
# CONFIG_CRYPTO_DEFLATE is not set
CONFIG_CRYPTO_LZO=m
CONFIG_CRYPTO_HW=y
CONFIG_ZCRYPT=m
# CONFIG_ZCRYPT_MONOLITHIC is not set
# CONFIG_CRYPTO_SHA1_S390 is not set
# CONFIG_CRYPTO_SHA256_S390 is not set
CONFIG_CRYPTO_SHA512_S390=m
# CONFIG_CRYPTO_DES_S390 is not set
# CONFIG_CRYPTO_AES_S390 is not set
CONFIG_S390_PRNG=m

#
# Library routines
#
CONFIG_BITREVERSE=m
# CONFIG_GENERIC_FIND_FIRST_BIT is not set
# CONFIG_GENERIC_FIND_NEXT_BIT is not set
# CONFIG_CRC_CCITT is not set
# CONFIG_CRC16 is not set
# CONFIG_CRC_ITU_T is not set
CONFIG_CRC32=m
CONFIG_CRC7=m
CONFIG_LIBCRC32C=m
CONFIG_LZO_COMPRESS=m
CONFIG_LZO_DECOMPRESS=m
CONFIG_PLIST=y
CONFIG_HAVE_KVM=y
CONFIG_VIRTUALIZATION=y
CONFIG_KVM=m
CONFIG_VIRTIO=y
CONFIG_VIRTIO_RING=y
CONFIG_VIRTIO_BALLOON=m

2008-06-21 04:18:35

by Heiko Carstens

[permalink] [raw]
Subject: Re: [BUG] CFS vs cpu hotplug

On Fri, Jun 20, 2008 at 10:51:03AM +0200, Peter Zijlstra wrote:
> On Thu, 2008-06-19 at 23:49 +0200, Heiko Carstens wrote:
> > On Thu, Jun 19, 2008 at 11:32:29PM +0200, Peter Zijlstra wrote:
> > > On Thu, 2008-06-19 at 20:05 +0200, Peter Zijlstra wrote:
> > > > On Thu, 2008-06-19 at 18:19 +0200, Heiko Carstens wrote:
> > >
> > > > > The sched_entity that belongs to the cfs_rq:
> > > > >
> > > > > >> px *(sched_entity *) 0x759300
> > > > > struct sched_entity {
> > > > > load = struct load_weight {
> > > > > weight = 0x800
> > > > > inv_weight = 0x1ffc01
> > > > > }
> > > > > run_node = struct rb_node {
> > > > > rb_parent_color = 0x1
> > > > > rb_right = (nil)
> > > > > rb_left = (nil)
> > > > > }
> > > > > group_node = struct list_head {
> > > > > next = 0x75a3b8
> > > > > prev = 0x75a3b8
> > > > > }
> > > > > on_rq = 0x1
> > > > > exec_start = 0x189685acb4aa46
> > > > > sum_exec_runtime = 0x188a2b84c
> > > > > vruntime = 0xd036bd29
> > > > > prev_sum_exec_runtime = 0x1672e3f62
> > > > > last_wakeup = 0x0
> > > > > avg_overlap = 0x0
> > > > > parent = (nil)
> > > > > cfs_rq = 0x75a380
> > > > > my_q = 0x759400
> > > > > }
> > >
> > > If you still have this dump, could you give the output of:
> > >
> > > px *(struct cfs_rq *) 0x759400
> > >
> > > And possibly walk down the chain getting its curr and then my_q again
> > > etc..
> >
> > Sure, fortunately just a very short chain:
> >
> > >> px *(struct cfs_rq *) 0x759400
> > struct cfs_rq {
> > load = struct load_weight {
> > weight = 0xc31
> > inv_weight = 0x0
> > }
> > nr_running = 0x1
> > exec_clock = 0x0
> > min_vruntime = 0x4f216b005
> > tasks_timeline = struct rb_root {
> > rb_node = 0x2fca4d40
> > }
> > rb_leftmost = 0x2fca4d40
> > tasks = struct list_head {
> > next = 0x2fca4d58
> > prev = 0x2fca4d58
> > }
> > balance_iterator = 0x2e29e700
> > curr = 0x2ef4f388
> > next = (nil)
> > nr_spread_over = 0x0
> > rq = 0x75a300
> > leaf_cfs_rq_list = struct list_head {
> > next = 0x75aaa0
> > prev = 0x2e1eca70
> > }
> > tg = 0x564910
> > }
>
> Hmm this one is buggered as well, it has nr_running = 1, and one entry
> in the tree, but also a !NULL curr.
>
> Could you please show:
>
> px *container_of(0x2fca4d40, struct sched_entity, run_node)
>
> which one might have to write like:
>
> px *((struct sched_entity *)((char*)0x2fca4d40) - ((unsigned long)&(((struct sched_entity *)0)->run_node)))
>
> /me prays he got the braces right,..

Here we go:

>> offset sched_entity.run_node
Offset: 16 bytes.

>> px *(sched_entity *) 0x2fca4d30
struct sched_entity {
load = struct load_weight {
weight = 0xc31
inv_weight = 0x14ff97
}
run_node = struct rb_node {
rb_parent_color = 0x1
rb_right = (nil)
rb_left = (nil)
}
group_node = struct list_head {
next = 0x759438
prev = 0x759438
}
on_rq = 0x1
exec_start = 0x1896859fb4ff76
sum_exec_runtime = 0x1f19
vruntime = 0x4f128ead9
prev_sum_exec_runtime = 0x0
last_wakeup = 0x0
avg_overlap = 0x0
parent = 0x759300
cfs_rq = 0x759400
my_q = (nil)
}

2008-06-25 22:12:38

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: [BUG] CFS vs cpu hotplug

2008/6/19 Heiko Carstens <[email protected]>:
> Hi Ingo, Peter,
>
> I'm still seeing kernel crashes on cpu hotplug with Linus' current git tree.
> All I have to do is to make all cpus busy (make -j4 of the kernel source is
> sufficient) and then start cpu hotplug stress.
> It usually takes below a minute to crash the system like this:
>
> Unable to handle kernel pointer dereference at virtual kernel address 005a800000031000
> Oops: 0038 [#1] PREEMPT SMP
> Modules linked in:
> CPU: 1 Not tainted 2.6.26-rc6-00232-g9bedbcb #356
> Process swapper (pid: 0, task: 000000002fe7ccf8, ksp: 000000002fe93d78)
> Krnl PSW : 0400e00180000000 0000000000032c6c (pick_next_task_fair+0x34/0xb0)
> R:0 T:1 IO:0 EX:0 Key:0 M:0 W:0 P:0 AS:3 CC:2 PM:0 EA:3
> Krnl GPRS: 00000000001ff000 0000000000030bd8 000000000075a380 000000002fe7ccf8
> 0000000000386690 0000000000000008 0000000000000000 000000002fe7cf58
> 0000000000000001 000000000075a300 0000000000000000 000000002fe93d40
> 005a800000031201 0000000000386010 000000002fe93d78 000000002fe93d40
> Krnl Code: 0000000000032c5c: e3e0f0980024 stg %r14,152(%r15)
> 0000000000032c62: d507d000c010 clc 0(8,%r13),16(%r12)
> 0000000000032c68: a784003c brc 8,32ce0
> >0000000000032c6c: d507d000c030 clc 0(8,%r13),48(%r12)
> 0000000000032c72: b904002c lgr %r2,%r12
> 0000000000032c76: a7a90000 lghi %r10,0
> 0000000000032c7a: a7840021 brc 8,32cbc
> 0000000000032c7e: c0e5ffffefe3 brasl %r14,30c44
> Call Trace:
> ([<000000000075a300>] 0x75a300)
> [<000000000037195a>] schedule+0x162/0x7f4
> [<000000000001a2be>] cpu_idle+0x1ca/0x25c
> [<000000000036f368>] start_secondary+0xac/0xb8
> [<0000000000000000>] 0x0
> [<0000000000000000>] 0x0
> Last Breaking-Event-Address:
> [<0000000000032cc6>] pick_next_task_fair+0x8e/0xb0
> <4>---[ end trace 9bb55df196feedcc ]---
> Kernel panic - not syncing: Attempted to kill the idle task!
>
> Please note that the above call trace is from s390, however Avi reported the
> same bug on x86_64.

FYI, I've managed to reproduce it 3 times (took 10 to 45 minutes) on
my dual-core Thinkpad R60.

(1) make -j3 of the kernel source

(2) a loop with : offline cpu_1 ; sleep 1 ; online cpu_1 ; sleep 1

2 times in the GUI environment so I couldn't see an oops (although, I
could here it as the very first time my laptop was constantly
beeeeeeeep-ing :-)

Strangely enough, an oops didn't appear in the plain console mode
(well, at least not on the active terminal). Although, my additional
debugging message from pick_next_task_fair() did appear on the screen
right before the system froze..

It's in the loop of pick_next_task_fair():

do {
se = pick_next_entity(cfs_rq);

if (unlikely(!se))
printk(KERN_ERR "BUG: se == NULL but
nr_running (%ld), load (%ld),
" rq-nr_running (%ld), rq-load (%ld)\n",
cfs_rq->nr_running,
cfs_rq->load.weight, rq->nr_running, r

cfs_rq = group_cfs_rq(se);
} while (cfs_rq);


BUG: se == NULL but nr_running (1), load (1024), rq-nr_running (1),
rq-load (1024)

so there is a crouching gremlin somewhere in the code :-/


--
Best regards,
Dmitry Adamushko

2008-06-28 22:17:15

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: [BUG] CFS vs cpu hotplug

Hello,


it seems to be related to migrate_dead_tasks().

Firstly I added traces to see all tasks being migrated with
migrate_live_tasks() and migrate_dead_tasks(). On my setup the problem
pops up (the one with "se == NULL" in the loop of
pick_next_task_fair()) shortly after the traces indicate that some has
been migrated with migrate_dead_tasks()). btw., I can reproduce it
much faster now with just a plain cpu down/up loop.

[disclaimer] Well, unless I'm really missing something important in
this late hour [/desclaimer] pick_next_task() is not something
appropriate for migrate_dead_tasks() :-)

the following change seems to eliminate the problem on my setup
(although, I kept it running only for a few minutes to get a few
messages indicating migrate_dead_tasks() does move tasks and the
system is still ok)

[ quick hack ]

@@ -5887,6 +5907,7 @@ static void migrate_dead_tasks(unsigned int dead_cpu)
next = pick_next_task(rq, rq->curr);
if (!next)
break;
+ next->sched_class->put_prev_task(rq, next);
migrate_dead(dead_cpu, next);

}

just in case, all the changes I've used for this test are attached "as is".

p.s. perhaps I won't be able to verify it carefully till tomorrow's
late evening.


--
Best regards,
Dmitry Adamushko


Attachments:
(No filename) (1.30 kB)
migration-experiment.patch (4.11 kB)
Download all attachments

2008-06-29 06:56:24

by Ingo Molnar

[permalink] [raw]
Subject: Re: [BUG] CFS vs cpu hotplug


* Dmitry Adamushko <[email protected]> wrote:

> Hello,
>
> it seems to be related to migrate_dead_tasks().
>
> Firstly I added traces to see all tasks being migrated with
> migrate_live_tasks() and migrate_dead_tasks(). On my setup the problem
> pops up (the one with "se == NULL" in the loop of
> pick_next_task_fair()) shortly after the traces indicate that some has
> been migrated with migrate_dead_tasks()). btw., I can reproduce it
> much faster now with just a plain cpu down/up loop.
>
> [disclaimer] Well, unless I'm really missing something important in
> this late hour [/desclaimer] pick_next_task() is not something
> appropriate for migrate_dead_tasks() :-)
>
> the following change seems to eliminate the problem on my setup
> (although, I kept it running only for a few minutes to get a few
> messages indicating migrate_dead_tasks() does move tasks and the
> system is still ok)
>
> [ quick hack ]
>
> @@ -5887,6 +5907,7 @@ static void migrate_dead_tasks(unsigned int dead_cpu)
> next = pick_next_task(rq, rq->curr);
> if (!next)
> break;
> + next->sched_class->put_prev_task(rq, next);
> migrate_dead(dead_cpu, next);
>

thanks Dmitry - i've applied this chunk to tip/master and
tip/sched/urgent, for more testing.

if this turns out to be the final and full fix today, would you mind to
submit the rest of your checks as well? It seems like a rather sensible
set of sanity checks. Put under CONFIG_SCHED_DEBUG or a new
(default-off) config option.

it would also be _very_ nice to have a built-in cpu hotplug tester in
the kernel, a'ka CONFIG_RCU_TORTURE_TEST=y. There's already sample code
in kernel/tracing/ of how to initiate hotplug events from within the
kernel.

Ingo

2008-06-30 09:08:17

by Heiko Carstens

[permalink] [raw]
Subject: Re: [BUG] CFS vs cpu hotplug

On Sun, Jun 29, 2008 at 12:16:56AM +0200, Dmitry Adamushko wrote:
> Hello,
>
>
> it seems to be related to migrate_dead_tasks().
>
> Firstly I added traces to see all tasks being migrated with
> migrate_live_tasks() and migrate_dead_tasks(). On my setup the problem
> pops up (the one with "se == NULL" in the loop of
> pick_next_task_fair()) shortly after the traces indicate that some has
> been migrated with migrate_dead_tasks()). btw., I can reproduce it
> much faster now with just a plain cpu down/up loop.
>
> [disclaimer] Well, unless I'm really missing something important in
> this late hour [/desclaimer] pick_next_task() is not something
> appropriate for migrate_dead_tasks() :-)
>
> the following change seems to eliminate the problem on my setup
> (although, I kept it running only for a few minutes to get a few
> messages indicating migrate_dead_tasks() does move tasks and the
> system is still ok)
>
> [ quick hack ]
>
> @@ -5887,6 +5907,7 @@ static void migrate_dead_tasks(unsigned int dead_cpu)
> next = pick_next_task(rq, rq->curr);
> if (!next)
> break;
> + next->sched_class->put_prev_task(rq, next);
> migrate_dead(dead_cpu, next);
>
> }

Thanks Dmitry! With your patch I cannot reproduce the bug anymore.

2008-06-30 09:17:44

by Ingo Molnar

[permalink] [raw]
Subject: Re: [BUG] CFS vs cpu hotplug


* Heiko Carstens <[email protected]> wrote:

> On Sun, Jun 29, 2008 at 12:16:56AM +0200, Dmitry Adamushko wrote:
> > Hello,
> >
> >
> > it seems to be related to migrate_dead_tasks().
> >
> > Firstly I added traces to see all tasks being migrated with
> > migrate_live_tasks() and migrate_dead_tasks(). On my setup the problem
> > pops up (the one with "se == NULL" in the loop of
> > pick_next_task_fair()) shortly after the traces indicate that some has
> > been migrated with migrate_dead_tasks()). btw., I can reproduce it
> > much faster now with just a plain cpu down/up loop.
> >
> > [disclaimer] Well, unless I'm really missing something important in
> > this late hour [/desclaimer] pick_next_task() is not something
> > appropriate for migrate_dead_tasks() :-)
> >
> > the following change seems to eliminate the problem on my setup
> > (although, I kept it running only for a few minutes to get a few
> > messages indicating migrate_dead_tasks() does move tasks and the
> > system is still ok)
> >
> > [ quick hack ]
> >
> > @@ -5887,6 +5907,7 @@ static void migrate_dead_tasks(unsigned int dead_cpu)
> > next = pick_next_task(rq, rq->curr);
> > if (!next)
> > break;
> > + next->sched_class->put_prev_task(rq, next);
> > migrate_dead(dead_cpu, next);
> >
> > }
>
> Thanks Dmitry! With your patch I cannot reproduce the bug anymore.

thanks - it passed my testing too. It's lined up for v2.6.26 merge, in
tip/sched/urgent.

Avi, does this patch fix your CPU hotplug problems too?

Ingo

2008-07-01 09:24:58

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [BUG] CFS vs cpu hotplug

Ingo Molnar wrote:
> * Heiko Carstens <[email protected]> wrote:
>
>> On Sun, Jun 29, 2008 at 12:16:56AM +0200, Dmitry Adamushko wrote:
>>> Hello,
>>>
>>>
>>> it seems to be related to migrate_dead_tasks().
>>>
>>> Firstly I added traces to see all tasks being migrated with
>>> migrate_live_tasks() and migrate_dead_tasks(). On my setup the problem
>>> pops up (the one with "se == NULL" in the loop of
>>> pick_next_task_fair()) shortly after the traces indicate that some has
>>> been migrated with migrate_dead_tasks()). btw., I can reproduce it
>>> much faster now with just a plain cpu down/up loop.
>>>
>>> [disclaimer] Well, unless I'm really missing something important in
>>> this late hour [/desclaimer] pick_next_task() is not something
>>> appropriate for migrate_dead_tasks() :-)
>>>
>>> the following change seems to eliminate the problem on my setup
>>> (although, I kept it running only for a few minutes to get a few
>>> messages indicating migrate_dead_tasks() does move tasks and the
>>> system is still ok)
>>>
>>> [ quick hack ]
>>>
>>> @@ -5887,6 +5907,7 @@ static void migrate_dead_tasks(unsigned int dead_cpu)
>>> next = pick_next_task(rq, rq->curr);
>>> if (!next)
>>> break;
>>> + next->sched_class->put_prev_task(rq, next);
>>> migrate_dead(dead_cpu, next);
>>>
>>> }
>> Thanks Dmitry! With your patch I cannot reproduce the bug anymore.
>
> thanks - it passed my testing too. It's lined up for v2.6.26 merge, in
> tip/sched/urgent.
>
> Avi, does this patch fix your CPU hotplug problems too?
>
> Ingo
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>
>

Hi, Ingo

The following oops still occurred whether this patch is applied or not.

Lai Jiangshan


------------[ cut here ]------------
kernel BUG at kernel/sched.c:6133!
invalid opcode: 0000 [1] SMP
CPU 0
Modules linked in:
Pid: 4744, comm: cpu_online.sh Not tainted 2.6.26-rc8 #1
RIP: 0010:[<ffffffff8058d0a9>] [<ffffffff8058d0a9>] migration_call+0x3eb/0x494
RSP: 0018:ffff81007115fd28 EFLAGS: 00010202
RAX: ffffffffffffffe3 RBX: ffff810001017580 RCX: 000000801b7c6e42
RDX: ffff81007115fcf8 RSI: 0000009388d2771c RDI: ffff810001017e00
RBP: ffff81007115fd78 R08: ffff81007115e000 R09: ffff8100807d6000
R10: ffff81007fb6d050 R11: 00000000ffffffff R12: 0000000000000283
R13: ffff810001029580 R14: ffff810001029580 R15: 0000000000000002
FS: 00007fbb153d36f0(0000) GS:ffffffff807a3000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
CR2: 00007fabafe2b0a8 CR3: 0000000076901000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
Process cpu_online.sh (pid: 4744, threadinfo ffff81007115e000, task ffff810071447200)
Stack: ffff81007115e000 000000007115fbd8 00000000ffffffff 0000000000000002
ffff81007115fd78 0000000000000000 00000000ffffffff ffffffff807a1d40
0000000000000002 0000000000000007 ffff81007115fdb8 ffffffff8059372c
Call Trace:
[<ffffffff8059372c>] notifier_call_chain+0x33/0x5b
[<ffffffff802476a9>] __raw_notifier_call_chain+0x9/0xb
[<ffffffff802476ba>] raw_notifier_call_chain+0xf/0x11
[<ffffffff805736d6>] _cpu_down+0x191/0x256
[<ffffffff805737c1>] cpu_down+0x26/0x36
[<ffffffff805749c1>] store_online+0x32/0x75
[<ffffffff803d1982>] sysdev_store+0x24/0x26
[<ffffffff802d2551>] sysfs_write_file+0xe0/0x11c
[<ffffffff80290e6b>] vfs_write+0xae/0x137
[<ffffffff802913d3>] sys_write+0x47/0x70
[<ffffffff8020b1eb>] system_call_after_swapgs+0x7b/0x80


Code: 80 07 00 00 48 01 83 80 07 00 00 49 c7 85 80 07 00 00 00 00 00 00 41 fe 45 00 49 39 dd 74 02 fe 03 41 54 9d 49 83 7d 08 00 74 04 <0f> 0b eb fe 4c 89 ef e8 b8 40 00 00 eb 1e 48 8b 11 48 8b 41 08
RIP [<ffffffff8058d0a9>] migration_call+0x3eb/0x494
RSP <ffff81007115fd28>
---[ end trace f22fd757d4f07850 ]---

platform: x86_64 2cores*2cpus fedora9
# cat cpu_online.sh
#!/bin/sh

cpu1=1
cpu2=1
cpu3=1
while ((1))
do
no=$(($RANDOM % 3 + 1))
if ((!cpu$no))
then
echo 1 > /sys/devices/system/cpu/cpu$no/online
((cpu$no=1))
else
echo 0 > /sys/devices/system/cpu/cpu$no/online
((cpu$no=0))
fi
echo 1 $cpu1 $cpu2 $cpu3
sleep 2
done

2008-07-01 09:33:07

by Ingo Molnar

[permalink] [raw]
Subject: Re: [BUG] CFS vs cpu hotplug


* Lai Jiangshan <[email protected]> wrote:

> The following oops still occurred whether this patch is applied or not.

> [<ffffffff8059372c>] notifier_call_chain+0x33/0x5b
> [<ffffffff802476a9>] __raw_notifier_call_chain+0x9/0xb
> [<ffffffff802476ba>] raw_notifier_call_chain+0xf/0x11
> [<ffffffff805736d6>] _cpu_down+0x191/0x256
> [<ffffffff805737c1>] cpu_down+0x26/0x36
> [<ffffffff805749c1>] store_online+0x32/0x75
> [<ffffffff803d1982>] sysdev_store+0x24/0x26
> [<ffffffff802d2551>] sysfs_write_file+0xe0/0x11c
> [<ffffffff80290e6b>] vfs_write+0xae/0x137
> [<ffffffff802913d3>] sys_write+0x47/0x70
> [<ffffffff8020b1eb>] system_call_after_swapgs+0x7b/0x80

hm, there were multiple problems in this area and a lot of dormant bugs.
Do you have this recent upstream commit in your tree:

| commit fcb43042ef55d2f46b0efa5d7746967cef38f056
| Author: Zhang, Yanmin <[email protected]>
| Date: Tue Jun 24 16:06:23 2008 +0800
|
| x86: fix cpu hotplug crash
|
| Vegard Nossum reported crashes during cpu hotplug tests:
|
| http://marc.info/?l=linux-kernel&m=121413950227884&w=4

?

Ingo

2008-07-01 10:11:06

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [BUG] CFS vs cpu hotplug

Ingo Molnar wrote:
> * Lai Jiangshan <[email protected]> wrote:
>
>> The following oops still occurred whether this patch is applied or not.
>
>> [<ffffffff8059372c>] notifier_call_chain+0x33/0x5b
>> [<ffffffff802476a9>] __raw_notifier_call_chain+0x9/0xb
>> [<ffffffff802476ba>] raw_notifier_call_chain+0xf/0x11
>> [<ffffffff805736d6>] _cpu_down+0x191/0x256
>> [<ffffffff805737c1>] cpu_down+0x26/0x36
>> [<ffffffff805749c1>] store_online+0x32/0x75
>> [<ffffffff803d1982>] sysdev_store+0x24/0x26
>> [<ffffffff802d2551>] sysfs_write_file+0xe0/0x11c
>> [<ffffffff80290e6b>] vfs_write+0xae/0x137
>> [<ffffffff802913d3>] sys_write+0x47/0x70
>> [<ffffffff8020b1eb>] system_call_after_swapgs+0x7b/0x80
>
> hm, there were multiple problems in this area and a lot of dormant bugs.
> Do you have this recent upstream commit in your tree:
No, I'll apply this patch and test it again. Thanks!
>
> | commit fcb43042ef55d2f46b0efa5d7746967cef38f056
> | Author: Zhang, Yanmin <[email protected]>
> | Date: Tue Jun 24 16:06:23 2008 +0800
> |
> | x86: fix cpu hotplug crash
> |
> | Vegard Nossum reported crashes during cpu hotplug tests:
> |
> | http://marc.info/?l=linux-kernel&m=121413950227884&w=4
>
> ?
>
> Ingo
>
>
>


2008-07-02 07:16:09

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [BUG] CFS vs cpu hotplug

Ingo Molnar wrote:
> * Lai Jiangshan <[email protected]> wrote:
>
>> The following oops still occurred whether this patch is applied or not.
>
>> [<ffffffff8059372c>] notifier_call_chain+0x33/0x5b
>> [<ffffffff802476a9>] __raw_notifier_call_chain+0x9/0xb
>> [<ffffffff802476ba>] raw_notifier_call_chain+0xf/0x11
>> [<ffffffff805736d6>] _cpu_down+0x191/0x256
>> [<ffffffff805737c1>] cpu_down+0x26/0x36
>> [<ffffffff805749c1>] store_online+0x32/0x75
>> [<ffffffff803d1982>] sysdev_store+0x24/0x26
>> [<ffffffff802d2551>] sysfs_write_file+0xe0/0x11c
>> [<ffffffff80290e6b>] vfs_write+0xae/0x137
>> [<ffffffff802913d3>] sys_write+0x47/0x70
>> [<ffffffff8020b1eb>] system_call_after_swapgs+0x7b/0x80
>
> hm, there were multiple problems in this area and a lot of dormant bugs.
> Do you have this recent upstream commit in your tree:
Hi, Ingo
I tested it again with the most recent upstreams(including the
following patch) committed, the oops still occurred.

Thanks, Lai Jiangshan

>
> | commit fcb43042ef55d2f46b0efa5d7746967cef38f056
> | Author: Zhang, Yanmin <[email protected]>
> | Date: Tue Jun 24 16:06:23 2008 +0800
> |
> | x86: fix cpu hotplug crash
> |
> | Vegard Nossum reported crashes during cpu hotplug tests:
> |
> | http://marc.info/?l=linux-kernel&m=121413950227884&w=4
>
> ?
>
> Ingo
>
>
>

2008-07-02 08:50:46

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: [BUG] CFS vs cpu hotplug

2008/7/2 Lai Jiangshan <[email protected]>:
> Ingo Molnar wrote:
>> * Lai Jiangshan <[email protected]> wrote:
>>
>>> The following oops still occurred whether this patch is applied or not.
>>
>>> [<ffffffff8059372c>] notifier_call_chain+0x33/0x5b
>>> [<ffffffff802476a9>] __raw_notifier_call_chain+0x9/0xb
>>> [<ffffffff802476ba>] raw_notifier_call_chain+0xf/0x11
>>> [<ffffffff805736d6>] _cpu_down+0x191/0x256
>>> [<ffffffff805737c1>] cpu_down+0x26/0x36
>>> [<ffffffff805749c1>] store_online+0x32/0x75
>>> [<ffffffff803d1982>] sysdev_store+0x24/0x26
>>> [<ffffffff802d2551>] sysfs_write_file+0xe0/0x11c
>>> [<ffffffff80290e6b>] vfs_write+0xae/0x137
>>> [<ffffffff802913d3>] sys_write+0x47/0x70
>>> [<ffffffff8020b1eb>] system_call_after_swapgs+0x7b/0x80
>>
>> hm, there were multiple problems in this area and a lot of dormant bugs.
>> Do you have this recent upstream commit in your tree:
> Hi, Ingo
> I tested it again with the most recent upstreams(including the
> following patch) committed, the oops still occurred.

[ taken from the oops ]
>
> kernel BUG at kernel/sched.c:6133!
>

is it BUG_ON(rq->nr_running != 0); in your sched.c?

hum, it's line #6134 in the recent sched.c version. So with the recent
version it was "kernel BUG at kernel/sched.c:6134!" right?

could you please try to get a crash with my additional debugging patch
(you may find it in this thread) applied?
We should see then all tasks that have been migrated (or failed to be
migrated) during migration_call(CPU_DEAD, ...).

TIA,

--
Best regards,
Dmitry Adamushko

2008-07-02 09:25:38

by Lai Jiangshan

[permalink] [raw]
Subject: Re: [BUG] CFS vs cpu hotplug

Dmitry Adamushko wrote:
> 2008/7/2 Lai Jiangshan <[email protected]>:
>> Ingo Molnar wrote:
>>> * Lai Jiangshan <[email protected]> wrote:
>>>
>>>> The following oops still occurred whether this patch is applied or not.
>>>> [<ffffffff8059372c>] notifier_call_chain+0x33/0x5b
>>>> [<ffffffff802476a9>] __raw_notifier_call_chain+0x9/0xb
>>>> [<ffffffff802476ba>] raw_notifier_call_chain+0xf/0x11
>>>> [<ffffffff805736d6>] _cpu_down+0x191/0x256
>>>> [<ffffffff805737c1>] cpu_down+0x26/0x36
>>>> [<ffffffff805749c1>] store_online+0x32/0x75
>>>> [<ffffffff803d1982>] sysdev_store+0x24/0x26
>>>> [<ffffffff802d2551>] sysfs_write_file+0xe0/0x11c
>>>> [<ffffffff80290e6b>] vfs_write+0xae/0x137
>>>> [<ffffffff802913d3>] sys_write+0x47/0x70
>>>> [<ffffffff8020b1eb>] system_call_after_swapgs+0x7b/0x80
>>> hm, there were multiple problems in this area and a lot of dormant bugs.
>>> Do you have this recent upstream commit in your tree:
>> Hi, Ingo
>> I tested it again with the most recent upstreams(including the
>> following patch) committed, the oops still occurred.
>
> [ taken from the oops ]
>> kernel BUG at kernel/sched.c:6133!
>>
>
> is it BUG_ON(rq->nr_running != 0); in your sched.c?
yes, I had test it twice yesterday, applied/not applied
your patch(no debugging).
>
> hum, it's line #6134 in the recent sched.c version. So with the recent
> version it was "kernel BUG at kernel/sched.c:6134!" right?
yes, and applied your's and Zhang's patch as Ingo's advice.
>
> could you please try to get a crash with my additional debugging patch
> (you may find it in this thread) applied?
> We should see then all tasks that have been migrated (or failed to be
> migrated) during migration_call(CPU_DEAD, ...).
>
Thank you. I'll test it again with your debugging patch applied
and get more info.
> TIA,
>


2008-07-07 10:28:29

by Miao Xie

[permalink] [raw]
Subject: Re: [BUG] CFS vs cpu hotplug

on 3:59 Lai Jiangshan wrote:
> Dmitry Adamushko wrote:
>> 2008/7/2 Lai Jiangshan <[email protected]>:
>>> Ingo Molnar wrote:
>>>> * Lai Jiangshan <[email protected]> wrote:
>>>>
>>>>> The following oops still occurred whether this patch is applied or not.
>>>>> [<ffffffff8059372c>] notifier_call_chain+0x33/0x5b
>>>>> [<ffffffff802476a9>] __raw_notifier_call_chain+0x9/0xb
>>>>> [<ffffffff802476ba>] raw_notifier_call_chain+0xf/0x11
>>>>> [<ffffffff805736d6>] _cpu_down+0x191/0x256
>>>>> [<ffffffff805737c1>] cpu_down+0x26/0x36
>>>>> [<ffffffff805749c1>] store_online+0x32/0x75
>>>>> [<ffffffff803d1982>] sysdev_store+0x24/0x26
>>>>> [<ffffffff802d2551>] sysfs_write_file+0xe0/0x11c
>>>>> [<ffffffff80290e6b>] vfs_write+0xae/0x137
>>>>> [<ffffffff802913d3>] sys_write+0x47/0x70
>>>>> [<ffffffff8020b1eb>] system_call_after_swapgs+0x7b/0x80
>>>> hm, there were multiple problems in this area and a lot of dormant bugs.
>>>> Do you have this recent upstream commit in your tree:
>>> Hi, Ingo
>>> I tested it again with the most recent upstreams(including the
>>> following patch) committed, the oops still occurred.
>> [ taken from the oops ]
>>> kernel BUG at kernel/sched.c:6133!
>>>
[snip]
>> We should see then all tasks that have been migrated (or failed to be
>> migrated) during migration_call(CPU_DEAD, ...).
>>
> Thank you. I'll test it again with your debugging patch applied
> and get more info.

I tested it with Dmitry's patch, and found that all the tasks on the offline
cpu were migrated to an online cpu by migrate_live_tasks() in migration_call().
But some tasks(such as klogd and so on)was moved back to the offline cpu
immediately before BUG_ON(rq->nr_running != 0) checking, even before acquiring
rq's lock.

static int __cpuinit
migration_call(struct notifier_block *nfb, unsigned long action, void *
{
...
switch (action) {
...
case CPU_DEAD:
case CPU_DEAD_FROZEN:
cpuset_lock();
migrate_live_tasks(cpu);
rq = cpu_rq(cpu);
...
spin_lock_irq(&rq->lock);
...
migrate_dead_tasks(cpu);
spin_unlock_irq(&rq->lock);
cpuset_unlock();
migrate_nr_uninterruptible(rq);
BUG_ON(rq->nr_running != 0);
...
break;
}
...
}

By debuging, I found this bug was caused by select_task_rq_fair().
After migrating the tasks on the offline cpu to an online cpu, the kernel would
wake up these migrated tasks quickly by try_to_wake_up(). try_to_wake_up() would
invoke select_task_rq_fair() to find a lower-load cpu in sched domains for them.
But the sched domains weren't updated and the offline cpu was still in the sched
domains. So select_task_rq_fair() might return the offline cpu's id, then the
bug occurred.

I fix the bug just by checking the select_task_rq_fair()'s return value in
try_to_wake_up().

Signed-off-by: Miao Xie <[email protected]>

---
kernel/sched.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/kernel/sched.c b/kernel/sched.c
index 94ead43..15b5ddf 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2103,6 +2103,9 @@ static int try_to_wake_up(struct task_struct *p, unsigned int state, int sync)
goto out_activate;

cpu = p->sched_class->select_task_rq(p, sync);
+ if (unlikely(cpu_is_offline(cpu)))
+ cpu = orig_cpu;
+
if (cpu != orig_cpu) {
set_task_cpu(p, cpu);
task_rq_unlock(rq, &flags);
--
1.5.4.rc3

2008-07-07 11:31:52

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: [BUG] CFS vs cpu hotplug

2008/7/7 Miao Xie <[email protected]>:
> on 3:59 Lai Jiangshan wrote:
>> Dmitry Adamushko wrote:
>>>
>>> [ ... ]
>>>
>>> We should see then all tasks that have been migrated (or failed to be
>>> migrated) during migration_call(CPU_DEAD, ...).
>>>
>> Thank you. I'll test it again with your debugging patch applied
>> and get more info.
>
> I tested it with Dmitry's patch, and found that all the tasks on the offline
> cpu were migrated to an online cpu by migrate_live_tasks() in migration_call().
> But some tasks(such as klogd and so on)was moved back to the offline cpu
> immediately before BUG_ON(rq->nr_running != 0) checking, even before acquiring
> rq's lock.
>
> static int __cpuinit
> migration_call(struct notifier_block *nfb, unsigned long action, void *
> {
> ...
> switch (action) {
> ...
> case CPU_DEAD:
> case CPU_DEAD_FROZEN:
> cpuset_lock();
> migrate_live_tasks(cpu);
> rq = cpu_rq(cpu);
> ...
> spin_lock_irq(&rq->lock);
> ...
> migrate_dead_tasks(cpu);
> spin_unlock_irq(&rq->lock);
> cpuset_unlock();
> migrate_nr_uninterruptible(rq);
> BUG_ON(rq->nr_running != 0);
> ...
> break;
> }
> ...
> }
>
> By debuging, I found this bug was caused by select_task_rq_fair().

Thanks for tracking this down!


> After migrating the tasks on the offline cpu to an online cpu, the kernel would
> wake up these migrated tasks quickly by try_to_wake_up(). try_to_wake_up() would
> invoke select_task_rq_fair() to find a lower-load cpu in sched domains for them.
> But the sched domains weren't updated and the offline cpu was still in the sched
> domains.

Hmm... if so, then this should be fixed, not select_task_rq_fair(). I
don't think this is expected behavior.


> So select_task_rq_fair() might return the offline cpu's id, then the
> bug occurred.
>
> I fix the bug just by checking the select_task_rq_fair()'s return value in
> try_to_wake_up().
>
> [ ... ]


--
Best regards,
Dmitry Adamushko

2008-07-09 22:32:56

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: [BUG] CFS vs cpu hotplug


hm, while looking at this code again...


Ingo,

I think we may have a race between try_to_wake_up() and migrate_live_tasks() -> move_task_off_dead_cpu()
when the later one may end up looping endlessly.


Subject: sched: prevent a potentially endless loop in move_task_off_dead_cpu()

Interrupts are enabled on other CPUs when migration_call(CPU_DEAD, ...) is called so we may get a race
between try_to_wake_up() and migrate_live_tasks() -> move_task_off_dead_cpu(). The former one may push
a task out of a dead CPU causing the later one to loop endlessly.


Signed-off-by: Dmitry Adamushko <[email protected]>

---
diff --git a/kernel/sched.c b/kernel/sched.c
index 94ead43..9397b87 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5621,8 +5621,10 @@ static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)

double_rq_lock(rq_src, rq_dest);
/* Already moved. */
- if (task_cpu(p) != src_cpu)
+ if (task_cpu(p) != src_cpu) {
+ ret = 1;
goto out;
+ }
/* Affinity changed (again). */
if (!cpu_isset(dest_cpu, p->cpus_allowed))
goto out;

---

2008-07-10 07:31:19

by Heiko Carstens

[permalink] [raw]
Subject: Re: [BUG] CFS vs cpu hotplug

On Thu, Jul 10, 2008 at 12:32:40AM +0200, Dmitry Adamushko wrote:
>
> hm, while looking at this code again...
>
>
> Ingo,
>
> I think we may have a race between try_to_wake_up() and migrate_live_tasks() -> move_task_off_dead_cpu()
> when the later one may end up looping endlessly.
>
>
> Subject: sched: prevent a potentially endless loop in move_task_off_dead_cpu()
>
> Interrupts are enabled on other CPUs when migration_call(CPU_DEAD, ...) is called so we may get a race
> between try_to_wake_up() and migrate_live_tasks() -> move_task_off_dead_cpu(). The former one may push
> a task out of a dead CPU causing the later one to loop endlessly.

That's exactly what explains a dump I got yesterday. Thanks for fixing! :)

Will apply your patch and let you know if it fixes the problem.
(may take until next week unfortunately).

> Signed-off-by: Dmitry Adamushko <[email protected]>
>
> ---
> diff --git a/kernel/sched.c b/kernel/sched.c
> index 94ead43..9397b87 100644
> --- a/kernel/sched.c
> +++ b/kernel/sched.c
> @@ -5621,8 +5621,10 @@ static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)
>
> double_rq_lock(rq_src, rq_dest);
> /* Already moved. */
> - if (task_cpu(p) != src_cpu)
> + if (task_cpu(p) != src_cpu) {
> + ret = 1;
> goto out;
> + }
> /* Affinity changed (again). */
> if (!cpu_isset(dest_cpu, p->cpus_allowed))
> goto out;
>
> ---
>

2008-07-10 07:40:28

by Ingo Molnar

[permalink] [raw]
Subject: Re: [BUG] CFS vs cpu hotplug


* Heiko Carstens <[email protected]> wrote:

> > Subject: sched: prevent a potentially endless loop in
> > move_task_off_dead_cpu()
> >
> > Interrupts are enabled on other CPUs when migration_call(CPU_DEAD,
> > ...) is called so we may get a race between try_to_wake_up() and
> > migrate_live_tasks() -> move_task_off_dead_cpu(). The former one may
> > push a task out of a dead CPU causing the later one to loop
> > endlessly.
>
> That's exactly what explains a dump I got yesterday. Thanks for
> fixing! :)

applied to tip/sched/urgent via the commit below - lets see whether we
can still get it into v2.6.26.

Ingo

---------------->
commit dc7fab8b3bb388c57c6c4a43ba68c8a32ca25204
Author: Dmitry Adamushko <[email protected]>
Date: Thu Jul 10 00:32:40 2008 +0200

sched: fix cpu hotplug

I think we may have a race between try_to_wake_up() and
migrate_live_tasks() -> move_task_off_dead_cpu() when the later one
may end up looping endlessly.

Interrupts are enabled on other CPUs when migration_call(CPU_DEAD, ...) is
called so we may get a race between try_to_wake_up() and
migrate_live_tasks() -> move_task_off_dead_cpu(). The former one may push
a task out of a dead CPU causing the later one to loop endlessly.

Heiko Carstens observed:

| That's exactly what explains a dump I got yesterday. Thanks for fixing! :)

Signed-off-by: Dmitry Adamushko <[email protected]>
Cc: [email protected]
Cc: Lai Jiangshan <[email protected]>
Cc: Heiko Carstens <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Avi Kivity <[email protected]>
Cc: Andrew Morton <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>

diff --git a/kernel/sched.c b/kernel/sched.c
index 94ead43..9397b87 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -5621,8 +5621,10 @@ static int __migrate_task(struct task_struct *p, int src_cpu, int dest_cpu)

double_rq_lock(rq_src, rq_dest);
/* Already moved. */
- if (task_cpu(p) != src_cpu)
+ if (task_cpu(p) != src_cpu) {
+ ret = 1;
goto out;
+ }
/* Affinity changed (again). */
if (!cpu_isset(dest_cpu, p->cpus_allowed))
goto out;