2008-07-11 19:07:53

by Vegard Nossum

[permalink] [raw]
Subject: current linux-2.6.git: cpusets completely broken

Hi,

I have now "config-bisected" and found that the difference between a
working and a non-working config is just this:

+CONFIG_CPUSETS=y
+CONFIG_PROC_PID_CPUSET=y

The difference between a i386 defconfig base and the non-working config is:

+CONFIG_CGROUPS=y
+CONFIG_CPUSETS=y
+CONFIG_PROC_PID_CPUSET=y

(Note that group scheduling is off and has nothing to with it!)


The result of having CPUSETS enabled as above is a 100% reproducible
BUG on the very first cpu hot-unplug:

------------[ cut here ]------------
kernel BUG at xxx/linux-2.6/kernel/sched.c:5859!
invalid opcode: 0000 [#1] SMP
Modules linked in:
Pid: 3653, comm: bash Not tainted (2.6.26-rc9-00102-ge5a5816 #3)
EIP: 0060:[<c040b6e7>] EFLAGS: 00210046 CPU: 0
EIP is at migration_call+0x29b/0x3bb
EAX: c1816558 EBX: c1816500 ECX: 01213000 EDX: c0603500
ESI: f7035f80 EDI: c1816500 EBP: 00000001 ESP: f6c17ec4
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process bash (pid: 3653, ti=f6c16000 task=f6cb1c00 task.ti=f6c16000)
Stack: c1816500 c0411b05 c0106fc8 c05b10f8 ffffffff 00000000 c05b11ac c0410375
00000001 00000007 00000007 00000001 00000000 f71f5500 c012f916 ffffffff
00000000 c03f42fb 00000000 00000001 fffffffd 00000003 0000001f 00000001
Call Trace:
[<c0411b05>] _etext+0x0/0xb
[<c0106fc8>] alternatives_smp_unlock+0x42/0x4f
[<c0410375>] notifier_call_chain+0x2a/0x47
[<c012f916>] raw_notifier_call_chain+0x9/0xc
[<c03f42fb>] _cpu_down+0x14c/0x1ee
[<c03f43bd>] cpu_down+0x20/0x2c
[<c03f53d0>] store_online+0x24/0x56
[<c03f53ac>] store_online+0x0/0x56
[<c0272446>] sysdev_store+0x1e/0x22
[<c0191d18>] sysfs_write_file+0xa4/0xd8
[<c0191c74>] sysfs_write_file+0x0/0xd8
[<c0160fe6>] vfs_write+0x83/0xf6
[<c0161521>] sys_write+0x3c/0x63
[<c0103545>] sysenter_past_esp+0x6a/0x91
=======================
Code: 18 85 c0 89 c6 75 04 8b 1b eb f0 8b 4e 24 89 f2 8b 04 24 ff 51 1c ba 00 35
60 c0 8b 0c ad 00 d2 5b c0 83 be c0 00 00 00 00 75 04 <0f> 0b eb fe 8b 06 83 f8
40 75 04 0f 0b eb fe 8d 1c 0a 90 ff 46
EIP: [<c040b6e7>] migration_call+0x29b/0x3bb SS:ESP 0068:f6c17ec4
BUG: NMI Watchdog detected LOCKUP on CPU0, ip c040e76e, registers:
Modules linked in:
Pid: 3653, comm: bash Not tainted (2.6.26-rc9-00102-ge5a5816 #3)
EIP: 0060:[<c040e76e>] EFLAGS: 00200097 CPU: 0
EIP is at _spin_lock+0x10/0x15
EAX: c1816500 EBX: c0603500 ECX: 63adf1df EDX: 0000e2e1
ESI: c1816500 EDI: f6c17d6c EBP: f6db5500 ESP: f6c17d50
DS: 007b ES: 007b FS: 00d8 GS: 0033 SS: 0068
Process bash (pid: 3653, ti=f6c16000 task=f6cb1c00 task.ti=f6c16000)
Stack: c01170bb f6db5500 c180c500 00000000 00000000 c011736c 00000001 00200092
f7353f3c c0566ef8 00000001 00000000 c012c4ad 00000000 f7353f3c c0566ef8
c0115e44 00000000 00000001 c0566f00 c0566f00 00000000 00000001 00200092
Call Trace:
[<c01170bb>] task_rq_lock+0x28/0x4b
[<c011736c>] try_to_wake_up+0x65/0xe0
[<c012c4ad>] autoremove_wake_function+0xd/0x2d
[<c0115e44>] __wake_up_common+0x2e/0x58
[<c0116b8c>] __wake_up+0x29/0x39
[<c011db2b>] wake_up_klogd+0x2b/0x2d
[<c010476f>] die+0xb1/0x10f
[<c0104839>] do_invalid_op+0x0/0x6b
[<c010489b>] do_invalid_op+0x62/0x6b
[<c040b6e7>] migration_call+0x29b/0x3bb
[<c0410cc7>] kprobe_flush_task+0x4b/0x80
[<c0118f1c>] hrtick_set+0x7a/0xd8
[<c040d52b>] schedule+0x5b6/0x5e8
[<c0117f04>] update_curr_rt+0x92/0x339
[<c040ea1a>] error_code+0x72/0x78
[<c01100d8>] send_IPI_mask_sequence+0x24/0x91
[<c040b6e7>] migration_call+0x29b/0x3bb
[<c0411b05>] _etext+0x0/0xb
[<c0106fc8>] alternatives_smp_unlock+0x42/0x4f
[<c0410375>] notifier_call_chain+0x2a/0x47
[<c012f916>] raw_notifier_call_chain+0x9/0xc
[<c03f42fb>] _cpu_down+0x14c/0x1ee
[<c03f43bd>] cpu_down+0x20/0x2c
[<c03f53d0>] store_online+0x24/0x56
[<c03f53ac>] store_online+0x0/0x56
[<c0272446>] sysdev_store+0x1e/0x22
[<c0191d18>] sysfs_write_file+0xa4/0xd8
[<c0191c74>] sysfs_write_file+0x0/0xd8
[<c0160fe6>] vfs_write+0x83/0xf6
[<c0161521>] sys_write+0x3c/0x63
[<c0103545>] sysenter_past_esp+0x6a/0x91
=======================
Code: 00 00 01 0f 94 c0 84 c0 b9 01 00 00 00 75 09 90 81 02 00 00 00 01 30 c9 89
c8 c3 ba 00 01 00 00 90 66 0f c1 10 38 f2 74 06 f3 90 <8a> 10 eb f6 c3 90 81 28
00 00 00 01 74 05 e8 4f ff ff ff c3 53

Also, this is on the latest linux-2.6.git! Since we're so close to
release, maybe cpusets should simply be marked BROKEN for now? (Unless
we can fix it, of course. The alternative is to apply Miao Xie's
workaround patch temporarily.)

I hope this helps at least a little bit.


Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036


2008-07-11 19:36:58

by Paul Menage

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken

On Fri, Jul 11, 2008 at 12:07 PM, Vegard Nossum <[email protected]> wrote:
>
> The result of having CPUSETS enabled as above is a 100% reproducible
> BUG on the very first cpu hot-unplug:
>
> ------------[ cut here ]------------
> kernel BUG at xxx/linux-2.6/kernel/sched.c:5859!

That doesn't quite match up with any BUG in 2.6.26-rc9 - what tree is
this last crash based on?

>
> Also, this is on the latest linux-2.6.git! Since we're so close to
> release, maybe cpusets should simply be marked BROKEN for now? (Unless
> we can fix it, of course. The alternative is to apply Miao Xie's
> workaround patch temporarily.)

If we were going to mark anything as broken, wouldn't cpu-hotplug be
the more appropriate victim? I suspect that there are more systems
using cpusets in production environments than using cpu hotplug. But
as you say, fixing it sounds better.

Paul

2008-07-11 19:43:28

by Vegard Nossum

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken

On Fri, Jul 11, 2008 at 9:36 PM, Paul Menage <[email protected]> wrote:
> On Fri, Jul 11, 2008 at 12:07 PM, Vegard Nossum <[email protected]> wrote:
>>
>> The result of having CPUSETS enabled as above is a 100% reproducible
>> BUG on the very first cpu hot-unplug:
>>
>> ------------[ cut here ]------------
>> kernel BUG at xxx/linux-2.6/kernel/sched.c:5859!
>
> That doesn't quite match up with any BUG in 2.6.26-rc9 - what tree is
> this last crash based on?

latest mainline. Commit e5a5816f7875207cb0a0a7032e39a4686c5e10a4.

Is this one:

/* called under rq->lock with disabled interrupts */
static void migrate_dead(unsigned int dead_cpu, struct task_struct *p)
{
struct rq *rq = cpu_rq(dead_cpu);

/* Must be exiting, otherwise would be on tasklist. */
BUG_ON(!p->exit_state);

>> Also, this is on the latest linux-2.6.git! Since we're so close to
>> release, maybe cpusets should simply be marked BROKEN for now? (Unless
>> we can fix it, of course. The alternative is to apply Miao Xie's
>> workaround patch temporarily.)
>
> If we were going to mark anything as broken, wouldn't cpu-hotplug be
> the more appropriate victim? I suspect that there are more systems
> using cpusets in production environments than using cpu hotplug. But
> as you say, fixing it sounds better.

I'm sorry for the harsh characterization and suggestion; please accept
my apology. It was purely a result of my excitement at having made
some progress in this case.

But I have more good news; reverting this:

commit f18f982abf183e91f435990d337164c7a43d1e6d
Author: Max Krasnyansky <[email protected]>
Date: Thu May 29 11:17:01 2008 -0700

sched: CPU hotplug events must not destroy scheduler domains created by the
cpusets

First issue is not related to the cpusets. We're simply leaking doms_cur.
It's allocated in arch_init_sched_domains() which is called for every
hotplug event. So we just keep reallocation doms_cur without freeing it.
I introduced free_sched_domains() function that cleans things up.

Second issue is that sched domains created by the cpusets are
completely destroyed by the CPU hotplug events. For all CPU hotplug
events scheduler attaches all CPUs to the NULL domain and then puts
them all into the single domain thereby destroying domains created
by the cpusets (partition_sched_domains).
The solution is simple, when cpusets are enabled scheduler should not
create default domain and instead let cpusets do that. Which is
exactly what the patch does.

Signed-off-by: Max Krasnyansky <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Acked-by: Peter Zijlstra <[email protected]>
Signed-off-by: Thomas Gleixner <[email protected]>

gets rid of the BUG! (Added people to Ccs.)

Might I instead suggest a revert of this? (Again, unless somebody else
can spot the real error and fix it before 2.6.26 is out :-))


Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036

2008-07-11 20:07:18

by Max Krasnyansky

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken

Vegard Nossum wrote:
> On Fri, Jul 11, 2008 at 9:36 PM, Paul Menage <[email protected]> wrote:
>> On Fri, Jul 11, 2008 at 12:07 PM, Vegard Nossum <[email protected]> wrote:
>>> The result of having CPUSETS enabled as above is a 100% reproducible
>>> BUG on the very first cpu hot-unplug:
>>>
>>> ------------[ cut here ]------------
>>> kernel BUG at xxx/linux-2.6/kernel/sched.c:5859!
>> That doesn't quite match up with any BUG in 2.6.26-rc9 - what tree is
>> this last crash based on?
>
> latest mainline. Commit e5a5816f7875207cb0a0a7032e39a4686c5e10a4.
>
> Is this one:
>
> /* called under rq->lock with disabled interrupts */
> static void migrate_dead(unsigned int dead_cpu, struct task_struct *p)
> {
> struct rq *rq = cpu_rq(dead_cpu);
>
> /* Must be exiting, otherwise would be on tasklist. */
> BUG_ON(!p->exit_state);
>
>>> Also, this is on the latest linux-2.6.git! Since we're so close to
>>> release, maybe cpusets should simply be marked BROKEN for now? (Unless
>>> we can fix it, of course. The alternative is to apply Miao Xie's
>>> workaround patch temporarily.)
>> If we were going to mark anything as broken, wouldn't cpu-hotplug be
>> the more appropriate victim? I suspect that there are more systems
>> using cpusets in production environments than using cpu hotplug. But
>> as you say, fixing it sounds better.
>
> I'm sorry for the harsh characterization and suggestion; please accept
> my apology. It was purely a result of my excitement at having made
> some progress in this case.
>
> But I have more good news; reverting this:
>
> commit f18f982abf183e91f435990d337164c7a43d1e6d
> Author: Max Krasnyansky <[email protected]>
> Date: Thu May 29 11:17:01 2008 -0700
>
> sched: CPU hotplug events must not destroy scheduler domains created by the
> cpusets
>
> First issue is not related to the cpusets. We're simply leaking doms_cur.
> It's allocated in arch_init_sched_domains() which is called for every
> hotplug event. So we just keep reallocation doms_cur without freeing it.
> I introduced free_sched_domains() function that cleans things up.
>
> Second issue is that sched domains created by the cpusets are
> completely destroyed by the CPU hotplug events. For all CPU hotplug
> events scheduler attaches all CPUs to the NULL domain and then puts
> them all into the single domain thereby destroying domains created
> by the cpusets (partition_sched_domains).
> The solution is simple, when cpusets are enabled scheduler should not
> create default domain and instead let cpusets do that. Which is
> exactly what the patch does.
>
> Signed-off-by: Max Krasnyansky <[email protected]>
> Cc: [email protected]
> Cc: [email protected]
> Cc: [email protected]
> Acked-by: Peter Zijlstra <[email protected]>
> Signed-off-by: Thomas Gleixner <[email protected]>
>
> gets rid of the BUG! (Added people to Ccs.)
Really ? Just by looking at the backtraces in your first email it seems
unrelated.

> Might I instead suggest a revert of this? (Again, unless somebody else
> can spot the real error and fix it before 2.6.26 is out :-))
I'd actually be ok with reverting it. Paul and I were looking into some
circular locking issues triggered by the very same patch. Since we do
not have a solution yet we could revert it for now and work on a fix
during .27-rc series.

Max

2008-07-11 23:03:28

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken

2008/7/11 Vegard Nossum <[email protected]>:
> On Fri, Jul 11, 2008 at 9:36 PM, Paul Menage <[email protected]> wrote:
>> On Fri, Jul 11, 2008 at 12:07 PM, Vegard Nossum <[email protected]> wrote:
>>>
>>> The result of having CPUSETS enabled as above is a 100% reproducible
>>> BUG on the very first cpu hot-unplug:
>>>
>>> ------------[ cut here ]------------
>>> kernel BUG at xxx/linux-2.6/kernel/sched.c:5859!
>>
>> That doesn't quite match up with any BUG in 2.6.26-rc9 - what tree is
>> this last crash based on?
>
> latest mainline. Commit e5a5816f7875207cb0a0a7032e39a4686c5e10a4.
>
> Is this one:
>
> /* called under rq->lock with disabled interrupts */
> static void migrate_dead(unsigned int dead_cpu, struct task_struct *p)
> {
> struct rq *rq = cpu_rq(dead_cpu);
>
> /* Must be exiting, otherwise would be on tasklist. */
> BUG_ON(!p->exit_state);
>
>>> Also, this is on the latest linux-2.6.git! Since we're so close to
>>> release, maybe cpusets should simply be marked BROKEN for now? (Unless
>>> we can fix it, of course. The alternative is to apply Miao Xie's
>>> workaround patch temporarily.)
>>
>> If we were going to mark anything as broken, wouldn't cpu-hotplug be
>> the more appropriate victim? I suspect that there are more systems
>> using cpusets in production environments than using cpu hotplug. But
>> as you say, fixing it sounds better.
>
> I'm sorry for the harsh characterization and suggestion; please accept
> my apology. It was purely a result of my excitement at having made
> some progress in this case.
>
> But I have more good news; reverting this:
>
> commit f18f982abf183e91f435990d337164c7a43d1e6d
> Author: Max Krasnyansky <[email protected]>
> Date: Thu May 29 11:17:01 2008 -0700
>
> sched: CPU hotplug events must not destroy scheduler domains created by the
> cpusets

Does the patch below help?

(non-white-space-damaged version is attached)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 9fceb97..ae61dc9 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1912,11 +1912,21 @@ static void common_cpu_mem_hotplug_unplug(void)
static int cpuset_handle_cpuhp(struct notifier_block *unused_nb,
unsigned long phase, void *unused_cpu)
{
- if (phase == CPU_DYING || phase == CPU_DYING_FROZEN)
+ swicth (phase) {
+ case CPU_UP_CANCELED:
+ case CPU_UP_CANCELED_FROZEN:
+ case CPU_DOWN_FAILED:
+ case CPU_DOWN_FAILED_FROZEN:
+ case CPU_ONLINE:
+ case CPU_ONLINE_FROZEN:
+ case CPU_DEAD:
+ case CPU_DEAD_FROZEN:
+ common_cpu_mem_hotplug_unplug();
+ break;
+ default:
return NOTIFY_DONE;

- common_cpu_mem_hotplug_unplug();
- return 0;
+ return NOTIFY_OK;
}

#ifdef CONFIG_MEMORY_HOTPLU



--
Best regards,
Dmitry Adamushko


Attachments:
(No filename) (2.81 kB)
003-fix-cpuset.patch (760.00 B)
Download all attachments

2008-07-11 23:20:11

by Max Krasnyansky

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken

Dmitry Adamushko wrote:
> Does the patch below help?
>
> (non-white-space-damaged version is attached)
>
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> index 9fceb97..ae61dc9 100644
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -1912,11 +1912,21 @@ static void common_cpu_mem_hotplug_unplug(void)
> static int cpuset_handle_cpuhp(struct notifier_block *unused_nb,
> unsigned long phase, void *unused_cpu)
> {
> - if (phase == CPU_DYING || phase == CPU_DYING_FROZEN)
> + swicth (phase) {
> + case CPU_UP_CANCELED:
> + case CPU_UP_CANCELED_FROZEN:
> + case CPU_DOWN_FAILED:
> + case CPU_DOWN_FAILED_FROZEN:
> + case CPU_ONLINE:
> + case CPU_ONLINE_FROZEN:
> + case CPU_DEAD:
> + case CPU_DEAD_FROZEN:
> + common_cpu_mem_hotplug_unplug();
> + break;
> + default:
> return NOTIFY_DONE;

I was always wondering why we're running that logic for every hotplug
event.
For example, do we care for UP_CANCELED* and DOWN_FAILED* ? Presumably
nothing should've changed, no ?
btw I'm just going by the name of the event. So UP_CANCELED to me sounds
like CPU never got to the ONLINE stage and cpuset should not really care.

Max

2008-07-11 23:53:29

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken

2008/7/12 Max Krasnyansky <[email protected]>:
> Dmitry Adamushko wrote:
>>
>> Does the patch below help?
>>
>> (non-white-space-damaged version is attached)
>>
>> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
>> index 9fceb97..ae61dc9 100644
>> --- a/kernel/cpuset.c
>> +++ b/kernel/cpuset.c
>> @@ -1912,11 +1912,21 @@ static void common_cpu_mem_hotplug_unplug(void)
>> static int cpuset_handle_cpuhp(struct notifier_block *unused_nb,
>> unsigned long phase, void *unused_cpu)
>> {
>> - if (phase == CPU_DYING || phase == CPU_DYING_FROZEN)
>> + swicth (phase) {
>> + case CPU_UP_CANCELED:
>> + case CPU_UP_CANCELED_FROZEN:
>> + case CPU_DOWN_FAILED:
>> + case CPU_DOWN_FAILED_FROZEN:
>> + case CPU_ONLINE:
>> + case CPU_ONLINE_FROZEN:
>> + case CPU_DEAD:
>> + case CPU_DEAD_FROZEN:
>> + common_cpu_mem_hotplug_unplug();
>> + break;
>> + default:
>> return NOTIFY_DONE;
>
> I was always wondering why we're running that logic for every hotplug event.
> For example, do we care for UP_CANCELED* and DOWN_FAILED* ? Presumably
> nothing should've changed, no ?

I guess, it's to revert the effect of CPU_{DOWN | UP}_PREPARE.

e.g. CPU_DOWN_PREPARE detaches/destroys sched-domains... so if a
cpu-down op. fails, we restore them to their inital state.


> btw I'm just going by the name of the event. So UP_CANCELED to me sounds
> like CPU never got to the ONLINE stage and cpuset should not really care.

Yes, if it has not done anything for UP_PREPARE (and no one else on
behave of it).


>
> Max
>

--
Best regards,
Dmitry Adamushko

2008-07-12 03:17:54

by Vegard Nossum

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken

> Does the patch below help?

Yes, it does! Well done!

(It needed a couple of minor syntax fixes, so I'm attaching the final
patch as well.)

Can somebody else please test/ack/review it too? This should eventually
go into 2.6.26 if it doesn't break anything else.


Vegard


>From da802021c32be4020ec810f04b56d3654a8d4766 Mon Sep 17 00:00:00 2001
From: Dmitry Adamushko <[email protected]>
Date: Sat, 12 Jul 2008 05:01:55 +0200
Subject: [PATCH] cpuset: fix cpu hotplug

This patch fixes a "kernel BUG at kernel/sched.c:5859!"
when cpu hotplug is used with CONFIG_CPUSETS=y.

[ Fixed invalid syntax ]
Signed-off-by: Vegard Nossum <[email protected]>
---
kernel/cpuset.c | 17 ++++++++++++++---
1 files changed, 14 insertions(+), 3 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 9fceb97..860a190 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1912,11 +1912,22 @@ static void common_cpu_mem_hotplug_unplug(void)
static int cpuset_handle_cpuhp(struct notifier_block *unused_nb,
unsigned long phase, void *unused_cpu)
{
- if (phase == CPU_DYING || phase == CPU_DYING_FROZEN)
+ switch (phase) {
+ case CPU_UP_CANCELED:
+ case CPU_UP_CANCELED_FROZEN:
+ case CPU_DOWN_FAILED:
+ case CPU_DOWN_FAILED_FROZEN:
+ case CPU_ONLINE:
+ case CPU_ONLINE_FROZEN:
+ case CPU_DEAD:
+ case CPU_DEAD_FROZEN:
+ common_cpu_mem_hotplug_unplug();
+ break;
+ default:
return NOTIFY_DONE;
+ }

- common_cpu_mem_hotplug_unplug();
- return 0;
+ return NOTIFY_OK;
}

#ifdef CONFIG_MEMORY_HOTPLUG
--
1.5.4.1

2008-07-12 03:29:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken



On Sat, 12 Jul 2008, Vegard Nossum wrote:
>
> Can somebody else please test/ack/review it too? This should eventually
> go into 2.6.26 if it doesn't break anything else.

And Dmitry, _please_ also explain what was going on. Why did things break
from calling common_cpu_mem_hotplug_unplug() too much? That function is
called pretty randomly anyway (for just about any random CPU event), so
why did it fail in some circumstances?

Linus

2008-07-12 10:03:08

by Miao Xie

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken

on 2008-7-12 11:28 Linus Torvalds wrote:
>
> On Sat, 12 Jul 2008, Vegard Nossum wrote:
>> Can somebody else please test/ack/review it too? This should eventually
>> go into 2.6.26 if it doesn't break anything else.
>
> And Dmitry, _please_ also explain what was going on. Why did things break
> from calling common_cpu_mem_hotplug_unplug() too much? That function is
> called pretty randomly anyway (for just about any random CPU event), so
> why did it fail in some circumstances?
>
> Linus
>

My explanation:
http://lkml.org/lkml/2008/7/7/75
this bug occurred on the kernel compiled with CONFIG_CPUSETS=y.

As Dmitry said in the following mail, modifying try_to_wake_up() to fix this bug
is not perfect. Maybe we need update the sched domain before migrating tasks.
http://lkml.org/lkml/2008/7/7/94

So I remake a patch to fix this bug by updating the sched domain when a cpu is in
CPU_DOWN_PREPARE state.

I think Vegard Nossum's patch is not so good because it is not necessary to detach
all the sched domains when making a cpu offline.

Signed-off-by: Miao Xie <[email protected]>

---
include/linux/sched.h | 1 +
kernel/cpuset.c | 30 +++++++++++++++++++++++++-----
kernel/sched.c | 28 +++++++++++++++++++++++++++-
3 files changed, 53 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c5d3f84..cf40eae 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -817,6 +817,7 @@ struct sched_domain {
#endif
};

+extern void detach_sched_domains(int cpu);
extern void partition_sched_domains(int ndoms_new, cpumask_t *doms_new,
struct sched_domain_attr *dattr_new);
extern int arch_reinit_sched_domains(void);
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 9fceb97..64fa742 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1910,13 +1910,33 @@ static void common_cpu_mem_hotplug_unplug(void)
*/

static int cpuset_handle_cpuhp(struct notifier_block *unused_nb,
- unsigned long phase, void *unused_cpu)
+ unsigned long phase, void *hcpu)
{
- if (phase == CPU_DYING || phase == CPU_DYING_FROZEN)
- return NOTIFY_DONE;
+ int cpu = (long)hcpu;

- common_cpu_mem_hotplug_unplug();
- return 0;
+ switch (phase) {
+ case CPU_DOWN_PREPARE:
+ case CPU_DOWN_PREPARE_FROZEN:
+ cgroup_lock();
+ get_online_cpus();
+ detach_sched_domains(cpu);
+ put_online_cpus();
+ cgroup_unlock();
+ break;
+
+ case CPU_DOWN_FAILED:
+ case CPU_DOWN_FAILED_FROZEN:
+ case CPU_DEAD:
+ case CPU_DEAD_FROZEN:
+ case CPU_ONLINE:
+ case CPU_ONLINE_FROZEN:
+ common_cpu_mem_hotplug_unplug();
+ break;
+
+ default:
+ return NOTIFY_DONE;
+ }
+ return NOTIFY_OK;
}

#ifdef CONFIG_MEMORY_HOTPLUG
diff --git a/kernel/sched.c b/kernel/sched.c
index 4e2f603..73e0026 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -7315,6 +7315,32 @@ static int dattrs_equal(struct sched_domain_attr *cur, int idx_cur,
sizeof(struct sched_domain_attr));
}

+
+/*
+ * Detach sched domains from a group of cpus which are in the same domain with
+ * the specified cpu. These cpus will now be attach to the NULL domain.
+ *
+ * Call with hotplug lock and cgroup lock held
+ */
+void detach_sched_domains(int cpu)
+{
+ int i;
+
+ unregister_sched_domain_sysctl();
+
+ mutex_lock(&sched_domains_mutex);
+
+ for (i = 0; i < ndoms_cur; i++) {
+ if (cpu_isset(cpu, doms_cur[i])) {
+ detach_destroy_domains(doms_cur + i);
+ cpus_clear(doms_cur[i]);
+ break;
+ }
+ }
+
+ mutex_unlock(&sched_domains_mutex);
+}
+
/*
* Partition sched domains as specified by the 'ndoms_new'
* cpumasks in the array doms_new[] of cpumasks. This compares
@@ -7481,6 +7507,7 @@ int sched_create_sysfs_power_savings_entries(struct sysdev_class *cls)
static int update_sched_domains(struct notifier_block *nfb,
unsigned long action, void *hcpu)
{
+#ifndef CONFIG_CPUSETS
switch (action) {
case CPU_UP_PREPARE:
case CPU_UP_PREPARE_FROZEN:
@@ -7506,7 +7533,6 @@ static int update_sched_domains(struct notifier_block *nfb,
return NOTIFY_DONE;
}

-#ifndef CONFIG_CPUSETS
/*
* Create default domain partitioning if cpusets are disabled.
* Otherwise we let cpusets rebuild the domains based on the
--
1.5.4.rc3

2008-07-12 10:04:47

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken

2008/7/12 Linus Torvalds <[email protected]>:
>
>
> On Sat, 12 Jul 2008, Vegard Nossum wrote:
>>
>> Can somebody else please test/ack/review it too? This should eventually
>> go into 2.6.26 if it doesn't break anything else.
>
> And Dmitry, _please_ also explain what was going on. Why did things break
> from calling common_cpu_mem_hotplug_unplug() too much? That function is
> called pretty randomly anyway (for just about any random CPU event), so
> why did it fail in some circumstances?

Upon CPU_DOWN_PREPARE, update_sched_domains() ->
detach_destroy_domains(&cpu_online_map) ;
does the following:

/*
* Force a reinitialization of the sched domains hierarchy. The domains
* and groups cannot be updated in place without racing with the balancing
* code, so we temporarily attach all running cpus to the NULL domain
* which will prevent rebalancing while the sched domains are recalculated.
*/

The sched-domains should be rebuilt when a CPU_DOWN ops. is completed,
effectivelly either upon CPU_DEAD{_FROZEN} (upon success) or
CPU_DOWN_FAILED{_FROZEN} (upon failure -- restore the things to their
initial state). That's what update_sched_domains() also does but only
for !CPUSETS case.

With Max's patch, sched-domains' reinitialization is delegated to CPUSETS code:

cpuset_handle_cpuhp() -> common_cpu_mem_hotplug_unplug() ->
rebuild_sched_domains()

which as you've said "called pretty randomly anyway", e.g. for CPU_UP_PREPARE.

[ ah, then rebuild_sched_domains() should not be there. It should be
nop for MEMPLUG events I presume - should make another patch. ]

Being called for CPU_UP_PREPARE (and if its callback is called after
update_sched_domains()), it just negates all the work done by
update_sched_domains() -- i.e. a soon-to-be-offline cpu is included in
the sched-domains and that makes it visible for the load-balancer
while the CPU_DOWN ops. is in progress.

__migrate_live_tasks() moves the tasks off a 'dead' cpu (it's already
"offline" when this function is called).

try_to_wake_up() is called for one of these tasks from another CPU ->
the load-balancer (wake_idle()) picks up a "dead" CPU and places the
task on it. Then e.g. BUG_ON(rq->nr_running) detects this a bit later
-> oops.

Now another funny thing is that we probably have a memory leak with
common_cpu_mem_hotplug_unplug() "randomly" calling
rebuild_sched_domains() and sometimes re-allocating domains when they
already exist.


>
> Linus
>

--
Best regards,
Dmitry Adamushko

2008-07-12 10:45:54

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken


2008/7/12 Dmitry Adamushko <[email protected]>:
> 2008/7/12 Linus Torvalds <[email protected]>:
>>
>>
>> On Sat, 12 Jul 2008, Vegard Nossum wrote:
>>>
>>> Can somebody else please test/ack/review it too? This should eventually
>>> go into 2.6.26 if it doesn't break anything else.
>>
>> And Dmitry, _please_ also explain what was going on. Why did things break
>> from calling common_cpu_mem_hotplug_unplug() too much? That function is
>> called pretty randomly anyway (for just about any random CPU event), so
>> why did it fail in some circumstances?
>
> Upon CPU_DOWN_PREPARE, update_sched_domains() ->
> detach_destroy_domains(&cpu_online_map) ;
> does the following:
>
> /*
> * Force a reinitialization of the sched domains hierarchy. The domains
> * and groups cannot be updated in place without racing with the balancing
> * code, so we temporarily attach all running cpus to the NULL domain
> * which will prevent rebalancing while the sched domains are recalculated.
> */
>
> The sched-domains should be rebuilt when a CPU_DOWN ops. is completed,
> effectivelly either upon CPU_DEAD{_FROZEN} (upon success) or
> CPU_DOWN_FAILED{_FROZEN} (upon failure -- restore the things to their
> initial state). That's what update_sched_domains() also does but only
> for !CPUSETS case.
>
> With Max's patch, sched-domains' reinitialization is delegated to CPUSETS code:
>
> cpuset_handle_cpuhp() -> common_cpu_mem_hotplug_unplug() ->
> rebuild_sched_domains()
>
> which as you've said "called pretty randomly anyway", e.g. for CPU_UP_PREPARE.
>
> [ ah, then rebuild_sched_domains() should not be there. It should be
> nop for MEMPLUG events I presume - should make another patch. ]

I had in mind something like this:

[ yes, probably the patch makes things somewhat uglier. I tried to bring a minimal amount of changes so far, just to emulate the 'old' behavior of update_sched_domains().
I guess, common_cpu_mem_hotplug_unplug() needs to be split up into cpu- and mem-hotplug parts to make it cleaner ]

(not tested yet)

---

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 9fceb97..965d9eb 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1882,7 +1882,7 @@ static void scan_for_empty_cpusets(const struct cpuset *root)
* in order to minimize text size.
*/

-static void common_cpu_mem_hotplug_unplug(void)
+static void common_cpu_mem_hotplug_unplug(int rebuild_sd)
{
cgroup_lock();

@@ -1894,7 +1894,8 @@ static void common_cpu_mem_hotplug_unplug(void)
* Scheduler destroys domains on hotplug events.
* Rebuild them based on the current settings.
*/
- rebuild_sched_domains();
+ if (rebuild_sd)
+ rebuild_sched_domains();

cgroup_unlock();
}
@@ -1912,11 +1913,22 @@ static void common_cpu_mem_hotplug_unplug(void)
static int cpuset_handle_cpuhp(struct notifier_block *unused_nb,
unsigned long phase, void *unused_cpu)
{
- if (phase == CPU_DYING || phase == CPU_DYING_FROZEN)
+ swicth (phase) {
+ case CPU_UP_CANCELED:
+ case CPU_UP_CANCELED_FROZEN:
+ case CPU_DOWN_FAILED:
+ case CPU_DOWN_FAILED_FROZEN:
+ case CPU_ONLINE:
+ case CPU_ONLINE_FROZEN:
+ case CPU_DEAD:
+ case CPU_DEAD_FROZEN:
+ common_cpu_mem_hotplug_unplug(1);
+ break;
+ default:
return NOTIFY_DONE;
+ }

- common_cpu_mem_hotplug_unplug();
- return 0;
+ return NOTIFY_OK;
}

#ifdef CONFIG_MEMORY_HOTPLUG
@@ -1929,7 +1941,7 @@ static int cpuset_handle_cpuhp(struct notifier_block *unused_nb,

void cpuset_track_online_nodes(void)
{
- common_cpu_mem_hotplug_unplug();
+ common_cpu_mem_hotplug_unplug(0);
}
#endif


2008-07-12 11:05:58

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken

2008/7/12 Miao Xie <[email protected]>:
> [ ... ]
>
> My explanation:
> http://lkml.org/lkml/2008/7/7/75
> this bug occurred on the kernel compiled with CONFIG_CPUSETS=y.
>
> As Dmitry said in the following mail, modifying try_to_wake_up() to fix this bug
> is not perfect. Maybe we need update the sched domain before migrating tasks.
> http://lkml.org/lkml/2008/7/7/94
>
> So I remake a patch to fix this bug by updating the sched domain when a cpu is in
> CPU_DOWN_PREPARE state.
>
> I think Vegard Nossum's patch is not so good because it is not necessary to detach
> all the sched domains when making a cpu offline.

(that was my "not so good" patch :-)

Yes, maybe. OTOH, your patch does it in a more drastic way which
should be analyzed more carefully.
Perhaps, for .27 but that's just my 2 cents.

Plus, rebuild_sched_domains() has to be called only for cpu-hotplug events.

(I guess it just shows once more that common_cpu_mem_hotplug_unplug()
should be re-designed a bit)


--
Best regards,
Dmitry Adamushko

2008-07-12 11:14:59

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken

On Sat, 2008-07-12 at 12:45 +0200, Dmitry Adamushko wrote:
> 2008/7/12 Dmitry Adamushko <[email protected]>:
> > 2008/7/12 Linus Torvalds <[email protected]>:
> >>
> >>
> >> On Sat, 12 Jul 2008, Vegard Nossum wrote:
> >>>
> >>> Can somebody else please test/ack/review it too? This should eventually
> >>> go into 2.6.26 if it doesn't break anything else.
> >>
> >> And Dmitry, _please_ also explain what was going on. Why did things break
> >> from calling common_cpu_mem_hotplug_unplug() too much? That function is
> >> called pretty randomly anyway (for just about any random CPU event), so
> >> why did it fail in some circumstances?
> >
> > Upon CPU_DOWN_PREPARE, update_sched_domains() ->
> > detach_destroy_domains(&cpu_online_map) ;
> > does the following:
> >
> > /*
> > * Force a reinitialization of the sched domains hierarchy. The domains
> > * and groups cannot be updated in place without racing with the balancing
> > * code, so we temporarily attach all running cpus to the NULL domain
> > * which will prevent rebalancing while the sched domains are recalculated.
> > */
> >
> > The sched-domains should be rebuilt when a CPU_DOWN ops. is completed,
> > effectivelly either upon CPU_DEAD{_FROZEN} (upon success) or
> > CPU_DOWN_FAILED{_FROZEN} (upon failure -- restore the things to their
> > initial state). That's what update_sched_domains() also does but only
> > for !CPUSETS case.
> >
> > With Max's patch, sched-domains' reinitialization is delegated to CPUSETS code:
> >
> > cpuset_handle_cpuhp() -> common_cpu_mem_hotplug_unplug() ->
> > rebuild_sched_domains()
> >
> > which as you've said "called pretty randomly anyway", e.g. for CPU_UP_PREPARE.
> >
> > [ ah, then rebuild_sched_domains() should not be there. It should be
> > nop for MEMPLUG events I presume - should make another patch. ]
>
> I had in mind something like this:
>
> [ yes, probably the patch makes things somewhat uglier. I tried to bring a minimal amount of changes so far, just to emulate the 'old' behavior of update_sched_domains().
> I guess, common_cpu_mem_hotplug_unplug() needs to be split up into cpu- and mem-hotplug parts to make it cleaner ]
>
> (not tested yet)
>
> ---

argh, this one compiles (will test shortly).


Signed-off-by: Dmitry Adamushko <[email protected]>


diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 9fceb97..798b3ab 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1882,7 +1882,7 @@ static void scan_for_empty_cpusets(const struct
cpuset *root)
* in order to minimize text size.
*/

-static void common_cpu_mem_hotplug_unplug(void)
+static void common_cpu_mem_hotplug_unplug(int rebuild_sd)
{
cgroup_lock();

@@ -1894,7 +1894,8 @@ static void common_cpu_mem_hotplug_unplug(void)
* Scheduler destroys domains on hotplug events.
* Rebuild them based on the current settings.
*/
- rebuild_sched_domains();
+ if (rebuild_sd)
+ rebuild_sched_domains();

cgroup_unlock();
}
@@ -1912,11 +1913,22 @@ static void common_cpu_mem_hotplug_unplug(void)
static int cpuset_handle_cpuhp(struct notifier_block *unused_nb,
unsigned long phase, void *unused_cpu)
{
- if (phase == CPU_DYING || phase == CPU_DYING_FROZEN)
+ switch (phase) {
+ case CPU_UP_CANCELED:
+ case CPU_UP_CANCELED_FROZEN:
+ case CPU_DOWN_FAILED:
+ case CPU_DOWN_FAILED_FROZEN:
+ case CPU_ONLINE:
+ case CPU_ONLINE_FROZEN:
+ case CPU_DEAD:
+ case CPU_DEAD_FROZEN:
+ common_cpu_mem_hotplug_unplug(1);
+ break;
+ default:
return NOTIFY_DONE;
+ }

- common_cpu_mem_hotplug_unplug();
- return 0;
+ return NOTIFY_OK;
}

#ifdef CONFIG_MEMORY_HOTPLUG
@@ -1929,7 +1941,7 @@ static int cpuset_handle_cpuhp(struct
notifier_block *unused_nb,

void cpuset_track_online_nodes(void)
{
- common_cpu_mem_hotplug_unplug();
+ common_cpu_mem_hotplug_unplug(0);
}
#endif



2008-07-12 19:15:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken



On Sat, 12 Jul 2008, Miao Xie wrote:
>
> I think Vegard Nossum's patch is not so good because it is not necessary to detach
> all the sched domains when making a cpu offline.

Well, short-term, I'd like to have a minimal fix.

Long-term, I really think that whole CPU notifier thing needs to be
*fixed*.

It's totally broken to have single callback functions with "flags"
parameters telling people what to do. I don't understand why people keep
doing it. It's *always* broken, and the fix is *always* to a structure
with multiple function pointers - separate functions for separate events.

In the case of those CPU hotplug notifiers, the "freeze" thing could
probably have been a flag, but CPU_DOWN_PREPARE/DEAD/DYING/UP/xyz should
likely just be different function callbacks. That way, when we add a
callback type, it doesn't require changing all existing callbacks that
don't want to care about the new case. And we wouldn't have these stupid
and fragile and ugly "switch()" statements with tons of cases all over.

Sadly, people have latched onto a model (that piece-of-sh*t "notifier"
infrastructure) that encourages - and almost requires - this kind of pure
crap.

But the CPU hotplug stuff _could_ just use separate notifier chains for
the different kinds of events. It wouldn't be perfect, but it would be
better than the mess we have now. Instead of doing

register_cpu_notifier(..);

and having a single thing that has to handle all cases, we could have

/* current "ONLINE/ONLINE_FROZEN" cases */
online = register_cpu_notifier(CPU_ONLINE, online_fn);
dead = register_cpu_nofitier(CPU_DEAD, dead_fn);

which would allocate the "struct notifier_block" an fill it in, and have
_separate_ queues for all those cases. That way, *if* you want to share
the code for multiple cases, you just register the same function. And if
you only cared about one case, you'd only be called for that one case!

I dunno. Maybe the conversion would be painful. And maybe the end result
isn't wonderful either. But the current setup for CPU notifiers is just a
damn disgrace.

Linus

2008-07-12 19:19:30

by Max Krasnyansky

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken

Dmitry Adamushko wrote:
> 2008/7/12 Linus Torvalds <[email protected]>:
>>
>> On Sat, 12 Jul 2008, Vegard Nossum wrote:
>>> Can somebody else please test/ack/review it too? This should eventually
>>> go into 2.6.26 if it doesn't break anything else.
>> And Dmitry, _please_ also explain what was going on. Why did things break
>> from calling common_cpu_mem_hotplug_unplug() too much? That function is
>> called pretty randomly anyway (for just about any random CPU event), so
>> why did it fail in some circumstances?
>
> Upon CPU_DOWN_PREPARE, update_sched_domains() ->
> detach_destroy_domains(&cpu_online_map) ;
> does the following:
>
> /*
> * Force a reinitialization of the sched domains hierarchy. The domains
> * and groups cannot be updated in place without racing with the balancing
> * code, so we temporarily attach all running cpus to the NULL domain
> * which will prevent rebalancing while the sched domains are recalculated.
> */
>
> The sched-domains should be rebuilt when a CPU_DOWN ops. is completed,
> effectivelly either upon CPU_DEAD{_FROZEN} (upon success) or
> CPU_DOWN_FAILED{_FROZEN} (upon failure -- restore the things to their
> initial state). That's what update_sched_domains() also does but only
> for !CPUSETS case.
>
> With Max's patch, sched-domains' reinitialization is delegated to CPUSETS code:
>
> cpuset_handle_cpuhp() -> common_cpu_mem_hotplug_unplug() ->
> rebuild_sched_domains()
>
> which as you've said "called pretty randomly anyway", e.g. for CPU_UP_PREPARE.
>
> [ ah, then rebuild_sched_domains() should not be there. It should be
> nop for MEMPLUG events I presume - should make another patch. ]
>
> Being called for CPU_UP_PREPARE (and if its callback is called after
> update_sched_domains()), it just negates all the work done by
> update_sched_domains() -- i.e. a soon-to-be-offline cpu is included in
> the sched-domains and that makes it visible for the load-balancer
> while the CPU_DOWN ops. is in progress.
>
> __migrate_live_tasks() moves the tasks off a 'dead' cpu (it's already
> "offline" when this function is called).
>
> try_to_wake_up() is called for one of these tasks from another CPU ->
> the load-balancer (wake_idle()) picks up a "dead" CPU and places the
> task on it. Then e.g. BUG_ON(rq->nr_running) detects this a bit later
> -> oops.
Ah, makes sense. Thanx for the explanation.

> Now another funny thing is that we probably have a memory leak with
> common_cpu_mem_hotplug_unplug() "randomly" calling
> rebuild_sched_domains() and sometimes re-allocating domains when they
> already exist.
I beleive that part is ok. We used to have a leak in the scheduler code where
arch_init_sched_domains() just allocated new masks without freeing the old
ones. I fixed that. rebuild_sched_domains() -> partition_sched_domains() is
clean (I think). partition_sched_domains() first does the cleanup and then
takes ownership of the domain masks.

btw It's perfectly ok (or at least it has be ok) to call
rebuild_sched_domains() randomly because it's need to run every time
sched_load_balance flags in the cpuset change, and on any other even that
affects domains. As Paul J explained currently that's the only sane way to
reconstruct the domains based on the cpuset settings.

Max

2008-07-12 20:12:13

by Linus Torvalds

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken



On Sat, 12 Jul 2008, Dmitry Adamushko wrote:
>
> With Max's patch, sched-domains' reinitialization is delegated to CPUSETS code:
>
> cpuset_handle_cpuhp() -> common_cpu_mem_hotplug_unplug() ->
> rebuild_sched_domains()
>
> which as you've said "called pretty randomly anyway", e.g. for CPU_UP_PREPARE.
>
> [ ah, then rebuild_sched_domains() should not be there. It should be
> nop for MEMPLUG events I presume - should make another patch. ]

That whole notion of doing the same thing for UP/DOWN/SIDEWAYS looks
pretty damn odd to me, but whatever.

> Being called for CPU_UP_PREPARE (and if its callback is called after
> update_sched_domains()), it just negates all the work done by
> update_sched_domains() -- i.e. a soon-to-be-offline cpu is included in
> the sched-domains and that makes it visible for the load-balancer
> while the CPU_DOWN ops. is in progress.

This sounds like it could trigger various other problems too, but happily
hit the BUG_ON() first.

> __migrate_live_tasks() moves the tasks off a 'dead' cpu (it's already
> "offline" when this function is called).
>
> try_to_wake_up() is called for one of these tasks from another CPU ->
> the load-balancer (wake_idle()) picks up a "dead" CPU and places the
> task on it. Then e.g. BUG_ON(rq->nr_running) detects this a bit later
> -> oops.

Grr.

Ok, can you re-send the (fixed-up) patch with the explanation and the
tested-by: from Vegard. It seems that not calling rebuild_sched_domains()
(by not calling common_cpu_mem_hotplug_unplug()) for CPU_UP_PREPARE was
the minimal fix.

IOW, the "switch()" statement was just another way of adding
CPU_UP_PREPARE to the list of things that we don't do anything for, and
your patch was really just equivalent to the following patch?

Anyway, I'd just like a patch that (a) has been tested and (b) comes with
a nice subject line and targeted explanation of this issue, so I can
commit it for 2.6.26.

The patch below is not meant for beign used, it's just to verify that I am
on the same page with you explanation.

Linus

---
kernel/cpuset.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 9fceb97..24f34ce 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1912,7 +1912,8 @@ static void common_cpu_mem_hotplug_unplug(void)
static int cpuset_handle_cpuhp(struct notifier_block *unused_nb,
unsigned long phase, void *unused_cpu)
{
- if (phase == CPU_DYING || phase == CPU_DYING_FROZEN)
+ if (phase == CPU_DYING || phase == CPU_DYING_FROZEN ||
+ phase == CPU_UP_PREPARE || phase == CPU_UP_PREPARE_FROZEN)
return NOTIFY_DONE;

common_cpu_mem_hotplug_unplug();

2008-07-12 21:32:02

by Linus Torvalds

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken



On Sat, 12 Jul 2008, Linus Torvalds wrote:
>
> > Being called for CPU_UP_PREPARE (and if its callback is called after
> > update_sched_domains()), it just negates all the work done by
> > update_sched_domains() -- i.e. a soon-to-be-offline cpu is included in
> > the sched-domains and that makes it visible for the load-balancer
> > while the CPU_DOWN ops. is in progress.
>
> This sounds like it could trigger various other problems too, but happily
> hit the BUG_ON() first.

Btw - the way to avoid this whole problem might be to make CPU migration
use a *different* CPU map than "online".

This patch almost certainly doesn't work, but let me explain:

- "cpu_online_map" is the CPU's that can be currently be running

It is enabled/disabled by low-level architecture code when the CPU
actually gets disabled.

- Add a new "cpu_active_map", which is the CPU's that are currently fully
set up, and can not just be running tasks, but can be _migrated_ to!

- We can always just clear the "cpu_active_map" entry when we start a CPU
down event - that guarantees that while tasks may be running on it,
there won't be any _new_ tasks migrated to it.

- both cpu_down() and cpu_up() can just end with a simple

if (cpu_online(cpu))
cpu_set(cpu, cpu_active_map);

before they release the hotplug lock, and it will always do the right
thing regardless of whether the up/down succeeded or not.

The _only_ thing that "active" is used for is literally to verify that a
migration is valid before doing it.

Now, a few points:
(a) it's *TOTALLY* untested. It may or may not compile.
(b) I think many architectures do magic things at the initial boot, and
change the online maps in odd ways. I tried to avoid this by simply
doing the initial "cpu_set()" of cpu_active_map pretty late, just
before we bring up other CPUs
(c) I think this is a pretty simple approach - and like how all the code
is architecture-neutral. The "active" map may not be used a lot, but
doesn't this simplify the whole problem a lot? It just makes the
whole scheduling issue go away for CPU's that are going down.

What do you guys think? Ingo?

Vegard, and just out of interest, in case this would happen to work, does
this actually end up also fixing the bug (with the other fixes unapplied?)

Linus

---
include/linux/cpumask.h | 15 ++++++++++++++-
init/main.c | 7 +++++++
kernel/cpu.c | 8 +++++++-
kernel/sched.c | 2 +-
4 files changed, 29 insertions(+), 3 deletions(-)

diff --git a/include/linux/cpumask.h b/include/linux/cpumask.h
index c24875b..88f2dd2 100644
--- a/include/linux/cpumask.h
+++ b/include/linux/cpumask.h
@@ -359,13 +359,14 @@ static inline void __cpus_fold(cpumask_t *dstp, const cpumask_t *origp,

/*
* The following particular system cpumasks and operations manage
- * possible, present and online cpus. Each of them is a fixed size
+ * possible, present, active and online cpus. Each of them is a fixed size
* bitmap of size NR_CPUS.
*
* #ifdef CONFIG_HOTPLUG_CPU
* cpu_possible_map - has bit 'cpu' set iff cpu is populatable
* cpu_present_map - has bit 'cpu' set iff cpu is populated
* cpu_online_map - has bit 'cpu' set iff cpu available to scheduler
+ * cpu_active_map - has bit 'cpu' set iff cpu available to migration
* #else
* cpu_possible_map - has bit 'cpu' set iff cpu is populated
* cpu_present_map - copy of cpu_possible_map
@@ -417,6 +418,16 @@ extern cpumask_t cpu_possible_map;
extern cpumask_t cpu_online_map;
extern cpumask_t cpu_present_map;

+/*
+ * With CONFIG_HOTPLUG_CPU, cpu_active_map is a real instance.
+ * Without hotplugging, "online" and "active" are the same.
+ */
+#ifdef CONFIG_HOTPLUG_CPU
+extern cpumask_t cpu_active_map;
+#else
+#define cpu_active_map cpu_online_map
+#endif
+
#if NR_CPUS > 1
#define num_online_cpus() cpus_weight(cpu_online_map)
#define num_possible_cpus() cpus_weight(cpu_possible_map)
@@ -424,6 +435,7 @@ extern cpumask_t cpu_present_map;
#define cpu_online(cpu) cpu_isset((cpu), cpu_online_map)
#define cpu_possible(cpu) cpu_isset((cpu), cpu_possible_map)
#define cpu_present(cpu) cpu_isset((cpu), cpu_present_map)
+#define cpu_active(cpu) cpu_isset((cpu), cpu_active_map)
#else
#define num_online_cpus() 1
#define num_possible_cpus() 1
@@ -431,6 +443,7 @@ extern cpumask_t cpu_present_map;
#define cpu_online(cpu) ((cpu) == 0)
#define cpu_possible(cpu) ((cpu) == 0)
#define cpu_present(cpu) ((cpu) == 0)
+#define cpu_active(cpu) ((cpu) == 0)
#endif

#define cpu_is_offline(cpu) unlikely(!cpu_online(cpu))
diff --git a/init/main.c b/init/main.c
index f7fb200..bfccff6 100644
--- a/init/main.c
+++ b/init/main.c
@@ -414,6 +414,13 @@ static void __init smp_init(void)
{
unsigned int cpu;

+ /*
+ * Set up the current CPU as possible to migrate to.
+ * The other ones will be done by cpu_up/cpu_down()
+ */
+ cpu = smp_processor_id();
+ cpu_set(cpu, cpu_active_map);
+
/* FIXME: This should be done in userspace --RR */
for_each_present_cpu(cpu) {
if (num_online_cpus() >= setup_max_cpus)
diff --git a/kernel/cpu.c b/kernel/cpu.c
index c77bc3a..2a30026 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -44,6 +44,8 @@ void __init cpu_hotplug_init(void)

#ifdef CONFIG_HOTPLUG_CPU

+cpumask_t cpu_active_map;
+
void get_online_cpus(void)
{
might_sleep();
@@ -269,11 +271,13 @@ int __ref cpu_down(unsigned int cpu)
int err = 0;

cpu_maps_update_begin();
+ cpu_clear(cpu, cpu_active_map);
if (cpu_hotplug_disabled)
err = -EBUSY;
else
err = _cpu_down(cpu, 0);
-
+ if (cpu_online(cpu))
+ cpu_set(cpu, cpu_active_map);
cpu_maps_update_done();
return err;
}
@@ -337,6 +341,8 @@ int __cpuinit cpu_up(unsigned int cpu)
else
err = _cpu_up(cpu, 0);

+ if (cpu_online(cpu))
+ cpu_set(cpu, cpu_active_map);
cpu_maps_update_done();
return err;
}
diff --git a/kernel/sched.c b/kernel/sched.c
index 4e2f603..21ee025 100644
--- a/kernel/sched.c
+++ b/kernel/sched.c
@@ -2680,7 +2680,7 @@ static void sched_migrate_task(struct task_struct *p, int dest_cpu)

rq = task_rq_lock(p, &flags);
if (!cpu_isset(dest_cpu, p->cpus_allowed)
- || unlikely(cpu_is_offline(dest_cpu)))
+ || unlikely(!cpu_active(dest_cpu)))
goto out;

/* force the process onto the specified CPU */

2008-07-12 22:08:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken



On Sat, 12 Jul 2008, Linus Torvalds wrote:
>
> This patch almost certainly doesn't work, but let me explain:

Well, I decided to just take the plunge and test it. It WorksForMe(tm), so
it's no _totally_ broken. But I really didn't test it except to see that
it still booted, and I don't use cpusets and never saw the original bug,
of course.

But considering how simple it is, if it works for people as a way to work
around stupid CPU migration issues due to subtle wakeup calls, I'd almost
prefer to really solve this whole issue with that cpu_active_map thing.

It's so simple it should be really _robust_ in the presense of problems
elsewhere (like the whole cpusets scheduling domain mess).

Assuming I didn't do anythign stupid, of course, which is why it would
definitely need much more testing. And especially if Vegard can test it
with the case that oopsed for him due to the bad migration..

Linus

2008-07-12 22:43:17

by Max Krasnyansky

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken



Linus Torvalds wrote:
>
> On Sat, 12 Jul 2008, Linus Torvalds wrote:
>> This patch almost certainly doesn't work, but let me explain:
>
> Well, I decided to just take the plunge and test it. It WorksForMe(tm), so
> it's no _totally_ broken. But I really didn't test it except to see that
> it still booted, and I don't use cpusets and never saw the original bug,
> of course.
>
> But considering how simple it is, if it works for people as a way to work
> around stupid CPU migration issues due to subtle wakeup calls, I'd almost
> prefer to really solve this whole issue with that cpu_active_map thing.
>
> It's so simple it should be really _robust_ in the presense of problems
> elsewhere (like the whole cpusets scheduling domain mess).
>
> Assuming I didn't do anythign stupid, of course, which is why it would
> definitely need much more testing. And especially if Vegard can test it
> with the case that oopsed for him due to the bad migration..

My vote goes for Dmitry's patch. The one with the full switch() statement.
Your simplified version with if() is correct (I think) but the switch() is
more explicit about what events are being processed.

The cpu_active_map thing seems like an overkill. In a sense that we should not
try to add a new map for every such case. Granter this migration case may be
special enough to warrant the new map but in general I think it's not the
right way to go.

btw Dmitry's patch should go in anyway. It makes no sense to kill and then
immediately rebuild domains during hotplug sequence. Actually I might have a
bit better patch. I'm thinking we should just call rebuild_sched_domains() (or
some wrapper) from the sched hotplug handler. That way it'll be clear when
domains are supposed to be created/destroyed. ie We won't have to coordinate
cpu hotplug handling events between scheduler and cpusets. I'll send a patch a
bit later.

Max

2008-07-12 23:00:39

by Vegard Nossum

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken

O Master Penguin,

On Sun, Jul 13, 2008 at 12:07 AM, Linus Torvalds
<[email protected]> wrote:
> Assuming I didn't do anythign stupid, of course, which is why it would
> definitely need much more testing. And especially if Vegard can test it
> with the case that oopsed for him due to the bad migration..

Your patch has been tested and found to be working!

It was applied on top of latest linux-2.6.git with the right config
and passed all my tests, including the bad test-case.

(A little pet horse for the occasion: Testing can show the presence of
errors, but not their absence. But that's a different story.)


Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036

2008-07-12 23:02:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken



On Sat, 12 Jul 2008, Max Krasnyansky wrote:
>
> My vote goes for Dmitry's patch. The one with the full switch() statement.
> Your simplified version with if() is correct (I think) but the switch() is
> more explicit about what events are being processed.

Well, I still haven't seen a combined patch+signoff+good explanation, so I
can't really commit it.

> The cpu_active_map thing seems like an overkill. In a sense that we should not
> try to add a new map for every such case. Granter this migration case may be
> special enough to warrant the new map but in general I think it's not the
> right way to go.

Note how cpu_active_map has nothing to do with cpusets per se, and
everything to do with the fact that CPU migration currently seems to be
fundamentally flawed in the presense of a CPU hotunplug.

Can somebody tell me why some _other_ random wakeup cannot cause the same
kind of migration at an inopportune time?

The fact is, Dmitry's patch fixed _one_ particular wakeup from happening
(that just happened to be *guaranteed* to happen when it shouldn't!), but
as far as I can tell, it's a totally generic problem, with any

try_to_wake_up() -> load-balancer

chain being able to trigger it by causing a migration to a CPU that we
are in the process of turning off.

IOW, I don't think that my patch is overkill at all. I think it fixes the
real bug there.

(It's also true that the cpusets code calls rebuild_sched_domains() way
too much, but that's a _stupidity_ issue, not the cause of the bug per se,
if I follow the code!)

Linus

2008-07-12 23:05:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken



On Sun, 13 Jul 2008, Vegard Nossum wrote:
>
> (A little pet horse for the occasion: Testing can show the presence of
> errors, but not their absence. But that's a different story.)

Absolutely. Which is actually why I prefer my patch. I think it fixes - in
general - the issue of CPU migration migrating tasks back to the CPU that
we're taking down.

The other patches seem to work around just the problem that _triggers_ the
bug. They don't actually make it impossible to migrate to a CPU that is
getting shut down - they just try to avoid the particular sequence that
made it happen for you.

Linus

2008-07-12 23:05:40

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken

2008/7/13 Vegard Nossum <[email protected]>:
> O Master Penguin,
>
> On Sun, Jul 13, 2008 at 12:07 AM, Linus Torvalds
> <[email protected]> wrote:
>> Assuming I didn't do anythign stupid, of course, which is why it would
>> definitely need much more testing. And especially if Vegard can test it
>> with the case that oopsed for him due to the bad migration..
>
> Your patch has been tested and found to be working!

Are you sure that you have removed my patch or the one from Miao? I
would be really surprised if this patch works without them. Linus has
updated sched_migrate_task() which is used _only_ by sched_exec() ---
this is not the path that leads to this bug. try_to_wake_up() -> ...
-> wake_idle() does not see "cpu_active_map".

>
> Vegard
>

--
Best regards,
Dmitry Adamushko

2008-07-12 23:18:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken



On Sun, 13 Jul 2008, Dmitry Adamushko wrote:
>
> try_to_wake_up() -> ... -> wake_idle() does not see "cpu_active_map".

You're right. I missed a couple places, because that migrate code not only
ends up using "cpu_is_offline()" instead of "!cpu_online()" (so my greps
all failed), and because it has those online checks in multiple places.
Grr.

So it would need to change a few other "cpu_is_offline()" calls to
"!cpu_active()" instead (in __migrate_task at a minimum).

Linus

2008-07-12 23:19:19

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken

2008/7/13 Linus Torvalds <[email protected]>:
>
>> (A little pet horse for the occasion: Testing can show the presence of
>> errors, but not their absence. But that's a different story.)
>
> Absolutely. Which is actually why I prefer my patch. I think it fixes - in
> general - the issue of CPU migration migrating tasks back to the CPU that
> we're taking down.
>
> The other patches seem to work around just the problem that _triggers_ the
> bug. They don't actually make it impossible to migrate to a CPU that is
> getting shut down - they just try to avoid the particular sequence that
> made it happen for you.

Well, they try to make sched-domains consistent for all possible
cases, not just any particular case. So no, they don't allow a
possibility to leave tasks on a dead CPU (unless there is another
bug).

With your patch (and a cpusets :: hotplug handler from the current
-git) sched-domains are still broken and they are used in a number of
places. So why keep them at all?

I'm really surprised that Vegard says this "cpu_active_map" patch
alone fixes the problem.

With your modifications of common_cpu_mem_hotplug_unplug() - yes but
then it will work even without "cpu_active_map".

(ok, unless I'm really blind at this late hour so please direct me to
the right way :-)


>
> Linus
>


--
Best regards,
Dmitry Adamushko

2008-07-12 23:25:16

by Vegard Nossum

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken

On Sun, Jul 13, 2008 at 1:05 AM, Dmitry Adamushko
<[email protected]> wrote:
> Are you sure that you have removed my patch or the one from Miao? I
> would be really surprised if this patch works without them. Linus has
> updated sched_migrate_task() which is used _only_ by sched_exec() ---
> this is not the path that leads to this bug. try_to_wake_up() -> ...
> -> wake_idle() does not see "cpu_active_map".

Absolutely. I could have made an error somewhere along the way, of
course (yes, it does happen), but I am quite sure I did everything
correctly.

If you doubt it, you could try it too. All that is needed to reproduce
the original BUG here is CPUSETS=y and bring cpu1 down.


Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036

2008-07-12 23:25:48

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken

2008/7/13 Dmitry Adamushko <[email protected]>:
> 2008/7/13 Linus Torvalds <[email protected]>:
>>
>>> (A little pet horse for the occasion: Testing can show the presence of
>>> errors, but not their absence. But that's a different story.)
>>
>> Absolutely. Which is actually why I prefer my patch. I think it fixes - in
>> general - the issue of CPU migration migrating tasks back to the CPU that
>> we're taking down.
>>
>> The other patches seem to work around just the problem that _triggers_ the
>> bug. They don't actually make it impossible to migrate to a CPU that is
>> getting shut down - they just try to avoid the particular sequence that
>> made it happen for you.
>
> Well, they try to make sched-domains consistent for all possible
> cases, not just any particular case. So no, they don't allow a
> possibility to leave tasks on a dead CPU (unless there is another
> bug).
>
> With your patch (and a cpusets :: hotplug handler from the current
> -git) sched-domains are still broken and they are used in a number of
> places. So why keep them at all?
>
> I'm really surprised that Vegard says this "cpu_active_map" patch
> alone fixes the problem.
>
> With your modifications of common_cpu_mem_hotplug_unplug() - yes

No, not even with your modifications of common_mem_hotplug_unplug().

- if (phase == CPU_DYING || phase == CPU_DYING_FROZEN)
+ if (phase == CPU_DYING || phase == CPU_DYING_FROZEN ||
+ phase == CPU_UP_PREPARE || phase == CPU_UP_PREPARE_FROZEN)
return NOTIFY_DONE;

You should have added "phase == CPU_DOWN_PREPARE || phase ==
CPU_DOWN_PREPARE_FROZEN" additionally. So I'm really surprised by
Vegard's assertion :-)


but
> then it will work even without "cpu_active_map".
>
> (ok, unless I'm really blind at this late hour so please direct me to
> the right way :-)
>
>
>>
>> Linus
>>
>
>
> --
> Best regards,
> Dmitry Adamushko
>



--
Best regards,
Dmitry Adamushko

2008-07-13 00:10:45

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken


Linus,


(just that we have it all together in one place, ready for testing and
further consideration).

below is the patch and explanation.

Basically the fix below just emulates the 'old' behavior
of update_sched_domains(). We call rebuild_sched_domains() for the same hotplug-events
as it was called (and is still called for !CPUSETS case) in update_sched_domains().
The aim is to keep sched-domain consistent wrt cpu-down/up.

This should be a minimal change. Effectively, the change is against
f18f982abf183e91f435990d337164c7a43d1e6d. So the logic of this patch should be easily visible comparing it to
what the aforementioned commit does.

Ingo, could also please comment on this issue? TIA.


Subject: fix cpuset_handle_cpuhp()

The following commit

---
commit f18f982abf183e91f435990d337164c7a43d1e6d
Author: Max Krasnyansky <[email protected]>
Date: Thu May 29 11:17:01 2008 -0700

sched: CPU hotplug events must not destroy scheduler domains created by
the cpusets
---

[ Note, with this commit arch_update_cpu_topology is not called any more for CPUSETS. But it's just a nop.
The whole scheme should be probably reworked later. ]


introduced a hotplug-related problem as described below:

[ Basically the fix below just emulates the 'old' behavior of update_sched_domains().
We call rebuild_sched_domains() for the same hotplug-events as it was called (and is still called
for !CPUSETS case) in update_sched_domains(). ]


Upon CPU_DOWN_PREPARE, update_sched_domains() -> detach_destroy_domains(&cpu_online_map)
does the following:

/*
* Force a reinitialization of the sched domains hierarchy. The domains
* and groups cannot be updated in place without racing with the
balancing
* code, so we temporarily attach all running cpus to the NULL domain
* which will prevent rebalancing while the sched domains are
recalculated.
*/

The sched-domains should be rebuilt when a CPU_DOWN ops. has been
completed, effectively either upon CPU_DEAD{_FROZEN} (upon success) or
CPU_DOWN_FAILED{_FROZEN} (upon failure -- restore the things to their
initial state). That's what update_sched_domains() also does but only
for !CPUSETS case.

With Max's patch, sched-domains' reinitialization is delegated to
CPUSETS code:

cpuset_handle_cpuhp() -> common_cpu_mem_hotplug_unplug() ->
rebuild_sched_domains()

Being called for CPU_UP_PREPARE and if its callback is called after
update_sched_domains()), it just negates all the work done by
update_sched_domains() -- i.e. a soon-to-be-offline cpu is included in
the sched-domains and that makes it visible for the load-balancer
while the CPU_DOWN ops. is in progress.

__migrate_live_tasks() moves the tasks off a 'dead' cpu (it's already
"offline" when this function is called).

try_to_wake_up() is called for one of these tasks from another CPU ->
the load-balancer (wake_idle()) picks up a "dead" CPU and places the
task on it. Then e.g. BUG_ON(rq->nr_running) detects this a bit later
-> oops.


Signed-off-by: Dmitry Adamushko <[email protected]>
CC: Ingo Molnar <[email protected]>
CC: Vegard Nossum <[email protected]>
CC: Paul Menage <[email protected]>
CC: Max Krasnyansky <[email protected]>
CC: Paul Jackson <[email protected]>
CC: Peter Zijlstra <[email protected]>
CC: [email protected]
CC: [email protected]
CC: Thomas Gleixner <[email protected]>

---
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 9fceb97..798b3ab 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1882,7 +1882,7 @@ static void scan_for_empty_cpusets(const struct cpuset *root)
* in order to minimize text size.
*/

-static void common_cpu_mem_hotplug_unplug(void)
+static void common_cpu_mem_hotplug_unplug(int rebuild_sd)
{
cgroup_lock();

@@ -1894,7 +1894,8 @@ static void common_cpu_mem_hotplug_unplug(void)
* Scheduler destroys domains on hotplug events.
* Rebuild them based on the current settings.
*/
- rebuild_sched_domains();
+ if (rebuild_sd)
+ rebuild_sched_domains();

cgroup_unlock();
}
@@ -1912,11 +1913,22 @@ static void common_cpu_mem_hotplug_unplug(void)
static int cpuset_handle_cpuhp(struct notifier_block *unused_nb,
unsigned long phase, void *unused_cpu)
{
- if (phase == CPU_DYING || phase == CPU_DYING_FROZEN)
+ switch (phase) {
+ case CPU_UP_CANCELED:
+ case CPU_UP_CANCELED_FROZEN:
+ case CPU_DOWN_FAILED:
+ case CPU_DOWN_FAILED_FROZEN:
+ case CPU_ONLINE:
+ case CPU_ONLINE_FROZEN:
+ case CPU_DEAD:
+ case CPU_DEAD_FROZEN:
+ common_cpu_mem_hotplug_unplug(1);
+ break;
+ default:
return NOTIFY_DONE;
+ }

- common_cpu_mem_hotplug_unplug();
- return 0;
+ return NOTIFY_OK;
}

#ifdef CONFIG_MEMORY_HOTPLUG
@@ -1929,7 +1941,7 @@ static int cpuset_handle_cpuhp(struct notifier_block *unused_nb,

void cpuset_track_online_nodes(void)
{
- common_cpu_mem_hotplug_unplug();
+ common_cpu_mem_hotplug_unplug(0);
}
#endif


---

2008-07-13 08:50:27

by Vegard Nossum

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken

On Sun, Jul 13, 2008 at 2:10 AM, Dmitry Adamushko
<[email protected]> wrote:
> Subject: fix cpuset_handle_cpuhp()
>
> The following commit
>
> ---
> commit f18f982abf183e91f435990d337164c7a43d1e6d
> Author: Max Krasnyansky <[email protected]>
> Date: Thu May 29 11:17:01 2008 -0700
>
> sched: CPU hotplug events must not destroy scheduler domains created by
> the cpusets
> ---
>
> [ Note, with this commit arch_update_cpu_topology is not called any more for CPUSETS. But it's just a nop.
> The whole scheme should be probably reworked later. ]
>
>
> introduced a hotplug-related problem as described below:
>
> [ Basically the fix below just emulates the 'old' behavior of update_sched_domains().
> We call rebuild_sched_domains() for the same hotplug-events as it was called (and is still called
> for !CPUSETS case) in update_sched_domains(). ]
>
>
> Upon CPU_DOWN_PREPARE, update_sched_domains() -> detach_destroy_domains(&cpu_online_map)
> does the following:
>
> /*
> * Force a reinitialization of the sched domains hierarchy. The domains
> * and groups cannot be updated in place without racing with the
> balancing
> * code, so we temporarily attach all running cpus to the NULL domain
> * which will prevent rebalancing while the sched domains are
> recalculated.
> */
>
> The sched-domains should be rebuilt when a CPU_DOWN ops. has been
> completed, effectively either upon CPU_DEAD{_FROZEN} (upon success) or
> CPU_DOWN_FAILED{_FROZEN} (upon failure -- restore the things to their
> initial state). That's what update_sched_domains() also does but only
> for !CPUSETS case.
>
> With Max's patch, sched-domains' reinitialization is delegated to
> CPUSETS code:
>
> cpuset_handle_cpuhp() -> common_cpu_mem_hotplug_unplug() ->
> rebuild_sched_domains()
>
> Being called for CPU_UP_PREPARE and if its callback is called after
> update_sched_domains()), it just negates all the work done by
> update_sched_domains() -- i.e. a soon-to-be-offline cpu is included in
> the sched-domains and that makes it visible for the load-balancer
> while the CPU_DOWN ops. is in progress.
>
> __migrate_live_tasks() moves the tasks off a 'dead' cpu (it's already
> "offline" when this function is called).
>
> try_to_wake_up() is called for one of these tasks from another CPU ->
> the load-balancer (wake_idle()) picks up a "dead" CPU and places the
> task on it. Then e.g. BUG_ON(rq->nr_running) detects this a bit later
> -> oops.
>
>
> Signed-off-by: Dmitry Adamushko <[email protected]>

Tested-by: Vegard Nossum <[email protected]>

Works :-)


Vegard

--
"The animistic metaphor of the bug that maliciously sneaked in while
the programmer was not looking is intellectually dishonest as it
disguises that the error is the programmer's own creation."
-- E. W. Dijkstra, EWD1036

2008-07-13 09:41:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken


* Vegard Nossum <[email protected]> wrote:

> On Sun, Jul 13, 2008 at 2:10 AM, Dmitry Adamushko
> <[email protected]> wrote:
> > Subject: fix cpuset_handle_cpuhp()
> >
> > The following commit
> >
> > ---
> > commit f18f982abf183e91f435990d337164c7a43d1e6d
> > Author: Max Krasnyansky <[email protected]>
> > Date: Thu May 29 11:17:01 2008 -0700
> >
> > sched: CPU hotplug events must not destroy scheduler domains created by
> > the cpusets
> > ---
> >
> > [ Note, with this commit arch_update_cpu_topology is not called any more for CPUSETS. But it's just a nop.
> > The whole scheme should be probably reworked later. ]
> >
> >
> > introduced a hotplug-related problem as described below:
> >
> > [ Basically the fix below just emulates the 'old' behavior of update_sched_domains().
> > We call rebuild_sched_domains() for the same hotplug-events as it was called (and is still called
> > for !CPUSETS case) in update_sched_domains(). ]
> >
> >
> > Upon CPU_DOWN_PREPARE, update_sched_domains() -> detach_destroy_domains(&cpu_online_map)
> > does the following:
> >
> > /*
> > * Force a reinitialization of the sched domains hierarchy. The domains
> > * and groups cannot be updated in place without racing with the
> > balancing
> > * code, so we temporarily attach all running cpus to the NULL domain
> > * which will prevent rebalancing while the sched domains are
> > recalculated.
> > */
> >
> > The sched-domains should be rebuilt when a CPU_DOWN ops. has been
> > completed, effectively either upon CPU_DEAD{_FROZEN} (upon success) or
> > CPU_DOWN_FAILED{_FROZEN} (upon failure -- restore the things to their
> > initial state). That's what update_sched_domains() also does but only
> > for !CPUSETS case.
> >
> > With Max's patch, sched-domains' reinitialization is delegated to
> > CPUSETS code:
> >
> > cpuset_handle_cpuhp() -> common_cpu_mem_hotplug_unplug() ->
> > rebuild_sched_domains()
> >
> > Being called for CPU_UP_PREPARE and if its callback is called after
> > update_sched_domains()), it just negates all the work done by
> > update_sched_domains() -- i.e. a soon-to-be-offline cpu is included in
> > the sched-domains and that makes it visible for the load-balancer
> > while the CPU_DOWN ops. is in progress.
> >
> > __migrate_live_tasks() moves the tasks off a 'dead' cpu (it's already
> > "offline" when this function is called).
> >
> > try_to_wake_up() is called for one of these tasks from another CPU ->
> > the load-balancer (wake_idle()) picks up a "dead" CPU and places the
> > task on it. Then e.g. BUG_ON(rq->nr_running) detects this a bit later
> > -> oops.
> >
> >
> > Signed-off-by: Dmitry Adamushko <[email protected]>
>
> Tested-by: Vegard Nossum <[email protected]>
>
> Works :-)

thanks! I've tidied up the changelog and queued it up into
tip/sched/urgent. I'd prefer this more conservative patch so late in the
cycle, but i'll also queue up the more intrusive real fix from Linus and
Dmitry in sched/devel.

Linus, if you've not applied it already, you can pull Dmitry's fix from:

git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip.git sched-fixes-for-linus

shortlog, diffstat and diff below.

Thanks,

Ingo

------------------>
Dmitry Adamushko (1):
cpusets, hotplug, scheduler: fix scheduler domain breakage


kernel/cpuset.c | 24 ++++++++++++++++++------
1 files changed, 18 insertions(+), 6 deletions(-)

diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 9fceb97..798b3ab 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -1882,7 +1882,7 @@ static void scan_for_empty_cpusets(const struct cpuset *root)
* in order to minimize text size.
*/

-static void common_cpu_mem_hotplug_unplug(void)
+static void common_cpu_mem_hotplug_unplug(int rebuild_sd)
{
cgroup_lock();

@@ -1894,7 +1894,8 @@ static void common_cpu_mem_hotplug_unplug(void)
* Scheduler destroys domains on hotplug events.
* Rebuild them based on the current settings.
*/
- rebuild_sched_domains();
+ if (rebuild_sd)
+ rebuild_sched_domains();

cgroup_unlock();
}
@@ -1912,11 +1913,22 @@ static void common_cpu_mem_hotplug_unplug(void)
static int cpuset_handle_cpuhp(struct notifier_block *unused_nb,
unsigned long phase, void *unused_cpu)
{
- if (phase == CPU_DYING || phase == CPU_DYING_FROZEN)
+ switch (phase) {
+ case CPU_UP_CANCELED:
+ case CPU_UP_CANCELED_FROZEN:
+ case CPU_DOWN_FAILED:
+ case CPU_DOWN_FAILED_FROZEN:
+ case CPU_ONLINE:
+ case CPU_ONLINE_FROZEN:
+ case CPU_DEAD:
+ case CPU_DEAD_FROZEN:
+ common_cpu_mem_hotplug_unplug(1);
+ break;
+ default:
return NOTIFY_DONE;
+ }

- common_cpu_mem_hotplug_unplug();
- return 0;
+ return NOTIFY_OK;
}

#ifdef CONFIG_MEMORY_HOTPLUG
@@ -1929,7 +1941,7 @@ static int cpuset_handle_cpuhp(struct notifier_block *unused_nb,

void cpuset_track_online_nodes(void)
{
- common_cpu_mem_hotplug_unplug();
+ common_cpu_mem_hotplug_unplug(0);
}
#endif

2008-07-13 09:53:16

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken

2008/7/13 Linus Torvalds <[email protected]>:
> On Sun, 13 Jul 2008, Dmitry Adamushko wrote:
>>
>> try_to_wake_up() -> ... -> wake_idle() does not see "cpu_active_map".
>
> You're right. I missed a couple places, because that migrate code not only
> ends up using "cpu_is_offline()" instead of "!cpu_online()" (so my greps
> all failed), and because it has those online checks in multiple places.
> Grr.
>
> So it would need to change a few other "cpu_is_offline()" calls to
> "!cpu_active()" instead (in __migrate_task at a minimum).

it should have checked the result of select_task_rq() in
try_to_wake_up() or modify wake_idle() alternatively.

And let me explain one last time why I opposed your 'cpu_active_map' approach.

I do agree that there are likely ways to optimize the hotplug
machinery but I have been focused on fixing bugs in a scope of the
current framework trying to keep it intact with _minimal_ changes (as
it's probably .26 material).

The current way to synchronize with the load-balancer is to attach
NULL domains to all sched-domains upon CPU_DOWN_PREPARE and rebuild
sched-domains upon CPU_DOWN, effectively making the load-balancer
'blind' (and this way it's workable indeed). Perhaps it's an overkill
and something like being proposed by Miao or you should be
considered/tried as an alternative.

Even if we place "!cpu_active()" in all the load-balancer-related
places (btw., we can also do it with !cpu_online() / cpu_offline() as
Miao did with his initial patch) :

(1) common_cpu_mem_hotplug_unplug() -> rebuild_sched_domain() is still
called pretty "randomly" (breaking the aforementioned model). At the
very least it's an overkill;

(2) sched-domains are broken (at least while CPU_{UP,DOMS} ops. are in
progress) and in this state they are still used in a number of places.
That's just illogic;

With (2) in place, "cpu_mask_active" acts as a workaround to the
existing (broken by CPUSETS) model.
If we want "cpu_mask_active" as a primary solution, then the current
model should be altered (presumably, we don't need NULL domains any
more). Otherwise, it's kind of a strange (illogical) hybrid.


>
> Linus
>

--
Best regards,
Dmitry Adamushko

2008-07-13 15:29:38

by Andi Kleen

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken

Linus Torvalds <[email protected]> writes:
>
> Btw - the way to avoid this whole problem might be to make CPU migration
> use a *different* CPU map than "online".

In general I think we need more such masks. Some time ago I had
a patch to introduce cpu_ever_online_mask, which prevented some
problems with accessing uninitialized per cpu data.

-Andi

2008-07-13 17:11:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken



On Sun, 13 Jul 2008, Dmitry Adamushko wrote:

> And let me explain one last time why I opposed your 'cpu_active_map' approach.

And let me explain why you are totally off base.

> I do agree that there are likely ways to optimize the hotplug
> machinery [ .. deleted rambling .. ]

This has *NOTHING* to do with optimizing any hotplug machinery.

> The current way to synchronize with the load-balancer is to attach
> NULL domains [ .. deleted more ramblings .. ]

This has *NOTHING* to do even with cpusets and scheduler domains!

Until you can understand that, all your arguments are total and utter
CRAP.

So Dmitry - please follow along, and think this through.

This is a *fundamental* scheduler issue. It has nothing what-so-ever to do
with optimization, and it has nothing to do with cpusets. It's about the
fact that we migrate threads from one CPU to another - and we do that
whether cpusets are even enabled or not!

And anything that uses "cpu_active_map" to decide if the migration target
is alive is simply _buggy_.

See? Not "un-optimized". Not "cpusets". Just pure scheduling and hotplug
issues with taking a CPU down.

As long as you continue to only look at wake_idle() and scheduler domains,
you are missing all the *other* cases of migration. Like the one we do at
execve() time, or in balance_task.

The thing is, we should fix the top level code to never even _consider_ an
invalid CPU as a target, and that in turn should mean that all the other
code should be able to just totally ignore CPU hotplug events.

In other words, it vey fundamentally SHOULD NOT MATTER that somebody
happened to call "try_to_wake_up()" during the cpu unplug sequence. We
should fix the fundamental scheduler routines to simply make it impossible
for that to ever balance something back to a CPU that is going down.

And we shouldn't _care_ about what crazy things the cpusets code does.

See?

THAT is the reason for my patch. I think the cpusets callbacks are totally
insane, but I don't care. What I care about is that the scheduler got
confused just because those insane callbacks happened to make timing be
just subtle enough that (and I quote):

"try_to_wake_up() is called for one of these tasks from another CPU ->
the load-balancer (wake_idle()) picks up a "dead" CPU and places the
task on it. Then e.g. BUG_ON(rq->nr_running) detects this a bit later
-> oops."

IOW, we should never have had code that was that fragile in the first
place! It's totally INSANE to depend on complex and fragile code, when
we'd be much better off with simple code that always says: "I will not
migrate a task to a CPU that is going down".

Depending on complex (and conditional) scheduler domains data structures
is a *bug*. It's fragile, and it's a horrible design mistake.

Linus

2008-07-13 17:43:26

by Ingo Molnar

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken


* Linus Torvalds <[email protected]> wrote:

> The thing is, we should fix the top level code to never even
> _consider_ an invalid CPU as a target, and that in turn should mean
> that all the other code should be able to just totally ignore CPU
> hotplug events.

agreed. We thought we could get away by hiding the "is the CPU dead"
information in existing data structures (by shaping sched domains to
never lead to a dead CPU) - but this method has proven itself fragile
via a series of bugs.

It was one micro-optimization one too many. We should just accept the
fact that the current model is not maintainable and add your extra (and
trivial) cpu_active_map layer that protects against migrating to CPUs
that are going down. [we'll basically introduce a "going down" state
inbetween 'online' and 'offline']

And this will get rid of some other fragile trickery - because from that
point on we dont have to be super-careful about the whole sequence of
manipulating sched domains anymore. Cpusets can do whatever it wants, it
wont be able to break hotplug+scheduling - and that's important for any
functionality that is not used by default.

[ I dont think Dmitry will disagree with this notion all that much, it's
just that his personal limit for calling an algorithm unmaintainable
is probably a lot higher than normal :-) ]

Ingo

2008-07-13 17:47:44

by Linus Torvalds

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken



On Sun, 13 Jul 2008, Linus Torvalds wrote:
>
> The thing is, we should fix the top level code to never even _consider_ an
> invalid CPU as a target, and that in turn should mean that all the other
> code should be able to just totally ignore CPU hotplug events.

IOW, I think we should totally remove the whole "update_sched_domains()"
thing too. Any logic that needs it is broken. We shouldn't detach the
scheduler domains in DOWN_PREPARE (much less UP_PREPARE), we should just
leave them damn well alone.

As the comment says, "The domains and groups cannot be updated in place
without racing with the balancing code". The thing is, we shouldn't even
try. The correct way to handle all this is to make the balancing code use
the domains regardless, but protect against CPU's going down with
_another_ data structure that is much easier to update.

Namely something like 'cpu_active_map'.

Then we just get rid of all the crap in update_sched_domains() entirely,
and then we can make the cpusets code do the *sane* thing, which is to
rebuild the scheduler domains only when the CPU up/down has completed.

So instead of this illogical and crazy mess:

+ switch (phase) {
+ case CPU_UP_CANCELED:
+ case CPU_UP_CANCELED_FROZEN:
+ case CPU_DOWN_FAILED:
+ case CPU_DOWN_FAILED_FROZEN:
+ case CPU_ONLINE:
+ case CPU_ONLINE_FROZEN:
+ case CPU_DEAD:
+ case CPU_DEAD_FROZEN:
+ common_cpu_mem_hotplug_unplug(1);

it should just say

+ switch (phase) {
+ case CPU_ONLINE:
+ case CPU_ONLINE_FROZEN:
+ case CPU_DEAD:
+ case CPU_DEAD_FROZEN:
+ common_cpu_mem_hotplug_unplug(1);

because it only makes sense to rebuild the scheduler domains when the
thing SUCCEEDS.

See? By having a sane design, the code is not just more robust and easy to
follow, you can also simplify it and make it more logical.

The current design is not sane.

Linus

2008-07-13 18:13:25

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken

2008/7/13 Linus Torvalds <[email protected]>:
>
> On Sun, 13 Jul 2008, Linus Torvalds wrote:
>>
>> The thing is, we should fix the top level code to never even _consider_ an
>> invalid CPU as a target, and that in turn should mean that all the other
>> code should be able to just totally ignore CPU hotplug events.
>
> IOW, I think we should totally remove the whole "update_sched_domains()"
> thing too. Any logic that needs it is broken. We shouldn't detach the
> scheduler domains in DOWN_PREPARE (much less UP_PREPARE), we should just
> leave them damn well alone.
>
> As the comment says, "The domains and groups cannot be updated in place
> without racing with the balancing code". The thing is, we shouldn't even
> try. The correct way to handle all this is to make the balancing code use
> the domains regardless, but protect against CPU's going down with
> _another_ data structure that is much easier to update.
>
> Namely something like 'cpu_active_map'.
>
> Then we just get rid of all the crap in update_sched_domains() entirely,
> and then we can make the cpusets code do the *sane* thing, which is to
> rebuild the scheduler domains only when the CPU up/down has completed.
>
> So instead of this illogical and crazy mess:
>
> + switch (phase) {
> + case CPU_UP_CANCELED:
> + case CPU_UP_CANCELED_FROZEN:
> + case CPU_DOWN_FAILED:
> + case CPU_DOWN_FAILED_FROZEN:
> + case CPU_ONLINE:
> + case CPU_ONLINE_FROZEN:
> + case CPU_DEAD:
> + case CPU_DEAD_FROZEN:
> + common_cpu_mem_hotplug_unplug(1);
>
> it should just say
>
> + switch (phase) {
> + case CPU_ONLINE:
> + case CPU_ONLINE_FROZEN:
> + case CPU_DEAD:
> + case CPU_DEAD_FROZEN:
> + common_cpu_mem_hotplug_unplug(1);
>
> because it only makes sense to rebuild the scheduler domains when the
> thing SUCCEEDS.
>
> See? By having a sane design, the code is not just more robust and easy to
> follow, you can also simplify it and make it more logical.

Yes, I agree. And I did _not_ say that the current design is sane. My
impression about changes acceptable during a late release cycle was
utterly CRAPPY (indeed, it's always better to immediately fix a
problem the right way, not just add another patch and pray it doesn't
break somewhere else).


>
> Linus
>

--
Best regards,
Dmitry Adamushko

2008-07-13 18:19:45

by Ingo Molnar

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken


* Dmitry Adamushko <[email protected]> wrote:

> > So instead of this illogical and crazy mess:
> >
> > + switch (phase) {
> > + case CPU_UP_CANCELED:
> > + case CPU_UP_CANCELED_FROZEN:
> > + case CPU_DOWN_FAILED:
> > + case CPU_DOWN_FAILED_FROZEN:
> > + case CPU_ONLINE:
> > + case CPU_ONLINE_FROZEN:
> > + case CPU_DEAD:
> > + case CPU_DEAD_FROZEN:
> > + common_cpu_mem_hotplug_unplug(1);
> >
> > it should just say
> >
> > + switch (phase) {
> > + case CPU_ONLINE:
> > + case CPU_ONLINE_FROZEN:
> > + case CPU_DEAD:
> > + case CPU_DEAD_FROZEN:
> > + common_cpu_mem_hotplug_unplug(1);
> >
> > because it only makes sense to rebuild the scheduler domains when the
> > thing SUCCEEDS.
> >
> > See? By having a sane design, the code is not just more robust and
> > easy to follow, you can also simplify it and make it more logical.
>
> Yes, I agree. And I did _not_ say that the current design is sane. My
> impression about changes acceptable during a late release cycle was
> utterly CRAPPY (indeed, it's always better to immediately fix a
> problem the right way, not just add another patch and pray it doesn't
> break somewhere else).

mind sending Linus's patch as a completed patchset against tip/master
(or tip/sched/devel) so that we can do it in early v2.6.27?

i still think your cpusets.c fix is what we should do for v2.6.26, given
that there's agreement about how to fix it for real and thus in terms of
regression/bug risk your patch is lower-impact and CPU hotplug has been
broken for such a long time.

But we should follow it up with Linus's patch immediately afterwards in
v2.6.27. Hm?

Ingo

2008-07-13 18:21:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken



On Sun, 13 Jul 2008, Dmitry Adamushko wrote:
>
> Yes, I agree. And I did _not_ say that the current design is sane. My
> impression about changes acceptable during a late release cycle was
> utterly CRAPPY (indeed, it's always better to immediately fix a
> problem the right way, not just add another patch and pray it doesn't
> break somewhere else).

I do agree that since I really want to do 2.6.26 (and I don't think this
single issue is worth holding anything up for, considering that we do have
a tested and understood patch - I was worried for a while), for now we'll
do the minimal thing, and release 2.6.26 with just that.

I just want the longer-term strategy to be something more understandable
that doesn't depend on very subtle ordering of various different callback
functions etc.

Linus

2008-07-13 18:39:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken



On Sun, 13 Jul 2008, Ingo Molnar wrote:
>
> mind sending Linus's patch as a completed patchset against tip/master
> (or tip/sched/devel) so that we can do it in early v2.6.27?

Note that my patch does need lots more eyes and more testing, since it
needs to make sure that anything that depended on online_map and/or just
on the scheduling domains now checks active_map instead (or in addition
to).

> i still think your cpusets.c fix is what we should do for v2.6.26, given
> that there's agreement about how to fix it for real and thus in terms of
> regression/bug risk your patch is lower-impact and CPU hotplug has been
> broken for such a long time.

Agreed. I've pulled it from your tree.

Linus

2008-07-14 15:50:02

by Mike Travis

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken

Andi Kleen wrote:
> Linus Torvalds <[email protected]> writes:
>> Btw - the way to avoid this whole problem might be to make CPU migration
>> use a *different* CPU map than "online".
>
> In general I think we need more such masks. Some time ago I had
> a patch to introduce cpu_ever_online_mask, which prevented some
> problems with accessing uninitialized per cpu data.

Not to divert the conversation too much but I would like to see these
maps allocated based on the real number of cpus and not NR_CPUS.

Thanks,
Mike

2008-07-14 22:38:48

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken

On Sat, 12 Jul 2008, Linus Torvalds wrote:
> [ ... ]
>
> Btw - the way to avoid this whole problem might be to make CPU migration
> use a *different* CPU map than "online".
>
> This patch almost certainly doesn't work, but let me explain:
>
> - "cpu_online_map" is the CPU's that can be currently be running
>
> It is enabled/disabled by low-level architecture code when the CPU
> actually gets disabled.
>
> - Add a new "cpu_active_map", which is the CPU's that are currently fully
> set up, and can not just be running tasks, but can be _migrated_ to!
>
> - We can always just clear the "cpu_active_map" entry when we start a CPU
> down event - that guarantees that while tasks may be running on it,
> there won't be any _new_ tasks migrated to it.

(please correct me if I misinterpreted your point)

cpu_clear(cpu, cpu_active_map); _alone_ does not guarantee that after
its completion, no new tasks can appear on (be migrated to) 'cpu'.

cpu_clear() may race against migration operations which are already in
progress on other CPUs : executing right after a check for
!cpu_active(cpu) and before doing actual migration [*]

Am I missing something?

[ If no, then what I dare to say below is that: (a) with only
cpu_clear(cpu, cpu_active_map) in cpu_down(), "cpu_active_map" is
perhaps not much better than (alternatively) using existing
"cpu_online_map" to check if a task can be migrated to 'cpu' _and_ (b)
there are also a few (rough) speculations on how to fix [*] ]

New tasks may appear on (soon-to-be-dead) 'cpu' at any point until
_cpu_down() calls

__stop_machine_run() -> [ next is called by 'kstopmachine' ] do_stop()
-> stop_machine()

stop_machine() starts a RT high-prio thread on each online cpu and
waits until these threads get scheduled in (take control of cpus).
That guarantees a re-schedule on each CPU has taken place.
In turn, it means none of the CPUs are in the middle of task-migration
operation [**] and further task-migration operations can not race
against cpu_down() -> cpu_clear() (in a sense, stop_machine() is a
synchronization point).

[**] migration operations are done with rq->lock being held.

OTOH, cpu_clear(cpu, cpu_online_map) takes place right after
stop_machine() : do_stop() -> take_cpu_down() (via smdata->fn()) ->
__cpu_disable().

Let's imagine we update all places in the scheduler where
task-migration may take place with a check for either
(a) !cpu_active(cpu) _or_ (b) cpu_offline(cpu) :

then for both cases new tasks may apear on 'cpu' for which cpu_down()
is in progress and for both cases - until __stop_machine_run() -> ...
-> stop_machine() gets called.

Hm?

In any case, the scheduler does not depend on sched-domains to do
migration and migration to offline cpus is not possible (although,
it's possible to soon-to-be-offline cpus), but OTOH we depend on
internals of __stop_machine_run() [ it acts as a sync. point ].

To solve both, we might introduce a special synchronization point
right after cpu_clear(cpu, cpu_active_map) gets called in cpu_down().

[ simplest (probably stupid) approaches ]

(a)

per-cpu rw_lock, readers' part is taken by task-migration code,
writer's part is in cpu_down():

rw_write_lock(per_cpu(migration_lock, cpu)); cpu_clear(cpu,
cpu_active_map); rw_write_unlock(...);

(b)

add rq->migration counter (per-cpu)

inc(rq->migration);
if (cpu_active(dst_cpu))
do_migration(dst_cpu);
dec(rq->migration);

cpu_active_sync(cpu)
{
for_each_online_cpu:
while (rq->migration) { cpu_relax(); }
}

(c)

per-cpu "migration_counter" so per_cpu(migration_counter, dst_cpu)
gets +1 while a migration operation _to_ this cpu is in progress and
then

cpu_active_sync(to_be_offline_cpu)
{
while (per_cpu(migration_counter, to_be_offline_cpu) != 0) { cpu_relax(); }
}


--
Best regards,
Dmitry Adamushko

2008-07-14 23:06:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken



On Tue, 15 Jul 2008, Dmitry Adamushko wrote:
>
> cpu_clear(cpu, cpu_active_map); _alone_ does not guarantee that after
> its completion, no new tasks can appear on (be migrated to) 'cpu'.

But I think we should make it do that.

I do realize that we "queue" processes, but that's part of the whole
complexity. More importantly, the people who do that kind of asynchronous
queueing don't even really care - *if* they cared about the process
_having_ to show up on the destination core, they'd be waiting
synchronously and re-trying (which they do).

So by doing the test for cpu_active_map not at queuing time, but at the
time when we actually try to do the migration, we can now also make that
cpu_active_map be totally serialized.

(Of course, anybody who clears the bit does need to take the runqueue lock
of that CPU too, but cpu_down() will have to do that as it does the
"migrate away live tasks" anyway, so that's not a problem)

Linus

2008-07-15 00:00:57

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken

2008/7/15 Linus Torvalds <[email protected]>:
>
> On Tue, 15 Jul 2008, Dmitry Adamushko wrote:
>>
>> cpu_clear(cpu, cpu_active_map); _alone_ does not guarantee that after
>> its completion, no new tasks can appear on (be migrated to) 'cpu'.
>
> But I think we should make it do that.
>
> I do realize that we "queue" processes, but that's part of the whole
> complexity. More importantly, the people who do that kind of asynchronous
> queueing don't even really care - *if* they cared about the process
> _having_ to show up on the destination core, they'd be waiting
> synchronously and re-trying (which they do).
>
> So by doing the test for cpu_active_map not at queuing time, but at the
> time when we actually try to do the migration,
> we can now also make that
> cpu_active_map be totally serialized.
>
> (Of course, anybody who clears the bit does need to take the runqueue lock
> of that CPU too, but cpu_down() will have to do that as it does the
> "migrate away live tasks" anyway, so that's not a problem)

The 'synchronization' point occurs even earlier - when cpu_down() ->
__stop_machine_run() gets called (as I described in my previous mail).

My point was that if it's ok to have a _delayed_ synchronization
point, having it not immediately after cpu_clear(cpu, cpu_active_map)
but when the "runqueue lock" is taken a bit later (as you pointed out
above) or __stop_machine_run() gets executed (which is a sync point,
scheduling-wise),

then we can implement the proper synchronization (hotplugging vs.
task-migration) with cpu_online_map (no need for cpu_active_map).

Note, currently, _not_ all places in the scheduler where an actual
migration (not just queuing requests) takes place do the test for
cpu_offline(). Instead, they (blindly) rely on the assumption that if
a cpu is available via sched-domains, then it's guaranteed to be
online (and can be migrated to).

Provided all those places had cpu_offline() (additionally) in place,
the bug which has been discussed in this thread would _not_ happen
and, moreover, we would _not_ need to do all the fancy "attach NULL
domains" sched-domain manipulations (which depend on DOWN_PREPARE,
DOWN and other hotpluging events). We would only need to rebuild
domains once upon CPU_DOWN (on success).

p.s. hope my point is more understandable now (or it's clear that I'm
missing something at this late hour :^)


>
> Linus
>

--
Best regards,
Dmitry Adamushko

2008-07-15 00:24:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken



On Tue, 15 Jul 2008, Dmitry Adamushko wrote:
>
> The 'synchronization' point occurs even earlier - when cpu_down() ->
> __stop_machine_run() gets called (as I described in my previous mail).
>
> My point was that if it's ok to have a _delayed_ synchronization
> point, having it not immediately after cpu_clear(cpu, cpu_active_map)
> but when the "runqueue lock" is taken a bit later (as you pointed out
> above) or __stop_machine_run() gets executed (which is a sync point,
> scheduling-wise),
>
> then we can implement the proper synchronization (hotplugging vs.
> task-migration) with cpu_online_map (no need for cpu_active_map).

Maybe. But what is the point? And more importantly, I think it's wrong.

There's really a *difference* between "this CPU is still running, but
going down" and "this CPU is running".

And it's a valid difference. For example, a process should be able to
absolutely DEPEND on being able to depend on cpu_online(current_cpu())
*always* being true.

I also don't understand why people are arguing against a single new CPU
map (it's _global_ to the whole kernel, for crissake!) when it clearly
makes the rules much simpler. Look at the patch I sent out, and tell me it
isn't 100% obvious what cpu_active_map does, and what the logic is.

In contrast, try to follow the same for cpu_online_map. I dare you. You
have to already know that code really really well in order to understand
what happens to it, both at bootup _and_ at hotplug events.

Dmitry, THIS CODE WAS BUGGY. Not just once. Multiple f*cking times!

That should tell you something.

In particular, it should tell you that the code is too hard to follow, and
too fragile, and a total mess.

I do NOT understand why you seem to argue for being "subtle" and "clever",
considering the history of this whole setup. Subtle and clever and complex
is what got us to the crap situation.

So here's the code-word of today: KISS - Keep It Simple Stupid.

And I _guarantee_ that the "cpu_active_map" approach is a hell of a lot
simpler than the alternatives. Partly because it really matches what we
want much more closely: it gives a clear new state for "this CPU is going
down, even though things are still running on it".

And then it's 100% logical to say: "ok, if it's going down, we agree to
not add new processes to it".

THAT is the kind of logic we should strive for. Not "let's avoid the
obvious and simple code because we can re-use the existing messy code for
yet another thing".

Dammit, this code should be easier to understand, not harder!

Linus

2008-07-15 02:21:22

by Dmitry Adamushko

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken

2008/7/15 Linus Torvalds <[email protected]>:
>
>
> On Tue, 15 Jul 2008, Dmitry Adamushko wrote:
>>
>> The 'synchronization' point occurs even earlier - when cpu_down() ->
>> __stop_machine_run() gets called (as I described in my previous mail).
>>
>> My point was that if it's ok to have a _delayed_ synchronization
>> point, having it not immediately after cpu_clear(cpu, cpu_active_map)
>> but when the "runqueue lock" is taken a bit later (as you pointed out
>> above) or __stop_machine_run() gets executed (which is a sync point,
>> scheduling-wise),
>>
>> then we can implement the proper synchronization (hotplugging vs.
>> task-migration) with cpu_online_map (no need for cpu_active_map).
>
> [ ... ]
>
> In particular, it should tell you that the code is too hard to follow, and
> too fragile, and a total mess.
>
> I do NOT understand why you seem to argue for being "subtle" and "clever",
> considering the history of this whole setup. Subtle and clever and complex
> is what got us to the crap situation.

Fair enough, agreed.


>
> Linus
>

--
Best regards,
Dmitry Adamushko

2008-07-15 03:03:35

by Max Krasnyansky

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken

Dmitry Adamushko wrote:
> 2008/7/15 Linus Torvalds <[email protected]>:
>>
>> On Tue, 15 Jul 2008, Dmitry Adamushko wrote:
>>> The 'synchronization' point occurs even earlier - when cpu_down() ->
>>> __stop_machine_run() gets called (as I described in my previous mail).
>>>
>>> My point was that if it's ok to have a _delayed_ synchronization
>>> point, having it not immediately after cpu_clear(cpu, cpu_active_map)
>>> but when the "runqueue lock" is taken a bit later (as you pointed out
>>> above) or __stop_machine_run() gets executed (which is a sync point,
>>> scheduling-wise),
>>>
>>> then we can implement the proper synchronization (hotplugging vs.
>>> task-migration) with cpu_online_map (no need for cpu_active_map).
>> [ ... ]
>>
>> In particular, it should tell you that the code is too hard to follow, and
>> too fragile, and a total mess.
>>
>> I do NOT understand why you seem to argue for being "subtle" and "clever",
>> considering the history of this whole setup. Subtle and clever and complex
>> is what got us to the crap situation.
>
> Fair enough, agreed.

Ok. Sounds like the consensus is to try and do this cpu_active_map thing, and
it sounds like it will lets us get rid of the "destroy domains / rebuild
domains" logic, which would be a good thing. I've spent a good part of the
weekend chasing circular locking dependencies in calling
rebuild_sched_domains() from cpu hotplug handler path. Which we'll still need
(to update domains on CPU UP and DOWN events) but not having to blow away the
domains as often as we do now will simplify things, and probably make hotplug
events a bit less disruptive.

Did you guys an updated patch ? Dmitry pointed out several things that Linus
missed in his original version. I guess I can go through the thread and
reconstruct that but if you have a patch I can try let me know.

Max

2008-07-15 03:23:20

by Steven Rostedt

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken


On Mon, 14 Jul 2008, Linus Torvalds wrote:
> So by doing the test for cpu_active_map not at queuing time, but at the
> time when we actually try to do the migration, we can now also make that
> cpu_active_map be totally serialized.
>
> (Of course, anybody who clears the bit does need to take the runqueue lock
> of that CPU too, but cpu_down() will have to do that as it does the
> "migrate away live tasks" anyway, so that's not a problem)

Wouldn't simply doing a synchronize_sched() after clearing the bit also
make sure that no new task will be scheduled on that CPU?

-- Steve

2008-07-15 03:37:26

by Linus Torvalds

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken



On Mon, 14 Jul 2008, Steven Rostedt wrote:

>
> On Mon, 14 Jul 2008, Linus Torvalds wrote:
> > So by doing the test for cpu_active_map not at queuing time, but at the
> > time when we actually try to do the migration, we can now also make that
> > cpu_active_map be totally serialized.
> >
> > (Of course, anybody who clears the bit does need to take the runqueue lock
> > of that CPU too, but cpu_down() will have to do that as it does the
> > "migrate away live tasks" anyway, so that's not a problem)
>
> Wouldn't simply doing a synchronize_sched() after clearing the bit also
> make sure that no new task will be scheduled on that CPU?

My point was that it DOESN'T NEED TO DO ANYTHING AT ALL.

It has to get the runqueue lock in order to move the currently executing
threads off that CPU anyway. The fact that it can (and actually does)
synchronize with the scheduler in other ways is totally and utterly
immaterial.

That's what "robust" means. It means that things just work, and there are
no subtle interactions.

Sure, you can serialize with something complicated and heavy.

But isn't it nice that the absolutely _least_ complicated and heavy
operation (ie getting the runqueue lock) also serializes sufficiently?
Isn't it nice how you have to do that in order to do all those other
things that you obviously have to do?

Please don't argue about how we can add more subtle rules, or how other
thigns can serialize this thing as well. Isn't it sufficient that the
_obvious_ things serialize it?

Linus

2008-07-15 03:47:37

by Steven Rostedt

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken

On Mon, 14 Jul 2008, Linus Torvalds wrote:
> >
> > On Mon, 14 Jul 2008, Linus Torvalds wrote:
> > > So by doing the test for cpu_active_map not at queuing time, but at the
> > > time when we actually try to do the migration, we can now also make that
> > > cpu_active_map be totally serialized.
> > >
> > > (Of course, anybody who clears the bit does need to take the runqueue lock
> > > of that CPU too, but cpu_down() will have to do that as it does the
> > > "migrate away live tasks" anyway, so that's not a problem)
> >
> > Wouldn't simply doing a synchronize_sched() after clearing the bit also
> > make sure that no new task will be scheduled on that CPU?
>
> My point was that it DOESN'T NEED TO DO ANYTHING AT ALL.
>
> It has to get the runqueue lock in order to move the currently executing
> threads off that CPU anyway. The fact that it can (and actually does)
> synchronize with the scheduler in other ways is totally and utterly
> immaterial.
>
> That's what "robust" means. It means that things just work, and there are
> no subtle interactions.
>
> Sure, you can serialize with something complicated and heavy.
>
> But isn't it nice that the absolutely _least_ complicated and heavy
> operation (ie getting the runqueue lock) also serializes sufficiently?
> Isn't it nice how you have to do that in order to do all those other
> things that you obviously have to do?
>
> Please don't argue about how we can add more subtle rules, or how other
> thigns can serialize this thing as well. Isn't it sufficient that the
> _obvious_ things serialize it?

Oh, I'm not arguing. My mind is going off to an even bigger picture, where
something in the future would need to stop migration to a particular CPU,
and that it could simply clear the bit and call synchronize_sched. The run
queue lock is only visible to the scheduler. Sorry, I may have been day
dreaming out loud ;-)

But for the case at hand, yes I agree, simply grabbing the run queue lock
is a very elegant and simple solution.

-- Steve

2008-07-15 04:04:54

by Linus Torvalds

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken



On Mon, 14 Jul 2008, Steven Rostedt wrote:
>
> Oh, I'm not arguing. My mind is going off to an even bigger picture, where
> something in the future would need to stop migration to a particular CPU,
> and that it could simply clear the bit and call synchronize_sched. The run
> queue lock is only visible to the scheduler. Sorry, I may have been day
> dreaming out loud ;-)

Well, you'd be stuck right now anyway.

At least in my trivial patch, the cpu_active_map locking is protected by
'cpu_add_remove_lock' which is static to cpu.c. The only thing that
modifies it (apart from the initial setup before SMP has been brought up)
is the hotplug code.

Linus

2008-07-15 04:13:23

by Linus Torvalds

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken



On Mon, 14 Jul 2008, Max Krasnyansky wrote:
>
> Did you guys an updated patch ? Dmitry pointed out several things that Linus
> missed in his original version. I guess I can go through the thread and
> reconstruct that but if you have a patch I can try let me know.

I didn't update it, and right now I'm just merging too much (and
discussing the merges) to have time.

The patch really needs to have some scheduler person look at the use fo
cpu_active_map - I was kind of hoping that Ingo would.

The other thing I wish somebody would at least double-check (and
preferably just remove) is the !CPU_HOTPLUG special case. For that case, I
literally did

#define cpu_active_map cpu_online_map

and just depended on the fact that the few cases that touched
cpu_online_map even when CPU_HOTPLUG was disabled would do

if (cpu_online(cpu))
set_cpu(cpu, cpu_active_map);

(or something - I ended up discaring the patch as I was doing the whole
2.6.26 release, so I don't even have it in front of me right now, just in
my email archives and in my memory).

And quite frankly, I don't think that was a good idea. It was exactly the
sort of "subtle cleverness" that I myself argue against. So I think that
aliasing of cpu_active/online_map was a mistake - it was kind of cute, but
there's really no good reason to do it. It's not like that one CPU map
wastes any memory, and the special-case simply isn't worth it.

Linus

2008-07-15 04:16:18

by Steven Rostedt

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken



On Mon, 14 Jul 2008, Linus Torvalds wrote:
> >
> > Oh, I'm not arguing. My mind is going off to an even bigger picture, where
> > something in the future would need to stop migration to a particular CPU,
> > and that it could simply clear the bit and call synchronize_sched. The run
> > queue lock is only visible to the scheduler. Sorry, I may have been day
> > dreaming out loud ;-)
>
> Well, you'd be stuck right now anyway.
>
> At least in my trivial patch, the cpu_active_map locking is protected by
> 'cpu_add_remove_lock' which is static to cpu.c. The only thing that
> modifies it (apart from the initial setup before SMP has been brought up)
> is the hotplug code.

Yeah, I see that now. I was thinking back to a time I was doing some crazy
kernel hacking and would have liked this feature (stop new tasks from
migrating to a CPU). cpusets today are probably a better answer, and what
I was thinking with using the cpu_active_map is more of a hack.

Sorry for the noise, I'll go back to updating ftrace.txt

-- Steve

2008-07-15 08:33:00

by Ingo Molnar

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken


* Linus Torvalds <[email protected]> wrote:

> On Mon, 14 Jul 2008, Max Krasnyansky wrote:
> >
> > Did you guys an updated patch ? Dmitry pointed out several things that Linus
> > missed in his original version. I guess I can go through the thread and
> > reconstruct that but if you have a patch I can try let me know.
>
> I didn't update it, and right now I'm just merging too much (and
> discussing the merges) to have time.
>
> The patch really needs to have some scheduler person look at the use
> fo cpu_active_map - I was kind of hoping that Ingo would.

yeah - it's very high on our TODO list :-) Peter, Dmitry and me are
looking into it.

I didnt touch most of -tip in the past few days to get a rock solid QA
track record for all items we have.

Ingo

2008-07-15 08:42:49

by Max Krasnyansky

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken

Ingo Molnar wrote:
> * Linus Torvalds <[email protected]> wrote:
>
>> On Mon, 14 Jul 2008, Max Krasnyansky wrote:
>>> Did you guys an updated patch ? Dmitry pointed out several things that Linus
>>> missed in his original version. I guess I can go through the thread and
>>> reconstruct that but if you have a patch I can try let me know.
>> I didn't update it, and right now I'm just merging too much (and
>> discussing the merges) to have time.
>>
>> The patch really needs to have some scheduler person look at the use
>> fo cpu_active_map - I was kind of hoping that Ingo would.
>
> yeah - it's very high on our TODO list :-) Peter, Dmitry and me are
> looking into it.
>
> I didnt touch most of -tip in the past few days to get a rock solid QA
> track record for all items we have.

I just sent you guys a patch. Please take a look. I've probably missed something
but it should close (I think). Also we'd probably at least want the bits that
streamline the domain reinitialization because it helps with cpusets (ie uses
same exact path for all cases).

Max

2008-07-15 08:58:20

by Ingo Molnar

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken


* Max Krasnyansky <[email protected]> wrote:

> Ingo Molnar wrote:
>> * Linus Torvalds <[email protected]> wrote:
>>
>>> On Mon, 14 Jul 2008, Max Krasnyansky wrote:
>>>> Did you guys an updated patch ? Dmitry pointed out several things that Linus
>>>> missed in his original version. I guess I can go through the thread and
>>>> reconstruct that but if you have a patch I can try let me know.
>>> I didn't update it, and right now I'm just merging too much (and
>>> discussing the merges) to have time.
>>>
>>> The patch really needs to have some scheduler person look at the use
>>> fo cpu_active_map - I was kind of hoping that Ingo would.
>>
>> yeah - it's very high on our TODO list :-) Peter, Dmitry and me are
>> looking into it.
>>
>> I didnt touch most of -tip in the past few days to get a rock solid QA
>> track record for all items we have.
>
> I just sent you guys a patch. Please take a look. I've probably missed
> something but it should close (I think). Also we'd probably at least
> want the bits that streamline the domain reinitialization because it
> helps with cpusets (ie uses same exact path for all cases).

thanks Max. Since upstream already has Dmitry's it conflicted with your
patch - i fixed the interactions up, see it below. (completely untested)

It's not ready for inclusion yet though - these new
#ifdefs are quite ugly:

+#if !defined(CONFIG_CPUSETS)
+ partition_sched_domains(0, NULL, NULL);
+#else
+ rebuild_sched_domains();
+#endif

we should just have a single method for refreshing sched domains
hierarchy, and in the !CONFIG_CPUSETS case that should simply fall back
to partition_sched_domains().

We can do that by making rebuild_sched_domains() the primary method that
is called - and in the !CPUSETS case it's an inline that calls
partition_sched_domains().

also, small nits:

use #ifdef CONFIG_CPUSETS instead of "#if !defined(CONFIG_CPUSETS)".

and while at it:

+#if !defined(CONFIG_CPUSETS)
/* XXX: Theoretical race here - CPU may be hotplugged now */
hotcpu_notifier(update_sched_domains, 0);
+#endif

that race should be closed now, hm?

Ingo

------------------>
Subject: cpu hotplug, sched: introduce cpu_active_map and redo sched domain managment
From: Max Krasnyansky <[email protected]>
Date: Tue, 15 Jul 2008 01:40:10 -0700

This is based on Linus' idea of creating cpu_active_map that prevents
scheduler load balancer from migrating tasks to the cpu that is going
down.

It allows us to simplify domain management code and avoid unecessary
domain rebuilds during cpu hotplug event handling.

Please ignore the cpusets part for now. It needs some more work in order
to avoid crazy lock nesting. Although I did simplfy and unify domain
reinitialization logic. We now simply call partition_sched_domains() in
all the cases. This means that we're using exact same code paths as in
cpusets case and hence the test below cover cpusets too.

This not only boots but also easily handles
while true; do make clean; make -j 8; done
and
while true; do on-off-cpu 1; done
at the same time.
(on-off-cpu 1 simple does echo 0/1 > /sys/.../cpu1/online thing).

Suprising the box (dualcore Core2) is quite usable. In fact I'm typing
this on right now in gnome-terminal and things are moving just fine.

I beleive I addressed all of the Dmitry's comments for original Linus'
version. I changed both fair and rt balancer to mask out non-active cpus.
And replaced cpu_is_offline() with !cpu_active() in the main scheduler
code where it made sense (to me).

I've probably missed something but I'd dare to say consider for the
inclusion ;-)

Signed-off-by: Max Krasnyanskiy <[email protected]>
Cc: [email protected]
Cc: Max Krasnyanskiy <[email protected]>
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Cc: [email protected]
Signed-off-by: Ingo Molnar <[email protected]>
---
include/linux/cpumask.h | 6 ++
init/main.c | 7 +++
kernel/cpu.c | 30 +++++++++++---
kernel/sched.c | 100 ++++++++++++++----------------------------------
kernel/sched_fair.c | 2
kernel/sched_rt.c | 5 ++
6 files changed, 74 insertions(+), 76 deletions(-)

Index: tip/include/linux/cpumask.h
===================================================================
--- tip.orig/include/linux/cpumask.h
+++ tip/include/linux/cpumask.h
@@ -396,13 +396,14 @@ int __next_cpu_nr(int n, const cpumask_t

/*
* The following particular system cpumasks and operations manage
- * possible, present and online cpus. Each of them is a fixed size
+ * possible, present, active and online cpus. Each of them is a fixed size
* bitmap of size NR_CPUS.
*
* #ifdef CONFIG_HOTPLUG_CPU
* cpu_possible_map - has bit 'cpu' set iff cpu is populatable
* cpu_present_map - has bit 'cpu' set iff cpu is populated
* cpu_online_map - has bit 'cpu' set iff cpu available to scheduler
+ * cpu_active_map - has bit 'cpu' set iff cpu available to migration
* #else
* cpu_possible_map - has bit 'cpu' set iff cpu is populated
* cpu_present_map - copy of cpu_possible_map
@@ -453,6 +454,7 @@ int __next_cpu_nr(int n, const cpumask_t
extern cpumask_t cpu_possible_map;
extern cpumask_t cpu_online_map;
extern cpumask_t cpu_present_map;
+extern cpumask_t cpu_active_map;

#if NR_CPUS > 1
#define num_online_cpus() cpus_weight_nr(cpu_online_map)
@@ -461,6 +463,7 @@ extern cpumask_t cpu_present_map;
#define cpu_online(cpu) cpu_isset((cpu), cpu_online_map)
#define cpu_possible(cpu) cpu_isset((cpu), cpu_possible_map)
#define cpu_present(cpu) cpu_isset((cpu), cpu_present_map)
+#define cpu_active(cpu) cpu_isset((cpu), cpu_active_map)
#else
#define num_online_cpus() 1
#define num_possible_cpus() 1
@@ -468,6 +471,7 @@ extern cpumask_t cpu_present_map;
#define cpu_online(cpu) ((cpu) == 0)
#define cpu_possible(cpu) ((cpu) == 0)
#define cpu_present(cpu) ((cpu) == 0)
+#define cpu_active(cpu) ((cpu) == 0)
#endif

#define cpu_is_offline(cpu) unlikely(!cpu_online(cpu))
Index: tip/init/main.c
===================================================================
--- tip.orig/init/main.c
+++ tip/init/main.c
@@ -417,6 +417,13 @@ static void __init smp_init(void)
{
unsigned int cpu;

+ /*
+ * Set up the current CPU as possible to migrate to.
+ * The other ones will be done by cpu_up/cpu_down()
+ */
+ cpu = smp_processor_id();
+ cpu_set(cpu, cpu_active_map);
+
/* FIXME: This should be done in userspace --RR */
for_each_present_cpu(cpu) {
if (num_online_cpus() >= setup_max_cpus)
Index: tip/kernel/cpu.c
===================================================================
--- tip.orig/kernel/cpu.c
+++ tip/kernel/cpu.c
@@ -64,6 +64,8 @@ void __init cpu_hotplug_init(void)
cpu_hotplug.refcount = 0;
}

+cpumask_t cpu_active_map;
+
#ifdef CONFIG_HOTPLUG_CPU

void get_online_cpus(void)
@@ -291,11 +293,20 @@ int __ref cpu_down(unsigned int cpu)
int err = 0;

cpu_maps_update_begin();
- if (cpu_hotplug_disabled)
+
+ if (cpu_hotplug_disabled) {
err = -EBUSY;
- else
- err = _cpu_down(cpu, 0);
+ goto out;
+ }
+
+ cpu_clear(cpu, cpu_active_map);
+
+ err = _cpu_down(cpu, 0);
+
+ if (cpu_online(cpu))
+ cpu_set(cpu, cpu_active_map);

+out:
cpu_maps_update_done();
return err;
}
@@ -354,11 +365,18 @@ int __cpuinit cpu_up(unsigned int cpu)
}

cpu_maps_update_begin();
- if (cpu_hotplug_disabled)
+
+ if (cpu_hotplug_disabled) {
err = -EBUSY;
- else
- err = _cpu_up(cpu, 0);
+ goto out;
+ }

+ err = _cpu_up(cpu, 0);
+
+ if (cpu_online(cpu))
+ cpu_set(cpu, cpu_active_map);
+
+out:
cpu_maps_update_done();
return err;
}
Index: tip/kernel/sched.c
===================================================================
--- tip.orig/kernel/sched.c
+++ tip/kernel/sched.c
@@ -2877,7 +2877,7 @@ static void sched_migrate_task(struct ta

rq = task_rq_lock(p, &flags);
if (!cpu_isset(dest_cpu, p->cpus_allowed)
- || unlikely(cpu_is_offline(dest_cpu)))
+ || unlikely(!cpu_active(dest_cpu)))
goto out;

trace_kernel_sched_migrate_task(p, cpu_of(rq), dest_cpu);
@@ -3846,7 +3846,7 @@ int select_nohz_load_balancer(int stop_t
/*
* If we are going offline and still the leader, give up!
*/
- if (cpu_is_offline(cpu) &&
+ if (!cpu_active(cpu) &&
atomic_read(&nohz.load_balancer) == cpu) {
if (atomic_cmpxchg(&nohz.load_balancer, cpu, -1) != cpu)
BUG();
@@ -5868,7 +5868,7 @@ static int __migrate_task(struct task_st
struct rq *rq_dest, *rq_src;
int ret = 0, on_rq;

- if (unlikely(cpu_is_offline(dest_cpu)))
+ if (unlikely(!cpu_active(dest_cpu)))
return ret;

rq_src = cpu_rq(src_cpu);
@@ -7545,18 +7545,6 @@ void __attribute__((weak)) arch_update_c
}

/*
- * Free current domain masks.
- * Called after all cpus are attached to NULL domain.
- */
-static void free_sched_domains(void)
-{
- ndoms_cur = 0;
- if (doms_cur != &fallback_doms)
- kfree(doms_cur);
- doms_cur = &fallback_doms;
-}
-
-/*
* Set up scheduler domains and groups. Callers must hold the hotplug lock.
* For now this just excludes isolated cpus, but could be used to
* exclude other special cases in the future.
@@ -7634,7 +7622,7 @@ static int dattrs_equal(struct sched_dom
* ownership of it and will kfree it when done with it. If the caller
* failed the kmalloc call, then it can pass in doms_new == NULL,
* and partition_sched_domains() will fallback to the single partition
- * 'fallback_doms'.
+ * 'fallback_doms', it also forces the domains to be rebuilt.
*
* Call with hotplug lock held
*/
@@ -7648,12 +7636,8 @@ void partition_sched_domains(int ndoms_n
/* always unregister in case we don't destroy any domains */
unregister_sched_domain_sysctl();

- if (doms_new == NULL) {
- ndoms_new = 1;
- doms_new = &fallback_doms;
- cpus_andnot(doms_new[0], cpu_online_map, cpu_isolated_map);
- dattr_new = NULL;
- }
+ if (doms_new == NULL)
+ ndoms_new = 0;

/* Destroy deleted domains */
for (i = 0; i < ndoms_cur; i++) {
@@ -7668,6 +7652,14 @@ match1:
;
}

+ if (doms_new == NULL) {
+ ndoms_cur = 0;
+ ndoms_new = 1;
+ doms_new = &fallback_doms;
+ cpus_andnot(doms_new[0], cpu_online_map, cpu_isolated_map);
+ dattr_new = NULL;
+ }
+
/* Build new domains */
for (i = 0; i < ndoms_new; i++) {
for (j = 0; j < ndoms_cur; j++) {
@@ -7698,17 +7690,15 @@ match2:
#if defined(CONFIG_SCHED_MC) || defined(CONFIG_SCHED_SMT)
int arch_reinit_sched_domains(void)
{
- int err;
-
get_online_cpus();
- mutex_lock(&sched_domains_mutex);
- detach_destroy_domains(&cpu_online_map);
- free_sched_domains();
- err = arch_init_sched_domains(&cpu_online_map);
- mutex_unlock(&sched_domains_mutex);
+#if !defined(CONFIG_CPUSETS)
+ partition_sched_domains(0, NULL, NULL);
+#else
+ rebuild_sched_domains();
+#endif
put_online_cpus();

- return err;
+ return 0;
}

static ssize_t sched_power_savings_store(const char *buf, size_t count, int smt)
@@ -7774,60 +7764,28 @@ int sched_create_sysfs_power_savings_ent
}
#endif /* CONFIG_SCHED_MC || CONFIG_SCHED_SMT */

+#if !defined(CONFIG_CPUSETS)
/*
- * Force a reinitialization of the sched domains hierarchy. The domains
- * and groups cannot be updated in place without racing with the balancing
- * code, so we temporarily attach all running cpus to the NULL domain
- * which will prevent rebalancing while the sched domains are recalculated.
+ * Add online and remove offline CPUs from the scheduler domains.
+ * When CPUSETS are enabled they take over this function.
*/
static int update_sched_domains(struct notifier_block *nfb,
unsigned long action, void *hcpu)
{
- int cpu = (int)(long)hcpu;
-
switch (action) {
- case CPU_DOWN_PREPARE:
- case CPU_DOWN_PREPARE_FROZEN:
- disable_runtime(cpu_rq(cpu));
- /* fall-through */
- case CPU_UP_PREPARE:
- case CPU_UP_PREPARE_FROZEN:
- detach_destroy_domains(&cpu_online_map);
- free_sched_domains();
- return NOTIFY_OK;
-
-
- case CPU_DOWN_FAILED:
- case CPU_DOWN_FAILED_FROZEN:
case CPU_ONLINE:
case CPU_ONLINE_FROZEN:
- enable_runtime(cpu_rq(cpu));
- /* fall-through */
- case CPU_UP_CANCELED:
- case CPU_UP_CANCELED_FROZEN:
case CPU_DEAD:
case CPU_DEAD_FROZEN:
- /*
- * Fall through and re-initialise the domains.
- */
- break;
+ partition_sched_domains(0, NULL, NULL);
+ return NOTIFY_OK;
+
default:
return NOTIFY_DONE;
}
-
-#ifndef CONFIG_CPUSETS
- /*
- * Create default domain partitioning if cpusets are disabled.
- * Otherwise we let cpusets rebuild the domains based on the
- * current setup.
- */
-
- /* The hotplug lock is already held by cpu_up/cpu_down */
- arch_init_sched_domains(&cpu_online_map);
+}
#endif

- return NOTIFY_OK;
-}

void __init sched_init_smp(void)
{
@@ -7846,8 +7804,12 @@ void __init sched_init_smp(void)
cpu_set(smp_processor_id(), non_isolated_cpus);
mutex_unlock(&sched_domains_mutex);
put_online_cpus();
+
+#if !defined(CONFIG_CPUSETS)
/* XXX: Theoretical race here - CPU may be hotplugged now */
hotcpu_notifier(update_sched_domains, 0);
+#endif
+
init_hrtick();

/* Move init over to a non-isolated CPU */
Index: tip/kernel/sched_fair.c
===================================================================
--- tip.orig/kernel/sched_fair.c
+++ tip/kernel/sched_fair.c
@@ -1004,6 +1004,8 @@ static void yield_task_fair(struct rq *r
* not idle and an idle cpu is available. The span of cpus to
* search starts with cpus closest then further out as needed,
* so we always favor a closer, idle cpu.
+ * Domains may include CPUs that are not usable for migration,
+ * hence we need to mask them out (cpu_active_map)
*
* Returns the CPU we should wake onto.
*/
Index: tip/kernel/sched_rt.c
===================================================================
--- tip.orig/kernel/sched_rt.c
+++ tip/kernel/sched_rt.c
@@ -915,6 +915,11 @@ static int find_lowest_rq(struct task_st
int this_cpu = smp_processor_id();
int cpu = task_cpu(task);

+ /*
+ * Only consider CPUs that are usable for migration.
+ */
+ cpus_and(*lowest_mask, *lowest_mask, cpu_active_map);
+
if (task->rt.nr_cpus_allowed == 1)
return -1; /* No other targets possible */

2008-07-15 09:12:36

by Max Krasnyansky

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken

Ingo Molnar wrote:
> * Max Krasnyansky <[email protected]> wrote:
>
>> Ingo Molnar wrote:
>>> * Linus Torvalds <[email protected]> wrote:
>>>
>>>> On Mon, 14 Jul 2008, Max Krasnyansky wrote:
>>>>> Did you guys an updated patch ? Dmitry pointed out several things that Linus
>>>>> missed in his original version. I guess I can go through the thread and
>>>>> reconstruct that but if you have a patch I can try let me know.
>>>> I didn't update it, and right now I'm just merging too much (and
>>>> discussing the merges) to have time.
>>>>
>>>> The patch really needs to have some scheduler person look at the use
>>>> fo cpu_active_map - I was kind of hoping that Ingo would.
>>> yeah - it's very high on our TODO list :-) Peter, Dmitry and me are
>>> looking into it.
>>>
>>> I didnt touch most of -tip in the past few days to get a rock solid QA
>>> track record for all items we have.
>> I just sent you guys a patch. Please take a look. I've probably missed
>> something but it should close (I think). Also we'd probably at least
>> want the bits that streamline the domain reinitialization because it
>> helps with cpusets (ie uses same exact path for all cases).
>
> thanks Max. Since upstream already has Dmitry's it conflicted with your
> patch - i fixed the interactions up, see it below. (completely untested)
Hmm, I did it on top of 2.6.26 final which has Dmitry's patch. It must have
been something in the -tip.
Anyway, I'll apply it here and retest.

> It's not ready for inclusion yet though - these new
> #ifdefs are quite ugly:
>
> +#if !defined(CONFIG_CPUSETS)
> + partition_sched_domains(0, NULL, NULL);
> +#else
> + rebuild_sched_domains();
> +#endif
>
> we should just have a single method for refreshing sched domains
> hierarchy, and in the !CONFIG_CPUSETS case that should simply fall back
> to partition_sched_domains().
>
> We can do that by making rebuild_sched_domains() the primary method that
> is called - and in the !CPUSETS case it's an inline that calls
> partition_sched_domains().

Actually that's exactly what I have in the cpuset part of the patch. But as
I mentioned there is some circular locking issues with rebuild_sched_domains().
(cgroup_lock and get_online_cpus()) and I wanted to fix that fix. Now that I
think about it's kind of unrelated. So in other words I totally agree. I'll
go ahead fix it and resend.

btw While we're at it. Does arch_init_sched_domains() and arch_reinit_sched_domains()
still make sense. I'm talking about naming here. I suppose arch_ part means that it
can be replaced by the arch code. But it's not the case with those two guys.
What do you think ?

> also, small nits:
>
> use #ifdef CONFIG_CPUSETS instead of "#if !defined(CONFIG_CPUSETS)".
Will do.

> and while at it:
>
> +#if !defined(CONFIG_CPUSETS)
> /* XXX: Theoretical race here - CPU may be hotplugged now */
> hotcpu_notifier(update_sched_domains, 0);
> +#endif
>
> that race should be closed now, hm?
Yeah, I stared at that comment for a second and decided to leave it.
I guess it's still there. I mean in theory cpu can still be hotplugged just
before sched or cpuset register their notifier callbacks.

Max

2008-07-16 06:36:22

by Max Krasnyansky

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken

There are now three patches that are outstanding (pending reviews, etc) on this:

1. cpu hotplug, sched: Introduce cpu_active_map and redo sched\
domain managment (take 2)

2. cpu hotplug: Make cpu_active_map synchronization dependency clear

3. cpuset: Make rebuild_sched_domains() usable from any context (take 2)

Please review/ack/nak/etc.

----
Linus ACKed #1 and Ingo seemed to be more or less ok with it. We need ACKs
from Dmitry, Peter and Greg on the general approach and due to some
non-trivial merges with their stuff. RT balancer cpupri thingy and and RT
bandwidth handling.

#2 Should be fairly straightforward.

#3 Need ACKs from Paul J. and Paul M. overall they seemed ok with the first
version, and new version implements the suggested workqueue based rebuild.


Thanx
Max

2008-07-16 07:10:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken

On Tue, 2008-07-15 at 23:35 -0700, Max Krasnyansky wrote:
> There are now three patches that are outstanding (pending reviews, etc) on this:
>
> 1. cpu hotplug, sched: Introduce cpu_active_map and redo sched\
> domain managment (take 2)
>
> 2. cpu hotplug: Make cpu_active_map synchronization dependency clear
>
> 3. cpuset: Make rebuild_sched_domains() usable from any context (take 2)
>
> Please review/ack/nak/etc.
>
> ----
> Linus ACKed #1 and Ingo seemed to be more or less ok with it. We need ACKs
> from Dmitry, Peter and Greg on the general approach and due to some
> non-trivial merges with their stuff. RT balancer cpupri thingy and and RT
> bandwidth handling.

ACK

> #2 Should be fairly straightforward.

Has a misformed comment, ACK when that's fixed.

> #3 Need ACKs from Paul J. and Paul M. overall they seemed ok with the first
> version, and new version implements the suggested workqueue based rebuild.

Has more misformed comments. Please use

/*
* this is a multiline
* comment
*/

instead of

/* this is a multiline
* comment */


2008-07-16 17:01:31

by Max Krasnyansky

[permalink] [raw]
Subject: Re: current linux-2.6.git: cpusets completely broken

Peter Zijlstra wrote:
> On Tue, 2008-07-15 at 23:35 -0700, Max Krasnyansky wrote:
>> There are now three patches that are outstanding (pending reviews, etc) on this:
>>
>> 1. cpu hotplug, sched: Introduce cpu_active_map and redo sched\
>> domain managment (take 2)
>>
>> 2. cpu hotplug: Make cpu_active_map synchronization dependency clear
>>
>> 3. cpuset: Make rebuild_sched_domains() usable from any context (take 2)
>>
>> Please review/ack/nak/etc.
>>
>> ----
>> Linus ACKed #1 and Ingo seemed to be more or less ok with it. We need ACKs
>> from Dmitry, Peter and Greg on the general approach and due to some
>> non-trivial merges with their stuff. RT balancer cpupri thingy and and RT
>> bandwidth handling.
>
> ACK
>
>> #2 Should be fairly straightforward.
>
> Has a misformed comment, ACK when that's fixed.
>
>> #3 Need ACKs from Paul J. and Paul M. overall they seemed ok with the first
>> version, and new version implements the suggested workqueue based rebuild.
>
> Has more misformed comments. Please use
>
> /*
> * this is a multiline
> * comment
> */
>
> instead of
>
> /* this is a multiline
> * comment */

Got it. Will fix and resend.

Thanx for the ACKs and review.

Max