LinuxLists.cc - [PATCH] cpusets+hotplug+preepmt broken

[permalink] [raw]

Subject: Re: [Lse-tech] [PATCH] cpusets+hotplug+preepmt broken

Dinakar wrote:
> It looks like hotplug operations need to take cpuset_sem as well.

Could you explain why this is -- what is the deadlock? I don't doubt
you found a deadlock, but since I don't understand it yet, it is
difficult for me to judge whether this is the best fix.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-05-11 19:52:15

[permalink] [raw]

Subject: Re: [PATCH] cpusets+hotplug+preepmt broken

On Thu, May 12, 2005 at 12:46:54AM +0530, Dinakar Guniguntala wrote:
>
> cpusets+hotplug+preempt hangs up the machine, if both
> cpuset and hotplug operations are going on simultaneously
>
> I also get this oops first
>
> scheduling while atomic: cpuoff.sh/0x00000001/25331
> [<c040195d>] schedule+0xa4d/0xa60
> [<c0401b25>] wait_for_completion+0xc5/0xf0
> [<c011a200>] default_wake_function+0x0/0x20
> [<c0400cd3>] __down+0x83/0x110
> [<c011a200>] default_wake_function+0x0/0x20
> [<c0400eab>] __down_failed+0x7/0xc
> [<c0141fe7>] .text.lock.cpuset+0x105/0x15e
> [<c011b860>] move_task_off_dead_cpu+0x130/0x1f0
> [<c011ba7c>] migrate_live_tasks+0x8c/0x90
> [<c011be25>] migration_call+0x75/0x2c0
> [<c01423f2>] __stop_machine_run+0x92/0xb0
> [<c012e0dd>] notifier_call_chain+0x2d/0x50
> [<c013a6fb>] cpu_down+0x16b/0x2a0
> [<c027ddfb>] store_online+0x5b/0x80
> [<c027ae15>] sysdev_store+0x35/0x40
> [<c019f43e>] flush_write_buffer+0x3e/0x50
> [<c019f4a8>] sysfs_write_file+0x58/0x80
> [<c0163d26>] vfs_write+0xc6/0x180
> [<c0163eb1>] sys_write+0x51/0x80
> [<c01034e5>] syscall_call+0x7/0xb

This trace is what should be fixed -- we're trying to schedule while
the machine is "stopped" (all cpus except for one spinning with
interrupts off). I'm not too familiar with the cpusets code but I
would like to stay away from nesting these semaphores if at all
possible.

Will you share your testcase please?

Nathan

2005-05-11 19:55:34

[permalink] [raw]

Subject: Re: [Lse-tech] [PATCH] cpusets+hotplug+preepmt broken

pj wrote:
> Could you explain why this is -- what is the deadlock?

On further reading, I see it. You're right.

Deep in the bowels of the hotplug code, when removing a dead cpu, while
holding the runqueue lock (task_rq_lock), the hotplug code might need to
walk up the cpuset hierarchy, to find the nearest enclosing cpuset that
still has online cpus, as part of figuring where to run a task that is
being kicked off the dead cpu. The runqueue lock is atomic, but getting
the cpuset_sem (needed to walk up the cpuset hierarchy) can sleep. So
you need to get the cpuset_sem before calling task_rq_lock, so as to
avoid the "scheduling while atomic" oops that you reported. Therefore
the hotplug code, and anyone else calling cpuset_cpus_allowed(), which
means sched_setaffinity(), needs to first grab cpuset_sem, so that they
can grab any atomic locks needed, after getting cpuset_sem, not before.

Acked-by: Paul Jackson <[email protected]>

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-05-11 20:27:00

[permalink] [raw]

Subject: Re: [Lse-tech] [PATCH] cpusets+hotplug+preepmt broken

On Wed, May 11, 2005 at 12:55:08PM -0700, Paul Jackson wrote:
> pj wrote:
> > Could you explain why this is -- what is the deadlock?
>
> On further reading, I see it. You're right.
>
> Deep in the bowels of the hotplug code, when removing a dead cpu, while
> holding the runqueue lock (task_rq_lock), the hotplug code might need to
> walk up the cpuset hierarchy, to find the nearest enclosing cpuset that
> still has online cpus, as part of figuring where to run a task that is
> being kicked off the dead cpu. The runqueue lock is atomic, but getting
> the cpuset_sem (needed to walk up the cpuset hierarchy) can sleep. So
> you need to get the cpuset_sem before calling task_rq_lock, so as to
> avoid the "scheduling while atomic" oops that you reported. Therefore
> the hotplug code, and anyone else calling cpuset_cpus_allowed(), which
> means sched_setaffinity(), needs to first grab cpuset_sem, so that they
> can grab any atomic locks needed, after getting cpuset_sem, not before.

Won't holding cpuset_sem while calling cpuset_cpus_allowed cause a
deadlock?

Nathan

2005-05-11 20:46:54

[permalink] [raw]

Subject: Re: [Lse-tech] [PATCH] cpusets+hotplug+preepmt broken

Nathan asked:
> Won't holding cpuset_sem while calling cpuset_cpus_allowed cause a
> deadlock?

It definitely would, which is why Dinakar's patch both added a call to
get cpuset_sem earlier in the hot unplug code, as well as removed the
call to get cpuset_sem that was inside the cpuset_cpus_allowed code (as
well as modified the other caller of cpuset_cpus_allowed, the
sched_setaffinity code, to explicitly get cpuset_sem before calling
cpuset_cpus_allowed, since that call no longer acquired cpuset_sem on
its own.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-05-11 20:46:44

[permalink] [raw]

Subject: Re: [PATCH] cpusets+hotplug+preepmt broken

Nathan wrote:
> I'm not too familiar with the cpusets code but I
> would like to stay away from nesting these semaphores if at all
> possible.

I share you preference for not nesting these semaphores.

The other choice I am aware of would be for the hotplug code to be less
cpuset-friendly. In the move_task_off_dead_cpu() code, at the point it
says "No more Mr. Nice Guy", instead of looking for the nearest
enclosing cpuset that has something online, which is what the
cpuset_cpus_allowed() does, instead we could just take any damn cpu that
was online.

Something along the lines of the following fix:

--- pj/kernel.old/sched.c 2005-05-11 13:00:17.000000000 -0700
+++ pj/kernel.new/sched.c 2005-05-11 13:02:24.000000000 -0700
@@ -4229,7 +4229,7 @@ static void move_task_off_dead_cpu(int d

/* No more Mr. Nice Guy. */
if (dest_cpu == NR_CPUS) {
- tsk->cpus_allowed = cpuset_cpus_allowed(tsk);
+ tsk->cpus_allowed = cpu_online_map;
dest_cpu = any_online_cpu(tsk->cpus_allowed);

/*

We've already decided here that we had to violate the cpuset container -
as apparently someone hot unplugged every cpu in the current tasks
cpuset.

Hmmm ... that's not quite right ... we've decided that we had to violate
the current tasks cpus_allowed, as apparently someone hot unplugged
every cpu allowed there. This might be a proper subset of what cpus the
tasks cpuset allows, perhaps due to a sched_setaffinity() call to
restrict a task to just one or a few cpus that are allowed in its
cpuset.

So what we'd really like to do would be to first fallback to all the
cpus allowed in the specified tasks cpuset (no walking the cpuset
hierarchy), and see if any of those cpus are still online to receive
this orphan task. Unless someone has botched the system configuration,
and taken offline all the cpus in a cpuset, this should yield up a cpu
that is still both allowed and online. If that fails, then to heck with
honoring cpuset placement - just take the first online cpu we can find.

This is doable without holding cpuset_sem. We can look at a current
tasks cpuset without cpuset_sem, just with the task lock. And this
should almost always work (yield up an online cpu that the cpuset
allows). And when it doesn't work, we can reasonably blame the system
administrator for forcing us to blow out the cpuset confinement.

The following untested, uncompiled patch claims to do this:

--- 2.6.12-rc1-mm4/include/linux/cpuset.h 2005-04-02 15:43:28.000000000 -0800
+++ 2.6.12-rc1-mm4.new/include/linux/cpuset.h 2005-05-11 13:26:10.000000000 -0700
@@ -19,6 +19,7 @@ extern void cpuset_init_smp(void);
extern void cpuset_fork(struct task_struct *p);
extern void cpuset_exit(struct task_struct *p);
extern const cpumask_t cpuset_cpus_allowed(const struct task_struct *p);
+extern const cpumask_t cpuset_task_cpus_allowed(const struct task_struct *p);
void cpuset_init_current_mems_allowed(void);
void cpuset_update_current_mems_allowed(void);
void cpuset_restrict_to_mems_allowed(unsigned long *nodes);
@@ -38,6 +39,10 @@ static inline cpumask_t cpuset_cpus_allo
{
return cpu_possible_map;
}
+static inline cpumask_t cpuset_task_cpus_allowed(struct task_struct *p)
+{
+ return cpu_possible_map;
+}

static inline void cpuset_init_current_mems_allowed(void) {}
static inline void cpuset_update_current_mems_allowed(void) {}
--- 2.6.12-rc1-mm4/kernel/cpuset.c 2005-04-22 19:35:34.000000000 -0700
+++ 2.6.12-rc1-mm4.new/kernel/cpuset.c 2005-05-11 13:40:05.000000000 -0700
@@ -1570,6 +1570,27 @@ const cpumask_t cpuset_cpus_allowed(cons
return mask;
}

+/**
+ * cpuset_task_cpus_allowed - return cpus_allowed mask from a tasks cpuset.
+ * @tsk: pointer to task_struct from which to obtain cpuset->cpus_allowed.
+ *
+ * Description: Returns the cpumask_t cpus_allowed of the cpuset
+ * attached to the specified @tsk. Unlike cpuset_cpus_allowed(),
+ * is not guaranteed to return a non-empty subset of cpu_online_map.
+ * Does not walk up the cpuset hierarchy, and does not attempt to
+ * acquire the cpuset_sem. If called on a task about to exit,
+ * where tsk->cpuset is already NULL, return cpu_online_map.
+ *
+ * Call with task locked.
+ **/
+
+const cpumask_t cpuset_task_cpus_allowed(const struct task_struct *tsk)
+{
+ if (!tsk->cpuset)
+ return cpu_online_map;
+ return tsk->cpuset->cpus_allowed;
+}
+
void cpuset_init_current_mems_allowed(void)
{
current->mems_allowed = NODE_MASK_ALL;
--- 2.6.12-rc1-mm4/kernel/sched.c 2005-04-22 19:51:44.000000000 -0700
+++ 2.6.12-rc1-mm4.new/kernel/sched.c 2005-05-11 13:33:20.000000000 -0700
@@ -4303,7 +4303,7 @@ static void move_task_off_dead_cpu(int d

/* No more Mr. Nice Guy. */
if (dest_cpu == NR_CPUS) {
- tsk->cpus_allowed = cpuset_cpus_allowed(tsk);
+ tsk->cpus_allowed = cpuset_task_cpus_allowed(tsk);
dest_cpu = any_online_cpu(tsk->cpus_allowed);

/*

I retract my Ack of Dinakar's patch, in favor of further consideration
of this last patch, above.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-05-11 20:59:04

[permalink] [raw]

Subject: Re: [PATCH] cpusets+hotplug+preepmt broken

Well, that last patch obviously doesn't do all that I said
it did - it doesn't do the final fallback to cpu_online_map,
if no cpu in tsk->cpuset->cpus_allowed is online.

Another variation will be forthcoming soon.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-05-12 14:59:53

[permalink] [raw]

Subject: Re: [PATCH] cpusets+hotplug+preepmt broken

On Wed, May 11, 2005 at 01:42:35PM -0700, Paul Jackson wrote:
> So what we'd really like to do would be to first fallback to all the
> cpus allowed in the specified tasks cpuset (no walking the cpuset
> hierarchy), and see if any of those cpus are still online to receive
> this orphan task. Unless someone has botched the system configuration,
> and taken offline all the cpus in a cpuset, this should yield up a cpu
> that is still both allowed and online. If that fails, then to heck with
> honoring cpuset placement - just take the first online cpu we can find.
>

I tried your suggestion, but hit another oops. This has more to do with
hotplug+preempt looks like

Code: fc 89 ec 5d e9 8b 0e 00 00 c7 04 24 5c 49 41 c0 e8 7f 39 00 00 e8 da 87 fe ff eb c1 0f 0b 36 11 32 96 41 c0 eb 92 83 f8
<6>note: cpuoff.sh[25439] exited with preempt_count 1
scheduling while atomic: cpuoff.sh/0x10000001/25439
[<c04010d2>] schedule+0x4b2/0x4c0
[<c0152da0>] unmap_page_range+0xc0/0xe0
[<c0401988>] cond_resched+0x28/0x40
[<c0152f56>] unmap_vmas+0x196/0x290
[<c0157c93>] exit_mmap+0xa3/0x1a0
[<c011ce53>] mmput+0x43/0x100
[<c01222f0>] do_exit+0xf0/0x3b0
[<c01047da>] die+0x18a/0x190
[<c0104bb0>] do_invalid_op+0x0/0xd0
[<c0104c5e>] do_invalid_op+0xae/0xd0
[<c011bbe7>] migrate_dead+0xa7/0xc0
[<c0400f14>] schedule+0x2f4/0x4c0
[<c040112b>] preempt_schedule+0x4b/0x70
[<c040112b>] preempt_schedule+0x4b/0x70
[<c0103fd7>] error_code+0x4f/0x54
[<c040007b>] svc_proc_unregister+0x1b/0x20
[<c011007b>] __acpi_map_table+0x7b/0xc0
[<c011bbe7>] migrate_dead+0xa7/0xc0
[<c011fef9>] profile_handoff_task+0x39/0x50
[<c011bc62>] migrate_dead_tasks+0x62/0x90
[<c011bda6>] migration_call+0x116/0x2c0
[<c0142102>] __stop_machine_run+0x92/0xb0
[<c012ddad>] notifier_call_chain+0x2d/0x50
[<c013a3cb>] cpu_down+0x16b/0x2a0
[<c027db0b>] store_online+0x5b/0x80
[<c027ab25>] sysdev_store+0x35/0x40
[<c019f14e>] flush_write_buffer+0x3e/0x50
[<c019f1b8>] sysfs_write_file+0x58/0x80
[<c0163a36>] vfs_write+0xc6/0x180
[<c0163bc1>] sys_write+0x51/0x80
[<c01034e5>] syscall_call+0x7/0xb

On 2.6.12-rc4-mm1 it is even worse, it panics

Booting processor 2/1 eip 3000
Calibrating delay using timer specific routine.. 1800.08 BogoMIPS (lpj=900043)
CPU2: Intel Pentium III (Cascades) stepping 04
Unable to handle kernel NULL pointer dereference<7>CPU0 attaching sched-domain:
at virtual address 00000001
printing eip:
00000001
*pde = 000001e3
*pte = 00000001
Oops: 0000 [#1]
PREEMPT SMP
Modules linked in:
CPU: 2
EIP: 0060:[<00000001>] Not tainted VLI
EFLAGS: 00010006 (2.6.12-rc4-mm1)
EIP is at 0x1
eax: c18c0000 ebx: c04758a0 ecx: 00000001 edx: e8c13eb4
esi: ffffffff edi: c185c030 ebp: c18c1f80 esp: c18c1f2c
ds: 007b es: 007b ss: 0068
Process swapper (pid: 0, threadinfo=c18c0000 task=c185c030)
Stack: c0114e0d e8c13eb4 c18c0000 c0103ecc c18c0000 00000008 00000002 ffffffff
c185c030 c18c1f80 00000001 0000007b 0000007b fffffffb c04048a0 00000060
00000206 c040511c 00000003 00000000 00000000 c18c0000 c01033ee 00000003
Call Trace:
[<c0114e0d>] smp_call_function_interrupt+0x3d/0x60
[<c0103ecc>] call_function_interrupt+0x1c/0x24
[<c04048a0>] schedule+0x0/0x7c0
[<c040511c>] preempt_schedule_irq+0x4c/0x80
[<c01033ee>] need_resched+0x1f/0x21
[<c011516f>] start_secondary+0x10f/0x1a0
Code: Bad EIP value.
<0>Kernel panic - not syncing: Fatal exception in interrupt
Booting processor 3/2 eip 3000
Initializing CPU#3

I really havent had a chance to investigate whats going on, should be able to
do that tomorrow. Here is the patch I tried, my .config and scripts

-Dinakar

Attachments:

(No filename) (3.54 kB)
dyn-sd-v0.6-0.patch (2.82 kB)
test-scripts.tar.gz (8.93 kB)
Download all attachments

2005-05-13 00:34:37

[permalink] [raw]

Subject: Re: [PATCH] cpusets+hotplug+preepmt broken

Paul Jackson wrote:
>
> I share you preference for not nesting these semaphores.
>
> The other choice I am aware of would be for the hotplug code to be less
> cpuset-friendly. In the move_task_off_dead_cpu() code, at the point it
> says "No more Mr. Nice Guy", instead of looking for the nearest
> enclosing cpuset that has something online, which is what the
> cpuset_cpus_allowed() does, instead we could just take any damn cpu that
> was online.
>
> Something along the lines of the following fix:
>
> --- pj/kernel.old/sched.c 2005-05-11 13:00:17.000000000 -0700
> +++ pj/kernel.new/sched.c 2005-05-11 13:02:24.000000000 -0700
> @@ -4229,7 +4229,7 @@ static void move_task_off_dead_cpu(int d
>
> /* No more Mr. Nice Guy. */
> if (dest_cpu == NR_CPUS) {
> - tsk->cpus_allowed = cpuset_cpus_allowed(tsk);
> + tsk->cpus_allowed = cpu_online_map;
> dest_cpu = any_online_cpu(tsk->cpus_allowed);

Well, CPU_MASK_ALL rather than cpu_online_map, I would think. That is
what the behavior was before the cpuset merge, anyway. It might be
the best short term solution, more below...

> So what we'd really like to do would be to first fallback to all the
> cpus allowed in the specified tasks cpuset (no walking the cpuset
> hierarchy), and see if any of those cpus are still online to receive
> this orphan task. Unless someone has botched the system configuration,
> and taken offline all the cpus in a cpuset, this should yield up a cpu
> that is still both allowed and online. If that fails, then to heck with
> honoring cpuset placement - just take the first online cpu we can find.
>
> This is doable without holding cpuset_sem. We can look at a current
> tasks cpuset without cpuset_sem, just with the task lock.

Yes, but your patch doesn't lock the task itself (unless I'm
misreading patches again). However, have a look at the comments above
task_lock in sched.h:

/*
* Protects ->fs, ->files, ->mm, ->ptrace, ->group_info, ->comm, keyring
* subscriptions and synchronises with wait4(). Also used in procfs.
*
* Nests both inside and outside of read_lock(&tasklist_lock).
* It must not be nested with write_lock_irq(&tasklist_lock),
* neither inside nor outside.
*/
static inline void task_lock(struct task_struct *p)
{
spin_lock(&p->alloc_lock);
}

Unfortunately, move_task_off_dead_cpu is called from
migrate_live_tasks while the latter has a write_lock_irq on
tasklist_lock. So we can't use task_lock in this context, assuming
the comments are valid. Right?

Nathan

2005-05-13 12:04:35

[permalink] [raw]

Subject: Re: [Lse-tech] Re: [PATCH] cpusets+hotplug+preepmt broken

On Thu, May 12, 2005 at 08:40:48PM +0530, Dinakar Guniguntala wrote:
>
> I really havent had a chance to investigate whats going on, should be
> able to > do that tomorrow. Here is the patch I tried, my .config and
> scripts

Ok I confirmed that this is a hotplug bug. This problem is hit by turning
off cpusets with only hotplug and preempt on. It happens in the latest mm
as well (2.6.12-rc4-mm1)

I will not be looking into this right now, I can provide details to
anyone interested

-Dinakar

2005-05-13 12:21:16

[permalink] [raw]

Subject: Re: [Lse-tech] Re: [PATCH] cpusets+hotplug+preepmt broken

On Wed, May 11, 2005 at 02:51:56PM -0500, Nathan Lynch wrote:

> This trace is what should be fixed -- we're trying to schedule while
> the machine is "stopped" (all cpus except for one spinning with
> interrupts off). I'm not too familiar with the cpusets code but I
> would like to stay away from nesting these semaphores if at all
> possible.

Vatsa pointed out another scenario where cpusets+hotplug is currently
broken. attach_task in cpuset.c is called without holding the hotplug
lock and it is possible to call set_cpus_allowed for a task with no
online cpus.
Given this I think the patch I sent first is the most appropriate
patch. In addition we also need to take hotplug lock in the cpusets
code whenever we are modifying cpus_allowed of a task. IOW make cpusets
and hotplug operations completly exclusive to each other. The same
applies to memory hotplug code once it gets in.

However on the downside this would mean
1. A lot of nested locks (mostly in cpuset_common_file_write)
2. Taking of hotplug (cpu now and later memory) locks for operations
that may just be updating a flag

-Dinakar

2005-05-13 17:28:54

[permalink] [raw]

Subject: Re: [Lse-tech] Re: [PATCH] cpusets+hotplug+preepmt broken

On Fri, May 13, 2005 at 06:02:17PM +0530, Dinakar Guniguntala wrote:
> attach_task in cpuset.c is called without holding the hotplug
> lock and it is possible to call set_cpus_allowed for a task with no
> online cpus.

This in fact was the reason that we added lock_cpu_hotplug in sched_setaffinity.

Also guarantee_online_cpus seems to be accessing cpu_online_map with preemption
enabled (& no hotplug lock taken). This is highly not recommended.

> Given this I think the patch I sent first is the most appropriate
> patch.

I agree that taking the hotplug lock seems reasonable here.

> In addition we also need to take hotplug lock in the cpusets
> code whenever we are modifying cpus_allowed of a task. IOW make cpusets
> and hotplug operations completly exclusive to each other. The same
> applies to memory hotplug code once it gets in.
>
> However on the downside this would mean
> 1. A lot of nested locks (mostly in cpuset_common_file_write)
> 2. Taking of hotplug (cpu now and later memory) locks for operations
> that may just be updating a flag

Given the fact that CPU/Memory hotplug and cpuset operation may
be infrequent events, this will probably be not a concern.

--

Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

2005-05-13 20:26:37

[permalink] [raw]

Subject: Re: [Lse-tech] Re: [PATCH] cpusets+hotplug+preepmt broken

Srivatsa, replying to Dinakar:
> This in fact was the reason that we added lock_cpu_hotplug
> in sched_setaffinity.

Why just in sched_setaffinity()? What about the other 60+ calls to
set_cpus_allowed(). Shouldn't most of those calls be checking that the
passed in cpus are online (holding lock_cpu_hotplug while doing all
this)? Either that, or at least handling the error from
set_cpus_allowed() if the requested cpus end up not being online? I see
only 2 set_cpus_allowed() calls that make any pretense of examining the
return value.

> I agree that taking the hotplug lock seems reasonable here.

I take it that you are recommending changing cpuset_cpus_allowed() in
the way sched_setaffinity() was changed, to grab lock_cpu_hotplug around
the small piece of code that examines cpu_online_map and calls
set_cpus_allowed().

This sounds good to me. I wonder why it doesn't need to be considered
in the other 60+ calls to set_cpus_allowed.

> Given the fact that CPU/Memory hotplug and cpuset operation may
> be infrequent events, this will probably be not a concern.

I agree that performance is not the key issue here. Little if any
of this is on any hot code path.

We do need to be clear about how these locks work, their semantics, what
they require and what they insure, and their various interactions.

This is not easy stuff to get right.

I notice that the documentation for lock_cpu_hotplug() is a tad on
the skimpy side:

/* Stop CPUs going up and down. */

That's it, so far as I can see. Interaction of hotplug locking with
the rest of the kernel has been, is now, and will continue to be, a
difficult area. More care than this needs to be put into explaining
the use, semantics and interactions of any locking involved.

In particular, in my view, locks should guard data. What data does
lock_cpu_hotplug() guard? I propose that it guards cpu_online_map.

I recommend considering a different name for this lock. Perhaps
'cpu_online_sem', instead of 'cpucontrol'? And perhaps the
lock_cpu_hotplug() should be renamed, to say 'lock_cpu_online'?

Then a survey of all uses of cpu_online_map, and by extension, of all
calls to set_cpus_allowed(), should be made, to see in which cases this
semaphore should be held. For extra credit, make sure that every caller
to set_cpus_allowed() makes an effort to only pass in cpus that are
currently online. The caller should do this, since only the caller can
know what alternative cpus to use, if their first choice is offline.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-05-13 20:34:41

by Dipankar Sarma

[permalink] [raw]

Subject: Re: [Lse-tech] Re: [PATCH] cpusets+hotplug+preepmt broken

On Fri, May 13, 2005 at 12:59:53PM -0700, Paul Jackson wrote:
> Srivatsa, replying to Dinakar:
> > This in fact was the reason that we added lock_cpu_hotplug
> > in sched_setaffinity.
>
> We do need to be clear about how these locks work, their semantics, what
> they require and what they insure, and their various interactions.
>
> This is not easy stuff to get right.
>
> I notice that the documentation for lock_cpu_hotplug() is a tad on
> the skimpy side:
>
> /* Stop CPUs going up and down. */
>
> That's it, so far as I can see. Interaction of hotplug locking with
> the rest of the kernel has been, is now, and will continue to be, a
> difficult area. More care than this needs to be put into explaining
> the use, semantics and interactions of any locking involved.
>
> In particular, in my view, locks should guard data. What data does
> lock_cpu_hotplug() guard? I propose that it guards cpu_online_map.
>
> I recommend considering a different name for this lock. Perhaps
> 'cpu_online_sem', instead of 'cpucontrol'? And perhaps the
> lock_cpu_hotplug() should be renamed, to say 'lock_cpu_online'?

No. CPU hotplug uses two different locking - see both lock_cpu_hotplug()
and __stop_machine_run(). Anyone reading cpu_online_map with
preemption disabled is safe from cpu hotplug even without taking
any lock.

Thanks
Dipankar

2005-05-13 20:47:43

[permalink] [raw]

Subject: Re: [Lse-tech] Re: [PATCH] cpusets+hotplug+preepmt broken

Note:

I believe that the following, starting with Dinakar's post to which I am
responding, is a _separate_ issue from the one that Dinakar proposed to
address with his patch of a couple days ago.

The problem two days ago involves trying to take cpuset_sem in the
cpu_down() code path, while holding a spinlock. My (still not yet
reposted) alternative patch will seek to avoid needing cpuset_sem on
this code path entirely.

The following problem involves a race condition between code that would
examine the cpu_online_map prior to calling set_cpus_allowed(), and the
hotplog code possibly changing that cpu_online_map.

Details ...

Dinakar wrote:
> Vatsa pointed out another scenario where cpusets+hotplug is currently
> broken. attach_task in cpuset.c is called without holding the hotplug
> lock and it is possible to call set_cpus_allowed for a task with no
> online cpus.

First - my apologies for not replying on this thread yesterday
with an updated patch proposal. I spent yesterday rebuilding my
email server - a bad disk and a bad fan.

If kernel/cpuset.c attach_task() can call set_cpus_allowed() with a
a mask that has no online cpus, it is certainly not for lack of trying.

The following two lines of code are adjacent in attach_task():

guarantee_online_cpus(cs, &cpus);
set_cpus_allowed(tsk, cpus);

The last line of guarantee_online_cpus() calls BUG_ON if the cpus
mask has no online cpus:

BUG_ON(!cpus_intersects(*pmask, cpu_online_map));

So this code tries really hard to see that online cpus are passed to
set_cpus_allowed(). Before the hotplug code came along, this was
probably sufficient.

But, yes indeed, there is now a race here, as there is no lock on
cpu_online_map between this BUG_ON() check, and the call to
set_cpus_allowed(), a few instructions later. The hotplug code could
change what is online, in the meanwhile.

The code for set_cpus_allowed() anticipates this - after getting the
relevant task_rq_lock, which (I presume) holds off hotplug, it verifies
that the target mask has some online cpus:

=========================================================
int set_cpus_allowed(task_t *p, cpumask_t new_mask)
{
...
rq = task_rq_lock(p, &flags);
if (!cpus_intersects(new_mask, cpu_online_map)) {
ret = -EINVAL;
goto out;
}

p->cpus_allowed = new_mask;
=========================================================

Seems to me that the problem here is not particularly to do with
cpusets, but with the call to set_cpus_allowed() and with hotplug
possibly changing the cpu_online_map at the same time.

I see over 60 calls to set_cpus_allowed() in the kernel, approximately 2
of which actually pay any attention to the return value.

Could not any of these calls fail, due to hotplug taking offline all the
cpus requested in the set_cpus_allowed() call before the task_rq_lock()
is obtained? It wouldn't surprise me if some of these calls don't even
try to see that the requested cpus are online, much less avoid racing
with hotplug or handle the -EINVAL error from set_cpus_allowed().

I think that hotplug needs to think carefully about how to handle
set_cpus_allowed() calls.

Perhaps this means a lock that needs to be held around the code that
composes the cpu mask to be set, verifies these cpus are online, and
calls set_cpus_allowed. This lock would need to be considered for
likely use in most of the 60 calls of set_cpus_allowed(). Having done
this, then the set_cpus_allowed() code probably shouldn't be returning
an error silently, but rather complaining in some sort of BUG() report
if passed cpus that were not online.

Hmmm ... looking ahead to Srivatsa's post, it looks like this might have
been done, in the form of lock_cpu_hotplug().

I will continue this line of thought as a reply to Srivatsa's post.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-05-13 20:56:21

[permalink] [raw]

Subject: Re: [Lse-tech] Re: [PATCH] cpusets+hotplug+preepmt broken

> No.

What part of what I wrote are you saying "No" to?

And what does that have to do with the two different
locking mechanisms for hotplug?

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-05-13 20:49:09

[permalink] [raw]

Subject: Re: [Lse-tech] Re: [PATCH] cpusets+hotplug+preepmt broken

> > In particular, in my view, locks should guard data. What data does
> > lock_cpu_hotplug() guard? I propose that it guards cpu_online_map.
> >
> > I recommend considering a different name for this lock. Perhaps
> > 'cpu_online_sem', instead of 'cpucontrol'? And perhaps the
> > lock_cpu_hotplug() should be renamed, to say 'lock_cpu_online'?
>
> No. CPU hotplug uses two different locking - see both lock_cpu_hotplug()
> and __stop_machine_run(). Anyone reading cpu_online_map with
> preemption disabled is safe from cpu hotplug even without taking
> any lock.

More precisely (I think), reading cpu_online_map with preemption
disabled guarantees that none of the cpus in the map will go offline
-- it does not prevent an online operation in progress (but most code
only cares about the former case). Also note that __stop_machine_run
is used only to offline a cpu.

The cpucontrol semaphore does not only protect cpu_online_map and
cpu_present_map, but also serializes cpu hotplug operations, so that
only one may be in progress at a time.

I've been mulling over submitting a Documentation/cpuhotplug.txt,
sounds like there's sufficient demand...

Nathan

2005-05-13 21:10:16

by Dipankar Sarma

[permalink] [raw]

Subject: Re: [Lse-tech] Re: [PATCH] cpusets+hotplug+preepmt broken

On Fri, May 13, 2005 at 03:46:52PM -0500, Nathan Lynch wrote:
> > > In particular, in my view, locks should guard data. What data does
> > > lock_cpu_hotplug() guard? I propose that it guards cpu_online_map.
> > >
> > > I recommend considering a different name for this lock. Perhaps
> > > 'cpu_online_sem', instead of 'cpucontrol'? And perhaps the
> > > lock_cpu_hotplug() should be renamed, to say 'lock_cpu_online'?
> >
> > No. CPU hotplug uses two different locking - see both lock_cpu_hotplug()
> > and __stop_machine_run(). Anyone reading cpu_online_map with
> > preemption disabled is safe from cpu hotplug even without taking
> > any lock.
>
> More precisely (I think), reading cpu_online_map with preemption
> disabled guarantees that none of the cpus in the map will go offline
> -- it does not prevent an online operation in progress (but most code
> only cares about the former case). Also note that __stop_machine_run
> is used only to offline a cpu.
>
> The cpucontrol semaphore does not only protect cpu_online_map and
> cpu_present_map, but also serializes cpu hotplug operations, so that
> only one may be in progress at a time.

Exactly. So, naming it lock_cpu_online() would make no sense.

>
> I've been mulling over submitting a Documentation/cpuhotplug.txt,
> sounds like there's sufficient demand...

That would be very good considering the fact that CPU hotplug
is getting broken often. Another nice thing to see would be
i386 cpu hotplug support. That would allow lot more developers
to test their code with cpu hotplug. Not sure where that one is now.
Perhaps Zwane knows.

Thanks
Dipankar

2005-05-13 21:18:36

[permalink] [raw]

Subject: Re: [Lse-tech] Re: [PATCH] cpusets+hotplug+preepmt broken

Nathan wrote:
> sounds like there's sufficient demand...

That's a safe bet ;).

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-05-13 21:17:59

by Dipankar Sarma

[permalink] [raw]

Subject: Re: [Lse-tech] Re: [PATCH] cpusets+hotplug+preepmt broken

On Fri, May 13, 2005 at 01:52:33PM -0700, Paul Jackson wrote:
> > No.
>
> What part of what I wrote are you saying "No" to?

The question right above "No" :)

>
> And what does that have to do with the two different
> locking mechanisms for hotplug?

Because lock_cpu_hotplug() isn't a lock just for cpu_online_map.

I think we need better in-tree documentation of cpu hotplug locking.
For now most of the descriptions are in conversations between
Rusty and Vatsa in linux hotplug cpu support mailing list
while they were designing it.

Thanks
Dipankar

2005-05-13 21:34:16

[permalink] [raw]

Subject: Re: [Lse-tech] Re: [PATCH] cpusets+hotplug+preepmt broken

On Fri, May 13, 2005 at 06:02:17PM +0530, Dinakar Guniguntala wrote:
> On Wed, May 11, 2005 at 02:51:56PM -0500, Nathan Lynch wrote:
>
> > This trace is what should be fixed -- we're trying to schedule while
> > the machine is "stopped" (all cpus except for one spinning with
> > interrupts off). I'm not too familiar with the cpusets code but I
> > would like to stay away from nesting these semaphores if at all
> > possible.
>
> Vatsa pointed out another scenario where cpusets+hotplug is currently
> broken. attach_task in cpuset.c is called without holding the hotplug
> lock and it is possible to call set_cpus_allowed for a task with no
> online cpus.
> Given this I think the patch I sent first is the most appropriate
> patch.

Your original patch is deadlocky since cpuset_cpus_allowed() does
task_lock() while write_lock_irq(&tasklist_lock) is already held by
migrate_live_tasks().

Nathan

2005-05-14 02:24:03

[permalink] [raw]

Subject: Re: [PATCH] cpusets+hotplug+preepmt broken

Two days ago, I wrote:
> Another variation will be forthcoming soon.

Don't apply the following yet, Andrew. It is untested, and we've not
yet obtained agreement.

I'll sqawk again, if this patch survives long enough to warrant inclusion.

===

Ah to heck with it. This subtle distinction over what level of cpuset
we fall back to when a hot unplug leaves a task with no online cpuset in
its current allowed set is not worth it.

Every variation I consider is either sufficiently complicated that I
can't be sure it's right, or sufficiently simple that it's obviously
broken.

Revert the move_task_off_dead_cpu() code to its previous code, before
cpusets were added. If none of the remaining allowed cpus are online,
then let the task run on any cpu, no limit. This is a legal fallback,
and indeed one of the possible outcomes of the previous code. It's just
not so Nice.

If a system administrator doesn't like a task being allowed to run
anywhere as a result of this, then they should clear out a cpuset of the
tasks running in it, before they take the last cpu in that cpuset
offline, and they should use taskset (sched_setaffinity) or other means
to ensure that tasks aren't pinned to a cpu that is about to be taken
offline.

Unless and until someone can make a good case to the contrary, it is not
worth nesting hotplug and cpuset semaphores to attempt to provide a more
subtle fallback, that few people would understand anyway.

At least do one thing right - attach the task to the top_cpuset if we
have to force its cpus_allowed there. That keeps the tasks apparent
cpuset in sync with its cpus_allowed (any online cpu or CPU_MASK_ALL,
which are roughly equivalent in this context).

Signed-off-by: Paul Jackson <[email protected]>

diff -Naurp 2.6.12-rc1-mm4.orig/kernel/sched.c 2.6.12-rc1-mm4/kernel/sched.c
--- 2.6.12-rc1-mm4.orig/kernel/sched.c 2005-05-13 18:39:54.000000000 -0700
+++ 2.6.12-rc1-mm4/kernel/sched.c 2005-05-13 19:02:49.000000000 -0700
@@ -4301,7 +4301,8 @@ static void move_task_off_dead_cpu(int d

/* No more Mr. Nice Guy. */
if (dest_cpu == NR_CPUS) {
- tsk->cpus_allowed = cpuset_cpus_allowed(tsk);
+ cpus_setall(tsk->cpus_allowed);
+ tsk->cpuset = &top_cpuset;
dest_cpu = any_online_cpu(tsk->cpus_allowed);

/*

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-05-14 02:59:16

[permalink] [raw]

Subject: Re: [Lse-tech] Re: [PATCH] cpusets+hotplug+preepmt broken

Dipankar, replying to pj:
> > What part of what I wrote are you saying "No" to?
>
> The question right above "No" :)

Well ... that was less than obvious. You quoted too much, and
responded with information about other semaphores, not about
why other duties of _this_ semaphore made such a rename wrong.

Fortunately, Nathan clarified matters.

So how would you, or Srivatsa or Nathan, respond to my more substantive
point, to repeat:

Srivatsa, replying to Dinakar:
> This in fact was the reason that we added lock_cpu_hotplug
> in sched_setaffinity.

Why just in sched_setaffinity()? What about the other 60+ calls to
set_cpus_allowed(). Shouldn't most of those calls be checking that the
passed in cpus are online (holding lock_cpu_hotplug while doing all
this)? Either that, or at least handling the error from
set_cpus_allowed() if the requested cpus end up not being online? I see
only 2 set_cpus_allowed() calls that make any pretense of examining the
return value.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-05-14 12:14:46

[permalink] [raw]

Subject: Re: [PATCH] cpusets+hotplug+preepmt broken

Paul Jackson wrote:
>
> Ah to heck with it. This subtle distinction over what level of cpuset
> we fall back to when a hot unplug leaves a task with no online cpuset in
> its current allowed set is not worth it.
>
> Every variation I consider is either sufficiently complicated that I
> can't be sure it's right, or sufficiently simple that it's obviously
> broken.

Heh, I feel your pain.

> diff -Naurp 2.6.12-rc1-mm4.orig/kernel/sched.c 2.6.12-rc1-mm4/kernel/sched.c
> --- 2.6.12-rc1-mm4.orig/kernel/sched.c 2005-05-13 18:39:54.000000000 -0700
> +++ 2.6.12-rc1-mm4/kernel/sched.c 2005-05-13 19:02:49.000000000 -0700
> @@ -4301,7 +4301,8 @@ static void move_task_off_dead_cpu(int d
>
> /* No more Mr. Nice Guy. */
> if (dest_cpu == NR_CPUS) {
> - tsk->cpus_allowed = cpuset_cpus_allowed(tsk);
> + cpus_setall(tsk->cpus_allowed);
> + tsk->cpuset = &top_cpuset;

Hmm, tsk->cpuset will break the build if !CONFIG_CPUSETS, no? Plus,
the task's original cpuset->count will be unbalanced and it will never
be released, methinks.

As a short-term (i.e. 2.6.12) solution, would it be terrible to set
tsk->cpus_allowed to the default value without messing with the
cpuset? That is, is it Bad for a task's cpus_allowed to be a superset
of its cpuset's cpus_allowed? I ran Dinakar's test on 2.6.12-rc4 +
this proposed "fix" on ppc64 without any crashes or warnings, but I
would need you to confirm that this doesn't violate some fundamental
cpuset constraint.

However, I think making a best effort to honor the task's cpuset is a
reasonable goal in this context. But it will require some nontrivial
changes to the code for migrating tasks off the dead cpu, as well as
some changes to the cpuset code. Not 2.6.12 material.

I'll patch soon.

Nathan

2005-05-14 16:09:47

[permalink] [raw]

Subject: Re: [Lse-tech] Re: [PATCH] cpusets+hotplug+preepmt broken

On Fri, May 13, 2005 at 07:58:51PM -0700, Paul Jackson wrote:
> Why just in sched_setaffinity()? What about the other 60+ calls to
> set_cpus_allowed().

Most of the 60+ calls seem to be in arch-specific code. We don't bother
about arch'es which dont support CPU Hotplug. AFAIK, only ppc64, ia64
and s390 support it (as of today that is :)

> Shouldn't most of those calls be checking that the
> passed in cpus are online (holding lock_cpu_hotplug while doing all
> this)? Either that, or at least handling the error from
> set_cpus_allowed() if the requested cpus end up not being online?

Yes, if these calls can run on CPU Hotplug supported platforms.

Apart from being buggy on this count (that they don't lock_cpu_hotplug
and/or don't check return value), they will also be buggy
if any code uses set_cpus_allowed against a user-space task to
migrate to a different CPU. Something like:

old_allowed = tsk->cpus_allowed;
set_cpus_allowed(tsk, new_mask);
/* Do something */
/* Restore mask */
set_cpus_allowed(tsk, old_allowed);

This is buggy because it can race with sched_setaffinity against the same task.

Having said that, the major remaining offenders (in arch-independent code) seem
to be cpufreq, sn_hwperf, acpi, perfctr/virtual.c, attach_task :),
net/core/pktgen.c (needs to be converted over to kthread_start/stop interface).

Now how much of this actually runs on the CPU Hotplug supported platforms?
Don't know ..But it is neverthless good to fix few of these ..

In fact, some of these and other set_cpus_allowed woes were discussed here:

http://sourceforge.net/mailarchive/forum.php?thread_id=3582487&forum_id=35582

--

Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

2005-05-14 16:28:41

[permalink] [raw]

Subject: Re: [PATCH] cpusets+hotplug+preepmt broken

On Fri, May 13, 2005 at 07:23:31PM -0700, Paul Jackson wrote:
> Revert the move_task_off_dead_cpu() code to its previous code, before
> cpusets were added. If none of the remaining allowed cpus are online,
> then let the task run on any cpu, no limit.

I would agree to that!

--

Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

2005-05-14 16:31:04

[permalink] [raw]

Subject: Re: [Lse-tech] Re: [PATCH] cpusets+hotplug+preepmt broken

On Fri, May 13, 2005 at 07:58:51PM -0700, Paul Jackson wrote:
>
> So how would you, or Srivatsa or Nathan, respond to my more substantive
> point, to repeat:
>
> Srivatsa, replying to Dinakar:
> > This in fact was the reason that we added lock_cpu_hotplug
> > in sched_setaffinity.
>
> Why just in sched_setaffinity()?

I suspect that the lock_cpu_hotplug is no longer necessary in
sched_setaffinity. I found the original changeset which introduced
it, and it's short enough that I'll just duplicate it here:

diff -Naru a/kernel/sched.c b/kernel/sched.c
--- a/kernel/sched.c 2005-05-14 07:21:39 -07:00
+++ b/kernel/sched.c 2005-05-14 07:21:39 -07:00
@@ -1012,6 +1012,7 @@
unsigned long flags;
cpumask_t old_mask, new_mask = cpumask_of_cpu(dest_cpu);

+ lock_cpu_hotplug();
rq = task_rq_lock(p, &flags);
old_mask = p->cpus_allowed;
if (!cpu_isset(dest_cpu, old_mask) || !cpu_online(dest_cpu))
@@ -1035,6 +1036,7 @@
}
out:
task_rq_unlock(rq, &flags);
+ unlock_cpu_hotplug();
}

/*
@@ -2309,11 +2311,13 @@
if (copy_from_user(&new_mask, user_mask_ptr, sizeof(new_mask)))
return -EFAULT;

+ lock_cpu_hotplug();
read_lock(&tasklist_lock);

p = find_process_by_pid(pid);
if (!p) {
read_unlock(&tasklist_lock);
+ unlock_cpu_hotplug();
return -ESRCH;
}

@@ -2334,6 +2338,7 @@

out_unlock:
put_task_struct(p);
+ unlock_cpu_hotplug();
return retval;
}

# This is a BitKeeper generated diff -Nru style patch.
#
# ChangeSet
# 2004/03/19 08:02:56-08:00 [email protected]
# [PATCH] Hotplug CPUs: Take cpu Lock Around Migration
#
# Grab cpu lock around sched_migrate_task() and
#sys_sched_setaffinity().
# This is a noop without CONFIG_HOTPLUG_CPU.
#
# The sched_migrate_task may have a performance penalty on NUMA if
#lots
# of exec rebalancing is happening, however this only applies to
# CONFIG_NUMA and CONFIG_HOTPLUG_CPU, which noone does at the moment
# anyway.
#
# Also, the scheduler in -mm solves the race another way, so this
#will
# vanish then.
#
# kernel/sched.c
# 2004/03/16 18:10:10-08:00 [email protected] +5 -0
# Hotplug CPUs: Take cpu Lock Around Migration
#

The lock/unlock_cpu_hotplug is no longer there in sched_migrate_task.
The changelog leads me to believe that it was intended that the same
change should have been made to sched_setaffinity by now. I think
it's safe to remove it; I can't see why it would be necessary any
more.

Nathan

2005-05-14 17:05:00

[permalink] [raw]

Subject: Re: [PATCH] cpusets+hotplug+preepmt broken

Nathan wrote:
> Hmm, tsk->cpuset will break the build if !CONFIG_CPUSETS, no? Plus,
> the task's original cpuset->count will be unbalanced and it will never
> be released, methinks.

Ah - yup.

Well, at least I am keeping my mistakes simple enough that others can
spot them quickly ;).

Fixing the tasks cpuset by moving it to top_cpuset to match its
newly extended tsk->cpus_allowed setting probably requires holding
cpuset_sem, which is what we were trying to avoid. Drat.

> As a short-term (i.e. 2.6.12) solution, would it be terrible to set
> tsk->cpus_allowed to the default value without messing with the
> cpuset?

No - not terrible at all. The code that was there before, using the
cpuset_cpus_allowed() and guarantee_online_cpus(), could do just that,
setting tsk->cpus_allowed to a superset of tsk->cpuset->cpus_allowed,
possibly even setting tsk->cpus_allowed to all bits. No one has noticed
any terrible disasters from this yet.

Granted, no one has pushed this code path very hard yet.

> However, I think making a best effort to honor the task's cpuset is a
> reasonable goal in this context. But it will require some nontrivial
> changes to the code for migrating tasks off the dead cpu, as well as
> some changes to the cpuset code.

If these nontrivial changes have other merits, such as making some code
simpler or clearer or more elegant, then that's goodness in any case.

But I am reluctant to complicate either the cpuset or the hotplug code
on this account. In my view, the kernel reserves the right to blow out
a cpuset, if it is convenient for the kernel to do so in order stay
sane, which includes:

* insuring every task has a cpu it is allowed to run on,
* insuring every task has a node is is allowed to allocate on, and
* no deadlocks, hangs, oops, panics or other disasters.

The mm/page_alloc.c __alloc_pages() code makes similar tradeoffs,
allowing GFP_ATOMIC requests to ignore cpusets under memory pressure,
for the "convenience of the kernel."

A key property of any solution to this is that it is so simple and
robust that people who don't understand this stuff (including obviously
myself after a month's vacation) won't break it in the future.

The following untested patch is probably what you meant by your
"short-term solution", above. It _just_ reverts the "No more Mr. Nice
Guy" code back to setting all bits in tsk->cpus_allowed. I see on
another post in my inbox that Srivatsa has just heartily endorsed this
approach as well.

Signed-off-by: Paul Jackson <[email protected]>

diff -Naurp 2.6.12-rc1-mm4.orig/kernel/sched.c 2.6.12-rc1-mm4/kernel/sched.c
--- 2.6.12-rc1-mm4.orig/kernel/sched.c 2005-05-13 18:39:54.000000000 -0700
+++ 2.6.12-rc1-mm4/kernel/sched.c 2005-05-14 09:06:29.000000000 -0700
@@ -4301,7 +4301,7 @@ static void move_task_off_dead_cpu(int d

/* No more Mr. Nice Guy. */
if (dest_cpu == NR_CPUS) {
- tsk->cpus_allowed = cpuset_cpus_allowed(tsk);
+ cpus_setall(tsk->cpus_allowed);
dest_cpu = any_online_cpu(tsk->cpus_allowed);

/*

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-05-14 17:23:48

[permalink] [raw]

Subject: Re: [Lse-tech] Re: [PATCH] cpusets+hotplug+preepmt broken

On Sat, May 14, 2005 at 11:30:12AM -0500, Nathan Lynch wrote:
> I suspect that the lock_cpu_hotplug is no longer necessary in
> sched_setaffinity.

Digging harder into my memory, I think the primary reason why lock_cpu_hotplug
was added in sched_setaffinity was this: some code wants to temporarily
override a (user-space) task's cpu mask. Before stop_machine came along, module
code was doing that (again if my memory serves me right). With stop_machine,
temporary changing of user-space (rmmod's) cpus_allowed is not reqd.
However I still see other code (like acpi_processor_set_performance) which
does something like this:

saved_mask = current->cpus_allowed;

set_cpus_allowed(current, new_mask);

/* Do something ..Could block */

set_cpus_allowed(current, saved_mask);

How do you ensure that saved_mask has not changed since the time
we took a snapshot of it (bcoz of sched_setaffinity having
changed it)? lock_cpu_hotplug could serialize such pieces of code

> I found the original changeset which introduced
> it, and it's short enough that I'll just duplicate it here:

[snip]

> The lock/unlock_cpu_hotplug is no longer there in sched_migrate_task.

Essentialy, if I remember correct, sched_migrate_task was also doing
something like acpi_processor_set_performance to migrate a task to a new cpu.
Thats why we discussed that sched_migrate_task should take lock_cpu_hotplug.
But I don't think it ever went in. There were concerns of cache-line bounces
on NUMA m/c bcoz of lock_cpu_hotplug in a frequently executed code like
sched_migrate_task. Moreover sched_migrate_task no longer behaves like
acpi_processor_set_performance when it wants to migrate a user-space task to a
different CPU. It now seems to take the help of migration thread (which is
much neater compared to what was there earlier i guess).

> The changelog leads me to believe that it was intended that the same
> change should have been made to sched_setaffinity by now. I think
> it's safe to remove it; I can't see why it would be necessary any
> more.

I recommend that we keep the lock_cpu_hotplug in sched_affinity
unless we figure out a different way of solving the race I highlighted above
or we ban code like that in acpi_processor_set_performance :)

--

Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

2005-05-14 17:44:59

[permalink] [raw]

Subject: Re: [PATCH] cpusets+hotplug+preepmt broken

This patch removes the entwining of cpusets and hotplug code in the "No
more Mr. Nice Guy" case of sched.c move_task_off_dead_cpu().

This is the same patch I sent last hour, with this patch comment cleaned
up, and Documentation/cpusets.txt fixed as well.

Since the hotplug code is holding a spinlock at this point, we cannot
take the cpuset semaphore, cpuset_sem, as would seem to be required
either to update the tasks cpuset, or to scan up the nested cpuset
chain, looking for the nearest cpuset ancestor that still has some CPUs
that are online. So we just punt and blast the tasks cpus_allowed with
all bits allowed.

This reverts these lines of code to what they were before the cpuset
patch. And it updates the cpuset Doc file, to match. This patch is
untested.

The one known alternative to this that seems to work came from Dinakar
Guniguntala, and required the hotplug code to take the cpuset_sem
semaphore much earlier in its processing. So far as we know, the
increased locking entanglement between cpusets and hot plug of this
alternative approach is not worth doing in this case.

Signed-off-by: Paul Jackson <[email protected]>

diff -Naurp 2.6.12-rc1-mm4.orig/Documentation/cpusets.txt 2.6.12-rc1-mm4/Documentation/cpusets.txt
--- 2.6.12-rc1-mm4.orig/Documentation/cpusets.txt 2005-05-14 10:20:27.000000000 -0700
+++ 2.6.12-rc1-mm4/Documentation/cpusets.txt 2005-05-14 10:24:13.000000000 -0700
@@ -252,8 +252,7 @@ in a tasks processor placement.
There is an exception to the above. If hotplug funtionality is used
to remove all the CPUs that are currently assigned to a cpuset,
then the kernel will automatically update the cpus_allowed of all
-tasks attached to CPUs in that cpuset with the online CPUs of the
-nearest parent cpuset that still has some CPUs online. When memory
+tasks attached to CPUs in that cpuset to allow all CPUs. When memory
hotplug functionality for removing Memory Nodes is available, a
similar exception is expected to apply there as well. In general,
the kernel prefers to violate cpuset placement, over starving a task
diff -Naurp 2.6.12-rc1-mm4.orig/kernel/sched.c 2.6.12-rc1-mm4/kernel/sched.c
--- 2.6.12-rc1-mm4.orig/kernel/sched.c 2005-05-13 18:39:54.000000000 -0700
+++ 2.6.12-rc1-mm4/kernel/sched.c 2005-05-14 09:06:29.000000000 -0700
@@ -4301,7 +4301,7 @@ static void move_task_off_dead_cpu(int d

/* No more Mr. Nice Guy. */
if (dest_cpu == NR_CPUS) {
- tsk->cpus_allowed = cpuset_cpus_allowed(tsk);
+ cpus_setall(tsk->cpus_allowed);
dest_cpu = any_online_cpu(tsk->cpus_allowed);

/*

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-05-14 17:51:09

[permalink] [raw]

Subject: Re: [Lse-tech] Re: [PATCH] cpusets+hotplug+preepmt broken

Srivatsa wrote:
> In fact, some of these and other set_cpus_allowed woes were discussed here

So indeed, there are issues with lock_cpu_hotplug, sched_setaffinity,
set_cpus_allowed() and tsk->cpus_allowed and cpu_online_map and such.

Indeed, the tar pit is deeper than I imagined.

Ok ... I will quietly sneak out the back door and leave this to the experts.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-05-14 17:58:18

[permalink] [raw]

Subject: Re: [Lse-tech] Re: [PATCH] cpusets+hotplug+preepmt broken

Srivatsa wrote:
> We don't bother about arch'es which dont support CPU Hotplug.
> ...
> Apart from being buggy on this count (that they don't lock_cpu_hotplug
> and/or don't check return value), they will also be buggy if ...

Ok. That makes sense.

Though it is a bit worrisome.

One of the two key ways to document something is to have all existing
code examples doing it the right way. Then someone adding a new
architecture, or adding hotplug to an existing architecture, will tend
to do the right thing, simply by doing what others have done.

Here we seem to have some cases where the "normal" ways to code some
things are broken for hotplug. But most of the arch's still have the
broken code patterns.

The other way to document something is with comments and Doc files.

It would seem that hotplug still has opportunities for improvement,
on both fronts.

Good luck ;).

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-05-14 23:18:09

[permalink] [raw]

Subject: Re: [Lse-tech] Re: [PATCH] cpusets+hotplug+preepmt broken

Srivatsa Vaddagiri wrote:
> > The changelog leads me to believe that it was intended that the same
> > change should have been made to sched_setaffinity by now. I think
> > it's safe to remove it; I can't see why it would be necessary any
> > more.
>
> I recommend that we keep the lock_cpu_hotplug in sched_affinity
> unless we figure out a different way of solving the race I highlighted above
> or we ban code like that in acpi_processor_set_performance :)

Ok, thanks for the education. I agree.

Nathan

2005-05-18 04:15:50

[permalink] [raw]

Subject: Re: [PATCH] cpusets+hotplug+preepmt broken

Srivatsa, Dinakar and/or Nathan,

If the version of this patch that I posted:

Sat, 14 May 2005 10:44:29 -0700
Message-Id: <[email protected]>

with the comment change, and the code change:

/* No more Mr. Nice Guy. */
if (dest_cpu == NR_CPUS) {
- tsk->cpus_allowed = cpuset_cpus_allowed(tsk);
+ cpus_setall(tsk->cpus_allowed);

is one that any of you have tested, and if you agree with it, then with
an "Acked-by:" reply (to that patch), it should make sense to recommend
it to Andrew for inclusion in his and/or Linus's kernel.

I'm still happy with it. My understanding is that you all are ok with
it as well, and I just need to push it over the finish line.

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373, 1.925.600.0401

2005-05-18 09:20:01

[permalink] [raw]

Subject: Re: [Lse-tech] Re: [PATCH] cpusets+hotplug+preepmt broken

On Sat, May 14, 2005 at 10:44:29AM -0700, Paul Jackson wrote:
> Signed-off-by: Paul Jackson <[email protected]>
>
> diff -Naurp 2.6.12-rc1-mm4.orig/Documentation/cpusets.txt 2.6.12-rc1-mm4/Documentation/cpusets.txt
> --- 2.6.12-rc1-mm4.orig/Documentation/cpusets.txt 2005-05-14 10:20:27.000000000 -0700
> +++ 2.6.12-rc1-mm4/Documentation/cpusets.txt 2005-05-14 10:24:13.000000000 -0700
> @@ -252,8 +252,7 @@ in a tasks processor placement.
> There is an exception to the above. If hotplug funtionality is used
> to remove all the CPUs that are currently assigned to a cpuset,
> then the kernel will automatically update the cpus_allowed of all
> -tasks attached to CPUs in that cpuset with the online CPUs of the
> -nearest parent cpuset that still has some CPUs online. When memory
> +tasks attached to CPUs in that cpuset to allow all CPUs. When memory
> hotplug functionality for removing Memory Nodes is available, a
> similar exception is expected to apply there as well. In general,
> the kernel prefers to violate cpuset placement, over starving a task
> diff -Naurp 2.6.12-rc1-mm4.orig/kernel/sched.c 2.6.12-rc1-mm4/kernel/sched.c
> --- 2.6.12-rc1-mm4.orig/kernel/sched.c 2005-05-13 18:39:54.000000000 -0700
> +++ 2.6.12-rc1-mm4/kernel/sched.c 2005-05-14 09:06:29.000000000 -0700
> @@ -4301,7 +4301,7 @@ static void move_task_off_dead_cpu(int d
>
> /* No more Mr. Nice Guy. */
> if (dest_cpu == NR_CPUS) {
> - tsk->cpus_allowed = cpuset_cpus_allowed(tsk);
> + cpus_setall(tsk->cpus_allowed);
> dest_cpu = any_online_cpu(tsk->cpus_allowed);
>
> /*
>

This patch is fine by me. (Vatsa is on a vacation, so maybe Nathan can ack it?)
Andrew this resolves the oops I had reported earlier in the thread.

Acked-by: Dinakar Guniguntala <[email protected]>

2005-05-18 14:51:27

[permalink] [raw]

Subject: Re: [PATCH] cpusets+hotplug+preepmt broken

Acked-by: Nathan Lynch <[email protected]>

2005-05-18 21:20:24