2007-08-24 22:18:19

by Cliff Wickman

[permalink] [raw]
Subject: [PATCH 1/1] hotplug cpu: migrate a task within its cpuset


When a cpu is disabled, move_task_off_dead_cpu() is called for tasks
that have been running on that cpu.

Currently, such a task is migrated:
1) to any cpu on the same node as the disabled cpu, which is both online
and among that task's cpus_allowed
2) to any cpu which is both online and among that task's cpus_allowed

It is typical of a multithreaded application running on a large NUMA system
to have its tasks confined to a cpuset so as to cluster them near the
memory that they share. Furthermore, it is typical to explicitly place such
a task on a specific cpu in that cpuset. And in that case the task's
cpus_allowed includes only a single cpu.

This patch would insert a preference to migrate such a task to some cpu within
its cpuset (and set its cpus_allowed to its entire cpuset).

With this patch, migrate the task to:
1) to any cpu on the same node as the disabled cpu, which is both online
and among that task's cpus_allowed
2) to any online cpu within the task's cpuset
3) to any cpu which is both online and among that task's cpus_allowed


In order to do this, move_task_off_dead_cpu() must make a call to
cpuset_cpus_allowed_lock(), a new variant of cpuset_cpus_allowed() that
will not block.
Calls are made to cpuset_lock() and cpuset_unlock() in migration_call()
to set the cpuset mutex during the whole migrate_live_tasks() and
migrate_dead_tasks() procedure.

This patch depends on 2 patches from Oleg Nesterov:
[PATCH 1/2] do CPU_DEAD migrating under read_lock(tasklist) instead of write_lock_irq(tasklist)
[PATCH 2/2] migration_call(CPU_DEAD): use spin_lock_irq() instead of task_rq_lock()

Diffed against 2.6.23-rc3

Signed-off-by: Cliff Wickman <[email protected]>

---
include/linux/cpuset.h | 5 +++++
kernel/cpuset.c | 19 +++++++++++++++++++
kernel/sched.c | 14 ++++++++++++++
3 files changed, 38 insertions(+)

Index: linus.070821/kernel/sched.c
===================================================================
--- linus.070821.orig/kernel/sched.c
+++ linus.070821/kernel/sched.c
@@ -61,6 +61,7 @@
#include <linux/delayacct.h>
#include <linux/reciprocal_div.h>
#include <linux/unistd.h>
+#include <linux/cpuset.h>

#include <asm/tlb.h>

@@ -5091,6 +5092,17 @@ restart:
if (dest_cpu == NR_CPUS)
dest_cpu = any_online_cpu(p->cpus_allowed);

+ /* try to stay on the same cpuset */
+ if (dest_cpu == NR_CPUS) {
+ /*
+ * The cpuset_cpus_allowed_lock() variant of
+ * cpuset_cpus_allowed() will not block
+ * It must be called within calls to cpuset_lock/cpuset_unlock.
+ */
+ p->cpus_allowed = cpuset_cpus_allowed_lock(p);
+ dest_cpu = any_online_cpu(p->cpus_allowed);
+ }
+
/* No more Mr. Nice Guy. */
if (dest_cpu == NR_CPUS) {
rq = task_rq_lock(p, &flags);
@@ -5412,6 +5424,7 @@ migration_call(struct notifier_block *nf

case CPU_DEAD:
case CPU_DEAD_FROZEN:
+ cpuset_lock(); /* around calls to cpuset_cpus_allowed_lock() */
migrate_live_tasks(cpu);
rq = cpu_rq(cpu);
kthread_stop(rq->migration_thread);
@@ -5425,6 +5438,7 @@ migration_call(struct notifier_block *nf
rq->idle->sched_class = &idle_sched_class;
migrate_dead_tasks(cpu);
spin_unlock_irq(&rq->lock);
+ cpuset_unlock();
migrate_nr_uninterruptible(rq);
BUG_ON(rq->nr_running != 0);

Index: linus.070821/include/linux/cpuset.h
===================================================================
--- linus.070821.orig/include/linux/cpuset.h
+++ linus.070821/include/linux/cpuset.h
@@ -22,6 +22,7 @@ extern void cpuset_init_smp(void);
extern void cpuset_fork(struct task_struct *p);
extern void cpuset_exit(struct task_struct *p);
extern cpumask_t cpuset_cpus_allowed(struct task_struct *p);
+extern cpumask_t cpuset_cpus_allowed_lock(struct task_struct *p);
extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
#define cpuset_current_mems_allowed (current->mems_allowed)
void cpuset_init_current_mems_allowed(void);
@@ -87,6 +88,10 @@ static inline cpumask_t cpuset_cpus_allo
{
return cpu_possible_map;
}
+static inline cpumask_t cpuset_cpus_allowed_lock(struct task_struct *p)
+{
+ return cpu_possible_map;
+}

static inline nodemask_t cpuset_mems_allowed(struct task_struct *p)
{
Index: linus.070821/kernel/cpuset.c
===================================================================
--- linus.070821.orig/kernel/cpuset.c
+++ linus.070821/kernel/cpuset.c
@@ -2333,6 +2333,25 @@ cpumask_t cpuset_cpus_allowed(struct tas
return mask;
}

+/**
+ * cpuset_cpus_allowed_lock - return cpus_allowed mask from a tasks cpuset.
+ * @tsk: pointer to task_struct from which to obtain cpuset->cpus_allowed.
+ *
+ * Description: Same as cpuset_cpus_allowed, but called with callback_mutex
+ * already held.
+ **/
+
+cpumask_t cpuset_cpus_allowed_lock(struct task_struct *tsk)
+{
+ cpumask_t mask;
+
+ task_lock(tsk);
+ guarantee_online_cpus(tsk->cpuset, &mask);
+ task_unlock(tsk);
+
+ return mask;
+}
+
void cpuset_init_current_mems_allowed(void)
{
current->mems_allowed = NODE_MASK_ALL;
--
Cliff Wickman
Silicon Graphics, Inc.
[email protected]
(651) 683-3824


2007-08-24 22:56:13

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/1] hotplug cpu: migrate a task within its cpuset

On Fri, 24 Aug 2007 17:18:06 -0500
Cliff Wickman <[email protected]> wrote:

> When a cpu is disabled, move_task_off_dead_cpu() is called for tasks
> that have been running on that cpu.
>
> Currently, such a task is migrated:
> 1) to any cpu on the same node as the disabled cpu, which is both online
> and among that task's cpus_allowed
> 2) to any cpu which is both online and among that task's cpus_allowed
>
> It is typical of a multithreaded application running on a large NUMA system
> to have its tasks confined to a cpuset so as to cluster them near the
> memory that they share. Furthermore, it is typical to explicitly place such
> a task on a specific cpu in that cpuset. And in that case the task's
> cpus_allowed includes only a single cpu.

operator error..

> This patch would insert a preference to migrate such a task to some cpu within
> its cpuset (and set its cpus_allowed to its entire cpuset).
>
> With this patch, migrate the task to:
> 1) to any cpu on the same node as the disabled cpu, which is both online
> and among that task's cpus_allowed
> 2) to any online cpu within the task's cpuset
> 3) to any cpu which is both online and among that task's cpus_allowed

Wouldn't it be saner to refuse the offlining request if the CPU has tasks
which cannot be migrated to any other CPU? I mean, the operator has gone
and asked the machine to perform two inconsistent/incompatible things at
the same time.

Look at it this way. If we were to merge this patch then it would be
logical to also merge a patch which has the following description:

"if an process attempts to pin itself onto an presently-offlined CPU,
the kernel will choose a different CPU according to <heuristics> and
will pin the process to that CPU instead".

Which is the same thing as your patch, only it handles the two events when
they occur in the other order.

No?

2007-08-25 09:45:21

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 1/1] hotplug cpu: migrate a task within its cpuset

On 08/24, Andrew Morton wrote:
>
> On Fri, 24 Aug 2007 17:18:06 -0500
> Cliff Wickman <[email protected]> wrote:
>
> > When a cpu is disabled, move_task_off_dead_cpu() is called for tasks
> > that have been running on that cpu.
> >
> > Currently, such a task is migrated:
> > 1) to any cpu on the same node as the disabled cpu, which is both online
> > and among that task's cpus_allowed
> > 2) to any cpu which is both online and among that task's cpus_allowed
> >
> > It is typical of a multithreaded application running on a large NUMA system
> > to have its tasks confined to a cpuset so as to cluster them near the
> > memory that they share. Furthermore, it is typical to explicitly place such
> > a task on a specific cpu in that cpuset. And in that case the task's
> > cpus_allowed includes only a single cpu.
>
> operator error..
>
> > This patch would insert a preference to migrate such a task to some cpu within
> > its cpuset (and set its cpus_allowed to its entire cpuset).
> >
> > With this patch, migrate the task to:
> > 1) to any cpu on the same node as the disabled cpu, which is both online
> > and among that task's cpus_allowed
> > 2) to any online cpu within the task's cpuset
> > 3) to any cpu which is both online and among that task's cpus_allowed
>
> Wouldn't it be saner to refuse the offlining request if the CPU has tasks
> which cannot be migrated to any other CPU? I mean, the operator has gone
> and asked the machine to perform two inconsistent/incompatible things at
> the same time.

I don't think so (regardless of this patch and CONFIG_CPUSETS). Any user
can bind its process to (say) CPU 4. This shouldn't block cpu-unplug.

Now, let's suppose that this process is a member of some cpuset which
contains CPUs 3 and 4, and CPU 4 goes down.

Before this patch, process leaves its ->cpuset and migrates to some "random"
any_online_cpu(). With this patch it stays within ->cpuset and migrates to
CPU 3.

> Look at it this way. If we were to merge this patch then it would be
> logical to also merge a patch which has the following description:
>
> "if an process attempts to pin itself onto an presently-offlined CPU,
> the kernel will choose a different CPU according to <heuristics> and
> will pin the process to that CPU instead".

set_cpus_allowed() just returns -EINVAL in that case, this looks a bit
more logical.

Oleg.

2007-08-25 09:56:20

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 1/1] hotplug cpu: migrate a task within its cpuset

On 08/24, Cliff Wickman wrote:
> --- linus.070821.orig/kernel/cpuset.c
> +++ linus.070821/kernel/cpuset.c
> @@ -2333,6 +2333,25 @@ cpumask_t cpuset_cpus_allowed(struct tas
> return mask;
> }
>
> +/**
> + * cpuset_cpus_allowed_lock - return cpus_allowed mask from a tasks cpuset.
> + * @tsk: pointer to task_struct from which to obtain cpuset->cpus_allowed.
> + *
> + * Description: Same as cpuset_cpus_allowed, but called with callback_mutex
> + * already held.
> + **/
> +
> +cpumask_t cpuset_cpus_allowed_lock(struct task_struct *tsk)
> +{
> + cpumask_t mask;
> +
> + task_lock(tsk);
> + guarantee_online_cpus(tsk->cpuset, &mask);
> + task_unlock(tsk);
> +
> + return mask;
> +}

Very minor nit, perhaps it makes sense to use this function in cpuset_cpus_allowed()
to avoid the code duplication. Its name is a bit confusing, imho. How about
cpuset_cpus_allowed_locked() or __cpuset_cpus_allowed() ?

Oleg.

2007-08-25 11:15:47

by Oleg Nesterov

[permalink] [raw]
Subject: Re: [PATCH 1/1] hotplug cpu: migrate a task within its cpuset

On 08/24, Cliff Wickman wrote:
>
> @@ -5091,6 +5092,17 @@ restart:
> if (dest_cpu == NR_CPUS)
> dest_cpu = any_online_cpu(p->cpus_allowed);
>
> + /* try to stay on the same cpuset */
> + if (dest_cpu == NR_CPUS) {
> + /*
> + * The cpuset_cpus_allowed_lock() variant of
> + * cpuset_cpus_allowed() will not block
> + * It must be called within calls to cpuset_lock/cpuset_unlock.
> + */
> + p->cpus_allowed = cpuset_cpus_allowed_lock(p);
> + dest_cpu = any_online_cpu(p->cpus_allowed);

Ugh, didn't notice while reading this patch.

We shouldn't change ->cpus_allowed without rq->lock. Please look what
the current code does in "/* No more Mr. Nice Guy. */" case.

Oleg.

Subject: Re: [PATCH 1/1] hotplug cpu: migrate a task within its cpuset

On Sat, Aug 25, 2007 at 01:47:40PM +0400, Oleg Nesterov wrote:
> On 08/24, Andrew Morton wrote:
> >
> > On Fri, 24 Aug 2007 17:18:06 -0500
> > Cliff Wickman <[email protected]> wrote:
> >
> > > When a cpu is disabled, move_task_off_dead_cpu() is called for tasks
> > > that have been running on that cpu.
> > >
> > > Currently, such a task is migrated:
> > > 1) to any cpu on the same node as the disabled cpu, which is both online
> > > and among that task's cpus_allowed
> > > 2) to any cpu which is both online and among that task's cpus_allowed
> > >
> > > It is typical of a multithreaded application running on a large NUMA system
> > > to have its tasks confined to a cpuset so as to cluster them near the
> > > memory that they share. Furthermore, it is typical to explicitly place such
> > > a task on a specific cpu in that cpuset. And in that case the task's
> > > cpus_allowed includes only a single cpu.
> >
> > operator error..
> >
> > > This patch would insert a preference to migrate such a task to some cpu within
> > > its cpuset (and set its cpus_allowed to its entire cpuset).
> > >
> > > With this patch, migrate the task to:
> > > 1) to any cpu on the same node as the disabled cpu, which is both online
> > > and among that task's cpus_allowed
> > > 2) to any online cpu within the task's cpuset
> > > 3) to any cpu which is both online and among that task's cpus_allowed
> >
> > Wouldn't it be saner to refuse the offlining request if the CPU has tasks
> > which cannot be migrated to any other CPU? I mean, the operator has gone
> > and asked the machine to perform two inconsistent/incompatible things at
> > the same time.
>
> I don't think so (regardless of this patch and CONFIG_CPUSETS). Any user
> can bind its process to (say) CPU 4. This shouldn't block cpu-unplug.
>
> Now, let's suppose that this process is a member of some cpuset which
> contains CPUs 3 and 4, and CPU 4 goes down.
>
> Before this patch, process leaves its ->cpuset and migrates to some "random"
> any_online_cpu(). With this patch it stays within ->cpuset and migrates to
> CPU 3.

The decision to bind a task to a specific cpu, was taken by the userspace
for a reason, which is _unknown_ to the kernel.
So logically, shouldn't the userspace decide what should be
the fate of those exclusive-affined tasks, whose cpu is about to go
offline? After all, the reason to offline the cpu is, again, unknown to
the kernel.

Though we have been breaking such a task's affinity in
/* No more Mr. Nice Guy. */ part, IMO a nicer way to do it would be to:
- Fail the cpu-offline operation with -EBUSY, if there exist userspace tasks
exclusively affined to that cpu.
- Provide some kind of infrastructure like a sysfs file
/sys/devices/system/cpu/cpuX/exclusive_tasks which will help
the administrator take proactive steps before requesting a
cpu offline, instead of the kernel taking the reactive step once the
offline is done.

The side-effect, ofcourse would be that it would break some of the
existing applications, which are not used to cpu-offline failing unless
the cpu was already offline or there was only one online cpu. But is the
side effect so critical, that we continue with this funny contradiction in
the kernel?! Or is there something important, I'm missing here?

>
> > Look at it this way. If we were to merge this patch then it would be
> > logical to also merge a patch which has the following description:
> >
> > "if an process attempts to pin itself onto an presently-offlined CPU,
> > the kernel will choose a different CPU according to <heuristics> and
> > will pin the process to that CPU instead".
>
> set_cpus_allowed() just returns -EINVAL in that case, this looks a bit
> more logical.
>

Yup, it sure does!

> Oleg.
>

Thanks and Regards
gautham.
--
Gautham R Shenoy
Linux Technology Center
IBM India.
"Freedom comes with a price tag of responsibility, which is still a bargain,
because Freedom is priceless!"

2007-08-26 00:48:18

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 1/1] hotplug cpu: migrate a task within its cpuset

On Sun, 2007-08-26 at 05:46 +0530, Gautham R Shenoy wrote:
> On Sat, Aug 25, 2007 at 01:47:40PM +0400, Oleg Nesterov wrote:
> > Before this patch, process leaves its ->cpuset and migrates to some "random"
> > any_online_cpu(). With this patch it stays within ->cpuset and migrates to
> > CPU 3.
>
> The decision to bind a task to a specific cpu, was taken by the userspace
> for a reason, which is _unknown_ to the kernel.
> So logically, shouldn't the userspace decide what should be
> the fate of those exclusive-affined tasks, whose cpu is about to go
> offline? After all, the reason to offline the cpu is, again, unknown to
> the kernel.

Userspace is not monolithic. If you refuse to take a CPU offline
because a task is affine, then any user can prevent a CPU from going
offline.

You could, perhaps, introduce a "gentle" offline which fails if process
affinity can no longer be met.

Rusty.

2007-08-26 08:10:27

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 1/1] hotplug cpu: migrate a task within its cpuset

On Sun, 26 Aug 2007 10:47:24 +1000 Rusty Russell <[email protected]> wrote:

> On Sun, 2007-08-26 at 05:46 +0530, Gautham R Shenoy wrote:
> > On Sat, Aug 25, 2007 at 01:47:40PM +0400, Oleg Nesterov wrote:
> > > Before this patch, process leaves its ->cpuset and migrates to some "random"
> > > any_online_cpu(). With this patch it stays within ->cpuset and migrates to
> > > CPU 3.
> >
> > The decision to bind a task to a specific cpu, was taken by the userspace
> > for a reason, which is _unknown_ to the kernel.
> > So logically, shouldn't the userspace decide what should be
> > the fate of those exclusive-affined tasks, whose cpu is about to go
> > offline? After all, the reason to offline the cpu is, again, unknown to
> > the kernel.
>
> Userspace is not monolithic. If you refuse to take a CPU offline
> because a task is affine, then any user can prevent a CPU from going
> offline.

That's a kernel bug.

> You could, perhaps, introduce a "gentle" offline which fails if process
> affinity can no longer be met.

Suitably privileged userspace should be able to

1) prevent tasks from binding to CPU N then
2) migrate all tasks which can use CPU N over to other CPU(s) then
3) offline CPU N.

2007-08-28 17:23:16

by Rusty Russell

[permalink] [raw]
Subject: Re: [PATCH 1/1] hotplug cpu: migrate a task within its cpuset

On Sun, 2007-08-26 at 01:09 -0700, Andrew Morton wrote:
> On Sun, 26 Aug 2007 10:47:24 +1000 Rusty Russell <[email protected]> wrote:
> > Userspace is not monolithic. If you refuse to take a CPU offline
> > because a task is affine, then any user can prevent a CPU from going
> > offline.
>
> That's a kernel bug.

You mean "would be if it were implemented"? Although consider the
equivalent forkbomb or thrashing userspace problems, where we just say
"use quotas".

Just to clarify: that is not how we work, we migrate tasks off a dying
CPU, breaking affinity and printing a warning if necessary.

It was simple and has not proven problematic in practice. (Userspace
cpu affinity has been a question of optimality not correctness)

> > You could, perhaps, introduce a "gentle" offline which fails if process
> > affinity can no longer be met.
>
> Suitably privileged userspace should be able to
>
> 1) prevent tasks from binding to CPU N then
> 2) migrate all tasks which can use CPU N over to other CPU(s) then
> 3) offline CPU N.

Indeed, (1) is missing. I would hesitate to introduce more
infrastructure in an under-worn and over-buggy part of the kernel
though.

Rusty.