There are several issues related to how cpusets are handled during CPU
hotplug. One of the most annoying and noticeable consequences of this flaw
is that, after a suspend/resume or hibernation/restore, all non-root cpusets
will have only 1 cpu (the boot cpu) in their cpusets. Hence the tasks in
those cpusets get pinned to the boot cpu alone, leading to a drastic
performance degradation of those tasks/workloads.
One major user of cpusets is libvirt, which means that after a
suspend/hibernation cycle, all VMs suddenly end up running terribly slow!
This patchset solves these problems by reworking the way cpusets are handled
during CPU hotplug.
Patches 1 & 2 are cleanups that separate out hotplug handling so that we can
implement different logic for different hotplug events (CPU/Mem
online/offline). This also leads to some optimizations and more importantly
prepares the ground for implementing the cpuset fix for CPU hotplug.
Patches 3 & 4 are the core of the new solution for cpuset handling for
CPU hotplug. Patch 3 introduces the infrastructure needed, and Patch 4
exploits it for hotplug handling.
Patch 5 adds documentation for the new cpuset handling.
Patch 6 is an optimization opened up by the previous patches.
Patch 7 is a trivial removal of an outdated comment.
--
Srivatsa S. Bhat (7):
cpusets, hotplug: Implement cpuset tree traversal in a helper function
cpusets, hotplug: Restructure functions that are invoked during hotplug
cpusets: Introduce 'user_cpus_allowed' and rework the semantics of 'cpus_allowed'
CPU hotplug, cpusets: Workout hotplug handling for cpusets
Docs, cpusets: Update the cpuset documentation
cpusets: Optimize the implementation of guarantee_online_cpus()
cpusets: Remove out-dated comment about cpuset_track_online_cpus
Documentation/cgroups/cpusets.txt | 43 +++--
include/linux/cpuset.h | 4
kernel/cpuset.c | 317 ++++++++++++++++++++++++++++---------
kernel/sched/core.c | 4
4 files changed, 274 insertions(+), 94 deletions(-)
Thanks,
Srivatsa S. Bhat
IBM Linux Technology Center
At present, the functions that deal with cpusets during CPU/Mem hotplug
are quite messy, since a lot of the functionality is mixed up without clear
separation. And this takes a toll on optimization as well. For example,
the function cpuset_update_active_cpus() is called on both CPU offline and CPU
online events; and it invokes scan_for_empty_cpusets(), which makes sense
only for CPU offline events. And hence, the current code ends up unnecessarily
traversing the cpuset tree during CPU online also.
As a first step towards cleaning up those functions, encapsulate the cpuset
tree traversal in a helper function, so as to facilitate upcoming changes.
Signed-off-by: Srivatsa S. Bhat <[email protected]>
Cc: [email protected]
---
kernel/cpuset.c | 26 ++++++++++++++++++--------
1 files changed, 18 insertions(+), 8 deletions(-)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 14f7070..23e5da6 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2000,6 +2000,23 @@ static void remove_tasks_in_empty_cpuset(struct cpuset *cs)
move_member_tasks_to_cpuset(cs, parent);
}
+static struct cpuset *traverse_cpusets(struct list_head *queue)
+{
+ struct cpuset *cp;
+ struct cpuset *child; /* scans child cpusets of cp */
+ struct cgroup *cont;
+
+ cp = list_first_entry(queue, struct cpuset, stack_list);
+ list_del(queue->next);
+ list_for_each_entry(cont, &cp->css.cgroup->children, sibling) {
+ child = cgroup_cs(cont);
+ list_add_tail(&child->stack_list, queue);
+ }
+
+ return cp;
+}
+
+
/*
* Walk the specified cpuset subtree and look for empty cpusets.
* The tasks of such cpuset must be moved to a parent cpuset.
@@ -2019,19 +2036,12 @@ static void scan_for_empty_cpusets(struct cpuset *root)
{
LIST_HEAD(queue);
struct cpuset *cp; /* scans cpusets being updated */
- struct cpuset *child; /* scans child cpusets of cp */
- struct cgroup *cont;
static nodemask_t oldmems; /* protected by cgroup_mutex */
list_add_tail((struct list_head *)&root->stack_list, &queue);
while (!list_empty(&queue)) {
- cp = list_first_entry(&queue, struct cpuset, stack_list);
- list_del(queue.next);
- list_for_each_entry(cont, &cp->css.cgroup->children, sibling) {
- child = cgroup_cs(cont);
- list_add_tail(&child->stack_list, &queue);
- }
+ cp = traverse_cpusets(&queue);
/* Continue past cpusets with all cpus, mems online */
if (cpumask_subset(cp->cpus_allowed, cpu_active_mask) &&
Separate out the cpuset related handling for CPU/Memory online/offline.
This also helps us exploit the most obvious and basic level of optimization
that any notification mechanism (CPU/Mem online/offline) has to offer us:
"We *know* why we have been invoked. So stop pretending that we are lost,
and do only the necessary amount of processing!".
And while at it, rename scan_for_empty_cpusets() to
scan_cpusets_upon_hotplug(), which will be more appropriate, considering
the upcoming changes.
Signed-off-by: Srivatsa S. Bhat <[email protected]>
Cc: [email protected]
---
include/linux/cpuset.h | 4 +-
kernel/cpuset.c | 82 +++++++++++++++++++++++++++++++++---------------
kernel/sched/core.c | 4 +-
3 files changed, 60 insertions(+), 30 deletions(-)
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 668f66b..838320f 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -20,7 +20,7 @@ extern int number_of_cpusets; /* How many cpusets are defined in system? */
extern int cpuset_init(void);
extern void cpuset_init_smp(void);
-extern void cpuset_update_active_cpus(void);
+extern void cpuset_update_active_cpus(bool cpu_online);
extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
extern void cpuset_cpus_allowed_fallback(struct task_struct *p);
extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
@@ -124,7 +124,7 @@ static inline void set_mems_allowed(nodemask_t nodemask)
static inline int cpuset_init(void) { return 0; }
static inline void cpuset_init_smp(void) {}
-static inline void cpuset_update_active_cpus(void)
+static inline void cpuset_update_active_cpus(bool cpu_online)
{
partition_sched_domains(1, NULL, NULL);
}
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 23e5da6..f1de35b 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -147,6 +147,12 @@ typedef enum {
CS_SPREAD_SLAB,
} cpuset_flagbits_t;
+/* the type of hotplug event */
+enum hotplug_event {
+ CPUSET_CPU_OFFLINE,
+ CPUSET_MEM_OFFLINE,
+};
+
/* convenient tests for these bits */
static inline int is_cpu_exclusive(const struct cpuset *cs)
{
@@ -2032,39 +2038,60 @@ static struct cpuset *traverse_cpusets(struct list_head *queue)
* that has tasks along with an empty 'mems'. But if we did see such
* a cpuset, we'd handle it just like we do if its 'cpus' was empty.
*/
-static void scan_for_empty_cpusets(struct cpuset *root)
+static void
+scan_cpusets_upon_hotplug(struct cpuset *root, enum hotplug_event event)
{
LIST_HEAD(queue);
- struct cpuset *cp; /* scans cpusets being updated */
+ struct cpuset *cp; /* scans cpusets being updated */
static nodemask_t oldmems; /* protected by cgroup_mutex */
list_add_tail((struct list_head *)&root->stack_list, &queue);
- while (!list_empty(&queue)) {
- cp = traverse_cpusets(&queue);
+ switch (event) {
+ case CPUSET_CPU_OFFLINE:
+ while (!list_empty(&queue)) {
+ cp = traverse_cpusets(&queue);
- /* Continue past cpusets with all cpus, mems online */
- if (cpumask_subset(cp->cpus_allowed, cpu_active_mask) &&
- nodes_subset(cp->mems_allowed, node_states[N_HIGH_MEMORY]))
- continue;
+ /* Continue past cpusets with all cpus online */
+ if (cpumask_subset(cp->cpus_allowed, cpu_active_mask))
+ continue;
- oldmems = cp->mems_allowed;
+ /* Remove offline cpus from this cpuset. */
+ mutex_lock(&callback_mutex);
+ cpumask_and(cp->cpus_allowed, cp->cpus_allowed,
+ cpu_active_mask);
+ mutex_unlock(&callback_mutex);
- /* Remove offline cpus and mems from this cpuset. */
- mutex_lock(&callback_mutex);
- cpumask_and(cp->cpus_allowed, cp->cpus_allowed,
- cpu_active_mask);
- nodes_and(cp->mems_allowed, cp->mems_allowed,
+ /* Move tasks from the empty cpuset to a parent */
+ if (cpumask_empty(cp->cpus_allowed))
+ remove_tasks_in_empty_cpuset(cp);
+ else
+ update_tasks_cpumask(cp, NULL);
+ }
+ break;
+
+ case CPUSET_MEM_OFFLINE:
+ while (!list_empty(&queue)) {
+ cp = traverse_cpusets(&queue);
+
+ /* Continue past cpusets with all mems online */
+ if (nodes_subset(cp->mems_allowed,
+ node_states[N_HIGH_MEMORY]))
+ continue;
+
+ oldmems = cp->mems_allowed;
+
+ /* Remove offline mems from this cpuset. */
+ mutex_lock(&callback_mutex);
+ nodes_and(cp->mems_allowed, cp->mems_allowed,
node_states[N_HIGH_MEMORY]);
- mutex_unlock(&callback_mutex);
+ mutex_unlock(&callback_mutex);
- /* Move tasks from the empty cpuset to a parent */
- if (cpumask_empty(cp->cpus_allowed) ||
- nodes_empty(cp->mems_allowed))
- remove_tasks_in_empty_cpuset(cp);
- else {
- update_tasks_cpumask(cp, NULL);
- update_tasks_nodemask(cp, &oldmems, NULL);
+ /* Move tasks from the empty cpuset to a parent */
+ if (nodes_empty(cp->mems_allowed))
+ remove_tasks_in_empty_cpuset(cp);
+ else
+ update_tasks_nodemask(cp, &oldmems, NULL);
}
}
}
@@ -2080,8 +2107,11 @@ static void scan_for_empty_cpusets(struct cpuset *root)
*
* Called within get_online_cpus(). Needs to call cgroup_lock()
* before calling generate_sched_domains().
+ *
+ * @cpu_online: Indicates whether this is a CPU online event (true) or
+ * a CPU offline event (false).
*/
-void cpuset_update_active_cpus(void)
+void cpuset_update_active_cpus(bool cpu_online)
{
struct sched_domain_attr *attr;
cpumask_var_t *doms;
@@ -2091,7 +2121,7 @@ void cpuset_update_active_cpus(void)
mutex_lock(&callback_mutex);
cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
mutex_unlock(&callback_mutex);
- scan_for_empty_cpusets(&top_cpuset);
+ scan_cpusets_upon_hotplug(&top_cpuset, CPUSET_CPU_OFFLINE);
ndoms = generate_sched_domains(&doms, &attr);
cgroup_unlock();
@@ -2122,9 +2152,9 @@ static int cpuset_track_online_nodes(struct notifier_block *self,
case MEM_OFFLINE:
/*
* needn't update top_cpuset.mems_allowed explicitly because
- * scan_for_empty_cpusets() will update it.
+ * scan_cpusets_upon_hotplug() will update it.
*/
- scan_for_empty_cpusets(&top_cpuset);
+ scan_cpusets_upon_hotplug(&top_cpuset, CPUSET_MEM_OFFLINE);
break;
default:
break;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0533a68..55cfe8c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6812,7 +6812,7 @@ static int cpuset_cpu_active(struct notifier_block *nfb, unsigned long action,
switch (action & ~CPU_TASKS_FROZEN) {
case CPU_ONLINE:
case CPU_DOWN_FAILED:
- cpuset_update_active_cpus();
+ cpuset_update_active_cpus(true);
return NOTIFY_OK;
default:
return NOTIFY_DONE;
@@ -6824,7 +6824,7 @@ static int cpuset_cpu_inactive(struct notifier_block *nfb, unsigned long action,
{
switch (action & ~CPU_TASKS_FROZEN) {
case CPU_DOWN_PREPARE:
- cpuset_update_active_cpus();
+ cpuset_update_active_cpus(false);
return NOTIFY_OK;
default:
return NOTIFY_DONE;
The kernel doesn't do a good job of differentiating the following 2 events:
a. altering a cpuset's cpus_allowed mask, as per user request
b. a CPU hotplug operation.
As a result, we have the following painpoints:
1. The cpuset configurations set by the user goes haywire after CPU Hotplug,
most noticeably after a suspend/resume or hibernation/restore.
This is because the kernel tries to accommodate its workload on available
hardware resources (online cpus) upon a CPU hotplug event, and in the
meantime, forgets about the user's original preferences for the cpuset.
2. It gets worse than that. The kernel chooses to *move* tasks from one cpuset
to another in case of unfavourable CPU Hotplug operations. This is even
more irksome from the user's point of view.
Of course, in doing all this, the kernel was only trying to help the user,
but it turns out that it is more of a pain than anything useful. However,
luckily, this problem _can_ be solved properly, while still being "correct"
in both the user's as well as the kernel's point of view.
That solution follows, which can be logically summarized as:-
1. The kernel will remember the cpuset's cpus_allowed mask set by the user
and will not alter it. (It can be altered only by an explicit request
by the user, by writing to the cpuset.cpus file.)
2. When CPU Hotplug events occur, the kernel will try to run the tasks on
the remaining online cpus in that cpuset.
ie., mask = (cpus_allowed mask set by user) & (cpu_active_mask)
However, when the last online cpu in that cpuset is taken offline, the
kernel will run these tasks on the set of online cpus pertaining to some
parent cpuset. So here, the change from existing behaviour is only this:
instead of *moving* the tasks to a parent cpuset, we retain them in their
cpusets, and run them on the parent cpuset's cpu mask. This solves
painpoint #2.
3. However, we don't want to keep the user in the dark, as to what cpus
the tasks in the cpuset are actually running on. So, the kernel exposes
a new per-cpuset file that indicates the true internal state of affairs -
the set of cpus that the tasks in the cpuset are actually running on.
(This is nothing but 'mask' mentioned in the equation above).
4. Because of the mask calculation shown above, cpu offline + cpu online,
or suspend/resume etc are automatically handled: if a cpu goes offline
it is removed from the cpuset; if it comes back online, it is added to
the cpuset. No surprises! All in all, this solves painpoint #1.
Implementation details:
----------------------
To properly handle updates to cpusets during CPU Hotplug, introduce a new
per-cpuset mask called user_cpus_allowed, and also expose a new per-cpuset
file in userspace called cpuset.actual_cpus with the following semantics:
Userspace file Kernel variable/mask
============== ====================
cpuset.cpus <= will map to => user_cpus_allowed(new)
cpuset.actual_cpus(new) <= will map to => cpus_allowed
The user_cpus_allowed mask will be used to track the user's preferences for
the cpuset. That is, the kernel will update it upon writes to the .cpus file,
but not during CPU hotplug events. That way, the mask will always represent
the user set preferences for the cpuset.
The cpus_allowed mask will begin by reflecting the user_cpus_allowed mask.
However, during CPU hotplug events, the kernel is free to update this mask
suitably to ensure that the tasks in the cpuset have atleast one online CPU
to run on.
Signed-off-by: Srivatsa S. Bhat <[email protected]>
Cc: [email protected]
---
kernel/cpuset.c | 140 ++++++++++++++++++++++++++++++++++++++++++++-----------
1 files changed, 112 insertions(+), 28 deletions(-)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index f1de35b..4bafbc4 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -92,7 +92,25 @@ struct cpuset {
struct cgroup_subsys_state css;
unsigned long flags; /* "unsigned long" so bitops work */
- cpumask_var_t cpus_allowed; /* CPUs allowed to tasks in cpuset */
+
+ /*
+ * CPUs allowed to tasks in this cpuset, as per user preference. The
+ * kernel doesn't modify this inspite of CPU hotplug events. Modified
+ * only upon explicit user request (ie., upon writes to the cpuset's
+ * .cpus file).
+ * Reflected in userspace as (r/w) cpuset.cpus file.
+ */
+ cpumask_var_t user_cpus_allowed;
+
+ /*
+ * CPUs that the tasks in this cpuset can actually run on. To begin
+ * with, it is the same as user_cpus_allowed. But in the case of CPU
+ * hotplug events, the kernel is free to modify this mask, to ensure
+ * that the tasks run on *some* CPU.
+ * Reflected in userspace as the (read-only) cpuset.actual_cpus file.
+ */
+ cpumask_var_t cpus_allowed;
+
nodemask_t mems_allowed; /* Memory Nodes allowed to tasks */
struct cpuset *parent; /* my parent */
@@ -272,7 +290,7 @@ static struct file_system_type cpuset_fs_type = {
};
/*
- * Return in pmask the portion of a cpusets's cpus_allowed that
+ * Return in pmask the portion of a cpusets's user_cpus_allowed that
* are online. If none are online, walk up the cpuset hierarchy
* until we find one that does have some online cpus. If we get
* all the way to the top and still haven't found any online cpus,
@@ -288,10 +306,12 @@ static struct file_system_type cpuset_fs_type = {
static void guarantee_online_cpus(const struct cpuset *cs,
struct cpumask *pmask)
{
- while (cs && !cpumask_intersects(cs->cpus_allowed, cpu_online_mask))
+ while (cs && !cpumask_intersects(cs->user_cpus_allowed,
+ cpu_online_mask))
cs = cs->parent;
+
if (cs)
- cpumask_and(pmask, cs->cpus_allowed, cpu_online_mask);
+ cpumask_and(pmask, cs->user_cpus_allowed, cpu_online_mask);
else
cpumask_copy(pmask, cpu_online_mask);
BUG_ON(!cpumask_intersects(pmask, cpu_online_mask));
@@ -351,7 +371,7 @@ static void cpuset_update_task_spread_flag(struct cpuset *cs,
static int is_cpuset_subset(const struct cpuset *p, const struct cpuset *q)
{
- return cpumask_subset(p->cpus_allowed, q->cpus_allowed) &&
+ return cpumask_subset(p->user_cpus_allowed, q->user_cpus_allowed) &&
nodes_subset(p->mems_allowed, q->mems_allowed) &&
is_cpu_exclusive(p) <= is_cpu_exclusive(q) &&
is_mem_exclusive(p) <= is_mem_exclusive(q);
@@ -369,13 +389,23 @@ static struct cpuset *alloc_trial_cpuset(const struct cpuset *cs)
if (!trial)
return NULL;
- if (!alloc_cpumask_var(&trial->cpus_allowed, GFP_KERNEL)) {
- kfree(trial);
- return NULL;
- }
+ if (!alloc_cpumask_var(&trial->user_cpus_allowed, GFP_KERNEL))
+ goto out_trial;
+
+ if (!alloc_cpumask_var(&trial->cpus_allowed, GFP_KERNEL))
+ goto out_user_cpus_allowed;
+
+ cpumask_copy(trial->user_cpus_allowed, cs->user_cpus_allowed);
cpumask_copy(trial->cpus_allowed, cs->cpus_allowed);
return trial;
+
+ out_user_cpus_allowed:
+ free_cpumask_var(trial->user_cpus_allowed);
+
+ out_trial:
+ kfree(trial);
+ return NULL;
}
/**
@@ -384,6 +414,7 @@ static struct cpuset *alloc_trial_cpuset(const struct cpuset *cs)
*/
static void free_trial_cpuset(struct cpuset *trial)
{
+ free_cpumask_var(trial->user_cpus_allowed);
free_cpumask_var(trial->cpus_allowed);
kfree(trial);
}
@@ -402,7 +433,7 @@ static void free_trial_cpuset(struct cpuset *trial)
* cpuset in the list must use cur below, not trial.
*
* 'trial' is the address of bulk structure copy of cur, with
- * perhaps one or more of the fields cpus_allowed, mems_allowed,
+ * perhaps one or more of the fields user_cpus_allowed, mems_allowed,
* or flags changed to new, trial values.
*
* Return 0 if valid, -errno if not.
@@ -437,7 +468,8 @@ static int validate_change(const struct cpuset *cur, const struct cpuset *trial)
c = cgroup_cs(cont);
if ((is_cpu_exclusive(trial) || is_cpu_exclusive(c)) &&
c != cur &&
- cpumask_intersects(trial->cpus_allowed, c->cpus_allowed))
+ cpumask_intersects(trial->user_cpus_allowed,
+ c->user_cpus_allowed))
return -EINVAL;
if ((is_mem_exclusive(trial) || is_mem_exclusive(c)) &&
c != cur &&
@@ -445,9 +477,12 @@ static int validate_change(const struct cpuset *cur, const struct cpuset *trial)
return -EINVAL;
}
- /* Cpusets with tasks can't have empty cpus_allowed or mems_allowed */
+ /*
+ * Cpusets with tasks can't have empty user_cpus_allowed or
+ * mems_allowed
+ */
if (cgroup_task_count(cur->css.cgroup)) {
- if (cpumask_empty(trial->cpus_allowed) ||
+ if (cpumask_empty(trial->user_cpus_allowed) ||
nodes_empty(trial->mems_allowed)) {
return -ENOSPC;
}
@@ -554,6 +589,10 @@ update_domain_attr_tree(struct sched_domain_attr *dattr, struct cpuset *c)
* all cpusets having the same 'pn' value then form the one
* element of the partition (one sched domain) to be passed to
* partition_sched_domains().
+ *
+ * Note: We use the cpuset's cpus_allowed mask and *not* the
+ * user_cpus_allowed mask, because cpus_allowed is the cpu mask on which
+ * we actually want to run the tasks on.
*/
static int generate_sched_domains(cpumask_var_t **domains,
struct sched_domain_attr **attributes)
@@ -862,7 +901,8 @@ static void update_tasks_cpumask(struct cpuset *cs, struct ptr_heap *heap)
}
/**
- * update_cpumask - update the cpus_allowed mask of a cpuset and all tasks in it
+ * update_cpumask - update the user_cpus_allowed mask of a cpuset and all
+ * tasks in it. Note that this is a user request.
* @cs: the cpuset to consider
* @buf: buffer of cpu numbers written to this cpuset
*/
@@ -873,24 +913,28 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
int retval;
int is_load_balanced;
- /* top_cpuset.cpus_allowed tracks cpu_online_mask; it's read-only */
+ /*
+ * top_cpuset.user_cpus_allowed tracks cpu_online_mask;
+ * it's read-only
+ */
if (cs == &top_cpuset)
return -EACCES;
/*
- * An empty cpus_allowed is ok only if the cpuset has no tasks.
+ * An empty user_cpus_allowed is ok only if the cpuset has no tasks.
* Since cpulist_parse() fails on an empty mask, we special case
* that parsing. The validate_change() call ensures that cpusets
* with tasks have cpus.
*/
if (!*buf) {
- cpumask_clear(trialcs->cpus_allowed);
+ cpumask_clear(trialcs->user_cpus_allowed);
} else {
- retval = cpulist_parse(buf, trialcs->cpus_allowed);
+ retval = cpulist_parse(buf, trialcs->user_cpus_allowed);
if (retval < 0)
return retval;
- if (!cpumask_subset(trialcs->cpus_allowed, cpu_active_mask))
+ if (!cpumask_subset(trialcs->user_cpus_allowed,
+ cpu_active_mask))
return -EINVAL;
}
retval = validate_change(cs, trialcs);
@@ -898,7 +942,7 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
return retval;
/* Nothing to do if the cpus didn't change */
- if (cpumask_equal(cs->cpus_allowed, trialcs->cpus_allowed))
+ if (cpumask_equal(cs->user_cpus_allowed, trialcs->user_cpus_allowed))
return 0;
retval = heap_init(&heap, PAGE_SIZE, GFP_KERNEL, NULL);
@@ -908,7 +952,9 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
is_load_balanced = is_sched_load_balance(trialcs);
mutex_lock(&callback_mutex);
- cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);
+ cpumask_copy(cs->user_cpus_allowed, trialcs->user_cpus_allowed);
+ /* Initialize the cpus_allowed mask too. */
+ cpumask_copy(cs->cpus_allowed, cs->user_cpus_allowed);
mutex_unlock(&callback_mutex);
/*
@@ -1395,7 +1441,7 @@ static int cpuset_can_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
* unnecessary. Thus, cpusets are not applicable for such
* threads. This prevents checking for success of
* set_cpus_allowed_ptr() on all attached tasks before
- * cpus_allowed may be changed.
+ * user_cpus_allowed may be changed.
*/
if (task->flags & PF_THREAD_BOUND)
return -EINVAL;
@@ -1454,6 +1500,7 @@ static void cpuset_attach(struct cgroup *cgrp, struct cgroup_taskset *tset)
typedef enum {
FILE_MEMORY_MIGRATE,
+ FILE_USERCPULIST,
FILE_CPULIST,
FILE_MEMLIST,
FILE_CPU_EXCLUSIVE,
@@ -1553,7 +1600,7 @@ static int cpuset_write_resmask(struct cgroup *cgrp, struct cftype *cft,
}
switch (cft->private) {
- case FILE_CPULIST:
+ case FILE_USERCPULIST:
retval = update_cpumask(cs, trialcs, buf);
break;
case FILE_MEMLIST:
@@ -1582,6 +1629,17 @@ out:
* across a page fault.
*/
+static size_t cpuset_sprintf_usercpulist(char *page, struct cpuset *cs)
+{
+ size_t count;
+
+ mutex_lock(&callback_mutex);
+ count = cpulist_scnprintf(page, PAGE_SIZE, cs->user_cpus_allowed);
+ mutex_unlock(&callback_mutex);
+
+ return count;
+}
+
static size_t cpuset_sprintf_cpulist(char *page, struct cpuset *cs)
{
size_t count;
@@ -1622,6 +1680,9 @@ static ssize_t cpuset_common_file_read(struct cgroup *cont,
s = page;
switch (type) {
+ case FILE_USERCPULIST:
+ s += cpuset_sprintf_usercpulist(s, cs);
+ break;
case FILE_CPULIST:
s += cpuset_sprintf_cpulist(s, cs);
break;
@@ -1697,7 +1758,14 @@ static struct cftype files[] = {
.read = cpuset_common_file_read,
.write_string = cpuset_write_resmask,
.max_write_len = (100U + 6 * NR_CPUS),
+ .private = FILE_USERCPULIST,
+ },
+
+ {
+ .name = "actual_cpus",
+ .read = cpuset_common_file_read,
.private = FILE_CPULIST,
+ .mode = S_IRUGO,
},
{
@@ -1826,6 +1894,7 @@ static void cpuset_post_clone(struct cgroup *cgroup)
mutex_lock(&callback_mutex);
cs->mems_allowed = parent_cs->mems_allowed;
+ cpumask_copy(cs->user_cpus_allowed, parent_cs->user_cpus_allowed);
cpumask_copy(cs->cpus_allowed, parent_cs->cpus_allowed);
mutex_unlock(&callback_mutex);
return;
@@ -1848,10 +1917,12 @@ static struct cgroup_subsys_state *cpuset_create(struct cgroup *cont)
cs = kmalloc(sizeof(*cs), GFP_KERNEL);
if (!cs)
return ERR_PTR(-ENOMEM);
- if (!alloc_cpumask_var(&cs->cpus_allowed, GFP_KERNEL)) {
- kfree(cs);
- return ERR_PTR(-ENOMEM);
- }
+
+ if (!alloc_cpumask_var(&cs->user_cpus_allowed, GFP_KERNEL))
+ goto out_cs;
+
+ if (!alloc_cpumask_var(&cs->cpus_allowed, GFP_KERNEL))
+ goto out_user_cpus_allowed;
cs->flags = 0;
if (is_spread_page(parent))
@@ -1859,6 +1930,7 @@ static struct cgroup_subsys_state *cpuset_create(struct cgroup *cont)
if (is_spread_slab(parent))
set_bit(CS_SPREAD_SLAB, &cs->flags);
set_bit(CS_SCHED_LOAD_BALANCE, &cs->flags);
+ cpumask_clear(cs->user_cpus_allowed);
cpumask_clear(cs->cpus_allowed);
nodes_clear(cs->mems_allowed);
fmeter_init(&cs->fmeter);
@@ -1867,6 +1939,12 @@ static struct cgroup_subsys_state *cpuset_create(struct cgroup *cont)
cs->parent = parent;
number_of_cpusets++;
return &cs->css ;
+
+ out_user_cpus_allowed:
+ free_cpumask_var(cs->user_cpus_allowed);
+ out_cs:
+ kfree(cs);
+ return ERR_PTR(-ENOMEM);
}
/*
@@ -1883,6 +1961,7 @@ static void cpuset_destroy(struct cgroup *cont)
update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
number_of_cpusets--;
+ free_cpumask_var(cs->user_cpus_allowed);
free_cpumask_var(cs->cpus_allowed);
kfree(cs);
}
@@ -1909,9 +1988,13 @@ int __init cpuset_init(void)
{
int err = 0;
+ if (!alloc_cpumask_var(&top_cpuset.user_cpus_allowed, GFP_KERNEL))
+ BUG();
+
if (!alloc_cpumask_var(&top_cpuset.cpus_allowed, GFP_KERNEL))
BUG();
+ cpumask_setall(top_cpuset.user_cpus_allowed);
cpumask_setall(top_cpuset.cpus_allowed);
nodes_setall(top_cpuset.mems_allowed);
@@ -2166,13 +2249,14 @@ static int cpuset_track_online_nodes(struct notifier_block *self,
#endif
/**
- * cpuset_init_smp - initialize cpus_allowed
+ * cpuset_init_smp - initialize user_cpus_allowed and cpus_allowed
*
* Description: Finish top cpuset after cpu, node maps are initialized
**/
void __init cpuset_init_smp(void)
{
+ cpumask_copy(top_cpuset.user_cpus_allowed, cpu_active_mask);
cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
top_cpuset.mems_allowed = node_states[N_HIGH_MEMORY];
Now that we have 2 per-cpuset masks, namely user_cpus_allowed and
cpus_allowed, implement the CPU Hotplug handling for cpusets.
The cpuset update upon hotplug simplifies to:
For any CPU hotplug event (online/offline), traverse the cpuset hierarchy
top-down, doing:
1. cpus_allowed mask = (user_cpus_allowed mask) & (cpu_active_mask)
2. If the resulting cpus_allowed mask is empty,
cpus_allowed mask = parent cpuset's cpus_allowed mask
(Because of the top-down traversal, and the guarantee that the root cpuset
will always have online cpus, it is enough to copy the parent's
cpus_allowed mask.)
3. No need to move tasks from one cpuset to another, during any CPU Hotplug
operation.
This ensures that we are as close to the user's preference as possible,
within the constraints imposed by CPU hotplug.
Signed-off-by: Srivatsa S. Bhat <[email protected]>
Cc: [email protected]
---
kernel/cpuset.c | 78 +++++++++++++++++++++++++++++++++++++++++--------------
1 files changed, 58 insertions(+), 20 deletions(-)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 4bafbc4..c501a90 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -167,6 +167,7 @@ typedef enum {
/* the type of hotplug event */
enum hotplug_event {
+ CPUSET_CPU_ONLINE,
CPUSET_CPU_OFFLINE,
CPUSET_MEM_OFFLINE,
};
@@ -2056,11 +2057,10 @@ static void move_member_tasks_to_cpuset(struct cpuset *from, struct cpuset *to)
}
/*
- * If CPU and/or memory hotplug handlers, below, unplug any CPUs
- * or memory nodes, we need to walk over the cpuset hierarchy,
- * removing that CPU or node from all cpusets. If this removes the
- * last CPU or node from a cpuset, then move the tasks in the empty
- * cpuset to its next-highest non-empty parent.
+ * If memory hotplug handlers, below, unplug any memory nodes, we need
+ * to walk over the cpuset hierarchy, removing that node from all cpusets.
+ * If this removes the last node from a cpuset, then move the tasks in
+ * the empty cpuset to its next-highest non-empty parent.
*
* Called with cgroup_mutex held
* callback_mutex must not be held, as cpuset_attach() will take it.
@@ -2079,11 +2079,10 @@ static void remove_tasks_in_empty_cpuset(struct cpuset *cs)
/*
* Find its next-highest non-empty parent, (top cpuset
- * has online cpus, so can't be empty).
+ * has online cpus and memory node, so can't be empty).
*/
parent = cs->parent;
- while (cpumask_empty(parent->cpus_allowed) ||
- nodes_empty(parent->mems_allowed))
+ while (nodes_empty(parent->mems_allowed))
parent = parent->parent;
move_member_tasks_to_cpuset(cs, parent);
@@ -2107,8 +2106,12 @@ static struct cpuset *traverse_cpusets(struct list_head *queue)
/*
- * Walk the specified cpuset subtree and look for empty cpusets.
- * The tasks of such cpuset must be moved to a parent cpuset.
+ * Walk the specified cpuset subtree upon a hotplug operation (CPU/Memory
+ * online/offline) and update the cpusets accordingly.
+ *
+ * For CPU Hotplug, update the cpus_allowed mask of the cpuset such that
+ * it has online cpus for the tasks in the cpuset to run on, without
+ * deviating much from the user set preference for the cpuset.
*
* Called with cgroup_mutex held. We take callback_mutex to modify
* cpus_allowed and mems_allowed.
@@ -2119,7 +2122,8 @@ static struct cpuset *traverse_cpusets(struct list_head *queue)
*
* For now, since we lack memory hot unplug, we'll never see a cpuset
* that has tasks along with an empty 'mems'. But if we did see such
- * a cpuset, we'd handle it just like we do if its 'cpus' was empty.
+ * a cpuset, we would move the tasks of such cpuset to a non-empty parent
+ * cpuset.
*/
static void
scan_cpusets_upon_hotplug(struct cpuset *root, enum hotplug_event event)
@@ -2131,6 +2135,30 @@ scan_cpusets_upon_hotplug(struct cpuset *root, enum hotplug_event event)
list_add_tail((struct list_head *)&root->stack_list, &queue);
switch (event) {
+ case CPUSET_CPU_ONLINE:
+ while (!list_empty(&queue)) {
+ cp = traverse_cpusets(&queue);
+
+ /*
+ * Continue past cpusets which don't need to be
+ * updated.
+ */
+ if (cpumask_equal(cp->user_cpus_allowed,
+ cp->cpus_allowed))
+ continue;
+
+ /*
+ * Restore new CPU to this cpuset if it was
+ * originally present in this cpuset.
+ */
+ mutex_lock(&callback_mutex);
+ cpumask_and(cp->cpus_allowed,
+ cp->user_cpus_allowed, cpu_active_mask);
+ mutex_unlock(&callback_mutex);
+
+ update_tasks_cpumask(cp, NULL);
+ }
+ break;
case CPUSET_CPU_OFFLINE:
while (!list_empty(&queue)) {
cp = traverse_cpusets(&queue);
@@ -2141,15 +2169,22 @@ scan_cpusets_upon_hotplug(struct cpuset *root, enum hotplug_event event)
/* Remove offline cpus from this cpuset. */
mutex_lock(&callback_mutex);
- cpumask_and(cp->cpus_allowed, cp->cpus_allowed,
- cpu_active_mask);
+ cpumask_and(cp->cpus_allowed,
+ cp->user_cpus_allowed, cpu_active_mask);
mutex_unlock(&callback_mutex);
- /* Move tasks from the empty cpuset to a parent */
+ /*
+ * If cpuset is empty, copy parent cpuset's
+ * cpus_allowed mask. Because our traversal is
+ * top-down, and because the root cpuset will always
+ * have online cpus, it is sufficient to copy the
+ * parent cpuset's mask here.
+ */
if (cpumask_empty(cp->cpus_allowed))
- remove_tasks_in_empty_cpuset(cp);
- else
- update_tasks_cpumask(cp, NULL);
+ cpumask_copy(cp->cpus_allowed,
+ cp->parent->cpus_allowed);
+
+ update_tasks_cpumask(cp, NULL);
}
break;
@@ -2185,8 +2220,9 @@ scan_cpusets_upon_hotplug(struct cpuset *root, enum hotplug_event event)
* (of no affect) on systems that are actively using CPU hotplug
* but making no active use of cpusets.
*
- * This routine ensures that top_cpuset.cpus_allowed tracks
- * cpu_active_mask on each CPU hotplug (cpuhp) event.
+ * This routine ensures that top_cpuset.user_cpus_allowed and
+ * top_cpuset.cpus_allowed tracks cpu_active_mask on each CPU hotplug
+ * (cpuhp) event.
*
* Called within get_online_cpus(). Needs to call cgroup_lock()
* before calling generate_sched_domains().
@@ -2202,9 +2238,11 @@ void cpuset_update_active_cpus(bool cpu_online)
cgroup_lock();
mutex_lock(&callback_mutex);
+ cpumask_copy(top_cpuset.user_cpus_allowed, cpu_active_mask);
cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
mutex_unlock(&callback_mutex);
- scan_cpusets_upon_hotplug(&top_cpuset, CPUSET_CPU_OFFLINE);
+ scan_cpusets_upon_hotplug(&top_cpuset,
+ cpu_online ? CPUSET_CPU_ONLINE : CPUSET_CPU_OFFLINE);
ndoms = generate_sched_domains(&doms, &attr);
cgroup_unlock();
Add documentation for the newly introduced cpuset.actual_cpus file and
describe the new semantics for updating cpusets upon CPU hotplug.
Signed-off-by: Srivatsa S. Bhat <[email protected]>
Cc: [email protected]
---
Documentation/cgroups/cpusets.txt | 43 +++++++++++++++++++++++++------------
1 files changed, 29 insertions(+), 14 deletions(-)
diff --git a/Documentation/cgroups/cpusets.txt b/Documentation/cgroups/cpusets.txt
index cefd3d8..374b9d2 100644
--- a/Documentation/cgroups/cpusets.txt
+++ b/Documentation/cgroups/cpusets.txt
@@ -168,7 +168,12 @@ Each cpuset is represented by a directory in the cgroup file system
containing (on top of the standard cgroup files) the following
files describing that cpuset:
- - cpuset.cpus: list of CPUs in that cpuset
+ - cpuset.cpus: list of CPUs in that cpuset, as set by the user;
+ the kernel will not alter this upon CPU hotplug;
+ this file has read/write permissions
+ - cpuset.actual_cpus: list of CPUs actually available for the tasks in the
+ cpuset; the kernel can change this in the event of
+ CPU hotplug; this file is read-only
- cpuset.mems: list of Memory Nodes in that cpuset
- cpuset.memory_migrate flag: if set, move pages to cpusets nodes
- cpuset.cpu_exclusive flag: is cpu placement exclusive?
@@ -640,16 +645,25 @@ prior 'cpuset.mems' setting, will not be moved.
There is an exception to the above. If hotplug functionality is used
to remove all the CPUs that are currently assigned to a cpuset,
-then all the tasks in that cpuset will be moved to the nearest ancestor
-with non-empty cpus. But the moving of some (or all) tasks might fail if
-cpuset is bound with another cgroup subsystem which has some restrictions
-on task attaching. In this failing case, those tasks will stay
-in the original cpuset, and the kernel will automatically update
-their cpus_allowed to allow all online CPUs. When memory hotplug
-functionality for removing Memory Nodes is available, a similar exception
-is expected to apply there as well. In general, the kernel prefers to
-violate cpuset placement, over starving a task that has had all
-its allowed CPUs or Memory Nodes taken offline.
+then the cpuset hierarchy is traversed, searching for the nearest
+ancestor whose cpu mask has atleast one online cpu. Then the tasks in
+the empty cpuset will be run on the cpus specified in that ancestor's cpu mask.
+Note that during CPU hotplug operations, the tasks in a cpuset will not
+be moved from one cpuset to another; only the the cpu mask of that cpuset
+will be updated to ensure that there is atleast one online cpu, by trying
+to closely resemble the cpu mask of the nearest non-empty ancestor containing
+online cpus.
+
+When memory hotplug functionality for removing Memory Nodes is available,
+if all the memory nodes currently assigned to a cpuset are removed via
+hotplug, then all the tasks in that cpuset will be moved to the nearest
+ancestor with non-empty memory nodes. But the moving of some (or all)
+tasks might fail if cpuset is bound with another cgroup subsystem which
+has some restrictions on task attaching. In this failing case, those
+tasks will stay in the original cpuset, and the kernel will automatically
+update their mems_allowed to allow all online nodes.
+In general, the kernel prefers to violate cpuset placement, over starving
+a task that has had all its allowed CPUs or Memory Nodes taken offline.
There is a second exception to the above. GFP_ATOMIC requests are
kernel internal allocations that must be satisfied, immediately.
@@ -730,9 +744,10 @@ cgroup.event_control cpuset.memory_spread_page
cgroup.procs cpuset.memory_spread_slab
cpuset.cpu_exclusive cpuset.mems
cpuset.cpus cpuset.sched_load_balance
-cpuset.mem_exclusive cpuset.sched_relax_domain_level
-cpuset.mem_hardwall notify_on_release
-cpuset.memory_migrate tasks
+cpuset.actual_cpus cpuset.sched_relax_domain_level
+cpuset.mem_exclusive notify_on_release
+cpuset.mem_hardwall tasks
+cpuset.memory_migrate
Reading them will give you information about the state of this cpuset:
the CPUs and Memory Nodes it can use, the processes that are using
A cpuset's cpus_allowed mask is kept updated upon CPU hotplug
events in such a way that it always contains online cpus. Using
this observation, rework the body of guarantee_online_cpus().
Signed-off-by: Srivatsa S. Bhat <[email protected]>
---
kernel/cpuset.c | 15 +++++++++------
1 files changed, 9 insertions(+), 6 deletions(-)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index c501a90..6446095 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -301,20 +301,23 @@ static struct file_system_type cpuset_fs_type = {
* One way or another, we guarantee to return some non-empty subset
* of cpu_online_mask.
*
+ * Optimization: The cpus_allowed mask of a cpuset is maintained in such
+ * a way as to always have online cpus, by doing the cpuset hiearchy walk
+ * if/when necessary during CPU hotplug. And hence, it fits the above
+ * requirement perfectly. The only point to watch out is that cpus_allowed
+ * can be empty when the cpuset has no tasks and user_cpus_allowed is empty.
+ *
* Call with callback_mutex held.
*/
static void guarantee_online_cpus(const struct cpuset *cs,
struct cpumask *pmask)
{
- while (cs && !cpumask_intersects(cs->user_cpus_allowed,
- cpu_online_mask))
- cs = cs->parent;
-
- if (cs)
- cpumask_and(pmask, cs->user_cpus_allowed, cpu_online_mask);
+ if (cs && !cpumask_empty(cs->cpus_allowed))
+ cpumask_copy(pmask, cs->cpus_allowed);
else
cpumask_copy(pmask, cpu_online_mask);
+
BUG_ON(!cpumask_intersects(pmask, cpu_online_mask));
}
cpuset_track_online_cpus() is no longer present. So remove the
outdated comment and replace it with reference to cpuset_update_active_cpus()
which is its equivalent.
Signed-off-by: Srivatsa S. Bhat <[email protected]>
---
kernel/cpuset.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 6446095..2fdbe10 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2257,7 +2257,7 @@ void cpuset_update_active_cpus(bool cpu_online)
/*
* Keep top_cpuset.mems_allowed tracking node_states[N_HIGH_MEMORY].
* Call this routine anytime after node_states[N_HIGH_MEMORY] changes.
- * See also the previous routine cpuset_track_online_cpus().
+ * See cpuset_update_active_cpus() for CPU hotplug handling.
*/
static int cpuset_track_online_nodes(struct notifier_block *self,
unsigned long action, void *arg)
> Documentation/cgroups/cpusets.txt | 43 +++--
> include/linux/cpuset.h | 4
> kernel/cpuset.c | 317 ++++++++++++++++++++++++++++---------
> kernel/sched/core.c | 4
> 4 files changed, 274 insertions(+), 94 deletions(-)
Bah, I really hate this complexity you've created for a problem that
really doesn't exist.
So why not fix the active mask crap?
On 05/05/2012 12:54 AM, Peter Zijlstra wrote:
>
>> Documentation/cgroups/cpusets.txt | 43 +++--
>> include/linux/cpuset.h | 4
>> kernel/cpuset.c | 317 ++++++++++++++++++++++++++++---------
>> kernel/sched/core.c | 4
>> 4 files changed, 274 insertions(+), 94 deletions(-)
>
> Bah, I really hate this complexity you've created for a problem that
> really doesn't exist.
>
Doesn't exist? Well, I believe we do have a problem and a serious one
at that too!
The heart of the problem can be summarized in 2 sentences:
o During a CPU hotplug, tasks can move between cpusets, and never
come back to their original cpuset.
o Tasks might get pinned to lesser number of cpus, unreasonably.
Both these are undesirable from a system-admin point of view.
Moreover, having workarounds for this from userspace is way too messy and
ugly, if not impossible.
> So why not fix the active mask crap?
Because I doubt if that is the right way to approach this problem.
An updated cpu_active_mask not being the necessary and sufficient condition
for all scheduler related activities, is a different problem altogether, IMHO.
(Btw, Ingo had also suggested reworking this whole cpuset thing, while
reviewing the previous version of this fix.
http://thread.gmane.org/gmane.linux.kernel/1250097/focus=1252133)
Also, we need to fix this problem at the CPU Hotplug level itself, and
not just for the suspend/resume case. Because, we have had numerous bug
reports and people complaining about this issue, in various scenarios,
including those that didn't involve suspend/resume.
I am sure some of the people in Cc will have more to add to this, but in
general, when the CPU hotplug (maybe even cpu offline + online) and the
cpuset administration are done asynchronously, it leads to nasty surprises.
In fact, there have been reports where people spent inordinate amounts of
time before they figured out that a long-forgotten cpu hotplug operation
which was performed, was the root-cause of a low-performing workload!.
All these only suggest that it is time that we cleaned this up thoroughly,
and at the root cause level itself.
Btw, though there are 7 patches in this series, I don't think this patchset
increases the complexity of the code.. In fact, it makes many things simpler
and saner/cleaner, IMHO.
Regards,
Srivatsa S. Bhat
On Sat, 2012-05-05 at 01:28 +0530, Srivatsa S. Bhat wrote:
> On 05/05/2012 12:54 AM, Peter Zijlstra wrote:
>
> >
> >> Documentation/cgroups/cpusets.txt | 43 +++--
> >> include/linux/cpuset.h | 4
> >> kernel/cpuset.c | 317 ++++++++++++++++++++++++++++---------
> >> kernel/sched/core.c | 4
> >> 4 files changed, 274 insertions(+), 94 deletions(-)
> >
> > Bah, I really hate this complexity you've created for a problem that
> > really doesn't exist.
> >
>
>
> Doesn't exist? Well, I believe we do have a problem and a serious one
> at that too!
Still not convinced,..
> The heart of the problem can be summarized in 2 sentences:
>
> o During a CPU hotplug, tasks can move between cpusets, and never
> come back to their original cpuset.
This is a feature! You cannot say a task is part of a cpuset and then
run it elsewhere just because things don't work out.
That's actively violating the meaning of cpusets.
> o Tasks might get pinned to lesser number of cpus, unreasonably.
-ENOPARSE, are you trying to say that when the set contains 4 cpus and
you unplug one its left with 3? Sounds like pretty damn obvious, that's
what unplug does, it takes a cpu away.
> Both these are undesirable from a system-admin point of view.
Both of those are fundamental principles you cannot change.
> Moreover, having workarounds for this from userspace is way too messy and
> ugly, if not impossible.
There's nothing to work around -- with the exception of the suspend case
-- things work as they ought to.
> > So why not fix the active mask crap?
>
>
> Because I doubt if that is the right way to approach this problem.
>
> An updated cpu_active_mask not being the necessary and sufficient condition
> for all scheduler related activities, is a different problem altogether, IMHO.
It was the sole cause the previous, simple, patch didn't work. So fixing
that seems like important.
> (Btw, Ingo had also suggested reworking this whole cpuset thing, while
> reviewing the previous version of this fix.
> http://thread.gmane.org/gmane.linux.kernel/1250097/focus=1252133)
I still maintain that what you're proposing is wrong. You simply cannot
run a task outside of the set for a little while and say that's ok.
A set becoming empty while still having tasks is a hard error and not
something that should be swept under the carpet. Currently we printk()
and move them to the parent set until we find a set with !0 cpus. I
think Paul Jackson was wrong there, he should have simply SIGKILL'ed the
tasks or failed the hotplug.
> Also, we need to fix this problem at the CPU Hotplug level itself, and
> not just for the suspend/resume case. Because, we have had numerous bug
> reports and people complaining about this issue, in various scenarios,
> including those that didn't involve suspend/resume.
NO, absolutely not and I will NAK any and all such nonsense. WTF is a
cpuset worth if you can run on random other cpus?
> I am sure some of the people in Cc will have more to add to this, but in
> general, when the CPU hotplug (maybe even cpu offline + online) and the
> cpuset administration are done asynchronously, it leads to nasty surprises.
> In fact, there have been reports where people spent inordinate amounts of
> time before they figured out that a long-forgotten cpu hotplug operation
> which was performed, was the root-cause of a low-performing workload!.
Yeah so? I'm sure you can find infinite examples of clueless people
wasting time because they don't know how things work.
On Fri, 2012-05-04 at 22:14 +0200, Peter Zijlstra wrote:
> > Also, we need to fix this problem at the CPU Hotplug level itself, and
> > not just for the suspend/resume case. Because, we have had numerous bug
> > reports and people complaining about this issue, in various scenarios,
> > including those that didn't involve suspend/resume.
>
> NO, absolutely not and I will NAK any and all such nonsense. WTF is a
> cpuset worth if you can run on random other cpus?
Sorting your cpuset 'problem' isn't nowhere near enough to make hotplug
'safe'. unplug also destroys task_struct::cpus_allowed.
Try it:
# schedtool -a 2 $$
# grep Cpus_allowed /proc/self/status
Cpus_allowed: 000004
# echo 0 > /sys/devices/system/cpu/cpu2/online
# grep Cpus_allowed /proc/self/status
Cpus_allowed: ffffff
See, hotplug is destructive, it has to be, there's no saying the cpu
will every come back.
So mucking about trying to make cpusets non-destructive is pointless.
The real bug is people using hotplug (for all kinds of stupid stuff) and
expecting anything different.
On 04.05.2012 [22:14:16 +0200], Peter Zijlstra wrote:
> On Sat, 2012-05-05 at 01:28 +0530, Srivatsa S. Bhat wrote:
> > On 05/05/2012 12:54 AM, Peter Zijlstra wrote:
> >
> > >
> > >> Documentation/cgroups/cpusets.txt | 43 +++--
> > >> include/linux/cpuset.h | 4
> > >> kernel/cpuset.c | 317 ++++++++++++++++++++++++++++---------
> > >> kernel/sched/core.c | 4
> > >> 4 files changed, 274 insertions(+), 94 deletions(-)
> > >
> > > Bah, I really hate this complexity you've created for a problem that
> > > really doesn't exist.
> > >
> >
> >
> > Doesn't exist? Well, I believe we do have a problem and a serious one
> > at that too!
>
> Still not convinced,..
>
> > The heart of the problem can be summarized in 2 sentences:
> >
> > o During a CPU hotplug, tasks can move between cpusets, and never
> > come back to their original cpuset.
>
> This is a feature! You cannot say a task is part of a cpuset and then
> run it elsewhere just because things don't work out.
>
> That's actively violating the meaning of cpusets.
Tbh, I agree with you Peter, as I think that's how cpusets *should*
work. But I'll also reference `man cpuset`:
Not all allocations of system memory are constrained by cpusets,
for the following reasons.
If hot-plug functionality is used to remove all the CPUs that
are currently assigned to a cpuset, then the kernel will
automatically update the cpus_allowed of all processes attached
to CPUs in that cpuset to allow all CPUs. When memory hot-plug
function- ality for removing memory nodes is available, a
similar exception is expected to apply there as well. In
general, the kernel prefers to violate cpuset placement, rather
than starving a process that has had all its allowed CPUs or
memory nodes taken off- line. User code should reconfigure
cpusets to only refer to online CPUs and memory nodes when using
hot-plug to add or remove such resources.
So cpusets are, per their own documentation, not hard-limits in the face
of hotplug.
I, personally, think we should just kill of tasks in cpuset-constrained
environments that are nonsensical (no memory, no cpus, etc.). But, it
would seem we've already supported this (inherit the parent in the face
of hotplug) behavior in the past. Not sure we should break it ... at
least on the surface.
> > o Tasks might get pinned to lesser number of cpus, unreasonably.
>
> -ENOPARSE, are you trying to say that when the set contains 4 cpus and
> you unplug one its left with 3? Sounds like pretty damn obvious, that's
> what unplug does, it takes a cpu away.
I think he's saying that it's pinned to 3 forever, even if that 4th CPU
is re-plugged.
> > Both these are undesirable from a system-admin point of view.
>
> Both of those are fundamental principles you cannot change.
I see what you did there :)
<snip>
> > (Btw, Ingo had also suggested reworking this whole cpuset thing, while
> > reviewing the previous version of this fix.
> > http://thread.gmane.org/gmane.linux.kernel/1250097/focus=1252133)
>
> I still maintain that what you're proposing is wrong. You simply cannot
> run a task outside of the set for a little while and say that's ok.
>
> A set becoming empty while still having tasks is a hard error and not
> something that should be swept under the carpet. Currently we printk()
> and move them to the parent set until we find a set with !0 cpus. I
> think Paul Jackson was wrong there, he should have simply SIGKILL'ed the
> tasks or failed the hotplug.
Ah, excuse my quoting of the man-page, it would seem you are aware of
the pre-existing behavior.
So, I think I'm ok with putting the onus of all this on the
configuration owner -- don't configure/hotplug, etc. things stupidly.
We should change the cpusets implementation, then, though; update the
man-pages, etc.
So I can see several solutions:
- Rework cpusets to not be so nice to the user and kill of tasks that
run in stupid cpusets. (to be written)
- Keep current behavior to be nice to the user, but make it much noisier
when the cpuset rules are being broken because they are stupid (do
nothing choice)
- Track/restore the user's setup when it's possible to do so. (this
patchset)
I'm not sure any of these is "better" than the rest, but they probably
all have distinct merits.
How easy will it be for something like libvirt to handle that first
case? Can libvirt be modified to recognize that a VM has been killed due
to having an empty cpuset? And is that reasonable? What about other
users of cpusets (what are they?)?
Thanks,
Nish
--
Nishanth Aravamudan <[email protected]>
IBM Linux Technology Center
On 04.05.2012 [22:28:01 +0200], Peter Zijlstra wrote:
> On Fri, 2012-05-04 at 22:14 +0200, Peter Zijlstra wrote:
> > > Also, we need to fix this problem at the CPU Hotplug level itself, and
> > > not just for the suspend/resume case. Because, we have had numerous bug
> > > reports and people complaining about this issue, in various scenarios,
> > > including those that didn't involve suspend/resume.
> >
> > NO, absolutely not and I will NAK any and all such nonsense. WTF is a
> > cpuset worth if you can run on random other cpus?
>
> Sorting your cpuset 'problem' isn't nowhere near enough to make hotplug
> 'safe'. unplug also destroys task_struct::cpus_allowed.
>
> Try it:
>
> # schedtool -a 2 $$
> # grep Cpus_allowed /proc/self/status
> Cpus_allowed: 000004
> # echo 0 > /sys/devices/system/cpu/cpu2/online
> # grep Cpus_allowed /proc/self/status
> Cpus_allowed: ffffff
>
>
> See, hotplug is destructive, it has to be, there's no saying the cpu
> will every come back.
I think it's ok for hotplug to be destructive. But I guess I'm not
entirely sure why cpusets can't retain user-inputted
configuration/policy information even while destroying things currently?
And re-instating that policy if possible in the future?
> So mucking about trying to make cpusets non-destructive is pointless.
>
> The real bug is people using hotplug (for all kinds of stupid stuff) and
> expecting anything different.
Probably true :)
-Nish
--
Nishanth Aravamudan <[email protected]>
IBM Linux Technology Center
On Fri, 2012-05-04 at 13:46 -0700, Nishanth Aravamudan wrote:
> What about other users of cpusets (what are they?)?
cpusets came from SGI, its traditionally used to partition _large_
machines. Things like the batch/job-schedulers that go with that type of
setup use it.
I've no clue why libvirt uses it (or why one would use libvirt for that
matter).
On Fri, 2012-05-04 at 13:49 -0700, Nishanth Aravamudan wrote:
> I think it's ok for hotplug to be destructive. But I guess I'm not
> entirely sure why cpusets can't retain user-inputted
> configuration/policy information even while destroying things currently?
> And re-instating that policy if possible in the future?
Two issues here:
- if you retain it for cpuset but not others that's confusing (too);
- we never retain anything, if you unload a module you loose all state
that was associated with it too. What makes this special?
On 04.05.2012 [23:01:34 +0200], Peter Zijlstra wrote:
> On Fri, 2012-05-04 at 13:49 -0700, Nishanth Aravamudan wrote:
> > I think it's ok for hotplug to be destructive. But I guess I'm not
> > entirely sure why cpusets can't retain user-inputted
> > configuration/policy information even while destroying things currently?
> > And re-instating that policy if possible in the future?
>
> Two issues here:
> - if you retain it for cpuset but not others that's confusing (too);
That's a good point.
Related, possibly counter-example, and perhaps I'm wrong about it. When
we hot-unplug a CPU, and a task's scheduler affinity (via
sched_setaffinity) refers to that CPU only, do we kill that task? Can
you sched_setaffinity a task to a CPU that is offline (alone or in a
group of possible CPUs)? Or is it allowed to run anywhere? Do we destroy
its affinity policy when that situation is run across? Or do we restore
the task to the CPU again when we re-plug it? I'm just curious about
what the kernel does here and you probably know off the top of your head
:) It feels like a similar situation.
> - we never retain anything, if you unload a module you loose all state
> that was associated with it too. What makes this special?
Another good point. I would think if we were talking about unmounting
cpusetfs altogether then what you say would correspond. But I feel like
there is some distinction between module unloading (and remembering
information about the module in question, I assume you mean things like
module parameters) and cpu hotplug (and remembering information about
cpuset configuration, which isn't necessarily directly related).
I don't have good answers to either of your points off the top of my
head -- but even if we say that "hey, userspace, you're dumb, get over
it", it feels like there could be more that we could do here. It seems
wrong to make every cpuset user (presuming there are more than just
libvirt & SGI) scan regularly for empty cpusets (I'm operating under the
assumption that the person setting up the cpuset configuration may not
be the same person that does the CPU hotplug operation).
Thanks,
Nish
--
Nishanth Aravamudan <[email protected]>
IBM Linux Technology Center
On Fri, 2012-05-04 at 14:27 -0700, Nishanth Aravamudan wrote:
> (I'm operating under the
> assumption that the person setting up the cpuset configuration may not
> be the same person that does the CPU hotplug operation).
They're using the same account though: root, so they'd better sync up
anyway :-)
On Fri, 2012-05-04 at 14:27 -0700, Nishanth Aravamudan wrote:
> > - if you retain it for cpuset but not others that's confusing (too);
>
> That's a good point.
>
> Related, possibly counter-example, and perhaps I'm wrong about it. When
> we hot-unplug a CPU, and a task's scheduler affinity (via
> sched_setaffinity) refers to that CPU only, do we kill that task? Can
> you sched_setaffinity a task to a CPU that is offline (alone or in a
> group of possible CPUs)? Or is it allowed to run anywhere? Do we destroy
> its affinity policy when that situation is run across?
See a few emails back, we destroy the affinity. Current cpuset behaviour
can be said to match that.
> Or do we restore the task to the CPU again when we re-plug it?
Nope that information is lost forever from the kernels pov.
Keeping this information around for the off-chance of needing it is
rather expensive (512 bytes per task for your regular distro kernel that
has NR_CPUS=4096).
On Fri, 2012-05-04 at 14:27 -0700, Nishanth Aravamudan wrote:
> (presuming there are more than just libvirt & SGI)
I think acme's tuna might also use it to partition the system for RT
workloads. But if you use hotplug and RT you're really doing it
wrong ;-)
On 04.05.2012 [22:56:21 +0200], Peter Zijlstra wrote:
> On Fri, 2012-05-04 at 13:46 -0700, Nishanth Aravamudan wrote:
> > What about other users of cpusets (what are they?)?
>
> cpusets came from SGI, its traditionally used to partition _large_
> machines. Things like the batch/job-schedulers that go with that type of
> setup use it.
Yeah, I recall that usage (or some description similar). Do we have any
other known users of cpusets (beyond libvirt)?
> I've no clue why libvirt uses it (or why one would use libvirt for that
> matter).
Well, it is the case that libvirt does use it, and libvirt is used
pretty widely (or so it seems to me). I don't use it (cpusets or libvirt
:) either, but it seems like we should either tell libvirt directly that
cpusets are inappropriate for their use-case (once we figure out what
exactly that is, and why they chose cpusets) or work with them to
support their use-case?
Thanks,
Nish
--
Nishanth Aravamudan <[email protected]>
IBM Linux Technology Center
On Fri, 2012-05-04 at 14:30 -0700, Nishanth Aravamudan wrote:
> Well, it is the case that libvirt does use it, and libvirt is used
> pretty widely (or so it seems to me). I don't use it (cpusets or libvirt
> :) either, but it seems like we should either tell libvirt directly that
> cpusets are inappropriate for their use-case (once we figure out what
> exactly that is, and why they chose cpusets) or work with them to
> support their use-case?
Sure, lets start by getting their use-case, then see if cpuset is the
right solution and if so take it from there.
That said, the whole suspend/resume 'problem' does seem worth fixing and
is a very special case where we absolutely know we're going to get back
in the state we are in and userspace isn't actually running. So ideally
we'd go with the bhat's patch that skips the sched_domain rebuilds
entirely +- some bug-fixes ;-).
On 04.05.2012 [23:34:25 +0200], Peter Zijlstra wrote:
> On Fri, 2012-05-04 at 14:27 -0700, Nishanth Aravamudan wrote:
> > > - if you retain it for cpuset but not others that's confusing (too);
> >
> > That's a good point.
> >
> > Related, possibly counter-example, and perhaps I'm wrong about it. When
> > we hot-unplug a CPU, and a task's scheduler affinity (via
> > sched_setaffinity) refers to that CPU only, do we kill that task? Can
> > you sched_setaffinity a task to a CPU that is offline (alone or in a
> > group of possible CPUs)? Or is it allowed to run anywhere? Do we destroy
> > its affinity policy when that situation is run across?
>
> See a few emails back, we destroy the affinity. Current cpuset behaviour
> can be said to match that.
Ah you're right, sorry for glossing over that case. Does that also
happen if you affinitize it to a group of CPUs?
Seems not, we "remember" the original mask in that case:
# taskset -p f $$
pid 1424's current affinity mask: ff
pid 1424's new affinity mask: f
# grep Cpus_allowed /proc/self/status
Cpus_allowed: 0000000f
Cpus_allowed_list: 0-3
# echo 0 > /sys/devices/system/cpu/cpu2/online
# grep Cpus_allowed /proc/self/status
Cpus_allowed: 0000000f
Cpus_allowed_list: 0-3
So ... it seems like we come to a crossroads of sorts? I would think
cpusets and sched_setaffinity should behave the same in terms of
hotplug.
*Maybe* a compromise is that we remember cpuset information up to the
empty cpuset, once you empty a cpuset, you forget everything? That
roughly corresponds to your and my test-case results?
Maybe that's more work than it's worth. It seems like, though, they
should have some similarity in functionality.
> > Or do we restore the task to the CPU again when we re-plug it?
>
> Nope that information is lost forever from the kernels pov.
>
> Keeping this information around for the off-chance of needing it is
> rather expensive (512 bytes per task for your regular distro kernel that
> has NR_CPUS=4096).
Yep, that's another good point.
Thanks,
Nish
--
Nishanth Aravamudan <[email protected]>
IBM Linux Technology Center
On 05/04/2012 02:19 PM, Srivatsa S. Bhat wrote:
> Add documentation for the newly introduced cpuset.actual_cpus file and
> describe the new semantics for updating cpusets upon CPU hotplug.
>
> Signed-off-by: Srivatsa S. Bhat <[email protected]>
> Cc: [email protected]
Yup, it looks like documentation:
Acked-by: Rob Landley <[email protected]>
> +then the cpuset hierarchy is traversed, searching for the nearest
> +ancestor whose cpu mask has atleast one online cpu. Then the tasks in
at least is two words.
Rob
--
GNU/Linux isn't: Linux=GPLv2, GNU=GPLv3+, they can't share code.
Either it's "mere aggregation", or a license violation. Pick one.
On Fri, 2012-05-04 at 22:56 +0200, Peter Zijlstra wrote:
> On Fri, 2012-05-04 at 13:46 -0700, Nishanth Aravamudan wrote:
> > What about other users of cpusets (what are they?)?
>
> cpusets came from SGI, its traditionally used to partition _large_
> machines. Things like the batch/job-schedulers that go with that type of
> setup use it.
Includes RT on non-dinky boxen - _large_ + RT I hope to never meet ;-)
-Mike
On Fri, 4 May 2012, Peter Zijlstra wrote:
> That said, the whole suspend/resume 'problem' does seem worth fixing and
> is a very special case where we absolutely know we're going to get back
> in the state we are in and userspace isn't actually running. So ideally
> we'd go with the bhat's patch that skips the sched_domain rebuilds
> entirely +- some bug-fixes ;-).
Just as an interesting side comment...
The USB subsystem faced this same problem years ago. The question was:
When a USB device (especially a mass-storage device) is unplugged and
then reconnected, is the new device instance the same as the old one?
Linus stepped in and firmly assured us that it was not. That's very
much like the situation you're describing: If CPU 4 is hot-unplugged
and then a new CPU appears in slot 4, is it the same CPU as before (and
does it therefore belong to the same cpusets as before)?
But this led to problems during suspend, because not all systems could
maintain bus connectivity while the system was asleep, and almost none
can during hibernation. As a result, mounted filesystems would become
unavailable after resume even though the USB storage device had been
plugged in the whole time. To the kernel, it appeared that the device
had been unplugged during suspend and then replugged during resume.
We ended up adopting a special-purpose solution just to handle that
case. It's described in Documentation/usb/persist.txt if you want the
full details. In brief, when the system resumes it checks to see if a
device appears to be present at the same port where a device used to
be. If it is, and if its descriptors match the values remembered for
the former device, then we accept the new device as being the same as
the old one, even though the hardware indicates that the connection was
not maintained during the system sleep.
>From my point of view, this suggests that CPU hot-unplug is not quite
the right tool to use during suspend. The CPU doesn't actually go
away; it merely becomes unusable for a while. In other words, this
approach applies an incorrect abstraction. What's really needed is
something a little different: a way to avoid running any tasks on that
CPU while not removing it from the system. If this means some tasks
can no longer run on any CPUs, so be it -- this happens only during
suspend, after all. Then during resume, when the CPU is brought back
up, tasks are allowed to run on it again.
Alan Stern
On 05/05/2012 02:16 AM, Nishanth Aravamudan wrote:
> On 04.05.2012 [22:14:16 +0200], Peter Zijlstra wrote:
>> On Sat, 2012-05-05 at 01:28 +0530, Srivatsa S. Bhat wrote:
>>> On 05/05/2012 12:54 AM, Peter Zijlstra wrote:
>>>
>>>>
>>>>> Documentation/cgroups/cpusets.txt | 43 +++--
>>>>> include/linux/cpuset.h | 4
>>>>> kernel/cpuset.c | 317 ++++++++++++++++++++++++++++---------
>>>>> kernel/sched/core.c | 4
>>>>> 4 files changed, 274 insertions(+), 94 deletions(-)
>>>>
>>>> Bah, I really hate this complexity you've created for a problem that
>>>> really doesn't exist.
>>>>
>>>
>>>
>>> Doesn't exist? Well, I believe we do have a problem and a serious one
>>> at that too!
>>
>> Still not convinced,..
>>
>>> The heart of the problem can be summarized in 2 sentences:
>>>
>>> o During a CPU hotplug, tasks can move between cpusets, and never
>>> come back to their original cpuset.
>>
>> This is a feature! You cannot say a task is part of a cpuset and then
>> run it elsewhere just because things don't work out.
>>
>> That's actively violating the meaning of cpusets.
>
> Tbh, I agree with you Peter, as I think that's how cpusets *should*
> work.
I agree that that's how cpusets must and should work in usual scenarios.
Otherwise, the whole concept of cpusets wouldn't make much sense.
However, in the face of hotplug, there are examples in the existing kernel
itself, where that principle is 'violated' in the strictest sense.
sched_setaffinity():
It calls cpuset_cpus_allowed() to find out what cpus are allowed for that
task (looking at the cpuset it is attached to), so that it can validate or
reduce the newly requested mask keeping the allowed cpus in mind.
But how does cpuset_cpus_allowed() calculate the "allowed cpus" for this task?
It calls guarantee_online_cpus(), which does exactly what I tried to do in
this patchset! That is, if the task's cpuset doesn't have any online cpus,
it goes up the cpuset hierarchy, trying to find a parent cpuset that does
have some online cpus and returns that mask! That too, without complaining!
So it looks like the kernel already has relaxations with respect to cpusets
or allowed cpus when it is faced with hotplug..
> But I'll also reference `man cpuset`:
>
> Not all allocations of system memory are constrained by cpusets,
> for the following reasons.
>
> If hot-plug functionality is used to remove all the CPUs that
> are currently assigned to a cpuset, then the kernel will
> automatically update the cpus_allowed of all processes attached
> to CPUs in that cpuset to allow all CPUs. When memory hot-plug
> function- ality for removing memory nodes is available, a
> similar exception is expected to apply there as well. In
> general, the kernel prefers to violate cpuset placement, rather
> than starving a process that has had all its allowed CPUs or
> memory nodes taken off- line. User code should reconfigure
> cpusets to only refer to online CPUs and memory nodes when using
> hot-plug to add or remove such resources.
>
> So cpusets are, per their own documentation, not hard-limits in the face
> of hotplug.
>
Right. So it is up to us to strike a balance in whatever way we choose -
o just kill those tasks and be done with it
o or come up with nice variants (it is worth noting that the documentation
is flexible in the sense that it doesn't imply any hard-and-fast rule
as to how exactly we should implement the nice variants.)
> I, personally, think we should just kill of tasks in cpuset-constrained
> environments that are nonsensical (no memory, no cpus, etc.).
Even I think just killing the tasks or maybe even preventing such destructive
hotplug (last cpu in a cpuset going offline) would have been way more
easier to handle and also logical.. and userspace would have been more
cautious while dealing with cpusets, from the beginning....
> But, it
> would seem we've already supported this (inherit the parent in the face
> of hotplug) behavior in the past. Not sure we should break it ... at
> least on the surface.
>
Yes. Now that the kernel already sported a nice variant from a long time,
it wouldn't be good to break that, IMHO. But the question is, is that particular
nice variant/feature (move tasks to another cpuset, so that we strictly follow
the cpuset concept no matter what) really that good of a compromise?
IOW, if we came up with such a nice variant to be good/accommodating to users,
have we really achieved that goal, or have we created more woes instead?
So, I think, considering the following 2 factors, namely:
o cpusets are not hard-limits in the face of hotplug, as per their own
documentation, and the doc itself promises flexible nice variants,
whose implementation we are free to choose.
o sched_setaffinity() already does what I intended to do for cpusets
with this patchset.
Considering these, I think it is OK to revisit and rework how we deal with
cpusets during hotplug...
Oh by the way, my patchset also exposes a new file that shows exactly what
cpus the tasks in a cpuset are allowed to run on - so that we are not doing
anything sneaky under the hood, without the user's knowledge. So it is also
easy for userspace to check if things deviated from the original configuration,
and establish a new configuration if needed, by writing to cpuset.cpus file,
which the kernel will immediately honour.
>>> o Tasks might get pinned to lesser number of cpus, unreasonably.
>>
>> -ENOPARSE, are you trying to say that when the set contains 4 cpus and
>> you unplug one its left with 3? Sounds like pretty damn obvious, that's
>> what unplug does, it takes a cpu away.
>
> I think he's saying that it's pinned to 3 forever, even if that 4th CPU
> is re-plugged.
>
Yes, I meant that. Sorry for not being clear.
<snip>
>
> So I can see several solutions:
>
> - Rework cpusets to not be so nice to the user and kill of tasks that
> run in stupid cpusets. (to be written)
> - Keep current behavior to be nice to the user, but make it much noisier
> when the cpuset rules are being broken because they are stupid (do
> nothing choice)
> - Track/restore the user's setup when it's possible to do so. (this
> patchset)
>
> I'm not sure any of these is "better" than the rest, but they probably
> all have distinct merits.
>
Yep, and it makes sense to choose the one which the kernel is willing
to support, within its constraints. And if that happens to be a nice variant
anyway, why not choose the one which actually helps...?
Regards,
Srivatsa S. Bhat
On Sat, May 05, 2012 at 11:24:55AM -0400, Alan Stern wrote:
> On Fri, 4 May 2012, Peter Zijlstra wrote:
>
> > That said, the whole suspend/resume 'problem' does seem worth fixing and
> > is a very special case where we absolutely know we're going to get back
> > in the state we are in and userspace isn't actually running. So ideally
> > we'd go with the bhat's patch that skips the sched_domain rebuilds
> > entirely +- some bug-fixes ;-).
>
> Just as an interesting side comment...
>
> The USB subsystem faced this same problem years ago. The question was:
> When a USB device (especially a mass-storage device) is unplugged and
> then reconnected, is the new device instance the same as the old one?
> Linus stepped in and firmly assured us that it was not. That's very
> much like the situation you're describing: If CPU 4 is hot-unplugged
> and then a new CPU appears in slot 4, is it the same CPU as before (and
> does it therefore belong to the same cpusets as before)?
>
> But this led to problems during suspend, because not all systems could
> maintain bus connectivity while the system was asleep, and almost none
> can during hibernation. As a result, mounted filesystems would become
> unavailable after resume even though the USB storage device had been
> plugged in the whole time. To the kernel, it appeared that the device
> had been unplugged during suspend and then replugged during resume.
>
> We ended up adopting a special-purpose solution just to handle that
> case. It's described in Documentation/usb/persist.txt if you want the
> full details. In brief, when the system resumes it checks to see if a
> device appears to be present at the same port where a device used to
> be. If it is, and if its descriptors match the values remembered for
> the former device, then we accept the new device as being the same as
> the old one, even though the hardware indicates that the connection was
> not maintained during the system sleep.
>
> >From my point of view, this suggests that CPU hot-unplug is not quite
> the right tool to use during suspend. The CPU doesn't actually go
> away; it merely becomes unusable for a while. In other words, this
> approach applies an incorrect abstraction. What's really needed is
> something a little different: a way to avoid running any tasks on that
> CPU while not removing it from the system. If this means some tasks
> can no longer run on any CPUs, so be it -- this happens only during
> suspend, after all. Then during resume, when the CPU is brought back
> up, tasks are allowed to run on it again.
If I understand correctly, Thomas Gleixner is pushing in this direction,
allowing CPUs to be brought down partially (preventing anything from
running on it) or completely. The big obstacle in current kernel
is lack of organized way of bringing CPUs down.
Thanx, Paul
On Saturday, May 05, 2012, Paul E. McKenney wrote:
> On Sat, May 05, 2012 at 11:24:55AM -0400, Alan Stern wrote:
> > On Fri, 4 May 2012, Peter Zijlstra wrote:
> >
> > > That said, the whole suspend/resume 'problem' does seem worth fixing and
> > > is a very special case where we absolutely know we're going to get back
> > > in the state we are in and userspace isn't actually running. So ideally
> > > we'd go with the bhat's patch that skips the sched_domain rebuilds
> > > entirely +- some bug-fixes ;-).
> >
> > Just as an interesting side comment...
> >
> > The USB subsystem faced this same problem years ago. The question was:
> > When a USB device (especially a mass-storage device) is unplugged and
> > then reconnected, is the new device instance the same as the old one?
> > Linus stepped in and firmly assured us that it was not. That's very
> > much like the situation you're describing: If CPU 4 is hot-unplugged
> > and then a new CPU appears in slot 4, is it the same CPU as before (and
> > does it therefore belong to the same cpusets as before)?
> >
> > But this led to problems during suspend, because not all systems could
> > maintain bus connectivity while the system was asleep, and almost none
> > can during hibernation. As a result, mounted filesystems would become
> > unavailable after resume even though the USB storage device had been
> > plugged in the whole time. To the kernel, it appeared that the device
> > had been unplugged during suspend and then replugged during resume.
> >
> > We ended up adopting a special-purpose solution just to handle that
> > case. It's described in Documentation/usb/persist.txt if you want the
> > full details. In brief, when the system resumes it checks to see if a
> > device appears to be present at the same port where a device used to
> > be. If it is, and if its descriptors match the values remembered for
> > the former device, then we accept the new device as being the same as
> > the old one, even though the hardware indicates that the connection was
> > not maintained during the system sleep.
> >
> > >From my point of view, this suggests that CPU hot-unplug is not quite
> > the right tool to use during suspend. The CPU doesn't actually go
> > away; it merely becomes unusable for a while. In other words, this
> > approach applies an incorrect abstraction. What's really needed is
> > something a little different: a way to avoid running any tasks on that
> > CPU while not removing it from the system. If this means some tasks
> > can no longer run on any CPUs, so be it -- this happens only during
> > suspend, after all. Then during resume, when the CPU is brought back
> > up, tasks are allowed to run on it again.
>
> If I understand correctly, Thomas Gleixner is pushing in this direction,
> allowing CPUs to be brought down partially (preventing anything from
> running on it) or completely. The big obstacle in current kernel
> is lack of organized way of bringing CPUs down.
Yet, this is the only viable way to go, IMHO.
Thanks,
Rafael
On 05/06/2012 01:15 AM, Srivatsa S. Bhat wrote:
>> I, personally, think we should just kill of tasks in cpuset-constrained
>> environments that are nonsensical (no memory, no cpus, etc.).
>
>
> Even I think just killing the tasks or maybe even preventing such destructive
> hotplug (last cpu in a cpuset going offline) would have been way more
> easier to handle and also logical.. and userspace would have been more
> cautious while dealing with cpusets, from the beginning....
On one another OS, there's a "force" flag for cpu_down(). If cpu_down() is
called with the "force" flag as false, the request will be rejected if
any cpuset becomes empty; otherwise it will try to assign other CPUs to
the empty cpusets.
So the administrator could choose different behaviors.
On Fri, May 04, 2012 at 02:30:11PM -0700, Nishanth Aravamudan wrote:
> On 04.05.2012 [22:56:21 +0200], Peter Zijlstra wrote:
> > On Fri, 2012-05-04 at 13:46 -0700, Nishanth Aravamudan wrote:
> > > What about other users of cpusets (what are they?)?
> >
> > cpusets came from SGI, its traditionally used to partition _large_
> > machines. Things like the batch/job-schedulers that go with that type of
> > setup use it.
>
> Yeah, I recall that usage (or some description similar). Do we have any
> other known users of cpusets (beyond libvirt)?
IIRC, the lxc.sf.net project also uses cpusets (no connection to the libvirt
LXC driver mentioned below which is an alternative impl of the same concept).
> > I've no clue why libvirt uses it (or why one would use libvirt for that
> > matter).
>
> Well, it is the case that libvirt does use it, and libvirt is used
> pretty widely (or so it seems to me). I don't use it (cpusets or libvirt
> :) either, but it seems like we should either tell libvirt directly that
> cpusets are inappropriate for their use-case (once we figure out what
> exactly that is, and why they chose cpusets) or work with them to
> support their use-case?
Libvirt uses the cpuset cgroups functionality in two of its
virtualization drivers:
- LXC. Container based virt. The cpuset controller is used to
constrain all processes running inside the container to a
specific collection of CPUs. While we could use the traditional
sched_setaffinity() syscall at initial startup of the container,
this is not so practical when we want to dynamically change the
affinity of an existing container. It would require that we
iterate over all tasks changing their affinity, and to avoid
fork() race conditions we'd need to suspend the container while
doing this. Thus we've long used the cpuset cgroups controller
for LXC.
- KVM. Full machine virt. By default we use sched_setaffinity
to apply constraints on what host CPUs a VM executes on. Fairly
recently we added the ability to optionally use the cpuset
controller instead (only if the sysadmin has already mounted
it). The advantage of this, is that if we update the cpuset
of an existing VM, then IIUC, the kernel will migrate its
allocated memory to be local to the new CPU set mask.
The pain point we're hitting, is that upon suspend/restore the cgroups
cpuset masks are not preserved. This is not a problem for server virt
usage scenarios, but it is for desktop users with virt on laptaops.
I don't see a viable alternative to the cpuset controller for our LXC
container driver. For KVM we could do without the cpuset controller
if there is alternative way to tell the kernel to migrate the KVM
process memory to be local to the new CPU affinity set using the
sched_setaffinity() call.
We are open to suggestions of alternative approaches, particularly since
we have had no end of trouble with pretty much all of the kernel's
cgroups controllers :-(
Regards,
Daniel
--
|: http://berrange.com -o- http://www.flickr.com/photos/dberrange/ :|
|: http://libvirt.org -o- http://virt-manager.org :|
|: http://autobuild.org -o- http://search.cpan.org/~danberr/ :|
|: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :|
On 05/05/2012 01:44 AM, Peter Zijlstra wrote:
> On Sat, 2012-05-05 at 01:28 +0530, Srivatsa S. Bhat wrote:
>> On 05/05/2012 12:54 AM, Peter Zijlstra wrote:
>>
>>>
>>>> Documentation/cgroups/cpusets.txt | 43 +++--
>>>> include/linux/cpuset.h | 4
>>>> kernel/cpuset.c | 317 ++++++++++++++++++++++++++++---------
>>>> kernel/sched/core.c | 4
>>>> 4 files changed, 274 insertions(+), 94 deletions(-)
>>>
>>> Bah, I really hate this complexity you've created for a problem that
>>> really doesn't exist.
>>>
>>
>>
>> Doesn't exist? Well, I believe we do have a problem and a serious one
>> at that too!
>
>>> So why not fix the active mask crap?
>>
>>
>> Because I doubt if that is the right way to approach this problem.
>>
>> An updated cpu_active_mask not being the necessary and sufficient condition
>> for all scheduler related activities, is a different problem altogether, IMHO.
>
> It was the sole cause the previous, simple, patch didn't work. So fixing
> that seems like important.
>
Some thoughts on this..
First of all, why would it be reasonable to expect the scheduler to work
flawlessly with half its infrastructure (sched domains for example) in a
stale/inconsistent/outdated state?
IOW, I am finding it difficult to understand why you would consider it a bug if
the scheduler falters when cpu_active mask is up-to-date but the sched domains
are old/outdated.. Is it not expected? And hence, wouldn't it make sense to keep
the sched domains up-to-date so that the scheduler functions properly?
Also, to "fix" that, sprinkling checks for active cpu, wherever the sched domain
tree traversal is done, like:
if (!cpu_active(cpu))
/* Go out */
for_each_domain(cpu, sd) {
}
looks quite ugly/hacky to me, because, if the sched domains were up-to-date
(as they should be), then the domain traversal would automatically become a
nop since the sd pointer would have been NULL... Thus, there wouldn't be a
need for such checks.
Moreover, those checks for active cpu, if added, could also end up in hot
paths, such as schedule()..
Regards,
Srivatsa S. Bhat