Currently the kernel doesn't handle cpusets properly during suspend/resume.
After a resume, all non-root cpusets end up having only 1 cpu (the boot cpu),
causing massive performance degradation of workloads. One major user of cpusets
is libvirt, which means that after a suspend/hibernation cycle, all VMs
suddenly end up running terribly slow!
Also, the kernel moves the tasks from one cpuset to another during CPU hotplug
in the suspend/resume path, leading to a task-management nightmare after
resume.
This patchset solves these problems by reworking the way cpusets are handled
during CPU hotplug in the suspend/resume path. This doesn't involve any
change in semantics as to how cpusets are handled during regular CPU hotplug
because it is correct as it is.
Patches 1 & 2 are cleanups that separate out hotplug handling so that we can
implement different logic for different hotplug events (CPU/Mem
online/offline). This also leads to some optimizations and more importantly
prepares the ground for any further work dealing with cpusets during hotplug.
Patch 3 is a bug fix - it ensures that the tasks attached to the root cpuset
see the updated cpus_allowed mask upon CPU hotplug.
Patches 4 and 5 implement the fix for cpusets handling during suspend/resume.
--
Srivatsa S. Bhat (5):
cpusets, hotplug: Implement cpuset tree traversal in a helper function
cpusets, hotplug: Restructure functions that are invoked during hotplug
cpusets: Update tasks' cpus_allowed mask upon updates to root cpuset
cpusets: Add provisions for distinguishing CPU Hotplug in suspend/resume path
cpusets, suspend: Save and restore cpusets during suspend/resume
include/linux/cpuset.h | 4 -
kernel/cpuset.c | 235 +++++++++++++++++++++++++++++++++++++++---------
kernel/sched/core.c | 29 +++++-
3 files changed, 218 insertions(+), 50 deletions(-)
Thanks,
Srivatsa S. Bhat
IBM Linux Technology Center
At present, the functions that deal with cpusets during CPU/Mem hotplug
are quite messy, since a lot of the functionality is mixed up without clear
separation. And this takes a toll on optimization as well. For example,
the function cpuset_update_active_cpus() is called on both CPU offline and CPU
online events; and it invokes scan_for_empty_cpusets(), which makes sense
only for CPU offline events. And hence, the current code ends up unnecessarily
traversing the cpuset tree during CPU online also.
As a first step towards cleaning up those functions, encapsulate the cpuset
tree traversal in a helper function, so as to facilitate upcoming changes.
Signed-off-by: Srivatsa S. Bhat <[email protected]>
Cc: [email protected]
---
kernel/cpuset.c | 26 ++++++++++++++++++--------
1 files changed, 18 insertions(+), 8 deletions(-)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 14f7070..23e5da6 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -2000,6 +2000,23 @@ static void remove_tasks_in_empty_cpuset(struct cpuset *cs)
move_member_tasks_to_cpuset(cs, parent);
}
+static struct cpuset *traverse_cpusets(struct list_head *queue)
+{
+ struct cpuset *cp;
+ struct cpuset *child; /* scans child cpusets of cp */
+ struct cgroup *cont;
+
+ cp = list_first_entry(queue, struct cpuset, stack_list);
+ list_del(queue->next);
+ list_for_each_entry(cont, &cp->css.cgroup->children, sibling) {
+ child = cgroup_cs(cont);
+ list_add_tail(&child->stack_list, queue);
+ }
+
+ return cp;
+}
+
+
/*
* Walk the specified cpuset subtree and look for empty cpusets.
* The tasks of such cpuset must be moved to a parent cpuset.
@@ -2019,19 +2036,12 @@ static void scan_for_empty_cpusets(struct cpuset *root)
{
LIST_HEAD(queue);
struct cpuset *cp; /* scans cpusets being updated */
- struct cpuset *child; /* scans child cpusets of cp */
- struct cgroup *cont;
static nodemask_t oldmems; /* protected by cgroup_mutex */
list_add_tail((struct list_head *)&root->stack_list, &queue);
while (!list_empty(&queue)) {
- cp = list_first_entry(&queue, struct cpuset, stack_list);
- list_del(queue.next);
- list_for_each_entry(cont, &cp->css.cgroup->children, sibling) {
- child = cgroup_cs(cont);
- list_add_tail(&child->stack_list, &queue);
- }
+ cp = traverse_cpusets(&queue);
/* Continue past cpusets with all cpus, mems online */
if (cpumask_subset(cp->cpus_allowed, cpu_active_mask) &&
Separate out the cpuset related handling for CPU/Memory online/offline.
This also helps us exploit the most obvious and basic level of optimization
that any notification mechanism (CPU/Mem online/offline) has to offer us:
"We *know* why we have been invoked. So stop pretending that we are lost,
and do only the necessary amount of processing!".
And while at it, rename scan_for_empty_cpusets() to
scan_cpusets_upon_hotplug(), which will be more appropriate, considering
the upcoming changes.
Signed-off-by: Srivatsa S. Bhat <[email protected]>
Cc: [email protected]
---
include/linux/cpuset.h | 4 +-
kernel/cpuset.c | 91 +++++++++++++++++++++++++++++++++---------------
kernel/sched/core.c | 4 +-
3 files changed, 67 insertions(+), 32 deletions(-)
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 668f66b..838320f 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -20,7 +20,7 @@ extern int number_of_cpusets; /* How many cpusets are defined in system? */
extern int cpuset_init(void);
extern void cpuset_init_smp(void);
-extern void cpuset_update_active_cpus(void);
+extern void cpuset_update_active_cpus(bool cpu_online);
extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
extern void cpuset_cpus_allowed_fallback(struct task_struct *p);
extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
@@ -124,7 +124,7 @@ static inline void set_mems_allowed(nodemask_t nodemask)
static inline int cpuset_init(void) { return 0; }
static inline void cpuset_init_smp(void) {}
-static inline void cpuset_update_active_cpus(void)
+static inline void cpuset_update_active_cpus(bool cpu_online)
{
partition_sched_domains(1, NULL, NULL);
}
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 23e5da6..87b0048 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -147,6 +147,12 @@ typedef enum {
CS_SPREAD_SLAB,
} cpuset_flagbits_t;
+/* the type of hotplug event */
+enum hotplug_event {
+ CPUSET_CPU_OFFLINE,
+ CPUSET_MEM_OFFLINE,
+};
+
/* convenient tests for these bits */
static inline int is_cpu_exclusive(const struct cpuset *cs)
{
@@ -2018,8 +2024,10 @@ static struct cpuset *traverse_cpusets(struct list_head *queue)
/*
- * Walk the specified cpuset subtree and look for empty cpusets.
- * The tasks of such cpuset must be moved to a parent cpuset.
+ * Walk the specified cpuset subtree upon a hotplug operation (CPU/Memory
+ * online/offline) and update the cpusets accordingly.
+ * For regular CPU/Mem hotplug, look for empty cpusets; the tasks of such
+ * cpuset must be moved to a parent cpuset.
*
* Called with cgroup_mutex held. We take callback_mutex to modify
* cpus_allowed and mems_allowed.
@@ -2032,39 +2040,60 @@ static struct cpuset *traverse_cpusets(struct list_head *queue)
* that has tasks along with an empty 'mems'. But if we did see such
* a cpuset, we'd handle it just like we do if its 'cpus' was empty.
*/
-static void scan_for_empty_cpusets(struct cpuset *root)
+static void
+scan_cpusets_upon_hotplug(struct cpuset *root, enum hotplug_event event)
{
LIST_HEAD(queue);
- struct cpuset *cp; /* scans cpusets being updated */
+ struct cpuset *cp; /* scans cpusets being updated */
static nodemask_t oldmems; /* protected by cgroup_mutex */
list_add_tail((struct list_head *)&root->stack_list, &queue);
- while (!list_empty(&queue)) {
- cp = traverse_cpusets(&queue);
+ switch (event) {
+ case CPUSET_CPU_OFFLINE:
+ while (!list_empty(&queue)) {
+ cp = traverse_cpusets(&queue);
- /* Continue past cpusets with all cpus, mems online */
- if (cpumask_subset(cp->cpus_allowed, cpu_active_mask) &&
- nodes_subset(cp->mems_allowed, node_states[N_HIGH_MEMORY]))
- continue;
+ /* Continue past cpusets with all cpus online */
+ if (cpumask_subset(cp->cpus_allowed, cpu_active_mask))
+ continue;
- oldmems = cp->mems_allowed;
+ /* Remove offline cpus from this cpuset. */
+ mutex_lock(&callback_mutex);
+ cpumask_and(cp->cpus_allowed, cp->cpus_allowed,
+ cpu_active_mask);
+ mutex_unlock(&callback_mutex);
- /* Remove offline cpus and mems from this cpuset. */
- mutex_lock(&callback_mutex);
- cpumask_and(cp->cpus_allowed, cp->cpus_allowed,
- cpu_active_mask);
- nodes_and(cp->mems_allowed, cp->mems_allowed,
+ /* Move tasks from the empty cpuset to a parent */
+ if (cpumask_empty(cp->cpus_allowed))
+ remove_tasks_in_empty_cpuset(cp);
+ else
+ update_tasks_cpumask(cp, NULL);
+ }
+ break;
+
+ case CPUSET_MEM_OFFLINE:
+ while (!list_empty(&queue)) {
+ cp = traverse_cpusets(&queue);
+
+ /* Continue past cpusets with all mems online */
+ if (nodes_subset(cp->mems_allowed,
+ node_states[N_HIGH_MEMORY]))
+ continue;
+
+ oldmems = cp->mems_allowed;
+
+ /* Remove offline mems from this cpuset. */
+ mutex_lock(&callback_mutex);
+ nodes_and(cp->mems_allowed, cp->mems_allowed,
node_states[N_HIGH_MEMORY]);
- mutex_unlock(&callback_mutex);
+ mutex_unlock(&callback_mutex);
- /* Move tasks from the empty cpuset to a parent */
- if (cpumask_empty(cp->cpus_allowed) ||
- nodes_empty(cp->mems_allowed))
- remove_tasks_in_empty_cpuset(cp);
- else {
- update_tasks_cpumask(cp, NULL);
- update_tasks_nodemask(cp, &oldmems, NULL);
+ /* Move tasks from the empty cpuset to a parent */
+ if (nodes_empty(cp->mems_allowed))
+ remove_tasks_in_empty_cpuset(cp);
+ else
+ update_tasks_nodemask(cp, &oldmems, NULL);
}
}
}
@@ -2080,8 +2109,11 @@ static void scan_for_empty_cpusets(struct cpuset *root)
*
* Called within get_online_cpus(). Needs to call cgroup_lock()
* before calling generate_sched_domains().
+ *
+ * @cpu_online: Indicates whether this is a CPU online event (true) or
+ * a CPU offline event (false).
*/
-void cpuset_update_active_cpus(void)
+void cpuset_update_active_cpus(bool cpu_online)
{
struct sched_domain_attr *attr;
cpumask_var_t *doms;
@@ -2091,7 +2123,10 @@ void cpuset_update_active_cpus(void)
mutex_lock(&callback_mutex);
cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
mutex_unlock(&callback_mutex);
- scan_for_empty_cpusets(&top_cpuset);
+
+ if (!cpu_online)
+ scan_cpusets_upon_hotplug(&top_cpuset, CPUSET_CPU_OFFLINE);
+
ndoms = generate_sched_domains(&doms, &attr);
cgroup_unlock();
@@ -2122,9 +2157,9 @@ static int cpuset_track_online_nodes(struct notifier_block *self,
case MEM_OFFLINE:
/*
* needn't update top_cpuset.mems_allowed explicitly because
- * scan_for_empty_cpusets() will update it.
+ * scan_cpusets_upon_hotplug() will update it.
*/
- scan_for_empty_cpusets(&top_cpuset);
+ scan_cpusets_upon_hotplug(&top_cpuset, CPUSET_MEM_OFFLINE);
break;
default:
break;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 0533a68..55cfe8c 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6812,7 +6812,7 @@ static int cpuset_cpu_active(struct notifier_block *nfb, unsigned long action,
switch (action & ~CPU_TASKS_FROZEN) {
case CPU_ONLINE:
case CPU_DOWN_FAILED:
- cpuset_update_active_cpus();
+ cpuset_update_active_cpus(true);
return NOTIFY_OK;
default:
return NOTIFY_DONE;
@@ -6824,7 +6824,7 @@ static int cpuset_cpu_inactive(struct notifier_block *nfb, unsigned long action,
{
switch (action & ~CPU_TASKS_FROZEN) {
case CPU_DOWN_PREPARE:
- cpuset_update_active_cpus();
+ cpuset_update_active_cpus(false);
return NOTIFY_OK;
default:
return NOTIFY_DONE;
The root cpuset's cpus_allowed mask can get updated during CPU hotplug.
However, during those updates, the tasks belonging to the root cpuset
aren't touched at all - their cpus_allowed masks aren't updated to reflect
the change in the configuration of the cpuset they belong to.
Fix this by moving the update of top_cpuset.cpus_allowed to
scan_cpusets_upon_hotplug() and ensure that the call to update_tasks_cpumask()
is not missed out for the root cpuset, including for cpu online events.
Signed-off-by: Srivatsa S. Bhat <[email protected]>
Cc: [email protected]
---
kernel/cpuset.c | 35 +++++++++++++++++++++++++++--------
1 files changed, 27 insertions(+), 8 deletions(-)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 87b0048..0fb9bff 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -150,6 +150,7 @@ typedef enum {
/* the type of hotplug event */
enum hotplug_event {
CPUSET_CPU_OFFLINE,
+ CPUSET_CPU_ONLINE,
CPUSET_MEM_OFFLINE,
};
@@ -2058,10 +2059,14 @@ scan_cpusets_upon_hotplug(struct cpuset *root, enum hotplug_event event)
if (cpumask_subset(cp->cpus_allowed, cpu_active_mask))
continue;
- /* Remove offline cpus from this cpuset. */
mutex_lock(&callback_mutex);
- cpumask_and(cp->cpus_allowed, cp->cpus_allowed,
- cpu_active_mask);
+ /* top_cpuset.cpus_allowed must track cpu_active_mask */
+ if (cp == &top_cpuset)
+ cpumask_copy(cp->cpus_allowed, cpu_active_mask);
+ else
+ /* Remove offline cpus from this cpuset. */
+ cpumask_and(cp->cpus_allowed, cp->cpus_allowed,
+ cpu_active_mask);
mutex_unlock(&callback_mutex);
/* Move tasks from the empty cpuset to a parent */
@@ -2072,6 +2077,20 @@ scan_cpusets_upon_hotplug(struct cpuset *root, enum hotplug_event event)
}
break;
+ case CPUSET_CPU_ONLINE:
+ /*
+ * Restore only the top_cpuset because it has to track
+ * cpu_active_mask always.
+ */
+ if (root == &top_cpuset) {
+ mutex_lock(&callback_mutex);
+ cpumask_copy(root->cpus_allowed, cpu_active_mask);
+ mutex_unlock(&callback_mutex);
+ update_tasks_cpumask(root, NULL);
+ }
+ list_del(queue.next);
+ break;
+
case CPUSET_MEM_OFFLINE:
while (!list_empty(&queue)) {
cp = traverse_cpusets(&queue);
@@ -2105,7 +2124,8 @@ scan_cpusets_upon_hotplug(struct cpuset *root, enum hotplug_event event)
* but making no active use of cpusets.
*
* This routine ensures that top_cpuset.cpus_allowed tracks
- * cpu_active_mask on each CPU hotplug (cpuhp) event.
+ * cpu_active_mask on each CPU hotplug (cpuhp) event. (This maintenance
+ * is actually implemented in scan_cpusets_upon_hotplug().)
*
* Called within get_online_cpus(). Needs to call cgroup_lock()
* before calling generate_sched_domains().
@@ -2120,11 +2140,10 @@ void cpuset_update_active_cpus(bool cpu_online)
int ndoms;
cgroup_lock();
- mutex_lock(&callback_mutex);
- cpumask_copy(top_cpuset.cpus_allowed, cpu_active_mask);
- mutex_unlock(&callback_mutex);
- if (!cpu_online)
+ if (cpu_online)
+ scan_cpusets_upon_hotplug(&top_cpuset, CPUSET_CPU_ONLINE);
+ else
scan_cpusets_upon_hotplug(&top_cpuset, CPUSET_CPU_OFFLINE);
ndoms = generate_sched_domains(&doms, &attr);
Cpusets needs to distinguish between a regular CPU Hotplug operation and a
CPU Hotplug operation carried out as part of the suspend/resume sequence.
So add provisions to facilitate that, so that the two operations can be
handled differently.
Signed-off-by: Srivatsa S. Bhat <[email protected]>
Cc: [email protected]
---
include/linux/cpuset.h | 2 +-
kernel/cpuset.c | 19 ++++++++++++++++---
kernel/sched/core.c | 29 +++++++++++++++++++++++------
3 files changed, 40 insertions(+), 10 deletions(-)
diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
index 838320f..184cb94 100644
--- a/include/linux/cpuset.h
+++ b/include/linux/cpuset.h
@@ -20,7 +20,7 @@ extern int number_of_cpusets; /* How many cpusets are defined in system? */
extern int cpuset_init(void);
extern void cpuset_init_smp(void);
-extern void cpuset_update_active_cpus(bool cpu_online);
+extern void cpuset_update_active_cpus(bool cpu_online, bool frozen);
extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
extern void cpuset_cpus_allowed_fallback(struct task_struct *p);
extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 0fb9bff..0723183 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -151,6 +151,8 @@ typedef enum {
enum hotplug_event {
CPUSET_CPU_OFFLINE,
CPUSET_CPU_ONLINE,
+ CPUSET_CPU_OFFLINE_FROZEN,
+ CPUSET_CPU_ONLINE_FROZEN,
CPUSET_MEM_OFFLINE,
};
@@ -2077,6 +2079,12 @@ scan_cpusets_upon_hotplug(struct cpuset *root, enum hotplug_event event)
}
break;
+ case CPUSET_CPU_OFFLINE_FROZEN:
+ break;
+
+ case CPUSET_CPU_ONLINE_FROZEN:
+ break;
+
case CPUSET_CPU_ONLINE:
/*
* Restore only the top_cpuset because it has to track
@@ -2132,19 +2140,24 @@ scan_cpusets_upon_hotplug(struct cpuset *root, enum hotplug_event event)
*
* @cpu_online: Indicates whether this is a CPU online event (true) or
* a CPU offline event (false).
+ * @frozen: Indicates whether the tasks are frozen (true) or not (false)
+ * Frozen tasks indicates a CPU hotplug operation in the suspend/resume path.
*/
-void cpuset_update_active_cpus(bool cpu_online)
+void cpuset_update_active_cpus(bool cpu_online, bool frozen)
{
struct sched_domain_attr *attr;
+ enum hotplug_event event;
cpumask_var_t *doms;
int ndoms;
cgroup_lock();
if (cpu_online)
- scan_cpusets_upon_hotplug(&top_cpuset, CPUSET_CPU_ONLINE);
+ event = frozen ? CPUSET_CPU_ONLINE_FROZEN : CPUSET_CPU_ONLINE;
else
- scan_cpusets_upon_hotplug(&top_cpuset, CPUSET_CPU_OFFLINE);
+ event = frozen ? CPUSET_CPU_OFFLINE_FROZEN : CPUSET_CPU_OFFLINE;
+
+ scan_cpusets_upon_hotplug(&top_cpuset, event);
ndoms = generate_sched_domains(&doms, &attr);
cgroup_unlock();
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 55cfe8c..08463d5 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6809,26 +6809,43 @@ int __init sched_create_sysfs_power_savings_entries(struct device *dev)
static int cpuset_cpu_active(struct notifier_block *nfb, unsigned long action,
void *hcpu)
{
- switch (action & ~CPU_TASKS_FROZEN) {
+ bool frozen;
+
+ switch (action) {
case CPU_ONLINE:
case CPU_DOWN_FAILED:
- cpuset_update_active_cpus(true);
- return NOTIFY_OK;
+ frozen = false;
+ break;
+ case CPU_ONLINE_FROZEN:
+ case CPU_DOWN_FAILED_FROZEN:
+ frozen = true;
+ break;
default:
return NOTIFY_DONE;
}
+
+ cpuset_update_active_cpus(true, frozen);
+ return NOTIFY_OK;
}
static int cpuset_cpu_inactive(struct notifier_block *nfb, unsigned long action,
void *hcpu)
{
- switch (action & ~CPU_TASKS_FROZEN) {
+ bool frozen;
+
+ switch (action) {
case CPU_DOWN_PREPARE:
- cpuset_update_active_cpus(false);
- return NOTIFY_OK;
+ frozen = false;
+ break;
+ case CPU_DOWN_PREPARE_FROZEN:
+ frozen = true;
+ break;
default:
return NOTIFY_DONE;
}
+
+ cpuset_update_active_cpus(false, frozen);
+ return NOTIFY_OK;
}
void __init sched_init_smp(void)
In the event of CPU hotplug, the kernel modifies the cpusets' cpus_allowed
masks as and when necessary to ensure that the tasks belonging to the cpusets
have some place (online CPUs) to run on. And regular CPU hotplug is
destructive in the sense that the kernel doesn't remember the original cpuset
configurations set by the user, across hotplug operations.
However, suspend/resume (which uses CPU hotplug) is a special case in which
the kernel has the responsibility to restore the system (during resume), to
exactly the same state it was in before suspend. And this calls for special
handling of cpusets when CPU hotplug is carried out in the suspend/resume
path.
That special handling for suspend/resume is implemented as follows:
1. Explicitly save all the cpusets' cpus_allowed mask during suspend and
restore them during resume. Use a new per-cpuset mask to facilitate this.
2. During CPU hotplug, modify the cpusets' cpus_allowed mask as necessary to
keep them non-empty.
3. Do not move the tasks from one cpuset to another during hotplug.
It is to be noted that all the userspace would have been already frozen
before doing CPU hotplug in the suspend/resume path, and hence, in reality,
nobody (userspace tasks) will actually observe any of this special case
handling (which is good news for the kernel, because this deviates from
true hotplug semantics slightly, so as to resume the system properly).
(Also, while implementing this special case handling for suspend/resume, we
don't modify the existing cpuset handling for regular CPU hotplug, since it
is correct as it is anyway.)
Signed-off-by: Srivatsa S. Bhat <[email protected]>
Cc: [email protected]
---
kernel/cpuset.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++---
1 files changed, 79 insertions(+), 5 deletions(-)
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 0723183..671bf26 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -93,6 +93,13 @@ struct cpuset {
unsigned long flags; /* "unsigned long" so bitops work */
cpumask_var_t cpus_allowed; /* CPUs allowed to tasks in cpuset */
+
+ /*
+ * used to save cpuset's cpus_allowed mask during suspend and restore
+ * it during resume.
+ */
+ cpumask_var_t suspend_cpus_allowed;
+
nodemask_t mems_allowed; /* Memory Nodes allowed to tasks */
struct cpuset *parent; /* my parent */
@@ -1851,10 +1858,12 @@ static struct cgroup_subsys_state *cpuset_create(struct cgroup *cont)
cs = kmalloc(sizeof(*cs), GFP_KERNEL);
if (!cs)
return ERR_PTR(-ENOMEM);
- if (!alloc_cpumask_var(&cs->cpus_allowed, GFP_KERNEL)) {
- kfree(cs);
- return ERR_PTR(-ENOMEM);
- }
+
+ if (!alloc_cpumask_var(&cs->cpus_allowed, GFP_KERNEL))
+ goto out_cs;
+
+ if (!alloc_cpumask_var(&cs->suspend_cpus_allowed, GFP_KERNEL))
+ goto out_cpus_allowed;
cs->flags = 0;
if (is_spread_page(parent))
@@ -1870,6 +1879,12 @@ static struct cgroup_subsys_state *cpuset_create(struct cgroup *cont)
cs->parent = parent;
number_of_cpusets++;
return &cs->css ;
+
+ out_cpus_allowed:
+ free_cpumask_var(cs->cpus_allowed);
+ out_cs:
+ kfree(cs);
+ return ERR_PTR(-ENOMEM);
}
/*
@@ -1887,6 +1902,7 @@ static void cpuset_destroy(struct cgroup *cont)
number_of_cpusets--;
free_cpumask_var(cs->cpus_allowed);
+ free_cpumask_var(cs->suspend_cpus_allowed);
kfree(cs);
}
@@ -1915,6 +1931,9 @@ int __init cpuset_init(void)
if (!alloc_cpumask_var(&top_cpuset.cpus_allowed, GFP_KERNEL))
BUG();
+ if (!alloc_cpumask_var(&top_cpuset.suspend_cpus_allowed, GFP_KERNEL))
+ BUG();
+
cpumask_setall(top_cpuset.cpus_allowed);
nodes_setall(top_cpuset.mems_allowed);
@@ -2031,6 +2050,12 @@ static struct cpuset *traverse_cpusets(struct list_head *queue)
* online/offline) and update the cpusets accordingly.
* For regular CPU/Mem hotplug, look for empty cpusets; the tasks of such
* cpuset must be moved to a parent cpuset.
+ * For CPU hotplug in the suspend/resume path,
+ * - save cpusets' cpus_allowed mask during suspend and restore them during
+ * resume
+ * - update the cpusets' cpus_allowed mask to keep them non-empty during the
+ * suspend/resume transition
+ * - don't move the tasks from one cpuset to another during these updates
*
* Called with cgroup_mutex held. We take callback_mutex to modify
* cpus_allowed and mems_allowed.
@@ -2049,6 +2074,7 @@ scan_cpusets_upon_hotplug(struct cpuset *root, enum hotplug_event event)
LIST_HEAD(queue);
struct cpuset *cp; /* scans cpusets being updated */
static nodemask_t oldmems; /* protected by cgroup_mutex */
+ static int frozen_cpu_count; /* marks begin/end of suspend/resume */
list_add_tail((struct list_head *)&root->stack_list, &queue);
@@ -2080,10 +2106,58 @@ scan_cpusets_upon_hotplug(struct cpuset *root, enum hotplug_event event)
break;
case CPUSET_CPU_OFFLINE_FROZEN:
+ while (!list_empty(&queue)) {
+ cp = traverse_cpusets(&queue);
+
+ /*
+ * Save the cpuset's original cpus_allowed mask,
+ * so that we can restore it during resume.
+ *
+ * frozen_cpu_count == 0 indicates the begin/end of CPU
+ * hotplug initiated as part of the suspend/resume
+ * sequence.
+ */
+ if (unlikely(!frozen_cpu_count))
+ cpumask_copy(cp->suspend_cpus_allowed,
+ cp->cpus_allowed);
+
+ /* Continue past cpusets with all cpus online */
+ if (cpumask_subset(cp->cpus_allowed, cpu_active_mask))
+ continue;
+
+ /*
+ * The userspace is frozen since we are in the
+ * suspend path. So to avoid unnecessary overhead,
+ * just set cpus_allowed to cpu_active_mask and carry
+ * on, since no one will notice it anyway.
+ * Moreover, top_cpuset.cpus_allowed must track
+ * cpu_active_mask, which is taken care of as well.
+ */
+ mutex_lock(&callback_mutex);
+ cpumask_copy(cp->cpus_allowed, cpu_active_mask);
+ mutex_unlock(&callback_mutex);
+
+ update_tasks_cpumask(cp, NULL);
+ }
+ frozen_cpu_count++;
break;
case CPUSET_CPU_ONLINE_FROZEN:
- break;
+ frozen_cpu_count--;
+ if (unlikely(!frozen_cpu_count)) {
+ while (!list_empty(&queue)) {
+ cp = traverse_cpusets(&queue);
+
+ mutex_lock(&callback_mutex);
+ cpumask_copy(cp->cpus_allowed,
+ cp->suspend_cpus_allowed);
+ mutex_unlock(&callback_mutex);
+ update_tasks_cpumask(cp, NULL);
+ }
+ break;
+ } else {
+ /* Fall through */
+ }
case CPUSET_CPU_ONLINE:
/*
On Mon, 14 May 2012, Srivatsa S. Bhat wrote:
> Currently the kernel doesn't handle cpusets properly during suspend/resume.
> After a resume, all non-root cpusets end up having only 1 cpu (the boot cpu),
> causing massive performance degradation of workloads. One major user of cpusets
> is libvirt, which means that after a suspend/hibernation cycle, all VMs
> suddenly end up running terribly slow!
>
> Also, the kernel moves the tasks from one cpuset to another during CPU hotplug
> in the suspend/resume path, leading to a task-management nightmare after
> resume.
>
To deal with mempolicy rebinding when a cpuset changes, I made a change to
mempolicies to store the user nodemask passed to set_mempolicy() or
mbind() so the intention of the user could be preserved. It seems like
you should do the same thing for cpusets to store the "intended" set of
cpus and respect that during cpu online?
> Patches 1 & 2 are cleanups that separate out hotplug handling so that we can
> implement different logic for different hotplug events (CPU/Mem
> online/offline). This also leads to some optimizations and more importantly
> prepares the ground for any further work dealing with cpusets during hotplug.
>
> Patch 3 is a bug fix - it ensures that the tasks attached to the root cpuset
> see the updated cpus_allowed mask upon CPU hotplug.
>
> Patches 4 and 5 implement the fix for cpusets handling during suspend/resume.
All of your patches are labeled to [email protected], but I seriously
doubt any of this is stable material since it has been a long-standing
issue (and perhaps intentional in some cases) and your series includes
cleanups and optimizations that wouldn't be stable candidates, so I'd
suggest removing that annotation.
On Mon, 14 May 2012, Srivatsa S. Bhat wrote:
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> index 14f7070..23e5da6 100644
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -2000,6 +2000,23 @@ static void remove_tasks_in_empty_cpuset(struct cpuset *cs)
> move_member_tasks_to_cpuset(cs, parent);
> }
>
> +static struct cpuset *traverse_cpusets(struct list_head *queue)
I suggest a different name for this, traverse_cpusets() doesn't imply that
it will be returning struct cpuset *.
> +{
> + struct cpuset *cp;
> + struct cpuset *child; /* scans child cpusets of cp */
> + struct cgroup *cont;
> +
> + cp = list_first_entry(queue, struct cpuset, stack_list);
> + list_del(queue->next);
> + list_for_each_entry(cont, &cp->css.cgroup->children, sibling) {
> + child = cgroup_cs(cont);
> + list_add_tail(&child->stack_list, queue);
> + }
> +
> + return cp;
Eek, what happens if queue is empty? It can't happen with only this
patch applied, but since you're doing this to be used in other places then
you'd need to check for list_empty().
> +}
> +
> +
> /*
> * Walk the specified cpuset subtree and look for empty cpusets.
> * The tasks of such cpuset must be moved to a parent cpuset.
> @@ -2019,19 +2036,12 @@ static void scan_for_empty_cpusets(struct cpuset *root)
> {
> LIST_HEAD(queue);
> struct cpuset *cp; /* scans cpusets being updated */
> - struct cpuset *child; /* scans child cpusets of cp */
> - struct cgroup *cont;
> static nodemask_t oldmems; /* protected by cgroup_mutex */
>
> list_add_tail((struct list_head *)&root->stack_list, &queue);
>
> while (!list_empty(&queue)) {
> - cp = list_first_entry(&queue, struct cpuset, stack_list);
> - list_del(queue.next);
> - list_for_each_entry(cont, &cp->css.cgroup->children, sibling) {
> - child = cgroup_cs(cont);
> - list_add_tail(&child->stack_list, &queue);
> - }
> + cp = traverse_cpusets(&queue);
>
> /* Continue past cpusets with all cpus, mems online */
> if (cpumask_subset(cp->cpus_allowed, cpu_active_mask) &&
Otherwise, looks good.
On Mon, 14 May 2012, Srivatsa S. Bhat wrote:
> Separate out the cpuset related handling for CPU/Memory online/offline.
> This also helps us exploit the most obvious and basic level of optimization
> that any notification mechanism (CPU/Mem online/offline) has to offer us:
> "We *know* why we have been invoked. So stop pretending that we are lost,
> and do only the necessary amount of processing!".
>
> And while at it, rename scan_for_empty_cpusets() to
> scan_cpusets_upon_hotplug(), which will be more appropriate, considering
> the upcoming changes.
>
If it's more appropriate in upcoming changes, then change it in the
upcoming changes that make it more appropriate?
> Signed-off-by: Srivatsa S. Bhat <[email protected]>
> Cc: [email protected]
> ---
>
> include/linux/cpuset.h | 4 +-
> kernel/cpuset.c | 91 +++++++++++++++++++++++++++++++++---------------
> kernel/sched/core.c | 4 +-
> 3 files changed, 67 insertions(+), 32 deletions(-)
>
> diff --git a/include/linux/cpuset.h b/include/linux/cpuset.h
> index 668f66b..838320f 100644
> --- a/include/linux/cpuset.h
> +++ b/include/linux/cpuset.h
> @@ -20,7 +20,7 @@ extern int number_of_cpusets; /* How many cpusets are defined in system? */
>
> extern int cpuset_init(void);
> extern void cpuset_init_smp(void);
> -extern void cpuset_update_active_cpus(void);
> +extern void cpuset_update_active_cpus(bool cpu_online);
> extern void cpuset_cpus_allowed(struct task_struct *p, struct cpumask *mask);
> extern void cpuset_cpus_allowed_fallback(struct task_struct *p);
> extern nodemask_t cpuset_mems_allowed(struct task_struct *p);
> @@ -124,7 +124,7 @@ static inline void set_mems_allowed(nodemask_t nodemask)
> static inline int cpuset_init(void) { return 0; }
> static inline void cpuset_init_smp(void) {}
>
> -static inline void cpuset_update_active_cpus(void)
> +static inline void cpuset_update_active_cpus(bool cpu_online)
> {
> partition_sched_domains(1, NULL, NULL);
> }
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> index 23e5da6..87b0048 100644
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -147,6 +147,12 @@ typedef enum {
> CS_SPREAD_SLAB,
> } cpuset_flagbits_t;
>
> +/* the type of hotplug event */
> +enum hotplug_event {
> + CPUSET_CPU_OFFLINE,
> + CPUSET_MEM_OFFLINE,
> +};
> +
> /* convenient tests for these bits */
> static inline int is_cpu_exclusive(const struct cpuset *cs)
> {
> @@ -2018,8 +2024,10 @@ static struct cpuset *traverse_cpusets(struct list_head *queue)
>
>
> /*
> - * Walk the specified cpuset subtree and look for empty cpusets.
> - * The tasks of such cpuset must be moved to a parent cpuset.
> + * Walk the specified cpuset subtree upon a hotplug operation (CPU/Memory
> + * online/offline) and update the cpusets accordingly.
> + * For regular CPU/Mem hotplug, look for empty cpusets; the tasks of such
> + * cpuset must be moved to a parent cpuset.
> *
> * Called with cgroup_mutex held. We take callback_mutex to modify
> * cpus_allowed and mems_allowed.
> @@ -2032,39 +2040,60 @@ static struct cpuset *traverse_cpusets(struct list_head *queue)
> * that has tasks along with an empty 'mems'. But if we did see such
> * a cpuset, we'd handle it just like we do if its 'cpus' was empty.
> */
> -static void scan_for_empty_cpusets(struct cpuset *root)
> +static void
> +scan_cpusets_upon_hotplug(struct cpuset *root, enum hotplug_event event)
> {
> LIST_HEAD(queue);
> - struct cpuset *cp; /* scans cpusets being updated */
> + struct cpuset *cp; /* scans cpusets being updated */
> static nodemask_t oldmems; /* protected by cgroup_mutex */
>
> list_add_tail((struct list_head *)&root->stack_list, &queue);
>
> - while (!list_empty(&queue)) {
> - cp = traverse_cpusets(&queue);
> + switch (event) {
> + case CPUSET_CPU_OFFLINE:
> + while (!list_empty(&queue)) {
> + cp = traverse_cpusets(&queue);
>
> - /* Continue past cpusets with all cpus, mems online */
> - if (cpumask_subset(cp->cpus_allowed, cpu_active_mask) &&
> - nodes_subset(cp->mems_allowed, node_states[N_HIGH_MEMORY]))
> - continue;
> + /* Continue past cpusets with all cpus online */
> + if (cpumask_subset(cp->cpus_allowed, cpu_active_mask))
> + continue;
>
> - oldmems = cp->mems_allowed;
> + /* Remove offline cpus from this cpuset. */
> + mutex_lock(&callback_mutex);
> + cpumask_and(cp->cpus_allowed, cp->cpus_allowed,
> + cpu_active_mask);
> + mutex_unlock(&callback_mutex);
>
> - /* Remove offline cpus and mems from this cpuset. */
> - mutex_lock(&callback_mutex);
> - cpumask_and(cp->cpus_allowed, cp->cpus_allowed,
> - cpu_active_mask);
> - nodes_and(cp->mems_allowed, cp->mems_allowed,
> + /* Move tasks from the empty cpuset to a parent */
> + if (cpumask_empty(cp->cpus_allowed))
> + remove_tasks_in_empty_cpuset(cp);
> + else
> + update_tasks_cpumask(cp, NULL);
> + }
> + break;
> +
> + case CPUSET_MEM_OFFLINE:
> + while (!list_empty(&queue)) {
> + cp = traverse_cpusets(&queue);
> +
> + /* Continue past cpusets with all mems online */
> + if (nodes_subset(cp->mems_allowed,
> + node_states[N_HIGH_MEMORY]))
> + continue;
> +
> + oldmems = cp->mems_allowed;
> +
> + /* Remove offline mems from this cpuset. */
> + mutex_lock(&callback_mutex);
> + nodes_and(cp->mems_allowed, cp->mems_allowed,
> node_states[N_HIGH_MEMORY]);
> - mutex_unlock(&callback_mutex);
> + mutex_unlock(&callback_mutex);
>
> - /* Move tasks from the empty cpuset to a parent */
> - if (cpumask_empty(cp->cpus_allowed) ||
> - nodes_empty(cp->mems_allowed))
> - remove_tasks_in_empty_cpuset(cp);
> - else {
> - update_tasks_cpumask(cp, NULL);
> - update_tasks_nodemask(cp, &oldmems, NULL);
> + /* Move tasks from the empty cpuset to a parent */
> + if (nodes_empty(cp->mems_allowed))
> + remove_tasks_in_empty_cpuset(cp);
> + else
> + update_tasks_nodemask(cp, &oldmems, NULL);
> }
> }
> }
This looks like a good optimization, but the existing comment for
scan_for_empty_cpusets() is wrong: we certainly do not lack memory
hot-unplug and it will remove nodes from N_HIGH_MEMORY if all present
pages from a node are offlined. I had a patch that emulated node
hot-remove on x86 and this worked fine. So perhaps update that existing
comment as well (not shown in this diff)?
Otherwise, looks good.
On Mon, 14 May 2012, Srivatsa S. Bhat wrote:
> The root cpuset's cpus_allowed mask can get updated during CPU hotplug.
> However, during those updates, the tasks belonging to the root cpuset
> aren't touched at all - their cpus_allowed masks aren't updated to reflect
> the change in the configuration of the cpuset they belong to.
>
> Fix this by moving the update of top_cpuset.cpus_allowed to
> scan_cpusets_upon_hotplug() and ensure that the call to update_tasks_cpumask()
> is not missed out for the root cpuset, including for cpu online events.
>
> Signed-off-by: Srivatsa S. Bhat <[email protected]>
Acked-by: David Rientjes <[email protected]>
On Mon, 14 May 2012, Srivatsa S. Bhat wrote:
> Cpusets needs to distinguish between a regular CPU Hotplug operation and a
> CPU Hotplug operation carried out as part of the suspend/resume sequence.
> So add provisions to facilitate that, so that the two operations can be
> handled differently.
>
There's no functional change with this patch and it's unclear from this
changelog why we need to distinguish between the two, so perhaps fold this
into patch 5 or explain how this will be helpful in this changelog?
Otherwise it doesn't seem justifiable to add 30 more lines of code.
On Mon, 14 May 2012, Srivatsa S. Bhat wrote:
> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> index 0723183..671bf26 100644
> --- a/kernel/cpuset.c
> +++ b/kernel/cpuset.c
> @@ -93,6 +93,13 @@ struct cpuset {
>
> unsigned long flags; /* "unsigned long" so bitops work */
> cpumask_var_t cpus_allowed; /* CPUs allowed to tasks in cpuset */
> +
> + /*
> + * used to save cpuset's cpus_allowed mask during suspend and restore
> + * it during resume.
> + */
> + cpumask_var_t suspend_cpus_allowed;
> +
> nodemask_t mems_allowed; /* Memory Nodes allowed to tasks */
>
> struct cpuset *parent; /* my parent */
I see what you're doing with this and think it will fix the problem that
you're trying to address, but I think it could become much more general
to just the suspend case: if an admin sets a cpuset to have cpus 4-6, for
example, and cpu 5 goes offline, then I believe the cpuset should once
again become 4-6 if cpu 5 comes back online. So I think this should be
implemented like mempolicies are which save the user intended nodemask
that may become restricted by cpuset placement but will be rebound if the
cpuset includes the intended nodes.
If that's done, then it takes care of the suspend case as well and adds
generic cpu hotplug support, not just for suspend/resume.
On 14.05.2012 [17:37:50 -0700], David Rientjes wrote:
> On Mon, 14 May 2012, Srivatsa S. Bhat wrote:
>
> > diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> > index 0723183..671bf26 100644
> > --- a/kernel/cpuset.c
> > +++ b/kernel/cpuset.c
> > @@ -93,6 +93,13 @@ struct cpuset {
> >
> > unsigned long flags; /* "unsigned long" so bitops work */
> > cpumask_var_t cpus_allowed; /* CPUs allowed to tasks in cpuset */
> > +
> > + /*
> > + * used to save cpuset's cpus_allowed mask during suspend and restore
> > + * it during resume.
> > + */
> > + cpumask_var_t suspend_cpus_allowed;
> > +
> > nodemask_t mems_allowed; /* Memory Nodes allowed to tasks */
> >
> > struct cpuset *parent; /* my parent */
>
> I see what you're doing with this and think it will fix the problem that
> you're trying to address, but I think it could become much more general
> to just the suspend case: if an admin sets a cpuset to have cpus 4-6, for
> example, and cpu 5 goes offline, then I believe the cpuset should once
> again become 4-6 if cpu 5 comes back online. So I think this should be
> implemented like mempolicies are which save the user intended nodemask
> that may become restricted by cpuset placement but will be rebound if the
> cpuset includes the intended nodes.
Heh, please read the thread at
http://marc.info/?l=linux-kernel&m=133615922717112&w=2 ... subject is
"[PATCH v2 0/7] CPU hotplug, cpusets: Fix issues with cpusets handling
upon CPU hotplug". That was effectively the same solution Srivatsa
originally posted. But after lengthy discussions with PeterZ and others,
it was decided that suspend/resume is a special case where it makes
sense to save "policy" but that generally cpu/memory hotplug is a
destructive operation and nothing is required to be retained (that
certain policies are retained is unfortunately now expected, but isn't
guaranteed for cpusets, at least).
> If that's done, then it takes care of the suspend case as well and adds
> generic cpu hotplug support, not just for suspend/resume.
Yeah ... but that's not the intent here (to add generic cpu hotplug
support).
Srivatsa, perhaps a reference to some summary of why it's not good to
support the general hotplug case would have been good in 0/5?
Thanks,
Nish
--
Nishanth Aravamudan <[email protected]>
IBM Linux Technology Center
On Mon, 14 May 2012, Nishanth Aravamudan wrote:
> > I see what you're doing with this and think it will fix the problem that
> > you're trying to address, but I think it could become much more general
> > to just the suspend case: if an admin sets a cpuset to have cpus 4-6, for
> > example, and cpu 5 goes offline, then I believe the cpuset should once
> > again become 4-6 if cpu 5 comes back online. So I think this should be
> > implemented like mempolicies are which save the user intended nodemask
> > that may become restricted by cpuset placement but will be rebound if the
> > cpuset includes the intended nodes.
>
> Heh, please read the thread at
> http://marc.info/?l=linux-kernel&m=133615922717112&w=2 ... subject is
> "[PATCH v2 0/7] CPU hotplug, cpusets: Fix issues with cpusets handling
> upon CPU hotplug". That was effectively the same solution Srivatsa
> originally posted. But after lengthy discussions with PeterZ and others,
> it was decided that suspend/resume is a special case where it makes
> sense to save "policy" but that generally cpu/memory hotplug is a
> destructive operation and nothing is required to be retained (that
> certain policies are retained is unfortunately now expected, but isn't
> guaranteed for cpusets, at least).
>
If you do set_mempolicy(MPOL_BIND, 2-3) to bind a thread to nodes 2-3 that
is attached to a cpuset whereas cpuset.mems == 2-3, and then cpuset.mems
changes to 0-1, what is the expected behavior? Do we immediately oom on
the next allocation? If cpuset.mems is set again to 2-3, what's the
desired behavior?
I fixed this problem by introducing MPOL_F_* flags in set_mempolicy(2) by
saving the user intended nodemask passed by set_mempolicy() and respecting
it whenever allowed by cpusets.
Right now, the behavior of what happens for a cpuset where cpuset.cpus ==
2-3 and then cpus 2-3 go offline and then are brought back online is
undefined. The same is true of cpuset.cpus during resume. So if you're
going to add a cpumask to struct cpuset, then why not respect it for all
offline events and get rid of all this specialized suspend-only stuff?
It's very simple to make this consistent across all cpu hotplug events and
build suspend on top of it from a cpuset perspective.
On 14.05.2012 [21:04:16 -0700], David Rientjes wrote:
> On Mon, 14 May 2012, Nishanth Aravamudan wrote:
>
> > > I see what you're doing with this and think it will fix the problem that
> > > you're trying to address, but I think it could become much more general
> > > to just the suspend case: if an admin sets a cpuset to have cpus 4-6, for
> > > example, and cpu 5 goes offline, then I believe the cpuset should once
> > > again become 4-6 if cpu 5 comes back online. So I think this should be
> > > implemented like mempolicies are which save the user intended nodemask
> > > that may become restricted by cpuset placement but will be rebound if the
> > > cpuset includes the intended nodes.
> >
> > Heh, please read the thread at
> > http://marc.info/?l=linux-kernel&m=133615922717112&w=2 ... subject is
> > "[PATCH v2 0/7] CPU hotplug, cpusets: Fix issues with cpusets handling
> > upon CPU hotplug". That was effectively the same solution Srivatsa
> > originally posted. But after lengthy discussions with PeterZ and others,
> > it was decided that suspend/resume is a special case where it makes
> > sense to save "policy" but that generally cpu/memory hotplug is a
> > destructive operation and nothing is required to be retained (that
> > certain policies are retained is unfortunately now expected, but isn't
> > guaranteed for cpusets, at least).
> >
>
> If you do set_mempolicy(MPOL_BIND, 2-3) to bind a thread to nodes 2-3
> that is attached to a cpuset whereas cpuset.mems == 2-3, and then
> cpuset.mems changes to 0-1, what is the expected behavior? Do we
> immediately oom on the next allocation? If cpuset.mems is set again
> to 2-3, what's the desired behavior?
"expected [or desired] behavior" always makes me cringe. It's usually
some insane user-level expectations that don't really make sense :).
But I don't honestly know the answer here as I've not polled any
customers on it. `man cpuset` does provide some insight into the
implementation, though:
Cpusets are integrated with the sched_setaffinity(2) scheduling
affinity mechanism and the mbind(2) and set_mempolicy(2)
memory-placement mechanisms in the kernel. Neither of these
mechanisms let a process make use of a CPU or memory node that
is not allowed by that process's cpuset. If changes to a
process's cpuset placement conflict with these other mechanisms,
then cpuset placement is enforced even if it means overriding
these other mechanisms. The kernel accomplishes this overriding
by silently restricting the CPUs and memory nodes requested by
these other mechanisms to those allowed by the invoking
process's cpuset. This can result in these other calls
returning an error, if for example, such a call ends up
requesting an empty set of CPUs or memory nodes, after that
request is restricted to the invoking process's cpuset.
So no, it should not OOM, but instead the mempolicy is ignored.
> I fixed this problem by introducing MPOL_F_* flags in set_mempolicy(2)
> by saving the user intended nodemask passed by set_mempolicy() and
> respecting it whenever allowed by cpusets.
So, if you read that thread, this is what (in essence) Srivatsa proposed
in v2. We store the user-defined cpumask and keep it regardless of
kernel decisions. We intersect the user-defined cpumask with the kernel
(which is really reflecting the administrator's hotplug decisions)
topology and run tasks in constrained cpusets on the result. We reflect
this decision in a new read-only file in each cpuset that indicates the
"actual" cpus that a task in a given cpuset may be scheduled on.
But PeterZ nack-ed it and his reasoning was sound -- CPU (and memory, I
would think) hotplug is a necessarily destructive behavior.
> Right now, the behavior of what happens for a cpuset where cpuset.cpus ==
> 2-3 and then cpus 2-3 go offline and then are brought back online is
> undefined.
Erm, no it's rather clearly defined by what actually happens. It may not
be "specified" in a formal document, but behavior is a heckuva thing.
What happens is that the offlining process pushes the tasks in that
constrained cpuset up into the parent cpuset (actually moves them). In a
suspend case, since we're offlining all CPUs, this results in all task
being pushed up to the root cpuset.
I would also quote `man cpuset` here to actually say the behavior is
"specified", technically:
If hot-plug functionality is used to remove all the CPUs that
are currently assigned to a cpuset, then the kernel will
automatically update the cpus_allowed of all processes attached
to CPUs in that cpuset to allow all CPUs.
The fact that those CPUs are eventually (or immediately) brought back
online is not considered in the decision of how to handle tasks in the
constrained cpuset when the CPUs are taken offline. That seems to make
sense, since there isn't any guarantee that an offlined CPU will ever
return to online status in the future.
> The same is true of cpuset.cpus during resume. So if you're going to
> add a cpumask to struct cpuset, then why not respect it for all
> offline events and get rid of all this specialized suspend-only stuff?
> It's very simple to make this consistent across all cpu hotplug events
> and build suspend on top of it from a cpuset perspective.
"simple" -- sure. Read v2 of the patchset, as I said. But then read all
the discussion that follows and I think you will see that this has been
hashed out before with similar reasoning on both sides, and that the
policy side of things is not obviously simply. The resulting decision
was to special-case suspend, but not "remember" state across other
hotplug actions, which is more of an "unintentional hotplug" (and from
what Paul McKenney mentions in that thread, sounds like tglx is working
on patches to remove the full hotplug usage from s/r).
Thanks,
Nish
--
Nishanth Aravamudan <[email protected]>
IBM Linux Technology Center
On 05/15/2012 07:10 AM, Nishanth Aravamudan wrote:
> On 14.05.2012 [17:37:50 -0700], David Rientjes wrote:
>> On Mon, 14 May 2012, Srivatsa S. Bhat wrote:
>>
>>> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
>>> index 0723183..671bf26 100644
>>> --- a/kernel/cpuset.c
>>> +++ b/kernel/cpuset.c
>>> @@ -93,6 +93,13 @@ struct cpuset {
>>>
>>> unsigned long flags; /* "unsigned long" so bitops work */
>>> cpumask_var_t cpus_allowed; /* CPUs allowed to tasks in cpuset */
>>> +
>>> + /*
>>> + * used to save cpuset's cpus_allowed mask during suspend and restore
>>> + * it during resume.
>>> + */
>>> + cpumask_var_t suspend_cpus_allowed;
>>> +
>>> nodemask_t mems_allowed; /* Memory Nodes allowed to tasks */
>>>
>>> struct cpuset *parent; /* my parent */
>>
>> I see what you're doing with this and think it will fix the problem that
>> you're trying to address, but I think it could become much more general
>> to just the suspend case: if an admin sets a cpuset to have cpus 4-6, for
>> example, and cpu 5 goes offline, then I believe the cpuset should once
>> again become 4-6 if cpu 5 comes back online. So I think this should be
>> implemented like mempolicies are which save the user intended nodemask
>> that may become restricted by cpuset placement but will be rebound if the
>> cpuset includes the intended nodes.
>
> Heh, please read the thread at
> http://marc.info/?l=linux-kernel&m=133615922717112&w=2 ... subject is
> "[PATCH v2 0/7] CPU hotplug, cpusets: Fix issues with cpusets handling
> upon CPU hotplug". That was effectively the same solution Srivatsa
> originally posted. But after lengthy discussions with PeterZ and others,
> it was decided that suspend/resume is a special case where it makes
> sense to save "policy" but that generally cpu/memory hotplug is a
> destructive operation and nothing is required to be retained (that
> certain policies are retained is unfortunately now expected, but isn't
> guaranteed for cpusets, at least).
>
>> If that's done, then it takes care of the suspend case as well and adds
>> generic cpu hotplug support, not just for suspend/resume.
>
> Yeah ... but that's not the intent here (to add generic cpu hotplug
> support).
>
> Srivatsa, perhaps a reference to some summary of why it's not good to
> support the general hotplug case would have been good in 0/5?
>
Yep, I forgot to add that..
Anyway, here is the summary, for the benefit of reviewers:
CPU hotplug, by definition, is a destructive operation. As a result, there
is no guarantee that after a hot-unplug, the cpu will ever get hot-replugged.
And hence, in the event of a hotplug operation, remembering the state
doesn't make much sense.
However, suspend/resume is a special case in which, it uses CPU hotplug
reversibly - all nonboot cpus are hot-removed during suspend, and hot-added
back during resume.. *every one of them*!
So during suspend/resume, we are 100% sure that whatever we hot-removed
will be added back, because the goal of suspend/resume is to bring things
to the same state anyway. So, we are forced to remember the state for CPU
hotplug done in the suspend/resume path, so that we can restore it back.
And for the regular CPU hotplug case, Peter Zijlstra also gave the example
of sched_setaffinity(), whose implementation differs slightly from that
of cpusets, but in the end, it also respects hotplug semantics and doesn't
remember the state after hotplug.
So, the bottomline is, in Peter's own words:
"Generic hotplug is a destructive operation. And hence, no state should be
remembered. No ifs and buts about that!"
But Peter agreed (in an off-list discussion) to a special case handling of
cpusets during suspend/resume alone, without altering how it is dealt with
during regular CPU hotplug. And this patchset (v3) implements that design.
Regards,
Srivatsa S. Bhat
On 05/15/2012 05:28 AM, David Rientjes wrote:
> On Mon, 14 May 2012, Srivatsa S. Bhat wrote:
>
>> Currently the kernel doesn't handle cpusets properly during suspend/resume.
>> After a resume, all non-root cpusets end up having only 1 cpu (the boot cpu),
>> causing massive performance degradation of workloads. One major user of cpusets
>> is libvirt, which means that after a suspend/hibernation cycle, all VMs
>> suddenly end up running terribly slow!
>>
>> Also, the kernel moves the tasks from one cpuset to another during CPU hotplug
>> in the suspend/resume path, leading to a task-management nightmare after
>> resume.
>>
>
> To deal with mempolicy rebinding when a cpuset changes, I made a change to
> mempolicies to store the user nodemask passed to set_mempolicy() or
> mbind() so the intention of the user could be preserved. It seems like
> you should do the same thing for cpusets to store the "intended" set of
> cpus and respect that during cpu online?
>
Well, I think Nishanth addressed this one already.. As he said, that idea was
implemented in v2 of the patchset[1], and it turned out to be against hotplug
semantics, as pointed out by Peter Zijlstra.
[1]. http://thread.gmane.org/gmane.linux.documentation/4805
>> Patches 1 & 2 are cleanups that separate out hotplug handling so that we can
>> implement different logic for different hotplug events (CPU/Mem
>> online/offline). This also leads to some optimizations and more importantly
>> prepares the ground for any further work dealing with cpusets during hotplug.
>>
>> Patch 3 is a bug fix - it ensures that the tasks attached to the root cpuset
>> see the updated cpus_allowed mask upon CPU hotplug.
>>
>> Patches 4 and 5 implement the fix for cpusets handling during suspend/resume.
>
> All of your patches are labeled to [email protected], but I seriously
> doubt any of this is stable material since it has been a long-standing
> issue (and perhaps intentional in some cases)
Yes, it is a long-standing issue (bug), but it is not intentional.
People are struggling to deal with this kernel bug for suspend/resume and
there have been numerous bug-reports and stuff everywhere. It is high-time we
fix this in the kernel and get it into stable kernels too (because they too have
this bug).
> and your series includes
> cleanups and optimizations that wouldn't be stable candidates, so I'd
> suggest removing that annotation.
>
Well, the existing code was so messed up that I didn't have a choice but to
clean it up before fixing the suspend/resume case. Had I tried to implement
the fix without cleaning it up, it would have been absolutely horrible, I believe.
And the optimizations? those are just side effects of that cleanup! That really
tells the extent to which it was messed up in the first place!
Regards,
Srivatsa S. Bhat
On 05/15/2012 05:33 AM, David Rientjes wrote:
> On Mon, 14 May 2012, Srivatsa S. Bhat wrote:
>
>> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
>> index 14f7070..23e5da6 100644
>> --- a/kernel/cpuset.c
>> +++ b/kernel/cpuset.c
>> @@ -2000,6 +2000,23 @@ static void remove_tasks_in_empty_cpuset(struct cpuset *cs)
>> move_member_tasks_to_cpuset(cs, parent);
>> }
>>
>> +static struct cpuset *traverse_cpusets(struct list_head *queue)
>
> I suggest a different name for this, traverse_cpusets() doesn't imply that
> it will be returning struct cpuset *.
OK, I guess something like cpuset_next() or next_cpuset() should be fine?
>
>> +{
>> + struct cpuset *cp;
>> + struct cpuset *child; /* scans child cpusets of cp */
>> + struct cgroup *cont;
>> +
>> + cp = list_first_entry(queue, struct cpuset, stack_list);
>> + list_del(queue->next);
>> + list_for_each_entry(cont, &cp->css.cgroup->children, sibling) {
>> + child = cgroup_cs(cont);
>> + list_add_tail(&child->stack_list, queue);
>> + }
>> +
>> + return cp;
>
> Eek, what happens if queue is empty? It can't happen with only this
> patch applied, but since you're doing this to be used in other places then
> you'd need to check for list_empty().
>
Actually, I didn't intend this to be used in other places because their
traversals are a little bit different anyway.. But yes, having a check for
list_empty() would be good. I'll add it and change the loop to something like:
while (cpuset_next(&queue) {
...
}
Thanks for pointing this out!
>> +}
>> +
>> +
>> /*
>> * Walk the specified cpuset subtree and look for empty cpusets.
>> * The tasks of such cpuset must be moved to a parent cpuset.
>> @@ -2019,19 +2036,12 @@ static void scan_for_empty_cpusets(struct cpuset *root)
>> {
>> LIST_HEAD(queue);
>> struct cpuset *cp; /* scans cpusets being updated */
>> - struct cpuset *child; /* scans child cpusets of cp */
>> - struct cgroup *cont;
>> static nodemask_t oldmems; /* protected by cgroup_mutex */
>>
>> list_add_tail((struct list_head *)&root->stack_list, &queue);
>>
>> while (!list_empty(&queue)) {
>> - cp = list_first_entry(&queue, struct cpuset, stack_list);
>> - list_del(queue.next);
>> - list_for_each_entry(cont, &cp->css.cgroup->children, sibling) {
>> - child = cgroup_cs(cont);
>> - list_add_tail(&child->stack_list, &queue);
>> - }
>> + cp = traverse_cpusets(&queue);
>>
>> /* Continue past cpusets with all cpus, mems online */
>> if (cpumask_subset(cp->cpus_allowed, cpu_active_mask) &&
>
> Otherwise, looks good.
>
Thanks a lot!
Regards,
Srivatsa S. Bhat
On 05/15/2012 05:57 AM, David Rientjes wrote:
> On Mon, 14 May 2012, Srivatsa S. Bhat wrote:
>
>> Separate out the cpuset related handling for CPU/Memory online/offline.
>> This also helps us exploit the most obvious and basic level of optimization
>> that any notification mechanism (CPU/Mem online/offline) has to offer us:
>> "We *know* why we have been invoked. So stop pretending that we are lost,
>> and do only the necessary amount of processing!".
>>
>> And while at it, rename scan_for_empty_cpusets() to
>> scan_cpusets_upon_hotplug(), which will be more appropriate, considering
>> the upcoming changes.
>>
>
> If it's more appropriate in upcoming changes, then change it in the
> upcoming changes that make it more appropriate?
>
Well, I wanted to split out the core of the fix from the rest of the cleanup,
so that the fix patch can be more focussed, thereby easing review.
And I think renaming this function is more of a noise, when compared with the
fix being implemented in the later patches.. So I thought I'll get it out of
the way by doing it here itself.
Moreover, that renaming is justified in this patch itself, IMHO.. It doesn't
really have to wait till the later ones, because considering the restructuring
that this patch does, the renaming is in order too..
>> Signed-off-by: Srivatsa S. Bhat <[email protected]>
>> Cc: [email protected]
>> ---
[...]
>> +
>> + case CPUSET_MEM_OFFLINE:
>> + while (!list_empty(&queue)) {
>> + cp = traverse_cpusets(&queue);
>> +
>> + /* Continue past cpusets with all mems online */
>> + if (nodes_subset(cp->mems_allowed,
>> + node_states[N_HIGH_MEMORY]))
>> + continue;
>> +
>> + oldmems = cp->mems_allowed;
>> +
>> + /* Remove offline mems from this cpuset. */
>> + mutex_lock(&callback_mutex);
>> + nodes_and(cp->mems_allowed, cp->mems_allowed,
>> node_states[N_HIGH_MEMORY]);
>> - mutex_unlock(&callback_mutex);
>> + mutex_unlock(&callback_mutex);
>>
>> - /* Move tasks from the empty cpuset to a parent */
>> - if (cpumask_empty(cp->cpus_allowed) ||
>> - nodes_empty(cp->mems_allowed))
>> - remove_tasks_in_empty_cpuset(cp);
>> - else {
>> - update_tasks_cpumask(cp, NULL);
>> - update_tasks_nodemask(cp, &oldmems, NULL);
>> + /* Move tasks from the empty cpuset to a parent */
>> + if (nodes_empty(cp->mems_allowed))
>> + remove_tasks_in_empty_cpuset(cp);
>> + else
>> + update_tasks_nodemask(cp, &oldmems, NULL);
>> }
>> }
>> }
>
> This looks like a good optimization, but the existing comment for
> scan_for_empty_cpusets() is wrong: we certainly do not lack memory
> hot-unplug and it will remove nodes from N_HIGH_MEMORY if all present
> pages from a node are offlined. I had a patch that emulated node
> hot-remove on x86 and this worked fine. So perhaps update that existing
> comment as well (not shown in this diff)?
>
Sure, will do.
> Otherwise, looks good.
Thanks a lot!
Regards,
Srivatsa S. Bhat
On 05/15/2012 06:03 AM, David Rientjes wrote:
> On Mon, 14 May 2012, Srivatsa S. Bhat wrote:
>
>> Cpusets needs to distinguish between a regular CPU Hotplug operation and a
>> CPU Hotplug operation carried out as part of the suspend/resume sequence.
>> So add provisions to facilitate that, so that the two operations can be
>> handled differently.
>>
>
> There's no functional change with this patch and it's unclear from this
> changelog why we need to distinguish between the two, so perhaps fold this
> into patch 5 or explain how this will be helpful in this changelog?
> Otherwise it doesn't seem justifiable to add 30 more lines of code.
Well, as 0/5 explains, this whole patchset is a suspend/resume-only fix.
So we need special-case handling for suspend/resume in cpusets. So the
additional code is justified, IMHO. It prepares the ground for patch 5.
Again, I split it up here because I didn't want to clutter patch 5; it is
complex enough as it is... ;-)
Regards,
Srivatsa S. Bhat
On Tue, 15 May 2012, Srivatsa S. Bhat wrote:
> >> diff --git a/kernel/cpuset.c b/kernel/cpuset.c
> >> index 14f7070..23e5da6 100644
> >> --- a/kernel/cpuset.c
> >> +++ b/kernel/cpuset.c
> >> @@ -2000,6 +2000,23 @@ static void remove_tasks_in_empty_cpuset(struct cpuset *cs)
> >> move_member_tasks_to_cpuset(cs, parent);
> >> }
> >>
> >> +static struct cpuset *traverse_cpusets(struct list_head *queue)
> >
> > I suggest a different name for this, traverse_cpusets() doesn't imply that
> > it will be returning struct cpuset *.
>
>
> OK, I guess something like cpuset_next() or next_cpuset() should be fine?
>
The generic cgroup code seems to prefer cgroup_iter_next(), so
cpuset_next() sounds appropriate.
> >> +{
> >> + struct cpuset *cp;
> >> + struct cpuset *child; /* scans child cpusets of cp */
> >> + struct cgroup *cont;
> >> +
> >> + cp = list_first_entry(queue, struct cpuset, stack_list);
> >> + list_del(queue->next);
> >> + list_for_each_entry(cont, &cp->css.cgroup->children, sibling) {
> >> + child = cgroup_cs(cont);
> >> + list_add_tail(&child->stack_list, queue);
> >> + }
> >> +
> >> + return cp;
> >
> > Eek, what happens if queue is empty? It can't happen with only this
> > patch applied, but since you're doing this to be used in other places then
> > you'd need to check for list_empty().
> >
>
>
> Actually, I didn't intend this to be used in other places because their
> traversals are a little bit different anyway.. But yes, having a check for
> list_empty() would be good. I'll add it and change the loop to something like:
>
> while (cpuset_next(&queue) {
> ...
> }
>
> Thanks for pointing this out!
>
Sounds good if cpuset_next() returns NULL for list_empty().
On Mon, 14 May 2012, Nishanth Aravamudan wrote:
> > If you do set_mempolicy(MPOL_BIND, 2-3) to bind a thread to nodes 2-3
> > that is attached to a cpuset whereas cpuset.mems == 2-3, and then
> > cpuset.mems changes to 0-1, what is the expected behavior? Do we
> > immediately oom on the next allocation? If cpuset.mems is set again
> > to 2-3, what's the desired behavior?
>
> "expected [or desired] behavior" always makes me cringe. It's usually
> some insane user-level expectations that don't really make sense :).
Yeah, and I think we should be moving in a direction where this behavior
is defined so that nobody can expert anything else.
> Cpusets are integrated with the sched_setaffinity(2) scheduling
> affinity mechanism and the mbind(2) and set_mempolicy(2)
> memory-placement mechanisms in the kernel. Neither of these
> mechanisms let a process make use of a CPU or memory node that
> is not allowed by that process's cpuset. If changes to a
> process's cpuset placement conflict with these other mechanisms,
> then cpuset placement is enforced even if it means overriding
> these other mechanisms.
This makes perfect sense because an admin wants to be able to move the
cpuset placement of a thread regardless of whether that thread did
sched_setaffinity() or mbind() itself so that it is running on a set of
isolated nodes that have affinity to its cpus. I agree that cpusets
should always take precedent.
However, if a thread did set_mempolicy(MPOL_BIND, 2-3) where cpuset.mems
== node_online_map, cpuset.mems changes to 0-1, then cpuset.mems changes
back to node_online_map, then I believe (and implemented in the mempolicy
code and added the specification in the man page) that the thread should
be bound to nodes 2-3.
> > I fixed this problem by introducing MPOL_F_* flags in set_mempolicy(2)
> > by saving the user intended nodemask passed by set_mempolicy() and
> > respecting it whenever allowed by cpusets.
>
> So, if you read that thread, this is what (in essence) Srivatsa proposed
> in v2. We store the user-defined cpumask and keep it regardless of
> kernel decisions. We intersect the user-defined cpumask with the kernel
> (which is really reflecting the administrator's hotplug decisions)
> topology and run tasks in constrained cpusets on the result. We reflect
> this decision in a new read-only file in each cpuset that indicates the
> "actual" cpus that a task in a given cpuset may be scheduled on.
>
I don't think we need a new read-only file that exposes the stored
cpumask, I think it should be stored and respected when possible and the
set of allowed cpus be exported in the way it always has been, through
cpuset.cpus.
> But PeterZ nack-ed it and his reasoning was sound -- CPU (and memory, I
> would think) hotplug is a necessarily destructive behavior.
>
>From a thread perspective, how is hot-removing a node different from
clearing the node's bit in cpuset.mems?
How is hot-adding a node different from setting the node's bit in
cpuset.mems?
> > Right now, the behavior of what happens for a cpuset where cpuset.cpus ==
> > 2-3 and then cpus 2-3 go offline and then are brought back online is
> > undefined.
>
> Erm, no it's rather clearly defined by what actually happens. It may not
> be "specified" in a formal document, but behavior is a heckuva thing.
>
"Undefined" in the sense that there's no formal specification for what the
behavior is; of course it has a current behavior just like gcc compiles
1-bit int fields to be signed although its behavior is undefined. You'll
be defining the behavior with this patchset.
> What happens is that the offlining process pushes the tasks in that
> constrained cpuset up into the parent cpuset (actually moves them). In a
> suspend case, since we're offlining all CPUs, this results in all task
> being pushed up to the root cpuset.
>
> I would also quote `man cpuset` here to actually say the behavior is
> "specified", technically:
>
> If hot-plug functionality is used to remove all the CPUs that
> are currently assigned to a cpuset, then the kernel will
> automatically update the cpus_allowed of all processes attached
> to CPUs in that cpuset to allow all CPUs.
>
Right, and that's consistent because the root cpuset requires all cpus.
> > The same is true of cpuset.cpus during resume. So if you're going to
> > add a cpumask to struct cpuset, then why not respect it for all
> > offline events and get rid of all this specialized suspend-only stuff?
> > It's very simple to make this consistent across all cpu hotplug events
> > and build suspend on top of it from a cpuset perspective.
>
> "simple" -- sure. Read v2 of the patchset, as I said. But then read all
> the discussion that follows and I think you will see that this has been
> hashed out before with similar reasoning on both sides, and that the
> policy side of things is not obviously simply. The resulting decision
> was to special-case suspend, but not "remember" state across other
> hotplug actions, which is more of an "unintentional hotplug" (and from
> what Paul McKenney mentions in that thread, sounds like tglx is working
> on patches to remove the full hotplug usage from s/r).
>
We're talking about two very different things. The suspend case is
special in this regard _only_ because it moves all threads to the root
cpuset and obviously you can't have a user-specified cpumask for the root
cpuset. That's irrelevant to my question about why we aren't storing the
user-specified cpumask in all non-root cpusets, which certainly remains
consistent even with suspend since these non-root cpusets cease to exist.
If a cpuset is defined to have cpuset.cpus == 2-3, cpu 3 is offlined, and
then cpu 3 is onlined, the behavior is currently undefined. You could
make the argument that cpusets is purely about NUMA and that cpu 3 may no
longer have affinity to cpuset.mems in which case I would agree that we
should not reset cpuset.cpus to 2-3 in this case. But that doesn't seem
to be the motivation here because we keep talking about suspend.
On Tue, 15 May 2012, Srivatsa S. Bhat wrote:
> >> Cpusets needs to distinguish between a regular CPU Hotplug operation and a
> >> CPU Hotplug operation carried out as part of the suspend/resume sequence.
> >> So add provisions to facilitate that, so that the two operations can be
> >> handled differently.
> >>
> >
> > There's no functional change with this patch and it's unclear from this
> > changelog why we need to distinguish between the two, so perhaps fold this
> > into patch 5 or explain how this will be helpful in this changelog?
> > Otherwise it doesn't seem justifiable to add 30 more lines of code.
>
>
> Well, as 0/5 explains, this whole patchset is a suspend/resume-only fix.
> So we need special-case handling for suspend/resume in cpusets. So the
> additional code is justified, IMHO. It prepares the ground for patch 5.
>
Your change, once merged, will not carry patch 0/5 here, so it would be
helpful to understand why we need to distinguish between the two as a
stand-alone patch in your changelog.
On 05/16/2012 12:04 AM, David Rientjes wrote:
> On Tue, 15 May 2012, Srivatsa S. Bhat wrote:
>
>>>> Cpusets needs to distinguish between a regular CPU Hotplug operation and a
>>>> CPU Hotplug operation carried out as part of the suspend/resume sequence.
>>>> So add provisions to facilitate that, so that the two operations can be
>>>> handled differently.
>>>>
>>>
>>> There's no functional change with this patch and it's unclear from this
>>> changelog why we need to distinguish between the two, so perhaps fold this
>>> into patch 5 or explain how this will be helpful in this changelog?
>>> Otherwise it doesn't seem justifiable to add 30 more lines of code.
>>
>>
>> Well, as 0/5 explains, this whole patchset is a suspend/resume-only fix.
>> So we need special-case handling for suspend/resume in cpusets. So the
>> additional code is justified, IMHO. It prepares the ground for patch 5.
>>
>
> Your change, once merged, will not carry patch 0/5 here, so it would be
> helpful to understand why we need to distinguish between the two as a
> stand-alone patch in your changelog.
>
Sorry, I didn't quite get your point. Do you just want me to update this
changelog or do you still want me to squash this patch with patch 5?
If its the former, ok, I can try (though, I really don't know what else
to add.. it already says that we need to have special case handling of
cpusets for suspend/resume.. and also says that this patch adds the
necessary provisions.. which implies I am going to use these provisions
next (in patch 5)..) Or, are you asking me to explain *why* we need to
have special case handling for suspend/resume? If that is the case, I
could probably move some bits from the changelog of patch 5 to this patch.
However, if its the latter (ie., you want me to still squash it with
patch 5), I am not quite convinced about it.. I still think doing that
will clutter patch 5 unnecessarily.
Regards,
Srivatsa S. Bhat
On Tue, 2012-05-15 at 11:31 -0700, David Rientjes wrote:
> However, if a thread did set_mempolicy(MPOL_BIND, 2-3) where cpuset.mems
> == node_online_map, cpuset.mems changes to 0-1, then cpuset.mems changes
> back to node_online_map, then I believe (and implemented in the mempolicy
> code and added the specification in the man page) that the thread should
> be bound to nodes 2-3.
I disagree, but alas that is done :-(
But what happens if you unplug nodes 2-3?
> > > I fixed this problem by introducing MPOL_F_* flags in set_mempolicy(2)
> > > by saving the user intended nodemask passed by set_mempolicy() and
> > > respecting it whenever allowed by cpusets.
> >
> > So, if you read that thread, this is what (in essence) Srivatsa proposed
> > in v2. We store the user-defined cpumask and keep it regardless of
> > kernel decisions. We intersect the user-defined cpumask with the kernel
> > (which is really reflecting the administrator's hotplug decisions)
> > topology and run tasks in constrained cpusets on the result. We reflect
> > this decision in a new read-only file in each cpuset that indicates the
> > "actual" cpus that a task in a given cpuset may be scheduled on.
> >
>
> I don't think we need a new read-only file that exposes the stored
> cpumask, I think it should be stored and respected when possible and the
> set of allowed cpus be exported in the way it always has been, through
> cpuset.cpus.
I agree we don't want the new file, I'm not sure what you mean with the
rest though.
> If a cpuset is defined to have cpuset.cpus == 2-3, cpu 3 is offlined, and
> then cpu 3 is onlined, the behavior is currently undefined.
Uhm, its documented to not restore 3. And changing this at this point
seems pointless, it doesn't solve Srivatsa's problem and is otherwise
pointless churn.
> You could
> make the argument that cpusets is purely about NUMA and that cpu 3 may no
> longer have affinity to cpuset.mems in which case I would agree that we
> should not reset cpuset.cpus to 2-3 in this case. But that doesn't seem
> to be the motivation here because we keep talking about suspend.
The problem is that if you have some cpusets configuration and then do a
s/r cycle the entire configuration is wrecked because suspend
hot-unplugs all but cpu0 and resume re-plugs the cpus.
This destroys all masks and migrates all tasks in sets not including
cpu0 to the root set.
Srivatsa proposed to 'fix' this by remembering state of regular hotplug,
to which I strongly oppose, hotplug is destructive and should be,
there's no point in remembering state that might never be used again.
Worse you temporarily 'break' your cpuset 'promise' to then silently
restore it.
The s/r resume case is special in that userspace isn't actually around
to observe the cpus going away and coming back, also it has the
guarantee the cpus will be coming back.
So s/r is special and should not destroy state, hotplug should.
On Wed, 16 May 2012, Srivatsa S. Bhat wrote:
> > Your change, once merged, will not carry patch 0/5 here, so it would be
> > helpful to understand why we need to distinguish between the two as a
> > stand-alone patch in your changelog.
> >
>
>
> Sorry, I didn't quite get your point. Do you just want me to update this
> changelog or do you still want me to squash this patch with patch 5?
>
I think it should either be folded into patch 5 or the changelog updated
to reflect why we need to distinguish between hotplug and suspend; the
text from patch 0 will not be carried along to git so as a stand-alone
commit in Linus' tree, it's confusing why this is being done.
On Tue, 15 May 2012, Peter Zijlstra wrote:
> > However, if a thread did set_mempolicy(MPOL_BIND, 2-3) where cpuset.mems
> > == node_online_map, cpuset.mems changes to 0-1, then cpuset.mems changes
> > back to node_online_map, then I believe (and implemented in the mempolicy
> > code and added the specification in the man page) that the thread should
> > be bound to nodes 2-3.
>
> I disagree, but alas that is done :-(
>
Not sure what you're disagreeing with, it only happens with
MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES and I've clearly defined the
behavior in the man page. I personally never had a use-case for
MPOL_F_RELATIVE_NODES but Paul Jackson asked that it be added for SGI when
we added mempolicy mode flags.
> But what happens if you unplug nodes 2-3?
>
Mempolicies lack hotplug notifiers entirely so it falls back to no policy
at allocation time, the rebinding of nodes in the context of cpusets only
occurs when cpuset.mems changes, not with hotplug.
> The problem is that if you have some cpusets configuration and then do a
> s/r cycle the entire configuration is wrecked because suspend
> hot-unplugs all but cpu0 and resume re-plugs the cpus.
>
> This destroys all masks and migrates all tasks in sets not including
> cpu0 to the root set.
>
Yup.
> Srivatsa proposed to 'fix' this by remembering state of regular hotplug,
> to which I strongly oppose, hotplug is destructive and should be,
> there's no point in remembering state that might never be used again.
> Worse you temporarily 'break' your cpuset 'promise' to then silently
> restore it.
>
> The s/r resume case is special in that userspace isn't actually around
> to observe the cpus going away and coming back, also it has the
> guarantee the cpus will be coming back.
>
> So s/r is special and should not destroy state, hotplug should.
>
Why can't we leave cpuset.cpus unaltered for all cpusets during suspend?
Having a cpu set in cpuset.cpus does not mean the attached threads are
allowed to run on all of them.
On Tue, 2012-05-15 at 14:05 -0700, David Rientjes wrote:
>
> Why can't we leave cpuset.cpus unaltered for all cpusets during suspend?
We can, that's what Srivatsa is going to make. The only thing I object
to is the !suspend hotplug case.
On Tue, 2012-05-15 at 14:05 -0700, David Rientjes wrote:
> Not sure what you're disagreeing with, it only happens with
> MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES and I've clearly defined the
> behavior in the man page. I personally never had a use-case for
> MPOL_F_RELATIVE_NODES but Paul Jackson asked that it be added for SGI when
> we added mempolicy mode flags.
So what you're saying is that:
task t: set_mempolicy(,mask=2-3);
add t to cpuset A
A.mems = 0-1
A.mems = 0-n
At which point t will again have 2-3, right?
My objection is that you'll have to keep the 2-3 mask around some place
because t->mems_allowed will be wrecked by cpuset actions.
Also, what does it mean to silently return to 2-3 after you've broken
that promise by forcing it to 0-1 using cpusets?
On 05/16/2012 02:38 AM, Peter Zijlstra wrote:
> On Tue, 2012-05-15 at 14:05 -0700, David Rientjes wrote:
>>
>> Why can't we leave cpuset.cpus unaltered for all cpusets during suspend?
>
> We can, that's what Srivatsa is going to make.
And which I did :-) (well, essentially that's what this patchset (v3) did.)
> The only thing I object
> to is the !suspend hotplug case.
>
...which is out of the way now, in this version. This version of the patchset
touches only the suspend/resume case. Nothing else.
Regards,
Srivatsa S. Bhat
On Tue, 15 May 2012, Peter Zijlstra wrote:
> > Why can't we leave cpuset.cpus unaltered for all cpusets during suspend?
>
> We can, that's what Srivatsa is going to make. The only thing I object
> to is the !suspend hotplug case.
>
Srivatsa is going to do that in another patchset in addition to this one?
We shouldn't need to store any old cpumask at all, just allow cpuset.cpus
to be a superset of online cpus during s/r and don't touch cpusets at all
since the cpus, as you said, are guaranteed to come back.
On Wed, 2012-05-16 at 02:51 +0530, Srivatsa S. Bhat wrote:
> And which I did :-) (well, essentially that's what this patchset (v3) did.)
>
Oh right.. :-) I'll try and have a look tomorrow.
On Tue, 15 May 2012, Peter Zijlstra wrote:
> > Not sure what you're disagreeing with, it only happens with
> > MPOL_F_STATIC_NODES or MPOL_F_RELATIVE_NODES and I've clearly defined the
> > behavior in the man page. I personally never had a use-case for
> > MPOL_F_RELATIVE_NODES but Paul Jackson asked that it be added for SGI when
> > we added mempolicy mode flags.
>
> So what you're saying is that:
>
> task t: set_mempolicy(,mask=2-3);
>
> add t to cpuset A
>
> A.mems = 0-1
> A.mems = 0-n
>
> At which point t will again have 2-3, right?
>
Depends on the mode which was omitted from your set_mempolicy().
When A.mems becomes 0-1, the task obviously also has a mempolicy of 0-1;
for MPOL_F_STATIC_NODES, the resulting nodemask of (0-1) & (2-3) is empty
so it falls back to the cpuset placement as required by the documentation,
and without MPOL_F_STATIC_NODES the nodemask is remapped onto the new set
which is also 0-1.
When A-mems is changed to 0-n, the task's mempolicy is either
- 2-3 again (in n >= 3) with MPOL_F_STATIC_NODES, the flag that means we
really want these specific nodes when allowed, or
- 0-1 still without MPOL_F_STATIC_NODES.
> Also, what does it mean to silently return to 2-3 after you've broken
> that promise by forcing it to 0-1 using cpusets?
>
This obviously isn't the typical usecase and I doubt moving tasks to a
cpuset with a disjoint set of allowed nodes ever happens in practice since
cpusets is for NUMA optimizations and its being inappropriately moved
around. MPOL_F_STATIC_NODES is typically used when mems are removed from
a cpuset and then subsequently added later. Doing
set_mempolicy(MPOL_INTERLEAVE, 0-3), attaching to cpuset.mems = 0-1, then
cpuset.mems = 0-3, would result in an interleave over 0-1 before
MPOL_F_STATIC_NODES was introduced. With the flag, however, the mems
becomes 0-3 once again.
On 05/16/2012 02:54 AM, David Rientjes wrote:
> On Tue, 15 May 2012, Peter Zijlstra wrote:
>
>>> Why can't we leave cpuset.cpus unaltered for all cpusets during suspend?
>>
>> We can, that's what Srivatsa is going to make. The only thing I object
>> to is the !suspend hotplug case.
>>
>
> Srivatsa is going to do that in another patchset in addition to this one?
Nope. This v3 itself is the full implementation.
> We shouldn't need to store any old cpumask at all, just allow cpuset.cpus
> to be a superset of online cpus during s/r and don't touch cpusets at all
> since the cpus, as you said, are guaranteed to come back.
>
Oh, I am really sorry to say this, but this method has got 'history' ;-)
(Argh, I really should have put pointers to v1 and v2 in patch 0/5...)
What you are suggesting was precisely the v1 of this patchset, which went
upstream as commit 8f2f748b06562 (CPU hotplug, cpusets, suspend: Don't touch
cpusets during suspend/resume).
It got reverted due to a nasty suspend hang in some corner case, where the
sched domains not being up-to-date got the scheduler confused.
Here is the thread with that discussion:
http://thread.gmane.org/gmane.linux.kernel/1262802/focus=1286289
As Peter suggested, I'll try to fix the issues at the 2 places that I found
where the scheduler gets confused despite the cpu_active mask being up-to-date.
But, I really want to avoid that scheduler fix and this cpuset fix from
being tied together, for the fear that until we root-cause and fix all
scheduler bugs related to cpu_active mask, we can never get cpusets fixed
once and for all for suspend/resume. So, this patchset does an explicit
save and restore to be sure, and so that we don't depend on some other/unknown
factors to make this work reliably.
Regards,
Srivatsa S. Bhat
On Wed, 16 May 2012, Srivatsa S. Bhat wrote:
> What you are suggesting was precisely the v1 of this patchset, which went
> upstream as commit 8f2f748b06562 (CPU hotplug, cpusets, suspend: Don't touch
> cpusets during suspend/resume).
>
> It got reverted due to a nasty suspend hang in some corner case, where the
> sched domains not being up-to-date got the scheduler confused.
> Here is the thread with that discussion:
> http://thread.gmane.org/gmane.linux.kernel/1262802/focus=1286289
>
> As Peter suggested, I'll try to fix the issues at the 2 places that I found
> where the scheduler gets confused despite the cpu_active mask being up-to-date.
>
> But, I really want to avoid that scheduler fix and this cpuset fix from
> being tied together, for the fear that until we root-cause and fix all
> scheduler bugs related to cpu_active mask, we can never get cpusets fixed
> once and for all for suspend/resume. So, this patchset does an explicit
> save and restore to be sure, and so that we don't depend on some other/unknown
> factors to make this work reliably.
>
Ok, so it seems like this is papering over an existing cpusets issue or an
interaction with the scheduler that is buggy. There's no reason why a
cpuset.cpus that is a superset of cpu_active_mask should cause an issue
since that's exactly what the root cpuset has. I know root is special
cased all over the cpuset code, but I think the real fix here is to figure
out why it can't be left as a superset and then we end up doing nothing
for s/r.
I don't have a preference for cpu hotplug and whether cpuset.cpus = 1-3
remains 1-3 when cpu 2 is offlined or not, I think it could be argued both
ways, but I disagree with saving the cpumask, removing all suspended cpus,
and then reinstating it for no reason.
On 05/16/2012 03:19 AM, David Rientjes wrote:
> On Wed, 16 May 2012, Srivatsa S. Bhat wrote:
>
>> What you are suggesting was precisely the v1 of this patchset, which went
>> upstream as commit 8f2f748b06562 (CPU hotplug, cpusets, suspend: Don't touch
>> cpusets during suspend/resume).
>>
>> It got reverted due to a nasty suspend hang in some corner case, where the
>> sched domains not being up-to-date got the scheduler confused.
>> Here is the thread with that discussion:
>> http://thread.gmane.org/gmane.linux.kernel/1262802/focus=1286289
>>
>> As Peter suggested, I'll try to fix the issues at the 2 places that I found
>> where the scheduler gets confused despite the cpu_active mask being up-to-date.
>>
>> But, I really want to avoid that scheduler fix and this cpuset fix from
>> being tied together, for the fear that until we root-cause and fix all
>> scheduler bugs related to cpu_active mask, we can never get cpusets fixed
>> once and for all for suspend/resume. So, this patchset does an explicit
>> save and restore to be sure, and so that we don't depend on some other/unknown
>> factors to make this work reliably.
>>
>
> Ok, so it seems like this is papering over an existing cpusets issue or an
> interaction with the scheduler that is buggy.
Well, I don't think so. See below.
> There's no reason why a
> cpuset.cpus that is a superset of cpu_active_mask should cause an issue
> since that's exactly what the root cpuset has.
No, even root cpuset is modified during hotplug (and hence during suspend/resume).
If you look at the code, the root cpuset's cpuset.cpus is always exactly equal
to cpu_active_mask (and it is updated during every hotplug, both online and
offline).
> I know root is special
> cased all over the cpuset code, but I think the real fix here is to figure
> out why it can't be left as a superset and then we end up doing nothing
> for s/r.
>
> I don't have a preference for cpu hotplug and whether cpuset.cpus = 1-3
> remains 1-3 when cpu 2 is offlined or not, I think it could be argued both
> ways, but I disagree with saving the cpumask, removing all suspended cpus,
> and then reinstating it for no reason.
>
I think there is a valid reason behind doing that.
Cpusets translates to sched domains in scheduler terms. So whenever you update
cpusets, the sched domains are updated. IOW, if you don't touch cpusets during
hotplug (suspend/resume case), you allow them to have offline cpus, meaning,
you allow sched domains to have offline cpus. Hence sched domains are rendered
stale.
Now, from my point of view, keeping the sched domains stale and expecting
the scheduler to work flawlessly doesn't seem right.
However, the cpu_active_mask was introduced to handle situations where hotplug
transition is still in progress, and the scheduler needs to take appropriate
decisions even with some of its data-structures in an inconsistent/stale state.
But once the hotplug operation is completed, the scheduler doesn't need to
depend on cpu_active_mask.
Now coming to the suspend/resume/cpuset case, what you are suggesting will
essentially boil down to keeping the sched domains stale not just _during_
hotplug transitions, but also after that, in fact across multiple hotplug
operations in succession! So, from my point of view, the real bug is keeping
the sched domains stale for so long, across hotplug operations.
(And on those lines, making the scheduler work correctly even in such cases
is only a good-to-have as a robustness measure and not a "bugfix".)
So to avoid keeping the sched domains stale during suspend/resume, we need
to update cpusets for every hotplug as necessary, and this means we need to do
an explicit save and restore.
Regards,
Srivatsa S. Bhat
On Wed, 16 May 2012, Srivatsa S. Bhat wrote:
> > I know root is special
> > cased all over the cpuset code, but I think the real fix here is to figure
> > out why it can't be left as a superset and then we end up doing nothing
> > for s/r.
> >
> > I don't have a preference for cpu hotplug and whether cpuset.cpus = 1-3
> > remains 1-3 when cpu 2 is offlined or not, I think it could be argued both
> > ways, but I disagree with saving the cpumask, removing all suspended cpus,
> > and then reinstating it for no reason.
> >
>
> I think there is a valid reason behind doing that.
>
> Cpusets translates to sched domains in scheduler terms. So whenever you update
> cpusets, the sched domains are updated. IOW, if you don't touch cpusets during
> hotplug (suspend/resume case), you allow them to have offline cpus, meaning,
> you allow sched domains to have offline cpus. Hence sched domains are rendered
> stale.
>
It's not possible to update the sched domains for s/r to be a subset of
cpuset.cpus? It would be the same situation for a thread using
sched_setaffinity() while bound to a cpuset with a superset of allowed
nodes. If you do that, there's no reason to alter cpuset.cpus at all and
you don't need to carry another cpumask around for each cpuset.
On 05/16/2012 04:02 AM, David Rientjes wrote:
> On Wed, 16 May 2012, Srivatsa S. Bhat wrote:
>
>>> I know root is special
>>> cased all over the cpuset code, but I think the real fix here is to figure
>>> out why it can't be left as a superset and then we end up doing nothing
>>> for s/r.
>>>
>>> I don't have a preference for cpu hotplug and whether cpuset.cpus = 1-3
>>> remains 1-3 when cpu 2 is offlined or not, I think it could be argued both
>>> ways, but I disagree with saving the cpumask, removing all suspended cpus,
>>> and then reinstating it for no reason.
>>>
>>
>> I think there is a valid reason behind doing that.
>>
>> Cpusets translates to sched domains in scheduler terms. So whenever you update
>> cpusets, the sched domains are updated. IOW, if you don't touch cpusets during
>> hotplug (suspend/resume case), you allow them to have offline cpus, meaning,
>> you allow sched domains to have offline cpus. Hence sched domains are rendered
>> stale.
>>
>
> It's not possible to update the sched domains for s/r to be a subset of
> cpuset.cpus?
Subset? See below..
(Btw, the above statement reminds me of a different idea I had long back
which I will write about in a separate mail.)
> It would be the same situation for a thread using
> sched_setaffinity() while bound to a cpuset with a superset of allowed
> nodes.
First of all, sched domains are built by looking at the cpusets' ->cpus_allowed
mask, not individual task's ->cpus_allowed mask. So we would gain nothing by
altering individual task's ->cpus_allowed mask, like what sched_setaffinity()
does.
On top of that, the "subset" argument wouldn't hold good in the s/r case.
sched_setaffinity() tries its best to keep the ->cpus_allowed mask of a task
as a subset of the ->cpus_allowed mask of the cpuset it belongs to.
But with s/r, that's not the case - it can very well become a disjoint set.
Consider a cpuset having cpuset.cpus = 1. What happens during suspend/resume
then? Going by your suggestion, the tasks in that cpuset will have
->cpus_allowed = 0,2-3 or some other combination not having cpu 1 when cpu 1
gets offlined. And it will keep getting changed into other things depending
on which phase of suspend/resume we are in.
IOW, ->cpus_allowed of the cpuset and ->cpus_allowed of the tasks belonging
to the cpuset can go totally out-of-sync, with no relationship like
subset/superset being preserved between them. Which is not the case with
sched_setaffinity(), where we always try to maintain a superset-subset
relationship between the two.
And in any case, altering individual task's ->cpus_allowed wouldn't buy us
anything, as I mentioned above.
> If you do that, there's no reason to alter cpuset.cpus at all and
> you don't need to carry another cpumask around for each cpuset.
>
Regards,
Srivatsa S. Bhat
On 05/16/2012 01:50 PM, Srivatsa S. Bhat wrote:
> On 05/16/2012 04:02 AM, David Rientjes wrote:
>
>> On Wed, 16 May 2012, Srivatsa S. Bhat wrote:
>>
>>>> I know root is special
>>>> cased all over the cpuset code, but I think the real fix here is to figure
>>>> out why it can't be left as a superset and then we end up doing nothing
>>>> for s/r.
>>>>
>>>> I don't have a preference for cpu hotplug and whether cpuset.cpus = 1-3
>>>> remains 1-3 when cpu 2 is offlined or not, I think it could be argued both
>>>> ways, but I disagree with saving the cpumask, removing all suspended cpus,
>>>> and then reinstating it for no reason.
>>>>
>>>
>>> I think there is a valid reason behind doing that.
>>>
>>> Cpusets translates to sched domains in scheduler terms. So whenever you update
>>> cpusets, the sched domains are updated. IOW, if you don't touch cpusets during
>>> hotplug (suspend/resume case), you allow them to have offline cpus, meaning,
>>> you allow sched domains to have offline cpus. Hence sched domains are rendered
>>> stale.
>>>
>>
>> It's not possible to update the sched domains for s/r to be a subset of
>> cpuset.cpus?
>
>
> (Btw, the above statement reminds me of a different idea I had long back
> which I will write about in a separate mail.)
>
You suggested keeping sched domains updated during s/r without altering cpuset.cpus.
That is a very good point!
Because, we will then be distinguishing between 2 things:
sched domains can be "stale" because of 2 distinct reasons, one of which is troublesome
but the other is harmless:
1. offline cpus are included in some sched domains, and some offline cpus have
a non-NULL sched domain pointer. This is the problematic situation.
2. sched domains don't reflect the cpuset configurations set up in cpuset.cpus of
different cpusets. This is not really harmful, because if this happens only during
s/r, the userspace wouldn't really notice it, as long as we reinstate the
cpuset<->sched domain dependency properly at the end of resume.
So you are suggesting implementing point #2, where we keep the sched domains updated
(partially, at least in a way that is not harmful), so that we avoid the problem in
#1.
I had written a patch for this long ago:
http://thread.gmane.org/gmane.linux.kernel/1250097/focus=1254715
The idea there was to create a single sched domain at the beginning of suspend,
temporarily ignoring cpuset configurations, and reinstating the proper sched domain
tree taking cpusets into consideration, at the end of resume. That way we need not
touch cpusets during s/r, we need not explicitly save/restore cpusets, and we still
manage to keep the scheduler sane and happy. And the frozen userspace cannot observe
the temporary mismatch between cpusets<->sched domains. So no problems there too.
IMHO, the only reason we didn't finalize on that patch earlier was because the version
in commit 8f2f748b06562 looked much simpler (and at that point, we had no clue that
the latter would lead to suspend hangs).
So, now, we can either go with the design in this v3 (explicit save/restore) or the
one in the link shown above (temporary cpuset<->sched domain mismatch during s/r).
Not sure what Peter has to say about the latter though... He might have reservations
about it, I don't know ;-)
Regards,
Srivatsa S. Bhat
On Wed, 2012-05-16 at 03:46 +0530, Srivatsa S. Bhat wrote:
>
> However, the cpu_active_mask was introduced to handle situations where hotplug
> transition is still in progress, and the scheduler needs to take appropriate
> decisions even with some of its data-structures in an inconsistent/stale state.
> But once the hotplug operation is completed, the scheduler doesn't need to
> depend on cpu_active_mask.
> (And on those lines, making the scheduler work correctly even in such cases
> is only a good-to-have as a robustness measure and not a "bugfix".)
I think those 2(?) cases you found not covered by active mask could
actually happen during a hotplug. So far they simply haven't.
On 05/17/2012 02:54 AM, Peter Zijlstra wrote:
> On Wed, 2012-05-16 at 03:46 +0530, Srivatsa S. Bhat wrote:
>>
>> However, the cpu_active_mask was introduced to handle situations where hotplug
>> transition is still in progress, and the scheduler needs to take appropriate
>> decisions even with some of its data-structures in an inconsistent/stale state.
>> But once the hotplug operation is completed, the scheduler doesn't need to
>> depend on cpu_active_mask.
>
>> (And on those lines, making the scheduler work correctly even in such cases
>> is only a good-to-have as a robustness measure and not a "bugfix".)
>
> I think those 2(?) cases you found not covered by active mask could
> actually happen during a hotplug. So far they simply haven't.
>
Yes, it could happen during a hotplug too. And its on my todo list to fix.
(But the point I was trying to make was that keeping the sched domains outdated
across multiple hotplug operations isn't really correct.)
Anyway, I think this is the state where the discussion stands atm:
1. We are not going to alter how cpusets are handled during regular cpu hotplug.
We will do a suspend/resume-only fix for cpusets.
David Rientjes said that this can probably be argued about endlessly, but he
has no objections against doing it this way.
2. David Rientjes pointed out that explicit save and restore for suspend/resume
needs a new per-cpuset cpu mask and he would prefer a design where we wouldn't
need a new per-cpuset mask.
Such a design is possible, by building a single sched domain (ignoring cpuset
configurations, by using partition_sched_domains(1, NULL, NULL)) throughout
suspend/resume cpu hotplug operations, and then restoring the original sched
domain tree by looking up the cpusets, towards end of resume (last online
operation).
The frozen userspace won't notice any of this.
3. David had some comments/suggestions about some patches in this series.
So, I'll go with the design mentioned in 2, and address 3 and spin out a new
version.
Regards,
Srivatsa S. Bhat