v9:
- Add a new patch 1 to remove the child cpuset restriction on parent's
"cpuset.cpus".
- Relax initial root partition entry limitation to allow cpuset.cpus to
overlap that of parent's.
- An "isolated invalid" displayed type is added to
cpuset.cpus.partition.
- Resetting partition root to "member" will leave child partition root
as invalid.
- Update documentation and test accordingly.
v8:
- Reorganize the patch series and rationalize the features and
constraints of a partition.
- Update patch descriptions and documentation accordingly.
v7:
- Simplify the documentation patch (patch 5) as suggested by Tejun.
- Fix a typo in patch 2 and improper commit log in patch 3.
This patchset includes one bug fix and four enhancements to the cpuset v2 code.
Patch 1: Allow parent to set "cpuset.cpus" that may not be a superset
of children's "cpuset.cpus" for default hierarchy.
Patch 2: Enable partition with no task to have empty cpuset.cpus.effective.
Patch 3: Refining the features and constraints of a cpuset partition
clarifying what changes are allowed.
Patch 4: Add a new partition state "isolated" to create a partition
root without load balancing. This is for handling intermitten workloads
that have a strict low latency requirement.
Patch 5: Enable the "cpuset.cpus.partition" file to show the reason
that causes invalid partition like "root invalid (No cpu available
due to hotplug)".
Patch 6 updates the cgroup-v2.rst file accordingly. Patch 7 adds a new
cpuset test to test the new cpuset partition code.
Waiman Long (7):
cgroup/cpuset: Don't let child cpusets restrict parent in default
hierarchy
cgroup/cpuset: Allow no-task partition to have empty
cpuset.cpus.effective
cgroup/cpuset: Refining features and constraints of a partition
cgroup/cpuset: Add a new isolated cpus.partition type
cgroup/cpuset: Show invalid partition reason string
cgroup/cpuset: Update description of cpuset.cpus.partition in
cgroup-v2.rst
kselftest/cgroup: Add cpuset v2 partition root state test
Documentation/admin-guide/cgroup-v2.rst | 168 +++--
kernel/cgroup/cpuset.c | 440 +++++++-----
tools/testing/selftests/cgroup/Makefile | 5 +-
.../selftests/cgroup/test_cpuset_prs.sh | 667 ++++++++++++++++++
tools/testing/selftests/cgroup/wait_inotify.c | 87 +++
5 files changed, 1142 insertions(+), 225 deletions(-)
create mode 100755 tools/testing/selftests/cgroup/test_cpuset_prs.sh
create mode 100644 tools/testing/selftests/cgroup/wait_inotify.c
--
2.27.0
In validate_change(), there is a check since v2.6.12 to make sure that
each of the child cpusets must be a subset of a parent cpuset. IOW, it
allows child cpusets to restrict what changes can be made to a parent's
"cpuset.cpus". This actually violates one of the core principles of the
default hierarchy where a cgroup higher up in the hierarchy should be
able to change configuration however it sees fit as deligation breaks
down otherwise.
To address this issue, the check is now removed for the default hierarchy
to free parent cpusets from being restricted by child cpusets. The
check will still apply for legacy hierarchy.
Suggested-by: Tejun Heo <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
---
kernel/cgroup/cpuset.c | 14 +++-----------
1 file changed, 3 insertions(+), 11 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index d0e163a02099..0dd7d853ed17 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -616,19 +616,11 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)
struct cpuset *c, *par;
int ret;
- rcu_read_lock();
-
- /* Each of our child cpusets must be a subset of us */
- ret = -EBUSY;
- cpuset_for_each_child(c, css, cur)
- if (!is_cpuset_subset(c, trial))
- goto out;
-
- /* Remaining checks don't apply to root cpuset */
- ret = 0;
+ /* The checks don't apply to root cpuset */
if (cur == &top_cpuset)
- goto out;
+ return 0;
+ rcu_read_lock();
par = parent_cs(cur);
/* On legacy hierarchy, we must be a subset of our parent cpuset. */
--
2.27.0
Currently, a partition root cannot have empty "cpuset.cpus.effective".
As a result, a parent partition root cannot distribute out all its CPUs
to child partitions with no CPUs left. However in most cases, there
shouldn't be any tasks associated with intermediate nodes of the default
hierarchy. So the current rule is too restrictive and can waste valuable
CPU resource.
To address this issue, we are now allowing a partition to have empty
"cpuset.cpus.effective" as long as it has no task. Therefore, a parent
partition with no task can now have all its CPUs distributed out to its
child partitions. The top cpuset always have some house-keeping tasks
running and so its list of effective cpu can't never be empty.
Once a partition with empty "cpuset.cpus.effective" is formed, no
new task can be moved into it until "cpuset.cpus.effective" becomes
non-empty.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/cgroup/cpuset.c | 113 +++++++++++++++++++++++++++++++----------
1 file changed, 85 insertions(+), 28 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 0dd7d853ed17..dfa15677845e 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -404,6 +404,41 @@ static inline bool is_in_v2_mode(void)
(cpuset_cgrp_subsys.root->flags & CGRP_ROOT_CPUSET_V2_MODE);
}
+/**
+ * partition_is_populated - check if partition has tasks
+ * @cs: partition root to be checked
+ * @excluded_child: a child cpuset to be excluded in task checking
+ * Return: true if there are tasks, false otherwise
+ *
+ * It is assumed that @cs is a valid partition root. @excluded_child should
+ * be non-NULL when this cpuset is going to become a partition itself.
+ */
+static inline bool partition_is_populated(struct cpuset *cs,
+ struct cpuset *excluded_child)
+{
+ struct cgroup_subsys_state *css;
+ struct cpuset *child;
+
+ if (cs->css.cgroup->nr_populated_csets)
+ return true;
+ if (!excluded_child && !cs->nr_subparts_cpus)
+ return cgroup_is_populated(cs->css.cgroup);
+
+ rcu_read_lock();
+ cpuset_for_each_child(child, css, cs) {
+ if (child == excluded_child)
+ continue;
+ if (is_partition_root(child))
+ continue;
+ if (cgroup_is_populated(child->css.cgroup)) {
+ rcu_read_unlock();
+ return true;
+ }
+ }
+ rcu_read_unlock();
+ return false;
+}
+
/*
* Return in pmask the portion of a task's cpusets's cpus_allowed that
* are online and are capable of running the task. If none are found,
@@ -1208,22 +1243,25 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
if ((cmd != partcmd_update) && css_has_online_children(&cpuset->css))
return -EBUSY;
- /*
- * Enabling partition root is not allowed if not all the CPUs
- * can be granted from parent's effective_cpus or at least one
- * CPU will be left after that.
- */
- if ((cmd == partcmd_enable) &&
- (!cpumask_subset(cpuset->cpus_allowed, parent->effective_cpus) ||
- cpumask_equal(cpuset->cpus_allowed, parent->effective_cpus)))
- return -EINVAL;
-
- /*
- * A cpumask update cannot make parent's effective_cpus become empty.
- */
adding = deleting = false;
old_prs = new_prs = cpuset->partition_root_state;
if (cmd == partcmd_enable) {
+ /*
+ * Enabling partition root is not allowed if not all the CPUs
+ * can be granted from parent's effective_cpus.
+ */
+ if (!cpumask_subset(cpuset->cpus_allowed, parent->effective_cpus))
+ return -EINVAL;
+
+ /*
+ * A parent can be left with no CPU as long as there is no
+ * task directly associated with the parent partition. For
+ * such a parent, no new task can be moved into it.
+ */
+ if (partition_is_populated(parent, cpuset) &&
+ cpumask_equal(cpuset->cpus_allowed, parent->effective_cpus))
+ return -EINVAL;
+
cpumask_copy(tmp->addmask, cpuset->cpus_allowed);
adding = true;
} else if (cmd == partcmd_disable) {
@@ -1245,9 +1283,10 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
adding = cpumask_andnot(tmp->addmask, tmp->addmask,
parent->subparts_cpus);
/*
- * Return error if the new effective_cpus could become empty.
+ * Return error if the new effective_cpus could become empty
+ * and there are tasks in the parent.
*/
- if (adding &&
+ if (adding && partition_is_populated(parent, cpuset) &&
cpumask_equal(parent->effective_cpus, tmp->addmask)) {
if (!deleting)
return -EINVAL;
@@ -1273,8 +1312,8 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
*/
adding = cpumask_and(tmp->addmask, cpuset->cpus_allowed,
parent->effective_cpus);
- part_error = cpumask_equal(tmp->addmask,
- parent->effective_cpus);
+ part_error = cpumask_equal(tmp->addmask, parent->effective_cpus) &&
+ partition_is_populated(parent, cpuset);
}
if (cmd == partcmd_update) {
@@ -1376,9 +1415,15 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
/*
* If it becomes empty, inherit the effective mask of the
- * parent, which is guaranteed to have some CPUs.
+ * parent, which is guaranteed to have some CPUs unless
+ * it is a partition root that has explicitly distributed
+ * out all its CPUs.
*/
if (is_in_v2_mode() && cpumask_empty(tmp->new_cpus)) {
+ if (is_partition_root(cp) &&
+ cpumask_equal(cp->cpus_allowed, cp->subparts_cpus))
+ goto update_parent_subparts;
+
cpumask_copy(tmp->new_cpus, parent->effective_cpus);
if (!cp->use_parent_ecpus) {
cp->use_parent_ecpus = true;
@@ -1400,6 +1445,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
continue;
}
+update_parent_subparts:
/*
* update_parent_subparts_cpumask() should have been called
* for cs already in update_cpumask(). We should also call
@@ -2201,6 +2247,13 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
(cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed)))
goto out_unlock;
+ /*
+ * On default hierarchy, task cannot be moved to a cpuset with empty
+ * effective cpus.
+ */
+ if (is_in_v2_mode() && cpumask_empty(cs->effective_cpus))
+ goto out_unlock;
+
cgroup_taskset_for_each(task, css, tset) {
ret = task_can_attach(task, cs->cpus_allowed);
if (ret)
@@ -3065,7 +3118,8 @@ hotplug_update_tasks(struct cpuset *cs,
struct cpumask *new_cpus, nodemask_t *new_mems,
bool cpus_updated, bool mems_updated)
{
- if (cpumask_empty(new_cpus))
+ /* A partition root is allowed to have empty effective cpus */
+ if (cpumask_empty(new_cpus) && !is_partition_root(cs))
cpumask_copy(new_cpus, parent_cs(cs)->effective_cpus);
if (nodes_empty(*new_mems))
*new_mems = parent_cs(cs)->effective_mems;
@@ -3134,11 +3188,12 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
/*
* In the unlikely event that a partition root has empty
- * effective_cpus or its parent becomes erroneous, we have to
- * transition it to the erroneous state.
+ * effective_cpus with tasks or its parent becomes erroneous, we
+ * have to transition it to the erroneous state.
*/
- if (is_partition_root(cs) && (cpumask_empty(&new_cpus) ||
- (parent->partition_root_state == PRS_ERROR))) {
+ if (is_partition_root(cs) &&
+ ((cpumask_empty(&new_cpus) && partition_is_populated(cs, NULL)) ||
+ (parent->partition_root_state == PRS_ERROR))) {
if (cs->nr_subparts_cpus) {
spin_lock_irq(&callback_lock);
cs->nr_subparts_cpus = 0;
@@ -3148,13 +3203,15 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
}
/*
- * If the effective_cpus is empty because the child
- * partitions take away all the CPUs, we can keep
- * the current partition and let the child partitions
- * fight for available CPUs.
+ * Force the partition to become invalid if either one of
+ * the following conditions hold:
+ * 1) empty effective cpus but not valid empty partition.
+ * 2) parent is invalid or doesn't grant any cpus to child
+ * partitions.
*/
if ((parent->partition_root_state == PRS_ERROR) ||
- cpumask_empty(&new_cpus)) {
+ (cpumask_empty(&new_cpus) &&
+ partition_is_populated(cs, NULL))) {
int old_prs;
update_parent_subparts_cpumask(cs, partcmd_disable,
--
2.27.0
In order to plan for proper partitioning of a system, a valid partition
must be using only CPUs designated in "cpuset.cpus" of the partition
root. It can be a subset, but it can't use CPUs outside of its designed
list. If none of the CPUs in the desginated list can be granted to the
partition, its "cpuset.cpus.effective" becomes empty. This is allowed
as long as there is no task in the partition.
To ease implementation, there are additional constraints in enabling a
partition root.
1) The "cpuset.cpus" is non-empty and exclusive.
2) The parent cgroup is a valid partition root.
3) The "cpuset.cpus" overlaps parent's "cpuset.cpus".
4) There is no child cgroups with cpuset enabled.
This allows offlined cpus in parent's "cpuset.cpus" to be granted to
a child partition which can have empty "cpuset.cpus.effective" when it
has no task.
The cpuset's subparts_cpus keeps track of CPUs (including offline CPUs)
that are allocated to child partitions. It does not change during
hotplug operations.
Once a partition root has been enabled, changes to "cpuset.cpus" is
generally allowed as long as the cpu list is exclusive and non-empty.
A partition will become invalid when one or more of the following
constraints are violated:
1) The parent cgroup is a valid partition root.
2) "cpuset.cpus.effective" is a subset of "cpuset.cpus"
3) "cpuset.cpus.effective" is non-empty when there are tasks
in the partition.
Disabling a partition root is always allowed even if there are child
partitions underneath it. In this case, all the child partitions become
invalid and unrecoverable. So care must be taken to double check if
there are child partitions underneath it before disabling a partition.
This patch makes the necessary change to support the above features
and constraints.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/cgroup/cpuset.c | 253 ++++++++++++++++++++---------------------
1 file changed, 121 insertions(+), 132 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index dfa15677845e..c877eecf4c5e 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1237,29 +1237,27 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
return -EINVAL;
/*
- * Enabling/disabling partition root is not allowed if there are
- * online children.
+ * Enabling partition root is not allowed if there are online children.
*/
- if ((cmd != partcmd_update) && css_has_online_children(&cpuset->css))
+ if ((cmd == partcmd_enable) && css_has_online_children(&cpuset->css))
return -EBUSY;
adding = deleting = false;
old_prs = new_prs = cpuset->partition_root_state;
if (cmd == partcmd_enable) {
/*
- * Enabling partition root is not allowed if not all the CPUs
- * can be granted from parent's effective_cpus.
+ * Enabling partition root is not allowed if cpus_allowed
+ * doesn't overlap parent's cpus_allowed.
*/
- if (!cpumask_subset(cpuset->cpus_allowed, parent->effective_cpus))
+ if (!cpumask_intersects(cpuset->cpus_allowed, parent->cpus_allowed))
return -EINVAL;
/*
* A parent can be left with no CPU as long as there is no
- * task directly associated with the parent partition. For
- * such a parent, no new task can be moved into it.
+ * task directly associated with the parent partition.
*/
if (partition_is_populated(parent, cpuset) &&
- cpumask_equal(cpuset->cpus_allowed, parent->effective_cpus))
+ cpumask_subset(parent->effective_cpus, cpuset->cpus_allowed))
return -EINVAL;
cpumask_copy(tmp->addmask, cpuset->cpus_allowed);
@@ -1271,54 +1269,52 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
/*
* partcmd_update with newmask:
*
+ * Compute add/delete mask to/from subparts_cpus
+ *
* delmask = cpus_allowed & ~newmask & parent->subparts_cpus
- * addmask = newmask & parent->effective_cpus
+ * addmask = newmask & parent->cpus_allowed
* & ~parent->subparts_cpus
*/
cpumask_andnot(tmp->delmask, cpuset->cpus_allowed, newmask);
deleting = cpumask_and(tmp->delmask, tmp->delmask,
parent->subparts_cpus);
- cpumask_and(tmp->addmask, newmask, parent->effective_cpus);
+ cpumask_and(tmp->addmask, newmask, parent->cpus_allowed);
adding = cpumask_andnot(tmp->addmask, tmp->addmask,
parent->subparts_cpus);
/*
- * Return error if the new effective_cpus could become empty
- * and there are tasks in the parent.
+ * Make partition invalid if parent's effective_cpus could
+ * become empty and there are tasks in the parent.
*/
- if (adding && partition_is_populated(parent, cpuset) &&
- cpumask_equal(parent->effective_cpus, tmp->addmask)) {
- if (!deleting)
- return -EINVAL;
- /*
- * As some of the CPUs in subparts_cpus might have
- * been offlined, we need to compute the real delmask
- * to confirm that.
- */
- if (!cpumask_and(tmp->addmask, tmp->delmask,
- cpu_active_mask))
- return -EINVAL;
- cpumask_copy(tmp->addmask, parent->effective_cpus);
- }
+ part_error = partition_is_populated(parent, cpuset) &&
+ cpumask_subset(parent->effective_cpus, tmp->addmask) &&
+ !cpumask_intersects(tmp->delmask, cpu_active_mask);
} else {
/*
* partcmd_update w/o newmask:
*
* addmask = cpus_allowed & parent->effective_cpus
*
- * Note that parent's subparts_cpus may have been
- * pre-shrunk in case there is a change in the cpu list.
- * So no deletion is needed.
+ * This gets invoked either due to a hotplug event or
+ * from update_cpumasks_hier() where we can't return an
+ * error. This can cause a partition root to become invalid
+ * in the case of a hotplug.
+ *
+ * A partition error happens when:
+ * 1) Cpuset is valid partition, but parent does not distribute
+ * out any CPUs.
+ * 2) Parent has tasks and all its effective CPUs will have
+ * to be distributed out.
*/
adding = cpumask_and(tmp->addmask, cpuset->cpus_allowed,
parent->effective_cpus);
- part_error = cpumask_equal(tmp->addmask, parent->effective_cpus) &&
- partition_is_populated(parent, cpuset);
+ part_error = (is_partition_root(cpuset) &&
+ !parent->nr_subparts_cpus) ||
+ (cpumask_equal(parent->effective_cpus, tmp->addmask) &&
+ partition_is_populated(parent, cpuset));
}
if (cmd == partcmd_update) {
- int prev_prs = cpuset->partition_root_state;
-
/*
* Check for possible transition between PRS_ENABLED
* and PRS_ERROR.
@@ -1333,13 +1329,9 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
new_prs = PRS_ENABLED;
break;
}
- /*
- * Set part_error if previously in invalid state.
- */
- part_error = (prev_prs == PRS_ERROR);
}
- if (!part_error && (new_prs == PRS_ERROR))
+ if ((old_prs == PRS_ERROR) && (new_prs == PRS_ERROR))
return 0; /* Nothing need to be done */
if (new_prs == PRS_ERROR) {
@@ -1392,6 +1384,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
* update_cpumasks_hier - Update effective cpumasks and tasks in the subtree
* @cs: the cpuset to consider
* @tmp: temp variables for calculating effective_cpus & partition setup
+ * @force: don't skip any descendant cpusets if set
*
* When configured cpumask is changed, the effective cpumasks of this cpuset
* and all its descendants need to be updated.
@@ -1400,7 +1393,8 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
*
* Called with cpuset_rwsem held
*/
-static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
+static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp,
+ bool force)
{
struct cpuset *cp;
struct cgroup_subsys_state *pos_css;
@@ -1410,6 +1404,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
rcu_read_lock();
cpuset_for_each_descendant_pre(cp, pos_css, cs) {
struct cpuset *parent = parent_cs(cp);
+ bool update_parent = false;
compute_effective_cpumask(tmp->new_cpus, cp, parent);
@@ -1437,9 +1432,9 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
/*
* Skip the whole subtree if the cpumask remains the same
- * and has no partition root state.
+ * with no partition root state and force flag not set.
*/
- if (!cp->partition_root_state &&
+ if (!cp->partition_root_state && !force &&
cpumask_equal(tmp->new_cpus, cp->effective_cpus)) {
pos_css = css_rightmost_descendant(pos_css);
continue;
@@ -1455,34 +1450,16 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
old_prs = new_prs = cp->partition_root_state;
if ((cp != cs) && old_prs) {
switch (parent->partition_root_state) {
- case PRS_DISABLED:
- /*
- * If parent is not a partition root or an
- * invalid partition root, clear its state
- * and its CS_CPU_EXCLUSIVE flag.
- */
- WARN_ON_ONCE(cp->partition_root_state
- != PRS_ERROR);
- new_prs = PRS_DISABLED;
-
- /*
- * clear_bit() is an atomic operation and
- * readers aren't interested in the state
- * of CS_CPU_EXCLUSIVE anyway. So we can
- * just update the flag without holding
- * the callback_lock.
- */
- clear_bit(CS_CPU_EXCLUSIVE, &cp->flags);
- break;
-
case PRS_ENABLED:
- if (update_parent_subparts_cpumask(cp, partcmd_update, NULL, tmp))
- update_tasks_cpumask(parent);
+ update_parent = true;
break;
+ case PRS_DISABLED:
case PRS_ERROR:
/*
- * When parent is invalid, it has to be too.
+ * When parent is not a partition root or is
+ * invalid, child partition roots become
+ * invalid too.
*/
new_prs = PRS_ERROR;
break;
@@ -1493,40 +1470,41 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
continue;
rcu_read_unlock();
+ if (update_parent) {
+ if (update_parent_subparts_cpumask(cp, partcmd_update, NULL, tmp))
+ update_tasks_cpumask(parent);
+ /*
+ * The cpuset partition_root_state may be changed
+ * to PRS_ERROR. Capture it.
+ */
+ new_prs = cp->partition_root_state;
+ }
+
spin_lock_irq(&callback_lock);
- cpumask_copy(cp->effective_cpus, tmp->new_cpus);
- if (cp->nr_subparts_cpus && (new_prs != PRS_ENABLED)) {
+ if (cp->nr_subparts_cpus && (new_prs <= 0)) {
+ /*
+ * Put all active subparts_cpus back to effective_cpus.
+ */
+ cpumask_or(tmp->new_cpus, tmp->new_cpus,
+ cp->subparts_cpus);
+ cpumask_and(tmp->new_cpus, tmp->new_cpus,
+ cpu_active_mask);
cp->nr_subparts_cpus = 0;
cpumask_clear(cp->subparts_cpus);
- } else if (cp->nr_subparts_cpus) {
+ }
+
+ cpumask_copy(cp->effective_cpus, tmp->new_cpus);
+ if (cp->nr_subparts_cpus) {
/*
* Make sure that effective_cpus & subparts_cpus
* are mutually exclusive.
- *
- * In the unlikely event that effective_cpus
- * becomes empty. we clear cp->nr_subparts_cpus and
- * let its child partition roots to compete for
- * CPUs again.
*/
cpumask_andnot(cp->effective_cpus, cp->effective_cpus,
cp->subparts_cpus);
- if (cpumask_empty(cp->effective_cpus)) {
- cpumask_copy(cp->effective_cpus, tmp->new_cpus);
- cpumask_clear(cp->subparts_cpus);
- cp->nr_subparts_cpus = 0;
- } else if (!cpumask_subset(cp->subparts_cpus,
- tmp->new_cpus)) {
- cpumask_andnot(cp->subparts_cpus,
- cp->subparts_cpus, tmp->new_cpus);
- cp->nr_subparts_cpus
- = cpumask_weight(cp->subparts_cpus);
- }
}
- if (new_prs != old_prs)
- cp->partition_root_state = new_prs;
-
+ cp->partition_root_state = new_prs;
spin_unlock_irq(&callback_lock);
notify_partition_change(cp, old_prs, new_prs);
@@ -1580,7 +1558,7 @@ static void update_sibling_cpumasks(struct cpuset *parent, struct cpuset *cs,
if (!sibling->use_parent_ecpus)
continue;
- update_cpumasks_hier(sibling, tmp);
+ update_cpumasks_hier(sibling, tmp, false);
}
rcu_read_unlock();
}
@@ -1653,13 +1631,13 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
* Make sure that subparts_cpus is a subset of cpus_allowed.
*/
if (cs->nr_subparts_cpus) {
- cpumask_andnot(cs->subparts_cpus, cs->subparts_cpus,
- cs->cpus_allowed);
+ cpumask_and(cs->subparts_cpus, cs->subparts_cpus,
+ cs->cpus_allowed);
cs->nr_subparts_cpus = cpumask_weight(cs->subparts_cpus);
}
spin_unlock_irq(&callback_lock);
- update_cpumasks_hier(cs, &tmp);
+ update_cpumasks_hier(cs, &tmp, false);
if (cs->partition_root_state) {
struct cpuset *parent = parent_cs(cs);
@@ -2083,20 +2061,23 @@ static int update_prstate(struct cpuset *cs, int new_prs)
}
} else {
/*
- * Turning off partition root will clear the
- * CS_CPU_EXCLUSIVE bit.
+ * Switch back to member is always allowed even if it
+ * disables child partitions.
*/
- if (old_prs == PRS_ERROR) {
- update_flag(CS_CPU_EXCLUSIVE, cs, 0);
- err = 0;
- goto out;
+ err = 0;
+ update_parent_subparts_cpumask(cs, partcmd_disable, NULL,
+ &tmpmask);
+ /*
+ * If there are child partitions, they will all become invalid.
+ */
+ if (unlikely(cs->nr_subparts_cpus)) {
+ spin_lock_irq(&callback_lock);
+ cs->nr_subparts_cpus = 0;
+ cpumask_clear(cs->subparts_cpus);
+ compute_effective_cpumask(cs->effective_cpus, cs, parent);
+ spin_unlock_irq(&callback_lock);
}
- err = update_parent_subparts_cpumask(cs, partcmd_disable,
- NULL, &tmpmask);
- if (err)
- goto out;
-
/* Turning off CS_CPU_EXCLUSIVE will not return error */
update_flag(CS_CPU_EXCLUSIVE, cs, 0);
}
@@ -2117,6 +2098,11 @@ static int update_prstate(struct cpuset *cs, int new_prs)
spin_lock_irq(&callback_lock);
cs->partition_root_state = new_prs;
spin_unlock_irq(&callback_lock);
+ /*
+ * Update child cpusets when disabling partition.
+ */
+ if (new_prs == PRS_DISABLED && !list_empty(&cs->css.children))
+ update_cpumasks_hier(cs, &tmpmask, true);
notify_partition_change(cs, old_prs, new_prs);
}
@@ -3188,12 +3174,32 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
/*
* In the unlikely event that a partition root has empty
- * effective_cpus with tasks or its parent becomes erroneous, we
- * have to transition it to the erroneous state.
+ * effective_cpus with tasks, we will have to invalidate child
+ * partitions, if present, by setting nr_subparts_cpus to 0 to
+ * reclaim their cpus.
+ */
+ if (is_partition_root(cs) && cpumask_empty(&new_cpus) &&
+ cs->nr_subparts_cpus && partition_is_populated(cs, NULL)) {
+ spin_lock_irq(&callback_lock);
+ cs->nr_subparts_cpus = 0;
+ cpumask_clear(cs->subparts_cpus);
+ spin_unlock_irq(&callback_lock);
+ compute_effective_cpumask(&new_cpus, cs, parent);
+ }
+
+ /*
+ * Force the partition to become invalid if either one of
+ * the following conditions hold:
+ * 1) empty effective cpus with tasks in partition
+ * 2) parent is invalid or doesn't grant any cpus to child partitions.
*/
if (is_partition_root(cs) &&
((cpumask_empty(&new_cpus) && partition_is_populated(cs, NULL)) ||
- (parent->partition_root_state == PRS_ERROR))) {
+ !parent->nr_subparts_cpus)) {
+ int old_prs;
+
+ update_parent_subparts_cpumask(cs, partcmd_disable,
+ NULL, tmp);
if (cs->nr_subparts_cpus) {
spin_lock_irq(&callback_lock);
cs->nr_subparts_cpus = 0;
@@ -3202,40 +3208,23 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
compute_effective_cpumask(&new_cpus, cs, parent);
}
- /*
- * Force the partition to become invalid if either one of
- * the following conditions hold:
- * 1) empty effective cpus but not valid empty partition.
- * 2) parent is invalid or doesn't grant any cpus to child
- * partitions.
- */
- if ((parent->partition_root_state == PRS_ERROR) ||
- (cpumask_empty(&new_cpus) &&
- partition_is_populated(cs, NULL))) {
- int old_prs;
-
- update_parent_subparts_cpumask(cs, partcmd_disable,
- NULL, tmp);
- old_prs = cs->partition_root_state;
- if (old_prs != PRS_ERROR) {
- spin_lock_irq(&callback_lock);
- cs->partition_root_state = PRS_ERROR;
- spin_unlock_irq(&callback_lock);
- notify_partition_change(cs, old_prs, PRS_ERROR);
- }
+ old_prs = cs->partition_root_state;
+ if (old_prs != PRS_ERROR) {
+ spin_lock_irq(&callback_lock);
+ cs->partition_root_state = PRS_ERROR;
+ spin_unlock_irq(&callback_lock);
+ notify_partition_change(cs, old_prs, PRS_ERROR);
}
cpuset_force_rebuild();
}
/*
* On the other hand, an erroneous partition root may be transitioned
- * back to a regular one or a partition root with no CPU allocated
- * from the parent may change to erroneous.
+ * back to a regular one.
*/
- if (is_partition_root(parent) &&
- ((cs->partition_root_state == PRS_ERROR) ||
- !cpumask_intersects(&new_cpus, parent->subparts_cpus)) &&
- update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp))
+ else if (is_partition_root(parent) &&
+ (cs->partition_root_state == PRS_ERROR) &&
+ update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp))
cpuset_force_rebuild();
update_tasks:
--
2.27.0
Cpuset v1 uses the sched_load_balance control file to determine if load
balancing should be enabled. Cpuset v2 gets rid of sched_load_balance
as its use may require disabling load balancing at cgroup root.
For workloads that require very low latency like DPDK, the latency
jitters caused by periodic load balancing may exceed the desired
latency limit.
When cpuset v2 is in use, the only way to avoid this latency cost is to
use the "isolcpus=" kernel boot option to isolate a set of CPUs. After
the kernel boot, however, there is no way to add or remove CPUs from
this isolated set. For workloads that are more dynamic in nature, that
means users have to provision enough CPUs for the worst case situation
resulting in excess idle CPUs.
To address this issue for cpuset v2, a new cpuset.cpus.partition type
"isolated" is added which allows the creation of a cpuset partition
without load balancing. This will allow system administrators to
dynamically adjust the size of isolated partition to the current need
of the workload without rebooting the system.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/cgroup/cpuset.c | 50 +++++++++++++++++++++++++++++++++++++-----
1 file changed, 45 insertions(+), 5 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index c877eecf4c5e..cfab10911682 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -179,6 +179,8 @@ struct cpuset {
*
* 1 - partition root
*
+ * 2 - partition root without load balancing (isolated)
+ *
* -1 - invalid partition root
* None of the cpus in cpus_allowed can be put into the parent's
* subparts_cpus. In this case, the cpuset is not a real partition
@@ -188,6 +190,7 @@ struct cpuset {
*/
#define PRS_DISABLED 0
#define PRS_ENABLED 1
+#define PRS_ISOLATED 2
#define PRS_ERROR -1
/*
@@ -1316,17 +1319,22 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
if (cmd == partcmd_update) {
/*
- * Check for possible transition between PRS_ENABLED
- * and PRS_ERROR.
+ * Check for possible transition between PRS_ERROR and
+ * PRS_ENABLED/PRS_ISOLATED.
*/
switch (cpuset->partition_root_state) {
case PRS_ENABLED:
+ case PRS_ISOLATED:
if (part_error)
new_prs = PRS_ERROR;
break;
case PRS_ERROR:
- if (!part_error)
+ if (part_error)
+ break;
+ if (is_sched_load_balance(cpuset))
new_prs = PRS_ENABLED;
+ else
+ new_prs = PRS_ISOLATED;
break;
}
}
@@ -1451,6 +1459,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp,
if ((cp != cs) && old_prs) {
switch (parent->partition_root_state) {
case PRS_ENABLED:
+ case PRS_ISOLATED:
update_parent = true;
break;
@@ -2023,6 +2032,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
static int update_prstate(struct cpuset *cs, int new_prs)
{
int err, old_prs = cs->partition_root_state;
+ bool sched_domain_rebuilt = false;
struct cpuset *parent = parent_cs(cs);
struct tmpmasks tmpmask;
@@ -2059,6 +2069,22 @@ static int update_prstate(struct cpuset *cs, int new_prs)
update_flag(CS_CPU_EXCLUSIVE, cs, 0);
goto out;
}
+
+ if (new_prs == PRS_ISOLATED) {
+ /*
+ * Disable the load balance flag should not return an
+ * error unless the system is running out of memory.
+ */
+ update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
+ sched_domain_rebuilt = true;
+ }
+ } else if (old_prs && new_prs) {
+ /*
+ * A change in load balance state only, no change in cpumasks.
+ */
+ update_flag(CS_SCHED_LOAD_BALANCE, cs, (new_prs != PRS_ISOLATED));
+ err = 0;
+ goto out; /* Sched domain is rebuilt in update_flag() */
} else {
/*
* Switch back to member is always allowed even if it
@@ -2080,6 +2106,12 @@ static int update_prstate(struct cpuset *cs, int new_prs)
/* Turning off CS_CPU_EXCLUSIVE will not return error */
update_flag(CS_CPU_EXCLUSIVE, cs, 0);
+
+ if (!is_sched_load_balance(cs)) {
+ /* Make sure load balance is on */
+ update_flag(CS_SCHED_LOAD_BALANCE, cs, 1);
+ sched_domain_rebuilt = true;
+ }
}
/*
@@ -2092,7 +2124,8 @@ static int update_prstate(struct cpuset *cs, int new_prs)
if (parent->child_ecpus_count)
update_sibling_cpumasks(parent, cs, &tmpmask);
- rebuild_sched_domains_locked();
+ if (!sched_domain_rebuilt)
+ rebuild_sched_domains_locked();
out:
if (!err) {
spin_lock_irq(&callback_lock);
@@ -2599,16 +2632,21 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
static int sched_partition_show(struct seq_file *seq, void *v)
{
struct cpuset *cs = css_cs(seq_css(seq));
+ const char *type;
switch (cs->partition_root_state) {
case PRS_ENABLED:
seq_puts(seq, "root\n");
break;
+ case PRS_ISOLATED:
+ seq_puts(seq, "isolated\n");
+ break;
case PRS_DISABLED:
seq_puts(seq, "member\n");
break;
case PRS_ERROR:
- seq_puts(seq, "root invalid\n");
+ type = is_sched_load_balance(cs) ? "root" : "isolated";
+ seq_printf(seq, "%s invalid\n", type);
break;
}
return 0;
@@ -2630,6 +2668,8 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf,
val = PRS_ENABLED;
else if (!strcmp(buf, "member"))
val = PRS_DISABLED;
+ else if (!strcmp(buf, "isolated"))
+ val = PRS_ISOLATED;
else
return -EINVAL;
--
2.27.0
There are a number of different reasons which can cause a partition to
become invalid. A user seeing an invalid partition may not know exactly
why. To help user to get a better understanding of the underlying reason,
The cpuset.cpus.partition control file, when read, will now report the
reason why a partition become invalid. When a partition does become
invalid, reading the control file will show "root invalid (<reason>)"
where <reason> is a string that describes why the partition is invalid.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/cgroup/cpuset.c | 58 ++++++++++++++++++++++++++++++++++++++----
1 file changed, 53 insertions(+), 5 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index cfab10911682..d1025470b9ea 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -85,6 +85,26 @@ struct fmeter {
spinlock_t lock; /* guards read or write of above */
};
+/*
+ * Invalid partition error code
+ */
+enum prs_errcode {
+ PERR_NONE = 0,
+ PERR_INVCPUS,
+ PERR_INVPARENT,
+ PERR_NOTPART,
+ PERR_NOCPUS,
+ PERR_HOTPLUG,
+};
+
+static const char * const perr_strings[] = {
+ [PERR_INVCPUS] = "Invalid change to cpuset.cpus",
+ [PERR_INVPARENT] = "Parent is an invalid partition root",
+ [PERR_NOTPART] = "Parent is not a partition root",
+ [PERR_NOCPUS] = "Parent unable to distribute cpu downstream",
+ [PERR_HOTPLUG] = "No cpu available due to hotplug",
+};
+
struct cpuset {
struct cgroup_subsys_state css;
@@ -168,6 +188,9 @@ struct cpuset {
int use_parent_ecpus;
int child_ecpus_count;
+ /* Invalid partition error code, not lock protected */
+ enum prs_errcode prs_err;
+
/* Handle for cpuset.cpus.partition */
struct cgroup_file partition_file;
};
@@ -282,8 +305,13 @@ static inline int is_partition_root(const struct cpuset *cs)
static inline void notify_partition_change(struct cpuset *cs,
int old_prs, int new_prs)
{
- if (old_prs != new_prs)
- cgroup_file_notify(&cs->partition_file);
+ if (old_prs == new_prs)
+ return;
+ cgroup_file_notify(&cs->partition_file);
+
+ /* Reset prs_err if not invalid */
+ if (new_prs != PRS_ERROR)
+ WRITE_ONCE(cs->prs_err, PERR_NONE);
}
static struct cpuset top_cpuset = {
@@ -1292,6 +1320,9 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
part_error = partition_is_populated(parent, cpuset) &&
cpumask_subset(parent->effective_cpus, tmp->addmask) &&
!cpumask_intersects(tmp->delmask, cpu_active_mask);
+
+ if ((READ_ONCE(cpuset->prs_err) == PERR_NONE) && part_error)
+ WRITE_ONCE(cpuset->prs_err, PERR_INVCPUS);
} else {
/*
* partcmd_update w/o newmask:
@@ -1315,6 +1346,9 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
!parent->nr_subparts_cpus) ||
(cpumask_equal(parent->effective_cpus, tmp->addmask) &&
partition_is_populated(parent, cpuset));
+
+ if (is_partition_root(cpuset) && part_error)
+ WRITE_ONCE(cpuset->prs_err, PERR_NOCPUS);
}
if (cmd == partcmd_update) {
@@ -1471,6 +1505,9 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp,
* invalid too.
*/
new_prs = PRS_ERROR;
+ WRITE_ONCE(cp->prs_err,
+ (parent->partition_root_state == PRS_ERROR)
+ ? PERR_INVPARENT : PERR_NOTPART);
break;
}
}
@@ -2632,7 +2669,7 @@ static s64 cpuset_read_s64(struct cgroup_subsys_state *css, struct cftype *cft)
static int sched_partition_show(struct seq_file *seq, void *v)
{
struct cpuset *cs = css_cs(seq_css(seq));
- const char *type;
+ const char *err, *type;
switch (cs->partition_root_state) {
case PRS_ENABLED:
@@ -2646,7 +2683,11 @@ static int sched_partition_show(struct seq_file *seq, void *v)
break;
case PRS_ERROR:
type = is_sched_load_balance(cs) ? "root" : "isolated";
- seq_printf(seq, "%s invalid\n", type);
+ err = perr_strings[READ_ONCE(cs->prs_err)];
+ if (err)
+ seq_printf(seq, "%s invalid (%s)\n", type, err);
+ else
+ seq_printf(seq, "%s invalid\n", type);
break;
}
return 0;
@@ -3236,7 +3277,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
if (is_partition_root(cs) &&
((cpumask_empty(&new_cpus) && partition_is_populated(cs, NULL)) ||
!parent->nr_subparts_cpus)) {
- int old_prs;
+ int old_prs, parent_prs;
update_parent_subparts_cpumask(cs, partcmd_disable,
NULL, tmp);
@@ -3249,10 +3290,17 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
}
old_prs = cs->partition_root_state;
+ parent_prs = parent->partition_root_state;
if (old_prs != PRS_ERROR) {
spin_lock_irq(&callback_lock);
cs->partition_root_state = PRS_ERROR;
spin_unlock_irq(&callback_lock);
+ if (parent_prs == PRS_ERROR)
+ WRITE_ONCE(cs->prs_err, PERR_INVPARENT);
+ else if (!parent_prs)
+ WRITE_ONCE(cs->prs_err, PERR_NOTPART);
+ else
+ WRITE_ONCE(cs->prs_err, PERR_HOTPLUG);
notify_partition_change(cs, old_prs, PRS_ERROR);
}
cpuset_force_rebuild();
--
2.27.0
Add a test script test_cpuset_prs.sh with a helper program wait_inotify
for exercising the cpuset v2 partition root state code.
Signed-off-by: Waiman Long <[email protected]>
---
tools/testing/selftests/cgroup/Makefile | 5 +-
.../selftests/cgroup/test_cpuset_prs.sh | 667 ++++++++++++++++++
tools/testing/selftests/cgroup/wait_inotify.c | 87 +++
3 files changed, 757 insertions(+), 2 deletions(-)
create mode 100755 tools/testing/selftests/cgroup/test_cpuset_prs.sh
create mode 100644 tools/testing/selftests/cgroup/wait_inotify.c
diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selftests/cgroup/Makefile
index 59e222460581..3f1fd3f93f41 100644
--- a/tools/testing/selftests/cgroup/Makefile
+++ b/tools/testing/selftests/cgroup/Makefile
@@ -1,10 +1,11 @@
# SPDX-License-Identifier: GPL-2.0
CFLAGS += -Wall -pthread
-all:
+all: ${HELPER_PROGS}
TEST_FILES := with_stress.sh
-TEST_PROGS := test_stress.sh
+TEST_PROGS := test_stress.sh test_cpuset_prs.sh
+TEST_GEN_FILES := wait_inotify
TEST_GEN_PROGS = test_memcontrol
TEST_GEN_PROGS += test_kmem
TEST_GEN_PROGS += test_core
diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
new file mode 100755
index 000000000000..cf8d20e3c544
--- /dev/null
+++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
@@ -0,0 +1,667 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Test for cpuset v2 partition root state (PRS)
+#
+# The sched verbose flag is set, if available, so that the console log
+# can be examined for the correct setting of scheduling domain.
+#
+
+skip_test() {
+ echo "$1"
+ echo "Test SKIPPED"
+ exit 0
+}
+
+[[ $(id -u) -eq 0 ]] || skip_test "Test must be run as root!"
+
+# Set sched verbose flag, if available
+[[ -d /sys/kernel/debug/sched ]] && echo Y > /sys/kernel/debug/sched/verbose
+
+# Get wait_inotify location
+WAIT_INOTIFY=$(cd $(dirname $0); pwd)/wait_inotify
+
+# Find cgroup v2 mount point
+CGROUP2=$(mount -t cgroup2 | head -1 | awk -e '{print $3}')
+[[ -n "$CGROUP2" ]] || skip_test "Cgroup v2 mount point not found!"
+
+CPUS=$(lscpu | grep "^CPU(s)" | sed -e "s/.*:[[:space:]]*//")
+[[ $CPUS -lt 8 ]] && skip_test "Test needs at least 8 cpus available!"
+
+# Set verbose flag and delay factor
+PROG=$1
+VERBOSE=
+DELAY_FACTOR=1
+while [[ "$1" = -* ]]
+do
+ case "$1" in
+ -v) VERBOSE=1
+ break
+ ;;
+ -d) DELAY_FACTOR=$2
+ shift
+ break
+ ;;
+ *) echo "Usage: $PROG [-v] [-d <delay-factor>"
+ exit
+ ;;
+ esac
+ shift
+done
+
+cd $CGROUP2
+echo +cpuset > cgroup.subtree_control
+[[ -d test ]] || mkdir test
+cd test
+
+# Pause in ms
+pause()
+{
+ DELAY=$1
+ LOOP=0
+ while [[ $LOOP -lt $DELAY_FACTOR ]]
+ do
+ sleep $DELAY
+ ((LOOP++))
+ done
+ return 0
+}
+
+console_msg()
+{
+ MSG=$1
+ echo "$MSG"
+ echo "" > /dev/console
+ echo "$MSG" > /dev/console
+ pause 0.01
+}
+
+test_partition()
+{
+ EXPECTED_VAL=$1
+ echo $EXPECTED_VAL > cpuset.cpus.partition
+ [[ $? -eq 0 ]] || exit 1
+ ACTUAL_VAL=$(cat cpuset.cpus.partition)
+ [[ $ACTUAL_VAL != $EXPECTED_VAL ]] && {
+ echo "cpuset.cpus.partition: expect $EXPECTED_VAL, found $EXPECTED_VAL"
+ echo "Test FAILED"
+ exit 1
+ }
+}
+
+test_effective_cpus()
+{
+ EXPECTED_VAL=$1
+ ACTUAL_VAL=$(cat cpuset.cpus.effective)
+ [[ "$ACTUAL_VAL" != "$EXPECTED_VAL" ]] && {
+ echo "cpuset.cpus.effective: expect '$EXPECTED_VAL', found '$EXPECTED_VAL'"
+ echo "Test FAILED"
+ exit 1
+ }
+}
+
+# Adding current process to cgroup.procs as a test
+test_add_proc()
+{
+ OUTSTR="$1"
+ ERRMSG=$((echo $$ > cgroup.procs) |& cat)
+ echo $ERRMSG | grep -q "$OUTSTR"
+ [[ $? -ne 0 ]] && {
+ echo "cgroup.procs: expect '$OUTSTR', got '$ERRMSG'"
+ echo "Test FAILED"
+ exit 1
+ }
+ echo $$ > $CGROUP2/cgroup.procs # Move out the task
+}
+
+#
+# Testing the new "isolated" partition root type
+#
+test_isolated()
+{
+ echo 2-3 > cpuset.cpus
+ TYPE=$(cat cpuset.cpus.partition)
+ [[ $TYPE = member ]] || echo member > cpuset.cpus.partition
+
+ console_msg "Change from member to root"
+ test_partition root
+
+ console_msg "Change from root to isolated"
+ test_partition isolated
+
+ console_msg "Change from isolated to member"
+ test_partition member
+
+ console_msg "Change from member to isolated"
+ test_partition isolated
+
+ console_msg "Change from isolated to root"
+ test_partition root
+
+ console_msg "Change from root to member"
+ test_partition member
+
+ #
+ # Testing partition root with no cpu
+ #
+ console_msg "Distribute all cpus to child partition"
+ echo +cpuset > cgroup.subtree_control
+ test_partition root
+
+ mkdir A1
+ cd A1
+ echo 2-3 > cpuset.cpus
+ test_partition root
+ test_effective_cpus 2-3
+ cd ..
+ test_effective_cpus ""
+
+ console_msg "Moving task to partition test"
+ test_add_proc "No space left"
+ cd A1
+ test_add_proc ""
+ cd ..
+
+ console_msg "Shrink and expand child partition"
+ cd A1
+ echo 2 > cpuset.cpus
+ cd ..
+ test_effective_cpus 3
+ cd A1
+ echo 2-3 > cpuset.cpus
+ cd ..
+ test_effective_cpus ""
+
+ # Cleaning up
+ console_msg "Cleaning up"
+ echo $$ > $CGROUP2/cgroup.procs
+ [[ -d A1 ]] && rmdir A1
+}
+
+#
+# Cpuset controller state transition test matrix.
+#
+# Cgroup test hierarchy
+#
+# test -- A1 -- A2 -- A3
+# \- B1
+#
+# P<v> = set cpus.partition (0:member, 1:root, 2:isolated, -1:root invalid)
+# C<l> = add cpu-list
+# S<p> = use prefix in subtree_control
+# T = put a task into cgroup
+# O<c>-<v> = Write <v> to CPU online file of <c>
+#
+SETUP_A123_PARTITIONS="C1-3:P1:S+ C2-3:P1:S+ C3:P1"
+TEST_MATRIX=(
+ # test old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate
+ # ---- ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------
+ " S+ C0-1 . . C2-3 S+ C4-5 . . 0 A2:0-1"
+ " S+ C0-1 . . C2-3 P1 . . . 0 "
+ " S+ C0-1 . . C2-3 P1:S+ C0-1:P1 . . 0 "
+ " S+ C0-1 . . C2-3 P1:S+ C1:P1 . . 0 "
+ " S+ C0-1:S+ . . C2-3 . . . P1 0 "
+ " S+ C0-1:P1 . . C2-3 S+ C1 . . 0 "
+ " S+ C0-1:P1 . . C2-3 S+ C1:P1 . . 0 "
+ " S+ C0-1:P1 . . C2-3 S+ C1:P1 . P1 0 "
+ " S+ C0-1:P1 . . C2-3 C4-5 . . . 0 A1:4-5"
+ " S+ C0-1:P1 . . C2-3 S+:C4-5 . . . 0 A1:4-5"
+ " S+ C0-1 . . C2-3:P1 . . . C2 0 "
+ " S+ C0-1 . . C2-3:P1 . . . C4-5 0 B1:4-5"
+ " S+ C0-3:P1:S+ C2-3:P1 . . . . . . 0 A1:0-1,A2:2-3"
+ " S+ C0-3:P1:S+ C2-3:P1 . . C1-3 . . . 0 A1:1,A2:2-3"
+ " S+ C2-3:P1:S+ C3:P1 . . C3 . . . 0 A1:,A2:3 A1:P1,A2:P1"
+ " S+ C2-3:P1:S+ C3:P1 . . C3 P0 . . 0 A1:3,A2:3 A1:P1,A2:P0"
+ " S+ C2-3:P1:S+ C2:P1 . . C2-4 . . . 0 A1:3-4,A2:2"
+ " S+ C2-3:P1:S+ C3:P1 . . C3 . . C0-2 0 A1:,B1:0-2 A1:P1,A2:P1"
+ " S+ $SETUP_A123_PARTITIONS . C2-3 . . . 0 A1:,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+
+ # CPU offlining cases:
+ " S+ C0-1 . . C2-3 S+ C4-5 . O2-0 0 A1:0-1,B1:3"
+ " S+ C0-3:P1:S+ C2-3:P1 . . O2-0 . . . 0 A1:0-1,A2:3"
+ " S+ C0-3:P1:S+ C2-3:P1 . . O2-0 O2-1 . . 0 A1:0-1,A2:2-3"
+ " S+ C0-3:P1:S+ C2-3:P1 . . O1-0 . . . 0 A1:0,A2:2-3"
+ " S+ C0-3:P1:S+ C2-3:P1 . . O1-0 O1-1 . . 0 A1:0-1,A2:2-3"
+ " S+ C2-3:P1:S+ C3:P1 . . O3-0 O3-1 . . 0 A1:2,A2:3 A1:P1,A2:P1"
+ " S+ C2-3:P1:S+ C3:P2 . . O3-0 O3-1 . . 0 A1:2,A2:3 A1:P1,A2:P2"
+ " S+ C2-3:P1:S+ C3:P1 . . O2-0 O2-1 . . 0 A1:2,A2:3 A1:P1,A2:P1"
+ " S+ C2-3:P1:S+ C3:P2 . . O2-0 O2-1 . . 0 A1:2,A2:3 A1:P1,A2:P2"
+ " S+ C2-3:P1:S+ C3:P1 . . O2-0 . . . 0 A1:,A2:3 A1:P1,A2:P1"
+ " S+ C2-3:P1:S+ C3:P1 . . O3-0 . . . 0 A1:2,A2: A1:P1,A2:P1"
+ " S+ C2-3:P1:S+ C3:P1 . . T:O2-0 . . . 0 A1:3,A2:3 A1:P1,A2:P-1"
+ " S+ C2-3:P1:S+ C3:P1 . . . T:O3-0 . . 0 A1:2,A2:2 A1:P1,A2:P-1"
+ " S+ $SETUP_A123_PARTITIONS . O1-0 . . . 0 A1:,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+ " S+ $SETUP_A123_PARTITIONS . O2-0 . . . 0 A1:1,A2:,A3:3 A1:P1,A2:P1,A3:P1"
+ " S+ $SETUP_A123_PARTITIONS . O3-0 . . . 0 A1:1,A2:2,A3: A1:P1,A2:P1,A3:P1"
+ " S+ $SETUP_A123_PARTITIONS . T:O1-0 . . . 0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P-1,A3:P-1"
+ " S+ $SETUP_A123_PARTITIONS . . T:O2-0 . . 0 A1:1,A2:3,A3:3 A1:P1,A2:P1,A3:P-1"
+ " S+ $SETUP_A123_PARTITIONS . . . T:O3-0 . 0 A1:1,A2:2,A3:2 A1:P1,A2:P1,A3:P-1"
+ " S+ $SETUP_A123_PARTITIONS . T:O1-0 O1-1 . . 0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+ " S+ $SETUP_A123_PARTITIONS . . T:O2-0 O2-1 . 0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+ " S+ $SETUP_A123_PARTITIONS . . . T:O3-0 O3-1 0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+ " S+ $SETUP_A123_PARTITIONS . T:O1-0 O2-0 O1-1 . 0 A1:1,A2:,A3:3 A1:P1,A2:P1,A3:P1"
+ " S+ $SETUP_A123_PARTITIONS . T:O1-0 O2-0 O2-1 . 0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P-1,A3:P-1"
+
+ # test old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate
+ # ---- ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------
+ #
+ # Incorrect change to cpuset.cpus invalidates partition root
+ #
+ # Adding CPUs to partition root that are not in parent's
+ # cpuset.cpus is allowed, but those extra CPUs are ignored.
+ " S+ C2-3:P1:S+ C3:P1 . . . C2-4 . . 0 A1:,A2:2-3 A1:P1,A2:P1"
+
+ # Taking away all CPUs from parent or itself if there are tasks
+ # will make the partition invalid.
+ " S+ C2-3:P1:S+ C3:P1 . . T C2-3 . . 0 A1:2-3,A2:2-3 A1:P1,A2:P-1"
+ " S+ $SETUP_A123_PARTITIONS . T:C2-3 . . . 0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P-1,A3:P-1"
+ " S+ $SETUP_A123_PARTITIONS . T:C2-3:C1-3 . . . 0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+
+ # Changing a partition root to member makes child partitions invalid
+ " S+ C2-3:P1:S+ C3:P1 . . P0 . . . 0 A1:2-3,A2:3 A1:P0,A2:P-1"
+ " S+ $SETUP_A123_PARTITIONS . C2-3 P0 . . 0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P0,A3:P-1"
+
+ # cpuset.cpus can contains cpus not in parent's cpuset.cpus as long
+ # as they overlap.
+ " S+ C2-3:P1:S+ . . . . C3-4:P1 . . 0 A1:2,A2:3 A1:P1,A2:P1"
+
+ # Deletion of CPUs distributed to child cgroup is allowed.
+ " S+ C0-1:P1:S+ C1 . C2-3 C4-5 . . . 0 A1:4-5,A2:4-5"
+
+ # test old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate
+ # ---- ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------
+ # Failure cases:
+
+ # To become a partition root, cpuset.cpus must overlap parent's
+ # cpuset.cpus.
+ " S+ C0-1 . . C2-3 S+ C4-5:P1 . . 1 "
+
+ # A cpuset cannot become a partition root if it has child cpusets
+ # with non-empty cpuset.cpus.
+ " S+ C0-1:S+ C1 . C2-3 P1 . . . 1 "
+
+ # Any change to cpuset.cpus of a partition root must be exclusive.
+ " S+ C0-1:P1 . . C2-3 C0-2 . . . 1 "
+ " S+ C0-1 . . C2-3:P1 . . . C1 1 "
+ " S+ C2-3:P1:S+ C2:P1 . C1 C1-3 . . . 1 "
+
+ # A task cannot be added to a partition with no cpu
+ " S+ C2-3:P1:S+ C3:P1 . . O2-0:T . . . 1 A1:,A2:3 A1:P1,A2:P1"
+)
+
+#
+# Write to the cpu online file
+# $1 - <c>-<v> where <c> = cpu number, <v> value to be written
+#
+write_cpu_online()
+{
+ CPU=${1%-*}
+ VAL=${1#*-}
+ CPUFILE=//sys/devices/system/cpu/cpu${CPU}/online
+ if [[ $VAL -eq 0 ]]
+ then
+ OFFLINE_CPUS="$OFFLINE_CPUS $CPU"
+ else
+ [[ -n "$OFFLINE_CPUS" ]] && {
+ OFFLINE_CPUS=$(echo $CPU $CPU $OFFLINE_CPUS | fmt -1 |\
+ sort | uniq -u)
+ }
+ fi
+ echo $VAL > $CPUFILE
+ pause 0.01
+}
+
+#
+# Set controller state
+# $1 - cgroup directory
+# $2 - state
+# $3 - showerr
+#
+# The presence of ":" in state means transition from one to the next.
+#
+set_ctrl_state()
+{
+ TMPMSG=/tmp/.msg_$$
+ CGRP=$1
+ STATE=$2
+ SHOWERR=${3}${VERBOSE}
+ CTRL=${CTRL:=$CONTROLLER}
+ HASERR=0
+ REDIRECT="2> $TMPMSG"
+ [[ -z "$STATE" || "$STATE" = '.' ]] && return 0
+
+ rm -f $TMPMSG
+ for CMD in $(echo $STATE | sed -e "s/:/ /g")
+ do
+ TFILE=$CGRP/cgroup.procs
+ SFILE=$CGRP/cgroup.subtree_control
+ PFILE=$CGRP/cpuset.cpus.partition
+ CFILE=$CGRP/cpuset.cpus
+ S=$(expr substr $CMD 1 1)
+ if [[ $S = S ]]
+ then
+ PREFIX=${CMD#?}
+ COMM="echo ${PREFIX}${CTRL} > $SFILE"
+ eval $COMM $REDIRECT
+ elif [[ $S = C ]]
+ then
+ CPUS=${CMD#?}
+ COMM="echo $CPUS > $CFILE"
+ eval $COMM $REDIRECT
+ elif [[ $S = P ]]
+ then
+ VAL=${CMD#?}
+ case $VAL in
+ 0) VAL=member
+ ;;
+ 1) VAL=root
+ ;;
+ 2) VAL=isolated
+ ;;
+ *)
+ echo "Invalid partition state - $VAL"
+ exit 1
+ ;;
+ esac
+ COMM="echo $VAL > $PFILE"
+ eval $COMM $REDIRECT
+ elif [[ $S = O ]]
+ then
+ VAL=${CMD#?}
+ write_cpu_online $VAL
+ elif [[ $S = T ]]
+ then
+ COMM="echo 0 > $TFILE"
+ eval $COMM $REDIRECT
+ fi
+ RET=$?
+ [[ $RET -ne 0 ]] && {
+ [[ -n "$SHOWERR" ]] && {
+ echo "$COMM"
+ cat $TMPMSG
+ }
+ HASERR=1
+ }
+ pause 0.01
+ rm -f $TMPMSG
+ done
+ return $HASERR
+}
+
+set_ctrl_state_noerr()
+{
+ CGRP=$1
+ STATE=$2
+ [[ -d $CGRP ]] || mkdir $CGRP
+ set_ctrl_state $CGRP $STATE 1
+ [[ $? -ne 0 ]] && {
+ echo "ERROR: Failed to set $2 to cgroup $1!"
+ exit 1
+ }
+}
+
+online_cpus()
+{
+ [[ -n "OFFLINE_CPUS" ]] && {
+ for C in $OFFLINE_CPUS
+ do
+ write_cpu_online ${C}-1
+ done
+ }
+}
+
+#
+# Return 1 if the list of effective cpus isn't the same as the initial list.
+#
+reset_cgroup_states()
+{
+ echo 0 > $CGROUP2/cgroup.procs
+ online_cpus
+ rmdir A1/A2/A3 A1/A2 A1 B1 > /dev/null 2>&1
+ set_ctrl_state . S-
+ pause 0.01
+}
+
+dump_states()
+{
+ for DIR in A1 A1/A2 A1/A2/A3 B1
+ do
+ ECPUS=$DIR/cpuset.cpus.effective
+ PRS=$DIR/cpuset.cpus.partition
+ [[ -e $ECPUS ]] && echo "$ECPUS: $(cat $ECPUS)"
+ [[ -e $PRS ]] && echo "$PRS: $(cat $PRS)"
+ done
+}
+
+#
+# Check effective cpus
+# $1 - check string, format: <cgroup>:<cpu-list>[,<cgroup>:<cpu-list>]*
+#
+check_effective_cpus()
+{
+ CHK_STR=$1
+ for CHK in $(echo $CHK_STR | sed -e "s/,/ /g")
+ do
+ set -- $(echo $CHK | sed -e "s/:/ /g")
+ CGRP=$1
+ CPUS=$2
+ [[ $CGRP = A2 ]] && CGRP=A1/A2
+ [[ $CGRP = A3 ]] && CGRP=A1/A2/A3
+ FILE=$CGRP/cpuset.cpus.effective
+ [[ -e $FILE ]] || return 1
+ [[ $CPUS = $(cat $FILE) ]] || return 1
+ done
+}
+
+#
+# Check cgroup states
+# $1 - check string, format: <cgroup>:<state>[,<cgroup>:<state>]*
+#
+check_cgroup_states()
+{
+ CHK_STR=$1
+ for CHK in $(echo $CHK_STR | sed -e "s/,/ /g")
+ do
+ set -- $(echo $CHK | sed -e "s/:/ /g")
+ CGRP=$1
+ STATE=$2
+ FILE=
+ EVAL=$(expr substr $STATE 2 2)
+ [[ $CGRP = A2 ]] && CGRP=A1/A2
+ [[ $CGRP = A3 ]] && CGRP=A1/A2/A3
+
+ case $STATE in
+ P*) FILE=$CGRP/cpuset.cpus.partition
+ ;;
+ *) echo "Unknown state: $STATE!"
+ exit 1
+ ;;
+ esac
+ VAL=$(cat $FILE)
+
+ case "$VAL" in
+ member) VAL=0
+ ;;
+ root) VAL=1
+ ;;
+ isolated)
+ VAL=2
+ ;;
+ "root invalid"*)
+ VAL=-1
+ ;;
+ esac
+ [[ $EVAL != $VAL ]] && return 1
+ done
+ return 0
+}
+
+#
+# Run cpuset state transition test
+# $1 - test matrix name
+#
+# This test is somewhat fragile as delays (sleep x) are added in various
+# places to make sure state changes are fully propagated before the next
+# action. These delays may need to be adjusted if running in a slower machine.
+#
+run_state_test()
+{
+ TEST=$1
+ CONTROLLER=cpuset
+ CPULIST=0-6
+ I=0
+ eval CNT="\${#$TEST[@]}"
+
+ reset_cgroup_states
+ echo $CPULIST > cpuset.cpus
+ echo root > cpuset.cpus.partition
+ console_msg "Running state transition test ..."
+
+ while [[ $I -lt $CNT ]]
+ do
+ echo "Running test $I ..." > /dev/console
+ eval set -- "\${$TEST[$I]}"
+ ROOT=$1
+ OLD_A1=$2
+ OLD_A2=$3
+ OLD_A3=$4
+ OLD_B1=$5
+ NEW_A1=$6
+ NEW_A2=$7
+ NEW_A3=$8
+ NEW_B1=$9
+ RESULT=${10}
+ ECPUS=${11}
+ STATES=${12}
+
+ set_ctrl_state_noerr . $ROOT
+ set_ctrl_state_noerr A1 $OLD_A1
+ set_ctrl_state_noerr A1/A2 $OLD_A2
+ set_ctrl_state_noerr A1/A2/A3 $OLD_A3
+ set_ctrl_state_noerr B1 $OLD_B1
+ RETVAL=0
+ set_ctrl_state A1 $NEW_A1; ((RETVAL += $?))
+ set_ctrl_state A1/A2 $NEW_A2; ((RETVAL += $?))
+ set_ctrl_state A1/A2/A3 $NEW_A3; ((RETVAL += $?))
+ set_ctrl_state B1 $NEW_B1; ((RETVAL += $?))
+
+ [[ $RETVAL -ne $RESULT ]] && {
+ echo "Test $TEST[$I] failed result check!"
+ eval echo \"\${$TEST[$I]}\"
+ dump_states
+ online_cpus
+ exit 1
+ }
+
+ [[ -n "$ECPUS" && "$ECPUS" != . ]] && {
+ check_effective_cpus $ECPUS
+ [[ $? -ne 0 ]] && {
+ echo "Test $TEST[$I] failed effective CPU check!"
+ eval echo \"\${$TEST[$I]}\"
+ echo
+ dump_states
+ online_cpus
+ exit 1
+ }
+ }
+
+ [[ -n "$STATES" ]] && {
+ check_cgroup_states $STATES
+ [[ $? -ne 0 ]] && {
+ echo "FAILED: Test $TEST[$I] failed states check!"
+ eval echo \"\${$TEST[$I]}\"
+ echo
+ dump_states
+ online_cpus
+ exit 1
+ }
+ }
+
+ reset_cgroup_states
+ #
+ # Check to see if effective cpu list changes
+ #
+ pause 0.05
+ NEWLIST=$(cat cpuset.cpus.effective)
+ [[ $NEWLIST != $CPULIST ]] && {
+ echo "Effective cpus changed to $NEWLIST after test $I!"
+ exit 1
+ }
+ [[ -n "$VERBOSE" ]] && echo "Test $I done."
+ ((I++))
+ done
+ echo "All $I tests of $TEST PASSED."
+
+ echo member > cpuset.cpus.partition
+}
+
+#
+# Wait for inotify event for the given file and read it
+# $1: cgroup file to wait for
+# $2: file to store the read result
+#
+wait_inotify()
+{
+ CGROUP_FILE=$1
+ OUTPUT_FILE=$2
+
+ $WAIT_INOTIFY $CGROUP_FILE
+ cat $CGROUP_FILE > $OUTPUT_FILE
+}
+
+#
+# Test if inotify events are properly generated when going into and out of
+# invalid partition state.
+#
+test_inotify()
+{
+ ERR=0
+ PRS=/tmp/.prs_$$
+ [[ -f $WAIT_INOTIFY ]] || {
+ echo "wait_inotify not found, inotify test SKIPPED."
+ return
+ }
+
+ pause 0.01
+ echo 1 > cpuset.cpus
+ echo 0 > cgroup.procs
+ echo root > cpuset.cpus.partition
+ pause 0.01
+ rm -f $PRS
+ wait_inotify $PWD/cpuset.cpus.partition $PRS &
+ pause 0.01
+ set_ctrl_state . "O1-0"
+ pause 0.01
+ check_cgroup_states ".:P-1"
+ if [[ $? -ne 0 ]]
+ then
+ echo "FAILED: Inotify test - partition not invalid"
+ ERR=1
+ elif [[ ! -f $PRS ]]
+ then
+ echo "FAILED: Inotify test - event not generated"
+ ERR=1
+ kill %1
+ elif [[ $(cat $PRS) != "root invalid"* ]]
+ then
+ echo "FAILED: Inotify test - incorrect state"
+ cat $PRS
+ ERR=1
+ fi
+ online_cpus
+ echo member > cpuset.cpus.partition
+ echo 0 > ../cgroup.procs
+ if [[ $ERR -ne 0 ]]
+ then
+ exit 1
+ else
+ echo "Inotify test PASSED"
+ fi
+}
+
+run_state_test TEST_MATRIX
+test_isolated
+test_inotify
+echo "All tests PASSED."
+cd ..
+rmdir test
diff --git a/tools/testing/selftests/cgroup/wait_inotify.c b/tools/testing/selftests/cgroup/wait_inotify.c
new file mode 100644
index 000000000000..e11b431e1b62
--- /dev/null
+++ b/tools/testing/selftests/cgroup/wait_inotify.c
@@ -0,0 +1,87 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Wait until an inotify event on the given cgroup file.
+ */
+#include <linux/limits.h>
+#include <sys/inotify.h>
+#include <sys/mman.h>
+#include <sys/ptrace.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <poll.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+static const char usage[] = "Usage: %s [-v] <cgroup_file>\n";
+static char *file;
+static int verbose;
+
+static inline void fail_message(char *msg)
+{
+ fprintf(stderr, msg, file);
+ exit(1);
+}
+
+int main(int argc, char *argv[])
+{
+ char *cmd = argv[0];
+ int c, fd;
+ struct pollfd fds = { .events = POLLIN, };
+
+ while ((c = getopt(argc, argv, "v")) != -1) {
+ switch (c) {
+ case 'v':
+ verbose++;
+ break;
+ }
+ argv++, argc--;
+ }
+
+ if (argc != 2) {
+ fprintf(stderr, usage, cmd);
+ return -1;
+ }
+ file = argv[1];
+ fd = open(file, O_RDONLY);
+ if (fd < 0)
+ fail_message("Cgroup file %s not found!\n");
+ close(fd);
+
+ fd = inotify_init();
+ if (fd < 0)
+ fail_message("inotify_init() fails on %s!\n");
+ if (inotify_add_watch(fd, file, IN_MODIFY) < 0)
+ fail_message("inotify_add_watch() fails on %s!\n");
+ fds.fd = fd;
+
+ /*
+ * poll waiting loop
+ */
+ for (;;) {
+ int ret = poll(&fds, 1, 10000);
+
+ if (ret < 0) {
+ if (errno == EINTR)
+ continue;
+ perror("poll");
+ exit(1);
+ }
+ if ((ret > 0) && (fds.revents & POLLIN))
+ break;
+ }
+ if (verbose) {
+ struct inotify_event events[10];
+ long len;
+
+ usleep(1000);
+ len = read(fd, events, sizeof(events));
+ printf("Number of events read = %ld\n",
+ len/sizeof(struct inotify_event));
+ }
+ close(fd);
+ return 0;
+}
--
2.27.0
Update Documentation/admin-guide/cgroup-v2.rst on the newly introduced
"isolated" cpuset partition type as well as other changes made in other
cpuset patches.
Signed-off-by: Waiman Long <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 168 ++++++++++++++----------
1 file changed, 102 insertions(+), 66 deletions(-)
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 2aeb7ae8b393..9612319b353f 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2099,74 +2099,110 @@ Cpuset Interface Files
It accepts only the following input values when written to.
======== ================================
- "root" a partition root
- "member" a non-root member of a partition
+ "member" Non-root member of a partition
+ "root" Partition root
+ "isolated" Partition root without load balancing
======== ================================
- When set to be a partition root, the current cgroup is the
- root of a new partition or scheduling domain that comprises
- itself and all its descendants except those that are separate
- partition roots themselves and their descendants. The root
- cgroup is always a partition root.
-
- There are constraints on where a partition root can be set.
- It can only be set in a cgroup if all the following conditions
- are true.
-
- 1) The "cpuset.cpus" is not empty and the list of CPUs are
- exclusive, i.e. they are not shared by any of its siblings.
- 2) The parent cgroup is a partition root.
- 3) The "cpuset.cpus" is also a proper subset of the parent's
- "cpuset.cpus.effective".
- 4) There is no child cgroups with cpuset enabled. This is for
- eliminating corner cases that have to be handled if such a
- condition is allowed.
-
- Setting it to partition root will take the CPUs away from the
- effective CPUs of the parent cgroup. Once it is set, this
- file cannot be reverted back to "member" if there are any child
- cgroups with cpuset enabled.
-
- A parent partition cannot distribute all its CPUs to its
- child partitions. There must be at least one cpu left in the
- parent partition.
-
- Once becoming a partition root, changes to "cpuset.cpus" is
- generally allowed as long as the first condition above is true,
- the change will not take away all the CPUs from the parent
- partition and the new "cpuset.cpus" value is a superset of its
- children's "cpuset.cpus" values.
-
- Sometimes, external factors like changes to ancestors'
- "cpuset.cpus" or cpu hotplug can cause the state of the partition
- root to change. On read, the "cpuset.sched.partition" file
- can show the following values.
-
- ============== ==============================
- "member" Non-root member of a partition
- "root" Partition root
- "root invalid" Invalid partition root
- ============== ==============================
-
- It is a partition root if the first 2 partition root conditions
- above are true and at least one CPU from "cpuset.cpus" is
- granted by the parent cgroup.
-
- A partition root can become invalid if none of CPUs requested
- in "cpuset.cpus" can be granted by the parent cgroup or the
- parent cgroup is no longer a partition root itself. In this
- case, it is not a real partition even though the restriction
- of the first partition root condition above will still apply.
- The cpu affinity of all the tasks in the cgroup will then be
- associated with CPUs in the nearest ancestor partition.
-
- An invalid partition root can be transitioned back to a
- real partition root if at least one of the requested CPUs
- can now be granted by its parent. In this case, the cpu
- affinity of all the tasks in the formerly invalid partition
- will be associated to the CPUs of the newly formed partition.
- Changing the partition state of an invalid partition root to
- "member" is always allowed even if child cpusets are present.
+ The root cgroup is always a partition root and its state
+ cannot be changed. All other non-root cgroups start out as
+ "member".
+
+ When set to "root", the current cgroup is the root of a new
+ partition or scheduling domain that comprises itself and
+ all its descendants except those that are separate partition
+ roots themselves and their descendants.
+
+ The value shown in "cpuset.cpus.effective" of a partition root is
+ the CPUs that the parent partition root can dedicate to the new
+ partition root. They are subtracted from "cpuset.cpus.effective"
+ of the parent and may be different from "cpuset.cpus"
+
+ When set to "isolated", the CPUs in that partition root will
+ be in an isolated state without any load balancing from the
+ scheduler. Tasks placed in such a partition with multiple
+ CPUs should be carefully distributed and bound to each of the
+ individual CPUs for optimal performance.
+
+ A partition root ("root" or "isolated") can be in one of the
+ two possible states - valid or invalid. An invalid partition
+ root is in a degraded state where some state information are
+ retained, but behaves more like a "member".
+
+ On read, the "cpuset.cpus.partition" file can show the following
+ values.
+
+ ====================== ==============================
+ "member" Non-root member of a partition
+ "root" Partition root
+ "isolated" Partition root without load balancing
+ "root invalid (<reason>)" Invalid partition root
+ "isolated invalid (<reason>)" Invalid isolated partition root
+ ====================== ==============================
+
+ In the case of an invalid partition root, a descriptive string on
+ why the partition is invalid is included within parentheses.
+
+ Almost all possible state transitions among "member", valid
+ and invalid partition roots are allowed except from "member"
+ to invalid partition root.
+
+ Before the "member" to partition root transition can happen,
+ the following conditions must be met or the transition will
+ not be allowed.
+
+ 1) The "cpuset.cpus" is non-empty and exclusive, i.e. they are
+ not shared by any of its siblings.
+ 2) The parent cgroup is a valid partition root.
+ 3) The "cpuset.cpus" must contain at least one of the CPUs from
+ parent's "cpuset.cpus", i.e. they overlap.
+ 4) There is no child cgroups with cpuset enabled. This avoids
+ cpu migrations of multiple cgroups simultaneously which can
+ be problematic.
+
+ Once becoming a partition root, the only rule restricting
+ changes made to "cpuset.cpus" is the exclusivity rule where
+ none of the siblings of a partition root can share CPUs with
+ it.
+
+ External events like hotplug or inappropriate changes to
+ "cpuset.cpus" can cause a valid partition root to become invalid.
+ Besides the exclusivity rule listed above, the other conditions
+ required to maintain the validity of a partition root are
+ as follows:
+
+ 1) The parent cgroup is a valid partition root.
+ 2) If "cpuset.cpus.effective" is empty, the partition must have
+ no task associated with it. Otherwise, the partition becomes
+ invalid and "cpuset.cpus.effective" will fall back to that
+ of the nearest non-empty ancestor.
+
+ A corollary of a valid partition root is that
+ "cpuset.cpus.effective" is always a subset of "cpuset.cpus".
+ Note that a task cannot be moved to a cgroup with empty
+ "cpuset.cpus.effective".
+
+ A valid non-root parent partition may distribute out all its CPUs
+ to its child partitions when there is no task associated with it.
+
+ An invalid partition root will be reverted back to a valid
+ one if none of the validity constraints of a valid partition
+ root are violated due to hotplug events or proper changes to
+ "cpuset.cpus" files.
+
+ Changing a partition root (valid or invalid) to "member" is
+ always allowed. If there are child partition roots underneath
+ it, they will become invalid and unrecoverable. So care must
+ be taken to double check for this condition before disabling
+ a partition root.
+
+ Poll and inotify events are triggered whenever the state of
+ "cpuset.cpus.partition" changes. That includes changes caused
+ by write to "cpuset.cpus.partition", cpu hotplug or other
+ changes that modify the validity status of the partition.
+ This will allow user space agents to monitor unexpected changes
+ to "cpuset.cpus.partition" without the need to do continuous
+ polling.
Device controller
--
2.27.0
On 12/5/21 13:32, Waiman Long wrote:
> v9:
> - Add a new patch 1 to remove the child cpuset restriction on parent's
> "cpuset.cpus".
> - Relax initial root partition entry limitation to allow cpuset.cpus to
> overlap that of parent's.
> - An "isolated invalid" displayed type is added to
> cpuset.cpus.partition.
> - Resetting partition root to "member" will leave child partition root
> as invalid.
> - Update documentation and test accordingly.
>
> v8:
> - Reorganize the patch series and rationalize the features and
> constraints of a partition.
> - Update patch descriptions and documentation accordingly.
>
> v7:
> - Simplify the documentation patch (patch 5) as suggested by Tejun.
> - Fix a typo in patch 2 and improper commit log in patch 3.
>
> This patchset includes one bug fix and four enhancements to the cpuset v2 code.
>
> Patch 1: Allow parent to set "cpuset.cpus" that may not be a superset
> of children's "cpuset.cpus" for default hierarchy.
>
> Patch 2: Enable partition with no task to have empty cpuset.cpus.effective.
>
> Patch 3: Refining the features and constraints of a cpuset partition
> clarifying what changes are allowed.
>
> Patch 4: Add a new partition state "isolated" to create a partition
> root without load balancing. This is for handling intermitten workloads
> that have a strict low latency requirement.
>
> Patch 5: Enable the "cpuset.cpus.partition" file to show the reason
> that causes invalid partition like "root invalid (No cpu available
> due to hotplug)".
>
> Patch 6 updates the cgroup-v2.rst file accordingly. Patch 7 adds a new
> cpuset test to test the new cpuset partition code.
>
> Waiman Long (7):
> cgroup/cpuset: Don't let child cpusets restrict parent in default
> hierarchy
> cgroup/cpuset: Allow no-task partition to have empty
> cpuset.cpus.effective
> cgroup/cpuset: Refining features and constraints of a partition
> cgroup/cpuset: Add a new isolated cpus.partition type
> cgroup/cpuset: Show invalid partition reason string
> cgroup/cpuset: Update description of cpuset.cpus.partition in
> cgroup-v2.rst
> kselftest/cgroup: Add cpuset v2 partition root state test
>
> Documentation/admin-guide/cgroup-v2.rst | 168 +++--
> kernel/cgroup/cpuset.c | 440 +++++++-----
> tools/testing/selftests/cgroup/Makefile | 5 +-
> .../selftests/cgroup/test_cpuset_prs.sh | 667 ++++++++++++++++++
> tools/testing/selftests/cgroup/wait_inotify.c | 87 +++
> 5 files changed, 1142 insertions(+), 225 deletions(-)
> create mode 100755 tools/testing/selftests/cgroup/test_cpuset_prs.sh
> create mode 100644 tools/testing/selftests/cgroup/wait_inotify.c
>
Hi,
Is this patch series good enough or is there other changes you would
still like to make in this series?
Cheers,
Longman
On Sun, Dec 05, 2021 at 01:32:14PM -0500, Waiman Long wrote:
> In validate_change(), there is a check since v2.6.12 to make sure that
> each of the child cpusets must be a subset of a parent cpuset. IOW, it
> allows child cpusets to restrict what changes can be made to a parent's
> "cpuset.cpus". This actually violates one of the core principles of the
> default hierarchy where a cgroup higher up in the hierarchy should be
> able to change configuration however it sees fit as deligation breaks
> down otherwise.
>
> To address this issue, the check is now removed for the default hierarchy
> to free parent cpusets from being restricted by child cpusets. The
> check will still apply for legacy hierarchy.
>
> Suggested-by: Tejun Heo <[email protected]>
> Signed-off-by: Waiman Long <[email protected]>
Applied to cgroup/for-5.17.
Thanks.
--
tejun
On Sun, Dec 05, 2021 at 01:32:15PM -0500, Waiman Long wrote:
> adding = deleting = false;
> old_prs = new_prs = cpuset->partition_root_state;
> if (cmd == partcmd_enable) {
> + /*
> + * Enabling partition root is not allowed if not all the CPUs
> + * can be granted from parent's effective_cpus.
> + */
> + if (!cpumask_subset(cpuset->cpus_allowed, parent->effective_cpus))
> + return -EINVAL;
> +
> + /*
> + * A parent can be left with no CPU as long as there is no
> + * task directly associated with the parent partition. For
> + * such a parent, no new task can be moved into it.
> + */
> + if (partition_is_populated(parent, cpuset) &&
> + cpumask_equal(cpuset->cpus_allowed, parent->effective_cpus))
> + return -EINVAL;
So, given that this only happens with threaded domains, can we just not
allow partitions within threaded domains? The combination doesn't make whole
lot of sense to me anyway.
> + /*
> + * On default hierarchy, task cannot be moved to a cpuset with empty
> + * effective cpus.
> + */
> + if (is_in_v2_mode() && cpumask_empty(cs->effective_cpus))
> + goto out_unlock;
And then we can avoid this extra restriction too, right?
Thanks.
--
tejun
Hello,
On Sun, Dec 05, 2021 at 01:32:19PM -0500, Waiman Long wrote:
> + In the case of an invalid partition root, a descriptive string on
> + why the partition is invalid is included within parentheses.
> +
> + Almost all possible state transitions among "member", valid
> + and invalid partition roots are allowed except from "member"
> + to invalid partition root.
So, this part still bothers me for the following two reasons that I brought
up earlier:
* When a valid partition turns invalid, now we have a reliable way of
discovering what exactly caused the transition. However, when a user now
fails to turn a member into partition, all they get is -EINVAL and there's
no way to discover why it failed and the failure conditions that -EINVAL
represents aren't simple.
* In an automated configuration scenarios, this operation mode may be
difficult to make reliable and lead to sporadic failures which can be
tricky to track down. The core problem is that whether a given operation
succeeds or not may depend on external states (CPU on/offline) which may
change asynchronously in a way that the configuring entity doesn't have
any control over.
It's true that both are existing problems with the current partition
interface and given that this is a pretty spcialized feature, this can be
okay. Michal, what are your thoughts?
Thanks.
--
tejun
On 12/13/21 15:45, Tejun Heo wrote:
> On Sun, Dec 05, 2021 at 01:32:15PM -0500, Waiman Long wrote:
>> adding = deleting = false;
>> old_prs = new_prs = cpuset->partition_root_state;
>> if (cmd == partcmd_enable) {
>> + /*
>> + * Enabling partition root is not allowed if not all the CPUs
>> + * can be granted from parent's effective_cpus.
>> + */
>> + if (!cpumask_subset(cpuset->cpus_allowed, parent->effective_cpus))
>> + return -EINVAL;
>> +
>> + /*
>> + * A parent can be left with no CPU as long as there is no
>> + * task directly associated with the parent partition. For
>> + * such a parent, no new task can be moved into it.
>> + */
>> + if (partition_is_populated(parent, cpuset) &&
>> + cpumask_equal(cpuset->cpus_allowed, parent->effective_cpus))
>> + return -EINVAL;
> So, given that this only happens with threaded domains, can we just not
> allow partitions within threaded domains? The combination doesn't make whole
> lot of sense to me anyway.
AFAICS, there are code in cpuset.c that disallows the an non-child node
to hold tasks, but the check doesn't cover all the possible cases. I
remembered that I was able to create such a scenario without using
threaded domains. That is why I put in this conditional check. It has
nothing to do with the use of threaded domains.
>> + /*
>> + * On default hierarchy, task cannot be moved to a cpuset with empty
>> + * effective cpus.
>> + */
>> + if (is_in_v2_mode() && cpumask_empty(cs->effective_cpus))
>> + goto out_unlock;
> And then we can avoid this extra restriction too, right?
This check is supposed to prevent a task to be moved to a leaf cpuset
partition with just offlined cpus and hence no effective cpu. A possible
alternative is to force the partition to become invalid, but I think not
allowing the move is easier until one or more offlined cpus are onlined.
Cheers,
Longman
On Tue, Dec 14, 2021 at 10:24:22PM -0500, Waiman Long <[email protected]> wrote:
> AFAICS, there are code in cpuset.c that disallows the an non-child node to
> hold tasks, but the check doesn't cover all the possible cases.
> I remembered that I was able to create such a scenario without using
> threaded domains.
On the default hierarchy (with controller(s) enabled)? That sounds like a bug.
> That is why I put in this conditional check. It has nothing to do with the
> use of threaded domains.
But threaded domains are important nevertheless.
I think that a structure like
app-cgroup cgroup.type=threaded domain cpuset.partition=root
`- rt cgroup.type=threaded cpuset.partition=isolated
`- normal cgroup.type=threaded
is a valid use case. Therefore I would not disallow partitioning inside
threaded subtrees (as suggested).
Michal
On Mon, Dec 13, 2021 at 10:41:23AM -1000, Tejun Heo <[email protected]> wrote:
> > To address this issue, the check is now removed for the default hierarchy
> > to free parent cpusets from being restricted by child cpusets. The
> > check will still apply for legacy hierarchy.
I'm trying to find whether something in update_cpumasks_hier() ensures
the constraint is checkd on the legacy hierarchy but it seems to me this
baby was thrown out with the bathwater. How is the legacy check still
applied?
> Applied to cgroup/for-5.17.
It comes out a bit more complex if I want to achieve both variants in
the below followup:
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 0dd7d853ed17..8b6e06f504f6 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -590,6 +590,35 @@ static inline void free_cpuset(struct cpuset *cs)
kfree(cs);
}
+/*
+ * validate_change_legacy() - Validate conditions specific to legacy (v1)
+ * behavior.
+ */
+static int validate_change_legacy(struct cpuset *cur, struct cpuset *trial)
+{
+ struct cgroup_subsys_state *css;
+ struct cpuset *c, *par;
+ int ret;
+
+ WARN_ON_ONCE(!rcu_read_lock_held());
+
+ /* Each of our child cpusets must be a subset of us */
+ ret = -EBUSY;
+ cpuset_for_each_child(c, css, cur)
+ if (!is_cpuset_subset(c, trial))
+ goto out;
+
+ /* On legacy hierarchy, we must be a subset of our parent cpuset. */
+ ret = -EACCES;
+ par = parent_cs(cur);
+ if (par && !is_cpuset_subset(trial, par))
+ goto out;
+
+ ret = 0;
+out:
+ return ret;
+}
+
/*
* validate_change() - Used to validate that any proposed cpuset change
* follows the structural rules for cpusets.
@@ -614,20 +643,21 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)
{
struct cgroup_subsys_state *css;
struct cpuset *c, *par;
- int ret;
-
- /* The checks don't apply to root cpuset */
- if (cur == &top_cpuset)
- return 0;
+ int ret = 0;
rcu_read_lock();
- par = parent_cs(cur);
- /* On legacy hierarchy, we must be a subset of our parent cpuset. */
- ret = -EACCES;
- if (!is_in_v2_mode() && !is_cpuset_subset(trial, par))
+ ret = validate_change_legacy(cur, trial);
+ if (ret)
+ goto out;
+
+ /* Remaining checks don't apply to root cpuset */
+ ret = 0;
+ if (cur == &top_cpuset)
goto out;
+ par = parent_cs(cur);
+
/*
* If either I or some sibling (!= me) is exclusive, we can't
* overlap
@@ -1175,9 +1205,7 @@ enum subparts_cmd {
*
* Because of the implicit cpu exclusive nature of a partition root,
* cpumask changes that violates the cpu exclusivity rule will not be
- * permitted when checked by validate_change(). The validate_change()
- * function will also prevent any changes to the cpu list if it is not
- * a superset of children's cpu lists.
+ * permitted when checked by validate_change().
*/
static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
struct cpumask *newmask,
On Mon, Dec 13, 2021 at 11:00:17AM -1000, Tejun Heo <[email protected]> wrote:
> * When a valid partition turns invalid, now we have a reliable way of
> discovering what exactly caused the transition. However, when a user now
> fails to turn a member into partition, all they get is -EINVAL and there's
> no way to discover why it failed and the failure conditions that -EINVAL
> represents aren't simple.
>
> * In an automated configuration scenarios, this operation mode may be
> difficult to make reliable and lead to sporadic failures which can be
> tricky to track down. The core problem is that whether a given operation
> succeeds or not may depend on external states (CPU on/offline) which may
> change asynchronously in a way that the configuring entity doesn't have
> any control over.
>
> It's true that both are existing problems with the current partition
> interface and given that this is a pretty spcialized feature, this can be
> okay. Michal, what are your thoughts?
Because of asynchronous changes, the return value should not be that
important and the user should watch cpuset.partitions for the result
(end state) anyway.
Furthermore, the reasons should be IMO just informative (i.e. I like
they're not explicitly documented) and not API.
But I see there could be a distinction between -EINVAL (the supplied
input makes no sense) and -EAGAIN(?) denoting that the switch to
partition root could not happen (due to outer constraints).
You seem to propose to replace the -EAGAIN above with a success code and
allow the switch to an invalid root.
The action of the configuring entity would be different: retry (when?)
vs wait till transition happens (notification) (although the immediate
effect (the change did not happen) is same).
I considered the two variants equal but the clear information about when
the change can happen I'd favor the variant allowing the switch to
invalid root now.
Michal
On Sun, Dec 05, 2021 at 01:32:16PM -0500, Waiman Long <[email protected]> wrote:
> @@ -1455,34 +1450,16 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
> switch (parent->partition_root_state) {
> [...]
> -
> case PRS_ENABLED:
> - if (update_parent_subparts_cpumask(cp, partcmd_update, NULL, tmp))
> - update_tasks_cpumask(parent);
> + update_parent = true;
> [...]
> + if (update_parent) {
> + if (update_parent_subparts_cpumask(cp, partcmd_update, NULL, tmp))
> + update_tasks_cpumask(parent);
> + /*
> + * The cpuset partition_root_state may be changed
> + * to PRS_ERROR. Capture it.
> + */
> + new_prs = cp->partition_root_state;
> + }
IIUC, this ensures that when a parent becomes partition root again, this
would propagate downwards to invalidated children.
However, the documentation says:
> + Changing a partition root (valid or invalid) to "member" is
> + always allowed. If there are child partition roots underneath
> + it, they will become invalid and unrecoverable. So care must
> + be taken to double check for this condition before disabling
> + a partition root.
I.e. it suggests a child can be trapped in the unrecoverable state (i.e.
not fixable by writing into cpuset.cpus.partition).
But this does not happen, right?
Michal
On 12/15/21 09:49, Michal Koutný wrote:
> On Sun, Dec 05, 2021 at 01:32:16PM -0500, Waiman Long <[email protected]> wrote:
>> @@ -1455,34 +1450,16 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
>> switch (parent->partition_root_state) {
>> [...]
>> -
>> case PRS_ENABLED:
>> - if (update_parent_subparts_cpumask(cp, partcmd_update, NULL, tmp))
>> - update_tasks_cpumask(parent);
>> + update_parent = true;
>> [...]
>> + if (update_parent) {
>> + if (update_parent_subparts_cpumask(cp, partcmd_update, NULL, tmp))
>> + update_tasks_cpumask(parent);
>> + /*
>> + * The cpuset partition_root_state may be changed
>> + * to PRS_ERROR. Capture it.
>> + */
>> + new_prs = cp->partition_root_state;
>> + }
> IIUC, this ensures that when a parent becomes partition root again, this
> would propagate downwards to invalidated children.
>
> However, the documentation says:
>
>> + Changing a partition root (valid or invalid) to "member" is
>> + always allowed. If there are child partition roots underneath
>> + it, they will become invalid and unrecoverable. So care must
>> + be taken to double check for this condition before disabling
>> + a partition root.
> I.e. it suggests a child can be trapped in the unrecoverable state (i.e.
> not fixable by writing into cpuset.cpus.partition).
> But this does not happen, right?
There are additional checks for the member to partition transition which
requires that the target cpuset shouldn't have child cpuset. That
prevents the recovering of a invalid partition root under a member
cpuset. We could certainly remove that restriction by adding additional
code as well as additional tests to verify it works. I haven't done that
simply to avoid adding more complexity to the current code.
Cheers,
Longman
On 12/15/21 07:23, Michal Koutný wrote:
> On Mon, Dec 13, 2021 at 10:41:23AM -1000, Tejun Heo <[email protected]> wrote:
>>> To address this issue, the check is now removed for the default hierarchy
>>> to free parent cpusets from being restricted by child cpusets. The
>>> check will still apply for legacy hierarchy.
> I'm trying to find whether something in update_cpumasks_hier() ensures
> the constraint is checkd on the legacy hierarchy but it seems to me this
> baby was thrown out with the bathwater. How is the legacy check still
> applied?
Yes, you are right. I did remove the check for legacy hierarchy too.
>> Applied to cgroup/for-5.17.
> It comes out a bit more complex if I want to achieve both variants in
> the below followup:
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 0dd7d853ed17..8b6e06f504f6 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -590,6 +590,35 @@ static inline void free_cpuset(struct cpuset *cs)
> kfree(cs);
> }
>
> +/*
> + * validate_change_legacy() - Validate conditions specific to legacy (v1)
> + * behavior.
> + */
> +static int validate_change_legacy(struct cpuset *cur, struct cpuset *trial)
> +{
> + struct cgroup_subsys_state *css;
> + struct cpuset *c, *par;
> + int ret;
> +
> + WARN_ON_ONCE(!rcu_read_lock_held());
> +
> + /* Each of our child cpusets must be a subset of us */
> + ret = -EBUSY;
> + cpuset_for_each_child(c, css, cur)
> + if (!is_cpuset_subset(c, trial))
> + goto out;
> +
> + /* On legacy hierarchy, we must be a subset of our parent cpuset. */
> + ret = -EACCES;
> + par = parent_cs(cur);
> + if (par && !is_cpuset_subset(trial, par))
> + goto out;
> +
> + ret = 0;
> +out:
> + return ret;
> +}
> +
> /*
> * validate_change() - Used to validate that any proposed cpuset change
> * follows the structural rules for cpusets.
> @@ -614,20 +643,21 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)
> {
> struct cgroup_subsys_state *css;
> struct cpuset *c, *par;
> - int ret;
> -
> - /* The checks don't apply to root cpuset */
> - if (cur == &top_cpuset)
> - return 0;
> + int ret = 0;
>
> rcu_read_lock();
> - par = parent_cs(cur);
>
> - /* On legacy hierarchy, we must be a subset of our parent cpuset. */
> - ret = -EACCES;
> - if (!is_in_v2_mode() && !is_cpuset_subset(trial, par))
I think you still need to guard it with "!is_in_v2_mode()".
if (!is_in_v2_mode()) {
ret = validate_change_legacy(cur, trial);
if (ret)
goto out;
}
> + ret = validate_change_legacy(cur, trial);
> + if (ret)
> + goto out;
> +
> + /* Remaining checks don't apply to root cpuset */
> + ret = 0;
> + if (cur == &top_cpuset)
> goto out;
>
> + par = parent_cs(cur);
> +
> /*
> * If either I or some sibling (!= me) is exclusive, we can't
> * overlap
Cheers,
Longman
On 12/15/21 09:44, Michal Koutný wrote:
> On Mon, Dec 13, 2021 at 11:00:17AM -1000, Tejun Heo <[email protected]> wrote:
>> * When a valid partition turns invalid, now we have a reliable way of
>> discovering what exactly caused the transition. However, when a user now
>> fails to turn a member into partition, all they get is -EINVAL and there's
>> no way to discover why it failed and the failure conditions that -EINVAL
>> represents aren't simple.
>>
>> * In an automated configuration scenarios, this operation mode may be
>> difficult to make reliable and lead to sporadic failures which can be
>> tricky to track down. The core problem is that whether a given operation
>> succeeds or not may depend on external states (CPU on/offline) which may
>> change asynchronously in a way that the configuring entity doesn't have
>> any control over.
>>
>> It's true that both are existing problems with the current partition
>> interface and given that this is a pretty spcialized feature, this can be
>> okay. Michal, what are your thoughts?
> Because of asynchronous changes, the return value should not be that
> important and the user should watch cpuset.partitions for the result
> (end state) anyway.
> Furthermore, the reasons should be IMO just informative (i.e. I like
> they're not explicitly documented) and not API.
>
> But I see there could be a distinction between -EINVAL (the supplied
> input makes no sense) and -EAGAIN(?) denoting that the switch to
> partition root could not happen (due to outer constraints).
>
> You seem to propose to replace the -EAGAIN above with a success code and
> allow the switch to an invalid root.
> The action of the configuring entity would be different: retry (when?)
> vs wait till transition happens (notification) (although the immediate
> effect (the change did not happen) is same).
> I considered the two variants equal but the clear information about when
> the change can happen I'd favor the variant allowing the switch to
> invalid root now.
Allowing direct transition from member to invalid partition doesn't feel
right for me. A casual user may assume a partition is correctly formed
without double checking the "cpuset.partition" value. Returning an error
will prevent this kind of issue. If returning more information about the
failure is the main reason for allowing the invalid partition
transition, we can extend the "cpuset.partition" read syntax to also
show the reason for the previous failure.
Cheers,
Longman
Hello, Waiman.
On Wed, Dec 15, 2021 at 01:16:43PM -0500, Waiman Long wrote:
> Allowing direct transition from member to invalid partition doesn't feel
> right for me. A casual user may assume a partition is correctly formed
> without double checking the "cpuset.partition" value. Returning an error
> will prevent this kind of issue. If returning more information about the
> failure is the main reason for allowing the invalid partition transition, we
> can extend the "cpuset.partition" read syntax to also show the reason for
> the previous failure.
I don't think it's a good idea to display error messages without a way to
link the error to the one who triggered it. This is the same problem we had
with resettable counters. It only works for scenarios where one guy is
sitting in front of the computer but gets nastry for more complex scnearios
and automation.
I understand that allowing transitions to invalid state can feel jarring.
There are pros and cons to both approaches. It's similar dynamics tho.
Erroring out may be more intuitive for a casual user but makes it harder for
more complex scenarios because whether a given operation errors or not is
dependent on external asynchronous states, there's no good way of reporting
the exact nature of the error or detecting when the operation would succeed
in the future, and the error conditions are rather arbitrary.
Thanks.
--
tejun
On 12/15/21 13:35, Tejun Heo wrote:
> Hello, Waiman.
>
> On Wed, Dec 15, 2021 at 01:16:43PM -0500, Waiman Long wrote:
>> Allowing direct transition from member to invalid partition doesn't feel
>> right for me. A casual user may assume a partition is correctly formed
>> without double checking the "cpuset.partition" value. Returning an error
>> will prevent this kind of issue. If returning more information about the
>> failure is the main reason for allowing the invalid partition transition, we
>> can extend the "cpuset.partition" read syntax to also show the reason for
>> the previous failure.
> I don't think it's a good idea to display error messages without a way to
> link the error to the one who triggered it. This is the same problem we had
> with resettable counters. It only works for scenarios where one guy is
> sitting in front of the computer but gets nastry for more complex scnearios
> and automation.
Yes, I agree it is not a good way to handle this issue.
>
> I understand that allowing transitions to invalid state can feel jarring.
> There are pros and cons to both approaches. It's similar dynamics tho.
> Erroring out may be more intuitive for a casual user but makes it harder for
> more complex scenarios because whether a given operation errors or not is
> dependent on external asynchronous states, there's no good way of reporting
> the exact nature of the error or detecting when the operation would succeed
> in the future, and the error conditions are rather arbitrary.
Thanks for the explanation. Yes, there are always pros and cons for
different approach to a problem. I am not totally against allowing
member to invalid partition transition. In that case, reading back
"cpuset.partition" is a must to verify that it is really a success.
How about we allow transition to an invalid partition state but still
return an error?
Regards,
Longman
On Wed, Dec 15, 2021 at 11:29:41AM -0500, Waiman Long <[email protected]> wrote:
> There are additional checks for the member to partition transition which
> requires that the target cpuset shouldn't have child cpuset.
Ah, I forgot the transition condition no. 4 will apply here. Clear.
So, currently full bottom up + top down walk is needed in (rare?) case
the switch from root partition to member and back.
> That prevents the recovering of a invalid partition root under a
> member cpuset. We could certainly remove that restriction by adding
> additional code as well as additional tests to verify it works. I
> haven't done that simply to avoid adding more complexity to the
> current code.
I agree this restriction can be lifted later independently when the rest
settles. (It's not so different from controllers disabling on the
unified hierarchy after all.)
Thanks,
Michal
The commit 1f1562fcd04a ("cgroup/cpuset: Don't let child cpusets
restrict parent in default hierarchy") inteded to relax the check only
on the default hierarchy (or v2 mode) but it dropped the check in v1
too.
This patch returns and separates the legacy-only validations so that
they can be considered only in the v1 mode, which should enforce the old
constraints for the sake of compatibility.
Fixes: 1f1562fcd04a ("cgroup/cpuset: Don't let child cpusets restrict parent in default hierarchy")
Suggested-by: Waiman Long <[email protected]>
Signed-off-by: Michal Koutný <[email protected]>
---
kernel/cgroup/cpuset.c | 52 ++++++++++++++++++++++++++++++++----------
1 file changed, 40 insertions(+), 12 deletions(-)
This is formatted as a separate patch fixing the already queued change in
for-5.17 but it can be eventually squashed into the referenced commit AFAIAC.
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 0dd7d853ed17..ce6929ddc0b0 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -590,6 +590,35 @@ static inline void free_cpuset(struct cpuset *cs)
kfree(cs);
}
+/*
+ * validate_change_legacy() - Validate conditions specific to legacy (v1)
+ * behavior.
+ */
+static int validate_change_legacy(struct cpuset *cur, struct cpuset *trial)
+{
+ struct cgroup_subsys_state *css;
+ struct cpuset *c, *par;
+ int ret;
+
+ WARN_ON_ONCE(!rcu_read_lock_held());
+
+ /* Each of our child cpusets must be a subset of us */
+ ret = -EBUSY;
+ cpuset_for_each_child(c, css, cur)
+ if (!is_cpuset_subset(c, trial))
+ goto out;
+
+ /* On legacy hierarchy, we must be a subset of our parent cpuset. */
+ ret = -EACCES;
+ par = parent_cs(cur);
+ if (par && !is_cpuset_subset(trial, par))
+ goto out;
+
+ ret = 0;
+out:
+ return ret;
+}
+
/*
* validate_change() - Used to validate that any proposed cpuset change
* follows the structural rules for cpusets.
@@ -614,20 +643,21 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)
{
struct cgroup_subsys_state *css;
struct cpuset *c, *par;
- int ret;
-
- /* The checks don't apply to root cpuset */
- if (cur == &top_cpuset)
- return 0;
+ int ret = 0;
rcu_read_lock();
- par = parent_cs(cur);
- /* On legacy hierarchy, we must be a subset of our parent cpuset. */
- ret = -EACCES;
- if (!is_in_v2_mode() && !is_cpuset_subset(trial, par))
+ if (!is_in_v2_mode())
+ ret = validate_change_legacy(cur, trial);
+ if (ret)
+ goto out;
+
+ /* Remaining checks don't apply to root cpuset */
+ if (cur == &top_cpuset)
goto out;
+ par = parent_cs(cur);
+
/*
* If either I or some sibling (!= me) is exclusive, we can't
* overlap
@@ -1175,9 +1205,7 @@ enum subparts_cmd {
*
* Because of the implicit cpu exclusive nature of a partition root,
* cpumask changes that violates the cpu exclusivity rule will not be
- * permitted when checked by validate_change(). The validate_change()
- * function will also prevent any changes to the cpu list if it is not
- * a superset of children's cpu lists.
+ * permitted when checked by validate_change().
*/
static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
struct cpumask *newmask,
--
2.33.1
On 12/17/21 10:48, Michal Koutný wrote:
> The commit 1f1562fcd04a ("cgroup/cpuset: Don't let child cpusets
> restrict parent in default hierarchy") inteded to relax the check only
> on the default hierarchy (or v2 mode) but it dropped the check in v1
> too.
>
> This patch returns and separates the legacy-only validations so that
> they can be considered only in the v1 mode, which should enforce the old
> constraints for the sake of compatibility.
>
> Fixes: 1f1562fcd04a ("cgroup/cpuset: Don't let child cpusets restrict parent in default hierarchy")
> Suggested-by: Waiman Long <[email protected]>
> Signed-off-by: Michal Koutný <[email protected]>
> ---
> kernel/cgroup/cpuset.c | 52 ++++++++++++++++++++++++++++++++----------
> 1 file changed, 40 insertions(+), 12 deletions(-)
>
> This is formatted as a separate patch fixing the already queued change in
> for-5.17 but it can be eventually squashed into the referenced commit AFAIAC.
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 0dd7d853ed17..ce6929ddc0b0 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -590,6 +590,35 @@ static inline void free_cpuset(struct cpuset *cs)
> kfree(cs);
> }
>
> +/*
> + * validate_change_legacy() - Validate conditions specific to legacy (v1)
> + * behavior.
> + */
> +static int validate_change_legacy(struct cpuset *cur, struct cpuset *trial)
> +{
> + struct cgroup_subsys_state *css;
> + struct cpuset *c, *par;
> + int ret;
> +
> + WARN_ON_ONCE(!rcu_read_lock_held());
> +
> + /* Each of our child cpusets must be a subset of us */
> + ret = -EBUSY;
> + cpuset_for_each_child(c, css, cur)
> + if (!is_cpuset_subset(c, trial))
> + goto out;
> +
> + /* On legacy hierarchy, we must be a subset of our parent cpuset. */
> + ret = -EACCES;
> + par = parent_cs(cur);
> + if (par && !is_cpuset_subset(trial, par))
> + goto out;
> +
> + ret = 0;
> +out:
> + return ret;
> +}
> +
> /*
> * validate_change() - Used to validate that any proposed cpuset change
> * follows the structural rules for cpusets.
> @@ -614,20 +643,21 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)
> {
> struct cgroup_subsys_state *css;
> struct cpuset *c, *par;
> - int ret;
> -
> - /* The checks don't apply to root cpuset */
> - if (cur == &top_cpuset)
> - return 0;
> + int ret = 0;
>
> rcu_read_lock();
> - par = parent_cs(cur);
>
> - /* On legacy hierarchy, we must be a subset of our parent cpuset. */
> - ret = -EACCES;
> - if (!is_in_v2_mode() && !is_cpuset_subset(trial, par))
> + if (!is_in_v2_mode())
> + ret = validate_change_legacy(cur, trial);
> + if (ret)
> + goto out;
> +
> + /* Remaining checks don't apply to root cpuset */
> + if (cur == &top_cpuset)
> goto out;
>
> + par = parent_cs(cur);
> +
> /*
> * If either I or some sibling (!= me) is exclusive, we can't
> * overlap
> @@ -1175,9 +1205,7 @@ enum subparts_cmd {
> *
> * Because of the implicit cpu exclusive nature of a partition root,
> * cpumask changes that violates the cpu exclusivity rule will not be
> - * permitted when checked by validate_change(). The validate_change()
> - * function will also prevent any changes to the cpu list if it is not
> - * a superset of children's cpu lists.
> + * permitted when checked by validate_change().
> */
> static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
> struct cpumask *newmask,
Thanks for addressing this issue.
Reviewed-by: Waiman Long <[email protected]>
On Sun, Dec 05, 2021 at 01:32:17PM -0500, Waiman Long wrote:
> Cpuset v1 uses the sched_load_balance control file to determine if load
> balancing should be enabled. Cpuset v2 gets rid of sched_load_balance
> as its use may require disabling load balancing at cgroup root.
>
> For workloads that require very low latency like DPDK, the latency
> jitters caused by periodic load balancing may exceed the desired
> latency limit.
>
> When cpuset v2 is in use, the only way to avoid this latency cost is to
> use the "isolcpus=" kernel boot option to isolate a set of CPUs. After
> the kernel boot, however, there is no way to add or remove CPUs from
> this isolated set. For workloads that are more dynamic in nature, that
> means users have to provision enough CPUs for the worst case situation
> resulting in excess idle CPUs.
>
> To address this issue for cpuset v2, a new cpuset.cpus.partition type
> "isolated" is added which allows the creation of a cpuset partition
> without load balancing. This will allow system administrators to
> dynamically adjust the size of isolated partition to the current need
> of the workload without rebooting the system.
you can, ofcourse, create lots of 1 cpu partitions, which is effectively
what you're doing, except there was a problem with that which you also
forgot to mention.
On 1/12/22 10:21, Peter Zijlstra wrote:
> On Sun, Dec 05, 2021 at 01:32:17PM -0500, Waiman Long wrote:
>> Cpuset v1 uses the sched_load_balance control file to determine if load
>> balancing should be enabled. Cpuset v2 gets rid of sched_load_balance
>> as its use may require disabling load balancing at cgroup root.
>>
>> For workloads that require very low latency like DPDK, the latency
>> jitters caused by periodic load balancing may exceed the desired
>> latency limit.
>>
>> When cpuset v2 is in use, the only way to avoid this latency cost is to
>> use the "isolcpus=" kernel boot option to isolate a set of CPUs. After
>> the kernel boot, however, there is no way to add or remove CPUs from
>> this isolated set. For workloads that are more dynamic in nature, that
>> means users have to provision enough CPUs for the worst case situation
>> resulting in excess idle CPUs.
>>
>> To address this issue for cpuset v2, a new cpuset.cpus.partition type
>> "isolated" is added which allows the creation of a cpuset partition
>> without load balancing. This will allow system administrators to
>> dynamically adjust the size of isolated partition to the current need
>> of the workload without rebooting the system.
> you can, ofcourse, create lots of 1 cpu partitions, which is effectively
> what you're doing, except there was a problem with that which you also
> forgot to mention.
Yes, that is a possible workaround. However, it makes cgroup management
much harder especially in the cgroup v2 environment where multiple
controllers are likely to be enabled in the same cgroup.
Cheers,
Longman
Hello,
On Wed, Dec 15, 2021 at 01:55:05PM -0500, Waiman Long wrote:
> How about we allow transition to an invalid partition state but still return
> an error?
Like -EAGAIN as Michal suggested? I don't know. I understand the motivation
but one problem is that error return usually means that the operation didn't
change the state of the system and that holds even for -EAGAIN. So, we'd be
trading one locally jarring thing (this thing is asynchrnous and the actual
state transitions should be monitored separately) to another one which is
jarring in a wider context (this thing returned error but the system state
changed anyway). To me, the latter seems noticeably worse given how common
the assumption that an error return indicate that nothing actually happened.
We have other cases where we split operation submissions and completions
(aios being the obvious one) but I don't think we have any where -EAGAIN
indicates successful initiation of an operation. At least I hope not.
Thanks.
--
tejun
On Wed, Jan 12, 2022 at 10:40:01AM -0500, Waiman Long wrote:
> Yes, that is a possible workaround. However, it makes cgroup management much
> harder especially in the cgroup v2 environment where multiple controllers
> are likely to be enabled in the same cgroup.
In most cases, cgroup2 shouldn't be a problem here given that controllers
can be enabled selectively and this 1-cpu cgroups will most likely be
leaves. But yeah, not super convenient.
Thanks.
--
tejun
On Fri, Dec 17, 2021 at 04:48:54PM +0100, Michal Koutn? wrote:
> The commit 1f1562fcd04a ("cgroup/cpuset: Don't let child cpusets
> restrict parent in default hierarchy") inteded to relax the check only
> on the default hierarchy (or v2 mode) but it dropped the check in v1
> too.
>
> This patch returns and separates the legacy-only validations so that
> they can be considered only in the v1 mode, which should enforce the old
> constraints for the sake of compatibility.
>
> Fixes: 1f1562fcd04a ("cgroup/cpuset: Don't let child cpusets restrict parent in default hierarchy")
> Suggested-by: Waiman Long <[email protected]>
> Signed-off-by: Michal Koutn? <[email protected]>
Applied to cgroup/for-5.17-fixes.
Thanks.
--
tejun