v11:
- Fix incorrect spacing in patch 7 and include documentation suggestions
by Michal.
- Move partition_is_populated() check to the last one in list of
conditions to be checked.
v10:
- Relax constraints for changes made to "cpuset.cpus"
and "cpuset.cpus.partition" as suggested. Now almost all changes
are allowed.
- Add patch 1 to signal that we may need to do additional work in
the future to relax the constraint that tasks' cpumask may need
some adjustment if child partitions are present.
- Add patch 2 for miscellaneous cleanups.
This patchset include the following enhancements to the cpuset v2
partition code.
1) Allow partitions that have no task to have empty effective cpus.
2) Relax the constraints on what changes are allowed in cpuset.cpus
and cpuset.cpus.partition. However, the partition remain invalid
until the constraints of a valid partition root is satisfied.
3) Add a new "isolated" partition type for partitions with no load
balancing which is available in v1 but not yet in v2.
4) Allow the reading of cpuset.cpus.partition to include a reason
string as to why the partition remain invalid.
In addition, the cgroup-v2.rst documentation file is updated and a self
test is added to verify the correctness the partition code.
The code diff from v10 is listed below.
Waiman Long (8):
cgroup/cpuset: Add top_cpuset check in update_tasks_cpumask()
cgroup/cpuset: Miscellaneous cleanups & add helper functions
cgroup/cpuset: Allow no-task partition to have empty
cpuset.cpus.effective
cgroup/cpuset: Relax constraints to partition & cpus changes
cgroup/cpuset: Add a new isolated cpus.partition type
cgroup/cpuset: Show invalid partition reason string
cgroup/cpuset: Update description of cpuset.cpus.partition in
cgroup-v2.rst
kselftest/cgroup: Add cpuset v2 partition root state test
Documentation/admin-guide/cgroup-v2.rst | 149 ++--
kernel/cgroup/cpuset.c | 718 +++++++++++-------
tools/testing/selftests/cgroup/Makefile | 5 +-
.../selftests/cgroup/test_cpuset_prs.sh | 674 ++++++++++++++++
tools/testing/selftests/cgroup/wait_inotify.c | 87 +++
5 files changed, 1304 insertions(+), 329 deletions(-)
create mode 100755 tools/testing/selftests/cgroup/test_cpuset_prs.sh
create mode 100644 tools/testing/selftests/cgroup/wait_inotify.c
--
diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 94e1e3771830..9184a09e0fc9 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2130,10 +2130,10 @@ Cpuset Interface Files
CPUs should be carefully distributed and bound to each of the
individual CPUs for optimal performance.
- The value shown in "cpuset.cpus.effective" of a partition root is
- the CPUs that the parent partition root can dedicate to the new
- partition root. They are subtracted from "cpuset.cpus.effective"
- of the parent and may be different from "cpuset.cpus"
+ The value shown in "cpuset.cpus.effective" of a partition root
+ is the CPUs that the partition root can dedicate to a potential
+ new child partition root. The new child subtracts available
+ CPUs from its parent "cpuset.cpus.effective".
A partition root ("root" or "isolated") can be in one of the
two possible states - valid or invalid. An invalid partition
@@ -2165,24 +2165,28 @@ Cpuset Interface Files
2) The parent cgroup is a valid partition root.
3) The "cpuset.cpus" is not empty and must contain at least
one of the CPUs from parent's "cpuset.cpus", i.e. they overlap.
- 4) The "cpuset.cpus.effective" must be a subset of "cpuset.cpus"
- and cannot be empty unless there is no task associated with
- this partition.
+ 4) The "cpuset.cpus.effective" must be a subset of "cpuset.cpus"
+ and cannot be empty unless there is no task associated with
+ this partition.
External events like hotplug or changes to "cpuset.cpus" can
cause a valid partition root to become invalid and vice versa.
Note that a task cannot be moved to a cgroup with empty
"cpuset.cpus.effective".
- For a valid partition root or an invalid partition root with
- the exclusivity rule enabled, changes made to "cpuset.cpus"
- that violate the exclusivity rule will not be allowed.
+ For a valid partition root or an invalid partition root with
+ the exclusivity rule enabled, changes made to "cpuset.cpus"
+ that violate the exclusivity rule will not be allowed.
A valid non-root parent partition may distribute out all its CPUs
to its child partitions when there is no task associated with it.
- Care must be taken to change a valid partition root to "member"
- as all its child partitions, if present, will become invalid.
+ Care must be taken to change a valid partition root to
+ "member" as all its child partitions, if present, will become
+ invalid causing disruption to tasks running in those child
+ partitions. These inactivated partitions could be recovered if
+ their parent is switched back to a partition root with a proper
+ set of "cpuset.cpus".
Poll and inotify events are triggered whenever the state of
"cpuset.cpus.partition" changes. That includes changes caused
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 90ee0e4d8d7e..261974f5bb3c 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1283,9 +1283,12 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
* invalid to valid violates the exclusivity rule.
*
* The partcmd_enable and partcmd_disable commands are used by
- * update_prstate(). The partcmd_update command is used by
- * update_cpumasks_hier() with newmask NULL and update_cpumask() with
- * newmask set.
+ * update_prstate(). An error code may be returned and the caller will check
+ * for error.
+ *
+ * The partcmd_update command is used by update_cpumasks_hier() with newmask
+ * NULL and update_cpumask() with newmask set. The callers won't check for
+ * error and so partition_root_state and prs_error will be updated directly.
*/
static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
struct cpumask *newmask,
@@ -1326,8 +1329,8 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
* A parent can be left with no CPU as long as there is no
* task directly associated with the parent partition.
*/
- if (partition_is_populated(parent, cs) &&
- !cpumask_intersects(cs->cpus_allowed, parent->effective_cpus))
+ if (!cpumask_intersects(cs->cpus_allowed, parent->effective_cpus) &&
+ partition_is_populated(parent, cs))
return PERR_NOCPUS;
cpumask_copy(tmp->addmask, cs->cpus_allowed);
@@ -1361,9 +1364,10 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
* Make partition invalid if parent's effective_cpus could
* become empty and there are tasks in the parent.
*/
- if (adding && partition_is_populated(parent, cs) &&
+ if (adding &&
cpumask_subset(parent->effective_cpus, tmp->addmask) &&
- !cpumask_intersects(tmp->delmask, cpu_active_mask)) {
+ !cpumask_intersects(tmp->delmask, cpu_active_mask) &&
+ partition_is_populated(parent, cs)) {
part_error = PERR_NOCPUS;
adding = false;
deleting = cpumask_and(tmp->delmask, cs->cpus_allowed,
@@ -1749,13 +1753,13 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
/*
* Make sure that subparts_cpus, if not empty, is a subset of
- * cpus_allowed. Clear subparts_cpus if there is an error or
+ * cpus_allowed. Clear subparts_cpus if partition not valid or
* empty effective cpus with tasks.
*/
if (cs->nr_subparts_cpus) {
- if (cs->prs_err ||
- (partition_is_populated(cs, NULL) &&
- cpumask_subset(trialcs->effective_cpus, cs->subparts_cpus))) {
+ if (!is_partition_valid(cs) ||
+ (cpumask_subset(trialcs->effective_cpus, cs->subparts_cpus) &&
+ partition_is_populated(cs, NULL))) {
cs->nr_subparts_cpus = 0;
cpumask_clear(cs->subparts_cpus);
} else {
Currently, a partition root cannot have empty "cpuset.cpus.effective".
As a result, a parent partition root cannot distribute out all its
CPUs to child partitions with no CPUs left. However in most cases,
there shouldn't be any tasks associated with intermediate nodes of the
default hierarchy. So the current rule is too restrictive and can waste
valuable CPU resource.
To address this issue, we are now allowing a partition to have empty
"cpuset.cpus.effective" as long as it has no task. Therefore, a parent
partition with no task can now have all its CPUs distributed out to its
child partitions. The top cpuset always have some house-keeping tasks
running and so its list of effective cpu can't be empty.
Once a partition with empty "cpuset.cpus.effective" is formed, no
new task can be moved into it until "cpuset.cpus.effective" becomes
non-empty.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/cgroup/cpuset.c | 112 +++++++++++++++++++++++++++++++----------
1 file changed, 85 insertions(+), 27 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index d156a39d7a08..6c65bcf278cb 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -412,6 +412,41 @@ static inline bool is_in_v2_mode(void)
(cpuset_cgrp_subsys.root->flags & CGRP_ROOT_CPUSET_V2_MODE);
}
+/**
+ * partition_is_populated - check if partition has tasks
+ * @cs: partition root to be checked
+ * @excluded_child: a child cpuset to be excluded in task checking
+ * Return: true if there are tasks, false otherwise
+ *
+ * It is assumed that @cs is a valid partition root. @excluded_child should
+ * be non-NULL when this cpuset is going to become a partition itself.
+ */
+static inline bool partition_is_populated(struct cpuset *cs,
+ struct cpuset *excluded_child)
+{
+ struct cgroup_subsys_state *css;
+ struct cpuset *child;
+
+ if (cs->css.cgroup->nr_populated_csets)
+ return true;
+ if (!excluded_child && !cs->nr_subparts_cpus)
+ return cgroup_is_populated(cs->css.cgroup);
+
+ rcu_read_lock();
+ cpuset_for_each_child(child, css, cs) {
+ if (child == excluded_child)
+ continue;
+ if (is_partition_valid(child))
+ continue;
+ if (cgroup_is_populated(child->css.cgroup)) {
+ rcu_read_unlock();
+ return true;
+ }
+ }
+ rcu_read_unlock();
+ return false;
+}
+
/*
* Return in pmask the portion of a task's cpusets's cpus_allowed that
* are online and are capable of running the task. If none are found,
@@ -1252,22 +1287,25 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
if ((cmd != partcmd_update) && css_has_online_children(&cs->css))
return -EBUSY;
- /*
- * Enabling partition root is not allowed if not all the CPUs
- * can be granted from parent's effective_cpus or at least one
- * CPU will be left after that.
- */
- if ((cmd == partcmd_enable) &&
- (!cpumask_subset(cs->cpus_allowed, parent->effective_cpus) ||
- cpumask_equal(cs->cpus_allowed, parent->effective_cpus)))
- return -EINVAL;
-
- /*
- * A cpumask update cannot make parent's effective_cpus become empty.
- */
adding = deleting = false;
old_prs = new_prs = cs->partition_root_state;
if (cmd == partcmd_enable) {
+ /*
+ * Enabling partition root is not allowed if not all the CPUs
+ * can be granted from parent's effective_cpus.
+ */
+ if (!cpumask_subset(cs->cpus_allowed, parent->effective_cpus))
+ return -EINVAL;
+
+ /*
+ * A parent can be left with no CPU as long as there is no
+ * task directly associated with the parent partition. For
+ * such a parent, no new task can be moved into it.
+ */
+ if (cpumask_equal(cs->cpus_allowed, parent->effective_cpus) &&
+ partition_is_populated(parent, cs))
+ return -EINVAL;
+
cpumask_copy(tmp->addmask, cs->cpus_allowed);
adding = true;
} else if (cmd == partcmd_disable) {
@@ -1289,10 +1327,12 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
adding = cpumask_andnot(tmp->addmask, tmp->addmask,
parent->subparts_cpus);
/*
- * Return error if the new effective_cpus could become empty.
+ * Return error if the new effective_cpus could become empty
+ * and there are tasks in the parent.
*/
if (adding &&
- cpumask_equal(parent->effective_cpus, tmp->addmask)) {
+ cpumask_equal(parent->effective_cpus, tmp->addmask) &&
+ partition_is_populated(parent, cs)) {
if (!deleting)
return -EINVAL;
/*
@@ -1317,8 +1357,8 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
*/
adding = cpumask_and(tmp->addmask, cs->cpus_allowed,
parent->effective_cpus);
- part_error = cpumask_equal(tmp->addmask,
- parent->effective_cpus);
+ part_error = cpumask_equal(tmp->addmask, parent->effective_cpus) &&
+ partition_is_populated(parent, cs);
}
if (cmd == partcmd_update) {
@@ -1420,9 +1460,15 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
/*
* If it becomes empty, inherit the effective mask of the
- * parent, which is guaranteed to have some CPUs.
+ * parent, which is guaranteed to have some CPUs unless
+ * it is a partition root that has explicitly distributed
+ * out all its CPUs.
*/
if (is_in_v2_mode() && cpumask_empty(tmp->new_cpus)) {
+ if (is_partition_valid(cp) &&
+ cpumask_equal(cp->cpus_allowed, cp->subparts_cpus))
+ goto update_parent_subparts;
+
cpumask_copy(tmp->new_cpus, parent->effective_cpus);
if (!cp->use_parent_ecpus) {
cp->use_parent_ecpus = true;
@@ -1444,6 +1490,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
continue;
}
+update_parent_subparts:
/*
* update_parent_subparts_cpumask() should have been called
* for cs already in update_cpumask(). We should also call
@@ -2249,6 +2296,13 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
(cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed)))
goto out_unlock;
+ /*
+ * On default hierarchy, task cannot be moved to a cpuset with empty
+ * effective cpus.
+ */
+ if (is_in_v2_mode() && cpumask_empty(cs->effective_cpus))
+ goto out_unlock;
+
cgroup_taskset_for_each(task, css, tset) {
ret = task_can_attach(task, cs->cpus_allowed);
if (ret)
@@ -3115,7 +3169,8 @@ hotplug_update_tasks(struct cpuset *cs,
struct cpumask *new_cpus, nodemask_t *new_mems,
bool cpus_updated, bool mems_updated)
{
- if (cpumask_empty(new_cpus))
+ /* A partition root is allowed to have empty effective cpus */
+ if (cpumask_empty(new_cpus) && !is_partition_valid(cs))
cpumask_copy(new_cpus, parent_cs(cs)->effective_cpus);
if (nodes_empty(*new_mems))
*new_mems = parent_cs(cs)->effective_mems;
@@ -3184,10 +3239,11 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
/*
* In the unlikely event that a partition root has empty
- * effective_cpus or its parent becomes invalid, we have to
- * transition it to the invalid state.
+ * effective_cpus with tasks or its parent becomes invalid, we
+ * have to transition it to the invalid state.
*/
- if (is_partition_valid(cs) && (cpumask_empty(&new_cpus) ||
+ if (is_partition_valid(cs) &&
+ ((cpumask_empty(&new_cpus) && partition_is_populated(cs, NULL)) ||
is_partition_invalid(parent))) {
if (cs->nr_subparts_cpus) {
spin_lock_irq(&callback_lock);
@@ -3198,13 +3254,15 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
}
/*
- * If the effective_cpus is empty because the child
- * partitions take away all the CPUs, we can keep
- * the current partition and let the child partitions
- * fight for available CPUs.
+ * Force the partition to become invalid if either one of
+ * the following conditions hold:
+ * 1) empty effective cpus but not valid empty partition.
+ * 2) parent is invalid or doesn't grant any cpus to child
+ * partitions.
*/
if (is_partition_invalid(parent) ||
- cpumask_empty(&new_cpus)) {
+ (cpumask_empty(&new_cpus) &&
+ partition_is_populated(cs, NULL))) {
int old_prs;
update_parent_subparts_cpumask(cs, partcmd_disable,
--
2.27.0
On 5/20/22 12:00, Sebastian Andrzej Siewior wrote:
> On 2022-05-10 11:34:05 [-0400], Waiman Long wrote:
>> v11:
>> - Fix incorrect spacing in patch 7 and include documentation suggestions
>> by Michal.
>> - Move partition_is_populated() check to the last one in list of
>> conditions to be checked.
> If I follow this correctly, then this is the latest version of the
> isolcpus= replacement with cgroup's cpusets, correct?
>
> Sebastian
It is just the beginning, there is still a lot of work to do before
isolcpus= can be completely replaced.
Cheers,
Longman
On 2022-05-10 11:34:05 [-0400], Waiman Long wrote:
> v11:
> - Fix incorrect spacing in patch 7 and include documentation suggestions
> by Michal.
> - Move partition_is_populated() check to the last one in list of
> conditions to be checked.
If I follow this correctly, then this is the latest version of the
isolcpus= replacement with cgroup's cpusets, correct?
Sebastian
On 2022-05-20 12:46:52 [-0400], Waiman Long wrote:
> On 5/20/22 12:00, Sebastian Andrzej Siewior wrote:
> > On 2022-05-10 11:34:05 [-0400], Waiman Long wrote:
> > > v11:
> > > - Fix incorrect spacing in patch 7 and include documentation suggestions
> > > by Michal.
> > > - Move partition_is_populated() check to the last one in list of
> > > conditions to be checked.
> > If I follow this correctly, then this is the latest version of the
> > isolcpus= replacement with cgroup's cpusets, correct?
> >
> > Sebastian
>
> It is just the beginning, there is still a lot of work to do before
> isolcpus= can be completely replaced.
Okay. Thanks.
> Cheers,
> Longman
Sebastian
Hello,
Sorry about the long delay.
On Tue, May 10, 2022 at 11:34:08AM -0400, Waiman Long wrote:
> Once a partition with empty "cpuset.cpus.effective" is formed, no
> new task can be moved into it until "cpuset.cpus.effective" becomes
> non-empty.
This is always true due to no-tasks-in-intermediate-cgroups requirement,
right?
Thanks.
--
tejun
On Sun, Jun 12, 2022 at 07:40:25AM -1000, Tejun Heo wrote:
> Hello,
>
> Sorry about the long delay.
>
> On Tue, May 10, 2022 at 11:34:08AM -0400, Waiman Long wrote:
> > Once a partition with empty "cpuset.cpus.effective" is formed, no
> > new task can be moved into it until "cpuset.cpus.effective" becomes
> > non-empty.
>
> This is always true due to no-tasks-in-intermediate-cgroups requirement,
> right?
or rather, I should have asked, why does this need an explicit check?
Thanks.
--
tejun
On 6/12/22 13:40, Tejun Heo wrote:
> Hello,
>
> Sorry about the long delay.
>
> On Tue, May 10, 2022 at 11:34:08AM -0400, Waiman Long wrote:
>> Once a partition with empty "cpuset.cpus.effective" is formed, no
>> new task can be moved into it until "cpuset.cpus.effective" becomes
>> non-empty.
> This is always true due to no-tasks-in-intermediate-cgroups requirement,
> right?
I seems to remember there are corner cases where a task can be moved to
an intermediate cgroup under circumstances. I need to dig further to
find out what it is.
Cheers,
Longman
On 6/12/22 13:41, Tejun Heo wrote:
> On Sun, Jun 12, 2022 at 07:40:25AM -1000, Tejun Heo wrote:
>> Hello,
>>
>> Sorry about the long delay.
>>
>> On Tue, May 10, 2022 at 11:34:08AM -0400, Waiman Long wrote:
>>> Once a partition with empty "cpuset.cpus.effective" is formed, no
>>> new task can be moved into it until "cpuset.cpus.effective" becomes
>>> non-empty.
>> This is always true due to no-tasks-in-intermediate-cgroups requirement,
>> right?
> or rather, I should have asked, why does this need an explicit check?
Without this patch, cpus.effective will never be empty. It just falls
back to its parent if it becomes empty. Now with an empty
cpus.effective, I am afraid that if a task is somehow moved to such a
cpuset, something bad may happen. So I add this check as a safeguard.
Cheers,
Longman
On 6/12/22 22:55, Tejun Heo wrote:
> On Sun, Jun 12, 2022 at 10:53:53PM -0400, Waiman Long wrote:
>> Without this patch, cpus.effective will never be empty. It just falls back
>> to its parent if it becomes empty. Now with an empty cpus.effective, I am
> Yeah, that part is fine.
>
>> afraid that if a task is somehow moved to such a cpuset, something bad may
>> happen. So I add this check as a safeguard.
> But how would that happen? A lot of other things would break too if that
> were to happen.
I will perform further check to see if this check is necessary.
Thanks,
Longman
On Sun, Jun 12, 2022 at 10:53:53PM -0400, Waiman Long wrote:
> Without this patch, cpus.effective will never be empty. It just falls back
> to its parent if it becomes empty. Now with an empty cpus.effective, I am
Yeah, that part is fine.
> afraid that if a task is somehow moved to such a cpuset, something bad may
> happen. So I add this check as a safeguard.
But how would that happen? A lot of other things would break too if that
were to happen.
Thanks.
--
tejun
On Sun, Jun 12, 2022 at 04:55:13PM -1000, Tejun Heo <[email protected]> wrote:
> But how would that happen? A lot of other things would break too if that
> were to happen.
cpuset is a threaded controller where the internal-node-constraint does
not hold. So the additional condition for cpuset migrations is IMO
warranted (and needed if there's no "fall up").
Michal
On 6/13/22 10:02, Michal Koutný wrote:
> On Sun, Jun 12, 2022 at 04:55:13PM -1000, Tejun Heo <[email protected]> wrote:
>> But how would that happen? A lot of other things would break too if that
>> were to happen.
> cpuset is a threaded controller where the internal-node-constraint does
> not hold. So the additional condition for cpuset migrations is IMO
> warranted (and needed if there's no "fall up").
Yes, you are right. cpuset is threaded and so it may have tasks even if
it is not the leaf node.
Thanks,
Longman
Hello,
On Mon, Jun 13, 2022 at 12:47:37PM -0400, Waiman Long wrote:
> On 6/13/22 10:02, Michal Koutn? wrote:
> > On Sun, Jun 12, 2022 at 04:55:13PM -1000, Tejun Heo <[email protected]> wrote:
> > > But how would that happen? A lot of other things would break too if that
> > > were to happen.
> > cpuset is a threaded controller where the internal-node-constraint does
> > not hold. So the additional condition for cpuset migrations is IMO
> > warranted (and needed if there's no "fall up").
>
> Yes, you are right. cpuset is threaded and so it may have tasks even if it
> is not the leaf node.
And we had this same exchange the last time. Can you please add a comment?
We might also already have had this exchange before too but is it necessary
to allow threaded cgroups to be isolated roots? The interaction between
being threaded and isolated is cleaner at that layer as it's interactions
between two explicit mode changes.
Thanks.
--
tejun