2021-06-21 18:51:11

by Waiman Long

[permalink] [raw]
Subject: [PATCH v2 0/6] cgroup/cpuset: Add new cpuset partition type & empty effecitve cpus

v2:
- Drop v1 patch 1.
- Break out some cosmetic changes into a separate patch (patch #1).
- Add a new patch to clarify the transition to invalid partition root
is mainly caused by hotplug events.
- Enhance the partition root state test including CPU online/offline
behavior and fix issues found by the test.

This patchset makes the following three major changes to the cpuset v2 code:

Patch 2: Clarify the use of invalid partition root and add new checks
to make sure that normal cpuset control file operations will not be
allowed to create invalid partition root. It also fixes some of the
issues in existing code.

Patch 3: Add a new partition state "isolated" to create a partition
root without load balancing. This is for handling intermitten workloads
that have a strict low latency requirement.

Patch 4: Allow partition roots that are not the top cpuset to distribute
all its cpus to child partitions as long as there is no task associated
with that partition root. This allows more flexibility for middleware
to manage multiple partitions.

Patch 5 updates the cgroup-v2.rst file accordingly. Patch 5 adds a new
cpuset test to test the new cpuset partition code.


Waiman Long (6):
cgroup/cpuset: Miscellaneous code cleanup
cgroup/cpuset: Clarify the use of invalid partition root
cgroup/cpuset: Add a new isolated cpus.partition type
cgroup/cpuset: Allow non-top parent partition root to distribute out
all CPUs
cgroup/cpuset: Update description of cpuset.cpus.partition in
cgroup-v2.rst
kselftest/cgroup: Add cpuset v2 partition root state test

Documentation/admin-guide/cgroup-v2.rst | 65 +-
kernel/cgroup/cpuset.c | 285 ++++++---
tools/testing/selftests/cgroup/Makefile | 2 +-
.../selftests/cgroup/test_cpuset_prs.sh | 558 ++++++++++++++++++
4 files changed, 794 insertions(+), 116 deletions(-)
create mode 100755 tools/testing/selftests/cgroup/test_cpuset_prs.sh

--
2.18.1


2021-06-21 18:51:29

by Waiman Long

[permalink] [raw]
Subject: [PATCH v2 1/6] cgroup/cpuset: Miscellaneous code cleanup

Use more descriptive variable names for update_prstate(), remove
unnecessary code and fix some typos. There is no functional change.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/cgroup/cpuset.c | 30 ++++++++++++++----------------
1 file changed, 14 insertions(+), 16 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index adb5190c4429..d4164e07c61b 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1114,7 +1114,7 @@ enum subparts_cmd {
* cpus_allowed can be granted or an error code will be returned.
*
* For partcmd_disable, the cpuset is being transofrmed from a partition
- * root back to a non-partition root. any CPUs in cpus_allowed that are in
+ * root back to a non-partition root. Any CPUs in cpus_allowed that are in
* parent's subparts_cpus will be taken away from that cpumask and put back
* into parent's effective_cpus. 0 should always be returned.
*
@@ -1225,7 +1225,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
/*
* partcmd_update w/o newmask:
*
- * addmask = cpus_allowed & parent->effectiveb_cpus
+ * addmask = cpus_allowed & parent->effective_cpus
*
* Note that parent's subparts_cpus may have been
* pre-shrunk in case there is a change in the cpu list.
@@ -1937,30 +1937,28 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,

/*
* update_prstate - update partititon_root_state
- * cs: the cpuset to update
- * val: 0 - disabled, 1 - enabled
+ * cs: the cpuset to update
+ * new_prs: new partition root state
*
* Call with cpuset_mutex held.
*/
-static int update_prstate(struct cpuset *cs, int val)
+static int update_prstate(struct cpuset *cs, int new_prs)
{
int err;
struct cpuset *parent = parent_cs(cs);
- struct tmpmasks tmp;
+ struct tmpmasks tmpmask;

- if ((val != 0) && (val != 1))
- return -EINVAL;
- if (val == cs->partition_root_state)
+ if (new_prs == cs->partition_root_state)
return 0;

/*
* Cannot force a partial or invalid partition root to a full
* partition root.
*/
- if (val && cs->partition_root_state)
+ if (new_prs && (cs->partition_root_state < 0))
return -EINVAL;

- if (alloc_cpumasks(NULL, &tmp))
+ if (alloc_cpumasks(NULL, &tmpmask))
return -ENOMEM;

err = -EINVAL;
@@ -1978,7 +1976,7 @@ static int update_prstate(struct cpuset *cs, int val)
goto out;

err = update_parent_subparts_cpumask(cs, partcmd_enable,
- NULL, &tmp);
+ NULL, &tmpmask);
if (err) {
update_flag(CS_CPU_EXCLUSIVE, cs, 0);
goto out;
@@ -1997,7 +1995,7 @@ static int update_prstate(struct cpuset *cs, int val)
}

err = update_parent_subparts_cpumask(cs, partcmd_disable,
- NULL, &tmp);
+ NULL, &tmpmask);
if (err)
goto out;

@@ -2015,11 +2013,11 @@ static int update_prstate(struct cpuset *cs, int val)
update_tasks_cpumask(parent);

if (parent->child_ecpus_count)
- update_sibling_cpumasks(parent, cs, &tmp);
+ update_sibling_cpumasks(parent, cs, &tmpmask);

rebuild_sched_domains_locked();
out:
- free_cpumasks(NULL, &tmp);
+ free_cpumasks(NULL, &tmpmask);
return err;
}

@@ -3060,7 +3058,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
goto retry;
}

- parent = parent_cs(cs);
+ parent = parent_cs(cs);
compute_effective_cpumask(&new_cpus, cs, parent);
nodes_and(new_mems, cs->mems_allowed, parent->effective_mems);

--
2.18.1

2021-06-21 18:51:56

by Waiman Long

[permalink] [raw]
Subject: [PATCH v2 5/6] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst

Update Documentation/admin-guide/cgroup-v2.rst on the newly introduced
"isolated" cpuset partition type as well as the ability to create
non-top cpuset partition with no cpu allocated to it.

Signed-off-by: Waiman Long <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 65 +++++++++++++++++--------
1 file changed, 44 insertions(+), 21 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index b1e81aa8598a..cf40a7f499c0 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2010,8 +2010,9 @@ Cpuset Interface Files
It accepts only the following input values when written to.

======== ================================
- "root" a partition root
- "member" a non-root member of a partition
+ "member" Non-root member of a partition
+ "root" Partition root
+ "isolated" Partition root without load balancing
======== ================================

When set to be a partition root, the current cgroup is the
@@ -2020,6 +2021,11 @@ Cpuset Interface Files
partition roots themselves and their descendants. The root
cgroup is always a partition root.

+ With "isolated", the CPUs in that partition root will be in an
+ isolated state without any load balancing from the scheduler.
+ Tasks in such a partition must be explicitly bind to each
+ individual CPU.
+
There are constraints on where a partition root can be set.
It can only be set in a cgroup if all the following conditions
are true.
@@ -2038,12 +2044,25 @@ Cpuset Interface Files
file cannot be reverted back to "member" if there are any child
cgroups with cpuset enabled.

- A parent partition cannot distribute all its CPUs to its
- child partitions. There must be at least one cpu left in the
- parent partition.
+ A parent partition may distribute all its CPUs to its child
+ partitions as long as it is not the root cgroup and there is no
+ task directly associated with that parent partition. Otherwise,
+ there must be at least one cpu left in the parent partition.
+ A new task cannot be moved to a partition root with no effective
+ cpu.

Once becoming a partition root, changes to "cpuset.cpus" is
- generally allowed as long as the first condition above is true,
+ generally allowed as long as the first condition above is true.
+ Other constraints for this operation are as follows.
+
+ 1) Any newly added CPUs must be a subset of the parent's
+ "cpuset.cpus.effective".
+ 2) Taking away all the CPUs from the parent's "cpuset.cpus.effective"
+ is only allowed if there is no task associated with the
+ parent partition.
+ 3) Deletion of CPUs that have been distributed to child partition
+ roots are not allowed.
+
the change will not take away all the CPUs from the parent
partition and the new "cpuset.cpus" value is a superset of its
children's "cpuset.cpus" values.
@@ -2056,6 +2075,7 @@ Cpuset Interface Files
============== ==============================
"member" Non-root member of a partition
"root" Partition root
+ "isolated" Partition root without load balancing
"root invalid" Invalid partition root
============== ==============================

@@ -2063,21 +2083,24 @@ Cpuset Interface Files
above are true and at least one CPU from "cpuset.cpus" is
granted by the parent cgroup.

- A partition root can become invalid if none of CPUs requested
- in "cpuset.cpus" can be granted by the parent cgroup or the
- parent cgroup is no longer a partition root itself. In this
- case, it is not a real partition even though the restriction
- of the first partition root condition above will still apply.
- The cpu affinity of all the tasks in the cgroup will then be
- associated with CPUs in the nearest ancestor partition.
-
- An invalid partition root can be transitioned back to a
- real partition root if at least one of the requested CPUs
- can now be granted by its parent. In this case, the cpu
- affinity of all the tasks in the formerly invalid partition
- will be associated to the CPUs of the newly formed partition.
- Changing the partition state of an invalid partition root to
- "member" is always allowed even if child cpusets are present.
+ A partition root becomes invalid if all the CPUs requested in
+ "cpuset.cpus" become unavailable. This can happen if all the
+ CPUs have been offlined, or the state of an ancestor partition
+ root become invalid. In this case, it is not a real partition
+ even though the restriction of the first partition root condition
+ above will still apply. The cpu affinity of all the tasks in
+ the cgroup will then be associated with CPUs in the nearest
+ ancestor partition. In the special case of a parent partition
+ competing with a child partition for the only CPU left, the
+ parent partition wins and the child partition becomes invalid.
+
+ An invalid partition root can be transitioned back to a real
+ partition root if at least one of the requested CPUs become
+ available again. In this case, the cpu affinity of all the
+ tasks in the formerly invalid partition will be associated to
+ the CPUs of the newly formed partition. Changing the partition
+ state of an invalid partition root to "member" is always allowed
+ even if child cpusets are present.


Device controller
--
2.18.1

2021-06-21 18:52:04

by Waiman Long

[permalink] [raw]
Subject: [PATCH v2 2/6] cgroup/cpuset: Clarify the use of invalid partition root

For cpuset partition, the special state of PRS_ERROR (invalid partition
root) was originally designed to handle hotplug events. In this state,
CPUs allocated to the partition root is released back to the parent
but the cpuset flags remain unchanged. However, certain manipulation
of cpuset control files could also cause a partition root to become
invalid though that was not the original intention.

Additional checks are now added to make sure that regular cpuset control
file manipulations are not allowed to make a partition root invalid. These
additional checks are:
1) A partition root can't be changed to member if it has child partition
roots.
2) Removing CPUs from cpuset.cpus that causes it to become invalid is
not allowed.

Comments are also added to clarify that a partition root becomes
invalid only when an external event like hotplug that causes all the
CPUs allocated to a partition root to become unavailable.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/cgroup/cpuset.c | 136 ++++++++++++++++++++++++-----------------
1 file changed, 79 insertions(+), 57 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index d4164e07c61b..3fe68d0f593d 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -174,7 +174,9 @@ struct cpuset {
* subparts_cpus. In this case, the cpuset is not a real partition
* root anymore. However, the CPU_EXCLUSIVE bit will still be set
* and the cpuset can be restored back to a partition root if the
- * parent cpuset can give more CPUs back to this child cpuset.
+ * parent cpuset can give more CPUs back to this child cpuset. A
+ * partition root becomes invalid when all its cpus become unavailable
+ * like being offlined.
*/
#define PRS_DISABLED 0
#define PRS_ENABLED 1
@@ -1193,6 +1195,15 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
/*
* partcmd_update with newmask:
*
+ * Return error if newmask isn't a subset of
+ * (cpus_allowed | parent->effective_cpus).
+ */
+ cpumask_or(tmp->addmask, cpuset->cpus_allowed,
+ parent->effective_cpus);
+ if (!cpumask_subset(newmask, tmp->addmask))
+ return -EINVAL;
+
+ /*
* delmask = cpus_allowed & ~newmask & parent->subparts_cpus
* addmask = newmask & parent->effective_cpus
* & ~parent->subparts_cpus
@@ -1205,7 +1216,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
adding = cpumask_andnot(tmp->addmask, tmp->addmask,
parent->subparts_cpus);
/*
- * Return error if the new effective_cpus could become empty.
+ * Return error if parent's effective_cpus could become empty.
*/
if (adding &&
cpumask_equal(parent->effective_cpus, tmp->addmask)) {
@@ -1221,20 +1232,35 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
return -EINVAL;
cpumask_copy(tmp->addmask, parent->effective_cpus);
}
+
+ /*
+ * Return error if effective_cpus becomes empty or any CPU
+ * distributed to child partitions is deleted.
+ */
+ if (deleting &&
+ (cpumask_intersects(tmp->delmask, cpuset->subparts_cpus) ||
+ cpumask_equal(tmp->delmask, cpuset->effective_cpus)))
+ return -EBUSY;
} else {
/*
* partcmd_update w/o newmask:
*
* addmask = cpus_allowed & parent->effective_cpus
*
+ * This gets invoked either due to a hotplug event or
+ * from update_cpumasks_hier() where we can't return an
+ * error. This can cause a partition root to become invalid
+ * in the case of a hotplug.
+ *
* Note that parent's subparts_cpus may have been
* pre-shrunk in case there is a change in the cpu list.
* So no deletion is needed.
*/
adding = cpumask_and(tmp->addmask, cpuset->cpus_allowed,
parent->effective_cpus);
- part_error = cpumask_equal(tmp->addmask,
- parent->effective_cpus);
+ part_error = (is_partition_root(cpuset) &&
+ !parent->nr_subparts_cpus) ||
+ cpumask_equal(tmp->addmask, parent->effective_cpus);
}

if (cmd == partcmd_update) {
@@ -1392,10 +1418,6 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
* When parent is invalid, it has to be too.
*/
cp->partition_root_state = PRS_ERROR;
- if (cp->nr_subparts_cpus) {
- cp->nr_subparts_cpus = 0;
- cpumask_clear(cp->subparts_cpus);
- }
break;
}
}
@@ -1406,38 +1428,32 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)

spin_lock_irq(&callback_lock);

- cpumask_copy(cp->effective_cpus, tmp->new_cpus);
if (cp->nr_subparts_cpus &&
(cp->partition_root_state != PRS_ENABLED)) {
+ /*
+ * Put all active subparts_cpus back to effective_cpus.
+ */
+ cpumask_or(tmp->new_cpus, tmp->new_cpus,
+ cp->subparts_cpus);
+ cpumask_and(tmp->new_cpus, tmp->new_cpus,
+ cpu_active_mask);
cp->nr_subparts_cpus = 0;
cpumask_clear(cp->subparts_cpus);
- } else if (cp->nr_subparts_cpus) {
+ }
+
+ cpumask_copy(cp->effective_cpus, tmp->new_cpus);
+ if (cp->nr_subparts_cpus) {
/*
* Make sure that effective_cpus & subparts_cpus
- * are mutually exclusive.
- *
- * In the unlikely event that effective_cpus
- * becomes empty. we clear cp->nr_subparts_cpus and
- * let its child partition roots to compete for
- * CPUs again.
+ * of a partition root are mutually exclusive.
*/
cpumask_andnot(cp->effective_cpus, cp->effective_cpus,
cp->subparts_cpus);
- if (cpumask_empty(cp->effective_cpus)) {
- cpumask_copy(cp->effective_cpus, tmp->new_cpus);
- cpumask_clear(cp->subparts_cpus);
- cp->nr_subparts_cpus = 0;
- } else if (!cpumask_subset(cp->subparts_cpus,
- tmp->new_cpus)) {
- cpumask_andnot(cp->subparts_cpus,
- cp->subparts_cpus, tmp->new_cpus);
- cp->nr_subparts_cpus
- = cpumask_weight(cp->subparts_cpus);
- }
+ WARN_ON_ONCE(cpumask_empty(cp->effective_cpus));
}
spin_unlock_irq(&callback_lock);

- WARN_ON(!is_in_v2_mode() &&
+ WARN_ON_ONCE(!is_in_v2_mode() &&
!cpumask_equal(cp->cpus_allowed, cp->effective_cpus));

update_tasks_cpumask(cp);
@@ -1560,8 +1576,8 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
* Make sure that subparts_cpus is a subset of cpus_allowed.
*/
if (cs->nr_subparts_cpus) {
- cpumask_andnot(cs->subparts_cpus, cs->subparts_cpus,
- cs->cpus_allowed);
+ cpumask_and(cs->subparts_cpus, cs->subparts_cpus,
+ cs->cpus_allowed);
cs->nr_subparts_cpus = cpumask_weight(cs->subparts_cpus);
}
spin_unlock_irq(&callback_lock);
@@ -1984,21 +2000,26 @@ static int update_prstate(struct cpuset *cs, int new_prs)
cs->partition_root_state = PRS_ENABLED;
} else {
/*
- * Turning off partition root will clear the
- * CS_CPU_EXCLUSIVE bit.
+ * Switch back to member is always allowed if PRS_ERROR.
*/
if (cs->partition_root_state == PRS_ERROR) {
- cs->partition_root_state = 0;
- update_flag(CS_CPU_EXCLUSIVE, cs, 0);
err = 0;
- goto out;
+ goto reset_flags;
}

+ /*
+ * A partition root cannot be reverted to member if some
+ * CPUs have been distributed to child partition roots.
+ */
+ if (!cpumask_empty(cs->subparts_cpus))
+ return -EBUSY;
+
err = update_parent_subparts_cpumask(cs, partcmd_disable,
NULL, &tmpmask);
if (err)
goto out;

+reset_flags:
cs->partition_root_state = 0;

/* Turning off CS_CPU_EXCLUSIVE will not return error */
@@ -3074,41 +3095,42 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)

/*
* In the unlikely event that a partition root has empty
- * effective_cpus or its parent becomes erroneous, we have to
- * transition it to the erroneous state.
+ * effective_cpus, we will have to force any child partitions,
+ * if present, to become invalid by setting nr_subparts_cpus to 0
+ * without causing itself to become invalid.
+ */
+ if (is_partition_root(cs) && cs->nr_subparts_cpus &&
+ cpumask_empty(&new_cpus)) {
+ cs->nr_subparts_cpus = 0;
+ cpumask_clear(cs->subparts_cpus);
+ compute_effective_cpumask(&new_cpus, cs, parent);
+ }
+
+ /*
+ * If empty effective_cpus or zero nr_subparts_cpus or its parent
+ * becomes erroneous, we have to transition it to the erroneous state.
*/
if (is_partition_root(cs) && (cpumask_empty(&new_cpus) ||
- (parent->partition_root_state == PRS_ERROR))) {
+ (parent->partition_root_state == PRS_ERROR) ||
+ !parent->nr_subparts_cpus)) {
+ update_parent_subparts_cpumask(cs, partcmd_disable,
+ NULL, tmp);
if (cs->nr_subparts_cpus) {
cs->nr_subparts_cpus = 0;
cpumask_clear(cs->subparts_cpus);
compute_effective_cpumask(&new_cpus, cs, parent);
}
-
- /*
- * If the effective_cpus is empty because the child
- * partitions take away all the CPUs, we can keep
- * the current partition and let the child partitions
- * fight for available CPUs.
- */
- if ((parent->partition_root_state == PRS_ERROR) ||
- cpumask_empty(&new_cpus)) {
- update_parent_subparts_cpumask(cs, partcmd_disable,
- NULL, tmp);
- cs->partition_root_state = PRS_ERROR;
- }
+ cs->partition_root_state = PRS_ERROR;
cpuset_force_rebuild();
}

/*
* On the other hand, an erroneous partition root may be transitioned
- * back to a regular one or a partition root with no CPU allocated
- * from the parent may change to erroneous.
+ * back to a regular one.
*/
- if (is_partition_root(parent) &&
- ((cs->partition_root_state == PRS_ERROR) ||
- !cpumask_intersects(&new_cpus, parent->subparts_cpus)) &&
- update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp))
+ else if (is_partition_root(parent) &&
+ (cs->partition_root_state == PRS_ERROR) &&
+ update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp))
cpuset_force_rebuild();

update_tasks:
--
2.18.1

2021-06-21 18:52:15

by Waiman Long

[permalink] [raw]
Subject: [PATCH v2 6/6] kselftest/cgroup: Add cpuset v2 partition root state test

Add a test script for exercising the cpuset v2 partition root state code.

Signed-off-by: Waiman Long <[email protected]>
---
tools/testing/selftests/cgroup/Makefile | 2 +-
.../selftests/cgroup/test_cpuset_prs.sh | 558 ++++++++++++++++++
2 files changed, 559 insertions(+), 1 deletion(-)
create mode 100755 tools/testing/selftests/cgroup/test_cpuset_prs.sh

diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selftests/cgroup/Makefile
index f027d933595b..e8ff3ffc3a43 100644
--- a/tools/testing/selftests/cgroup/Makefile
+++ b/tools/testing/selftests/cgroup/Makefile
@@ -4,7 +4,7 @@ CFLAGS += -Wall -pthread
all:

TEST_FILES := with_stress.sh
-TEST_PROGS := test_stress.sh
+TEST_PROGS := test_stress.sh test_cpuset_prs.sh
TEST_GEN_PROGS = test_memcontrol
TEST_GEN_PROGS += test_kmem
TEST_GEN_PROGS += test_core
diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
new file mode 100755
index 000000000000..0e7839d37325
--- /dev/null
+++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
@@ -0,0 +1,558 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Test for cpuset v2 partition root state (PRS)
+#
+# The sched verbose flag is set, if available, so that the console log
+# can be examined for the correct setting of scheduling domain.
+#
+
+skip_test() {
+ echo "$1"
+ echo "Test SKIPPED"
+ exit 0
+}
+
+[[ $(id -u) -eq 0 ]] || skip_test "Test must be run as root!"
+
+# Set sched verbose flag, if available
+[[ -d /sys/kernel/debug/sched ]] && echo Y > /sys/kernel/debug/sched/verbose
+
+# Find cgroup v2 mount point
+CGROUP2=$(mount -t cgroup2 | head -1 | awk -e '{print $3}')
+[[ -n "$CGROUP2" ]] || skip_test "Cgroup v2 mount point not found!"
+
+CPUS=$(lscpu | grep "^CPU(s)" | sed -e "s/.*:[[:space:]]*//")
+[[ $CPUS -lt 8 ]] && skip_test "Test needs at least 8 cpus available!"
+
+# Set verbose flag
+VERBOSE=
+[[ "$1" = -v ]] && VERBOSE=1
+
+cd $CGROUP2
+echo +cpuset > cgroup.subtree_control
+[[ -d test ]] || mkdir test
+cd test
+
+console_msg()
+{
+ MSG=$1
+ echo "$MSG"
+ echo "" > /dev/console
+ echo "$MSG" > /dev/console
+ sleep 0.01
+}
+
+test_partition()
+{
+ EXPECTED_VAL=$1
+ echo $EXPECTED_VAL > cpuset.cpus.partition
+ [[ $? -eq 0 ]] || exit 1
+ ACTUAL_VAL=$(cat cpuset.cpus.partition)
+ [[ $ACTUAL_VAL != $EXPECTED_VAL ]] && {
+ echo "cpuset.cpus.partition: expect $EXPECTED_VAL, found $EXPECTED_VAL"
+ echo "Test FAILED"
+ exit 1
+ }
+}
+
+test_effective_cpus()
+{
+ EXPECTED_VAL=$1
+ ACTUAL_VAL=$(cat cpuset.cpus.effective)
+ [[ "$ACTUAL_VAL" != "$EXPECTED_VAL" ]] && {
+ echo "cpuset.cpus.effective: expect '$EXPECTED_VAL', found '$EXPECTED_VAL'"
+ echo "Test FAILED"
+ exit 1
+ }
+}
+
+# Adding current process to cgroup.procs as a test
+test_add_proc()
+{
+ OUTSTR="$1"
+ ERRMSG=$((echo $$ > cgroup.procs) |& cat)
+ echo $ERRMSG | grep -q "$OUTSTR"
+ [[ $? -ne 0 ]] && {
+ echo "cgroup.procs: expect '$OUTSTR', got '$ERRMSG'"
+ echo "Test FAILED"
+ exit 1
+ }
+ echo $$ > $CGROUP2/cgroup.procs # Move out the task
+}
+
+#
+# Testing the new "isolated" partition root type
+#
+test_isolated()
+{
+ echo 2-3 > cpuset.cpus
+ TYPE=$(cat cpuset.cpus.partition)
+ [[ $TYPE = member ]] || echo member > cpuset.cpus.partition
+
+ console_msg "Change from member to root"
+ test_partition root
+
+ console_msg "Change from root to isolated"
+ test_partition isolated
+
+ console_msg "Change from isolated to member"
+ test_partition member
+
+ console_msg "Change from member to isolated"
+ test_partition isolated
+
+ console_msg "Change from isolated to root"
+ test_partition root
+
+ console_msg "Change from root to member"
+ test_partition member
+
+ #
+ # Testing partition root with no cpu
+ #
+ console_msg "Distribute all cpus to child partition"
+ echo +cpuset > cgroup.subtree_control
+ test_partition root
+
+ mkdir A1
+ cd A1
+ echo 2-3 > cpuset.cpus
+ test_partition root
+ test_effective_cpus 2-3
+ cd ..
+ test_effective_cpus ""
+
+ console_msg "Moving task to partition test"
+ test_add_proc "No space left"
+ cd A1
+ test_add_proc ""
+ cd ..
+
+ console_msg "Shrink and expand child partition"
+ cd A1
+ echo 2 > cpuset.cpus
+ cd ..
+ test_effective_cpus 3
+ cd A1
+ echo 2-3 > cpuset.cpus
+ cd ..
+ test_effective_cpus ""
+
+ # Cleaning up
+ console_msg "Cleaning up"
+ echo $$ > $CGROUP2/cgroup.procs
+ [[ -d A1 ]] && rmdir A1
+}
+
+#
+# Cpuset controller state transition test matrix.
+#
+# Cgroup test hierarchy
+#
+# test -- A1 -- A2 -- A3
+# \- B1
+#
+# P<v> = set cpus.partition (0:member, 1:root, 2:isolated, -1:root invalid)
+# C<l> = add cpu-list
+# S<p> = use prefix in subtree_control
+# T = put a task into cgroup
+# O<c>-<v> = Write <v> to CPU online file of <c>
+#
+SETUP_A123_PARTITIONS="C1-3:P1:S+ C2-3:P1:S+ C3:P1"
+TEST_MATRIX=(
+ # test old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate
+ # ---- ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------
+ " S+ C0-1 . . C2-3 S+ C4-5 . . 0 A2:0-1"
+ " S+ C0-1 . . C2-3 P1 . . . 0 "
+ " S+ C0-1 . . C2-3 P1:S+ C0-1:P1 . . 0 "
+ " S+ C0-1 . . C2-3 P1:S+ C1:P1 . . 0 "
+ " S+ C0-1:S+ . . C2-3 . . . P1 0 "
+ " S+ C0-1:P1 . . C2-3 S+ C1 . . 0 "
+ " S+ C0-1:P1 . . C2-3 S+ C1:P1 . . 0 "
+ " S+ C0-1:P1 . . C2-3 S+ C1:P1 . P1 0 "
+ " S+ C0-1:P1 . . C2-3 C4-5 . . . 0 A1:4-5"
+ " S+ C0-1:P1 . . C2-3 S+:C4-5 . . . 0 A1:4-5"
+ " S+ C0-1 . . C2-3:P1 . . . C2 0 "
+ " S+ C0-1 . . C2-3:P1 . . . C4-5 0 B1:4-5"
+ " S+ C0-3:P1:S+ C2-3:P1 . . . . . . 0 A1:0-1,A2:2-3"
+ " S+ C0-3:P1:S+ C2-3:P1 . . C1-3 . . . 0 A1:1,A2:2-3"
+ " S+ C2-3:P1:S+ C3:P1 . . C3 . . . 0 A1:,A2:3 A1:P1,A2:P1"
+ " S+ C2-3:P1:S+ C3:P1 . . C3 P0 . . 0 A1:3,A2:3 A1:P1,A2:P0"
+ " S+ C2-3:P1:S+ C2:P1 . . C2-4 . . . 0 A1:3-4,A2:2"
+ " S+ C2-3:P1:S+ C3:P1 . . C3 . . C0-2 0 A1:,B1:0-2 A1:P1,A2:P1"
+ " S+ $SETUP_A123_PARTITIONS . C2-3 . . . 0 A1:,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+
+ # CPU offlining cases:
+ " S+ C0-1 . . C2-3 S+ C4-5 . O2-0 0 A1:0-1,B1:3"
+ " S+ C0-3:P1:S+ C2-3:P1 . . O2-0 . . . 0 A1:0-1,A2:3"
+ " S+ C0-3:P1:S+ C2-3:P1 . . O2-0 O2-1 . . 0 A1:0-1,A2:2-3"
+ " S+ C0-3:P1:S+ C2-3:P1 . . O1-0 . . . 0 A1:0,A2:2-3"
+ " S+ C0-3:P1:S+ C2-3:P1 . . O1-0 O1-1 . . 0 A1:0-1,A2:2-3"
+ " S+ C2-3:P1:S+ C3:P1 . . O3-0 O3-1 . . 0 A1:2,A2:3 A1:P1,A2:P1"
+ " S+ C2-3:P1:S+ C3:P2 . . O3-0 O3-1 . . 0 A1:2,A2:3 A1:P1,A2:P2"
+ " S+ C2-3:P1:S+ C3:P1 . . O2-0 O2-1 . . 0 A1:2,A2:3 A1:P1,A2:P1"
+ " S+ C2-3:P1:S+ C3:P2 . . O2-0 O2-1 . . 0 A1:2,A2:3 A1:P1,A2:P2"
+ " S+ C2-3:P1:S+ C3:P1 . . O2-0 . . . 0 A1:,A2:3 A1:P1,A2:P1"
+ " S+ C2-3:P1:S+ C3:P1 . . O3-0 . . . 0 A1:2,A2: A1:P1,A2:P1"
+ " S+ C2-3:P1:S+ C3:P1 . . T:O2-0 . . . 0 A1:3,A2:3 A1:P1,A2:P-1"
+ " S+ $SETUP_A123_PARTITIONS . O1-0 . . . 0 A1:,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+ " S+ $SETUP_A123_PARTITIONS . O2-0 . . . 0 A1:1,A2:,A3:3 A1:P1,A2:P1,A3:P1"
+ " S+ $SETUP_A123_PARTITIONS . O3-0 . . . 0 A1:1,A2:2,A3: A1:P1,A2:P1,A3:P1"
+ " S+ $SETUP_A123_PARTITIONS . T:O1-0 . . . 0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P-1,A3:P-1"
+ " S+ $SETUP_A123_PARTITIONS . . T:O2-0 . . 0 A1:1,A2:3,A3:3 A1:P1,A2:P1,A3:P-1"
+ " S+ $SETUP_A123_PARTITIONS . . . T:O3-0 . 0 A1:1,A2:2,A3:2 A1:P1,A2:P1,A3:P-1"
+ " S+ $SETUP_A123_PARTITIONS . T:O1-0 O1-1 . . 0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+ " S+ $SETUP_A123_PARTITIONS . . T:O2-0 O2-1 . 0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+ " S+ $SETUP_A123_PARTITIONS . . . T:O3-0 O3-1 0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+ " S+ $SETUP_A123_PARTITIONS . T:O1-0 O2-0 O1-1 . 0 A1:1,A2:,A3:3 A1:P1,A2:P1,A3:P1"
+ " S+ $SETUP_A123_PARTITIONS . T:O1-0 O2-0 O2-1 . 0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P-1,A3:P-1"
+
+ # test old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate
+ # ---- ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------
+ # Failure cases:
+
+ # To become a partition root, cpuset.cpus must be a subset of
+ # parent's cpuset.cpus.effective.
+ " S+ C0-1 . . C2-3 S+ C4-5:P1 . . 1 "
+
+ # A cpuset cannot become a partition root if it has child cpusets
+ # with non-empty cpuset.cpus.
+ " S+ C0-1:S+ C1 . C2-3 P1 . . . 1 "
+
+ # Any change to cpuset.cpus of a partition root must be exclusive.
+ " S+ C0-1:P1 . . C2-3 C0-2 . . . 1 "
+ " S+ C0-1 . . C2-3:P1 . . . C1 1 "
+ " S+ C2-3:P1:S+ C2:P1 . C1 C1-3 . . . 1 "
+
+ # Deletion of CPUs distributed to child partition root is not allowed.
+ " S+ C0-1:P1:S+ C1 . C2-3 C4-5 . . . 1 "
+ " S+ C0-3:P1:S+ C2-3:P1 . . C0-2 . . . 1 "
+
+ # Adding CPUs to partition root that are not in parent's
+ # cpuset.cpus.effective is not allowed.
+ " S+ C2-3:P1:S+ C3:P1 . . . C2-4 . . 1 "
+
+ # Taking away all CPUs from parent or itself is not allowed if there are tasks.
+ " S+ C2-3:P1:S+ C3:P1 . . T C2-3 . . 1 A1:2,A2:3"
+ " S+ C1-3:P1:S+ C2-3:P1:S+
+ C3:P1 . T:C2-3 . . . 1 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+
+ # A partition root cannot change to member if it has child partition.
+ " S+ C2-3:P1:S+ C3:P1 . . P0 . . . 1 "
+ " S+ $SETUP_A123_PARTITIONS . C2-3 P0 . . 1 A1:,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+
+ # A task cannot be added to a partition with no cpu
+ " S+ C2-3:P1:S+ C3:P1 . . O2-0:T . . . 1 A1:,A2:3 A1:P1,A2:P1"
+ " S+ C2-3:P1:S+ C3:P1 . . O3-0 T . . 1 A1:2,A2: A1:P1,A2:P1"
+)
+
+#
+# Write to the cpu online file
+# $1 - <c>-<v> where <c> = cpu number, <v> value to be written
+#
+write_cpu_online()
+{
+ CPU=${1%-*}
+ VAL=${1#*-}
+ CPUFILE=//sys/devices/system/cpu/cpu${CPU}/online
+ if [[ $VAL -eq 0 ]]
+ then
+ OFFLINE_CPUS="$OFFLINE_CPUS $CPU"
+ else
+ [[ -n "$OFFLINE_CPUS" ]] && {
+ OFFLINE_CPUS=$(echo $CPU $CPU $OFFLINE_CPUS | fmt -1 |\
+ sort | uniq -u)
+ }
+ fi
+ echo $VAL > $CPUFILE
+ sleep 0.01
+}
+
+#
+# Set controller state
+# $1 - cgroup directory
+# $2 - state
+# $3 - showerr
+#
+# The presence of ":" in state means transition from one to the next.
+#
+set_ctrl_state()
+{
+ TMPMSG=/tmp/.msg_$$
+ CGRP=$1
+ STATE=$2
+ SHOWERR=${3}${VERBOSE}
+ CTRL=${CTRL:=$CONTROLLER}
+ HASERR=0
+ REDIRECT="2> $TMPMSG"
+ [[ -z "$STATE" || "$STATE" = '.' ]] && return 0
+
+ rm -f $TMPMSG
+ for CMD in $(echo $STATE | sed -e "s/:/ /g")
+ do
+ TFILE=$CGRP/cgroup.procs
+ SFILE=$CGRP/cgroup.subtree_control
+ PFILE=$CGRP/cpuset.cpus.partition
+ CFILE=$CGRP/cpuset.cpus
+ S=$(expr substr $CMD 1 1)
+ if [[ $S = S ]]
+ then
+ PREFIX=${CMD#?}
+ COMM="echo ${PREFIX}${CTRL} > $SFILE"
+ eval $COMM $REDIRECT
+ elif [[ $S = C ]]
+ then
+ CPUS=${CMD#?}
+ COMM="echo $CPUS > $CFILE"
+ eval $COMM $REDIRECT
+ elif [[ $S = P ]]
+ then
+ VAL=${CMD#?}
+ case $VAL in
+ 0) VAL=member
+ ;;
+ 1) VAL=root
+ ;;
+ 2) VAL=isolated
+ ;;
+ *)
+ echo "Invalid partiton state - $VAL"
+ exit 1
+ ;;
+ esac
+ COMM="echo $VAL > $PFILE"
+ eval $COMM $REDIRECT
+ elif [[ $S = O ]]
+ then
+ VAL=${CMD#?}
+ write_cpu_online $VAL
+ elif [[ $S = T ]]
+ then
+ COMM="echo 0 > $TFILE"
+ eval $COMM $REDIRECT
+ fi
+ RET=$?
+ [[ $RET -ne 0 ]] && {
+ [[ -n "$SHOWERR" ]] && {
+ echo "$COMM"
+ cat $TMPMSG
+ }
+ HASERR=1
+ }
+ sleep 0.01
+ rm -f $TMPMSG
+ done
+ return $HASERR
+}
+
+set_ctrl_state_noerr()
+{
+ CGRP=$1
+ STATE=$2
+ [[ -d $CGRP ]] || mkdir $CGRP
+ set_ctrl_state $CGRP $STATE 1
+ [[ $? -ne 0 ]] && {
+ echo "ERROR: Failed to set $2 to cgroup $1!"
+ exit 1
+ }
+}
+
+online_cpus()
+{
+ [[ -n "OFFLINE_CPUS" ]] && {
+ for C in $OFFLINE_CPUS
+ do
+ write_cpu_online ${C}-1
+ done
+ }
+}
+
+#
+# Return 1 if the list of effective cpus isn't the same as the initial list.
+#
+reset_cgroup_states()
+{
+ echo 0 > $CGROUP2/cgroup.procs
+ online_cpus
+ rmdir A1/A2/A3 A1/A2 A1 B1 > /dev/null 2>&1
+ set_ctrl_state . S-
+ sleep 0.005 # 5ms artificial delay to complete the deletion
+}
+
+dump_states()
+{
+ for DIR in A1 A1/A2 A1/A2/A3 B1
+ do
+ ECPUS=$DIR/cpuset.cpus.effective
+ PRS=$DIR/cpuset.cpus.partition
+ [[ -e $ECPUS ]] && echo "$ECPUS: $(cat $ECPUS)"
+ [[ -e $PRS ]] && echo "$PRS: $(cat $PRS)"
+ done
+}
+
+#
+# Check effective cpus
+# $1 - check string, format: <cgroup>:<cpu-list>[,<cgroup>:<cpu-list>]*
+#
+check_effective_cpus()
+{
+ CHK_STR=$1
+ for CHK in $(echo $CHK_STR | sed -e "s/,/ /g")
+ do
+ set -- $(echo $CHK | sed -e "s/:/ /g")
+ CGRP=$1
+ CPUS=$2
+ [[ $CGRP = A2 ]] && CGRP=A1/A2
+ [[ $CGRP = A3 ]] && CGRP=A1/A2/A3
+ FILE=$CGRP/cpuset.cpus.effective
+ [[ -e $FILE ]] || return 1
+ [[ $CPUS = $(cat $FILE) ]] || return 1
+ done
+}
+
+#
+# Check cgroup states
+# $1 - check string, format: <cgroup>:<state>[,<cgroup>:<state>]*
+#
+check_cgroup_states()
+{
+ CHK_STR=$1
+ for CHK in $(echo $CHK_STR | sed -e "s/,/ /g")
+ do
+ set -- $(echo $CHK | sed -e "s/:/ /g")
+ CGRP=$1
+ STATE=$2
+ FILE=
+ EVAL=$(expr substr $STATE 2 2)
+ [[ $CGRP = A2 ]] && CGRP=A1/A2
+ [[ $CGRP = A3 ]] && CGRP=A1/A2/A3
+
+ case $STATE in
+ P*) FILE=$CGRP/cpuset.cpus.partition
+ ;;
+ *) echo "Unknown state: $STATE!"
+ exit 1
+ ;;
+ esac
+ VAL=$(cat $FILE)
+
+ case "$VAL" in
+ member) VAL=0
+ ;;
+ root) VAL=1
+ ;;
+ isolated)
+ VAL=2
+ ;;
+ "root invalid")
+ VAL=-1
+ ;;
+ esac
+ [[ $EVAL != $VAL ]] && return 1
+ done
+ return 0
+}
+
+#
+# Run cpuset state transition test
+# $1 - test matrix name
+#
+# This test is somewhat fragile as delays (sleep x) are added in various
+# places to make sure state changes are fully propagated before the next
+# action. These delays may need to be adjusted if running in a slower machine.
+#
+run_state_test()
+{
+ TEST=$1
+ CONTROLLER=cpuset
+ I=0
+ CPULIST=0-6
+ eval CNT="\${#$TEST[@]}"
+
+ reset_cgroup_states
+ echo $CPULIST > cpuset.cpus
+ echo root > cpuset.cpus.partition
+ console_msg "Running state transition test ..."
+
+ while [[ $I -lt $CNT ]]
+ do
+ echo "Running test $I ..." > /dev/console
+ eval set -- "\${$TEST[$I]}"
+ ROOT=$1
+ OLD_A1=$2
+ OLD_A2=$3
+ OLD_A3=$4
+ OLD_B1=$5
+ NEW_A1=$6
+ NEW_A2=$7
+ NEW_A3=$8
+ NEW_B1=$9
+ RESULT=${10}
+ ECPUS=${11}
+ STATES=${12}
+
+ set_ctrl_state_noerr . $ROOT
+ set_ctrl_state_noerr A1 $OLD_A1
+ set_ctrl_state_noerr A1/A2 $OLD_A2
+ set_ctrl_state_noerr A1/A2/A3 $OLD_A3
+ set_ctrl_state_noerr B1 $OLD_B1
+ RETVAL=0
+ set_ctrl_state A1 $NEW_A1; ((RETVAL += $?))
+ set_ctrl_state A1/A2 $NEW_A2; ((RETVAL += $?))
+ set_ctrl_state A1/A2/A3 $NEW_A3; ((RETVAL += $?))
+ set_ctrl_state B1 $NEW_B1; ((RETVAL += $?))
+
+ [[ $RETVAL -ne $RESULT ]] && {
+ echo "Test $TEST[$I] failed result check!"
+ eval echo \"\${$TEST[$I]}\"
+ online_cpus
+ exit 1
+ }
+
+ [[ -n "$ECPUS" && "$ECPUS" != . ]] && {
+ check_effective_cpus $ECPUS
+ [[ $? -ne 0 ]] && {
+ echo "Test $TEST[$I] failed effective CPU check!"
+ eval echo \"\${$TEST[$I]}\"
+ echo
+ dump_states
+ online_cpus
+ exit 1
+ }
+ }
+
+ [[ -n "$STATES" ]] && {
+ check_cgroup_states $STATES
+ [[ $? -ne 0 ]] && {
+ echo "FAILED: Test $TEST[$I] failed states check!"
+ eval echo \"\${$TEST[$I]}\"
+ echo
+ dump_states
+ online_cpus
+ exit 1
+ }
+ }
+
+ reset_cgroup_states
+ [[ -n "$VERBOSE" ]] && echo "Test $I done."
+ ((I++))
+ done
+ echo "All $I tests of $TEST PASSED."
+
+ #
+ # Check to see if the effective cpu list changes
+ #
+ sleep 0.05
+ NEWLIST=$(cat cpuset.cpus.effective)
+ [[ $NEWLIST != $CPULIST ]] && {
+ echo "Effective cpus changed to $NEWLIST!"
+ }
+ echo member > cpuset.cpus.partition
+}
+
+run_state_test TEST_MATRIX
+test_isolated
+echo "All tests PASSED."
+cd ..
+rmdir test
--
2.18.1

2021-06-21 18:53:41

by Waiman Long

[permalink] [raw]
Subject: [PATCH v2 3/6] cgroup/cpuset: Add a new isolated cpus.partition type

Bugzilla: https://bugzilla.redhat.com/show_bug.cgi?id=TBD

commit 994fb794cb252edd124a46ca0994e37a4726a100
Author: Waiman Long <[email protected]>
Date: Sat, 19 Jun 2021 13:28:19 -0400

cgroup/cpuset: Add a new isolated cpus.partition type

Cpuset v1 uses the sched_load_balance control file to determine if load
balancing should be enabled. Cpuset v2 gets rid of sched_load_balance
as its use may require disabling load balancing at cgroup root.

For workloads that require very low latency like DPDK, the latency
jitters caused by periodic load balancing may exceed the desired
latency limit.

When cpuset v2 is in use, the only way to avoid this latency cost is to
use the "isolcpus=" kernel boot option to isolate a set of CPUs. After
the kernel boot, however, there is no way to add or remove CPUs from
this isolated set. For workloads that are more dynamic in nature, that
means users have to provision enough CPUs for the worst case situation
resulting in excess idle CPUs.

To address this issue for cpuset v2, a new cpuset.cpus.partition type
"isolated" is added which allows the creation of a cpuset partition
without load balancing. This will allow system administrators to
dynamically adjust the size of isolated partition to the current need
of the workload without rebooting the system.

Signed-off-by: Waiman Long <[email protected]>

Signed-off-by: Waiman Long <[email protected]>
---
kernel/cgroup/cpuset.c | 53 +++++++++++++++++++++++++++++++++++++-----
1 file changed, 47 insertions(+), 6 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 3fe68d0f593d..1a4b90e70e68 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -169,6 +169,8 @@ struct cpuset {
*
* 1 - partition root
*
+ * 2 - partition root without load balancing (isolated)
+ *
* -1 - invalid partition root
* None of the cpus in cpus_allowed can be put into the parent's
* subparts_cpus. In this case, the cpuset is not a real partition
@@ -180,6 +182,7 @@ struct cpuset {
*/
#define PRS_DISABLED 0
#define PRS_ENABLED 1
+#define PRS_ISOLATED 2
#define PRS_ERROR -1

/*
@@ -1267,17 +1270,22 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
int prev_prs = cpuset->partition_root_state;

/*
- * Check for possible transition between PRS_ENABLED
- * and PRS_ERROR.
+ * Check for possible transition between PRS_ERROR and
+ * PRS_ENABLED/PRS_ISOLATED.
*/
switch (cpuset->partition_root_state) {
case PRS_ENABLED:
+ case PRS_ISOLATED:
if (part_error)
cpuset->partition_root_state = PRS_ERROR;
break;
case PRS_ERROR:
- if (!part_error)
+ if (part_error)
+ break;
+ if (is_sched_load_balance(cpuset))
cpuset->partition_root_state = PRS_ENABLED;
+ else
+ cpuset->partition_root_state = PRS_ISOLATED;
break;
}
/*
@@ -1409,6 +1417,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
break;

case PRS_ENABLED:
+ case PRS_ISOLATED:
if (update_parent_subparts_cpumask(cp, partcmd_update, NULL, tmp))
update_tasks_cpumask(parent);
break;
@@ -1429,7 +1438,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
spin_lock_irq(&callback_lock);

if (cp->nr_subparts_cpus &&
- (cp->partition_root_state != PRS_ENABLED)) {
+ (cp->partition_root_state <= 0)) {
/*
* Put all active subparts_cpus back to effective_cpus.
*/
@@ -1963,6 +1972,7 @@ static int update_prstate(struct cpuset *cs, int new_prs)
int err;
struct cpuset *parent = parent_cs(cs);
struct tmpmasks tmpmask;
+ bool sched_domain_rebuilt = false;

if (new_prs == cs->partition_root_state)
return 0;
@@ -1993,11 +2003,30 @@ static int update_prstate(struct cpuset *cs, int new_prs)

err = update_parent_subparts_cpumask(cs, partcmd_enable,
NULL, &tmpmask);
+
if (err) {
update_flag(CS_CPU_EXCLUSIVE, cs, 0);
goto out;
}
- cs->partition_root_state = PRS_ENABLED;
+ if (new_prs == PRS_ISOLATED) {
+ /*
+ * Disable the load balance flag should not return an
+ * error unless the system is running out of memory.
+ */
+ update_flag(CS_SCHED_LOAD_BALANCE, cs, 0);
+ sched_domain_rebuilt = true;
+ }
+
+ cs->partition_root_state = new_prs;
+ } else if (cs->partition_root_state && new_prs) {
+ /*
+ * A change in load balance state only, no change in cpumasks.
+ */
+ update_flag(CS_SCHED_LOAD_BALANCE, cs, (new_prs != PRS_ISOLATED));
+
+ cs->partition_root_state = new_prs;
+ err = 0;
+ goto out; /* Sched domain is rebuilt in update_flag() */
} else {
/*
* Switch back to member is always allowed if PRS_ERROR.
@@ -2024,6 +2053,12 @@ static int update_prstate(struct cpuset *cs, int new_prs)

/* Turning off CS_CPU_EXCLUSIVE will not return error */
update_flag(CS_CPU_EXCLUSIVE, cs, 0);
+
+ if (!is_sched_load_balance(cs)) {
+ /* Make sure load balance is on */
+ update_flag(CS_SCHED_LOAD_BALANCE, cs, 1);
+ sched_domain_rebuilt = true;
+ }
}

/*
@@ -2036,7 +2071,8 @@ static int update_prstate(struct cpuset *cs, int new_prs)
if (parent->child_ecpus_count)
update_sibling_cpumasks(parent, cs, &tmpmask);

- rebuild_sched_domains_locked();
+ if (!sched_domain_rebuilt)
+ rebuild_sched_domains_locked();
out:
free_cpumasks(NULL, &tmpmask);
return err;
@@ -2531,6 +2567,9 @@ static int sched_partition_show(struct seq_file *seq, void *v)
case PRS_ENABLED:
seq_puts(seq, "root\n");
break;
+ case PRS_ISOLATED:
+ seq_puts(seq, "isolated\n");
+ break;
case PRS_DISABLED:
seq_puts(seq, "member\n");
break;
@@ -2557,6 +2596,8 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf,
val = PRS_ENABLED;
else if (!strcmp(buf, "member"))
val = PRS_DISABLED;
+ else if (!strcmp(buf, "isolated"))
+ val = PRS_ISOLATED;
else
return -EINVAL;

--
2.18.1

2021-06-21 18:53:41

by Waiman Long

[permalink] [raw]
Subject: [PATCH v2 4/6] cgroup/cpuset: Allow non-top parent partition root to distribute out all CPUs

Currently, a parent partition root cannot distribute all its CPUs to
child partition roots with no CPUs left. However in some use cases,
a management application may want to create a parent partition root as
a management unit with no task associated with it and has all its CPUs
distributed to various child partition roots dynamically according to
their needs. Leaving a cpu in the parent partition root in such a case is
now a waste.

To accommodate such use cases, a parent partition root can now have
all its CPUs distributed to its child partition roots as long as:
1) it is not the top cpuset; and
2) there is no task directly associated with the parent.

Once an empty parent partition root is formed, no new task can be moved
into it.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/cgroup/cpuset.c | 90 +++++++++++++++++++++++++++++-------------
1 file changed, 63 insertions(+), 27 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 1a4b90e70e68..01b861b97a40 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -268,6 +268,11 @@ static inline int is_partition_root(const struct cpuset *cs)
return cs->partition_root_state > 0;
}

+static inline int cpuset_has_tasks(const struct cpuset *cs)
+{
+ return cs->css.cgroup->nr_populated_csets;
+}
+
static struct cpuset top_cpuset = {
.flags = ((1 << CS_ONLINE) | (1 << CS_CPU_EXCLUSIVE) |
(1 << CS_MEM_EXCLUSIVE)),
@@ -1174,21 +1179,31 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
if ((cmd != partcmd_update) && css_has_online_children(&cpuset->css))
return -EBUSY;

- /*
- * Enabling partition root is not allowed if not all the CPUs
- * can be granted from parent's effective_cpus or at least one
- * CPU will be left after that.
- */
- if ((cmd == partcmd_enable) &&
- (!cpumask_subset(cpuset->cpus_allowed, parent->effective_cpus) ||
- cpumask_equal(cpuset->cpus_allowed, parent->effective_cpus)))
- return -EINVAL;
-
/*
* A cpumask update cannot make parent's effective_cpus become empty.
*/
adding = deleting = false;
if (cmd == partcmd_enable) {
+ bool parent_is_top_cpuset = !parent_cs(parent);
+ bool no_cpu_in_parent = cpumask_equal(cpuset->cpus_allowed,
+ parent->effective_cpus);
+ /*
+ * Enabling partition root is not allowed if not all the CPUs
+ * can be granted from parent's effective_cpus. If the parent
+ * is the top cpuset, at least one CPU must be left after that.
+ */
+ if (!cpumask_subset(cpuset->cpus_allowed, parent->effective_cpus) ||
+ (parent_is_top_cpuset && no_cpu_in_parent))
+ return -EINVAL;
+
+ /*
+ * A non-top parent can be left with no CPU as long as there
+ * is no task directly associated with the parent. For such
+ * a parent, no new task can be moved into it.
+ */
+ if (no_cpu_in_parent && cpuset_has_tasks(parent))
+ return -EINVAL;
+
cpumask_copy(tmp->addmask, cpuset->cpus_allowed);
adding = true;
} else if (cmd == partcmd_disable) {
@@ -1219,9 +1234,10 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
adding = cpumask_andnot(tmp->addmask, tmp->addmask,
parent->subparts_cpus);
/*
- * Return error if parent's effective_cpus could become empty.
+ * Return error if parent's effective_cpus could become empty
+ * and there are tasks in the parent.
*/
- if (adding &&
+ if (adding && cpuset_has_tasks(parent) &&
cpumask_equal(parent->effective_cpus, tmp->addmask)) {
if (!deleting)
return -EINVAL;
@@ -1237,12 +1253,13 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
}

/*
- * Return error if effective_cpus becomes empty or any CPU
- * distributed to child partitions is deleted.
+ * Return error if effective_cpus becomes empty with tasks
+ * or any CPU distributed to child partitions is deleted.
*/
if (deleting &&
(cpumask_intersects(tmp->delmask, cpuset->subparts_cpus) ||
- cpumask_equal(tmp->delmask, cpuset->effective_cpus)))
+ (cpumask_equal(tmp->delmask, cpuset->effective_cpus) &&
+ cpuset_has_tasks(cpuset))))
return -EBUSY;
} else {
/*
@@ -1263,7 +1280,8 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
parent->effective_cpus);
part_error = (is_partition_root(cpuset) &&
!parent->nr_subparts_cpus) ||
- cpumask_equal(tmp->addmask, parent->effective_cpus);
+ (cpumask_equal(tmp->addmask, parent->effective_cpus) &&
+ cpuset_has_tasks(parent));
}

if (cmd == partcmd_update) {
@@ -1364,9 +1382,15 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)

/*
* If it becomes empty, inherit the effective mask of the
- * parent, which is guaranteed to have some CPUs.
+ * parent, which is guaranteed to have some CPUs unless
+ * it is a partition root that has explicitly distributed
+ * out all its CPUs.
*/
if (is_in_v2_mode() && cpumask_empty(tmp->new_cpus)) {
+ if (is_partition_root(cp) &&
+ cpumask_equal(cp->cpus_allowed, cp->subparts_cpus))
+ goto update_parent_subparts;
+
cpumask_copy(tmp->new_cpus, parent->effective_cpus);
if (!cp->use_parent_ecpus) {
cp->use_parent_ecpus = true;
@@ -1388,6 +1412,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
continue;
}

+update_parent_subparts:
/*
* update_parent_subparts_cpumask() should have been called
* for cs already in update_cpumask(). We should also call
@@ -1458,7 +1483,8 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
*/
cpumask_andnot(cp->effective_cpus, cp->effective_cpus,
cp->subparts_cpus);
- WARN_ON_ONCE(cpumask_empty(cp->effective_cpus));
+ WARN_ON_ONCE(cpumask_empty(cp->effective_cpus) &&
+ cpuset_has_tasks(cp));
}
spin_unlock_irq(&callback_lock);

@@ -1787,7 +1813,7 @@ static void update_nodemasks_hier(struct cpuset *cs, nodemask_t *new_mems)
cp->effective_mems = *new_mems;
spin_unlock_irq(&callback_lock);

- WARN_ON(!is_in_v2_mode() &&
+ WARN_ON_ONCE(!is_in_v2_mode() &&
!nodes_equal(cp->mems_allowed, cp->effective_mems));

update_tasks_nodemask(cp);
@@ -2201,6 +2227,13 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
(cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed)))
goto out_unlock;

+ /*
+ * On default hierarchy, task cannot be moved to a cpuset with empty
+ * effective cpus.
+ */
+ if (is_in_v2_mode() && cpumask_empty(cs->effective_cpus))
+ goto out_unlock;
+
cgroup_taskset_for_each(task, css, tset) {
ret = task_can_attach(task, cs->cpus_allowed);
if (ret)
@@ -3067,7 +3100,8 @@ hotplug_update_tasks(struct cpuset *cs,
struct cpumask *new_cpus, nodemask_t *new_mems,
bool cpus_updated, bool mems_updated)
{
- if (cpumask_empty(new_cpus))
+ /* A partition root is allowed to have empty effective cpus */
+ if (cpumask_empty(new_cpus) && !is_partition_root(cs))
cpumask_copy(new_cpus, parent_cs(cs)->effective_cpus);
if (nodes_empty(*new_mems))
*new_mems = parent_cs(cs)->effective_mems;
@@ -3136,22 +3170,24 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)

/*
* In the unlikely event that a partition root has empty
- * effective_cpus, we will have to force any child partitions,
- * if present, to become invalid by setting nr_subparts_cpus to 0
- * without causing itself to become invalid.
+ * effective_cpus with tasks, we will have to force any child
+ * partitions, if present, to become invalid by setting
+ * nr_subparts_cpus to 0 without causing itself to become invalid.
*/
if (is_partition_root(cs) && cs->nr_subparts_cpus &&
- cpumask_empty(&new_cpus)) {
+ cpumask_empty(&new_cpus) && cpuset_has_tasks(cs)) {
cs->nr_subparts_cpus = 0;
cpumask_clear(cs->subparts_cpus);
compute_effective_cpumask(&new_cpus, cs, parent);
}

/*
- * If empty effective_cpus or zero nr_subparts_cpus or its parent
- * becomes erroneous, we have to transition it to the erroneous state.
+ * If empty effective_cpus with tasks or zero nr_subparts_cpus or
+ * its parent becomes erroneous, we have to transition it to the
+ * erroneous state.
*/
- if (is_partition_root(cs) && (cpumask_empty(&new_cpus) ||
+ if (is_partition_root(cs) &&
+ ((cpumask_empty(&new_cpus) && cpuset_has_tasks(cs)) ||
(parent->partition_root_state == PRS_ERROR) ||
!parent->nr_subparts_cpus)) {
update_parent_subparts_cpumask(cs, partcmd_disable,
--
2.18.1

2021-06-24 12:53:23

by Michal Koutný

[permalink] [raw]
Subject: Re: [PATCH v2 3/6] cgroup/cpuset: Add a new isolated cpus.partition type

Hello.

On Mon, Jun 21, 2021 at 02:49:21PM -0400, Waiman Long <[email protected]> wrote:
> cgroup/cpuset: Add a new isolated cpus.partition type
>
> Cpuset v1 uses the sched_load_balance control file to determine if load
> balancing should be enabled. Cpuset v2 gets rid of sched_load_balance
> as its use may require disabling load balancing at cgroup root.
>
> For workloads that require very low latency like DPDK, the latency
> jitters caused by periodic load balancing may exceed the desired
> latency limit.
>
> When cpuset v2 is in use, the only way to avoid this latency cost is to
> use the "isolcpus=" kernel boot option to isolate a set of CPUs. After
> the kernel boot, however, there is no way to add or remove CPUs from
> this isolated set. For workloads that are more dynamic in nature, that
> means users have to provision enough CPUs for the worst case situation
> resulting in excess idle CPUs.
>
> To address this issue for cpuset v2, a new cpuset.cpus.partition type
> "isolated" is added which allows the creation of a cpuset partition
> without load balancing. This will allow system administrators to
> dynamically adjust the size of isolated partition to the current need
> of the workload without rebooting the system.

I like this work.
Would it be worth generalizing the API to be on par with what isolcpus=
can configure? (I.e. not only load balancing but the other dimensions of
isolation (like the flags nohz and managed_irq now).)

I don't know if all such behaviors could be implemented dynamically
(likely not easy) but the API could initially implement just what you do
here with the "isolated" partition type.

The variant I'm thinking of would keep just the "root" and "member"
partitions type and the "root" type could be additionally configured via
cpuset.cpus.partition.flags (for example).

WDYT?

Michal


Attachments:
(No filename) (1.92 kB)
signature.asc (849.00 B)
Digital signature
Download all attachments

2021-06-24 15:25:52

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2 3/6] cgroup/cpuset: Add a new isolated cpus.partition type

On 6/24/21 8:51 AM, Michal Koutn? wrote:
> Hello.
>
> On Mon, Jun 21, 2021 at 02:49:21PM -0400, Waiman Long <[email protected]> wrote:
>> cgroup/cpuset: Add a new isolated cpus.partition type
>>
>> Cpuset v1 uses the sched_load_balance control file to determine if load
>> balancing should be enabled. Cpuset v2 gets rid of sched_load_balance
>> as its use may require disabling load balancing at cgroup root.
>>
>> For workloads that require very low latency like DPDK, the latency
>> jitters caused by periodic load balancing may exceed the desired
>> latency limit.
>>
>> When cpuset v2 is in use, the only way to avoid this latency cost is to
>> use the "isolcpus=" kernel boot option to isolate a set of CPUs. After
>> the kernel boot, however, there is no way to add or remove CPUs from
>> this isolated set. For workloads that are more dynamic in nature, that
>> means users have to provision enough CPUs for the worst case situation
>> resulting in excess idle CPUs.
>>
>> To address this issue for cpuset v2, a new cpuset.cpus.partition type
>> "isolated" is added which allows the creation of a cpuset partition
>> without load balancing. This will allow system administrators to
>> dynamically adjust the size of isolated partition to the current need
>> of the workload without rebooting the system.
> I like this work.
> Would it be worth generalizing the API to be on par with what isolcpus=
> can configure? (I.e. not only load balancing but the other dimensions of
> isolation (like the flags nohz and managed_irq now).)
Good point, the isolated partition is equivalent to isolcpus=domain. I
will need to evaluate the nohz and managed_irq options to see if they
can be done dynamically without adding a lot of overhead. If so, we can
extend the functionality to cover that in future patches. Right now,
this is for the domain functionality only. If we can cover the nohz and
managed_irq options, we can deprecate isolcpus and advocate the use of
cgroup instead.
>
> I don't know if all such behaviors could be implemented dynamically
> (likely not easy) but the API could initially implement just what you do
> here with the "isolated" partition type.
>
> The variant I'm thinking of would keep just the "root" and "member"
> partitions type and the "root" type could be additionally configured via
> cpuset.cpus.partition.flags (for example).
>
> WDYT?

What I am thinking is that "isolated" means "isolated:domain" or one can
do "isolated:nohz,domain,manged_irq" just like the current isolcpus boot
option. I don't think we really need to add an extra flags control file.

Cheers,
Longman


2021-06-26 10:57:19

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] cgroup/cpuset: Clarify the use of invalid partition root

Hello, Waiman.

On Mon, Jun 21, 2021 at 02:49:20PM -0400, Waiman Long wrote:
> 1) A partition root can't be changed to member if it has child partition
> roots.
> 2) Removing CPUs from cpuset.cpus that causes it to become invalid is
> not allowed.

I'm not a fan of this approach. No matter what we have to be able to handle
CPU removals which are user-iniated operations anyway, so I don't see why
we're adding a different way of handling a different set of operations. Just
handle them the same?

Thanks.

--
tejun

2021-06-28 13:09:29

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] cgroup/cpuset: Clarify the use of invalid partition root

On 6/26/21 6:53 AM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Mon, Jun 21, 2021 at 02:49:20PM -0400, Waiman Long wrote:
>> 1) A partition root can't be changed to member if it has child partition
>> roots.
>> 2) Removing CPUs from cpuset.cpus that causes it to become invalid is
>> not allowed.
> I'm not a fan of this approach. No matter what we have to be able to handle
> CPU removals which are user-iniated operations anyway, so I don't see why
> we're adding a different way of handling a different set of operations. Just
> handle them the same?

The main reason for doing this is because normal cpuset control file
actions are under the direct control of the cpuset code. So it is up to
us to decide whether to grant it or deny it. Hotplug, on the other hand,
is not under the control of cpuset code. It can't deny a hotplug
operation. This is the main reason why the partition root error state
was added in the first place.

Normally, users can set cpuset.cpus to whatever value they want even
though they are not actually granted. However, turning on partition root
is under more strict control. You can't turn on partition root if the
CPUs requested cannot actually be granted. The problem with setting the
state to just partition error is that users may not be aware that the
partition creation operation fails.  We can't assume all users will do
the proper error checking. I would rather let them know the operation
fails rather than relying on them doing the proper check afterward.

Yes, I agree that it is a different philosophy than the original cpuset
code, but I thought one reason of doing cgroup v2 is to simplify the
interface and make it a bit more erorr-proof. Since partition root
creation is a relatively rare operation, we can afford to make it more
strict than the other operations.

Cheers,
Longman

2021-07-05 17:54:42

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] cgroup/cpuset: Clarify the use of invalid partition root

Hello, Waiman.

On Mon, Jun 28, 2021 at 09:06:50AM -0400, Waiman Long wrote:
> The main reason for doing this is because normal cpuset control file actions
> are under the direct control of the cpuset code. So it is up to us to decide
> whether to grant it or deny it. Hotplug, on the other hand, is not under the
> control of cpuset code. It can't deny a hotplug operation. This is the main
> reason why the partition root error state was added in the first place.

I have a difficult time convincing myself that this difference justifies the
behavior difference and it keeps bothering me that there is a state which
can be reached through one path but rejected by the other. I'll continue
below.

> Normally, users can set cpuset.cpus to whatever value they want even though
> they are not actually granted. However, turning on partition root is under
> more strict control. You can't turn on partition root if the CPUs requested
> cannot actually be granted. The problem with setting the state to just
> partition error is that users may not be aware that the partition creation
> operation fails.? We can't assume all users will do the proper error
> checking. I would rather let them know the operation fails rather than
> relying on them doing the proper check afterward.
>
> Yes, I agree that it is a different philosophy than the original cpuset
> code, but I thought one reason of doing cgroup v2 is to simplify the
> interface and make it a bit more erorr-proof. Since partition root creation
> is a relatively rare operation, we can afford to make it more strict than
> the other operations.

So, IMO, one of the reasons why cgroup1 interface was such a mess was
because each piece of interaction was designed ad-hoc without regard to the
overall consistency. One person feels a particular way of interacting with
the interface is "correct" and does it that way and another person does
another part in a different way. In the end, we ended up with a messy
patchwork.

One problematic aspect of cpuset in cgroup1 was the handling of failure
modes, which was caused by the same exact approach - we wanted the interface
to reject invalid configurations outright even though we didn't have the
ability to prevent those configurations from occurring through other paths,
which makes the failure mode more subtle by further obscuring them.

I think a better approach would be having a clear signal and mechanism to
watch the state and explicitly requiring users to verify and monitor the
state transitions.

Thanks.

--
tejun

2021-07-16 18:46:39

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] cgroup/cpuset: Clarify the use of invalid partition root

On 7/5/21 1:51 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Mon, Jun 28, 2021 at 09:06:50AM -0400, Waiman Long wrote:
>> The main reason for doing this is because normal cpuset control file actions
>> are under the direct control of the cpuset code. So it is up to us to decide
>> whether to grant it or deny it. Hotplug, on the other hand, is not under the
>> control of cpuset code. It can't deny a hotplug operation. This is the main
>> reason why the partition root error state was added in the first place.
> I have a difficult time convincing myself that this difference justifies the
> behavior difference and it keeps bothering me that there is a state which
> can be reached through one path but rejected by the other. I'll continue
> below.
>
>> Normally, users can set cpuset.cpus to whatever value they want even though
>> they are not actually granted. However, turning on partition root is under
>> more strict control. You can't turn on partition root if the CPUs requested
>> cannot actually be granted. The problem with setting the state to just
>> partition error is that users may not be aware that the partition creation
>> operation fails.  We can't assume all users will do the proper error
>> checking. I would rather let them know the operation fails rather than
>> relying on them doing the proper check afterward.
>>
>> Yes, I agree that it is a different philosophy than the original cpuset
>> code, but I thought one reason of doing cgroup v2 is to simplify the
>> interface and make it a bit more erorr-proof. Since partition root creation
>> is a relatively rare operation, we can afford to make it more strict than
>> the other operations.
> So, IMO, one of the reasons why cgroup1 interface was such a mess was
> because each piece of interaction was designed ad-hoc without regard to the
> overall consistency. One person feels a particular way of interacting with
> the interface is "correct" and does it that way and another person does
> another part in a different way. In the end, we ended up with a messy
> patchwork.
>
> One problematic aspect of cpuset in cgroup1 was the handling of failure
> modes, which was caused by the same exact approach - we wanted the interface
> to reject invalid configurations outright even though we didn't have the
> ability to prevent those configurations from occurring through other paths,
> which makes the failure mode more subtle by further obscuring them.
>
> I think a better approach would be having a clear signal and mechanism to
> watch the state and explicitly requiring users to verify and monitor the
> state transitions.

Sorry for the late reply as I was busy with other works.

I agree with you on principle. However, the reason why there are more
restrictions on enabling partition is because I want to avoid forcing
the users to always read back cpuset.partition.type to see if the
operation succeeds instead of just getting an error from the operation.
The former approach is more error prone. If you don't want changes in
existing behavior, I can relax the checking and allow them to become an
invalid partition if an illegal operation happens.

Also there is now another cpuset patch to extend cpu isolation to cgroup
v1 [1]. I think it is better suit to the cgroup v2 partition scheme, but
cgroup v1 is still quite heavily out there.

Please let me know what you want me to do and I will send out a v3 version.

Thanks a lot!
Longman

2021-07-16 19:03:21

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] cgroup/cpuset: Clarify the use of invalid partition root

On 7/16/21 2:44 PM, Waiman Long wrote:
> On 7/5/21 1:51 PM, Tejun Heo wrote:
>> Hello, Waiman.
>>
>> On Mon, Jun 28, 2021 at 09:06:50AM -0400, Waiman Long wrote:
>>> The main reason for doing this is because normal cpuset control file
>>> actions
>>> are under the direct control of the cpuset code. So it is up to us
>>> to decide
>>> whether to grant it or deny it. Hotplug, on the other hand, is not
>>> under the
>>> control of cpuset code. It can't deny a hotplug operation. This is
>>> the main
>>> reason why the partition root error state was added in the first place.
>> I have a difficult time convincing myself that this difference
>> justifies the
>> behavior difference and it keeps bothering me that there is a state
>> which
>> can be reached through one path but rejected by the other. I'll continue
>> below.
>>
>>> Normally, users can set cpuset.cpus to whatever value they want even
>>> though
>>> they are not actually granted. However, turning on partition root is
>>> under
>>> more strict control. You can't turn on partition root if the CPUs
>>> requested
>>> cannot actually be granted. The problem with setting the state to just
>>> partition error is that users may not be aware that the partition
>>> creation
>>> operation fails.  We can't assume all users will do the proper error
>>> checking. I would rather let them know the operation fails rather than
>>> relying on them doing the proper check afterward.
>>>
>>> Yes, I agree that it is a different philosophy than the original cpuset
>>> code, but I thought one reason of doing cgroup v2 is to simplify the
>>> interface and make it a bit more erorr-proof. Since partition root
>>> creation
>>> is a relatively rare operation, we can afford to make it more strict
>>> than
>>> the other operations.
>> So, IMO, one of the reasons why cgroup1 interface was such a mess was
>> because each piece of interaction was designed ad-hoc without regard
>> to the
>> overall consistency. One person feels a particular way of interacting
>> with
>> the interface is "correct" and does it that way and another person does
>> another part in a different way. In the end, we ended up with a messy
>> patchwork.
>>
>> One problematic aspect of cpuset in cgroup1 was the handling of failure
>> modes, which was caused by the same exact approach - we wanted the
>> interface
>> to reject invalid configurations outright even though we didn't have the
>> ability to prevent those configurations from occurring through other
>> paths,
>> which makes the failure mode more subtle by further obscuring them.
>>
>> I think a better approach would be having a clear signal and
>> mechanism to
>> watch the state and explicitly requiring users to verify and monitor the
>> state transitions.
>
> Sorry for the late reply as I was busy with other works.
>
> I agree with you on principle. However, the reason why there are more
> restrictions on enabling partition is because I want to avoid forcing
> the users to always read back cpuset.partition.type to see if the
> operation succeeds instead of just getting an error from the
> operation. The former approach is more error prone. If you don't want
> changes in existing behavior, I can relax the checking and allow them
> to become an invalid partition if an illegal operation happens.
>
> Also there is now another cpuset patch to extend cpu isolation to
> cgroup v1 [1]. I think it is better suit to the cgroup v2 partition
> scheme, but cgroup v1 is still quite heavily out there.
>
> Please let me know what you want me to do and I will send out a v3
> version.

Note that the current cpuset partition implementation have implemented
some restrictions on when a partition can be enabled. However, I missed
some corner cases in the original implementation that allow certain
cpuset operations to make a partition invalid. I tried to plug those
holes in this patchset. However, if maintaining backward compatibility
is more important, I can leave those holes and update the documentation
to make sure that people check cpuset.partition.type to confirm if their
operation succeeds.

Cheers,
Longman

2021-07-16 20:10:19

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] cgroup/cpuset: Clarify the use of invalid partition root

On 7/16/21 2:59 PM, Waiman Long wrote:
> On 7/16/21 2:44 PM, Waiman Long wrote:
>> On 7/5/21 1:51 PM, Tejun Heo wrote:
>>> Hello, Waiman.
>>>
>>> On Mon, Jun 28, 2021 at 09:06:50AM -0400, Waiman Long wrote:
>>>> The main reason for doing this is because normal cpuset control
>>>> file actions
>>>> are under the direct control of the cpuset code. So it is up to us
>>>> to decide
>>>> whether to grant it or deny it. Hotplug, on the other hand, is not
>>>> under the
>>>> control of cpuset code. It can't deny a hotplug operation. This is
>>>> the main
>>>> reason why the partition root error state was added in the first
>>>> place.
>>> I have a difficult time convincing myself that this difference
>>> justifies the
>>> behavior difference and it keeps bothering me that there is a state
>>> which
>>> can be reached through one path but rejected by the other. I'll
>>> continue
>>> below.
>>>
>>>> Normally, users can set cpuset.cpus to whatever value they want
>>>> even though
>>>> they are not actually granted. However, turning on partition root
>>>> is under
>>>> more strict control. You can't turn on partition root if the CPUs
>>>> requested
>>>> cannot actually be granted. The problem with setting the state to just
>>>> partition error is that users may not be aware that the partition
>>>> creation
>>>> operation fails.  We can't assume all users will do the proper error
>>>> checking. I would rather let them know the operation fails rather than
>>>> relying on them doing the proper check afterward.
>>>>
>>>> Yes, I agree that it is a different philosophy than the original
>>>> cpuset
>>>> code, but I thought one reason of doing cgroup v2 is to simplify the
>>>> interface and make it a bit more erorr-proof. Since partition root
>>>> creation
>>>> is a relatively rare operation, we can afford to make it more
>>>> strict than
>>>> the other operations.
>>> So, IMO, one of the reasons why cgroup1 interface was such a mess was
>>> because each piece of interaction was designed ad-hoc without regard
>>> to the
>>> overall consistency. One person feels a particular way of
>>> interacting with
>>> the interface is "correct" and does it that way and another person does
>>> another part in a different way. In the end, we ended up with a messy
>>> patchwork.
>>>
>>> One problematic aspect of cpuset in cgroup1 was the handling of failure
>>> modes, which was caused by the same exact approach - we wanted the
>>> interface
>>> to reject invalid configurations outright even though we didn't have
>>> the
>>> ability to prevent those configurations from occurring through other
>>> paths,
>>> which makes the failure mode more subtle by further obscuring them.
>>>
>>> I think a better approach would be having a clear signal and
>>> mechanism to
>>> watch the state and explicitly requiring users to verify and monitor
>>> the
>>> state transitions.
>>
>> Sorry for the late reply as I was busy with other works.
>>
>> I agree with you on principle. However, the reason why there are more
>> restrictions on enabling partition is because I want to avoid forcing
>> the users to always read back cpuset.partition.type to see if the
>> operation succeeds instead of just getting an error from the
>> operation. The former approach is more error prone. If you don't want
>> changes in existing behavior, I can relax the checking and allow them
>> to become an invalid partition if an illegal operation happens.
>>
>> Also there is now another cpuset patch to extend cpu isolation to
>> cgroup v1 [1]. I think it is better suit to the cgroup v2 partition
>> scheme, but cgroup v1 is still quite heavily out there.
>>
>> Please let me know what you want me to do and I will send out a v3
>> version.
>
> Note that the current cpuset partition implementation have implemented
> some restrictions on when a partition can be enabled. However, I
> missed some corner cases in the original implementation that allow
> certain cpuset operations to make a partition invalid. I tried to plug
> those holes in this patchset. However, if maintaining backward
> compatibility is more important, I can leave those holes and update
> the documentation to make sure that people check cpuset.partition.type
> to confirm if their operation succeeds.

I just realize that partition root set the CPU_EXCLUSIVE bit. So changes
to cpuset.cpus that break exclusivity rule is not allowed anyway. This
patchset is just adding additional checks so that cpuset.cpus changes
that break the partition root rules will not be allowed. I can remove
those additional checks for this patchset and allow cpuset.cpus changes
that break the partition root rules to make it invalid instead. However,
I still want invalid changes to cpuset.partition.type to be disallowed.

Cheers,
Longman

2021-07-16 20:48:33

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] cgroup/cpuset: Clarify the use of invalid partition root

Hello, Waiman.

On Fri, Jul 16, 2021 at 04:08:15PM -0400, Waiman Long wrote:
> > > I agree with you on principle. However, the reason why there are
> > > more restrictions on enabling partition is because I want to avoid
> > > forcing the users to always read back cpuset.partition.type to see
> > > if the operation succeeds instead of just getting an error from the
> > > operation. The former approach is more error prone. If you don't
> > > want changes in existing behavior, I can relax the checking and
> > > allow them to become an invalid partition if an illegal operation
> > > happens.
> > >
> > > Also there is now another cpuset patch to extend cpu isolation to
> > > cgroup v1 [1]. I think it is better suit to the cgroup v2 partition
> > > scheme, but cgroup v1 is still quite heavily out there.
> > >
> > > Please let me know what you want me to do and I will send out a v3
> > > version.
> >
> > Note that the current cpuset partition implementation have implemented
> > some restrictions on when a partition can be enabled. However, I missed
> > some corner cases in the original implementation that allow certain
> > cpuset operations to make a partition invalid. I tried to plug those
> > holes in this patchset. However, if maintaining backward compatibility
> > is more important, I can leave those holes and update the documentation
> > to make sure that people check cpuset.partition.type to confirm if their
> > operation succeeds.
>
> I just realize that partition root set the CPU_EXCLUSIVE bit. So changes to
> cpuset.cpus that break exclusivity rule is not allowed anyway. This patchset
> is just adding additional checks so that cpuset.cpus changes that break the
> partition root rules will not be allowed. I can remove those additional
> checks for this patchset and allow cpuset.cpus changes that break the
> partition root rules to make it invalid instead. However, I still want
> invalid changes to cpuset.partition.type to be disallowed.

So, I get the instinct to disallow these operations and it'd make sense if
the conditions aren't reachable otherwise. However, I'm afraid what users
eventually get is false sense of security rather than any actual guarantee.

Inconsistencies like this cause actual usability hazards - e.g. imagine a
system config script whic sets up exclusive cpuset and let's say that the
use case is fine with degraded operation when the target cores are offline
(e.g. energy save mode w/ only low power cores online). Let's say this
script runs in late stages during boot and has been reliable. However, at
some point, there are changes in boot sequence and now there's low but
non-trivial chance that the system would already be in low power state when
the script runs. Now the script will fail sporadically and the whole thing
would be pretty awkward to debug.

I'd much prefer to have an explicit interface to confirm the eventual state
and a way to monitor state transitions (without polling). An invalid state
is an inherent part of cpuset configuration. I'd much rather have that
really explicit in the interface even if that means a bit of extra work at
configuration time.

Thanks.

--
tejun

2021-07-16 21:13:11

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] cgroup/cpuset: Clarify the use of invalid partition root

On 7/16/21 4:46 PM, Tejun Heo wrote:
> Hello, Waiman.
>
> On Fri, Jul 16, 2021 at 04:08:15PM -0400, Waiman Long wrote:
>>>> I agree with you on principle. However, the reason why there are
>>>> more restrictions on enabling partition is because I want to avoid
>>>> forcing the users to always read back cpuset.partition.type to see
>>>> if the operation succeeds instead of just getting an error from the
>>>> operation. The former approach is more error prone. If you don't
>>>> want changes in existing behavior, I can relax the checking and
>>>> allow them to become an invalid partition if an illegal operation
>>>> happens.
>>>>
>>>> Also there is now another cpuset patch to extend cpu isolation to
>>>> cgroup v1 [1]. I think it is better suit to the cgroup v2 partition
>>>> scheme, but cgroup v1 is still quite heavily out there.
>>>>
>>>> Please let me know what you want me to do and I will send out a v3
>>>> version.
>>> Note that the current cpuset partition implementation have implemented
>>> some restrictions on when a partition can be enabled. However, I missed
>>> some corner cases in the original implementation that allow certain
>>> cpuset operations to make a partition invalid. I tried to plug those
>>> holes in this patchset. However, if maintaining backward compatibility
>>> is more important, I can leave those holes and update the documentation
>>> to make sure that people check cpuset.partition.type to confirm if their
>>> operation succeeds.
>> I just realize that partition root set the CPU_EXCLUSIVE bit. So changes to
>> cpuset.cpus that break exclusivity rule is not allowed anyway. This patchset
>> is just adding additional checks so that cpuset.cpus changes that break the
>> partition root rules will not be allowed. I can remove those additional
>> checks for this patchset and allow cpuset.cpus changes that break the
>> partition root rules to make it invalid instead. However, I still want
>> invalid changes to cpuset.partition.type to be disallowed.
> So, I get the instinct to disallow these operations and it'd make sense if
> the conditions aren't reachable otherwise. However, I'm afraid what users
> eventually get is false sense of security rather than any actual guarantee.
>
> Inconsistencies like this cause actual usability hazards - e.g. imagine a
> system config script whic sets up exclusive cpuset and let's say that the
> use case is fine with degraded operation when the target cores are offline
> (e.g. energy save mode w/ only low power cores online). Let's say this
> script runs in late stages during boot and has been reliable. However, at
> some point, there are changes in boot sequence and now there's low but
> non-trivial chance that the system would already be in low power state when
> the script runs. Now the script will fail sporadically and the whole thing
> would be pretty awkward to debug.
>
> I'd much prefer to have an explicit interface to confirm the eventual state
> and a way to monitor state transitions (without polling). An invalid state
> is an inherent part of cpuset configuration. I'd much rather have that
> really explicit in the interface even if that means a bit of extra work at
> configuration time.

Are you suggesting that we add a cpuset.cpus.events file that allows
processes to be notified if an event (e.g. hotplug) that changes a
partition root to invalid partition happens or when explicit change to a
partition root fails? Will that be enough to satisfy your requirement?

Cheers,
Longman

2021-07-16 21:19:48

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] cgroup/cpuset: Clarify the use of invalid partition root

Hello,

On Fri, Jul 16, 2021 at 05:12:17PM -0400, Waiman Long wrote:
> Are you suggesting that we add a cpuset.cpus.events file that allows
> processes to be notified if an event (e.g. hotplug) that changes a partition
> root to invalid partition happens or when explicit change to a partition
> root fails? Will that be enough to satisfy your requirement?

Yeah, something like that or make the current state file generate events on
state transitions.

Thanks.

--
tejun

2021-07-16 21:30:53

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v2 2/6] cgroup/cpuset: Clarify the use of invalid partition root

On 7/16/21 5:18 PM, Tejun Heo wrote:
> Hello,
>
> On Fri, Jul 16, 2021 at 05:12:17PM -0400, Waiman Long wrote:
>> Are you suggesting that we add a cpuset.cpus.events file that allows
>> processes to be notified if an event (e.g. hotplug) that changes a partition
>> root to invalid partition happens or when explicit change to a partition
>> root fails? Will that be enough to satisfy your requirement?
> Yeah, something like that or make the current state file generate events on
> state transitions.


Sure. I will change the patch to make cpuset.cpus.partition generates
event when its state change. Thanks for the suggestion. It definitely
makes it better.

Cheers,
Longman