2022-09-01 21:19:21

by Waiman Long

[permalink] [raw]
Subject: [PATCH v12 00/10] cgroup/cpuset: cpu partition code fixes & enhancements

v12:
- Change patch 1 to enable update_tasks_cpumask() for top_cpuset except
for percpu kthreads.
- Add 2 more patches to make exclusivity rule violations invalidate the
partition and its siblings instead of failing the change to make it
consistent with other cpuset changes.
- Update documentation and test script accordingly.

v11:
- Fix incorrect spacing in patch 7 and include documentation suggestions
by Michal.
- Move partition_is_populated() check to the last one in list of
conditions to be checked.

v10:
- Relax constraints for changes made to "cpuset.cpus"
and "cpuset.cpus.partition" as suggested. Now almost all changes
are allowed.
- Add patch 1 to signal that we may need to do additional work in
the future to relax the constraint that tasks' cpumask may need
some adjustment if child partitions are present.
- Add patch 2 for miscellaneous cleanups.

The first patch fixes the problem that tasks in the top_cpuset
will not have its cpus_mask properly set to reflect the reduced
set of cpus available in the top_cpuset when a partition is enabled.

This patchset also includes the following enhancements to the cpuset
v2 partition code.

1) Allow partitions that have no task to have empty effective cpus.
2) Relax the constraints on what changes are allowed in cpuset.cpus
and cpuset.cpus.partition. However, the partition remain invalid
until the constraints of a valid partition root is satisfied.
3) Add a new "isolated" partition type for partitions with no load
balancing which is available in v1 but not yet in v2.
4) Allow the reading of cpuset.cpus.partition to include a reason
string as to why the partition remain invalid.

In addition, the cgroup-v2.rst documentation file is updated and a self
test is added to verify the correctness the partition code.

Waiman Long (10):
cgroup/cpuset: Enable update_tasks_cpumask() on top_cpuset
cgroup/cpuset: Miscellaneous cleanups & add helper functions
cgroup/cpuset: Allow no-task partition to have empty
cpuset.cpus.effective
cgroup/cpuset: Relax constraints to partition & cpus changes
cgroup/cpuset: Add a new isolated cpus.partition type
cgroup/cpuset: Show invalid partition reason string
cgroup/cpuset: Relocate a code block in validate_change()
cgroup/cpuset: Make partition invalid if cpumask change violates
exclusivity rule
cgroup/cpuset: Update description of cpuset.cpus.partition in
cgroup-v2.rst
kselftest/cgroup: Add cpuset v2 partition root state test

Documentation/admin-guide/cgroup-v2.rst | 150 ++--
kernel/cgroup/cpuset.c | 817 ++++++++++++------
tools/testing/selftests/cgroup/.gitignore | 1 +
tools/testing/selftests/cgroup/Makefile | 5 +-
.../selftests/cgroup/test_cpuset_prs.sh | 674 +++++++++++++++
tools/testing/selftests/cgroup/wait_inotify.c | 87 ++
6 files changed, 1385 insertions(+), 349 deletions(-)
create mode 100755 tools/testing/selftests/cgroup/test_cpuset_prs.sh
create mode 100644 tools/testing/selftests/cgroup/wait_inotify.c

--
2.31.1


2022-09-01 21:21:19

by Waiman Long

[permalink] [raw]
Subject: [PATCH v12 10/10] kselftest/cgroup: Add cpuset v2 partition root state test

Add a test script test_cpuset_prs.sh with a helper program wait_inotify
for exercising the cpuset v2 partition root state code.

Signed-off-by: Waiman Long <[email protected]>
---
tools/testing/selftests/cgroup/.gitignore | 1 +
tools/testing/selftests/cgroup/Makefile | 5 +-
.../selftests/cgroup/test_cpuset_prs.sh | 674 ++++++++++++++++++
tools/testing/selftests/cgroup/wait_inotify.c | 87 +++
4 files changed, 765 insertions(+), 2 deletions(-)
create mode 100755 tools/testing/selftests/cgroup/test_cpuset_prs.sh
create mode 100644 tools/testing/selftests/cgroup/wait_inotify.c

diff --git a/tools/testing/selftests/cgroup/.gitignore b/tools/testing/selftests/cgroup/.gitignore
index 306ee1b01e72..c4a57e69f749 100644
--- a/tools/testing/selftests/cgroup/.gitignore
+++ b/tools/testing/selftests/cgroup/.gitignore
@@ -5,3 +5,4 @@ test_freezer
test_kmem
test_kill
test_cpu
+wait_inotify
diff --git a/tools/testing/selftests/cgroup/Makefile b/tools/testing/selftests/cgroup/Makefile
index 478217cc1371..3d263747d2ad 100644
--- a/tools/testing/selftests/cgroup/Makefile
+++ b/tools/testing/selftests/cgroup/Makefile
@@ -1,10 +1,11 @@
# SPDX-License-Identifier: GPL-2.0
CFLAGS += -Wall -pthread

-all:
+all: ${HELPER_PROGS}

TEST_FILES := with_stress.sh
-TEST_PROGS := test_stress.sh
+TEST_PROGS := test_stress.sh test_cpuset_prs.sh
+TEST_GEN_FILES := wait_inotify
TEST_GEN_PROGS = test_memcontrol
TEST_GEN_PROGS += test_kmem
TEST_GEN_PROGS += test_core
diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
new file mode 100755
index 000000000000..526d2c42d870
--- /dev/null
+++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
@@ -0,0 +1,674 @@
+#!/bin/bash
+# SPDX-License-Identifier: GPL-2.0
+#
+# Test for cpuset v2 partition root state (PRS)
+#
+# The sched verbose flag is set, if available, so that the console log
+# can be examined for the correct setting of scheduling domain.
+#
+
+skip_test() {
+ echo "$1"
+ echo "Test SKIPPED"
+ exit 0
+}
+
+[[ $(id -u) -eq 0 ]] || skip_test "Test must be run as root!"
+
+# Set sched verbose flag, if available
+[[ -d /sys/kernel/debug/sched ]] && echo Y > /sys/kernel/debug/sched/verbose
+
+# Get wait_inotify location
+WAIT_INOTIFY=$(cd $(dirname $0); pwd)/wait_inotify
+
+# Find cgroup v2 mount point
+CGROUP2=$(mount -t cgroup2 | head -1 | awk -e '{print $3}')
+[[ -n "$CGROUP2" ]] || skip_test "Cgroup v2 mount point not found!"
+
+CPUS=$(lscpu | grep "^CPU(s)" | sed -e "s/.*:[[:space:]]*//")
+[[ $CPUS -lt 8 ]] && skip_test "Test needs at least 8 cpus available!"
+
+# Set verbose flag and delay factor
+PROG=$1
+VERBOSE=
+DELAY_FACTOR=1
+while [[ "$1" = -* ]]
+do
+ case "$1" in
+ -v) VERBOSE=1
+ break
+ ;;
+ -d) DELAY_FACTOR=$2
+ shift
+ break
+ ;;
+ *) echo "Usage: $PROG [-v] [-d <delay-factor>"
+ exit
+ ;;
+ esac
+ shift
+done
+
+cd $CGROUP2
+echo +cpuset > cgroup.subtree_control
+[[ -d test ]] || mkdir test
+cd test
+
+# Pause in ms
+pause()
+{
+ DELAY=$1
+ LOOP=0
+ while [[ $LOOP -lt $DELAY_FACTOR ]]
+ do
+ sleep $DELAY
+ ((LOOP++))
+ done
+ return 0
+}
+
+console_msg()
+{
+ MSG=$1
+ echo "$MSG"
+ echo "" > /dev/console
+ echo "$MSG" > /dev/console
+ pause 0.01
+}
+
+test_partition()
+{
+ EXPECTED_VAL=$1
+ echo $EXPECTED_VAL > cpuset.cpus.partition
+ [[ $? -eq 0 ]] || exit 1
+ ACTUAL_VAL=$(cat cpuset.cpus.partition)
+ [[ $ACTUAL_VAL != $EXPECTED_VAL ]] && {
+ echo "cpuset.cpus.partition: expect $EXPECTED_VAL, found $EXPECTED_VAL"
+ echo "Test FAILED"
+ exit 1
+ }
+}
+
+test_effective_cpus()
+{
+ EXPECTED_VAL=$1
+ ACTUAL_VAL=$(cat cpuset.cpus.effective)
+ [[ "$ACTUAL_VAL" != "$EXPECTED_VAL" ]] && {
+ echo "cpuset.cpus.effective: expect '$EXPECTED_VAL', found '$EXPECTED_VAL'"
+ echo "Test FAILED"
+ exit 1
+ }
+}
+
+# Adding current process to cgroup.procs as a test
+test_add_proc()
+{
+ OUTSTR="$1"
+ ERRMSG=$((echo $$ > cgroup.procs) |& cat)
+ echo $ERRMSG | grep -q "$OUTSTR"
+ [[ $? -ne 0 ]] && {
+ echo "cgroup.procs: expect '$OUTSTR', got '$ERRMSG'"
+ echo "Test FAILED"
+ exit 1
+ }
+ echo $$ > $CGROUP2/cgroup.procs # Move out the task
+}
+
+#
+# Testing the new "isolated" partition root type
+#
+test_isolated()
+{
+ echo 2-3 > cpuset.cpus
+ TYPE=$(cat cpuset.cpus.partition)
+ [[ $TYPE = member ]] || echo member > cpuset.cpus.partition
+
+ console_msg "Change from member to root"
+ test_partition root
+
+ console_msg "Change from root to isolated"
+ test_partition isolated
+
+ console_msg "Change from isolated to member"
+ test_partition member
+
+ console_msg "Change from member to isolated"
+ test_partition isolated
+
+ console_msg "Change from isolated to root"
+ test_partition root
+
+ console_msg "Change from root to member"
+ test_partition member
+
+ #
+ # Testing partition root with no cpu
+ #
+ console_msg "Distribute all cpus to child partition"
+ echo +cpuset > cgroup.subtree_control
+ test_partition root
+
+ mkdir A1
+ cd A1
+ echo 2-3 > cpuset.cpus
+ test_partition root
+ test_effective_cpus 2-3
+ cd ..
+ test_effective_cpus ""
+
+ console_msg "Moving task to partition test"
+ test_add_proc "No space left"
+ cd A1
+ test_add_proc ""
+ cd ..
+
+ console_msg "Shrink and expand child partition"
+ cd A1
+ echo 2 > cpuset.cpus
+ cd ..
+ test_effective_cpus 3
+ cd A1
+ echo 2-3 > cpuset.cpus
+ cd ..
+ test_effective_cpus ""
+
+ # Cleaning up
+ console_msg "Cleaning up"
+ echo $$ > $CGROUP2/cgroup.procs
+ [[ -d A1 ]] && rmdir A1
+}
+
+#
+# Cpuset controller state transition test matrix.
+#
+# Cgroup test hierarchy
+#
+# test -- A1 -- A2 -- A3
+# \- B1
+#
+# P<v> = set cpus.partition (0:member, 1:root, 2:isolated, -1:root invalid)
+# C<l> = add cpu-list
+# S<p> = use prefix in subtree_control
+# T = put a task into cgroup
+# O<c>-<v> = Write <v> to CPU online file of <c>
+#
+SETUP_A123_PARTITIONS="C1-3:P1:S+ C2-3:P1:S+ C3:P1"
+TEST_MATRIX=(
+ # test old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate
+ # ---- ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------
+ " S+ C0-1 . . C2-3 S+ C4-5 . . 0 A2:0-1"
+ " S+ C0-1 . . C2-3 P1 . . . 0 "
+ " S+ C0-1 . . C2-3 P1:S+ C0-1:P1 . . 0 "
+ " S+ C0-1 . . C2-3 P1:S+ C1:P1 . . 0 "
+ " S+ C0-1:S+ . . C2-3 . . . P1 0 "
+ " S+ C0-1:P1 . . C2-3 S+ C1 . . 0 "
+ " S+ C0-1:P1 . . C2-3 S+ C1:P1 . . 0 "
+ " S+ C0-1:P1 . . C2-3 S+ C1:P1 . P1 0 "
+ " S+ C0-1:P1 . . C2-3 C4-5 . . . 0 A1:4-5"
+ " S+ C0-1:P1 . . C2-3 S+:C4-5 . . . 0 A1:4-5"
+ " S+ C0-1 . . C2-3:P1 . . . C2 0 "
+ " S+ C0-1 . . C2-3:P1 . . . C4-5 0 B1:4-5"
+ " S+ C0-3:P1:S+ C2-3:P1 . . . . . . 0 A1:0-1,A2:2-3"
+ " S+ C0-3:P1:S+ C2-3:P1 . . C1-3 . . . 0 A1:1,A2:2-3"
+ " S+ C2-3:P1:S+ C3:P1 . . C3 . . . 0 A1:,A2:3 A1:P1,A2:P1"
+ " S+ C2-3:P1:S+ C3:P1 . . C3 P0 . . 0 A1:3,A2:3 A1:P1,A2:P0"
+ " S+ C2-3:P1:S+ C2:P1 . . C2-4 . . . 0 A1:3-4,A2:2"
+ " S+ C2-3:P1:S+ C3:P1 . . C3 . . C0-2 0 A1:,B1:0-2 A1:P1,A2:P1"
+ " S+ $SETUP_A123_PARTITIONS . C2-3 . . . 0 A1:,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+
+ # CPU offlining cases:
+ " S+ C0-1 . . C2-3 S+ C4-5 . O2-0 0 A1:0-1,B1:3"
+ " S+ C0-3:P1:S+ C2-3:P1 . . O2-0 . . . 0 A1:0-1,A2:3"
+ " S+ C0-3:P1:S+ C2-3:P1 . . O2-0 O2-1 . . 0 A1:0-1,A2:2-3"
+ " S+ C0-3:P1:S+ C2-3:P1 . . O1-0 . . . 0 A1:0,A2:2-3"
+ " S+ C0-3:P1:S+ C2-3:P1 . . O1-0 O1-1 . . 0 A1:0-1,A2:2-3"
+ " S+ C2-3:P1:S+ C3:P1 . . O3-0 O3-1 . . 0 A1:2,A2:3 A1:P1,A2:P1"
+ " S+ C2-3:P1:S+ C3:P2 . . O3-0 O3-1 . . 0 A1:2,A2:3 A1:P1,A2:P2"
+ " S+ C2-3:P1:S+ C3:P1 . . O2-0 O2-1 . . 0 A1:2,A2:3 A1:P1,A2:P1"
+ " S+ C2-3:P1:S+ C3:P2 . . O2-0 O2-1 . . 0 A1:2,A2:3 A1:P1,A2:P2"
+ " S+ C2-3:P1:S+ C3:P1 . . O2-0 . . . 0 A1:,A2:3 A1:P1,A2:P1"
+ " S+ C2-3:P1:S+ C3:P1 . . O3-0 . . . 0 A1:2,A2: A1:P1,A2:P1"
+ " S+ C2-3:P1:S+ C3:P1 . . T:O2-0 . . . 0 A1:3,A2:3 A1:P1,A2:P-1"
+ " S+ C2-3:P1:S+ C3:P1 . . . T:O3-0 . . 0 A1:2,A2:2 A1:P1,A2:P-1"
+ " S+ $SETUP_A123_PARTITIONS . O1-0 . . . 0 A1:,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+ " S+ $SETUP_A123_PARTITIONS . O2-0 . . . 0 A1:1,A2:,A3:3 A1:P1,A2:P1,A3:P1"
+ " S+ $SETUP_A123_PARTITIONS . O3-0 . . . 0 A1:1,A2:2,A3: A1:P1,A2:P1,A3:P1"
+ " S+ $SETUP_A123_PARTITIONS . T:O1-0 . . . 0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P-1,A3:P-1"
+ " S+ $SETUP_A123_PARTITIONS . . T:O2-0 . . 0 A1:1,A2:3,A3:3 A1:P1,A2:P1,A3:P-1"
+ " S+ $SETUP_A123_PARTITIONS . . . T:O3-0 . 0 A1:1,A2:2,A3:2 A1:P1,A2:P1,A3:P-1"
+ " S+ $SETUP_A123_PARTITIONS . T:O1-0 O1-1 . . 0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+ " S+ $SETUP_A123_PARTITIONS . . T:O2-0 O2-1 . 0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+ " S+ $SETUP_A123_PARTITIONS . . . T:O3-0 O3-1 0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+ " S+ $SETUP_A123_PARTITIONS . T:O1-0 O2-0 O1-1 . 0 A1:1,A2:,A3:3 A1:P1,A2:P1,A3:P1"
+ " S+ $SETUP_A123_PARTITIONS . T:O1-0 O2-0 O2-1 . 0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P-1,A3:P-1"
+
+ # test old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate
+ # ---- ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------
+ #
+ # Incorrect change to cpuset.cpus invalidates partition root
+ #
+ # Adding CPUs to partition root that are not in parent's
+ # cpuset.cpus is allowed, but those extra CPUs are ignored.
+ " S+ C2-3:P1:S+ C3:P1 . . . C2-4 . . 0 A1:,A2:2-3 A1:P1,A2:P1"
+
+ # Taking away all CPUs from parent or itself if there are tasks
+ # will make the partition invalid.
+ " S+ C2-3:P1:S+ C3:P1 . . T C2-3 . . 0 A1:2-3,A2:2-3 A1:P1,A2:P-1"
+ " S+ $SETUP_A123_PARTITIONS . T:C2-3 . . . 0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P-1,A3:P-1"
+ " S+ $SETUP_A123_PARTITIONS . T:C2-3:C1-3 . . . 0 A1:1,A2:2,A3:3 A1:P1,A2:P1,A3:P1"
+
+ # Changing a partition root to member makes child partitions invalid
+ " S+ C2-3:P1:S+ C3:P1 . . P0 . . . 0 A1:2-3,A2:3 A1:P0,A2:P-1"
+ " S+ $SETUP_A123_PARTITIONS . C2-3 P0 . . 0 A1:2-3,A2:2-3,A3:3 A1:P1,A2:P0,A3:P-1"
+
+ # cpuset.cpus can contains cpus not in parent's cpuset.cpus as long
+ # as they overlap.
+ " S+ C2-3:P1:S+ . . . . C3-4:P1 . . 0 A1:2,A2:3 A1:P1,A2:P1"
+
+ # Deletion of CPUs distributed to child cgroup is allowed.
+ " S+ C0-1:P1:S+ C1 . C2-3 C4-5 . . . 0 A1:4-5,A2:4-5"
+
+ # To become a valid partition root, cpuset.cpus must overlap parent's
+ # cpuset.cpus.
+ " S+ C0-1:P1 . . C2-3 S+ C4-5:P1 . . 0 A1:0-1,A2:0-1 A1:P1,A2:P-1"
+
+ # Enabling partition with child cpusets is allowed
+ " S+ C0-1:S+ C1 . C2-3 P1 . . . 0 A1:0-1,A2:1 A1:P1"
+
+ # A partition root with non-partition root parent is invalid, but it
+ # can be made valid if its parent becomes a partition root too.
+ " S+ C0-1:S+ C1 . C2-3 . P2 . . 0 A1:0-1,A2:1 A1:P0,A2:P-2"
+ " S+ C0-1:S+ C1:P2 . C2-3 P1 . . . 0 A1:0,A2:1 A1:P1,A2:P2"
+
+ # A non-exclusive cpuset.cpus change will invalidate partition and its siblings
+ " S+ C0-1:P1 . . C2-3 C0-2 . . . 0 A1:0-2,B1:2-3 A1:P-1,B1:P0"
+ " S+ C0-1:P1 . . P1:C2-3 C0-2 . . . 0 A1:0-2,B1:2-3 A1:P-1,B1:P-1"
+ " S+ C0-1 . . P1:C2-3 C0-2 . . . 0 A1:0-2,B1:2-3 A1:P0,B1:P-1"
+
+ # test old-A1 old-A2 old-A3 old-B1 new-A1 new-A2 new-A3 new-B1 fail ECPUs Pstate
+ # ---- ------ ------ ------ ------ ------ ------ ------ ------ ---- ----- ------
+ # Failure cases:
+
+ # A task cannot be added to a partition with no cpu
+ " S+ C2-3:P1:S+ C3:P1 . . O2-0:T . . . 1 A1:,A2:3 A1:P1,A2:P1"
+)
+
+#
+# Write to the cpu online file
+# $1 - <c>-<v> where <c> = cpu number, <v> value to be written
+#
+write_cpu_online()
+{
+ CPU=${1%-*}
+ VAL=${1#*-}
+ CPUFILE=//sys/devices/system/cpu/cpu${CPU}/online
+ if [[ $VAL -eq 0 ]]
+ then
+ OFFLINE_CPUS="$OFFLINE_CPUS $CPU"
+ else
+ [[ -n "$OFFLINE_CPUS" ]] && {
+ OFFLINE_CPUS=$(echo $CPU $CPU $OFFLINE_CPUS | fmt -1 |\
+ sort | uniq -u)
+ }
+ fi
+ echo $VAL > $CPUFILE
+ pause 0.01
+}
+
+#
+# Set controller state
+# $1 - cgroup directory
+# $2 - state
+# $3 - showerr
+#
+# The presence of ":" in state means transition from one to the next.
+#
+set_ctrl_state()
+{
+ TMPMSG=/tmp/.msg_$$
+ CGRP=$1
+ STATE=$2
+ SHOWERR=${3}${VERBOSE}
+ CTRL=${CTRL:=$CONTROLLER}
+ HASERR=0
+ REDIRECT="2> $TMPMSG"
+ [[ -z "$STATE" || "$STATE" = '.' ]] && return 0
+
+ rm -f $TMPMSG
+ for CMD in $(echo $STATE | sed -e "s/:/ /g")
+ do
+ TFILE=$CGRP/cgroup.procs
+ SFILE=$CGRP/cgroup.subtree_control
+ PFILE=$CGRP/cpuset.cpus.partition
+ CFILE=$CGRP/cpuset.cpus
+ S=$(expr substr $CMD 1 1)
+ if [[ $S = S ]]
+ then
+ PREFIX=${CMD#?}
+ COMM="echo ${PREFIX}${CTRL} > $SFILE"
+ eval $COMM $REDIRECT
+ elif [[ $S = C ]]
+ then
+ CPUS=${CMD#?}
+ COMM="echo $CPUS > $CFILE"
+ eval $COMM $REDIRECT
+ elif [[ $S = P ]]
+ then
+ VAL=${CMD#?}
+ case $VAL in
+ 0) VAL=member
+ ;;
+ 1) VAL=root
+ ;;
+ 2) VAL=isolated
+ ;;
+ *)
+ echo "Invalid partition state - $VAL"
+ exit 1
+ ;;
+ esac
+ COMM="echo $VAL > $PFILE"
+ eval $COMM $REDIRECT
+ elif [[ $S = O ]]
+ then
+ VAL=${CMD#?}
+ write_cpu_online $VAL
+ elif [[ $S = T ]]
+ then
+ COMM="echo 0 > $TFILE"
+ eval $COMM $REDIRECT
+ fi
+ RET=$?
+ [[ $RET -ne 0 ]] && {
+ [[ -n "$SHOWERR" ]] && {
+ echo "$COMM"
+ cat $TMPMSG
+ }
+ HASERR=1
+ }
+ pause 0.01
+ rm -f $TMPMSG
+ done
+ return $HASERR
+}
+
+set_ctrl_state_noerr()
+{
+ CGRP=$1
+ STATE=$2
+ [[ -d $CGRP ]] || mkdir $CGRP
+ set_ctrl_state $CGRP $STATE 1
+ [[ $? -ne 0 ]] && {
+ echo "ERROR: Failed to set $2 to cgroup $1!"
+ exit 1
+ }
+}
+
+online_cpus()
+{
+ [[ -n "OFFLINE_CPUS" ]] && {
+ for C in $OFFLINE_CPUS
+ do
+ write_cpu_online ${C}-1
+ done
+ }
+}
+
+#
+# Return 1 if the list of effective cpus isn't the same as the initial list.
+#
+reset_cgroup_states()
+{
+ echo 0 > $CGROUP2/cgroup.procs
+ online_cpus
+ rmdir A1/A2/A3 A1/A2 A1 B1 > /dev/null 2>&1
+ set_ctrl_state . S-
+ pause 0.01
+}
+
+dump_states()
+{
+ for DIR in A1 A1/A2 A1/A2/A3 B1
+ do
+ ECPUS=$DIR/cpuset.cpus.effective
+ PRS=$DIR/cpuset.cpus.partition
+ [[ -e $ECPUS ]] && echo "$ECPUS: $(cat $ECPUS)"
+ [[ -e $PRS ]] && echo "$PRS: $(cat $PRS)"
+ done
+}
+
+#
+# Check effective cpus
+# $1 - check string, format: <cgroup>:<cpu-list>[,<cgroup>:<cpu-list>]*
+#
+check_effective_cpus()
+{
+ CHK_STR=$1
+ for CHK in $(echo $CHK_STR | sed -e "s/,/ /g")
+ do
+ set -- $(echo $CHK | sed -e "s/:/ /g")
+ CGRP=$1
+ CPUS=$2
+ [[ $CGRP = A2 ]] && CGRP=A1/A2
+ [[ $CGRP = A3 ]] && CGRP=A1/A2/A3
+ FILE=$CGRP/cpuset.cpus.effective
+ [[ -e $FILE ]] || return 1
+ [[ $CPUS = $(cat $FILE) ]] || return 1
+ done
+}
+
+#
+# Check cgroup states
+# $1 - check string, format: <cgroup>:<state>[,<cgroup>:<state>]*
+#
+check_cgroup_states()
+{
+ CHK_STR=$1
+ for CHK in $(echo $CHK_STR | sed -e "s/,/ /g")
+ do
+ set -- $(echo $CHK | sed -e "s/:/ /g")
+ CGRP=$1
+ STATE=$2
+ FILE=
+ EVAL=$(expr substr $STATE 2 2)
+ [[ $CGRP = A2 ]] && CGRP=A1/A2
+ [[ $CGRP = A3 ]] && CGRP=A1/A2/A3
+
+ case $STATE in
+ P*) FILE=$CGRP/cpuset.cpus.partition
+ ;;
+ *) echo "Unknown state: $STATE!"
+ exit 1
+ ;;
+ esac
+ VAL=$(cat $FILE)
+
+ case "$VAL" in
+ member) VAL=0
+ ;;
+ root) VAL=1
+ ;;
+ isolated)
+ VAL=2
+ ;;
+ "root invalid"*)
+ VAL=-1
+ ;;
+ "isolated invalid"*)
+ VAL=-2
+ ;;
+ esac
+ [[ $EVAL != $VAL ]] && return 1
+ done
+ return 0
+}
+
+#
+# Run cpuset state transition test
+# $1 - test matrix name
+#
+# This test is somewhat fragile as delays (sleep x) are added in various
+# places to make sure state changes are fully propagated before the next
+# action. These delays may need to be adjusted if running in a slower machine.
+#
+run_state_test()
+{
+ TEST=$1
+ CONTROLLER=cpuset
+ CPULIST=0-6
+ I=0
+ eval CNT="\${#$TEST[@]}"
+
+ reset_cgroup_states
+ echo $CPULIST > cpuset.cpus
+ echo root > cpuset.cpus.partition
+ console_msg "Running state transition test ..."
+
+ while [[ $I -lt $CNT ]]
+ do
+ echo "Running test $I ..." > /dev/console
+ eval set -- "\${$TEST[$I]}"
+ ROOT=$1
+ OLD_A1=$2
+ OLD_A2=$3
+ OLD_A3=$4
+ OLD_B1=$5
+ NEW_A1=$6
+ NEW_A2=$7
+ NEW_A3=$8
+ NEW_B1=$9
+ RESULT=${10}
+ ECPUS=${11}
+ STATES=${12}
+
+ set_ctrl_state_noerr . $ROOT
+ set_ctrl_state_noerr A1 $OLD_A1
+ set_ctrl_state_noerr A1/A2 $OLD_A2
+ set_ctrl_state_noerr A1/A2/A3 $OLD_A3
+ set_ctrl_state_noerr B1 $OLD_B1
+ RETVAL=0
+ set_ctrl_state A1 $NEW_A1; ((RETVAL += $?))
+ set_ctrl_state A1/A2 $NEW_A2; ((RETVAL += $?))
+ set_ctrl_state A1/A2/A3 $NEW_A3; ((RETVAL += $?))
+ set_ctrl_state B1 $NEW_B1; ((RETVAL += $?))
+
+ [[ $RETVAL -ne $RESULT ]] && {
+ echo "Test $TEST[$I] failed result check!"
+ eval echo \"\${$TEST[$I]}\"
+ dump_states
+ online_cpus
+ exit 1
+ }
+
+ [[ -n "$ECPUS" && "$ECPUS" != . ]] && {
+ check_effective_cpus $ECPUS
+ [[ $? -ne 0 ]] && {
+ echo "Test $TEST[$I] failed effective CPU check!"
+ eval echo \"\${$TEST[$I]}\"
+ echo
+ dump_states
+ online_cpus
+ exit 1
+ }
+ }
+
+ [[ -n "$STATES" ]] && {
+ check_cgroup_states $STATES
+ [[ $? -ne 0 ]] && {
+ echo "FAILED: Test $TEST[$I] failed states check!"
+ eval echo \"\${$TEST[$I]}\"
+ echo
+ dump_states
+ online_cpus
+ exit 1
+ }
+ }
+
+ reset_cgroup_states
+ #
+ # Check to see if effective cpu list changes
+ #
+ pause 0.05
+ NEWLIST=$(cat cpuset.cpus.effective)
+ [[ $NEWLIST != $CPULIST ]] && {
+ echo "Effective cpus changed to $NEWLIST after test $I!"
+ exit 1
+ }
+ [[ -n "$VERBOSE" ]] && echo "Test $I done."
+ ((I++))
+ done
+ echo "All $I tests of $TEST PASSED."
+
+ echo member > cpuset.cpus.partition
+}
+
+#
+# Wait for inotify event for the given file and read it
+# $1: cgroup file to wait for
+# $2: file to store the read result
+#
+wait_inotify()
+{
+ CGROUP_FILE=$1
+ OUTPUT_FILE=$2
+
+ $WAIT_INOTIFY $CGROUP_FILE
+ cat $CGROUP_FILE > $OUTPUT_FILE
+}
+
+#
+# Test if inotify events are properly generated when going into and out of
+# invalid partition state.
+#
+test_inotify()
+{
+ ERR=0
+ PRS=/tmp/.prs_$$
+ [[ -f $WAIT_INOTIFY ]] || {
+ echo "wait_inotify not found, inotify test SKIPPED."
+ return
+ }
+
+ pause 0.01
+ echo 1 > cpuset.cpus
+ echo 0 > cgroup.procs
+ echo root > cpuset.cpus.partition
+ pause 0.01
+ rm -f $PRS
+ wait_inotify $PWD/cpuset.cpus.partition $PRS &
+ pause 0.01
+ set_ctrl_state . "O1-0"
+ pause 0.01
+ check_cgroup_states ".:P-1"
+ if [[ $? -ne 0 ]]
+ then
+ echo "FAILED: Inotify test - partition not invalid"
+ ERR=1
+ elif [[ ! -f $PRS ]]
+ then
+ echo "FAILED: Inotify test - event not generated"
+ ERR=1
+ kill %1
+ elif [[ $(cat $PRS) != "root invalid"* ]]
+ then
+ echo "FAILED: Inotify test - incorrect state"
+ cat $PRS
+ ERR=1
+ fi
+ online_cpus
+ echo member > cpuset.cpus.partition
+ echo 0 > ../cgroup.procs
+ if [[ $ERR -ne 0 ]]
+ then
+ exit 1
+ else
+ echo "Inotify test PASSED"
+ fi
+}
+
+run_state_test TEST_MATRIX
+test_isolated
+test_inotify
+echo "All tests PASSED."
+cd ..
+rmdir test
diff --git a/tools/testing/selftests/cgroup/wait_inotify.c b/tools/testing/selftests/cgroup/wait_inotify.c
new file mode 100644
index 000000000000..e11b431e1b62
--- /dev/null
+++ b/tools/testing/selftests/cgroup/wait_inotify.c
@@ -0,0 +1,87 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Wait until an inotify event on the given cgroup file.
+ */
+#include <linux/limits.h>
+#include <sys/inotify.h>
+#include <sys/mman.h>
+#include <sys/ptrace.h>
+#include <sys/stat.h>
+#include <sys/types.h>
+#include <errno.h>
+#include <fcntl.h>
+#include <poll.h>
+#include <stdio.h>
+#include <stdlib.h>
+#include <string.h>
+#include <unistd.h>
+
+static const char usage[] = "Usage: %s [-v] <cgroup_file>\n";
+static char *file;
+static int verbose;
+
+static inline void fail_message(char *msg)
+{
+ fprintf(stderr, msg, file);
+ exit(1);
+}
+
+int main(int argc, char *argv[])
+{
+ char *cmd = argv[0];
+ int c, fd;
+ struct pollfd fds = { .events = POLLIN, };
+
+ while ((c = getopt(argc, argv, "v")) != -1) {
+ switch (c) {
+ case 'v':
+ verbose++;
+ break;
+ }
+ argv++, argc--;
+ }
+
+ if (argc != 2) {
+ fprintf(stderr, usage, cmd);
+ return -1;
+ }
+ file = argv[1];
+ fd = open(file, O_RDONLY);
+ if (fd < 0)
+ fail_message("Cgroup file %s not found!\n");
+ close(fd);
+
+ fd = inotify_init();
+ if (fd < 0)
+ fail_message("inotify_init() fails on %s!\n");
+ if (inotify_add_watch(fd, file, IN_MODIFY) < 0)
+ fail_message("inotify_add_watch() fails on %s!\n");
+ fds.fd = fd;
+
+ /*
+ * poll waiting loop
+ */
+ for (;;) {
+ int ret = poll(&fds, 1, 10000);
+
+ if (ret < 0) {
+ if (errno == EINTR)
+ continue;
+ perror("poll");
+ exit(1);
+ }
+ if ((ret > 0) && (fds.revents & POLLIN))
+ break;
+ }
+ if (verbose) {
+ struct inotify_event events[10];
+ long len;
+
+ usleep(1000);
+ len = read(fd, events, sizeof(events));
+ printf("Number of events read = %ld\n",
+ len/sizeof(struct inotify_event));
+ }
+ close(fd);
+ return 0;
+}
--
2.31.1

2022-09-01 21:21:35

by Waiman Long

[permalink] [raw]
Subject: [PATCH v12 04/10] cgroup/cpuset: Relax constraints to partition & cpus changes

Currently, enabling a partition root is only allowed if all the
constraints of a valid partition are satisfied. Even changes to
"cpuset.cpus" may not be allowed in some cases. Moreover, there are
limits to changes made to a parent cpuset if it is a valid partition
root. This is contrary to the general cgroup v2 philosophy.

This patch relaxes the constraints of changing the state of "cpuset.cpus"
and "cpuset.cpus.partition". Now all valid changes ("member" or "root")
to "cpuset.cpus.partition" are allowed even if there are child cpusets
underneath it.

Trying to make a cpuset a partition root, however, will cause its state
to become invalid if the following constraints of a valid partition
root are not satisfied.

1) The "cpuset.cpus" is non-empty and exclusive.
2) The parent cpuset is a valid partition root.
3) The "cpuset.cpus" overlaps parent's "cpuset.cpus".

Similarly, almost all changes to "cpuset.cpus" are allowed with the
exception that if the underlying CS_CPU_EXCLUSIVE flag is set, the
exclusivity rule will still apply.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/cgroup/cpuset.c | 409 ++++++++++++++++++++++-------------------
1 file changed, 215 insertions(+), 194 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index cdde9ef05ad2..7f64bb1a7bef 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1222,13 +1222,15 @@ enum subparts_cmd {
partcmd_update, /* Update parent's subparts_cpus */
};

+static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
+ int turning_on);
/**
* update_parent_subparts_cpumask - update subparts_cpus mask of parent cpuset
* @cpuset: The cpuset that requests change in partition root state
* @cmd: Partition root state change command
* @newmask: Optional new cpumask for partcmd_update
* @tmp: Temporary addmask and delmask
- * Return: 0, 1 or an error code
+ * Return: 0 or -1 (error)
*
* For partcmd_enable, the cpuset is being transformed from a non-partition
* root to a partition root. The cpus_allowed mask of the given cpuset will
@@ -1239,28 +1241,22 @@ enum subparts_cmd {
* For partcmd_disable, the cpuset is being transformed from a partition
* root back to a non-partition root. Any CPUs in cpus_allowed that are in
* parent's subparts_cpus will be taken away from that cpumask and put back
- * into parent's effective_cpus. 0 should always be returned.
+ * into parent's effective_cpus. 0 will always be returned.
*
- * For partcmd_update, if the optional newmask is specified, the cpu
- * list is to be changed from cpus_allowed to newmask. Otherwise,
- * cpus_allowed is assumed to remain the same. The cpuset should either
- * be a partition root or an invalid partition root. The partition root
- * state may change if newmask is NULL and none of the requested CPUs can
- * be granted by the parent. The function will return 1 if changes to
- * parent's subparts_cpus and effective_cpus happen or 0 otherwise.
- * Error code should only be returned when newmask is non-NULL.
+ * For partcmd_update, if the optional newmask is specified, the cpu list is
+ * to be changed from cpus_allowed to newmask. Otherwise, cpus_allowed is
+ * assumed to remain the same. The cpuset should either be a valid or invalid
+ * partition root. The partition root state may change from valid to invalid
+ * or vice versa. An error code will only be returned if transitioning from
+ * invalid to valid violates the exclusivity rule.
*
* The partcmd_enable and partcmd_disable commands are used by
- * update_prstate(). The partcmd_update command is used by
- * update_cpumasks_hier() with newmask NULL and update_cpumask() with
- * newmask set.
- *
- * The checking is more strict when enabling partition root than the
- * other two commands.
+ * update_prstate(). An error code may be returned and the caller will check
+ * for error.
*
- * Because of the implicit cpu exclusive nature of a partition root,
- * cpumask changes that violates the cpu exclusivity rule will not be
- * permitted when checked by validate_change().
+ * The partcmd_update command is used by update_cpumasks_hier() with newmask
+ * NULL and update_cpumask() with newmask set. The callers won't check for
+ * error and so partition_root_state will be updated directly.
*/
static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
struct cpumask *newmask,
@@ -1282,14 +1278,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
if (!is_partition_valid(parent) ||
(newmask && cpumask_empty(newmask)) ||
(!newmask && cpumask_empty(cs->cpus_allowed)))
- return -EINVAL;
-
- /*
- * Enabling/disabling partition root is not allowed if there are
- * online children.
- */
- if ((cmd != partcmd_update) && css_has_online_children(&cs->css))
- return -EBUSY;
+ return -1;

/*
* new_prs will only be changed for the partcmd_update command.
@@ -1298,79 +1287,98 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
old_prs = new_prs = cs->partition_root_state;
if (cmd == partcmd_enable) {
/*
- * Enabling partition root is not allowed if not all the CPUs
- * can be granted from parent's effective_cpus.
+ * Enabling partition root is not allowed if cpus_allowed
+ * doesn't overlap parent's cpus_allowed.
*/
- if (!cpumask_subset(cs->cpus_allowed, parent->effective_cpus))
- return -EINVAL;
+ if (!cpumask_intersects(cs->cpus_allowed, parent->cpus_allowed))
+ return -1;

/*
* A parent can be left with no CPU as long as there is no
- * task directly associated with the parent partition. For
- * such a parent, no new task can be moved into it.
+ * task directly associated with the parent partition.
*/
- if (cpumask_equal(cs->cpus_allowed, parent->effective_cpus) &&
+ if (!cpumask_intersects(cs->cpus_allowed, parent->effective_cpus) &&
partition_is_populated(parent, cs))
- return -EINVAL;
+ return -1;

cpumask_copy(tmp->addmask, cs->cpus_allowed);
adding = true;
} else if (cmd == partcmd_disable) {
- deleting = cpumask_and(tmp->delmask, cs->cpus_allowed,
+ /*
+ * Need to remove cpus from parent's subparts_cpus for valid
+ * partition root.
+ */
+ deleting = !is_prs_invalid(old_prs) &&
+ cpumask_and(tmp->delmask, cs->cpus_allowed,
parent->subparts_cpus);
} else if (newmask) {
/*
* partcmd_update with newmask:
*
+ * Compute add/delete mask to/from subparts_cpus
+ *
* delmask = cpus_allowed & ~newmask & parent->subparts_cpus
- * addmask = newmask & parent->effective_cpus
+ * addmask = newmask & parent->cpus_allowed
* & ~parent->subparts_cpus
*/
cpumask_andnot(tmp->delmask, cs->cpus_allowed, newmask);
deleting = cpumask_and(tmp->delmask, tmp->delmask,
parent->subparts_cpus);

- cpumask_and(tmp->addmask, newmask, parent->effective_cpus);
+ cpumask_and(tmp->addmask, newmask, parent->cpus_allowed);
adding = cpumask_andnot(tmp->addmask, tmp->addmask,
parent->subparts_cpus);
/*
- * Return error if the new effective_cpus could become empty
- * and there are tasks in the parent.
+ * Make partition invalid if parent's effective_cpus could
+ * become empty and there are tasks in the parent.
*/
if (adding &&
- cpumask_equal(parent->effective_cpus, tmp->addmask) &&
+ cpumask_subset(parent->effective_cpus, tmp->addmask) &&
+ !cpumask_intersects(tmp->delmask, cpu_active_mask) &&
partition_is_populated(parent, cs)) {
- if (!deleting)
- return -EINVAL;
- /*
- * As some of the CPUs in subparts_cpus might have
- * been offlined, we need to compute the real delmask
- * to confirm that.
- */
- if (!cpumask_and(tmp->addmask, tmp->delmask,
- cpu_active_mask))
- return -EINVAL;
- cpumask_copy(tmp->addmask, parent->effective_cpus);
+ part_error = true;
+ adding = false;
+ deleting = cpumask_and(tmp->delmask, cs->cpus_allowed,
+ parent->subparts_cpus);
}
} else {
/*
* partcmd_update w/o newmask:
*
- * addmask = cpus_allowed & parent->effective_cpus
+ * delmask = cpus_allowed & parent->subparts_cpus
+ * addmask = cpus_allowed & parent->cpus_allowed
+ * & ~parent->subparts_cpus
*
- * Note that parent's subparts_cpus may have been
- * pre-shrunk in case there is a change in the cpu list.
- * So no deletion is needed.
+ * This gets invoked either due to a hotplug event or from
+ * update_cpumasks_hier(). This can cause the state of a
+ * partition root to transition from valid to invalid or vice
+ * versa. So we still need to compute the addmask and delmask.
+
+ * A partition error happens when:
+ * 1) Cpuset is valid partition, but parent does not distribute
+ * out any CPUs.
+ * 2) Parent has tasks and all its effective CPUs will have
+ * to be distributed out.
*/
- adding = cpumask_and(tmp->addmask, cs->cpus_allowed,
- parent->effective_cpus);
- part_error = cpumask_equal(tmp->addmask, parent->effective_cpus) &&
- partition_is_populated(parent, cs);
+ cpumask_and(tmp->addmask, cs->cpus_allowed,
+ parent->cpus_allowed);
+ adding = cpumask_andnot(tmp->addmask, tmp->addmask,
+ parent->subparts_cpus);
+ if ((is_partition_valid(cs) && !parent->nr_subparts_cpus) ||
+ (adding &&
+ cpumask_subset(parent->effective_cpus, tmp->addmask) &&
+ partition_is_populated(parent, cs))) {
+ part_error = true;
+ adding = false;
+ }
+
+ if (part_error && is_partition_valid(cs) &&
+ parent->nr_subparts_cpus)
+ deleting = cpumask_and(tmp->delmask, cs->cpus_allowed,
+ parent->subparts_cpus);
}

if (cmd == partcmd_update) {
- int prev_prs = cs->partition_root_state;
-
/*
* Check for possible transition between PRS_ROOT
* and PRS_INVALID_ROOT.
@@ -1385,27 +1393,23 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
new_prs = PRS_ROOT;
break;
}
- /*
- * Set part_error if previously in invalid state.
- */
- part_error = is_prs_invalid(prev_prs);
- }
-
- if (!part_error && is_prs_invalid(new_prs))
- return 0; /* Nothing need to be done */
-
- if (is_prs_invalid(new_prs)) {
- /*
- * Remove all its cpus from parent's subparts_cpus.
- */
- adding = false;
- deleting = cpumask_and(tmp->delmask, cs->cpus_allowed,
- parent->subparts_cpus);
}

if (!adding && !deleting && (new_prs == old_prs))
return 0;

+ /*
+ * Transitioning between invalid to valid or vice versa may require
+ * changing CS_CPU_EXCLUSIVE.
+ */
+ if (old_prs != new_prs) {
+ if (is_prs_invalid(old_prs) && !is_cpu_exclusive(cs) &&
+ (update_flag(CS_CPU_EXCLUSIVE, cs, 1) < 0))
+ return -1;
+ if (is_prs_invalid(new_prs) && is_cpu_exclusive(cs))
+ update_flag(CS_CPU_EXCLUSIVE, cs, 0);
+ }
+
/*
* Change the parent's subparts_cpus.
* Newly added CPUs will be removed from effective_cpus and
@@ -1435,15 +1439,20 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
cs->partition_root_state = new_prs;

spin_unlock_irq(&callback_lock);
+
+ if (adding || deleting)
+ update_tasks_cpumask(parent);
+
notify_partition_change(cs, old_prs);

- return cmd == partcmd_update;
+ return 0;
}

/*
* update_cpumasks_hier - Update effective cpumasks and tasks in the subtree
* @cs: the cpuset to consider
* @tmp: temp variables for calculating effective_cpus & partition setup
+ * @force: don't skip any descendant cpusets if set
*
* When configured cpumask is changed, the effective cpumasks of this cpuset
* and all its descendants need to be updated.
@@ -1452,7 +1461,8 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
*
* Called with cpuset_rwsem held
*/
-static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
+static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp,
+ bool force)
{
struct cpuset *cp;
struct cgroup_subsys_state *pos_css;
@@ -1462,6 +1472,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
rcu_read_lock();
cpuset_for_each_descendant_pre(cp, pos_css, cs) {
struct cpuset *parent = parent_cs(cp);
+ bool update_parent = false;

compute_effective_cpumask(tmp->new_cpus, cp, parent);

@@ -1489,9 +1500,9 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)

/*
* Skip the whole subtree if the cpumask remains the same
- * and has no partition root state.
+ * and has no partition root state and force flag not set.
*/
- if (!cp->partition_root_state &&
+ if (!cp->partition_root_state && !force &&
cpumask_equal(tmp->new_cpus, cp->effective_cpus)) {
pos_css = css_rightmost_descendant(pos_css);
continue;
@@ -1507,33 +1518,15 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
old_prs = new_prs = cp->partition_root_state;
if ((cp != cs) && old_prs) {
switch (parent->partition_root_state) {
- case PRS_MEMBER:
- /*
- * If parent is not a partition root or an
- * invalid partition root, clear its state
- * and its CS_CPU_EXCLUSIVE flag.
- */
- WARN_ON_ONCE(!is_partition_invalid(cp));
- new_prs = PRS_MEMBER;
-
- /*
- * clear_bit() is an atomic operation and
- * readers aren't interested in the state
- * of CS_CPU_EXCLUSIVE anyway. So we can
- * just update the flag without holding
- * the callback_lock.
- */
- clear_bit(CS_CPU_EXCLUSIVE, &cp->flags);
- break;
-
case PRS_ROOT:
- if (update_parent_subparts_cpumask(cp, partcmd_update, NULL, tmp))
- update_tasks_cpumask(parent);
+ update_parent = true;
break;

- case PRS_INVALID_ROOT:
+ default:
/*
- * When parent is invalid, it has to be too.
+ * When parent is not a partition root or is
+ * invalid, child partition roots become
+ * invalid too.
*/
new_prs = PRS_INVALID_ROOT;
break;
@@ -1544,41 +1537,43 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
continue;
rcu_read_unlock();

+ if (update_parent) {
+ update_parent_subparts_cpumask(cp, partcmd_update, NULL,
+ tmp);
+ /*
+ * The cpuset partition_root_state may become
+ * invalid. Capture it.
+ */
+ new_prs = cp->partition_root_state;
+ }
+
spin_lock_irq(&callback_lock);

- cpumask_copy(cp->effective_cpus, tmp->new_cpus);
if (cp->nr_subparts_cpus && !is_partition_valid(cp)) {
+ /*
+ * Put all active subparts_cpus back to effective_cpus.
+ */
+ cpumask_or(tmp->new_cpus, tmp->new_cpus,
+ cp->subparts_cpus);
+ cpumask_and(tmp->new_cpus, tmp->new_cpus,
+ cpu_active_mask);
cp->nr_subparts_cpus = 0;
cpumask_clear(cp->subparts_cpus);
- } else if (cp->nr_subparts_cpus) {
+ }
+
+ cpumask_copy(cp->effective_cpus, tmp->new_cpus);
+ if (cp->nr_subparts_cpus) {
/*
* Make sure that effective_cpus & subparts_cpus
* are mutually exclusive.
- *
- * In the unlikely event that effective_cpus
- * becomes empty. we clear cp->nr_subparts_cpus and
- * let its child partition roots to compete for
- * CPUs again.
*/
cpumask_andnot(cp->effective_cpus, cp->effective_cpus,
cp->subparts_cpus);
- if (cpumask_empty(cp->effective_cpus)) {
- cpumask_copy(cp->effective_cpus, tmp->new_cpus);
- cpumask_clear(cp->subparts_cpus);
- cp->nr_subparts_cpus = 0;
- } else if (!cpumask_subset(cp->subparts_cpus,
- tmp->new_cpus)) {
- cpumask_andnot(cp->subparts_cpus,
- cp->subparts_cpus, tmp->new_cpus);
- cp->nr_subparts_cpus
- = cpumask_weight(cp->subparts_cpus);
- }
}

- if (new_prs != old_prs)
- cp->partition_root_state = new_prs;
-
+ cp->partition_root_state = new_prs;
spin_unlock_irq(&callback_lock);
+
notify_partition_change(cp, old_prs);

WARN_ON(!is_in_v2_mode() &&
@@ -1639,7 +1634,7 @@ static void update_sibling_cpumasks(struct cpuset *parent, struct cpuset *cs,
continue;

rcu_read_unlock();
- update_cpumasks_hier(sibling, tmp);
+ update_cpumasks_hier(sibling, tmp, false);
rcu_read_lock();
css_put(&sibling->css);
}
@@ -1699,27 +1694,36 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
#endif

if (cs->partition_root_state) {
- /* Cpumask of a partition root cannot be empty */
- if (cpumask_empty(trialcs->cpus_allowed))
- return -EINVAL;
- if (update_parent_subparts_cpumask(cs, partcmd_update,
- trialcs->cpus_allowed, &tmp) < 0)
- return -EINVAL;
+ update_parent_subparts_cpumask(cs, partcmd_update,
+ trialcs->cpus_allowed, &tmp);
}

+ compute_effective_cpumask(trialcs->effective_cpus, trialcs,
+ parent_cs(cs));
spin_lock_irq(&callback_lock);
cpumask_copy(cs->cpus_allowed, trialcs->cpus_allowed);

/*
- * Make sure that subparts_cpus is a subset of cpus_allowed.
+ * Make sure that subparts_cpus, if not empty, is a subset of
+ * cpus_allowed. Clear subparts_cpus if partition not valid or
+ * empty effective cpus with tasks.
*/
if (cs->nr_subparts_cpus) {
- cpumask_and(cs->subparts_cpus, cs->subparts_cpus, cs->cpus_allowed);
- cs->nr_subparts_cpus = cpumask_weight(cs->subparts_cpus);
+ if (!is_partition_valid(cs) ||
+ (cpumask_subset(trialcs->effective_cpus, cs->subparts_cpus) &&
+ partition_is_populated(cs, NULL))) {
+ cs->nr_subparts_cpus = 0;
+ cpumask_clear(cs->subparts_cpus);
+ } else {
+ cpumask_and(cs->subparts_cpus, cs->subparts_cpus,
+ cs->cpus_allowed);
+ cs->nr_subparts_cpus = cpumask_weight(cs->subparts_cpus);
+ }
}
spin_unlock_irq(&callback_lock);

- update_cpumasks_hier(cs, &tmp);
+ /* effective_cpus will be updated here */
+ update_cpumasks_hier(cs, &tmp, false);

if (cs->partition_root_state) {
struct cpuset *parent = parent_cs(cs);
@@ -2105,7 +2109,7 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
*/
static int update_prstate(struct cpuset *cs, int new_prs)
{
- int err, old_prs = cs->partition_root_state;
+ int err = 0, old_prs = cs->partition_root_state;
struct cpuset *parent = parent_cs(cs);
struct tmpmasks tmpmask;

@@ -2113,24 +2117,25 @@ static int update_prstate(struct cpuset *cs, int new_prs)
return 0;

/*
- * Cannot force a partial or invalid partition root to a full
- * partition root.
+ * For a previously invalid partition root, leave it at being
+ * invalid if new_prs is not "member".
*/
if (new_prs && is_prs_invalid(old_prs))
- return -EINVAL;
+ return 0;

if (alloc_cpumasks(NULL, &tmpmask))
return -ENOMEM;

- err = -EINVAL;
if (!old_prs) {
/*
* Turning on partition root requires setting the
* CS_CPU_EXCLUSIVE bit implicitly as well and cpus_allowed
- * cannot be NULL.
+ * cannot be empty.
*/
- if (cpumask_empty(cs->cpus_allowed))
+ if (cpumask_empty(cs->cpus_allowed)) {
+ err = 1;
goto out;
+ }

err = update_flag(CS_CPU_EXCLUSIVE, cs, 1);
if (err)
@@ -2144,19 +2149,22 @@ static int update_prstate(struct cpuset *cs, int new_prs)
}
} else {
/*
- * Turning off partition root will clear the
- * CS_CPU_EXCLUSIVE bit.
+ * Switching back to member is always allowed even if it
+ * disables child partitions.
*/
- if (is_prs_invalid(old_prs)) {
- update_flag(CS_CPU_EXCLUSIVE, cs, 0);
- err = 0;
- goto out;
- }
+ update_parent_subparts_cpumask(cs, partcmd_disable, NULL,
+ &tmpmask);

- err = update_parent_subparts_cpumask(cs, partcmd_disable,
- NULL, &tmpmask);
- if (err)
- goto out;
+ /*
+ * If there are child partitions, they will all become invalid.
+ */
+ if (unlikely(cs->nr_subparts_cpus)) {
+ spin_lock_irq(&callback_lock);
+ cs->nr_subparts_cpus = 0;
+ cpumask_clear(cs->subparts_cpus);
+ compute_effective_cpumask(cs->effective_cpus, cs, parent);
+ spin_unlock_irq(&callback_lock);
+ }

/* Turning off CS_CPU_EXCLUSIVE will not return error */
update_flag(CS_CPU_EXCLUSIVE, cs, 0);
@@ -2169,15 +2177,24 @@ static int update_prstate(struct cpuset *cs, int new_prs)

rebuild_sched_domains_locked();
out:
- if (!err) {
- spin_lock_irq(&callback_lock);
- cs->partition_root_state = new_prs;
- spin_unlock_irq(&callback_lock);
- notify_partition_change(cs, old_prs);
- }
+ /*
+ * Make partition invalid if an error happen
+ */
+ if (err)
+ new_prs = PRS_INVALID_ROOT;
+ spin_lock_irq(&callback_lock);
+ cs->partition_root_state = new_prs;
+ spin_unlock_irq(&callback_lock);
+ /*
+ * Update child cpusets, if present.
+ * Force update if switching back to member.
+ */
+ if (!list_empty(&cs->css.children))
+ update_cpumasks_hier(cs, &tmpmask, !new_prs);

+ notify_partition_change(cs, old_prs);
free_cpumasks(NULL, &tmpmask);
- return err;
+ return 0;
}

/*
@@ -3244,12 +3261,31 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)

/*
* In the unlikely event that a partition root has empty
- * effective_cpus with tasks or its parent becomes invalid, we
- * have to transition it to the invalid state.
+ * effective_cpus with tasks, we will have to invalidate child
+ * partitions, if present, by setting nr_subparts_cpus to 0 to
+ * reclaim their cpus.
*/
- if (is_partition_valid(cs) &&
- ((cpumask_empty(&new_cpus) && partition_is_populated(cs, NULL)) ||
- is_partition_invalid(parent))) {
+ if (cs->nr_subparts_cpus && is_partition_valid(cs) &&
+ cpumask_empty(&new_cpus) && partition_is_populated(cs, NULL)) {
+ spin_lock_irq(&callback_lock);
+ cs->nr_subparts_cpus = 0;
+ cpumask_clear(cs->subparts_cpus);
+ spin_unlock_irq(&callback_lock);
+ compute_effective_cpumask(&new_cpus, cs, parent);
+ }
+
+ /*
+ * Force the partition to become invalid if either one of
+ * the following conditions hold:
+ * 1) empty effective cpus but not valid empty partition.
+ * 2) parent is invalid or doesn't grant any cpus to child
+ * partitions.
+ */
+ if (is_partition_valid(cs) && (!parent->nr_subparts_cpus ||
+ (cpumask_empty(&new_cpus) && partition_is_populated(cs, NULL)))) {
+ int old_prs;
+
+ update_parent_subparts_cpumask(cs, partcmd_disable, NULL, tmp);
if (cs->nr_subparts_cpus) {
spin_lock_irq(&callback_lock);
cs->nr_subparts_cpus = 0;
@@ -3258,40 +3294,25 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
compute_effective_cpumask(&new_cpus, cs, parent);
}

- /*
- * Force the partition to become invalid if either one of
- * the following conditions hold:
- * 1) empty effective cpus but not valid empty partition.
- * 2) parent is invalid or doesn't grant any cpus to child
- * partitions.
- */
- if (is_partition_invalid(parent) ||
- (cpumask_empty(&new_cpus) &&
- partition_is_populated(cs, NULL))) {
- int old_prs;
-
- update_parent_subparts_cpumask(cs, partcmd_disable,
- NULL, tmp);
- old_prs = cs->partition_root_state;
- if (!is_prs_invalid(old_prs)) {
- spin_lock_irq(&callback_lock);
- make_partition_invalid(cs);
- spin_unlock_irq(&callback_lock);
- notify_partition_change(cs, old_prs);
- }
+ old_prs = cs->partition_root_state;
+ if (is_partition_valid(cs)) {
+ spin_lock_irq(&callback_lock);
+ make_partition_invalid(cs);
+ spin_unlock_irq(&callback_lock);
+ notify_partition_change(cs, old_prs);
}
cpuset_force_rebuild();
}

/*
* On the other hand, an invalid partition root may be transitioned
- * back to a regular one or a partition root with no CPU allocated
- * from the parent may change to invalid.
+ * back to a regular one.
*/
- if (is_partition_valid(parent) && (is_partition_invalid(cs) ||
- !cpumask_intersects(&new_cpus, parent->subparts_cpus)) &&
- update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp))
- cpuset_force_rebuild();
+ else if (is_partition_valid(parent) && is_partition_invalid(cs)) {
+ update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp);
+ if (is_partition_valid(cs))
+ cpuset_force_rebuild();
+ }

update_tasks:
cpus_updated = !cpumask_equal(&new_cpus, cs->effective_cpus);
--
2.31.1

2022-09-01 21:22:15

by Waiman Long

[permalink] [raw]
Subject: [PATCH v12 02/10] cgroup/cpuset: Miscellaneous cleanups & add helper functions

The partition root state (PRS) macro names do not currently match the
external names. Change them to match the external names and add helper
functions to read or change the state.

Shorten the cpuset argument of update_parent_subparts_cpumask() to cs
to match other cpuset functions.

Remove the new_prs argument from notify_partition_change() as the
cs->partition_root_state has already been set to new_prs before it
is called.

There is no functional change.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/cgroup/cpuset.c | 169 ++++++++++++++++++++++-------------------
1 file changed, 90 insertions(+), 79 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 50bf837571ac..68404e488504 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -176,20 +176,18 @@ struct cpuset {
/*
* Partition root states:
*
- * 0 - not a partition root
- *
+ * 0 - member (not a partition root)
* 1 - partition root
- *
* -1 - invalid partition root
- * None of the cpus in cpus_allowed can be put into the parent's
- * subparts_cpus. In this case, the cpuset is not a real partition
- * root anymore. However, the CPU_EXCLUSIVE bit will still be set
- * and the cpuset can be restored back to a partition root if the
- * parent cpuset can give more CPUs back to this child cpuset.
*/
-#define PRS_DISABLED 0
-#define PRS_ENABLED 1
-#define PRS_ERROR -1
+#define PRS_MEMBER 0
+#define PRS_ROOT 1
+#define PRS_INVALID_ROOT -1
+
+static inline bool is_prs_invalid(int prs_state)
+{
+ return prs_state < 0;
+}

/*
* Temporary cpumasks for working with partitions that are passed among
@@ -269,25 +267,38 @@ static inline int is_spread_slab(const struct cpuset *cs)
return test_bit(CS_SPREAD_SLAB, &cs->flags);
}

-static inline int is_partition_root(const struct cpuset *cs)
+static inline int is_partition_valid(const struct cpuset *cs)
{
return cs->partition_root_state > 0;
}

+static inline int is_partition_invalid(const struct cpuset *cs)
+{
+ return cs->partition_root_state < 0;
+}
+
+/*
+ * Callers should hold callback_lock to modify partition_root_state.
+ */
+static inline void make_partition_invalid(struct cpuset *cs)
+{
+ cs->partition_root_state = PRS_INVALID_ROOT;
+}
+
/*
* Send notification event of whenever partition_root_state changes.
*/
-static inline void notify_partition_change(struct cpuset *cs,
- int old_prs, int new_prs)
+static inline void notify_partition_change(struct cpuset *cs, int old_prs)
{
- if (old_prs != new_prs)
- cgroup_file_notify(&cs->partition_file);
+ if (old_prs == cs->partition_root_state)
+ return;
+ cgroup_file_notify(&cs->partition_file);
}

static struct cpuset top_cpuset = {
.flags = ((1 << CS_ONLINE) | (1 << CS_CPU_EXCLUSIVE) |
(1 << CS_MEM_EXCLUSIVE)),
- .partition_root_state = PRS_ENABLED,
+ .partition_root_state = PRS_ROOT,
};

/**
@@ -876,7 +887,7 @@ static int generate_sched_domains(cpumask_var_t **domains,
csa[csn++] = cp;

/* skip @cp's subtree if not a partition root */
- if (!is_partition_root(cp))
+ if (!is_partition_valid(cp))
pos_css = css_rightmost_descendant(pos_css);
}
rcu_read_unlock();
@@ -1082,7 +1093,7 @@ static void rebuild_sched_domains_locked(void)
if (top_cpuset.nr_subparts_cpus) {
rcu_read_lock();
cpuset_for_each_descendant_pre(cs, pos_css, &top_cpuset) {
- if (!is_partition_root(cs)) {
+ if (!is_partition_valid(cs)) {
pos_css = css_rightmost_descendant(pos_css);
continue;
}
@@ -1216,11 +1227,11 @@ enum subparts_cmd {
* cpumask changes that violates the cpu exclusivity rule will not be
* permitted when checked by validate_change().
*/
-static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
+static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
struct cpumask *newmask,
struct tmpmasks *tmp)
{
- struct cpuset *parent = parent_cs(cpuset);
+ struct cpuset *parent = parent_cs(cs);
int adding; /* Moving cpus from effective_cpus to subparts_cpus */
int deleting; /* Moving cpus from subparts_cpus to effective_cpus */
int old_prs, new_prs;
@@ -1233,16 +1244,16 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
* The new cpumask, if present, or the current cpus_allowed must
* not be empty.
*/
- if (!is_partition_root(parent) ||
+ if (!is_partition_valid(parent) ||
(newmask && cpumask_empty(newmask)) ||
- (!newmask && cpumask_empty(cpuset->cpus_allowed)))
+ (!newmask && cpumask_empty(cs->cpus_allowed)))
return -EINVAL;

/*
* Enabling/disabling partition root is not allowed if there are
* online children.
*/
- if ((cmd != partcmd_update) && css_has_online_children(&cpuset->css))
+ if ((cmd != partcmd_update) && css_has_online_children(&cs->css))
return -EBUSY;

/*
@@ -1251,20 +1262,21 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
* CPU will be left after that.
*/
if ((cmd == partcmd_enable) &&
- (!cpumask_subset(cpuset->cpus_allowed, parent->effective_cpus) ||
- cpumask_equal(cpuset->cpus_allowed, parent->effective_cpus)))
+ (!cpumask_subset(cs->cpus_allowed, parent->effective_cpus) ||
+ cpumask_equal(cs->cpus_allowed, parent->effective_cpus)))
return -EINVAL;

/*
* A cpumask update cannot make parent's effective_cpus become empty.
+ * new_prs will only be changed for the partcmd_update command.
*/
adding = deleting = false;
- old_prs = new_prs = cpuset->partition_root_state;
+ old_prs = new_prs = cs->partition_root_state;
if (cmd == partcmd_enable) {
- cpumask_copy(tmp->addmask, cpuset->cpus_allowed);
+ cpumask_copy(tmp->addmask, cs->cpus_allowed);
adding = true;
} else if (cmd == partcmd_disable) {
- deleting = cpumask_and(tmp->delmask, cpuset->cpus_allowed,
+ deleting = cpumask_and(tmp->delmask, cs->cpus_allowed,
parent->subparts_cpus);
} else if (newmask) {
/*
@@ -1274,7 +1286,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
* addmask = newmask & parent->effective_cpus
* & ~parent->subparts_cpus
*/
- cpumask_andnot(tmp->delmask, cpuset->cpus_allowed, newmask);
+ cpumask_andnot(tmp->delmask, cs->cpus_allowed, newmask);
deleting = cpumask_and(tmp->delmask, tmp->delmask,
parent->subparts_cpus);

@@ -1308,44 +1320,44 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
* pre-shrunk in case there is a change in the cpu list.
* So no deletion is needed.
*/
- adding = cpumask_and(tmp->addmask, cpuset->cpus_allowed,
+ adding = cpumask_and(tmp->addmask, cs->cpus_allowed,
parent->effective_cpus);
part_error = cpumask_equal(tmp->addmask,
parent->effective_cpus);
}

if (cmd == partcmd_update) {
- int prev_prs = cpuset->partition_root_state;
+ int prev_prs = cs->partition_root_state;

/*
- * Check for possible transition between PRS_ENABLED
- * and PRS_ERROR.
+ * Check for possible transition between PRS_ROOT
+ * and PRS_INVALID_ROOT.
*/
- switch (cpuset->partition_root_state) {
- case PRS_ENABLED:
+ switch (cs->partition_root_state) {
+ case PRS_ROOT:
if (part_error)
- new_prs = PRS_ERROR;
+ new_prs = PRS_INVALID_ROOT;
break;
- case PRS_ERROR:
+ case PRS_INVALID_ROOT:
if (!part_error)
- new_prs = PRS_ENABLED;
+ new_prs = PRS_ROOT;
break;
}
/*
* Set part_error if previously in invalid state.
*/
- part_error = (prev_prs == PRS_ERROR);
+ part_error = is_prs_invalid(prev_prs);
}

- if (!part_error && (new_prs == PRS_ERROR))
+ if (!part_error && is_prs_invalid(new_prs))
return 0; /* Nothing need to be done */

- if (new_prs == PRS_ERROR) {
+ if (is_prs_invalid(new_prs)) {
/*
* Remove all its cpus from parent's subparts_cpus.
*/
adding = false;
- deleting = cpumask_and(tmp->delmask, cpuset->cpus_allowed,
+ deleting = cpumask_and(tmp->delmask, cs->cpus_allowed,
parent->subparts_cpus);
}

@@ -1378,10 +1390,10 @@ static int update_parent_subparts_cpumask(struct cpuset *cpuset, int cmd,
parent->nr_subparts_cpus = cpumask_weight(parent->subparts_cpus);

if (old_prs != new_prs)
- cpuset->partition_root_state = new_prs;
+ cs->partition_root_state = new_prs;

spin_unlock_irq(&callback_lock);
- notify_partition_change(cpuset, old_prs, new_prs);
+ notify_partition_change(cs, old_prs);

return cmd == partcmd_update;
}
@@ -1446,15 +1458,14 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
old_prs = new_prs = cp->partition_root_state;
if ((cp != cs) && old_prs) {
switch (parent->partition_root_state) {
- case PRS_DISABLED:
+ case PRS_MEMBER:
/*
* If parent is not a partition root or an
* invalid partition root, clear its state
* and its CS_CPU_EXCLUSIVE flag.
*/
- WARN_ON_ONCE(cp->partition_root_state
- != PRS_ERROR);
- new_prs = PRS_DISABLED;
+ WARN_ON_ONCE(!is_partition_invalid(cp));
+ new_prs = PRS_MEMBER;

/*
* clear_bit() is an atomic operation and
@@ -1466,16 +1477,16 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
clear_bit(CS_CPU_EXCLUSIVE, &cp->flags);
break;

- case PRS_ENABLED:
+ case PRS_ROOT:
if (update_parent_subparts_cpumask(cp, partcmd_update, NULL, tmp))
update_tasks_cpumask(parent);
break;

- case PRS_ERROR:
+ case PRS_INVALID_ROOT:
/*
* When parent is invalid, it has to be too.
*/
- new_prs = PRS_ERROR;
+ new_prs = PRS_INVALID_ROOT;
break;
}
}
@@ -1487,7 +1498,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
spin_lock_irq(&callback_lock);

cpumask_copy(cp->effective_cpus, tmp->new_cpus);
- if (cp->nr_subparts_cpus && (new_prs != PRS_ENABLED)) {
+ if (cp->nr_subparts_cpus && !is_partition_valid(cp)) {
cp->nr_subparts_cpus = 0;
cpumask_clear(cp->subparts_cpus);
} else if (cp->nr_subparts_cpus) {
@@ -1519,7 +1530,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
cp->partition_root_state = new_prs;

spin_unlock_irq(&callback_lock);
- notify_partition_change(cp, old_prs, new_prs);
+ notify_partition_change(cp, old_prs);

WARN_ON(!is_in_v2_mode() &&
!cpumask_equal(cp->cpus_allowed, cp->effective_cpus));
@@ -1535,7 +1546,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
if (!cpumask_empty(cp->cpus_allowed) &&
is_sched_load_balance(cp) &&
(!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) ||
- is_partition_root(cp)))
+ is_partition_valid(cp)))
need_rebuild_sched_domains = true;

rcu_read_lock();
@@ -2035,10 +2046,11 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
return err;
}

-/*
+/**
* update_prstate - update partition_root_state
- * cs: the cpuset to update
- * new_prs: new partition root state
+ * @cs: the cpuset to update
+ * @new_prs: new partition root state
+ * Return: 0 if successful, < 0 if error
*
* Call with cpuset_rwsem held.
*/
@@ -2055,7 +2067,7 @@ static int update_prstate(struct cpuset *cs, int new_prs)
* Cannot force a partial or invalid partition root to a full
* partition root.
*/
- if (new_prs && (old_prs == PRS_ERROR))
+ if (new_prs && is_prs_invalid(old_prs))
return -EINVAL;

if (alloc_cpumasks(NULL, &tmpmask))
@@ -2086,7 +2098,7 @@ static int update_prstate(struct cpuset *cs, int new_prs)
* Turning off partition root will clear the
* CS_CPU_EXCLUSIVE bit.
*/
- if (old_prs == PRS_ERROR) {
+ if (is_prs_invalid(old_prs)) {
update_flag(CS_CPU_EXCLUSIVE, cs, 0);
err = 0;
goto out;
@@ -2112,7 +2124,7 @@ static int update_prstate(struct cpuset *cs, int new_prs)
spin_lock_irq(&callback_lock);
cs->partition_root_state = new_prs;
spin_unlock_irq(&callback_lock);
- notify_partition_change(cs, old_prs, new_prs);
+ notify_partition_change(cs, old_prs);
}

free_cpumasks(NULL, &tmpmask);
@@ -2604,13 +2616,13 @@ static int sched_partition_show(struct seq_file *seq, void *v)
struct cpuset *cs = css_cs(seq_css(seq));

switch (cs->partition_root_state) {
- case PRS_ENABLED:
+ case PRS_ROOT:
seq_puts(seq, "root\n");
break;
- case PRS_DISABLED:
+ case PRS_MEMBER:
seq_puts(seq, "member\n");
break;
- case PRS_ERROR:
+ case PRS_INVALID_ROOT:
seq_puts(seq, "root invalid\n");
break;
}
@@ -2630,9 +2642,9 @@ static ssize_t sched_partition_write(struct kernfs_open_file *of, char *buf,
* Convert "root" to ENABLED, and convert "member" to DISABLED.
*/
if (!strcmp(buf, "root"))
- val = PRS_ENABLED;
+ val = PRS_ROOT;
else if (!strcmp(buf, "member"))
- val = PRS_DISABLED;
+ val = PRS_MEMBER;
else
return -EINVAL;

@@ -2931,7 +2943,7 @@ static void cpuset_css_offline(struct cgroup_subsys_state *css)
cpus_read_lock();
percpu_down_write(&cpuset_rwsem);

- if (is_partition_root(cs))
+ if (is_partition_valid(cs))
update_prstate(cs, 0);

if (!cgroup_subsys_on_dfl(cpuset_cgrp_subsys) &&
@@ -3176,11 +3188,11 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)

/*
* In the unlikely event that a partition root has empty
- * effective_cpus or its parent becomes erroneous, we have to
- * transition it to the erroneous state.
+ * effective_cpus or its parent becomes invalid, we have to
+ * transition it to the invalid state.
*/
- if (is_partition_root(cs) && (cpumask_empty(&new_cpus) ||
- (parent->partition_root_state == PRS_ERROR))) {
+ if (is_partition_valid(cs) && (cpumask_empty(&new_cpus) ||
+ is_partition_invalid(parent))) {
if (cs->nr_subparts_cpus) {
spin_lock_irq(&callback_lock);
cs->nr_subparts_cpus = 0;
@@ -3195,30 +3207,29 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
* the current partition and let the child partitions
* fight for available CPUs.
*/
- if ((parent->partition_root_state == PRS_ERROR) ||
+ if (is_partition_invalid(parent) ||
cpumask_empty(&new_cpus)) {
int old_prs;

update_parent_subparts_cpumask(cs, partcmd_disable,
NULL, tmp);
old_prs = cs->partition_root_state;
- if (old_prs != PRS_ERROR) {
+ if (!is_prs_invalid(old_prs)) {
spin_lock_irq(&callback_lock);
- cs->partition_root_state = PRS_ERROR;
+ make_partition_invalid(cs);
spin_unlock_irq(&callback_lock);
- notify_partition_change(cs, old_prs, PRS_ERROR);
+ notify_partition_change(cs, old_prs);
}
}
cpuset_force_rebuild();
}

/*
- * On the other hand, an erroneous partition root may be transitioned
+ * On the other hand, an invalid partition root may be transitioned
* back to a regular one or a partition root with no CPU allocated
- * from the parent may change to erroneous.
+ * from the parent may change to invalid.
*/
- if (is_partition_root(parent) &&
- ((cs->partition_root_state == PRS_ERROR) ||
+ if (is_partition_valid(parent) && (is_partition_invalid(cs) ||
!cpumask_intersects(&new_cpus, parent->subparts_cpus)) &&
update_parent_subparts_cpumask(cs, partcmd_update, NULL, tmp))
cpuset_force_rebuild();
--
2.31.1

2022-09-01 21:23:31

by Waiman Long

[permalink] [raw]
Subject: [PATCH v12 03/10] cgroup/cpuset: Allow no-task partition to have empty cpuset.cpus.effective

Currently, a partition root cannot have empty "cpuset.cpus.effective".
As a result, a parent partition root cannot distribute out all its
CPUs to child partitions with no CPUs left. However in most cases,
there shouldn't be any tasks associated with intermediate nodes of the
default hierarchy. So the current rule is too restrictive and can waste
valuable CPU resource.

To address this issue, we are now allowing a partition to have empty
"cpuset.cpus.effective" as long as it has no task. Since cpuset is
threaded, no-internal-process rule does not apply. So it is possible
to have tasks in a partition root with child sub-partitions even though
that should be uncommon.

A parent partition with no task can now have all its CPUs distributed out
to its child partitions. The top cpuset always have some house-keeping
tasks running and so its list of effective cpu can't be empty.

Once a partition with empty "cpuset.cpus.effective" is formed, no
new task can be moved into it until "cpuset.cpus.effective" becomes
non-empty.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/cgroup/cpuset.c | 109 +++++++++++++++++++++++++++++++----------
1 file changed, 84 insertions(+), 25 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 68404e488504..cdde9ef05ad2 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -416,6 +416,41 @@ static inline bool is_in_v2_mode(void)
(cpuset_cgrp_subsys.root->flags & CGRP_ROOT_CPUSET_V2_MODE);
}

+/**
+ * partition_is_populated - check if partition has tasks
+ * @cs: partition root to be checked
+ * @excluded_child: a child cpuset to be excluded in task checking
+ * Return: true if there are tasks, false otherwise
+ *
+ * It is assumed that @cs is a valid partition root. @excluded_child should
+ * be non-NULL when this cpuset is going to become a partition itself.
+ */
+static inline bool partition_is_populated(struct cpuset *cs,
+ struct cpuset *excluded_child)
+{
+ struct cgroup_subsys_state *css;
+ struct cpuset *child;
+
+ if (cs->css.cgroup->nr_populated_csets)
+ return true;
+ if (!excluded_child && !cs->nr_subparts_cpus)
+ return cgroup_is_populated(cs->css.cgroup);
+
+ rcu_read_lock();
+ cpuset_for_each_child(child, css, cs) {
+ if (child == excluded_child)
+ continue;
+ if (is_partition_valid(child))
+ continue;
+ if (cgroup_is_populated(child->css.cgroup)) {
+ rcu_read_unlock();
+ return true;
+ }
+ }
+ rcu_read_unlock();
+ return false;
+}
+
/*
* Return in pmask the portion of a task's cpusets's cpus_allowed that
* are online and are capable of running the task. If none are found,
@@ -1257,22 +1292,27 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
return -EBUSY;

/*
- * Enabling partition root is not allowed if not all the CPUs
- * can be granted from parent's effective_cpus or at least one
- * CPU will be left after that.
- */
- if ((cmd == partcmd_enable) &&
- (!cpumask_subset(cs->cpus_allowed, parent->effective_cpus) ||
- cpumask_equal(cs->cpus_allowed, parent->effective_cpus)))
- return -EINVAL;
-
- /*
- * A cpumask update cannot make parent's effective_cpus become empty.
* new_prs will only be changed for the partcmd_update command.
*/
adding = deleting = false;
old_prs = new_prs = cs->partition_root_state;
if (cmd == partcmd_enable) {
+ /*
+ * Enabling partition root is not allowed if not all the CPUs
+ * can be granted from parent's effective_cpus.
+ */
+ if (!cpumask_subset(cs->cpus_allowed, parent->effective_cpus))
+ return -EINVAL;
+
+ /*
+ * A parent can be left with no CPU as long as there is no
+ * task directly associated with the parent partition. For
+ * such a parent, no new task can be moved into it.
+ */
+ if (cpumask_equal(cs->cpus_allowed, parent->effective_cpus) &&
+ partition_is_populated(parent, cs))
+ return -EINVAL;
+
cpumask_copy(tmp->addmask, cs->cpus_allowed);
adding = true;
} else if (cmd == partcmd_disable) {
@@ -1294,10 +1334,12 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
adding = cpumask_andnot(tmp->addmask, tmp->addmask,
parent->subparts_cpus);
/*
- * Return error if the new effective_cpus could become empty.
+ * Return error if the new effective_cpus could become empty
+ * and there are tasks in the parent.
*/
if (adding &&
- cpumask_equal(parent->effective_cpus, tmp->addmask)) {
+ cpumask_equal(parent->effective_cpus, tmp->addmask) &&
+ partition_is_populated(parent, cs)) {
if (!deleting)
return -EINVAL;
/*
@@ -1322,8 +1364,8 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
*/
adding = cpumask_and(tmp->addmask, cs->cpus_allowed,
parent->effective_cpus);
- part_error = cpumask_equal(tmp->addmask,
- parent->effective_cpus);
+ part_error = cpumask_equal(tmp->addmask, parent->effective_cpus) &&
+ partition_is_populated(parent, cs);
}

if (cmd == partcmd_update) {
@@ -1425,9 +1467,15 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)

/*
* If it becomes empty, inherit the effective mask of the
- * parent, which is guaranteed to have some CPUs.
+ * parent, which is guaranteed to have some CPUs unless
+ * it is a partition root that has explicitly distributed
+ * out all its CPUs.
*/
if (is_in_v2_mode() && cpumask_empty(tmp->new_cpus)) {
+ if (is_partition_valid(cp) &&
+ cpumask_equal(cp->cpus_allowed, cp->subparts_cpus))
+ goto update_parent_subparts;
+
cpumask_copy(tmp->new_cpus, parent->effective_cpus);
if (!cp->use_parent_ecpus) {
cp->use_parent_ecpus = true;
@@ -1449,6 +1497,7 @@ static void update_cpumasks_hier(struct cpuset *cs, struct tmpmasks *tmp)
continue;
}

+update_parent_subparts:
/*
* update_parent_subparts_cpumask() should have been called
* for cs already in update_cpumask(). We should also call
@@ -2254,6 +2303,12 @@ static int cpuset_can_attach(struct cgroup_taskset *tset)
(cpumask_empty(cs->cpus_allowed) || nodes_empty(cs->mems_allowed)))
goto out_unlock;

+ /*
+ * Task cannot be moved to a cpuset with empty effective cpus.
+ */
+ if (cpumask_empty(cs->effective_cpus))
+ goto out_unlock;
+
cgroup_taskset_for_each(task, css, tset) {
ret = task_can_attach(task, cs->effective_cpus);
if (ret)
@@ -3119,7 +3174,8 @@ hotplug_update_tasks(struct cpuset *cs,
struct cpumask *new_cpus, nodemask_t *new_mems,
bool cpus_updated, bool mems_updated)
{
- if (cpumask_empty(new_cpus))
+ /* A partition root is allowed to have empty effective cpus */
+ if (cpumask_empty(new_cpus) && !is_partition_valid(cs))
cpumask_copy(new_cpus, parent_cs(cs)->effective_cpus);
if (nodes_empty(*new_mems))
*new_mems = parent_cs(cs)->effective_mems;
@@ -3188,10 +3244,11 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)

/*
* In the unlikely event that a partition root has empty
- * effective_cpus or its parent becomes invalid, we have to
- * transition it to the invalid state.
+ * effective_cpus with tasks or its parent becomes invalid, we
+ * have to transition it to the invalid state.
*/
- if (is_partition_valid(cs) && (cpumask_empty(&new_cpus) ||
+ if (is_partition_valid(cs) &&
+ ((cpumask_empty(&new_cpus) && partition_is_populated(cs, NULL)) ||
is_partition_invalid(parent))) {
if (cs->nr_subparts_cpus) {
spin_lock_irq(&callback_lock);
@@ -3202,13 +3259,15 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
}

/*
- * If the effective_cpus is empty because the child
- * partitions take away all the CPUs, we can keep
- * the current partition and let the child partitions
- * fight for available CPUs.
+ * Force the partition to become invalid if either one of
+ * the following conditions hold:
+ * 1) empty effective cpus but not valid empty partition.
+ * 2) parent is invalid or doesn't grant any cpus to child
+ * partitions.
*/
if (is_partition_invalid(parent) ||
- cpumask_empty(&new_cpus)) {
+ (cpumask_empty(&new_cpus) &&
+ partition_is_populated(cs, NULL))) {
int old_prs;

update_parent_subparts_cpumask(cs, partcmd_disable,
--
2.31.1

2022-09-01 21:25:08

by Waiman Long

[permalink] [raw]
Subject: [PATCH v12 09/10] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst

Update Documentation/admin-guide/cgroup-v2.rst on the newly introduced
"isolated" cpuset partition type as well as other changes made in other
cpuset patches.

Signed-off-by: Waiman Long <[email protected]>
---
Documentation/admin-guide/cgroup-v2.rst | 150 +++++++++++++-----------
1 file changed, 84 insertions(+), 66 deletions(-)

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index be4a77baf784..76b3ea9fd5c5 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2186,74 +2186,92 @@ Cpuset Interface Files
It accepts only the following input values when written to.

======== ================================
- "root" a partition root
- "member" a non-root member of a partition
+ "member" Non-root member of a partition
+ "root" Partition root
+ "isolated" Partition root without load balancing
======== ================================

- When set to be a partition root, the current cgroup is the
- root of a new partition or scheduling domain that comprises
- itself and all its descendants except those that are separate
- partition roots themselves and their descendants. The root
- cgroup is always a partition root.
-
- There are constraints on where a partition root can be set.
- It can only be set in a cgroup if all the following conditions
- are true.
-
- 1) The "cpuset.cpus" is not empty and the list of CPUs are
- exclusive, i.e. they are not shared by any of its siblings.
- 2) The parent cgroup is a partition root.
- 3) The "cpuset.cpus" is also a proper subset of the parent's
- "cpuset.cpus.effective".
- 4) There is no child cgroups with cpuset enabled. This is for
- eliminating corner cases that have to be handled if such a
- condition is allowed.
-
- Setting it to partition root will take the CPUs away from the
- effective CPUs of the parent cgroup. Once it is set, this
- file cannot be reverted back to "member" if there are any child
- cgroups with cpuset enabled.
-
- A parent partition cannot distribute all its CPUs to its
- child partitions. There must be at least one cpu left in the
- parent partition.
-
- Once becoming a partition root, changes to "cpuset.cpus" is
- generally allowed as long as the first condition above is true,
- the change will not take away all the CPUs from the parent
- partition and the new "cpuset.cpus" value is a superset of its
- children's "cpuset.cpus" values.
-
- Sometimes, external factors like changes to ancestors'
- "cpuset.cpus" or cpu hotplug can cause the state of the partition
- root to change. On read, the "cpuset.sched.partition" file
- can show the following values.
-
- ============== ==============================
- "member" Non-root member of a partition
- "root" Partition root
- "root invalid" Invalid partition root
- ============== ==============================
-
- It is a partition root if the first 2 partition root conditions
- above are true and at least one CPU from "cpuset.cpus" is
- granted by the parent cgroup.
-
- A partition root can become invalid if none of CPUs requested
- in "cpuset.cpus" can be granted by the parent cgroup or the
- parent cgroup is no longer a partition root itself. In this
- case, it is not a real partition even though the restriction
- of the first partition root condition above will still apply.
- The cpu affinity of all the tasks in the cgroup will then be
- associated with CPUs in the nearest ancestor partition.
-
- An invalid partition root can be transitioned back to a
- real partition root if at least one of the requested CPUs
- can now be granted by its parent. In this case, the cpu
- affinity of all the tasks in the formerly invalid partition
- will be associated to the CPUs of the newly formed partition.
- Changing the partition state of an invalid partition root to
- "member" is always allowed even if child cpusets are present.
+ The root cgroup is always a partition root and its state
+ cannot be changed. All other non-root cgroups start out as
+ "member".
+
+ When set to "root", the current cgroup is the root of a new
+ partition or scheduling domain that comprises itself and all
+ its descendants except those that are separate partition roots
+ themselves and their descendants.
+
+ When set to "isolated", the CPUs in that partition root will
+ be in an isolated state without any load balancing from the
+ scheduler. Tasks placed in such a partition with multiple
+ CPUs should be carefully distributed and bound to each of the
+ individual CPUs for optimal performance.
+
+ The value shown in "cpuset.cpus.effective" of a partition root
+ is the CPUs that the partition root can dedicate to a potential
+ new child partition root. The new child subtracts available
+ CPUs from its parent "cpuset.cpus.effective".
+
+ A partition root ("root" or "isolated") can be in one of the
+ two possible states - valid or invalid. An invalid partition
+ root is in a degraded state where some state information may
+ be retained, but behaves more like a "member".
+
+ All possible state transitions among "member", "root" and
+ "isolated" are allowed.
+
+ On read, the "cpuset.cpus.partition" file can show the following
+ values.
+
+ ====================== ==============================
+ "member" Non-root member of a partition
+ "root" Partition root
+ "isolated" Partition root without load balancing
+ "root invalid (<reason>)" Invalid partition root
+ "isolated invalid (<reason>)" Invalid isolated partition root
+ ====================== ==============================
+
+ In the case of an invalid partition root, a descriptive string on
+ why the partition is invalid is included within parentheses.
+
+ For a partition root to become valid, the following conditions
+ must be met.
+
+ 1) The "cpuset.cpus" is exclusive with its siblings , i.e. they
+ are not shared by any of its siblings (exclusivity rule).
+ 2) The parent cgroup is a valid partition root.
+ 3) The "cpuset.cpus" is not empty and must contain at least
+ one of the CPUs from parent's "cpuset.cpus", i.e. they overlap.
+ 4) The "cpuset.cpus.effective" cannot be empty unless there is
+ no task associated with this partition.
+
+ External events like hotplug or changes to "cpuset.cpus" can
+ cause a valid partition root to become invalid and vice versa.
+ Note that a task cannot be moved to a cgroup with empty
+ "cpuset.cpus.effective".
+
+ For a valid partition root with the sibling cpu exclusivity
+ rule enabled, changes made to "cpuset.cpus" that violate the
+ exclusivity rule will invalidate the partition as well as its
+ sibiling partitions with conflicting cpuset.cpus values. So
+ care must be taking in changing "cpuset.cpus".
+
+ A valid non-root parent partition may distribute out all its CPUs
+ to its child partitions when there is no task associated with it.
+
+ Care must be taken to change a valid partition root to
+ "member" as all its child partitions, if present, will become
+ invalid causing disruption to tasks running in those child
+ partitions. These inactivated partitions could be recovered if
+ their parent is switched back to a partition root with a proper
+ set of "cpuset.cpus".
+
+ Poll and inotify events are triggered whenever the state of
+ "cpuset.cpus.partition" changes. That includes changes caused
+ by write to "cpuset.cpus.partition", cpu hotplug or other
+ changes that modify the validity status of the partition.
+ This will allow user space agents to monitor unexpected changes
+ to "cpuset.cpus.partition" without the need to do continuous
+ polling.


Device controller
--
2.31.1

2022-09-01 21:40:24

by Waiman Long

[permalink] [raw]
Subject: [PATCH v12 08/10] cgroup/cpuset: Make partition invalid if cpumask change violates exclusivity rule

Currently, changes in "cpust.cpus" of a partition root is not allowed if
it violates the sibling cpu exclusivity rule when the check is done
in the validate_change() function. That is inconsistent with the
other cpuset changes that are always allowed but may make a partition
invalid.

Update the cpuset code to allow cpumask change even if it violates the
sibling cpu exclusivity rule, but invalidate the partition instead
just like the other changes. However, other sibling partitions with
conflicting cpumask will also be invalidated in order to not violating
the exclusivity rule. This behavior is specific to this partition
rule violation.

Note that a previous commit has made sibling cpu exclusivity rule check
the last check of validate_change(). So if -EINVAL is returned, we can
be sure that sibling cpu exclusivity rule violation is the only rule
that is broken.

Signed-off-by: Waiman Long <[email protected]>
---
kernel/cgroup/cpuset.c | 69 ++++++++++++++++++++++++++++++++++++------
1 file changed, 60 insertions(+), 9 deletions(-)

diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index b0d921d03f62..6baa977a71ba 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1256,6 +1256,7 @@ enum subparts_cmd {
partcmd_enable, /* Enable partition root */
partcmd_disable, /* Disable partition root */
partcmd_update, /* Update parent's subparts_cpus */
+ partcmd_invalidate, /* Make partition invalid */
};

static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
@@ -1286,13 +1287,17 @@ static int update_flag(cpuset_flagbits_t bit, struct cpuset *cs,
* or vice versa. An error code will only be returned if transitioning from
* invalid to valid violates the exclusivity rule.
*
+ * For partcmd_invalidate, the current partition will be made invalid.
+ *
* The partcmd_enable and partcmd_disable commands are used by
* update_prstate(). An error code may be returned and the caller will check
* for error.
*
* The partcmd_update command is used by update_cpumasks_hier() with newmask
- * NULL and update_cpumask() with newmask set. The callers won't check for
- * error and so partition_root_state and prs_error will be updated directly.
+ * NULL and update_cpumask() with newmask set. The partcmd_invalidate is used
+ * by update_cpumask() with NULL newmask. In both cases, the callers won't
+ * check for error and so partition_root_state and prs_error will be updated
+ * directly.
*/
static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
struct cpumask *newmask,
@@ -1320,7 +1325,8 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
return PERR_CPUSEMPTY;

/*
- * new_prs will only be changed for the partcmd_update command.
+ * new_prs will only be changed for the partcmd_update and
+ * partcmd_invalidate commands.
*/
adding = deleting = false;
old_prs = new_prs = cs->partition_root_state;
@@ -1350,6 +1356,20 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
deleting = !is_prs_invalid(old_prs) &&
cpumask_and(tmp->delmask, cs->cpus_allowed,
parent->subparts_cpus);
+ } else if (cmd == partcmd_invalidate) {
+ if (is_prs_invalid(old_prs))
+ return 0;
+
+ /*
+ * Make the current partition invalid. It is assumed that
+ * invalidation is caused by violating cpu exclusivity rule.
+ */
+ deleting = cpumask_and(tmp->delmask, cs->cpus_allowed,
+ parent->subparts_cpus);
+ if (old_prs > 0) {
+ new_prs = -old_prs;
+ part_error = PERR_NOTEXCL;
+ }
} else if (newmask) {
/*
* partcmd_update with newmask:
@@ -1403,6 +1423,7 @@ static int update_parent_subparts_cpumask(struct cpuset *cs, int cmd,
parent->cpus_allowed);
adding = cpumask_andnot(tmp->addmask, tmp->addmask,
parent->subparts_cpus);
+
if ((is_partition_valid(cs) && !parent->nr_subparts_cpus) ||
(adding &&
cpumask_subset(parent->effective_cpus, tmp->addmask) &&
@@ -1709,6 +1730,7 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
{
int retval;
struct tmpmasks tmp;
+ bool invalidate = false;

/* top_cpuset.cpus_allowed tracks cpu_online_mask; it's read-only */
if (cs == &top_cpuset)
@@ -1736,10 +1758,6 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
if (cpumask_equal(cs->cpus_allowed, trialcs->cpus_allowed))
return 0;

- retval = validate_change(cs, trialcs);
- if (retval < 0)
- return retval;
-
#ifdef CONFIG_CPUMASK_OFFSTACK
/*
* Use the cpumasks in trialcs for tmpmasks when they are pointers
@@ -1750,9 +1768,42 @@ static int update_cpumask(struct cpuset *cs, struct cpuset *trialcs,
tmp.new_cpus = trialcs->cpus_allowed;
#endif

+ retval = validate_change(cs, trialcs);
+
+ if ((retval == -EINVAL) && cgroup_subsys_on_dfl(cpuset_cgrp_subsys)) {
+ struct cpuset *cp, *parent;
+ struct cgroup_subsys_state *css;
+
+ /*
+ * The -EINVAL error code indicates that partition sibling
+ * CPU exclusivity rule has been violated. We still allow
+ * the cpumask change to proceed while invalidating the
+ * partition. However, any conflicting sibling partitions
+ * have to be marked as invalid too.
+ */
+ invalidate = true;
+ rcu_read_lock();
+ parent = parent_cs(cs);
+ cpuset_for_each_child(cp, css, parent)
+ if (is_partition_valid(cp) &&
+ cpumask_intersects(trialcs->cpus_allowed, cp->cpus_allowed)) {
+ rcu_read_unlock();
+ update_parent_subparts_cpumask(cp, partcmd_invalidate, NULL, &tmp);
+ rcu_read_lock();
+ }
+ rcu_read_unlock();
+ retval = 0;
+ }
+ if (retval < 0)
+ return retval;
+
if (cs->partition_root_state) {
- update_parent_subparts_cpumask(cs, partcmd_update,
- trialcs->cpus_allowed, &tmp);
+ if (invalidate)
+ update_parent_subparts_cpumask(cs, partcmd_invalidate,
+ NULL, &tmp);
+ else
+ update_parent_subparts_cpumask(cs, partcmd_update,
+ trialcs->cpus_allowed, &tmp);
}

compute_effective_cpumask(trialcs->effective_cpus, trialcs,
--
2.31.1

2022-09-04 02:52:11

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH v12 09/10] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst

Hi Waiman,

I love your patch! Perhaps something to improve:

[auto build test WARNING on linus/master]
[also build test WARNING on v6.0-rc3 next-20220901]
[cannot apply to tj-cgroup/for-next]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Waiman-Long/cgroup-cpuset-cpu-partition-code-fixes-enhancements/20220902-050019
base: https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git 42e66b1cc3a070671001f8a1e933a80818a192bf
reproduce:
# https://github.com/intel-lab-lkp/linux/commit/dce03e1a3eb6fce8d7c849c8daeff91ec9a47fc8
git remote add linux-review https://github.com/intel-lab-lkp/linux
git fetch --no-tags linux-review Waiman-Long/cgroup-cpuset-cpu-partition-code-fixes-enhancements/20220902-050019
git checkout dce03e1a3eb6fce8d7c849c8daeff91ec9a47fc8
make menuconfig
# enable CONFIG_COMPILE_TEST, CONFIG_WARN_MISSING_DOCUMENTS, CONFIG_WARN_ABI_ERRORS
make htmldocs

If you fix the issue, kindly add following tag where applicable
Reported-by: kernel test robot <[email protected]>

All warnings (new ones prefixed by >>):

>> Documentation/admin-guide/cgroup-v2.rst:2191: WARNING: Malformed table.

vim +2191 Documentation/admin-guide/cgroup-v2.rst

2091
2092 cpuset.cpus
2093 A read-write multiple values file which exists on non-root
2094 cpuset-enabled cgroups.
2095
2096 It lists the requested CPUs to be used by tasks within this
2097 cgroup. The actual list of CPUs to be granted, however, is
2098 subjected to constraints imposed by its parent and can differ
2099 from the requested CPUs.
2100
2101 The CPU numbers are comma-separated numbers or ranges.
2102 For example::
2103
2104 # cat cpuset.cpus
2105 0-4,6,8-10
2106
2107 An empty value indicates that the cgroup is using the same
2108 setting as the nearest cgroup ancestor with a non-empty
2109 "cpuset.cpus" or all the available CPUs if none is found.
2110
2111 The value of "cpuset.cpus" stays constant until the next update
2112 and won't be affected by any CPU hotplug events.
2113
2114 cpuset.cpus.effective
2115 A read-only multiple values file which exists on all
2116 cpuset-enabled cgroups.
2117
2118 It lists the onlined CPUs that are actually granted to this
2119 cgroup by its parent. These CPUs are allowed to be used by
2120 tasks within the current cgroup.
2121
2122 If "cpuset.cpus" is empty, the "cpuset.cpus.effective" file shows
2123 all the CPUs from the parent cgroup that can be available to
2124 be used by this cgroup. Otherwise, it should be a subset of
2125 "cpuset.cpus" unless none of the CPUs listed in "cpuset.cpus"
2126 can be granted. In this case, it will be treated just like an
2127 empty "cpuset.cpus".
2128
2129 Its value will be affected by CPU hotplug events.
2130
2131 cpuset.mems
2132 A read-write multiple values file which exists on non-root
2133 cpuset-enabled cgroups.
2134
2135 It lists the requested memory nodes to be used by tasks within
2136 this cgroup. The actual list of memory nodes granted, however,
2137 is subjected to constraints imposed by its parent and can differ
2138 from the requested memory nodes.
2139
2140 The memory node numbers are comma-separated numbers or ranges.
2141 For example::
2142
2143 # cat cpuset.mems
2144 0-1,3
2145
2146 An empty value indicates that the cgroup is using the same
2147 setting as the nearest cgroup ancestor with a non-empty
2148 "cpuset.mems" or all the available memory nodes if none
2149 is found.
2150
2151 The value of "cpuset.mems" stays constant until the next update
2152 and won't be affected by any memory nodes hotplug events.
2153
2154 Setting a non-empty value to "cpuset.mems" causes memory of
2155 tasks within the cgroup to be migrated to the designated nodes if
2156 they are currently using memory outside of the designated nodes.
2157
2158 There is a cost for this memory migration. The migration
2159 may not be complete and some memory pages may be left behind.
2160 So it is recommended that "cpuset.mems" should be set properly
2161 before spawning new tasks into the cpuset. Even if there is
2162 a need to change "cpuset.mems" with active tasks, it shouldn't
2163 be done frequently.
2164
2165 cpuset.mems.effective
2166 A read-only multiple values file which exists on all
2167 cpuset-enabled cgroups.
2168
2169 It lists the onlined memory nodes that are actually granted to
2170 this cgroup by its parent. These memory nodes are allowed to
2171 be used by tasks within the current cgroup.
2172
2173 If "cpuset.mems" is empty, it shows all the memory nodes from the
2174 parent cgroup that will be available to be used by this cgroup.
2175 Otherwise, it should be a subset of "cpuset.mems" unless none of
2176 the memory nodes listed in "cpuset.mems" can be granted. In this
2177 case, it will be treated just like an empty "cpuset.mems".
2178
2179 Its value will be affected by memory nodes hotplug events.
2180
2181 cpuset.cpus.partition
2182 A read-write single value file which exists on non-root
2183 cpuset-enabled cgroups. This flag is owned by the parent cgroup
2184 and is not delegatable.
2185
2186 It accepts only the following input values when written to.
2187
2188 ======== ================================
2189 "member" Non-root member of a partition
2190 "root" Partition root
> 2191 "isolated" Partition root without load balancing
2192 ======== ================================
2193
2194 The root cgroup is always a partition root and its state
2195 cannot be changed. All other non-root cgroups start out as
2196 "member".
2197
2198 When set to "root", the current cgroup is the root of a new
2199 partition or scheduling domain that comprises itself and all
2200 its descendants except those that are separate partition roots
2201 themselves and their descendants.
2202
2203 When set to "isolated", the CPUs in that partition root will
2204 be in an isolated state without any load balancing from the
2205 scheduler. Tasks placed in such a partition with multiple
2206 CPUs should be carefully distributed and bound to each of the
2207 individual CPUs for optimal performance.
2208
2209 The value shown in "cpuset.cpus.effective" of a partition root
2210 is the CPUs that the partition root can dedicate to a potential
2211 new child partition root. The new child subtracts available
2212 CPUs from its parent "cpuset.cpus.effective".
2213
2214 A partition root ("root" or "isolated") can be in one of the
2215 two possible states - valid or invalid. An invalid partition
2216 root is in a degraded state where some state information may
2217 be retained, but behaves more like a "member".
2218
2219 All possible state transitions among "member", "root" and
2220 "isolated" are allowed.
2221
2222 On read, the "cpuset.cpus.partition" file can show the following
2223 values.
2224
2225 ====================== ==============================
2226 "member" Non-root member of a partition
2227 "root" Partition root
2228 "isolated" Partition root without load balancing
2229 "root invalid (<reason>)" Invalid partition root
2230 "isolated invalid (<reason>)" Invalid isolated partition root
2231 ====================== ==============================
2232
2233 In the case of an invalid partition root, a descriptive string on
2234 why the partition is invalid is included within parentheses.
2235
2236 For a partition root to become valid, the following conditions
2237 must be met.
2238
2239 1) The "cpuset.cpus" is exclusive with its siblings , i.e. they
2240 are not shared by any of its siblings (exclusivity rule).
2241 2) The parent cgroup is a valid partition root.
2242 3) The "cpuset.cpus" is not empty and must contain at least
2243 one of the CPUs from parent's "cpuset.cpus", i.e. they overlap.
2244 4) The "cpuset.cpus.effective" cannot be empty unless there is
2245 no task associated with this partition.
2246
2247 External events like hotplug or changes to "cpuset.cpus" can
2248 cause a valid partition root to become invalid and vice versa.
2249 Note that a task cannot be moved to a cgroup with empty
2250 "cpuset.cpus.effective".
2251
2252 For a valid partition root with the sibling cpu exclusivity
2253 rule enabled, changes made to "cpuset.cpus" that violate the
2254 exclusivity rule will invalidate the partition as well as its
2255 sibiling partitions with conflicting cpuset.cpus values. So
2256 care must be taking in changing "cpuset.cpus".
2257
2258 A valid non-root parent partition may distribute out all its CPUs
2259 to its child partitions when there is no task associated with it.
2260
2261 Care must be taken to change a valid partition root to
2262 "member" as all its child partitions, if present, will become
2263 invalid causing disruption to tasks running in those child
2264 partitions. These inactivated partitions could be recovered if
2265 their parent is switched back to a partition root with a proper
2266 set of "cpuset.cpus".
2267
2268 Poll and inotify events are triggered whenever the state of
2269 "cpuset.cpus.partition" changes. That includes changes caused
2270 by write to "cpuset.cpus.partition", cpu hotplug or other
2271 changes that modify the validity status of the partition.
2272 This will allow user space agents to monitor unexpected changes
2273 to "cpuset.cpus.partition" without the need to do continuous
2274 polling.
2275
2276

--
0-DAY CI Kernel Test Service
https://01.org/lkp

2022-09-04 08:54:38

by Bagas Sanjaya

[permalink] [raw]
Subject: Re: [PATCH v12 09/10] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst

On Thu, Sep 01, 2022 at 04:57:44PM -0400, Waiman Long wrote:
> It accepts only the following input values when written to.
>
> ======== ================================
> - "root" a partition root
> - "member" a non-root member of a partition
> + "member" Non-root member of a partition
> + "root" Partition root
> + "isolated" Partition root without load balancing
> ======== ================================
>
><snipped>
> + On read, the "cpuset.cpus.partition" file can show the following
> + values.
> +
> + ====================== ==============================
> + "member" Non-root member of a partition
> + "root" Partition root
> + "isolated" Partition root without load balancing
> + "root invalid (<reason>)" Invalid partition root
> + "isolated invalid (<reason>)" Invalid isolated partition root
> + ====================== ==============================
> +

These tables above produced htmldocs warnings:

Documentation/admin-guide/cgroup-v2.rst:2191: WARNING: Malformed table.
Text in column margin in table line 4.

======== ================================
"member" Non-root member of a partition
"root" Partition root
"isolated" Partition root without load balancing
======== ================================
Documentation/admin-guide/cgroup-v2.rst:2229: WARNING: Malformed table.
Text in column margin in table line 5.

====================== ==============================
"member" Non-root member of a partition
"root" Partition root
"isolated" Partition root without load balancing
"root invalid (<reason>)" Invalid partition root
"isolated invalid (<reason>)" Invalid isolated partition root
====================== ==============================

I have applied the fixup:

---- >8 ----

diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
index 76b3ea9fd5c560..77b6faecf066cb 100644
--- a/Documentation/admin-guide/cgroup-v2.rst
+++ b/Documentation/admin-guide/cgroup-v2.rst
@@ -2185,11 +2185,11 @@ Cpuset Interface Files

It accepts only the following input values when written to.

- ======== ================================
+ ========== ================================
"member" Non-root member of a partition
"root" Partition root
"isolated" Partition root without load balancing
- ======== ================================
+ ========== ================================

The root cgroup is always a partition root and its state
cannot be changed. All other non-root cgroups start out as
@@ -2222,13 +2222,13 @@ Cpuset Interface Files
On read, the "cpuset.cpus.partition" file can show the following
values.

- ====================== ==============================
+ ============================= =====================================
"member" Non-root member of a partition
"root" Partition root
"isolated" Partition root without load balancing
"root invalid (<reason>)" Invalid partition root
"isolated invalid (<reason>)" Invalid isolated partition root
- ====================== ==============================
+ ============================= =====================================

In the case of an invalid partition root, a descriptive string on
why the partition is invalid is included within parentheses.

Thanks.

--
An old man doll... just what I always wanted! - Clara


Attachments:
(No filename) (3.48 kB)
signature.asc (235.00 B)
Download all attachments

2022-09-04 21:02:44

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH v12 00/10] cgroup/cpuset: cpu partition code fixes & enhancements

On Thu, Sep 01, 2022 at 04:57:35PM -0400, Waiman Long wrote:
> v12:
> - Change patch 1 to enable update_tasks_cpumask() for top_cpuset except
> for percpu kthreads.
> - Add 2 more patches to make exclusivity rule violations invalidate the
> partition and its siblings instead of failing the change to make it
> consistent with other cpuset changes.
> - Update documentation and test script accordingly.

Applied to cgroup/for-6.1 with the doc tables fixed.

Thanks a lot for sticking with it. This looks great.

--
tejun

2022-09-06 16:11:17

by Waiman Long

[permalink] [raw]
Subject: Re: [PATCH v12 09/10] cgroup/cpuset: Update description of cpuset.cpus.partition in cgroup-v2.rst


On 9/4/22 04:20, Bagas Sanjaya wrote:
> On Thu, Sep 01, 2022 at 04:57:44PM -0400, Waiman Long wrote:
>> It accepts only the following input values when written to.
>>
>> ======== ================================
>> - "root" a partition root
>> - "member" a non-root member of a partition
>> + "member" Non-root member of a partition
>> + "root" Partition root
>> + "isolated" Partition root without load balancing
>> ======== ================================
>>
>> <snipped>
>> + On read, the "cpuset.cpus.partition" file can show the following
>> + values.
>> +
>> + ====================== ==============================
>> + "member" Non-root member of a partition
>> + "root" Partition root
>> + "isolated" Partition root without load balancing
>> + "root invalid (<reason>)" Invalid partition root
>> + "isolated invalid (<reason>)" Invalid isolated partition root
>> + ====================== ==============================
>> +
> These tables above produced htmldocs warnings:
>
> Documentation/admin-guide/cgroup-v2.rst:2191: WARNING: Malformed table.
> Text in column margin in table line 4.
>
> ======== ================================
> "member" Non-root member of a partition
> "root" Partition root
> "isolated" Partition root without load balancing
> ======== ================================
> Documentation/admin-guide/cgroup-v2.rst:2229: WARNING: Malformed table.
> Text in column margin in table line 5.
>
> ====================== ==============================
> "member" Non-root member of a partition
> "root" Partition root
> "isolated" Partition root without load balancing
> "root invalid (<reason>)" Invalid partition root
> "isolated invalid (<reason>)" Invalid isolated partition root
> ====================== ==============================
>
> I have applied the fixup:
>
> ---- >8 ----
>
> diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst
> index 76b3ea9fd5c560..77b6faecf066cb 100644
> --- a/Documentation/admin-guide/cgroup-v2.rst
> +++ b/Documentation/admin-guide/cgroup-v2.rst
> @@ -2185,11 +2185,11 @@ Cpuset Interface Files
>
> It accepts only the following input values when written to.
>
> - ======== ================================
> + ========== ================================
> "member" Non-root member of a partition
> "root" Partition root
> "isolated" Partition root without load balancing
> - ======== ================================
> + ========== ================================
>
> The root cgroup is always a partition root and its state
> cannot be changed. All other non-root cgroups start out as
> @@ -2222,13 +2222,13 @@ Cpuset Interface Files
> On read, the "cpuset.cpus.partition" file can show the following
> values.
>
> - ====================== ==============================
> + ============================= =====================================
> "member" Non-root member of a partition
> "root" Partition root
> "isolated" Partition root without load balancing
> "root invalid (<reason>)" Invalid partition root
> "isolated invalid (<reason>)" Invalid isolated partition root
> - ====================== ==============================
> + ============================= =====================================
>
> In the case of an invalid partition root, a descriptive string on
> why the partition is invalid is included within parentheses.
>
> Thanks.

Thanks for the fixes. Will apply that.

Cheers,
Longman