This patch series includes miscellaneous update to the cpuset and its
testing code.
Patch 2 is actually a follow-up of commit 3fb906e7fabb ("cgroup/cpuset:
Don't filter offline CPUs in cpuset_cpus_allowed() for top cpuset tasks").
Patches 3-4 are for handling corner cases when dealing with
task_cpu_possible_mask().
Waiman Long (5):
cgroup/cpuset: Skip task update if hotplug doesn't affect current
cpuset
cgroup/cpuset: Include offline CPUs when tasks' cpumasks in top_cpuset
are updated
cgroup/cpuset: Find another usable CPU if none found in current cpuset
cgroup/cpuset: Add CONFIG_DEBUG_CPUSETS config for cpuset testing
cgroup/cpuset: Minor updates to test_cpuset_prs.sh
init/Kconfig | 5 +
kernel/cgroup/cpuset.c | 155 +++++++++++++++++-
.../selftests/cgroup/test_cpuset_prs.sh | 25 +--
3 files changed, 165 insertions(+), 20 deletions(-)
--
2.31.1
If a hotplug event doesn't affect the current cpuset, there is no point
to call hotplug_update_tasks() or hotplug_update_tasks_legacy(). So
just skip it.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/cgroup/cpuset.c | 3 +++
1 file changed, 3 insertions(+)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 636f1c682ac0..a801abad3bac 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3508,6 +3508,8 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
update_tasks:
cpus_updated = !cpumask_equal(&new_cpus, cs->effective_cpus);
mems_updated = !nodes_equal(new_mems, cs->effective_mems);
+ if (!cpus_updated && !mems_updated)
+ goto unlock; /* Hotplug doesn't affect this cpuset */
if (mems_updated)
check_insane_mems_config(&new_mems);
@@ -3519,6 +3521,7 @@ static void cpuset_hotplug_update_tasks(struct cpuset *cs, struct tmpmasks *tmp)
hotplug_update_tasks_legacy(cs, &new_cpus, &new_mems,
cpus_updated, mems_updated);
+unlock:
percpu_up_write(&cpuset_rwsem);
}
--
2.31.1
Since commit 431c69fac05b ("cpuset: Honour task_cpu_possible_mask()
in guarantee_online_cpus()"), task_cpu_possible_mask() is used within
the cpuset code. However, it is hard to find a arm64 system that can
actually makes task_cpu_possible_mask() return different cpu mask. As a
result, it is hard to exercise the correctness of the code that handle
exception cases due to task_cpu_possible_mask().
To help in exercising those code paths, we need a way to force
task_cpu_possible_mask() to return a different cpu mask. This patch adds
a new CONFIG_DEBUG_CPUSETS config option to enable some debug code to do
just that. The idea is to create a debugfs file "debug_cpu_possible_mask"
that holds the cpumask to be returned by task_cpu_possible_mask() when
a task with name started with the special prefix "cstest" is used as
the input argument. Userspace testing code is then able to exercise
the different code that is affected by task_cpu_possible_mask().
Signed-off-by: Waiman Long <[email protected]>
---
init/Kconfig | 5 +++
kernel/cgroup/cpuset.c | 76 ++++++++++++++++++++++++++++++++++++++++++
2 files changed, 81 insertions(+)
diff --git a/init/Kconfig b/init/Kconfig
index 18f0bf50c468..2abaa830aff0 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1140,6 +1140,11 @@ config PROC_PID_CPUSET
depends on CPUSETS
default y
+config DEBUG_CPUSETS
+ bool "Enable cpuset debugging"
+ depends on CPUSETS && DEBUG_FS
+ default n
+
config CGROUP_DEVICE
bool "Device controller"
help
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index aa8225daf1d3..45051ebb6606 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -220,6 +220,29 @@ static inline bool is_prs_invalid(int prs_state)
return prs_state < 0;
}
+#ifdef CONFIG_DEBUG_CPUSETS
+static struct cpumask debug_cpu_possible_mask;
+
+/*
+ * Debugging code for testing code involving task_cpu_possible_mask()
+ */
+static inline const struct cpumask *
+__task_cpu_possible_mask(struct task_struct *p)
+{
+ const struct cpumask *mask = task_cpu_possible_mask(p);
+
+ if (mask != cpu_possible_mask)
+ return mask;
+ else if (!strncmp(p->comm, "cstest", 6))
+ return &debug_cpu_possible_mask;
+ else
+ return cpu_possible_mask;
+}
+
+#undef task_cpu_possible_mask
+#define task_cpu_possible_mask(p) __task_cpu_possible_mask(p)
+#endif /* CONFIG_DEBUG_CPUSETS */
+
/*
* Temporary cpumasks for working with partitions that are passed among
* functions to avoid memory allocation in inner functions.
@@ -4139,3 +4162,56 @@ void cpuset_task_status_allowed(struct seq_file *m, struct task_struct *task)
seq_printf(m, "Mems_allowed_list:\t%*pbl\n",
nodemask_pr_args(&task->mems_allowed));
}
+
+#ifdef CONFIG_DEBUG_CPUSETS
+#include <linux/debugfs.h>
+
+/*
+ * Add a debugfs file "debug_cpu_possible_mask" that allows user to set
+ * a debug mask for testing.
+ */
+static ssize_t read_debug_mask(struct file *file, char __user *user_buf,
+ size_t count, loff_t *ppos)
+{
+ char buf[80];
+ int len;
+
+ len = snprintf(buf, sizeof(buf) - 1, "%*pbl\n",
+ cpumask_pr_args(&debug_cpu_possible_mask));
+ return simple_read_from_buffer(user_buf, count, ppos, buf, len);
+}
+
+static ssize_t write_debug_mask(struct file *file, const char __user *user_buf,
+ size_t count, loff_t *ppos)
+{
+ unsigned int len;
+ char buf[80];
+ int retval = 0;
+
+ len = min(count, sizeof(buf) - 1);
+ if (copy_from_user(buf, user_buf, len))
+ return -EFAULT;
+
+ if (!*buf)
+ cpumask_clear(&debug_cpu_possible_mask);
+ else
+ retval = cpulist_parse(buf, &debug_cpu_possible_mask);
+
+ return (retval < 0) ? retval : count;
+}
+
+static const struct file_operations fops_debug_mask = {
+ .read = read_debug_mask,
+ .write = write_debug_mask,
+ .llseek = default_llseek,
+};
+
+static int __init create_debug_cpu_possible_mask(void)
+{
+ cpumask_copy(&debug_cpu_possible_mask, cpu_possible_mask);
+ debugfs_create_file("debug_cpu_possible_mask", 0600, NULL, NULL,
+ &fops_debug_mask);
+ return 0;
+}
+late_initcall(create_debug_cpu_possible_mask);
+#endif /* CONFIG_DEBUG_CPUSETS */
--
2.31.1
Similar to commit 3fb906e7fabb ("group/cpuset: Don't filter offline
CPUs in cpuset_cpus_allowed() for top cpuset tasks"), the whole set of
possible CPUs including offline ones should be used for setting cpumasks
for tasks in the top cpuset when a cpuset partition is modified.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/cgroup/cpuset.c | 22 +++++++++++++---------
1 file changed, 13 insertions(+), 9 deletions(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index a801abad3bac..bbf57dcb2f68 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1209,7 +1209,8 @@ void rebuild_sched_domains(void)
*
* Iterate through each task of @cs updating its cpus_allowed to the
* effective cpuset's. As this function is called with cpuset_rwsem held,
- * cpuset membership stays stable.
+ * cpuset membership stays stable. For top_cpuset, task_cpu_possible_mask()
+ * is used instead of effective_cpus.
*/
static void update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus)
{
@@ -1219,15 +1220,18 @@ static void update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus)
css_task_iter_start(&cs->css, 0, &it);
while ((task = css_task_iter_next(&it))) {
- /*
- * Percpu kthreads in top_cpuset are ignored
- */
- if (top_cs && (task->flags & PF_KTHREAD) &&
- kthread_is_per_cpu(task))
- continue;
+ const struct cpumask *possible_mask = task_cpu_possible_mask(task);
- cpumask_and(new_cpus, cs->effective_cpus,
- task_cpu_possible_mask(task));
+ if (top_cs) {
+ /*
+ * Percpu kthreads in top_cpuset are ignored
+ */
+ if ((task->flags & PF_KTHREAD) && kthread_is_per_cpu(task))
+ continue;
+ cpumask_andnot(new_cpus, possible_mask, cs->subparts_cpus);
+ } else {
+ cpumask_and(new_cpus, cs->effective_cpus, possible_mask);
+ }
set_cpus_allowed_ptr(task, new_cpus);
}
css_task_iter_end(&it);
--
2.31.1
On a system with asymmetric CPUs, a restricted task is one that can run
only a selected subset of available CPUs. When a CPU goes offline or
when "cpuset.cpus" is changed, it is possible that a restricted task
may not have any runnable CPUs left in the current cpuset even if there
is still some CPUs in effective_cpus. In this case, the restricted task
cannot be run at all.
There are several ways we may be able to handle this situation. Treating
it like empty effective_cpus is probably too disruptive and is unfair to
the normal tasks. So it is better to have some special handling for these
restricted tasks. One possibility is to move the restricted tasks up the
cpuset hierarchy, but it is tricky to do it right. Another solution is
to assign other usable CPUs to these tasks. This patch implements the
later alternative by finding one usable CPU by walking up the cpuset
hierarchy and printing an informational message to let the users know
that these restricted tasks are running in a cpuset with no usable CPU.
Signed-off-by: Waiman Long <[email protected]>
---
kernel/cgroup/cpuset.c | 56 +++++++++++++++++++++++++++++++++++++++++-
1 file changed, 55 insertions(+), 1 deletion(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index bbf57dcb2f68..aa8225daf1d3 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -1202,6 +1202,38 @@ void rebuild_sched_domains(void)
cpus_read_unlock();
}
+/*
+ * Find a usable effective (online) CPU up the cpuset hierarchy and return it.
+ */
+static int find_usable_cpu(struct cpuset *cs, struct cpumask *new_cpus,
+ const struct cpumask *possible_mask)
+{
+ struct cpuset *parent;
+ unsigned long flags;
+ int cpu;
+
+ /*
+ * When offlining cpu, some effective_cpus may not be up to date.
+ * So check cpu_online_mask to be sure.
+ */
+ parent = parent_cs(cs);
+ while (parent &&
+ (!cpumask_and(new_cpus, parent->effective_cpus, possible_mask) ||
+ !cpumask_and(new_cpus, new_cpus, cpu_online_mask)))
+ parent = parent_cs(cs);
+
+ /* Fall back to all possible online cpus, if necessary */
+ if (!parent)
+ cpumask_and(new_cpus, possible_mask, cpu_online_mask);
+
+ /* cpumask_any_distribute() has to be called with preemption disabled */
+ local_irq_save(flags);
+ cpu = cpumask_any_distribute(new_cpus);
+ local_irq_restore(flags);
+
+ return cpu;
+}
+
/**
* update_tasks_cpumask - Update the cpumasks of tasks in the cpuset.
* @cs: the cpuset in which each task's cpus_allowed mask needs to be changed
@@ -1218,6 +1250,7 @@ static void update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus)
struct task_struct *task;
bool top_cs = cs == &top_cpuset;
+ percpu_rwsem_assert_held(&cpuset_rwsem);
css_task_iter_start(&cs->css, 0, &it);
while ((task = css_task_iter_next(&it))) {
const struct cpumask *possible_mask = task_cpu_possible_mask(task);
@@ -1232,7 +1265,28 @@ static void update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus)
} else {
cpumask_and(new_cpus, cs->effective_cpus, possible_mask);
}
- set_cpus_allowed_ptr(task, new_cpus);
+ /*
+ * On systems with assymetric CPUs, it is possible that
+ * cpumask will become empty or set_cpus_allowed_ptr() will
+ * return an error even if we still have CPUs in
+ * effective_cpus. In this case, we find a usable CPU walking
+ * up the cpuset hierarchy and use that for this particular
+ * task with an informational message about the change in the
+ * hope that the users will adjust "cpuset.cpus" accordingly.
+ */
+ if (cpumask_empty(new_cpus) ||
+ set_cpus_allowed_ptr(task, new_cpus)) {
+ char name[80];
+ int cpu;
+
+ cpu = find_usable_cpu(cs, new_cpus, possible_mask);
+ cpumask_clear(new_cpus);
+ cpumask_set_cpu(cpu, new_cpus);
+ WARN_ON_ONCE(set_cpus_allowed_ptr(task, new_cpus));
+ cgroup_name(cs->css.cgroup, name, sizeof(name));
+ pr_info("cpuset: Restricted task %s(%d) in cpuset %s is forced to run on outside CPU %d\n",
+ task->comm, task->pid, name, cpu);
+ }
}
css_task_iter_end(&it);
}
--
2.31.1
This patch makes the following minor updates to the cpuset partition
testing script test_cpuset_prs.sh.
- Remove online_cpus function call as it will be called anyway on exit
in cleanup.
- Make the enabling of sched/verbose debugfs flag conditional on the
"-v" verbose option and set DELAY_FACTOR to 2 in this case as cpuset
partition operations are likely to be slowed down by enabling that.
Signed-off-by: Waiman Long <[email protected]>
---
.../selftests/cgroup/test_cpuset_prs.sh | 25 +++++++++++--------
1 file changed, 14 insertions(+), 11 deletions(-)
diff --git a/tools/testing/selftests/cgroup/test_cpuset_prs.sh b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
index 75c100de90ff..2b5215cc599f 100755
--- a/tools/testing/selftests/cgroup/test_cpuset_prs.sh
+++ b/tools/testing/selftests/cgroup/test_cpuset_prs.sh
@@ -15,13 +15,6 @@ skip_test() {
[[ $(id -u) -eq 0 ]] || skip_test "Test must be run as root!"
-# Set sched verbose flag, if available
-if [[ -d /sys/kernel/debug/sched ]]
-then
- # Used to restore the original setting during cleanup
- SCHED_DEBUG=$(cat /sys/kernel/debug/sched/verbose)
- echo Y > /sys/kernel/debug/sched/verbose
-fi
# Get wait_inotify location
WAIT_INOTIFY=$(cd $(dirname $0); pwd)/wait_inotify
@@ -37,10 +30,14 @@ CPUS=$(lscpu | grep "^CPU(s):" | sed -e "s/.*:[[:space:]]*//")
PROG=$1
VERBOSE=
DELAY_FACTOR=1
+SCHED_DEBUG=
while [[ "$1" = -* ]]
do
case "$1" in
-v) VERBOSE=1
+ # Enable sched/verbose can slow thing down
+ [[ $DELAY_FACTOR -eq 1 ]] &&
+ DELAY_FACTOR=2
break
;;
-d) DELAY_FACTOR=$2
@@ -54,6 +51,14 @@ do
shift
done
+# Set sched verbose flag if available when "-v" option is specified
+if [[ -n "$VERBOSE" && -d /sys/kernel/debug/sched ]]
+then
+ # Used to restore the original setting during cleanup
+ SCHED_DEBUG=$(cat /sys/kernel/debug/sched/verbose)
+ echo Y > /sys/kernel/debug/sched/verbose
+fi
+
cd $CGROUP2
echo +cpuset > cgroup.subtree_control
[[ -d test ]] || mkdir test
@@ -65,7 +70,8 @@ cleanup()
rmdir A1/A2/A3 A1/A2 A1 B1 > /dev/null 2>&1
cd ..
rmdir test > /dev/null 2>&1
- echo "$SCHED_DEBUG" > /sys/kernel/debug/sched/verbose
+ [[ -n "$SCHED_DEBUG" ]] &&
+ echo "$SCHED_DEBUG" > /sys/kernel/debug/sched/verbose
}
# Pause in ms
@@ -571,7 +577,6 @@ run_state_test()
echo "Test $TEST[$I] failed result check!"
eval echo \"\${$TEST[$I]}\"
dump_states
- online_cpus
exit 1
}
@@ -582,7 +587,6 @@ run_state_test()
eval echo \"\${$TEST[$I]}\"
echo
dump_states
- online_cpus
exit 1
}
}
@@ -594,7 +598,6 @@ run_state_test()
eval echo \"\${$TEST[$I]}\"
echo
dump_states
- online_cpus
exit 1
}
}
--
2.31.1
On 3/7/23 01:38, Waiman Long wrote:
> This patch makes the following minor updates to the cpuset partition
> testing script test_cpuset_prs.sh.
>
> - Remove online_cpus function call as it will be called anyway on exit
> in cleanup.
> - Make the enabling of sched/verbose debugfs flag conditional on the
> "-v" verbose option and set DELAY_FACTOR to 2 in this case as cpuset
> partition operations are likely to be slowed down by enabling that.
>
> Signed-off-by: Waiman Long <[email protected]>
Reviewed-by: Kamalesh Babulal <[email protected]>
On Mon, Mar 06, 2023 at 03:08:45PM -0500, Waiman Long <[email protected]> wrote:
> If a hotplug event doesn't affect the current cpuset, there is no point
> to call hotplug_update_tasks() or hotplug_update_tasks_legacy(). So
> just skip it.
This skips "insane" modification of cs->cpus_allowed in
hotplug_update_tasks_legacy() but assuming cs->cpus_allowed is kept in
sync with cs->effective_cpus on v1, it is OK to skip the update based
only on effective_cpus check.
Hence,
> kernel/cgroup/cpuset.c | 3 +++
> 1 file changed, 3 insertions(+)
Reviewed-by: Michal Koutn? <[email protected]>
Hello Waiman.
On Mon, Mar 06, 2023 at 03:08:46PM -0500, Waiman Long <[email protected]> wrote:
> - /*
> - * Percpu kthreads in top_cpuset are ignored
> - */
> - if (top_cs && (task->flags & PF_KTHREAD) &&
> - kthread_is_per_cpu(task))
> - continue;
> + const struct cpumask *possible_mask = task_cpu_possible_mask(task);
>
> - cpumask_and(new_cpus, cs->effective_cpus,
> - task_cpu_possible_mask(task));
> + if (top_cs) {
> + /*
> + * Percpu kthreads in top_cpuset are ignored
> + */
> + if ((task->flags & PF_KTHREAD) && kthread_is_per_cpu(task))
> + continue;
> + cpumask_andnot(new_cpus, possible_mask, cs->subparts_cpus);
> + } else {
> + cpumask_and(new_cpus, cs->effective_cpus, possible_mask);
> + }
I'm wrapping my head around this slightly.
1) I'd suggest swapping args in of cpumask_and() to have possible_mask
consistently first.
2) Then I'm wondering whether two branches are truly different when
effective_cpus := cpus_allowed - subparts_cpus
top_cpuset.cpus_allowed == possible_mask (1)
IOW, can you see a difference in what affinities are set to eligible
top_cpuset tasks before and after this patch upon CPU hotplug?
(Hm, (1) holds only in v2. So is this a fix for v1 only?)
Thanks,
Michal
Hello.
On Mon, Mar 06, 2023 at 03:08:47PM -0500, Waiman Long <[email protected]> wrote:
> On a system with asymmetric CPUs, a restricted task is one that can run
> only a selected subset of available CPUs. When a CPU goes offline or
> when "cpuset.cpus" is changed, it is possible that a restricted task
> may not have any runnable CPUs left in the current cpuset even if there
> is still some CPUs in effective_cpus. In this case, the restricted task
> cannot be run at all.
>
> There are several ways we may be able to handle this situation. Treating
> it like empty effective_cpus is probably too disruptive and is unfair to
> the normal tasks. So it is better to have some special handling for these
> restricted tasks. One possibility is to move the restricted tasks up the
> cpuset hierarchy, but it is tricky to do it right. Another solution is
> to assign other usable CPUs to these tasks. This patch implements the
> later alternative by finding one usable CPU by walking up the cpuset
> hierarchy and printing an informational message to let the users know
> that these restricted tasks are running in a cpuset with no usable CPU.
>
> Signed-off-by: Waiman Long <[email protected]>
> ---
> kernel/cgroup/cpuset.c | 56 +++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 55 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index bbf57dcb2f68..aa8225daf1d3 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -1202,6 +1202,38 @@ void rebuild_sched_domains(void)
> cpus_read_unlock();
> }
>
> [...]
> /**
> * update_tasks_cpumask - Update the cpumasks of tasks in the cpuset.
> * @cs: the cpuset in which each task's cpus_allowed mask needs to be changed
> @@ -1218,6 +1250,7 @@ static void update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus)
> struct task_struct *task;
> bool top_cs = cs == &top_cpuset;
>
> + percpu_rwsem_assert_held(&cpuset_rwsem);
> css_task_iter_start(&cs->css, 0, &it);
> while ((task = css_task_iter_next(&it))) {
> const struct cpumask *possible_mask = task_cpu_possible_mask(task);
> @@ -1232,7 +1265,28 @@ static void update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus)
> } else {
> cpumask_and(new_cpus, cs->effective_cpus, possible_mask);
> }
> - set_cpus_allowed_ptr(task, new_cpus);
> + /*
> + * On systems with assymetric CPUs, it is possible that
> + * cpumask will become empty or set_cpus_allowed_ptr() will
> + * return an error even if we still have CPUs in
> + * effective_cpus. In this case, we find a usable CPU walking
> + * up the cpuset hierarchy and use that for this particular
> + * task with an informational message about the change in the
> + * hope that the users will adjust "cpuset.cpus" accordingly.
> + */
> + if (cpumask_empty(new_cpus) ||
> + set_cpus_allowed_ptr(task, new_cpus)) {
IIUC, cpumask_empty(new_cpus) here implies
cpumask_empty(cs->effective_cpus) but that shouldn't happen (cs should
inherit non-empty mask from an ancestor). Do I miss/forget anything?
This thus covers the case when p->user_cpus_ptr is incompatible with
hotplug or cpuset.cpus allowance and a different affinity must be
chosen. But doesn't that mean that the task would run _out_ of
cs->effective_cpus?
I guess that's unavoidable on asymmetric CPU archs but not no SMPs.
Shouldn't the solution distinguish between the two? (I.e. never run out
of effective_cpus on SMP.)
Thanks,
Michal
On 3/14/23 12:50, Michal Koutný wrote:
> On Mon, Mar 06, 2023 at 03:08:45PM -0500, Waiman Long <[email protected]> wrote:
>> If a hotplug event doesn't affect the current cpuset, there is no point
>> to call hotplug_update_tasks() or hotplug_update_tasks_legacy(). So
>> just skip it.
> This skips "insane" modification of cs->cpus_allowed in
> hotplug_update_tasks_legacy() but assuming cs->cpus_allowed is kept in
> sync with cs->effective_cpus on v1, it is OK to skip the update based
> only on effective_cpus check.
Yes, effective_cpus is equivalent to cpus_allowed in v1 unless you mount
the cpuset with the cpuset_v2_mode flag which will behave more like v2
where effective_cpus is still the key.
Cheers,
Longman
On 3/14/23 13:34, Michal Koutný wrote:
> Hello Waiman.
>
> On Mon, Mar 06, 2023 at 03:08:46PM -0500, Waiman Long <[email protected]> wrote:
>> - /*
>> - * Percpu kthreads in top_cpuset are ignored
>> - */
>> - if (top_cs && (task->flags & PF_KTHREAD) &&
>> - kthread_is_per_cpu(task))
>> - continue;
>> + const struct cpumask *possible_mask = task_cpu_possible_mask(task);
>>
>> - cpumask_and(new_cpus, cs->effective_cpus,
>> - task_cpu_possible_mask(task));
>> + if (top_cs) {
>> + /*
>> + * Percpu kthreads in top_cpuset are ignored
>> + */
>> + if ((task->flags & PF_KTHREAD) && kthread_is_per_cpu(task))
>> + continue;
>> + cpumask_andnot(new_cpus, possible_mask, cs->subparts_cpus);
>> + } else {
>> + cpumask_and(new_cpus, cs->effective_cpus, possible_mask);
>> + }
> I'm wrapping my head around this slightly.
> 1) I'd suggest swapping args in of cpumask_and() to have possible_mask
> consistently first.
I don't quite understand what you meant by "swapping args". It is
effective new_cpus = cs->effective_cpus ∩ possible_mask. What is the
point of swapping cs->effective_cpus and possible_mask.
> 2) Then I'm wondering whether two branches are truly different when
> effective_cpus := cpus_allowed - subparts_cpus
> top_cpuset.cpus_allowed == possible_mask (1)
effective_cpus may not be equal "cpus_allowed - subparts_cpus" if some
of the CPUs are offline as effective_cpus contains only online CPUs.
subparts_cpu can include offline cpus too. That is why I choose that
expression. I will add a comment to clarify that.
>
> IOW, can you see a difference in what affinities are set to eligible
> top_cpuset tasks before and after this patch upon CPU hotplug?
> (Hm, (1) holds only in v2. So is this a fix for v1 only?)
This is due to the fact that cpu hotplug code currently doesn't update
the cpu affinity of tasks in the top cpuset. Tasks not in the top cpuset
can rely on the hotplug code to update the cpu affinity appropriately.
For the tasks in the top cpuset, we have to make sure that all the
offline CPUs are included.
Cheers,
Longman
On 3/14/23 14:17, Michal Koutný wrote:
> Hello.
>
> On Mon, Mar 06, 2023 at 03:08:47PM -0500, Waiman Long <[email protected]> wrote:
>> On a system with asymmetric CPUs, a restricted task is one that can run
>> only a selected subset of available CPUs. When a CPU goes offline or
>> when "cpuset.cpus" is changed, it is possible that a restricted task
>> may not have any runnable CPUs left in the current cpuset even if there
>> is still some CPUs in effective_cpus. In this case, the restricted task
>> cannot be run at all.
>>
>> There are several ways we may be able to handle this situation. Treating
>> it like empty effective_cpus is probably too disruptive and is unfair to
>> the normal tasks. So it is better to have some special handling for these
>> restricted tasks. One possibility is to move the restricted tasks up the
>> cpuset hierarchy, but it is tricky to do it right. Another solution is
>> to assign other usable CPUs to these tasks. This patch implements the
>> later alternative by finding one usable CPU by walking up the cpuset
>> hierarchy and printing an informational message to let the users know
>> that these restricted tasks are running in a cpuset with no usable CPU.
>>
>> Signed-off-by: Waiman Long <[email protected]>
>> ---
>> kernel/cgroup/cpuset.c | 56 +++++++++++++++++++++++++++++++++++++++++-
>> 1 file changed, 55 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
>> index bbf57dcb2f68..aa8225daf1d3 100644
>> --- a/kernel/cgroup/cpuset.c
>> +++ b/kernel/cgroup/cpuset.c
>> @@ -1202,6 +1202,38 @@ void rebuild_sched_domains(void)
>> cpus_read_unlock();
>> }
>>
>> [...]
>> /**
>> * update_tasks_cpumask - Update the cpumasks of tasks in the cpuset.
>> * @cs: the cpuset in which each task's cpus_allowed mask needs to be changed
>> @@ -1218,6 +1250,7 @@ static void update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus)
>> struct task_struct *task;
>> bool top_cs = cs == &top_cpuset;
>>
>> + percpu_rwsem_assert_held(&cpuset_rwsem);
>> css_task_iter_start(&cs->css, 0, &it);
>> while ((task = css_task_iter_next(&it))) {
>> const struct cpumask *possible_mask = task_cpu_possible_mask(task);
>> @@ -1232,7 +1265,28 @@ static void update_tasks_cpumask(struct cpuset *cs, struct cpumask *new_cpus)
>> } else {
>> cpumask_and(new_cpus, cs->effective_cpus, possible_mask);
>> }
>> - set_cpus_allowed_ptr(task, new_cpus);
>> + /*
>> + * On systems with assymetric CPUs, it is possible that
>> + * cpumask will become empty or set_cpus_allowed_ptr() will
>> + * return an error even if we still have CPUs in
>> + * effective_cpus. In this case, we find a usable CPU walking
>> + * up the cpuset hierarchy and use that for this particular
>> + * task with an informational message about the change in the
>> + * hope that the users will adjust "cpuset.cpus" accordingly.
>> + */
>> + if (cpumask_empty(new_cpus) ||
>> + set_cpus_allowed_ptr(task, new_cpus)) {
> IIUC, cpumask_empty(new_cpus) here implies
> cpumask_empty(cs->effective_cpus) but that shouldn't happen (cs should
> inherit non-empty mask from an ancestor). Do I miss/forget anything?
>
> This thus covers the case when p->user_cpus_ptr is incompatible with
> hotplug or cpuset.cpus allowance and a different affinity must be
> chosen. But doesn't that mean that the task would run _out_ of
> cs->effective_cpus?
> I guess that's unavoidable on asymmetric CPU archs but not no SMPs.
> Shouldn't the solution distinguish between the two? (I.e. never run out
> of effective_cpus on SMP.)
Some arm64 systems can have asymmetric CPUs where certain tasks are only
runnable on a selected subset of CPUs. This information is not captured
in the cpuset. As a result, task_cpu_possible_mask() may return a mask
that have no overlap with effective_cpus causing new_cpus to become
empty. It takes me a while to understand the implication of that. I will
elaborate this point a bit more in the patch.
Cheers,
Longman
On Tue, Mar 14, 2023 at 03:02:53PM -0400, Waiman Long <[email protected]> wrote:
> > > + cpumask_andnot(new_cpus, possible_mask, cs->subparts_cpus);
> > > + } else {
> > > + cpumask_and(new_cpus, cs->effective_cpus, possible_mask);
> > > + }
> > I'm wrapping my head around this slightly.
> > 1) I'd suggest swapping args in of cpumask_and() to have possible_mask
> > consistently first.
> I don't quite understand what you meant by "swapping args". It is effective
> new_cpus = cs->effective_cpus ∩ possible_mask. What is the point of swapping
> cs->effective_cpus and possible_mask.
No effect except better readability (possible_mask comes first in the
other branch above too, left hand arg as the "base" that's modified).
> > 2) Then I'm wondering whether two branches are truly different when
> > effective_cpus := cpus_allowed - subparts_cpus
> > top_cpuset.cpus_allowed == possible_mask (1)
> effective_cpus may not be equal "cpus_allowed - subparts_cpus" if some of
> the CPUs are offline as effective_cpus contains only online CPUs.
> subparts_cpu can include offline cpus too. That is why I choose that
> expression. I will add a comment to clarify that.
I see now that it returns offlined cpus to top cpuset's tasks.
> >
> > IOW, can you see a difference in what affinities are set to eligible
> > top_cpuset tasks before and after this patch upon CPU hotplug?
> > (Hm, (1) holds only in v2. So is this a fix for v1 only?)
>
> This is due to the fact that cpu hotplug code currently doesn't update the
> cpu affinity of tasks in the top cpuset. Tasks not in the top cpuset can
> rely on the hotplug code to update the cpu affinity appropriately.
Oh, I mistook this for hotplug changing behavior but it's actually for
updating top_cpuset when its children becomes a partition root.
IIUC, top cpuset + hotplug has been treated specially because
hotplug must have taken care of affinity regardless of cpuset.
v1 allowed modification of top cpuset's mask (not sure if that
worked), v2 won't allow direct top cpuset's mask modificiation
but indirectly via partition root children.
So this is a continuation for 3fb906e7fabb ("cgroup/cpuset: Don't filter
offline CPUs in cpuset_cpus_allowed() for top cpuset tasks") to ensure
hotplug offline/online cycle won't overwrite top_cpuset tasks'
affinities (when partition change during offlined period).
This looks correct in this regard then.
(I wish it were simpler but that's for a different/broader top
cpuset+hotplug approach.)
Thanks,
Michal
On 3/15/23 06:06, Michal Koutný wrote:
> On Tue, Mar 14, 2023 at 03:02:53PM -0400, Waiman Long <[email protected]> wrote:
>>>> + cpumask_andnot(new_cpus, possible_mask, cs->subparts_cpus);
>>>> + } else {
>>>> + cpumask_and(new_cpus, cs->effective_cpus, possible_mask);
>>>> + }
>>> I'm wrapping my head around this slightly.
>>> 1) I'd suggest swapping args in of cpumask_and() to have possible_mask
>>> consistently first.
>> I don't quite understand what you meant by "swapping args". It is effective
>> new_cpus = cs->effective_cpus ∩ possible_mask. What is the point of swapping
>> cs->effective_cpus and possible_mask.
> No effect except better readability (possible_mask comes first in the
> other branch above too, left hand arg as the "base" that's modified).
OK, I got it now. I will swap it as suggested.
>
>>> 2) Then I'm wondering whether two branches are truly different when
>>> effective_cpus := cpus_allowed - subparts_cpus
>>> top_cpuset.cpus_allowed == possible_mask (1)
>> effective_cpus may not be equal "cpus_allowed - subparts_cpus" if some of
>> the CPUs are offline as effective_cpus contains only online CPUs.
>> subparts_cpu can include offline cpus too. That is why I choose that
>> expression. I will add a comment to clarify that.
> I see now that it returns offlined cpus to top cpuset's tasks.
>
>>> IOW, can you see a difference in what affinities are set to eligible
>>> top_cpuset tasks before and after this patch upon CPU hotplug?
>>> (Hm, (1) holds only in v2. So is this a fix for v1 only?)
>> This is due to the fact that cpu hotplug code currently doesn't update the
>> cpu affinity of tasks in the top cpuset. Tasks not in the top cpuset can
>> rely on the hotplug code to update the cpu affinity appropriately.
> Oh, I mistook this for hotplug changing behavior but it's actually for
> updating top_cpuset when its children becomes a partition root.
>
> IIUC, top cpuset + hotplug has been treated specially because
> hotplug must have taken care of affinity regardless of cpuset.
> v1 allowed modification of top cpuset's mask (not sure if that
> worked), v2 won't allow direct top cpuset's mask modificiation
> but indirectly via partition root children.
>
> So this is a continuation for 3fb906e7fabb ("cgroup/cpuset: Don't filter
> offline CPUs in cpuset_cpus_allowed() for top cpuset tasks") to ensure
> hotplug offline/online cycle won't overwrite top_cpuset tasks'
> affinities (when partition change during offlined period).
> This looks correct in this regard then.
> (I wish it were simpler but that's for a different/broader top
> cpuset+hotplug approach.)
You can't change the content of "cpuset.cpus" in the top cpuset (both v1
& v2). I believe the CPU hotplug does not attempt to update the cpumask
of tasks in the top cpuset mainly due to potential locking issue as
hotplug is triggered by hardware event. Partition change, however, is a
userspace event. So there shouldn't be any locking implication other
than the fact per-cpu kthreads should not have their cpumasks changed.
To be consistent with commit 3fb906e7fabb ("cgroup/cpuset: Don't filter
offline CPUs in cpuset_cpus_allowed() for top cpuset tasks"), similar
logic needs to be applied in the later case.
Cheers,
Longman
Hi Waiman,
On Mon, Mar 06, 2023 at 03:08:44PM -0500, Waiman Long wrote:
> This patch series includes miscellaneous update to the cpuset and its
> testing code.
>
> Patch 2 is actually a follow-up of commit 3fb906e7fabb ("cgroup/cpuset:
> Don't filter offline CPUs in cpuset_cpus_allowed() for top cpuset tasks").
>
> Patches 3-4 are for handling corner cases when dealing with
> task_cpu_possible_mask().
Thanks for cc'ing me on these. I ran my arm64 asymmetric tests and, fwiw,
I get the same results as vanilla -rc2, so that's good.
One behaviour that persists (and which I thought might be addressed by this
series) is the following. Imagine a 4-CPU system with CPUs 0-1 being 64-bit
only. If I configure a parent cpuset with 'cpuset.cpus' of "0-2" and a
child cpuset with 'cpuset.cpus' of "0-1", then attaching a 32-bit task
to the child cpuset will result in an affinity mask of 4. If I then change
'cpuset.cpus' of the parent cpuset to "0-1,3", the affinity mask of the
task remains at '4' whereas it might be nice to update it to '8', in-line
with the new affinity mask of the parent cpuset.
Anyway, I'm not complaining (this is certainly _not_ a regression), but
I thought I'd highlight it in case you were aiming to address this with
your changes.
Cheers,
Will
On 3/15/23 12:24, Will Deacon wrote:
> Hi Waiman,
>
> On Mon, Mar 06, 2023 at 03:08:44PM -0500, Waiman Long wrote:
>> This patch series includes miscellaneous update to the cpuset and its
>> testing code.
>>
>> Patch 2 is actually a follow-up of commit 3fb906e7fabb ("cgroup/cpuset:
>> Don't filter offline CPUs in cpuset_cpus_allowed() for top cpuset tasks").
>>
>> Patches 3-4 are for handling corner cases when dealing with
>> task_cpu_possible_mask().
> Thanks for cc'ing me on these. I ran my arm64 asymmetric tests and, fwiw,
> I get the same results as vanilla -rc2, so that's good.
>
> One behaviour that persists (and which I thought might be addressed by this
> series) is the following. Imagine a 4-CPU system with CPUs 0-1 being 64-bit
> only. If I configure a parent cpuset with 'cpuset.cpus' of "0-2" and a
> child cpuset with 'cpuset.cpus' of "0-1", then attaching a 32-bit task
> to the child cpuset will result in an affinity mask of 4. If I then change
> 'cpuset.cpus' of the parent cpuset to "0-1,3", the affinity mask of the
> task remains at '4' whereas it might be nice to update it to '8', in-line
> with the new affinity mask of the parent cpuset.
>
> Anyway, I'm not complaining (this is certainly _not_ a regression), but
> I thought I'd highlight it in case you were aiming to address this with
> your changes.
I believe it is because changes in parent cpuset only won't cause the
tasks in the child cpuset to be re-evaluated unless it causes a change
in the effective_cpus of the child cpuset. This is the case here. We
currently don't track how many tasks in the child cpusets are using
parent's cpumask due to lacking runnable CPUs in the child cpuset. We
can only fix this if we track those special tasks. It can be fixable,
but I don't know if it is a problem that is worth fixing.
Cheers,
Longman
On Tue, Mar 14, 2023 at 04:22:06PM -0400, Waiman Long <[email protected]> wrote:
> Some arm64 systems can have asymmetric CPUs where certain tasks are only
> runnable on a selected subset of CPUs.
Ah, I'm catching up.
> This information is not captured in the cpuset. As a result,
> task_cpu_possible_mask() may return a mask that have no overlap with
> effective_cpus causing new_cpus to become empty.
I can see that historically, there was an approach of terminating
unaccomodable tasks:
94f9c00f6460 ("arm64: Remove logic to kill 32-bit tasks on 64-bit-only cores")
the removal of killing had been made possible with
df950811f4a8 ("arm64: Prevent offlining first CPU with 32-bit EL0 on mismatched system").
That gives two other alternatives to affinity modification:
2) kill such tasks (not unlike OOM upon memory.max reduction),
3) reject cpuset reduction (violates cgroup v2 delegation).
What do you think about 2)?
Michal
On 3/17/23 08:27, Michal Koutný wrote:
> On Tue, Mar 14, 2023 at 04:22:06PM -0400, Waiman Long <[email protected]> wrote:
>> Some arm64 systems can have asymmetric CPUs where certain tasks are only
>> runnable on a selected subset of CPUs.
> Ah, I'm catching up.
>
>> This information is not captured in the cpuset. As a result,
>> task_cpu_possible_mask() may return a mask that have no overlap with
>> effective_cpus causing new_cpus to become empty.
> I can see that historically, there was an approach of terminating
> unaccomodable tasks:
> 94f9c00f6460 ("arm64: Remove logic to kill 32-bit tasks on 64-bit-only cores")
> the removal of killing had been made possible with
> df950811f4a8 ("arm64: Prevent offlining first CPU with 32-bit EL0 on mismatched system").
>
> That gives two other alternatives to affinity modification:
> 2) kill such tasks (not unlike OOM upon memory.max reduction),
> 3) reject cpuset reduction (violates cgroup v2 delegation).
>
> What do you think about 2)?
Yes, killing it is one possible solution.
(3) doesn't work if the affinity change is due to hot cpu removal. So
that leaves this patch or (2) as the only alternative. I would like to
hear what Will and Tejun thinks about it.
I am going to remove this patch from the series for the time being.
Thanks,
Longman
On Fri, Mar 17, 2023 at 10:59:26AM -0400, Waiman Long wrote:
> On 3/17/23 08:27, Michal Koutn? wrote:
> > On Tue, Mar 14, 2023 at 04:22:06PM -0400, Waiman Long <[email protected]> wrote:
> > > Some arm64 systems can have asymmetric CPUs where certain tasks are only
> > > runnable on a selected subset of CPUs.
> > Ah, I'm catching up.
> >
> > > This information is not captured in the cpuset. As a result,
> > > task_cpu_possible_mask() may return a mask that have no overlap with
> > > effective_cpus causing new_cpus to become empty.
> > I can see that historically, there was an approach of terminating
> > unaccomodable tasks:
> > 94f9c00f6460 ("arm64: Remove logic to kill 32-bit tasks on 64-bit-only cores")
> > the removal of killing had been made possible with
> > df950811f4a8 ("arm64: Prevent offlining first CPU with 32-bit EL0 on mismatched system").
> >
> > That gives two other alternatives to affinity modification:
> > 2) kill such tasks (not unlike OOM upon memory.max reduction),
> > 3) reject cpuset reduction (violates cgroup v2 delegation).
> >
> > What do you think about 2)?
>
> Yes, killing it is one possible solution.
>
> (3) doesn't work if the affinity change is due to hot cpu removal. So that
> leaves this patch or (2) as the only alternative. I would like to hear what
> Will and Tejun thinks about it.
The main constraint from the Android side (the lucky ecosystem where these
SoCs tend to show up) is that existing userspace (including 32-bit binaries)
continues to function without modification. So approaches such as killing
tasks or rejecting system calls tend not to work as well, since you
inevitably get divergent behaviour leading to functional breakage rather
than e.g. performance anomalies.
Having said that, the behaviour we currently have in mainline seems to
be alright, so please don't go out of your way to accomodate these SoCs.
I'm mainly just concerned about introducing any regressions, which is why
I ran my tests on this series
Cheers,
Will
On 3/24/23 10:32, Will Deacon wrote:
> On Fri, Mar 17, 2023 at 10:59:26AM -0400, Waiman Long wrote:
>> On 3/17/23 08:27, Michal Koutný wrote:
>>> On Tue, Mar 14, 2023 at 04:22:06PM -0400, Waiman Long <[email protected]> wrote:
>>>> Some arm64 systems can have asymmetric CPUs where certain tasks are only
>>>> runnable on a selected subset of CPUs.
>>> Ah, I'm catching up.
>>>
>>>> This information is not captured in the cpuset. As a result,
>>>> task_cpu_possible_mask() may return a mask that have no overlap with
>>>> effective_cpus causing new_cpus to become empty.
>>> I can see that historically, there was an approach of terminating
>>> unaccomodable tasks:
>>> 94f9c00f6460 ("arm64: Remove logic to kill 32-bit tasks on 64-bit-only cores")
>>> the removal of killing had been made possible with
>>> df950811f4a8 ("arm64: Prevent offlining first CPU with 32-bit EL0 on mismatched system").
>>>
>>> That gives two other alternatives to affinity modification:
>>> 2) kill such tasks (not unlike OOM upon memory.max reduction),
>>> 3) reject cpuset reduction (violates cgroup v2 delegation).
>>>
>>> What do you think about 2)?
>> Yes, killing it is one possible solution.
>>
>> (3) doesn't work if the affinity change is due to hot cpu removal. So that
>> leaves this patch or (2) as the only alternative. I would like to hear what
>> Will and Tejun thinks about it.
> The main constraint from the Android side (the lucky ecosystem where these
> SoCs tend to show up) is that existing userspace (including 32-bit binaries)
> continues to function without modification. So approaches such as killing
> tasks or rejecting system calls tend not to work as well, since you
> inevitably get divergent behaviour leading to functional breakage rather
> than e.g. performance anomalies.
>
> Having said that, the behaviour we currently have in mainline seems to
> be alright, so please don't go out of your way to accomodate these SoCs.
> I'm mainly just concerned about introducing any regressions, which is why
> I ran my tests on this series
I agree that killing it may be too draconian. I am withholding this
patch for now.
Thanks,
Longman
On Fri, Mar 24, 2023 at 02:32:50PM +0000, Will Deacon <[email protected]> wrote:
> So approaches such as killing tasks or rejecting system calls tend not
> to work as well, since you inevitably get divergent behaviour leading
> to functional breakage rather than e.g. performance anomalies.
What about temporary performance drop from 100% to 0% aka freezing the
tasks for the duration of the mismatching affinity config?
> Having said that, the behaviour we currently have in mainline seems to
> be alright, so please don't go out of your way to accomodate these SoCs.
I see. (Just wondering what you think about the fourth option above.)
Thanks,
Michal
On 3/24/23 14:19, Michal Koutný wrote:
> On Fri, Mar 24, 2023 at 02:32:50PM +0000, Will Deacon <[email protected]> wrote:
>> So approaches such as killing tasks or rejecting system calls tend not
>> to work as well, since you inevitably get divergent behaviour leading
>> to functional breakage rather than e.g. performance anomalies.
> What about temporary performance drop from 100% to 0% aka freezing the
> tasks for the duration of the mismatching affinity config?
That can be a lot of extra work to freeze it. I will prefer something
simpler.
Without this patch, I believe it will lead to a cpumask of 0 which will
cause the scheduler to pick a fallback cpu. It looks like the fallback
code may be able to pick up the right cpu or it may panic the system
(less likely).
Cheers,
Longman
>
>
>> Having said that, the behaviour we currently have in mainline seems to
>> be alright, so please don't go out of your way to accomodate these SoCs.
> I see. (Just wondering what you think about the fourth option above.)
>
> Thanks,
> Michal