If a process is limited by taskset (i.e. cpuset) to only be allowed to
run on cpu N, and then cpu N is offlined via hotplug, the process will
be assigned the current value of its cpuset cgroup's effective_cpus field
in a call to do_set_cpus_allowed() in cpuset_cpus_allowed_fallback().
This argument's value does not makes sense for this case, because
task_cs(tsk)->effective_cpus is modified by cpuset_hotplug_workfn()
to reflect the new value of cpu_active_mask after cpu N is removed from
the mask. While this may make sense for the cgroup affinity mask, it
does not make sense on a per-task basis, as a task that was previously
limited to only be run on cpu N will be limited to every cpu _except_ for
cpu N after it is offlined/onlined via hotplug.
Pre-patch behavior:
$ grep Cpus /proc/$$/status
Cpus_allowed: ff
Cpus_allowed_list: 0-7
$ taskset -p 4 $$
pid 19202's current affinity mask: f
pid 19202's new affinity mask: 4
$ grep Cpus /proc/self/status
Cpus_allowed: 04
Cpus_allowed_list: 2
# echo off > /sys/devices/system/cpu/cpu2/online
$ grep Cpus /proc/$$/status
Cpus_allowed: 0b
Cpus_allowed_list: 0-1,3
# echo on > /sys/devices/system/cpu/cpu2/online
$ grep Cpus /proc/$$/status
Cpus_allowed: 0b
Cpus_allowed_list: 0-1,3
On a patched system, the final grep produces the following
output instead:
$ grep Cpus /proc/$$/status
Cpus_allowed: ff
Cpus_allowed_list: 0-7
This patch changes the above behavior by instead resetting the mask to
task_cs(tsk)->cpus_allowed by default, and cpu_possible mask in legacy
mode.
This fallback mechanism is only triggered if _every_ other valid avenue
has been traveled, and it is the last resort before calling BUG().
Signed-off-by: Joel Savitz <[email protected]>
---
kernel/cgroup/cpuset.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 4834c4214e9c..6c9deb2cc687 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -3255,10 +3255,23 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
spin_unlock_irqrestore(&callback_lock, flags);
}
+/**
+ * cpuset_cpus_allowed_fallback - final fallback before complete catastrophe.
+ * @tsk: pointer to task_struct with which the scheduler is struggling
+ *
+ * Description: In the case that the scheduler cannot find an allowed cpu in
+ * tsk->cpus_allowed, we fall back to task_cs(tsk)->cpus_allowed. In legacy
+ * mode however, this value is the same as task_cs(tsk)->effective_cpus,
+ * which will not contain a sane cpumask during cases such as cpu hotplugging.
+ * This is the absolute last resort for the scheduler and it is only used if
+ * _every_ other avenue has been traveled.
+ **/
+
void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
{
rcu_read_lock();
- do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus);
+ do_set_cpus_allowed(tsk, is_in_v2_mode() ?
+ task_cs(tsk)->cpus_allowed : cpu_possible_mask);
rcu_read_unlock();
/*
--
2.18.1
On 04/09/2019 04:40 PM, Joel Savitz wrote:
> If a process is limited by taskset (i.e. cpuset) to only be allowed to
> run on cpu N, and then cpu N is offlined via hotplug, the process will
> be assigned the current value of its cpuset cgroup's effective_cpus field
> in a call to do_set_cpus_allowed() in cpuset_cpus_allowed_fallback().
> This argument's value does not makes sense for this case, because
> task_cs(tsk)->effective_cpus is modified by cpuset_hotplug_workfn()
> to reflect the new value of cpu_active_mask after cpu N is removed from
> the mask. While this may make sense for the cgroup affinity mask, it
> does not make sense on a per-task basis, as a task that was previously
> limited to only be run on cpu N will be limited to every cpu _except_ for
> cpu N after it is offlined/onlined via hotplug.
>
> Pre-patch behavior:
>
> $ grep Cpus /proc/$$/status
> Cpus_allowed: ff
> Cpus_allowed_list: 0-7
>
> $ taskset -p 4 $$
> pid 19202's current affinity mask: f
> pid 19202's new affinity mask: 4
>
> $ grep Cpus /proc/self/status
> Cpus_allowed: 04
> Cpus_allowed_list: 2
>
> # echo off > /sys/devices/system/cpu/cpu2/online
> $ grep Cpus /proc/$$/status
> Cpus_allowed: 0b
> Cpus_allowed_list: 0-1,3
>
> # echo on > /sys/devices/system/cpu/cpu2/online
> $ grep Cpus /proc/$$/status
> Cpus_allowed: 0b
> Cpus_allowed_list: 0-1,3
>
> On a patched system, the final grep produces the following
> output instead:
>
> $ grep Cpus /proc/$$/status
> Cpus_allowed: ff
> Cpus_allowed_list: 0-7
>
> This patch changes the above behavior by instead resetting the mask to
> task_cs(tsk)->cpus_allowed by default, and cpu_possible mask in legacy
> mode.
>
> This fallback mechanism is only triggered if _every_ other valid avenue
> has been traveled, and it is the last resort before calling BUG().
>
> Signed-off-by: Joel Savitz <[email protected]>
> ---
> kernel/cgroup/cpuset.c | 15 ++++++++++++++-
> 1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 4834c4214e9c..6c9deb2cc687 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -3255,10 +3255,23 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
> spin_unlock_irqrestore(&callback_lock, flags);
> }
>
> +/**
> + * cpuset_cpus_allowed_fallback - final fallback before complete catastrophe.
> + * @tsk: pointer to task_struct with which the scheduler is struggling
> + *
> + * Description: In the case that the scheduler cannot find an allowed cpu in
> + * tsk->cpus_allowed, we fall back to task_cs(tsk)->cpus_allowed. In legacy
> + * mode however, this value is the same as task_cs(tsk)->effective_cpus,
> + * which will not contain a sane cpumask during cases such as cpu hotplugging.
> + * This is the absolute last resort for the scheduler and it is only used if
> + * _every_ other avenue has been traveled.
> + **/
> +
> void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
> {
> rcu_read_lock();
> - do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus);
> + do_set_cpus_allowed(tsk, is_in_v2_mode() ?
> + task_cs(tsk)->cpus_allowed : cpu_possible_mask);
> rcu_read_unlock();
>
> /*
Acked-by: Waiman Long <[email protected]>
On Tue, Apr 09, 2019 at 04:40:03PM -0400 Joel Savitz wrote:
> If a process is limited by taskset (i.e. cpuset) to only be allowed to
> run on cpu N, and then cpu N is offlined via hotplug, the process will
> be assigned the current value of its cpuset cgroup's effective_cpus field
> in a call to do_set_cpus_allowed() in cpuset_cpus_allowed_fallback().
> This argument's value does not makes sense for this case, because
> task_cs(tsk)->effective_cpus is modified by cpuset_hotplug_workfn()
> to reflect the new value of cpu_active_mask after cpu N is removed from
> the mask. While this may make sense for the cgroup affinity mask, it
> does not make sense on a per-task basis, as a task that was previously
> limited to only be run on cpu N will be limited to every cpu _except_ for
> cpu N after it is offlined/onlined via hotplug.
>
> Pre-patch behavior:
>
> $ grep Cpus /proc/$$/status
> Cpus_allowed: ff
> Cpus_allowed_list: 0-7
>
> $ taskset -p 4 $$
> pid 19202's current affinity mask: f
> pid 19202's new affinity mask: 4
>
> $ grep Cpus /proc/self/status
> Cpus_allowed: 04
> Cpus_allowed_list: 2
>
> # echo off > /sys/devices/system/cpu/cpu2/online
> $ grep Cpus /proc/$$/status
> Cpus_allowed: 0b
> Cpus_allowed_list: 0-1,3
>
> # echo on > /sys/devices/system/cpu/cpu2/online
> $ grep Cpus /proc/$$/status
> Cpus_allowed: 0b
> Cpus_allowed_list: 0-1,3
>
> On a patched system, the final grep produces the following
> output instead:
>
> $ grep Cpus /proc/$$/status
> Cpus_allowed: ff
> Cpus_allowed_list: 0-7
>
> This patch changes the above behavior by instead resetting the mask to
> task_cs(tsk)->cpus_allowed by default, and cpu_possible mask in legacy
> mode.
>
> This fallback mechanism is only triggered if _every_ other valid avenue
> has been traveled, and it is the last resort before calling BUG().
>
> Signed-off-by: Joel Savitz <[email protected]>
> ---
> kernel/cgroup/cpuset.c | 15 ++++++++++++++-
> 1 file changed, 14 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
> index 4834c4214e9c..6c9deb2cc687 100644
> --- a/kernel/cgroup/cpuset.c
> +++ b/kernel/cgroup/cpuset.c
> @@ -3255,10 +3255,23 @@ void cpuset_cpus_allowed(struct task_struct *tsk, struct cpumask *pmask)
> spin_unlock_irqrestore(&callback_lock, flags);
> }
>
> +/**
> + * cpuset_cpus_allowed_fallback - final fallback before complete catastrophe.
> + * @tsk: pointer to task_struct with which the scheduler is struggling
> + *
> + * Description: In the case that the scheduler cannot find an allowed cpu in
> + * tsk->cpus_allowed, we fall back to task_cs(tsk)->cpus_allowed. In legacy
> + * mode however, this value is the same as task_cs(tsk)->effective_cpus,
> + * which will not contain a sane cpumask during cases such as cpu hotplugging.
> + * This is the absolute last resort for the scheduler and it is only used if
> + * _every_ other avenue has been traveled.
> + **/
> +
> void cpuset_cpus_allowed_fallback(struct task_struct *tsk)
> {
> rcu_read_lock();
> - do_set_cpus_allowed(tsk, task_cs(tsk)->effective_cpus);
> + do_set_cpus_allowed(tsk, is_in_v2_mode() ?
> + task_cs(tsk)->cpus_allowed : cpu_possible_mask);
> rcu_read_unlock();
>
> /*
> --
> 2.18.1
>
Fwiw,
Acked-by: Phil Auld <[email protected]>
--
On Tue, Apr 09, 2019 at 04:40:03PM -0400, Joel Savitz <[email protected]> wrote:
> $ grep Cpus /proc/$$/status
> Cpus_allowed: ff
> Cpus_allowed_list: 0-7
(a)
> $ taskset -p 4 $$
> pid 19202's current affinity mask: f
> pid 19202's new affinity mask: 4
>
> $ grep Cpus /proc/self/status
> Cpus_allowed: 04
> Cpus_allowed_list: 2
>
> # echo off > /sys/devices/system/cpu/cpu2/online
> $ grep Cpus /proc/$$/status
> Cpus_allowed: 0b
> Cpus_allowed_list: 0-1,3
I'm confused where this value comes from, I must be missing something.
Joel, is the task in question put into a cpuset with 0xf CPUs only (at
point (a))? Or are the CPUs 4-7 offlined as well?
Thanks,
Michal
On Tue, May 21, 2019 at 10:35 AM Michal Koutný <[email protected]> wrote:
> > $ grep Cpus /proc/$$/status
> > Cpus_allowed: ff
> > Cpus_allowed_list: 0-7
>
> (a)
>
> > $ taskset -p 4 $$
> > pid 19202's current affinity mask: f
> I'm confused where this value comes from, I must be missing something.
>
> Joel, is the task in question put into a cpuset with 0xf CPUs only (at
> point (a))? Or are the CPUs 4-7 offlined as well?
Good point.
It is a bit ambiguous, but I performed no action on the task's cpuset
nor did I offline any cpus at point (a).
After a bit of research, I am fairly certain that the observed
discrepancy is due to differing mechanisms used to acquire the cpuset
mask value.
The first mechanism, via `grep Cpus /proc/$$/status`, has it's value
populated by the expression (task->cpus_allowed) in
fs/proc/array.c:sched_getaffinity(), whereas the taskset utility
(https://github.com/karelzak/util-linux/blob/master/schedutils/taskset.c)
uses sched_getaffinity(2) to determine the "current affinity mask"
value from the expression (task->cpus_allowed & cpu_active_mask) in
kernel/sched/core.c:sched_getaffinty(),
I do not know if there is an explicit reason for this discrepancy or
whether the two mechanisms were simply built independently, perhaps
for different purposes.
I think the /proc/$$/status value is intended to simply reflect the
user-specified policy stating which cpus the task is allowed to run on
without consideration for hardware state, whereas the taskset value is
representative of the cpus that the task can actually be run on given
the restriction policy specified by the user via the cpuset mechanism.
By the way, I posted a v2 of this patch that correctly handles cgroup
v2 behavior.
Thanks for digging through this.
On Fri, May 24, 2019 at 11:33:55AM -0400, Joel Savitz <[email protected]> wrote:
> It is a bit ambiguous, but I performed no action on the task's cpuset
> nor did I offline any cpus at point (a).
So did you do any operation that left you with
cpu_active_mask & 0xf0 == 0
?
(If so, I think the demo code should be made without it to avoid the
confusion.)
Regardless, the demo code should also specify in what cpuset it happens
(for the v2 case).
> I think the /proc/$$/status value is intended to simply reflect the
> user-specified policy stating which cpus the task is allowed to run on
> without consideration for hardware state, whereas the taskset value is
> representative of the cpus that the task can actually be run on given
> the restriction policy specified by the user via the cpuset mechanism.
Yes, it seems to me to be somewhat analogous to effective_cpus vs
cpus_allowed in the v2 cpuset.
> By the way, I posted a v2 of this patch that correctly handles cgroup
> v2 behavior.
I see the original version made the state = cpuset in select_fallback_rq
mostly redundant. The split on v2 (hierarchy) in v2 (patch) makes some
sense. Although, on v1 we will lose the "no longer affine to..." message
(which is what happens in your demo IIUC).
Michal
I just did a quick test on a patched kernel to check on that "no
longer affine to..." message:
# nproc
64
# taskset -p 4 $$
pid 2261's current affinity mask: ffffffffffffffff
pid 2261's new affinity mask: 4
# echo off > /sys/devices/system/cpu/cpu2/online
# taskset -p $$
pid 2261's current affinity mask: fffffffffffffffb
# echo on > /sys/devices/system/cpu/cpu2/online
# taskset -p $$
pid 2261's current affinity mask: ffffffffffffffff
# dmesg | tail -5
[ 143.996375] process 2261 (bash) no longer affine to cpu2
[ 143.996657] IRQ 114: no longer affine to CPU2
[ 144.007472] IRQ 227: no longer affine to CPU2
[ 144.013460] smpboot: CPU 2 is now offline
[ 162.685519] smpboot: Booting Node 0 Processor 2 APIC 0x4
dmesg output is observably the same on patched and unpatched kernels
in this case.
The only difference in output is that on an unpatched kernel, the last
`taskset -p $$` outputs:
pid 2274's current affinity mask: fffffffffffffffb
Which is the behavior that this patch aims to modify
This case, which I believe is generalizable, demonstrates that we
retain the "no longer affine to..." output on a kernel with this
patch.
Best,
Joel Savitz
On Thu, Jun 13, 2019 at 8:02 AM Michal Koutný <[email protected]> wrote:
>
> On Tue, May 28, 2019 at 02:10:37PM +0200, Michal Koutný <[email protected]> wrote:
> > Although, on v1 we will lose the "no longer affine to..." message
> > (which is what happens in your demo IIUC).
> FWIW, I was wrong, off by one 'state' transition. So the patch doesn't
> cause change in messaging (not tested though).
On Tue, May 28, 2019 at 02:10:37PM +0200, Michal Koutn? <[email protected]> wrote:
> Although, on v1 we will lose the "no longer affine to..." message
> (which is what happens in your demo IIUC).
FWIW, I was wrong, off by one 'state' transition. So the patch doesn't
cause change in messaging (not tested though).