It was found that any change to the current cpuset hierarchy may reset
the cpus_allowed list of the tasks in the affected cpusets to the
default cpuset value even if those tasks have cpus affinity explicitly
set by the users before. That is especially easy to trigger under a
cgroup v2 environment where writing "+cpuset" to the root cgroup's
cgroup.subtree_control file will reset the cpus affinity of all the
processes in the system.
That is especially problematic in a nohz_full environment where the
tasks running in the nohz_full CPUs usually have their cpus affinity
explicitly set and will behave incorrectly if cpus affinity changes.
Fix this problem by adding a flag in the task structure to indicate that
a task has their cpus affinity explicitly set before and make cpuset
code not to change their cpus_allowed list unless the user chosen cpu
list is no longer a subset of the cpus_allowed list of the cpuset itself.
With that change in place, it was verified that tasks that have its
cpus affinity explicitly set will not be affected by changes made to
the v2 cgroup.subtree_control files.
Signed-off-by: Waiman Long <[email protected]>
---
include/linux/sched.h | 1 +
kernel/cgroup/cpuset.c | 18 ++++++++++++++++--
kernel/sched/core.c | 1 +
3 files changed, 18 insertions(+), 2 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index c46f3a63b758..60ae022fa842 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -815,6 +815,7 @@ struct task_struct {
unsigned int policy;
int nr_cpus_allowed;
+ int cpus_affinity_set;
const cpumask_t *cpus_ptr;
cpumask_t *user_cpus_ptr;
cpumask_t cpus_mask;
diff --git a/kernel/cgroup/cpuset.c b/kernel/cgroup/cpuset.c
index 71a418858a5e..c47757c61f39 100644
--- a/kernel/cgroup/cpuset.c
+++ b/kernel/cgroup/cpuset.c
@@ -704,6 +704,20 @@ static int validate_change(struct cpuset *cur, struct cpuset *trial)
return ret;
}
+/*
+ * Don't change the cpus_allowed list if cpus affinity has been explicitly
+ * set before unless the current cpu list is not a subset of the new cpu list.
+ */
+static int cpuset_set_cpus_allowed_ptr(struct task_struct *p,
+ const struct cpumask *new_mask)
+{
+ if (p->cpus_affinity_set && cpumask_subset(p->cpus_ptr, new_mask))
+ return 0;
+
+ p->cpus_affinity_set = 0;
+ return set_cpus_allowed_ptr(p, new_mask);
+}
+
#ifdef CONFIG_SMP
/*
* Helper routine for generate_sched_domains().
@@ -1130,7 +1144,7 @@ static void update_tasks_cpumask(struct cpuset *cs)
css_task_iter_start(&cs->css, 0, &it);
while ((task = css_task_iter_next(&it)))
- set_cpus_allowed_ptr(task, cs->effective_cpus);
+ cpuset_set_cpus_allowed_ptr(task, cs->effective_cpus);
css_task_iter_end(&it);
}
@@ -2303,7 +2317,7 @@ static void cpuset_attach(struct cgroup_taskset *tset)
* can_attach beforehand should guarantee that this doesn't
* fail. TODO: have a better way to handle failure here
*/
- WARN_ON_ONCE(set_cpus_allowed_ptr(task, cpus_attach));
+ WARN_ON_ONCE(cpuset_set_cpus_allowed_ptr(task, cpus_attach));
cpuset_change_task_nodemask(task, &cpuset_attach_nodemask_to);
cpuset_update_task_spread_flag(cs, task);
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index da0bf6fe9ecd..ab8ea6fa92db 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -8034,6 +8034,7 @@ __sched_setaffinity(struct task_struct *p, const struct cpumask *mask)
if (retval)
goto out_free_new_mask;
+ p->cpus_affinity_set = 1;
cpuset_cpus_allowed(p, cpus_allowed);
if (!cpumask_subset(new_mask, cpus_allowed)) {
/*
--
2.31.1
Hello.
On Wed, Jul 27, 2022 at 08:58:14PM -0400, Waiman Long <[email protected]> wrote:
> It was found that any change to the current cpuset hierarchy may reset
> the cpus_allowed list of the tasks in the affected cpusets to the
> default cpuset value even if those tasks have cpus affinity explicitly
> set by the users before.
I'm surprised this went so long unnoticed / unreported.
Could it be users relied on that implicit affinity reset?
> That is especially easy to trigger under a cgroup v2 environment where
> writing "+cpuset" to the root cgroup's cgroup.subtree_control file
> will reset the cpus affinity of all the processes in the system.
This should apply only to tasks that were extracted out of the root
cgroup, no? (OK, those are all processes practically.)
(Even without your second patch, the scope should be limited because of
src_cset==dst_cset check in cgroup_migrate_prepare_dst().)
> That is especially problematic in a nohz_full environment where the
> tasks running in the nohz_full CPUs usually have their cpus affinity
> explicitly set and will behave incorrectly if cpus affinity changes.
One could also argue that for such processes, cgroup hierarchy should be
first configured and only then they start and set own affinity.
> Fix this problem by adding a flag in the task structure to indicate that
> a task has their cpus affinity explicitly set before and make cpuset
> code not to change their cpus_allowed list unless the user chosen cpu
> list is no longer a subset of the cpus_allowed list of the cpuset itself.
I'm uneasy with the occasional revert of this flag, i.e. the task who
set their affinity would sometimes have it overwritten and sometimes
not (which might have been relied on, especially with writes into
cpuset.cpus).
(But I have no better answer than the counter-argument above since
there's no easier way to detect the implicit migrations.)
Also, is there similar effect with memory binding?
Thanks,
Michal
On 7/28/22 10:44, Michal Koutný wrote:
> Hello.
>
> On Wed, Jul 27, 2022 at 08:58:14PM -0400, Waiman Long <[email protected]> wrote:
>> It was found that any change to the current cpuset hierarchy may reset
>> the cpus_allowed list of the tasks in the affected cpusets to the
>> default cpuset value even if those tasks have cpus affinity explicitly
>> set by the users before.
> I'm surprised this went so long unnoticed / unreported.
>
> Could it be users relied on that implicit affinity reset?
As said, it is more easily triggered in a cgroup v2 environment.
Systemd, on a cgroup v2 environment, will write "+cpuset" to the root
cgroup's subtree_control file when a new container is instantiated. I
don't know why it is doing that, but that cause problem as it resets all
the cpus_allowed list assignment. Cgroup v1 doesn't have this problem.
>> That is especially easy to trigger under a cgroup v2 environment where
>> writing "+cpuset" to the root cgroup's cgroup.subtree_control file
>> will reset the cpus affinity of all the processes in the system.
> This should apply only to tasks that were extracted out of the root
> cgroup, no? (OK, those are all processes practically.)
The reset is done on all cgroups in a particular subtree. In the case of
cgroup root, it is all the processes in the system.
>
> (Even without your second patch, the scope should be limited because of
> src_cset==dst_cset check in cgroup_migrate_prepare_dst().)
>
>> That is especially problematic in a nohz_full environment where the
>> tasks running in the nohz_full CPUs usually have their cpus affinity
>> explicitly set and will behave incorrectly if cpus affinity changes.
> One could also argue that for such processes, cgroup hierarchy should be
> first configured and only then they start and set own affinity.
>
>> Fix this problem by adding a flag in the task structure to indicate that
>> a task has their cpus affinity explicitly set before and make cpuset
>> code not to change their cpus_allowed list unless the user chosen cpu
>> list is no longer a subset of the cpus_allowed list of the cpuset itself.
> I'm uneasy with the occasional revert of this flag, i.e. the task who
> set their affinity would sometimes have it overwritten and sometimes
> not (which might have been relied on, especially with writes into
> cpuset.cpus).
> (But I have no better answer than the counter-argument above since
> there's no easier way to detect the implicit migrations.)
I also thought about that. Another possible alternative is to use the
intersection of cpuset's cpu list and task's own cpu list as long as it
is not empty. Reducing the number of cpus assigned to a task may produce
some unexpected behavior too.
>
> Also, is there similar effect with memory binding?
I think so, but memory binding is less frequently used and its effect is
less noticeable.
Cheers,
Longman
>
> Thanks,
> Michal
>
On Thu, Jul 28, 2022 at 10:59:01AM -0400, Waiman Long <[email protected]> wrote:
> Cgroup v1 doesn't have this problem.
v1 analogy would be:
echo 2-3 >$dst/cpuset.cpus
# job runs in $dst
# one task T in $dst sets affinity just to one cpu
# I rethink my config, I want to allow $dst more space
echo 2-5 >$dst/cpuset.cpus
Most tasks in $dst happily utilize the new cpus but it breaks affinity
for T -- this must have been broken since ever.
(Or I'd argue that per-thread affinities are just recommendations, if I
have a task for nohz CPU, I should enforce its placement with cpuset
from the beginning.)
Michal
On 7/28/22 11:23, Michal Koutný wrote:
> On Thu, Jul 28, 2022 at 10:59:01AM -0400, Waiman Long <[email protected]> wrote:
>> Cgroup v1 doesn't have this problem.
> v1 analogy would be:
>
> echo 2-3 >$dst/cpuset.cpus
> # job runs in $dst
> # one task T in $dst sets affinity just to one cpu
> # I rethink my config, I want to allow $dst more space
> echo 2-5 >$dst/cpuset.cpus
>
> Most tasks in $dst happily utilize the new cpus but it breaks affinity
> for T -- this must have been broken since ever.
>
> (Or I'd argue that per-thread affinities are just recommendations, if I
> have a task for nohz CPU, I should enforce its placement with cpuset
> from the beginning.)
I should have clarified that what I meant is systemd on a cgroup v1
environment doesn't cause this cpu list reset to happen. It doesn't mean
that cgroup v1 has no similar problem. Sorry for the confusion.
Cheers,
Longman
Hello,
On Wed, Jul 27, 2022 at 08:58:14PM -0400, Waiman Long wrote:
> It was found that any change to the current cpuset hierarchy may reset
> the cpus_allowed list of the tasks in the affected cpusets to the
> default cpuset value even if those tasks have cpus affinity explicitly
> set by the users before. That is especially easy to trigger under a
> cgroup v2 environment where writing "+cpuset" to the root cgroup's
> cgroup.subtree_control file will reset the cpus affinity of all the
> processes in the system.
>
> That is especially problematic in a nohz_full environment where the
> tasks running in the nohz_full CPUs usually have their cpus affinity
> explicitly set and will behave incorrectly if cpus affinity changes.
>
> Fix this problem by adding a flag in the task structure to indicate that
> a task has their cpus affinity explicitly set before and make cpuset
> code not to change their cpus_allowed list unless the user chosen cpu
> list is no longer a subset of the cpus_allowed list of the cpuset itself.
>
> With that change in place, it was verified that tasks that have its
> cpus affinity explicitly set will not be affected by changes made to
> the v2 cgroup.subtree_control files.
I think the underlying cause here is cpuset overwriting the cpumask the user
configured but that's a longer discussion.
> +/*
> + * Don't change the cpus_allowed list if cpus affinity has been explicitly
> + * set before unless the current cpu list is not a subset of the new cpu list.
> + */
> +static int cpuset_set_cpus_allowed_ptr(struct task_struct *p,
> + const struct cpumask *new_mask)
> +{
> + if (p->cpus_affinity_set && cpumask_subset(p->cpus_ptr, new_mask))
> + return 0;
> +
> + p->cpus_affinity_set = 0;
> + return set_cpus_allowed_ptr(p, new_mask);
> +}
I wonder whether the more predictable behavior would be always not resetting
the cpumask if it's a subset of the new_mask. Also, shouldn't this check
p->cpus_mask instead of p->cpus_ptr?
Thanks.
--
tejun
On 28/07/22 10:59, Waiman Long wrote:
> On 7/28/22 10:44, Michal Koutný wrote:
>> This should apply only to tasks that were extracted out of the root
>> cgroup, no? (OK, those are all processes practically.)
> The reset is done on all cgroups in a particular subtree. In the case of
> cgroup root, it is all the processes in the system.
>>
I've been briefly playing with this, tasks in the cgroup root don't seem
affected on my end (QEMU + buildroot + latest tip/sched/core):
$ mount -t cgroup2 none /sys/fs/cgroup
$ /root/loop.sh &
$ PID=$!
$ taskset -pc 2-3 $PID
pid 177's current affinity list: 0-3
pid 177's new affinity list: 2,3
$ echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
$ taskset -pc $PID
pid 177's current affinity list: 2,3
However tasks extracted out as mentioned by Michal definitely are:
$ mount -t cgroup2 none /sys/fs/cgroup
$ /root/loop.sh &
$ PID=$!
$ taskset -pc 2-3 $PID
pid 172's current affinity list: 0-3
pid 172's new affinity list: 2,3
$ mkdir /sys/fs/cgroup/foobar
$ echo $PID > /sys/fs/cgroup/foobar/cgroup.procs
$ taskset -pc $PID
pid 172's current affinity list: 2,3
$ echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
$ taskset -pc $PID
pid 172's current affinity list: 0-3
IIUC this is just what happens anytime a task gets migrated to a new
cpuset. Initially loop.sh remains attached to the root cpuset, and the echo
+cpuset migrates it to the /foobar one.
Does that match what you're seeing?
On 7/28/22 12:50, Valentin Schneider wrote:
> On 28/07/22 10:59, Waiman Long wrote:
>> On 7/28/22 10:44, Michal Koutný wrote:
>>> This should apply only to tasks that were extracted out of the root
>>> cgroup, no? (OK, those are all processes practically.)
>> The reset is done on all cgroups in a particular subtree. In the case of
>> cgroup root, it is all the processes in the system.
> I've been briefly playing with this, tasks in the cgroup root don't seem
> affected on my end (QEMU + buildroot + latest tip/sched/core):
>
> $ mount -t cgroup2 none /sys/fs/cgroup
> $ /root/loop.sh &
> $ PID=$!
> $ taskset -pc 2-3 $PID
> pid 177's current affinity list: 0-3
> pid 177's new affinity list: 2,3
> $ echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
> $ taskset -pc $PID
> pid 177's current affinity list: 2,3
>
> However tasks extracted out as mentioned by Michal definitely are:
>
> $ mount -t cgroup2 none /sys/fs/cgroup
> $ /root/loop.sh &
> $ PID=$!
> $ taskset -pc 2-3 $PID
> pid 172's current affinity list: 0-3
> pid 172's new affinity list: 2,3
> $ mkdir /sys/fs/cgroup/foobar
> $ echo $PID > /sys/fs/cgroup/foobar/cgroup.procs
> $ taskset -pc $PID
> pid 172's current affinity list: 2,3
> $ echo +cpuset > /sys/fs/cgroup/cgroup.subtree_control
> $ taskset -pc $PID
> pid 172's current affinity list: 0-3
>
> IIUC this is just what happens anytime a task gets migrated to a new
> cpuset. Initially loop.sh remains attached to the root cpuset, and the echo
> +cpuset migrates it to the /foobar one.
>
> Does that match what you're seeing?
>
Yes. echo "+cpuset" to subtree_control means tasks in the child cgroups
will move to new cpusets. Those new cpusets will have the same cpu lists
as the parent unless the cpuset.cpus files are explicitly written to.
This patch will ensure that tasks that have explicitly set their cpu
affinity won't be affected by this change.
Cheers,
Longman
On 7/28/22 13:23, Tejun Heo wrote:
> Hello,
>
> On Wed, Jul 27, 2022 at 08:58:14PM -0400, Waiman Long wrote:
>> It was found that any change to the current cpuset hierarchy may reset
>> the cpus_allowed list of the tasks in the affected cpusets to the
>> default cpuset value even if those tasks have cpus affinity explicitly
>> set by the users before. That is especially easy to trigger under a
>> cgroup v2 environment where writing "+cpuset" to the root cgroup's
>> cgroup.subtree_control file will reset the cpus affinity of all the
>> processes in the system.
>>
>> That is especially problematic in a nohz_full environment where the
>> tasks running in the nohz_full CPUs usually have their cpus affinity
>> explicitly set and will behave incorrectly if cpus affinity changes.
>>
>> Fix this problem by adding a flag in the task structure to indicate that
>> a task has their cpus affinity explicitly set before and make cpuset
>> code not to change their cpus_allowed list unless the user chosen cpu
>> list is no longer a subset of the cpus_allowed list of the cpuset itself.
>>
>> With that change in place, it was verified that tasks that have its
>> cpus affinity explicitly set will not be affected by changes made to
>> the v2 cgroup.subtree_control files.
> I think the underlying cause here is cpuset overwriting the cpumask the user
> configured but that's a longer discussion.
>
>> +/*
>> + * Don't change the cpus_allowed list if cpus affinity has been explicitly
>> + * set before unless the current cpu list is not a subset of the new cpu list.
>> + */
>> +static int cpuset_set_cpus_allowed_ptr(struct task_struct *p,
>> + const struct cpumask *new_mask)
>> +{
>> + if (p->cpus_affinity_set && cpumask_subset(p->cpus_ptr, new_mask))
>> + return 0;
>> +
>> + p->cpus_affinity_set = 0;
>> + return set_cpus_allowed_ptr(p, new_mask);
>> +}
> I wonder whether the more predictable behavior would be always not resetting
> the cpumask if it's a subset of the new_mask.
There can be a counter argument that if a user found out that there is
not enough cpus in a cpuset to meet its performance target, one can
always increase the number of cpus in the cpuset. Generalizing this
behavior to all the tasks irrespective if they have explicitly set cpus
affinity before will disallow this use case.
> Also, shouldn't this check
> p->cpus_mask instead of p->cpus_ptr?
You are right. I should have used cpus_mask instead. Will send out a v2.
Cheers,
Longman
Hello,
On Thu, Jul 28, 2022 at 02:57:28PM -0400, Waiman Long wrote:
> There can be a counter argument that if a user found out that there is not
> enough cpus in a cpuset to meet its performance target, one can always
> increase the number of cpus in the cpuset. Generalizing this behavior to all
> the tasks irrespective if they have explicitly set cpus affinity before will
> disallow this use case.
This is nasty. The real solution here is separating out what user requested
and the mask that cpuset (or cpu hotplug) needs to apply on top. ie.
remember what the user requested in a separate cpumask and compute the
intersection into p->cpus_maks whenever something changes and apply
fallbacks on that final mask. Multiple parties updating the same variable is
never gonna lead to anything consistent and we're patching up for whatever
the immediate use case seems to need at the moment. That said, I'm not
necessarily against patching it up but if you're interested in delving into
it deeper, that'd be great.
Thanks.
--
tejun
On 7/28/22 15:02, Tejun Heo wrote:
> Hello,
>
> On Thu, Jul 28, 2022 at 02:57:28PM -0400, Waiman Long wrote:
>> There can be a counter argument that if a user found out that there is not
>> enough cpus in a cpuset to meet its performance target, one can always
>> increase the number of cpus in the cpuset. Generalizing this behavior to all
>> the tasks irrespective if they have explicitly set cpus affinity before will
>> disallow this use case.
> This is nasty.
That is a nasty example, I know. There may be users depending on the
existing behavior even if they don't know it. So I am a bit hesitant to
change the default behavior like that. On the other hand, tasks that
have explicitly set its cpu affinity certainly don't want to have
unexpected change to that.
> The real solution here is separating out what user requested
> and the mask that cpuset (or cpu hotplug) needs to apply on top. ie.
> remember what the user requested in a separate cpumask and compute the
> intersection into p->cpus_maks whenever something changes and apply
> fallbacks on that final mask. Multiple parties updating the same variable is
> never gonna lead to anything consistent and we're patching up for whatever
> the immediate use case seems to need at the moment. That said, I'm not
> necessarily against patching it up but if you're interested in delving into
> it deeper, that'd be great.
I believe the current code is already restricting what cpu affinity that
a user can request by limiting to those allowed by the current cpuset.
Hotplug is another issue that may need to be addressed. I will update my
patch to make it handle hotplug in a more graceful way.
Thanks,
Longman
Hello,
On Thu, Jul 28, 2022 at 03:21:26PM -0400, Waiman Long wrote:
> On 7/28/22 15:02, Tejun Heo wrote:
> > On Thu, Jul 28, 2022 at 02:57:28PM -0400, Waiman Long wrote:
> > > There can be a counter argument that if a user found out that there is not
> > > enough cpus in a cpuset to meet its performance target, one can always
> > > increase the number of cpus in the cpuset. Generalizing this behavior to all
> > > the tasks irrespective if they have explicitly set cpus affinity before will
> > > disallow this use case.
> > This is nasty.
>
> That is a nasty example, I know. There may be users depending on the
> existing behavior even if they don't know it. So I am a bit hesitant to
> change the default behavior like that. On the other hand, tasks that have
> explicitly set its cpu affinity certainly don't want to have unexpected
> change to that.
Yeah, I hear you. I'm on the same page.
> > The real solution here is separating out what user requested
> > and the mask that cpuset (or cpu hotplug) needs to apply on top. ie.
> > remember what the user requested in a separate cpumask and compute the
> > intersection into p->cpus_maks whenever something changes and apply
> > fallbacks on that final mask. Multiple parties updating the same variable is
> > never gonna lead to anything consistent and we're patching up for whatever
> > the immediate use case seems to need at the moment. That said, I'm not
> > necessarily against patching it up but if you're interested in delving into
> > it deeper, that'd be great.
>
> I believe the current code is already restricting what cpu affinity that a
> user can request by limiting to those allowed by the current cpuset. Hotplug
> is another issue that may need to be addressed. I will update my patch to
> make it handle hotplug in a more graceful way.
So, the patch you proposed is making the code remember one special aspect of
user requested configuration - whether it configured it or not, and trying
to preserve that particular state as cpuset state changes. It addresses the
immediate problem but it is a very partial approach. Let's say a task wanna
be affined to one logical thread of each core and set its mask to 0x5555.
Now, let's say cpuset got enabled and enforced 0xff and affined the task to
0xff. After a while, the cgroup got more cpus allocated and its cpuset now
has 0xfff. Ideally, what should happen is the task now having the effective
mask of 0x555. In practice, tho, it either would get 0xf55 or 0x55 depending
on which way we decide to misbehave.
Thanks.
--
tejun
On 7/28/22 16:44, Tejun Heo wrote:
> Hello,
>
> On Thu, Jul 28, 2022 at 03:21:26PM -0400, Waiman Long wrote:
>> On 7/28/22 15:02, Tejun Heo wrote:
>>> On Thu, Jul 28, 2022 at 02:57:28PM -0400, Waiman Long wrote:
>>>> There can be a counter argument that if a user found out that there is not
>>>> enough cpus in a cpuset to meet its performance target, one can always
>>>> increase the number of cpus in the cpuset. Generalizing this behavior to all
>>>> the tasks irrespective if they have explicitly set cpus affinity before will
>>>> disallow this use case.
>>> This is nasty.
>> That is a nasty example, I know. There may be users depending on the
>> existing behavior even if they don't know it. So I am a bit hesitant to
>> change the default behavior like that. On the other hand, tasks that have
>> explicitly set its cpu affinity certainly don't want to have unexpected
>> change to that.
> Yeah, I hear you. I'm on the same page.
>
>>> The real solution here is separating out what user requested
>>> and the mask that cpuset (or cpu hotplug) needs to apply on top. ie.
>>> remember what the user requested in a separate cpumask and compute the
>>> intersection into p->cpus_maks whenever something changes and apply
>>> fallbacks on that final mask. Multiple parties updating the same variable is
>>> never gonna lead to anything consistent and we're patching up for whatever
>>> the immediate use case seems to need at the moment. That said, I'm not
>>> necessarily against patching it up but if you're interested in delving into
>>> it deeper, that'd be great.
>> I believe the current code is already restricting what cpu affinity that a
>> user can request by limiting to those allowed by the current cpuset. Hotplug
>> is another issue that may need to be addressed. I will update my patch to
>> make it handle hotplug in a more graceful way.
> af
> So, the patch you proposed is making the code remember one special aspect of
> user requested configuration - whether it configured it or not, and trying
> to preserve that particular state as cpuset state changes. It addresses the
> immediate problem but it is a very partial approach. Let's say a task wanna
> be affined to one logical thread of each core and set its mask to 0x5555.
> Now, let's say cpuset got enabled and enforced 0xff and affined the task to
> 0xff. After a while, the cgroup got more cpus allocated and its cpuset now
> has 0xfff. Ideally, what should happen is the task now having the effective
> mask of 0x555. In practice, tho, it either would get 0xf55 or 0x55 depending
> on which way we decide to misbehave.
OK, I see what you want to accomplish. To fully address this issue, we
will need to have a new cpumask variable in the the task structure which
will be allocated if sched_setaffinity() is ever called. I can rework my
patch to use this approach.
Thanks,
Longman
Hello, Waiman.
On Thu, Jul 28, 2022 at 05:04:19PM -0400, Waiman Long wrote:
> > So, the patch you proposed is making the code remember one special aspect of
> > user requested configuration - whether it configured it or not, and trying
> > to preserve that particular state as cpuset state changes. It addresses the
> > immediate problem but it is a very partial approach. Let's say a task wanna
> > be affined to one logical thread of each core and set its mask to 0x5555.
> > Now, let's say cpuset got enabled and enforced 0xff and affined the task to
> > 0xff. After a while, the cgroup got more cpus allocated and its cpuset now
> > has 0xfff. Ideally, what should happen is the task now having the effective
> > mask of 0x555. In practice, tho, it either would get 0xf55 or 0x55 depending
> > on which way we decide to misbehave.
>
> OK, I see what you want to accomplish. To fully address this issue, we will
> need to have a new cpumask variable in the the task structure which will be
> allocated if sched_setaffinity() is ever called. I can rework my patch to
> use this approach.
Yeah, we'd need to track what user requested separately from the currently
effective cpumask. Let's make sure that the scheduler folks are on board
before committing to the idea tho. Peter, Ingo, what do you guys think?
Thanks.
--
tejun
On 28/07/22 11:39, Tejun Heo wrote:
> Hello, Waiman.
>
> On Thu, Jul 28, 2022 at 05:04:19PM -0400, Waiman Long wrote:
>> > So, the patch you proposed is making the code remember one special aspect of
>> > user requested configuration - whether it configured it or not, and trying
>> > to preserve that particular state as cpuset state changes. It addresses the
>> > immediate problem but it is a very partial approach. Let's say a task wanna
>> > be affined to one logical thread of each core and set its mask to 0x5555.
>> > Now, let's say cpuset got enabled and enforced 0xff and affined the task to
>> > 0xff. After a while, the cgroup got more cpus allocated and its cpuset now
>> > has 0xfff. Ideally, what should happen is the task now having the effective
>> > mask of 0x555. In practice, tho, it either would get 0xf55 or 0x55 depending
>> > on which way we decide to misbehave.
>>
>> OK, I see what you want to accomplish. To fully address this issue, we will
>> need to have a new cpumask variable in the the task structure which will be
>> allocated if sched_setaffinity() is ever called. I can rework my patch to
>> use this approach.
>
> Yeah, we'd need to track what user requested separately from the currently
> effective cpumask. Let's make sure that the scheduler folks are on board
> before committing to the idea tho. Peter, Ingo, what do you guys think?
>
FWIW on a runtime overhead side of things I think it'll be OK as that
should be just an extra mask copy in sched_setaffinity() and a subset
check / cpumask_and() in set_cpus_allowed_ptr(). The policy side is a bit
less clear (when, if ever, do we clear the user-defined mask? Will it keep
haunting us even after moving a task to a disjoint cpuset partition?).
There's also if/how that new mask should be exposed, because attaching a
task to a cpuset will now yield a not-necessarily-obvious affinity -
e.g. in the thread affinity example above, if the initial affinity setting
was done ages ago by some system tool, IMO the user needs a way to be able
to expect/understand the result of 0x555 rather than 0xfff.
While I'm saying this, I don't think anything exposes p->user_cpus_ptr, but
then again that one is for "special" hardware...
> Thanks.
>
> --
> tejun
On 7/29/22 10:15, Valentin Schneider wrote:
> On 28/07/22 11:39, Tejun Heo wrote:
>> Hello, Waiman.
>>
>> On Thu, Jul 28, 2022 at 05:04:19PM -0400, Waiman Long wrote:
>>>> So, the patch you proposed is making the code remember one special aspect of
>>>> user requested configuration - whether it configured it or not, and trying
>>>> to preserve that particular state as cpuset state changes. It addresses the
>>>> immediate problem but it is a very partial approach. Let's say a task wanna
>>>> be affined to one logical thread of each core and set its mask to 0x5555.
>>>> Now, let's say cpuset got enabled and enforced 0xff and affined the task to
>>>> 0xff. After a while, the cgroup got more cpus allocated and its cpuset now
>>>> has 0xfff. Ideally, what should happen is the task now having the effective
>>>> mask of 0x555. In practice, tho, it either would get 0xf55 or 0x55 depending
>>>> on which way we decide to misbehave.
>>> OK, I see what you want to accomplish. To fully address this issue, we will
>>> need to have a new cpumask variable in the the task structure which will be
>>> allocated if sched_setaffinity() is ever called. I can rework my patch to
>>> use this approach.
>> Yeah, we'd need to track what user requested separately from the currently
>> effective cpumask. Let's make sure that the scheduler folks are on board
>> before committing to the idea tho. Peter, Ingo, what do you guys think?
>>
> FWIW on a runtime overhead side of things I think it'll be OK as that
> should be just an extra mask copy in sched_setaffinity() and a subset
> check / cpumask_and() in set_cpus_allowed_ptr(). The policy side is a bit
> less clear (when, if ever, do we clear the user-defined mask? Will it keep
> haunting us even after moving a task to a disjoint cpuset partition?).
The runtime overhead should be minimal. It is the behavioral side that
we should be careful about. It is a change in existing behavior and we
don't want to cause surprise to the users. Currently, a task that set
its cpu affinity explicitly will have its affinity reset whenever there
is any change to the cpuset it belongs to or a hotplug event touch any
cpu in the current cpuset. The new behavior we are proposing here is
that it will try its best to keep the cpu affinity that the user
requested within the constraint of the current cpuset as well as the cpu
hotplug state.
>
> There's also if/how that new mask should be exposed, because attaching a
> task to a cpuset will now yield a not-necessarily-obvious affinity -
> e.g. in the thread affinity example above, if the initial affinity setting
> was done ages ago by some system tool, IMO the user needs a way to be able
> to expect/understand the result of 0x555 rather than 0xfff.
Users can use sched_getaffinity(2) to retrieve the current cpu affinity.
It is up to users to set another one if they don't like the current one.
I don't think we need to return what the previous requested cpu affinity
is. They are suppose to know that or they can set their own if they
don't like it. \
Cheers,
Longman
>
> While I'm saying this, I don't think anything exposes p->user_cpus_ptr, but
> then again that one is for "special" hardware...
>
>> Thanks.
>>
>> --
>> tejun
On 7/29/22 10:50, Waiman Long wrote:
> On 7/29/22 10:15, Valentin Schneider wrote:
>> On 28/07/22 11:39, Tejun Heo wrote:
>>> Hello, Waiman.
>>>
>>> On Thu, Jul 28, 2022 at 05:04:19PM -0400, Waiman Long wrote:
>>>>> So, the patch you proposed is making the code remember one special
>>>>> aspect of
>>>>> user requested configuration - whether it configured it or not,
>>>>> and trying
>>>>> to preserve that particular state as cpuset state changes. It
>>>>> addresses the
>>>>> immediate problem but it is a very partial approach. Let's say a
>>>>> task wanna
>>>>> be affined to one logical thread of each core and set its mask to
>>>>> 0x5555.
>>>>> Now, let's say cpuset got enabled and enforced 0xff and affined
>>>>> the task to
>>>>> 0xff. After a while, the cgroup got more cpus allocated and its
>>>>> cpuset now
>>>>> has 0xfff. Ideally, what should happen is the task now having the
>>>>> effective
>>>>> mask of 0x555. In practice, tho, it either would get 0xf55 or 0x55
>>>>> depending
>>>>> on which way we decide to misbehave.
>>>> OK, I see what you want to accomplish. To fully address this issue,
>>>> we will
>>>> need to have a new cpumask variable in the the task structure which
>>>> will be
>>>> allocated if sched_setaffinity() is ever called. I can rework my
>>>> patch to
>>>> use this approach.
>>> Yeah, we'd need to track what user requested separately from the
>>> currently
>>> effective cpumask. Let's make sure that the scheduler folks are on
>>> board
>>> before committing to the idea tho. Peter, Ingo, what do you guys think?
>>>
>> FWIW on a runtime overhead side of things I think it'll be OK as that
>> should be just an extra mask copy in sched_setaffinity() and a subset
>> check / cpumask_and() in set_cpus_allowed_ptr(). The policy side is a
>> bit
>> less clear (when, if ever, do we clear the user-defined mask? Will it
>> keep
>> haunting us even after moving a task to a disjoint cpuset partition?).
>
> The runtime overhead should be minimal. It is the behavioral side that
> we should be careful about. It is a change in existing behavior and we
> don't want to cause surprise to the users. Currently, a task that set
> its cpu affinity explicitly will have its affinity reset whenever
> there is any change to the cpuset it belongs to or a hotplug event
> touch any cpu in the current cpuset. The new behavior we are proposing
> here is that it will try its best to keep the cpu affinity that the
> user requested within the constraint of the current cpuset as well as
> the cpu hotplug state.
>
>
>>
>> There's also if/how that new mask should be exposed, because attaching a
>> task to a cpuset will now yield a not-necessarily-obvious affinity -
>> e.g. in the thread affinity example above, if the initial affinity
>> setting
>> was done ages ago by some system tool, IMO the user needs a way to be
>> able
>> to expect/understand the result of 0x555 rather than 0xfff.
>
> Users can use sched_getaffinity(2) to retrieve the current cpu
> affinity. It is up to users to set another one if they don't like the
> current one. I don't think we need to return what the previous
> requested cpu affinity is. They are suppose to know that or they can
> set their own if they don't like it. \
Looking at Will's series that introduced user_cpus_ptr, I think we can
overlay our proposal on top of that. So calling sched_setaffinity() will
also update user_cpus_ptr. We may still need a flag to indicate whether
user_cpus_ptr is set up because of sched_setaffinity() or due to a call
to force_compatible_cpus_allowed_ptr() from arm64 arch code. That will
make our work easier as some of the infrastructure is already there. I
am looking forward for your feedback.
Thanks,
Longman