The user_cpus_ptr field was originally added by commit b90ca8badbd1
("sched: Introduce task_struct::user_cpus_ptr to track requested
affinity"). It was used only by arm64 arch due to possible asymmetric
CPU setup.
Since commit 8f9ea86fdf99 ("sched: Always preserve the user requested
cpumask"), task_struct::user_cpus_ptr is repurposed to store user
requested cpu affinity specified in the sched_setaffinity().
This results in a performance regression in an arm64 system when booted
with "allow_mismatched_32bit_el0" on the command-line. The arch code will
(amongst other things) calls force_compatible_cpus_allowed_ptr() and
relax_compatible_cpus_allowed_ptr() when exec()'ing a 32-bit or a 64-bit
task respectively. Now a call to relax_compatible_cpus_allowed_ptr()
will always result in a __sched_setaffinity() call whether there is a
previous force_compatible_cpus_allowed_ptr() call or not.
In order to fix this regression, a new scheduler flag
task_struct::cpus_allowed_restricted is now added to track if
force_compatible_cpus_allowed_ptr() has been called before or not. This
patch also updates the comments in force_compatible_cpus_allowed_ptr()
and relax_compatible_cpus_allowed_ptr() and handles their interaction
with sched_setaffinity().
Fixes: 8f9ea86fdf99 ("sched: Always preserve the user requested cpumask")
Reported-by: Will Deacon <[email protected]>
Signed-off-by: Waiman Long <[email protected]>
---
include/linux/sched.h | 3 +++
kernel/sched/core.c | 46 ++++++++++++++++++++++++++++++++++---------
kernel/sched/sched.h | 2 ++
3 files changed, 42 insertions(+), 9 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 853d08f7562b..f93f62a1f858 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -886,6 +886,9 @@ struct task_struct {
unsigned sched_contributes_to_load:1;
unsigned sched_migrated:1;
+ /* restrict_cpus_allowed_ptr() bit, serialized by scheduler locks */
+ unsigned cpus_allowed_restricted:1;
+
/* Force alignment to the next boundary: */
unsigned :0;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index bb1ee6d7bdde..48234dc9005b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2999,15 +2999,40 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
struct rq *rq;
rq = task_rq_lock(p, &rf);
+
+ if (ctx->flags & (SCA_CLR_RESTRICT | SCA_SET_RESTRICT)) {
+ p->cpus_allowed_restricted = 0;
+ } else if (p->cpus_allowed_restricted) {
+ /*
+ * If force_compatible_cpus_allowed_ptr() has been called,
+ * we can't extend cpumask to beyond what is in cpus_mask.
+ */
+ if (!cpumask_and(rq->scratch_mask, ctx->new_mask,
+ &p->cpus_mask)) {
+ task_rq_unlock(rq, p, &rf);
+ return -EINVAL;
+ }
+
+ /*
+ * Note that we don't need to do further user_cpus_ptr
+ * masking below as cpus_mask should be a subset of
+ * user_cpus_ptr if set.
+ */
+ ctx->new_mask = rq->scratch_mask;
+ }
+
/*
* Masking should be skipped if SCA_USER or any of the SCA_MIGRATE_*
- * flags are set.
+ * flags are set or when cpus_allowed_restricted flag has been set.
*/
- if (p->user_cpus_ptr &&
+ if (p->user_cpus_ptr && !p->cpus_allowed_restricted &&
!(ctx->flags & (SCA_USER | SCA_MIGRATE_ENABLE | SCA_MIGRATE_DISABLE)) &&
cpumask_and(rq->scratch_mask, ctx->new_mask, p->user_cpus_ptr))
ctx->new_mask = rq->scratch_mask;
+ if (ctx->flags & SCA_SET_RESTRICT)
+ p->cpus_allowed_restricted = 1;
+
return __set_cpus_allowed_ptr_locked(p, ctx, rq, &rf);
}
@@ -3025,8 +3050,8 @@ EXPORT_SYMBOL_GPL(set_cpus_allowed_ptr);
/*
* Change a given task's CPU affinity to the intersection of its current
* affinity mask and @subset_mask, writing the resulting mask to @new_mask.
- * If user_cpus_ptr is defined, use it as the basis for restricting CPU
- * affinity or use cpu_online_mask instead.
+ * The cpus_allowed_restricted bit is set to indicate to a later
+ * relax_compatible_cpus_allowed_ptr() call to relax the cpumask.
*
* If the resulting mask is empty, leave the affinity unchanged and return
* -EINVAL.
@@ -3037,7 +3062,7 @@ static int restrict_cpus_allowed_ptr(struct task_struct *p,
{
struct affinity_context ac = {
.new_mask = new_mask,
- .flags = 0,
+ .flags = SCA_SET_RESTRICT,
};
struct rq_flags rf;
struct rq *rq;
@@ -3069,9 +3094,8 @@ static int restrict_cpus_allowed_ptr(struct task_struct *p,
/*
* Restrict the CPU affinity of task @p so that it is a subset of
- * task_cpu_possible_mask() and point @p->user_cpus_ptr to a copy of the
- * old affinity mask. If the resulting mask is empty, we warn and walk
- * up the cpuset hierarchy until we find a suitable mask.
+ * task_cpu_possible_mask(). If the resulting mask is empty, we warn
+ * and walk up the cpuset hierarchy until we find a suitable mask.
*/
void force_compatible_cpus_allowed_ptr(struct task_struct *p)
{
@@ -3126,10 +3150,14 @@ void relax_compatible_cpus_allowed_ptr(struct task_struct *p)
{
struct affinity_context ac = {
.new_mask = task_user_cpus(p),
- .flags = 0,
+ .flags = SCA_CLR_RESTRICT,
};
int ret;
+ /* Return if no previous force_compatible_cpus_allowed_ptr() call */
+ if (!data_race(p->cpus_allowed_restricted))
+ return;
+
/*
* Try to restore the old affinity mask with __sched_setaffinity().
* Cpuset masking will be done there too.
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 771f8ddb7053..adcef29d5479 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2293,6 +2293,8 @@ extern struct task_struct *pick_next_task_idle(struct rq *rq);
#define SCA_MIGRATE_DISABLE 0x02
#define SCA_MIGRATE_ENABLE 0x04
#define SCA_USER 0x08
+#define SCA_CLR_RESTRICT 0x10
+#define SCA_SET_RESTRICT 0x20
#ifdef CONFIG_SMP
--
2.31.1
Hi Waiman,
[+Thorsten given where we are in the release cycle]
On Fri, Jan 20, 2023 at 09:17:49PM -0500, Waiman Long wrote:
> The user_cpus_ptr field was originally added by commit b90ca8badbd1
> ("sched: Introduce task_struct::user_cpus_ptr to track requested
> affinity"). It was used only by arm64 arch due to possible asymmetric
> CPU setup.
>
> Since commit 8f9ea86fdf99 ("sched: Always preserve the user requested
> cpumask"), task_struct::user_cpus_ptr is repurposed to store user
> requested cpu affinity specified in the sched_setaffinity().
>
> This results in a performance regression in an arm64 system when booted
> with "allow_mismatched_32bit_el0" on the command-line. The arch code will
> (amongst other things) calls force_compatible_cpus_allowed_ptr() and
> relax_compatible_cpus_allowed_ptr() when exec()'ing a 32-bit or a 64-bit
> task respectively. Now a call to relax_compatible_cpus_allowed_ptr()
> will always result in a __sched_setaffinity() call whether there is a
> previous force_compatible_cpus_allowed_ptr() call or not.
I'd argue it's more than just a performance regression -- the affinity
masks are set incorrectly, which is a user visible thing
(i.e. sched_getaffinity() gives unexpected values).
> In order to fix this regression, a new scheduler flag
> task_struct::cpus_allowed_restricted is now added to track if
> force_compatible_cpus_allowed_ptr() has been called before or not. This
> patch also updates the comments in force_compatible_cpus_allowed_ptr()
> and relax_compatible_cpus_allowed_ptr() and handles their interaction
> with sched_setaffinity().
>
> Fixes: 8f9ea86fdf99 ("sched: Always preserve the user requested cpumask")
> Reported-by: Will Deacon <[email protected]>
> Signed-off-by: Waiman Long <[email protected]>
> ---
> include/linux/sched.h | 3 +++
> kernel/sched/core.c | 46 ++++++++++++++++++++++++++++++++++---------
> kernel/sched/sched.h | 2 ++
> 3 files changed, 42 insertions(+), 9 deletions(-)
I find this pretty invasive, but I guess it's up to Peter and Ingo.
It also doesn't the whole problem for me; see below.
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 853d08f7562b..f93f62a1f858 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -886,6 +886,9 @@ struct task_struct {
> unsigned sched_contributes_to_load:1;
> unsigned sched_migrated:1;
>
> + /* restrict_cpus_allowed_ptr() bit, serialized by scheduler locks */
> + unsigned cpus_allowed_restricted:1;
> +
> /* Force alignment to the next boundary: */
> unsigned :0;
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index bb1ee6d7bdde..48234dc9005b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2999,15 +2999,40 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
> struct rq *rq;
>
> rq = task_rq_lock(p, &rf);
> +
> + if (ctx->flags & (SCA_CLR_RESTRICT | SCA_SET_RESTRICT)) {
> + p->cpus_allowed_restricted = 0;
I don't think this is ever called on the SCA_SET_RESTRICT path, as
restrict_cpus_allowed_ptr() calls __set_cpus_allowed_ptr_locked() directly.
In my testing, we see a failure in the following sequence:
1. A 64-bit task has an affinity of 0xf
2. It exec()s a 32-bit program and is forcefully restricted to the set
of 32-bit-capable cores. Let's say that new mask is 0xc
3. The 32-bit task now exec()s a 64-bit program again
And now we're still stuck with 0xc after step 3 whereas we should restore
0xf.
> + } else if (p->cpus_allowed_restricted) {
> + /*
> + * If force_compatible_cpus_allowed_ptr() has been called,
> + * we can't extend cpumask to beyond what is in cpus_mask.
> + */
> + if (!cpumask_and(rq->scratch_mask, ctx->new_mask,
> + &p->cpus_mask)) {
> + task_rq_unlock(rq, p, &rf);
> + return -EINVAL;
> + }
Why is this masking actually needed? __sched_setaffinity() already
takes into account the task_cpu_possible_mask(), which is why I asked you
before [1] about cases where the saved affinity is not simply a superset
of the effective affinity.
Thanks,
Will
[1] https://lore.kernel.org/r/20230120175931.GA22417@willie-the-truck
On 1/24/23 14:48, Will Deacon wrote:
> Hi Waiman,
>
> [+Thorsten given where we are in the release cycle]
>
> On Fri, Jan 20, 2023 at 09:17:49PM -0500, Waiman Long wrote:
>> The user_cpus_ptr field was originally added by commit b90ca8badbd1
>> ("sched: Introduce task_struct::user_cpus_ptr to track requested
>> affinity"). It was used only by arm64 arch due to possible asymmetric
>> CPU setup.
>>
>> Since commit 8f9ea86fdf99 ("sched: Always preserve the user requested
>> cpumask"), task_struct::user_cpus_ptr is repurposed to store user
>> requested cpu affinity specified in the sched_setaffinity().
>>
>> This results in a performance regression in an arm64 system when booted
>> with "allow_mismatched_32bit_el0" on the command-line. The arch code will
>> (amongst other things) calls force_compatible_cpus_allowed_ptr() and
>> relax_compatible_cpus_allowed_ptr() when exec()'ing a 32-bit or a 64-bit
>> task respectively. Now a call to relax_compatible_cpus_allowed_ptr()
>> will always result in a __sched_setaffinity() call whether there is a
>> previous force_compatible_cpus_allowed_ptr() call or not.
> I'd argue it's more than just a performance regression -- the affinity
> masks are set incorrectly, which is a user visible thing
> (i.e. sched_getaffinity() gives unexpected values).
>
>> In order to fix this regression, a new scheduler flag
>> task_struct::cpus_allowed_restricted is now added to track if
>> force_compatible_cpus_allowed_ptr() has been called before or not. This
>> patch also updates the comments in force_compatible_cpus_allowed_ptr()
>> and relax_compatible_cpus_allowed_ptr() and handles their interaction
>> with sched_setaffinity().
>>
>> Fixes: 8f9ea86fdf99 ("sched: Always preserve the user requested cpumask")
>> Reported-by: Will Deacon <[email protected]>
>> Signed-off-by: Waiman Long <[email protected]>
>> ---
>> include/linux/sched.h | 3 +++
>> kernel/sched/core.c | 46 ++++++++++++++++++++++++++++++++++---------
>> kernel/sched/sched.h | 2 ++
>> 3 files changed, 42 insertions(+), 9 deletions(-)
> I find this pretty invasive, but I guess it's up to Peter and Ingo.
> It also doesn't the whole problem for me; see below.
>
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 853d08f7562b..f93f62a1f858 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -886,6 +886,9 @@ struct task_struct {
>> unsigned sched_contributes_to_load:1;
>> unsigned sched_migrated:1;
>>
>> + /* restrict_cpus_allowed_ptr() bit, serialized by scheduler locks */
>> + unsigned cpus_allowed_restricted:1;
>> +
>> /* Force alignment to the next boundary: */
>> unsigned :0;
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index bb1ee6d7bdde..48234dc9005b 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -2999,15 +2999,40 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
>> struct rq *rq;
>>
>> rq = task_rq_lock(p, &rf);
>> +
>> + if (ctx->flags & (SCA_CLR_RESTRICT | SCA_SET_RESTRICT)) {
>> + p->cpus_allowed_restricted = 0;
> I don't think this is ever called on the SCA_SET_RESTRICT path, as
> restrict_cpus_allowed_ptr() calls __set_cpus_allowed_ptr_locked() directly.
> In my testing, we see a failure in the following sequence:
>
> 1. A 64-bit task has an affinity of 0xf
> 2. It exec()s a 32-bit program and is forcefully restricted to the set
> of 32-bit-capable cores. Let's say that new mask is 0xc
> 3. The 32-bit task now exec()s a 64-bit program again
>
> And now we're still stuck with 0xc after step 3 whereas we should restore
> 0xf.
I am sorry that missed it. You are right. For setting the
cpus_allowed_restricted bit, it should be done directly in
restrict_cpus_allowed_ptr().
>> + } else if (p->cpus_allowed_restricted) {
>> + /*
>> + * If force_compatible_cpus_allowed_ptr() has been called,
>> + * we can't extend cpumask to beyond what is in cpus_mask.
>> + */
>> + if (!cpumask_and(rq->scratch_mask, ctx->new_mask,
>> + &p->cpus_mask)) {
>> + task_rq_unlock(rq, p, &rf);
>> + return -EINVAL;
>> + }
> Why is this masking actually needed? __sched_setaffinity() already
> takes into account the task_cpu_possible_mask(), which is why I asked you
> before [1] about cases where the saved affinity is not simply a superset
> of the effective affinity.
I kind of overlook the use of task_cpu_possible_mask() in
__set_cpus_allowed_ptr_locked. So we don't really need that masking.
That make the patch even simpler then. I will send out a v3.
Cheers,
Longman
On 1/24/23 14:48, Will Deacon wrote:
> Hi Waiman,
>
> [+Thorsten given where we are in the release cycle]
>
> On Fri, Jan 20, 2023 at 09:17:49PM -0500, Waiman Long wrote:
>> The user_cpus_ptr field was originally added by commit b90ca8badbd1
>> ("sched: Introduce task_struct::user_cpus_ptr to track requested
>> affinity"). It was used only by arm64 arch due to possible asymmetric
>> CPU setup.
>>
>> Since commit 8f9ea86fdf99 ("sched: Always preserve the user requested
>> cpumask"), task_struct::user_cpus_ptr is repurposed to store user
>> requested cpu affinity specified in the sched_setaffinity().
>>
>> This results in a performance regression in an arm64 system when booted
>> with "allow_mismatched_32bit_el0" on the command-line. The arch code will
>> (amongst other things) calls force_compatible_cpus_allowed_ptr() and
>> relax_compatible_cpus_allowed_ptr() when exec()'ing a 32-bit or a 64-bit
>> task respectively. Now a call to relax_compatible_cpus_allowed_ptr()
>> will always result in a __sched_setaffinity() call whether there is a
>> previous force_compatible_cpus_allowed_ptr() call or not.
> I'd argue it's more than just a performance regression -- the affinity
> masks are set incorrectly, which is a user visible thing
> (i.e. sched_getaffinity() gives unexpected values).
Can your elaborate a bit more on what you mean by getting unexpected
sched_getaffinity() results? You mean the result is wrong after a
relax_compatible_cpus_allowed_ptr(). Right?
sched_getaffinity() just return whatever is in cpus_mask. Normally, it
should be whatever cpus are allowed by the current cpuset unless
sched_setaffinity() has been called before. So after a call to
relax_compatible_cpus_allowed_ptr(), it should revert back to the
cpu_allowed set in the cpuset. If sched_setaffinity() has been called,
it should revert back to the intersection of the current cpuset and
user_cpus_ptr.
Cheers,
Longman
On Tue, Jan 24, 2023 at 03:08:09PM -0500, Waiman Long wrote:
> On 1/24/23 14:48, Will Deacon wrote:
> > On Fri, Jan 20, 2023 at 09:17:49PM -0500, Waiman Long wrote:
> > > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > > index 853d08f7562b..f93f62a1f858 100644
> > > --- a/include/linux/sched.h
> > > +++ b/include/linux/sched.h
> > > @@ -886,6 +886,9 @@ struct task_struct {
> > > unsigned sched_contributes_to_load:1;
> > > unsigned sched_migrated:1;
> > > + /* restrict_cpus_allowed_ptr() bit, serialized by scheduler locks */
> > > + unsigned cpus_allowed_restricted:1;
> > > +
> > > /* Force alignment to the next boundary: */
> > > unsigned :0;
> > > diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> > > index bb1ee6d7bdde..48234dc9005b 100644
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -2999,15 +2999,40 @@ static int __set_cpus_allowed_ptr(struct task_struct *p,
> > > struct rq *rq;
> > > rq = task_rq_lock(p, &rf);
> > > +
> > > + if (ctx->flags & (SCA_CLR_RESTRICT | SCA_SET_RESTRICT)) {
> > > + p->cpus_allowed_restricted = 0;
> > I don't think this is ever called on the SCA_SET_RESTRICT path, as
> > restrict_cpus_allowed_ptr() calls __set_cpus_allowed_ptr_locked() directly.
> > In my testing, we see a failure in the following sequence:
> >
> > 1. A 64-bit task has an affinity of 0xf
> > 2. It exec()s a 32-bit program and is forcefully restricted to the set
> > of 32-bit-capable cores. Let's say that new mask is 0xc
> > 3. The 32-bit task now exec()s a 64-bit program again
> >
> > And now we're still stuck with 0xc after step 3 whereas we should restore
> > 0xf.
> I am sorry that missed it. You are right. For setting the
> cpus_allowed_restricted bit, it should be done directly in
> restrict_cpus_allowed_ptr().
> > > + } else if (p->cpus_allowed_restricted) {
> > > + /*
> > > + * If force_compatible_cpus_allowed_ptr() has been called,
> > > + * we can't extend cpumask to beyond what is in cpus_mask.
> > > + */
> > > + if (!cpumask_and(rq->scratch_mask, ctx->new_mask,
> > > + &p->cpus_mask)) {
> > > + task_rq_unlock(rq, p, &rf);
> > > + return -EINVAL;
> > > + }
> > Why is this masking actually needed? __sched_setaffinity() already
> > takes into account the task_cpu_possible_mask(), which is why I asked you
> > before [1] about cases where the saved affinity is not simply a superset
> > of the effective affinity.
>
> I kind of overlook the use of task_cpu_possible_mask() in
> __set_cpus_allowed_ptr_locked. So we don't really need that masking. That
> make the patch even simpler then. I will send out a v3.
Thanks; I'll give it a spin when I see it.
Will
On Tue, Jan 24, 2023 at 03:24:36PM -0500, Waiman Long wrote:
> On 1/24/23 14:48, Will Deacon wrote:
> > On Fri, Jan 20, 2023 at 09:17:49PM -0500, Waiman Long wrote:
> > > The user_cpus_ptr field was originally added by commit b90ca8badbd1
> > > ("sched: Introduce task_struct::user_cpus_ptr to track requested
> > > affinity"). It was used only by arm64 arch due to possible asymmetric
> > > CPU setup.
> > >
> > > Since commit 8f9ea86fdf99 ("sched: Always preserve the user requested
> > > cpumask"), task_struct::user_cpus_ptr is repurposed to store user
> > > requested cpu affinity specified in the sched_setaffinity().
> > >
> > > This results in a performance regression in an arm64 system when booted
> > > with "allow_mismatched_32bit_el0" on the command-line. The arch code will
> > > (amongst other things) calls force_compatible_cpus_allowed_ptr() and
> > > relax_compatible_cpus_allowed_ptr() when exec()'ing a 32-bit or a 64-bit
> > > task respectively. Now a call to relax_compatible_cpus_allowed_ptr()
> > > will always result in a __sched_setaffinity() call whether there is a
> > > previous force_compatible_cpus_allowed_ptr() call or not.
> > I'd argue it's more than just a performance regression -- the affinity
> > masks are set incorrectly, which is a user visible thing
> > (i.e. sched_getaffinity() gives unexpected values).
>
> Can your elaborate a bit more on what you mean by getting unexpected
> sched_getaffinity() results? You mean the result is wrong after a
> relax_compatible_cpus_allowed_ptr(). Right?
Yes, as in the original report. If, on a 4-CPU system, I do the following
with v6.1 and "allow_mismatched_32bit_el0" on the kernel cmdline:
# for c in `seq 1 3`; do echo 0 > /sys/devices/system/cpu/cpu$c/online; done
# yes > /dev/null &
[1] 334
# taskset -p 334
pid 334's current affinity mask: 1
# for c in `seq 1 3`; do echo 1 > /sys/devices/system/cpu/cpu$c/online; done
# taskset -p 334
pid 334's current affinity mask: f
but with v6.2-rc5 that last taskset invocation gives:
pid 334's current affinity mask: 1
so, yes, the performance definitely regresses, but that's because the
affinity mask is wrong!
Will
On 1/26/23 11:11, Will Deacon wrote:
> On Tue, Jan 24, 2023 at 03:24:36PM -0500, Waiman Long wrote:
>> On 1/24/23 14:48, Will Deacon wrote:
>>> On Fri, Jan 20, 2023 at 09:17:49PM -0500, Waiman Long wrote:
>>>> The user_cpus_ptr field was originally added by commit b90ca8badbd1
>>>> ("sched: Introduce task_struct::user_cpus_ptr to track requested
>>>> affinity"). It was used only by arm64 arch due to possible asymmetric
>>>> CPU setup.
>>>>
>>>> Since commit 8f9ea86fdf99 ("sched: Always preserve the user requested
>>>> cpumask"), task_struct::user_cpus_ptr is repurposed to store user
>>>> requested cpu affinity specified in the sched_setaffinity().
>>>>
>>>> This results in a performance regression in an arm64 system when booted
>>>> with "allow_mismatched_32bit_el0" on the command-line. The arch code will
>>>> (amongst other things) calls force_compatible_cpus_allowed_ptr() and
>>>> relax_compatible_cpus_allowed_ptr() when exec()'ing a 32-bit or a 64-bit
>>>> task respectively. Now a call to relax_compatible_cpus_allowed_ptr()
>>>> will always result in a __sched_setaffinity() call whether there is a
>>>> previous force_compatible_cpus_allowed_ptr() call or not.
>>> I'd argue it's more than just a performance regression -- the affinity
>>> masks are set incorrectly, which is a user visible thing
>>> (i.e. sched_getaffinity() gives unexpected values).
>> Can your elaborate a bit more on what you mean by getting unexpected
>> sched_getaffinity() results? You mean the result is wrong after a
>> relax_compatible_cpus_allowed_ptr(). Right?
> Yes, as in the original report. If, on a 4-CPU system, I do the following
> with v6.1 and "allow_mismatched_32bit_el0" on the kernel cmdline:
>
> # for c in `seq 1 3`; do echo 0 > /sys/devices/system/cpu/cpu$c/online; done
> # yes > /dev/null &
> [1] 334
> # taskset -p 334
> pid 334's current affinity mask: 1
> # for c in `seq 1 3`; do echo 1 > /sys/devices/system/cpu/cpu$c/online; done
> # taskset -p 334
> pid 334's current affinity mask: f
>
> but with v6.2-rc5 that last taskset invocation gives:
>
> pid 334's current affinity mask: 1
>
> so, yes, the performance definitely regresses, but that's because the
> affinity mask is wrong!
I see what you mean now. Hotplug doesn't work quite well now because
user_cpus_ptr has been repurposed to store the value set of
sched_setaffinity() but not the previous cpus_mask before
force_compatible_cpus_allowed_ptr().
One possible solution is to modify the hotplug related code to check for
the cpus_allowed_restricted, and if set, check task_cpu_possible_mask()
to see if the cpu can be added back to its cpus_mask. I will take a
further look at that later.
Cheers,
Longman
On 1/26/23 15:49, Waiman Long wrote:
> On 1/26/23 11:11, Will Deacon wrote:
>> On Tue, Jan 24, 2023 at 03:24:36PM -0500, Waiman Long wrote:
>>> On 1/24/23 14:48, Will Deacon wrote:
>>>> On Fri, Jan 20, 2023 at 09:17:49PM -0500, Waiman Long wrote:
>>>>> The user_cpus_ptr field was originally added by commit b90ca8badbd1
>>>>> ("sched: Introduce task_struct::user_cpus_ptr to track requested
>>>>> affinity"). It was used only by arm64 arch due to possible asymmetric
>>>>> CPU setup.
>>>>>
>>>>> Since commit 8f9ea86fdf99 ("sched: Always preserve the user requested
>>>>> cpumask"), task_struct::user_cpus_ptr is repurposed to store user
>>>>> requested cpu affinity specified in the sched_setaffinity().
>>>>>
>>>>> This results in a performance regression in an arm64 system when
>>>>> booted
>>>>> with "allow_mismatched_32bit_el0" on the command-line. The arch
>>>>> code will
>>>>> (amongst other things) calls force_compatible_cpus_allowed_ptr() and
>>>>> relax_compatible_cpus_allowed_ptr() when exec()'ing a 32-bit or a
>>>>> 64-bit
>>>>> task respectively. Now a call to relax_compatible_cpus_allowed_ptr()
>>>>> will always result in a __sched_setaffinity() call whether there is a
>>>>> previous force_compatible_cpus_allowed_ptr() call or not.
>>>> I'd argue it's more than just a performance regression -- the affinity
>>>> masks are set incorrectly, which is a user visible thing
>>>> (i.e. sched_getaffinity() gives unexpected values).
>>> Can your elaborate a bit more on what you mean by getting unexpected
>>> sched_getaffinity() results? You mean the result is wrong after a
>>> relax_compatible_cpus_allowed_ptr(). Right?
>> Yes, as in the original report. If, on a 4-CPU system, I do the
>> following
>> with v6.1 and "allow_mismatched_32bit_el0" on the kernel cmdline:
>>
>> # for c in `seq 1 3`; do echo 0 >
>> /sys/devices/system/cpu/cpu$c/online; done
>> # yes > /dev/null &
>> [1] 334
>> # taskset -p 334
>> pid 334's current affinity mask: 1
>> # for c in `seq 1 3`; do echo 1 >
>> /sys/devices/system/cpu/cpu$c/online; done
>> # taskset -p 334
>> pid 334's current affinity mask: f
>>
>> but with v6.2-rc5 that last taskset invocation gives:
>>
>> pid 334's current affinity mask: 1
>>
>> so, yes, the performance definitely regresses, but that's because the
>> affinity mask is wrong!
>
> I see what you mean now. Hotplug doesn't work quite well now because
> user_cpus_ptr has been repurposed to store the value set of
> sched_setaffinity() but not the previous cpus_mask before
> force_compatible_cpus_allowed_ptr().
>
> One possible solution is to modify the hotplug related code to check
> for the cpus_allowed_restricted, and if set, check
> task_cpu_possible_mask() to see if the cpu can be added back to its
> cpus_mask. I will take a further look at that later.
Wait, I think the cpuset hotplug code should be able to restore the
right cpumask since task_cpu_possible_mask() is used there. Is cpuset
enabled? Does the test works without allow_mismatched_32bit_el0?
I think there may be a bug somewhere.
Cheers,
Longman
On 1/26/23 15:58, Waiman Long wrote:
> On 1/26/23 15:49, Waiman Long wrote:
>> On 1/26/23 11:11, Will Deacon wrote:
>>> On Tue, Jan 24, 2023 at 03:24:36PM -0500, Waiman Long wrote:
>>>> On 1/24/23 14:48, Will Deacon wrote:
>>>>> On Fri, Jan 20, 2023 at 09:17:49PM -0500, Waiman Long wrote:
>>>>>> The user_cpus_ptr field was originally added by commit b90ca8badbd1
>>>>>> ("sched: Introduce task_struct::user_cpus_ptr to track requested
>>>>>> affinity"). It was used only by arm64 arch due to possible
>>>>>> asymmetric
>>>>>> CPU setup.
>>>>>>
>>>>>> Since commit 8f9ea86fdf99 ("sched: Always preserve the user
>>>>>> requested
>>>>>> cpumask"), task_struct::user_cpus_ptr is repurposed to store user
>>>>>> requested cpu affinity specified in the sched_setaffinity().
>>>>>>
>>>>>> This results in a performance regression in an arm64 system when
>>>>>> booted
>>>>>> with "allow_mismatched_32bit_el0" on the command-line. The arch
>>>>>> code will
>>>>>> (amongst other things) calls force_compatible_cpus_allowed_ptr() and
>>>>>> relax_compatible_cpus_allowed_ptr() when exec()'ing a 32-bit or a
>>>>>> 64-bit
>>>>>> task respectively. Now a call to relax_compatible_cpus_allowed_ptr()
>>>>>> will always result in a __sched_setaffinity() call whether there
>>>>>> is a
>>>>>> previous force_compatible_cpus_allowed_ptr() call or not.
>>>>> I'd argue it's more than just a performance regression -- the
>>>>> affinity
>>>>> masks are set incorrectly, which is a user visible thing
>>>>> (i.e. sched_getaffinity() gives unexpected values).
>>>> Can your elaborate a bit more on what you mean by getting unexpected
>>>> sched_getaffinity() results? You mean the result is wrong after a
>>>> relax_compatible_cpus_allowed_ptr(). Right?
>>> Yes, as in the original report. If, on a 4-CPU system, I do the
>>> following
>>> with v6.1 and "allow_mismatched_32bit_el0" on the kernel cmdline:
>>>
>>> # for c in `seq 1 3`; do echo 0 >
>>> /sys/devices/system/cpu/cpu$c/online; done
>>> # yes > /dev/null &
>>> [1] 334
>>> # taskset -p 334
>>> pid 334's current affinity mask: 1
>>> # for c in `seq 1 3`; do echo 1 >
>>> /sys/devices/system/cpu/cpu$c/online; done
>>> # taskset -p 334
>>> pid 334's current affinity mask: f
>>>
>>> but with v6.2-rc5 that last taskset invocation gives:
>>>
>>> pid 334's current affinity mask: 1
>>>
>>> so, yes, the performance definitely regresses, but that's because the
>>> affinity mask is wrong!
>>
>> I see what you mean now. Hotplug doesn't work quite well now because
>> user_cpus_ptr has been repurposed to store the value set of
>> sched_setaffinity() but not the previous cpus_mask before
>> force_compatible_cpus_allowed_ptr().
>>
>> One possible solution is to modify the hotplug related code to check
>> for the cpus_allowed_restricted, and if set, check
>> task_cpu_possible_mask() to see if the cpu can be added back to its
>> cpus_mask. I will take a further look at that later.
>
> Wait, I think the cpuset hotplug code should be able to restore the
> right cpumask since task_cpu_possible_mask() is used there. Is cpuset
> enabled? Does the test works without allow_mismatched_32bit_el0?
BTW, if the test result is from running on a kernel built with the v2
patch, it is the unexpected result. That should be fixed in the v3 patch.
Cheers,
Longman
On Thu, Jan 26, 2023 at 08:56:51PM -0500, Waiman Long wrote:
> On 1/26/23 15:58, Waiman Long wrote:
> > On 1/26/23 15:49, Waiman Long wrote:
> > > On 1/26/23 11:11, Will Deacon wrote:
> > > > On Tue, Jan 24, 2023 at 03:24:36PM -0500, Waiman Long wrote:
> > > > > On 1/24/23 14:48, Will Deacon wrote:
> > > > > > On Fri, Jan 20, 2023 at 09:17:49PM -0500, Waiman Long wrote:
> > > > > > > The user_cpus_ptr field was originally added by commit b90ca8badbd1
> > > > > > > ("sched: Introduce task_struct::user_cpus_ptr to track requested
> > > > > > > affinity"). It was used only by arm64 arch due to
> > > > > > > possible asymmetric
> > > > > > > CPU setup.
> > > > > > >
> > > > > > > Since commit 8f9ea86fdf99 ("sched: Always preserve
> > > > > > > the user requested
> > > > > > > cpumask"), task_struct::user_cpus_ptr is repurposed to store user
> > > > > > > requested cpu affinity specified in the sched_setaffinity().
> > > > > > >
> > > > > > > This results in a performance regression in an arm64
> > > > > > > system when booted
> > > > > > > with "allow_mismatched_32bit_el0" on the
> > > > > > > command-line. The arch code will
> > > > > > > (amongst other things) calls force_compatible_cpus_allowed_ptr() and
> > > > > > > relax_compatible_cpus_allowed_ptr() when exec()'ing
> > > > > > > a 32-bit or a 64-bit
> > > > > > > task respectively. Now a call to relax_compatible_cpus_allowed_ptr()
> > > > > > > will always result in a __sched_setaffinity() call
> > > > > > > whether there is a
> > > > > > > previous force_compatible_cpus_allowed_ptr() call or not.
> > > > > > I'd argue it's more than just a performance regression
> > > > > > -- the affinity
> > > > > > masks are set incorrectly, which is a user visible thing
> > > > > > (i.e. sched_getaffinity() gives unexpected values).
> > > > > Can your elaborate a bit more on what you mean by getting unexpected
> > > > > sched_getaffinity() results? You mean the result is wrong after a
> > > > > relax_compatible_cpus_allowed_ptr(). Right?
> > > > Yes, as in the original report. If, on a 4-CPU system, I do the
> > > > following
> > > > with v6.1 and "allow_mismatched_32bit_el0" on the kernel cmdline:
> > > >
> > > > # for c in `seq 1 3`; do echo 0 >
> > > > /sys/devices/system/cpu/cpu$c/online; done
> > > > # yes > /dev/null &
> > > > [1] 334
> > > > # taskset -p 334
> > > > pid 334's current affinity mask: 1
> > > > # for c in `seq 1 3`; do echo 1 >
> > > > /sys/devices/system/cpu/cpu$c/online; done
> > > > # taskset -p 334
> > > > pid 334's current affinity mask: f
> > > >
> > > > but with v6.2-rc5 that last taskset invocation gives:
> > > >
> > > > pid 334's current affinity mask: 1
> > > >
> > > > so, yes, the performance definitely regresses, but that's because the
> > > > affinity mask is wrong!
> > >
> > > I see what you mean now. Hotplug doesn't work quite well now because
> > > user_cpus_ptr has been repurposed to store the value set of
> > > sched_setaffinity() but not the previous cpus_mask before
> > > force_compatible_cpus_allowed_ptr().
> > >
> > > One possible solution is to modify the hotplug related code to check
> > > for the cpus_allowed_restricted, and if set, check
> > > task_cpu_possible_mask() to see if the cpu can be added back to its
> > > cpus_mask. I will take a further look at that later.
> >
> > Wait, I think the cpuset hotplug code should be able to restore the
> > right cpumask since task_cpu_possible_mask() is used there. Is cpuset
> > enabled? Does the test works without allow_mismatched_32bit_el0?
>
> BTW, if the test result is from running on a kernel built with the v2 patch,
> it is the unexpected result. That should be fixed in the v3 patch.
The failure listed above is on vanilla v6.2-rc5. Your v2 has other issues,
as described in:
https://lore.kernel.org/r/20230124194805.GA27257@willie-the-truck
Will
On 1/26/23 11:11, Will Deacon wrote:
> On Tue, Jan 24, 2023 at 03:24:36PM -0500, Waiman Long wrote:
>> On 1/24/23 14:48, Will Deacon wrote:
>>> On Fri, Jan 20, 2023 at 09:17:49PM -0500, Waiman Long wrote:
>>>> The user_cpus_ptr field was originally added by commit b90ca8badbd1
>>>> ("sched: Introduce task_struct::user_cpus_ptr to track requested
>>>> affinity"). It was used only by arm64 arch due to possible asymmetric
>>>> CPU setup.
>>>>
>>>> Since commit 8f9ea86fdf99 ("sched: Always preserve the user requested
>>>> cpumask"), task_struct::user_cpus_ptr is repurposed to store user
>>>> requested cpu affinity specified in the sched_setaffinity().
>>>>
>>>> This results in a performance regression in an arm64 system when booted
>>>> with "allow_mismatched_32bit_el0" on the command-line. The arch code will
>>>> (amongst other things) calls force_compatible_cpus_allowed_ptr() and
>>>> relax_compatible_cpus_allowed_ptr() when exec()'ing a 32-bit or a 64-bit
>>>> task respectively. Now a call to relax_compatible_cpus_allowed_ptr()
>>>> will always result in a __sched_setaffinity() call whether there is a
>>>> previous force_compatible_cpus_allowed_ptr() call or not.
>>> I'd argue it's more than just a performance regression -- the affinity
>>> masks are set incorrectly, which is a user visible thing
>>> (i.e. sched_getaffinity() gives unexpected values).
>> Can your elaborate a bit more on what you mean by getting unexpected
>> sched_getaffinity() results? You mean the result is wrong after a
>> relax_compatible_cpus_allowed_ptr(). Right?
> Yes, as in the original report. If, on a 4-CPU system, I do the following
> with v6.1 and "allow_mismatched_32bit_el0" on the kernel cmdline:
>
> # for c in `seq 1 3`; do echo 0 > /sys/devices/system/cpu/cpu$c/online; done
> # yes > /dev/null &
> [1] 334
> # taskset -p 334
> pid 334's current affinity mask: 1
> # for c in `seq 1 3`; do echo 1 > /sys/devices/system/cpu/cpu$c/online; done
> # taskset -p 334
> pid 334's current affinity mask: f
>
> but with v6.2-rc5 that last taskset invocation gives:
>
> pid 334's current affinity mask: 1
>
> so, yes, the performance definitely regresses, but that's because the
> affinity mask is wrong!
Are you using cgroup v1 or v2? Are your process in the root
cgroup/cpuset or a child cgroup/cpuset under root?
If you are using cgroup v1 in a child cpuset, cpuset.cpus works more
like cpuset.cpus_effective. IOW, with an offline and then online hotplug
event, the cpu will be permanently lost from the cpuset. So the above is
an expected result. If you using cgroup v2, the cpuset should be able to
recover the lost cpu after the online event. If your process is in the
root cpuset, the cpu will not be lost too. Alternatively, if you mount
the v1 cpuset with the "cpuset_v2_mode" flag, it will behave more like
v2 and recover the lost cpu.
I ran a similar cpu offline/online test with cgroup v1 and v2 and
confirm that v1 has the above behavior and v2 is fine.
I believe what you have observed above may not be related to my sched
patch as I can't see how it will affect cpu hotplug at all.
Cheers,
Longman