2023-01-11 11:40:29

by Florian Weimer

[permalink] [raw]
Subject: rseq CPU ID not correct on 6.0 kernels for pinned threads

The glibc test suite contains a test that verifies that sched_getcpu
returns the expected CPU number for a thread that is pinned (via
sched_setaffinity) to a specific CPU. There are other threads running
which attempt to de-schedule the pinned thread from its CPU. I believe
the test is correctly doing what it is expected to do; it is invalid
only if one believes that it is okay for the kernel to disregard the
affinity mask for scheduling decisions.

These days, we use the cpu_id rseq field as the return value of
sched_getcpu if the kernel has rseq support (which it has in these
cases).

This test has started failing sporadically for us, some time around
kernel 6.0. I see failure occasionally on a Fedora builder, it runs:

Linux buildvm-x86-26.iad2.fedoraproject.org 6.0.15-300.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Dec 21 18:33:23 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

I think I've seen it on the x86-64 builder only, but that might just be
an accident.

The failing tests log this output:

=====FAIL: nptl/tst-thread-affinity-pthread.out=====
info: Detected CPU set size (in bits): 64
info: Maximum test CPU: 5
error: Pinned thread 1 ran on impossible cpu 0
error: Pinned thread 0 ran on impossible cpu 0
info: Main thread ran on 4 CPU(s) of 6 available CPU(s)
info: Other threads ran on 6 CPU(s)
=====FAIL: nptl/tst-thread-affinity-pthread2.out=====
info: Detected CPU set size (in bits): 64
info: Maximum test CPU: 5
error: Pinned thread 1 ran on impossible cpu 1
error: Pinned thread 2 ran on impossible cpu 0
error: Pinned thread 3 ran on impossible cpu 3
info: Main thread ran on 5 CPU(s) of 6 available CPU(s)
info: Other threads ran on 6 CPU(s)

But I also encountered one local failure, but it is rare. Maybe it's
load-related. There shouldn't be any CPU unplug or anything like that
involved here.

I am not entirely sure if something is changing CPU affinities from
outside the process (which would be quite wrong, but not a kernel bug).
But in the past, our glibc test has detected real rseq cpu_id
brokenness, so I'm leaning towards that as the cause this time, too.

Thanks,
Florian


2023-01-11 15:01:38

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: rseq CPU ID not correct on 6.0 kernels for pinned threads

On 2023-01-11 06:26, Florian Weimer wrote:
> The glibc test suite contains a test that verifies that sched_getcpu
> returns the expected CPU number for a thread that is pinned (via
> sched_setaffinity) to a specific CPU. There are other threads running
> which attempt to de-schedule the pinned thread from its CPU. I believe
> the test is correctly doing what it is expected to do; it is invalid
> only if one believes that it is okay for the kernel to disregard the
> affinity mask for scheduling decisions.
>
> These days, we use the cpu_id rseq field as the return value of
> sched_getcpu if the kernel has rseq support (which it has in these
> cases).
>
> This test has started failing sporadically for us, some time around
> kernel 6.0. I see failure occasionally on a Fedora builder, it runs:
>
> Linux buildvm-x86-26.iad2.fedoraproject.org 6.0.15-300.fc37.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Dec 21 18:33:23 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
>
> I think I've seen it on the x86-64 builder only, but that might just be
> an accident.
>
> The failing tests log this output:
>
> =====FAIL: nptl/tst-thread-affinity-pthread.out=====
> info: Detected CPU set size (in bits): 64
> info: Maximum test CPU: 5
> error: Pinned thread 1 ran on impossible cpu 0
> error: Pinned thread 0 ran on impossible cpu 0
> info: Main thread ran on 4 CPU(s) of 6 available CPU(s)
> info: Other threads ran on 6 CPU(s)
> =====FAIL: nptl/tst-thread-affinity-pthread2.out=====
> info: Detected CPU set size (in bits): 64
> info: Maximum test CPU: 5
> error: Pinned thread 1 ran on impossible cpu 1
> error: Pinned thread 2 ran on impossible cpu 0
> error: Pinned thread 3 ran on impossible cpu 3
> info: Main thread ran on 5 CPU(s) of 6 available CPU(s)
> info: Other threads ran on 6 CPU(s)
>
> But I also encountered one local failure, but it is rare. Maybe it's
> load-related. There shouldn't be any CPU unplug or anything like that
> involved here.
>
> I am not entirely sure if something is changing CPU affinities from
> outside the process (which would be quite wrong, but not a kernel bug).
> But in the past, our glibc test has detected real rseq cpu_id
> brokenness, so I'm leaning towards that as the cause this time, too.

It can be caused by rseq failing to update the cpu number field on
return to userspace. Tthis could be validated by printing the regular
getcpu vdso value and/or the value returned by the getcpu system call
when the error is triggered, and see whether it matches the rseq cpu id
value.

It can also be caused by scheduler failure to take the affinity into
account.

As you also point out, it can also be caused by some other task
modifying the affinity of your task concurrently. You could print
the result of sched_getaffinity on error to get a better idea of
the expected vs actual mask.

Lastly, it could be caused by CPU hotplug which would set all bits
in the affinity mask as a fallback. As you mention it should not be
the cause there.

Can you share your kernel configuration ?

Thanks,

Mathieu

>
> Thanks,
> Florian
>

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

2023-01-11 19:43:04

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: rseq CPU ID not correct on 6.0 kernels for pinned threads

On 2023-01-11 09:52, Mathieu Desnoyers wrote:
> On 2023-01-11 06:26, Florian Weimer wrote:
>> The glibc test suite contains a test that verifies that sched_getcpu
>> returns the expected CPU number for a thread that is pinned (via
>> sched_setaffinity) to a specific CPU.  There are other threads running
>> which attempt to de-schedule the pinned thread from its CPU.  I believe
>> the test is correctly doing what it is expected to do; it is invalid
>> only if one believes that it is okay for the kernel to disregard the
>> affinity mask for scheduling decisions.
>>
>> These days, we use the cpu_id rseq field as the return value of
>> sched_getcpu if the kernel has rseq support (which it has in these
>> cases).
>>
>> This test has started failing sporadically for us, some time around
>> kernel 6.0.  I see failure occasionally on a Fedora builder, it runs:
>>
>> Linux buildvm-x86-26.iad2.fedoraproject.org 6.0.15-300.fc37.x86_64 #1
>> SMP PREEMPT_DYNAMIC Wed Dec 21 18:33:23 UTC 2022 x86_64 x86_64 x86_64
>> GNU/Linux
>>
>> I think I've seen it on the x86-64 builder only, but that might just be
>> an accident.
>>
>> The failing tests log this output:
>>
>> =====FAIL: nptl/tst-thread-affinity-pthread.out=====
>> info: Detected CPU set size (in bits): 64
>> info: Maximum test CPU: 5
>> error: Pinned thread 1 ran on impossible cpu 0
>> error: Pinned thread 0 ran on impossible cpu 0
>> info: Main thread ran on 4 CPU(s) of 6 available CPU(s)
>> info: Other threads ran on 6 CPU(s)
>> =====FAIL: nptl/tst-thread-affinity-pthread2.out=====
>> info: Detected CPU set size (in bits): 64
>> info: Maximum test CPU: 5
>> error: Pinned thread 1 ran on impossible cpu 1
>> error: Pinned thread 2 ran on impossible cpu 0
>> error: Pinned thread 3 ran on impossible cpu 3
>> info: Main thread ran on 5 CPU(s) of 6 available CPU(s)
>> info: Other threads ran on 6 CPU(s)
>>
>> But I also encountered one local failure, but it is rare.  Maybe it's
>> load-related.  There shouldn't be any CPU unplug or anything like that
>> involved here.
>>
>> I am not entirely sure if something is changing CPU affinities from
>> outside the process (which would be quite wrong, but not a kernel bug).
>> But in the past, our glibc test has detected real rseq cpu_id
>> brokenness, so I'm leaning towards that as the cause this time, too.
>
> It can be caused by rseq failing to update the cpu number field on
> return to userspace. Tthis could be validated by printing the regular
> getcpu vdso value and/or the value returned by the getcpu system call
> when the error is triggered, and see whether it matches the rseq cpu id
> value.
>
> It can also be caused by scheduler failure to take the affinity into
> account.
>
> As you also point out, it can also be caused by some other task
> modifying the affinity of your task concurrently. You could print
> the result of sched_getaffinity on error to get a better idea of
> the expected vs actual mask.
>
> Lastly, it could be caused by CPU hotplug which would set all bits
> in the affinity mask as a fallback. As you mention it should not be
> the cause there.
>
> Can you share your kernel configuration ?

Also, can you provide more information about the cpufreq driver and
governor used in your system ? e.g. output of

cpupower frequency-info

and also output of

sysctl kernel.sched_energy_aware

Is this on a physical machine or in a virtual machine ?

Thanks,

Mathieu

>
> Thanks,
>
> Mathieu
>
>>
>> Thanks,
>> Florian
>>
>

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

2023-01-11 22:34:13

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: rseq CPU ID not correct on 6.0 kernels for pinned threads

On 2023-01-11 14:31, Mathieu Desnoyers wrote:
> On 2023-01-11 09:52, Mathieu Desnoyers wrote:
>> On 2023-01-11 06:26, Florian Weimer wrote:
>>> The glibc test suite contains a test that verifies that sched_getcpu
>>> returns the expected CPU number for a thread that is pinned (via
>>> sched_setaffinity) to a specific CPU.  There are other threads running
>>> which attempt to de-schedule the pinned thread from its CPU.  I believe
>>> the test is correctly doing what it is expected to do; it is invalid
>>> only if one believes that it is okay for the kernel to disregard the
>>> affinity mask for scheduling decisions.
>>>
>>> These days, we use the cpu_id rseq field as the return value of
>>> sched_getcpu if the kernel has rseq support (which it has in these
>>> cases).
>>>
>>> This test has started failing sporadically for us, some time around
>>> kernel 6.0.  I see failure occasionally on a Fedora builder, it runs:
>>>
>>> Linux buildvm-x86-26.iad2.fedoraproject.org 6.0.15-300.fc37.x86_64 #1
>>> SMP PREEMPT_DYNAMIC Wed Dec 21 18:33:23 UTC 2022 x86_64 x86_64 x86_64
>>> GNU/Linux
>>>
>>> I think I've seen it on the x86-64 builder only, but that might just be
>>> an accident.
>>>
>>> The failing tests log this output:
>>>
>>> =====FAIL: nptl/tst-thread-affinity-pthread.out=====
>>> info: Detected CPU set size (in bits): 64
>>> info: Maximum test CPU: 5
>>> error: Pinned thread 1 ran on impossible cpu 0
>>> error: Pinned thread 0 ran on impossible cpu 0
>>> info: Main thread ran on 4 CPU(s) of 6 available CPU(s)
>>> info: Other threads ran on 6 CPU(s)
>>> =====FAIL: nptl/tst-thread-affinity-pthread2.out=====
>>> info: Detected CPU set size (in bits): 64
>>> info: Maximum test CPU: 5
>>> error: Pinned thread 1 ran on impossible cpu 1
>>> error: Pinned thread 2 ran on impossible cpu 0
>>> error: Pinned thread 3 ran on impossible cpu 3
>>> info: Main thread ran on 5 CPU(s) of 6 available CPU(s)
>>> info: Other threads ran on 6 CPU(s)
>>>
>>> But I also encountered one local failure, but it is rare.  Maybe it's
>>> load-related.  There shouldn't be any CPU unplug or anything like that
>>> involved here.
>>>
>>> I am not entirely sure if something is changing CPU affinities from
>>> outside the process (which would be quite wrong, but not a kernel bug).
>>> But in the past, our glibc test has detected real rseq cpu_id
>>> brokenness, so I'm leaning towards that as the cause this time, too.
>>
>> It can be caused by rseq failing to update the cpu number field on
>> return to userspace. Tthis could be validated by printing the regular
>> getcpu vdso value and/or the value returned by the getcpu system call
>> when the error is triggered, and see whether it matches the rseq cpu
>> id value.
>>
>> It can also be caused by scheduler failure to take the affinity into
>> account.
>>
>> As you also point out, it can also be caused by some other task
>> modifying the affinity of your task concurrently. You could print
>> the result of sched_getaffinity on error to get a better idea of
>> the expected vs actual mask.
>>
>> Lastly, it could be caused by CPU hotplug which would set all bits
>> in the affinity mask as a fallback. As you mention it should not be
>> the cause there.
>>
>> Can you share your kernel configuration ?
>
> Also, can you provide more information about the cpufreq driver and
> governor used in your system ? e.g. output of
>
> cpupower frequency-info
>
> and also output of
>
> sysctl kernel.sched_energy_aware
>
> Is this on a physical machine or in a virtual machine ?

And one more thing: can you reproduce with a CONFIG_RSEQ=n kernel, or
when disabling rseq with the glibc.pthread.rseq glibc tunable ?

Thanks,

Mathieu

>
> Thanks,
>
> Mathieu
>
>>
>> Thanks,
>>
>> Mathieu
>>
>>>
>>> Thanks,
>>> Florian
>>>
>>
>

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

2023-01-12 17:17:20

by Florian Weimer

[permalink] [raw]
Subject: Re: rseq CPU ID not correct on 6.0 kernels for pinned threads

* Mathieu Desnoyers:

> As you also point out, it can also be caused by some other task
> modifying the affinity of your task concurrently. You could print
> the result of sched_getaffinity on error to get a better idea of
> the expected vs actual mask.
>
> Lastly, it could be caused by CPU hotplug which would set all bits
> in the affinity mask as a fallback. As you mention it should not be
> the cause there.
>
> Can you share your kernel configuration ?

Attached.

cpupower frequency-info says:

analyzing CPU 0:
driver: intel_cpufreq
CPUs which run at the same hardware frequency: 0
CPUs which need to have their frequency coordinated by software: 0
maximum transition latency: 20.0 us
hardware limits: 800 MHz - 4.60 GHz
available cpufreq governors: conservative ondemand userspace powersave performance schedutil
current policy: frequency should be within 800 MHz and 4.60 GHz.
The governor "schedutil" may decide which speed to use
within this range.
current CPU frequency: Unable to call hardware
current CPU frequency: 3.20 GHz (asserted by call to kernel)
boost state support:
Supported: yes
Active: yes

And I have: kernel.sched_energy_aware = 1

> Is this on a physical machine or in a virtual machine ?

I think it happened on both.

I added additional error reporting to the test (running on kernel
6.0.18-300.fc37.x86_64), and it seems that there is something that is
mucking with affinity masks:

info: Detected CPU set size (in bits): 64
info: Maximum test CPU: 19
error: Pinned thread 17 ran on impossible cpu 7
info: getcpu reported CPU 7, node 0
info: CPU affinity mask: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
error: Pinned thread 3 ran on impossible cpu 13
info: getcpu reported CPU 13, node 0
info: CPU affinity mask: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
info: Main thread ran on 2 CPU(s) of 20 available CPU(s)
info: Other threads ran on 20 CPU(s)

For each of these threads, the affinity mask should be a singleton set.
Now I need to find out if there is a process that changes affinity
settings.

Thanks,
Florian


Attachments:
config (169.56 kB)

2023-01-12 21:31:39

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: rseq CPU ID not correct on 6.0 kernels for pinned threads

On 2023-01-12 11:33, Florian Weimer wrote:
> * Mathieu Desnoyers:
>
>> As you also point out, it can also be caused by some other task
>> modifying the affinity of your task concurrently. You could print
>> the result of sched_getaffinity on error to get a better idea of
>> the expected vs actual mask.
>>
>> Lastly, it could be caused by CPU hotplug which would set all bits
>> in the affinity mask as a fallback. As you mention it should not be
>> the cause there.
>>
>> Can you share your kernel configuration ?
>
> Attached.
>
> cpupower frequency-info says:
>
> analyzing CPU 0:
> driver: intel_cpufreq
> CPUs which run at the same hardware frequency: 0
> CPUs which need to have their frequency coordinated by software: 0
> maximum transition latency: 20.0 us
> hardware limits: 800 MHz - 4.60 GHz
> available cpufreq governors: conservative ondemand userspace powersave performance schedutil
> current policy: frequency should be within 800 MHz and 4.60 GHz.
> The governor "schedutil" may decide which speed to use
> within this range.
> current CPU frequency: Unable to call hardware
> current CPU frequency: 3.20 GHz (asserted by call to kernel)
> boost state support:
> Supported: yes
> Active: yes
>
> And I have: kernel.sched_energy_aware = 1
>
>> Is this on a physical machine or in a virtual machine ?
>
> I think it happened on both.
>
> I added additional error reporting to the test (running on kernel
> 6.0.18-300.fc37.x86_64), and it seems that there is something that is
> mucking with affinity masks:
>
> info: Detected CPU set size (in bits): 64
> info: Maximum test CPU: 19
> error: Pinned thread 17 ran on impossible cpu 7
> info: getcpu reported CPU 7, node 0
> info: CPU affinity mask: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
> error: Pinned thread 3 ran on impossible cpu 13
> info: getcpu reported CPU 13, node 0
> info: CPU affinity mask: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
> info: Main thread ran on 2 CPU(s) of 20 available CPU(s)
> info: Other threads ran on 20 CPU(s)
>
> For each of these threads, the affinity mask should be a singleton set.
> Now I need to find out if there is a process that changes affinity
> settings.

If it's not cpu hotunplug, then perhaps something like systemd modifies
the AllowedCPUs of your cpuset concurrently ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
https://www.efficios.com

2023-01-13 16:41:12

by Waiman Long

[permalink] [raw]
Subject: Re: rseq CPU ID not correct on 6.0 kernels for pinned threads

On 1/13/23 11:06, Florian Weimer wrote:
> * Mathieu Desnoyers:
>
>> On 2023-01-12 11:33, Florian Weimer wrote:
>>> * Mathieu Desnoyers:
>>>
>>>> As you also point out, it can also be caused by some other task
>>>> modifying the affinity of your task concurrently. You could print
>>>> the result of sched_getaffinity on error to get a better idea of
>>>> the expected vs actual mask.
>>>>
>>>> Lastly, it could be caused by CPU hotplug which would set all bits
>>>> in the affinity mask as a fallback. As you mention it should not be
>>>> the cause there.
>>>>
>>>> Can you share your kernel configuration ?
>>> Attached.
>>> cpupower frequency-info says:
>>> analyzing CPU 0:
>>> driver: intel_cpufreq
>>> CPUs which run at the same hardware frequency: 0
>>> CPUs which need to have their frequency coordinated by software: 0
>>> maximum transition latency: 20.0 us
>>> hardware limits: 800 MHz - 4.60 GHz
>>> available cpufreq governors: conservative ondemand userspace powersave performance schedutil
>>> current policy: frequency should be within 800 MHz and 4.60 GHz.
>>> The governor "schedutil" may decide which speed to use
>>> within this range.
>>> current CPU frequency: Unable to call hardware
>>> current CPU frequency: 3.20 GHz (asserted by call to kernel)
>>> boost state support:
>>> Supported: yes
>>> Active: yes
>>> And I have: kernel.sched_energy_aware = 1
>>>
>>>> Is this on a physical machine or in a virtual machine ?
>>> I think it happened on both.
>>> I added additional error reporting to the test (running on kernel
>>> 6.0.18-300.fc37.x86_64), and it seems that there is something that is
>>> mucking with affinity masks:
>>> info: Detected CPU set size (in bits): 64
>>> info: Maximum test CPU: 19
>>> error: Pinned thread 17 ran on impossible cpu 7
>>> info: getcpu reported CPU 7, node 0
>>> info: CPU affinity mask: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
>>> error: Pinned thread 3 ran on impossible cpu 13
>>> info: getcpu reported CPU 13, node 0
>>> info: CPU affinity mask: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
>>> info: Main thread ran on 2 CPU(s) of 20 available CPU(s)
>>> info: Other threads ran on 20 CPU(s)
>>> For each of these threads, the affinity mask should be a singleton
>>> set.
>>> Now I need to find out if there is a process that changes affinity
>>> settings.
>> If it's not cpu hotunplug, then perhaps something like systemd
>> modifies the AllowedCPUs of your cpuset concurrently ?
> It's probably just this kernel bug:
>
> commit da019032819a1f09943d3af676892ec8c627668e
> Author: Waiman Long <[email protected]>
> Date: Thu Sep 22 14:00:39 2022 -0400
>
> sched: Enforce user requested affinity
>
> It was found that the user requested affinity via sched_setaffinity()
> can be easily overwritten by other kernel subsystems without an easy way
> to reset it back to what the user requested. For example, any change
> to the current cpuset hierarchy may reset the cpumask of the tasks in
> the affected cpusets to the default cpuset value even if those tasks
> have pre-existing user requested affinity. That is especially easy to
> trigger under a cgroup v2 environment where writing "+cpuset" to the
> root cgroup's cgroup.subtree_control file will reset the cpus affinity
> of all the processes in the system.
>
> That is problematic in a nohz_full environment where the tasks running
> in the nohz_full CPUs usually have their cpus affinity explicitly set
> and will behave incorrectly if cpus affinity changes.
>
> Fix this problem by looking at user_cpus_ptr in __set_cpus_allowed_ptr()
> and use it to restrcit the given cpumask unless there is no overlap. In
> that case, it will fallback to the given one. The SCA_USER flag is
> reused to indicate intent to set user_cpus_ptr and so user_cpus_ptr
> masking should be skipped. In addition, masking should also be skipped
> if any of the SCA_MIGRATE_* flag is set.
>
> All callers of set_cpus_allowed_ptr() will be affected by this change.
> A scratch cpumask is added to percpu runqueues structure for doing
> additional masking when user_cpus_ptr is set.
>
> Signed-off-by: Waiman Long <[email protected]>
> Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
> Link: https://lkml.kernel.org/r/[email protected]
>
> I don't think it's been merged into any stable kernels yet?

This patch will be in the v6.2 kernel. Since it is not marked as a fix,
it won't go into a stable kernel by default.

Cheers,
Longman

2023-01-13 17:22:22

by Florian Weimer

[permalink] [raw]
Subject: Re: rseq CPU ID not correct on 6.0 kernels for pinned threads

* Mathieu Desnoyers:

> On 2023-01-12 11:33, Florian Weimer wrote:
>> * Mathieu Desnoyers:
>>
>>> As you also point out, it can also be caused by some other task
>>> modifying the affinity of your task concurrently. You could print
>>> the result of sched_getaffinity on error to get a better idea of
>>> the expected vs actual mask.
>>>
>>> Lastly, it could be caused by CPU hotplug which would set all bits
>>> in the affinity mask as a fallback. As you mention it should not be
>>> the cause there.
>>>
>>> Can you share your kernel configuration ?
>> Attached.
>> cpupower frequency-info says:
>> analyzing CPU 0:
>> driver: intel_cpufreq
>> CPUs which run at the same hardware frequency: 0
>> CPUs which need to have their frequency coordinated by software: 0
>> maximum transition latency: 20.0 us
>> hardware limits: 800 MHz - 4.60 GHz
>> available cpufreq governors: conservative ondemand userspace powersave performance schedutil
>> current policy: frequency should be within 800 MHz and 4.60 GHz.
>> The governor "schedutil" may decide which speed to use
>> within this range.
>> current CPU frequency: Unable to call hardware
>> current CPU frequency: 3.20 GHz (asserted by call to kernel)
>> boost state support:
>> Supported: yes
>> Active: yes
>> And I have: kernel.sched_energy_aware = 1
>>
>>> Is this on a physical machine or in a virtual machine ?
>> I think it happened on both.
>> I added additional error reporting to the test (running on kernel
>> 6.0.18-300.fc37.x86_64), and it seems that there is something that is
>> mucking with affinity masks:
>> info: Detected CPU set size (in bits): 64
>> info: Maximum test CPU: 19
>> error: Pinned thread 17 ran on impossible cpu 7
>> info: getcpu reported CPU 7, node 0
>> info: CPU affinity mask: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
>> error: Pinned thread 3 ran on impossible cpu 13
>> info: getcpu reported CPU 13, node 0
>> info: CPU affinity mask: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
>> info: Main thread ran on 2 CPU(s) of 20 available CPU(s)
>> info: Other threads ran on 20 CPU(s)
>> For each of these threads, the affinity mask should be a singleton
>> set.
>> Now I need to find out if there is a process that changes affinity
>> settings.
>
> If it's not cpu hotunplug, then perhaps something like systemd
> modifies the AllowedCPUs of your cpuset concurrently ?

It's probably just this kernel bug:

commit da019032819a1f09943d3af676892ec8c627668e
Author: Waiman Long <[email protected]>
Date: Thu Sep 22 14:00:39 2022 -0400

sched: Enforce user requested affinity

It was found that the user requested affinity via sched_setaffinity()
can be easily overwritten by other kernel subsystems without an easy way
to reset it back to what the user requested. For example, any change
to the current cpuset hierarchy may reset the cpumask of the tasks in
the affected cpusets to the default cpuset value even if those tasks
have pre-existing user requested affinity. That is especially easy to
trigger under a cgroup v2 environment where writing "+cpuset" to the
root cgroup's cgroup.subtree_control file will reset the cpus affinity
of all the processes in the system.

That is problematic in a nohz_full environment where the tasks running
in the nohz_full CPUs usually have their cpus affinity explicitly set
and will behave incorrectly if cpus affinity changes.

Fix this problem by looking at user_cpus_ptr in __set_cpus_allowed_ptr()
and use it to restrcit the given cpumask unless there is no overlap. In
that case, it will fallback to the given one. The SCA_USER flag is
reused to indicate intent to set user_cpus_ptr and so user_cpus_ptr
masking should be skipped. In addition, masking should also be skipped
if any of the SCA_MIGRATE_* flag is set.

All callers of set_cpus_allowed_ptr() will be affected by this change.
A scratch cpumask is added to percpu runqueues structure for doing
additional masking when user_cpus_ptr is set.

Signed-off-by: Waiman Long <[email protected]>
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Link: https://lkml.kernel.org/r/[email protected]

I don't think it's been merged into any stable kernels yet?

Thanks,
Florian