2020-10-20 13:16:38

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [LKP] Re: [sched] bdfcae1140: will-it-scale.per_thread_ops -37.0% regression

----- On Oct 19, 2020, at 11:24 PM, Xing Zhengjun [email protected] wrote:

> On 10/7/2020 10:50 PM, Mathieu Desnoyers wrote:
>> ----- On Oct 2, 2020, at 4:33 AM, Rong Chen [email protected] wrote:
>>
>>> Greeting,
>>>
>>> FYI, we noticed a -37.0% regression of will-it-scale.per_thread_ops due to
>>> commit:
>>>
>>>
>>> commit: bdfcae11403e5099769a7c8dc3262e3c4193edef ("[RFC PATCH 2/3] sched:
>>> membarrier: cover kthread_use_mm (v3)")
>>> url:
>>> https://github.com/0day-ci/linux/commits/Mathieu-Desnoyers/Membarrier-updates/20200925-012549
>>> base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git
>>> 848785df48835eefebe0c4eb5da7690690b0a8b7
>>>
>>> in testcase: will-it-scale
>>> on test machine: 104 threads Skylake with 192G memory
>>> with following parameters:
>>>
>>> nr_task: 50%
>>> mode: thread
>>> test: context_switch1
>>> cpufreq_governor: performance
>>> ucode: 0x2006906
>>>
>>> test-description: Will It Scale takes a testcase and runs it from 1 through to n
>>> parallel copies to see if the testcase will scale. It builds both a process and
>>> threads based test in order to see any differences between the two.
>>> test-url: https://github.com/antonblanchard/will-it-scale
>>>
>>
>> Hi,
>>
>> I would like to report what I suspect is a random thread placement issue in the
>> context_switch1 test used by the 0day bot when running on a machine with
>> hyperthread
>> enabled.
>>
>> AFAIU the test code uses hwloc for thread placement which should theoretically
>> ensure
>> that each thread is placed on same processing unit, core and numa node between
>> runs.
>>
>> We can find the test code here:
>>
>> https://github.com/antonblanchard/will-it-scale/blob/master/tests/context_switch1.c
>>
>> And the main file containing thread setup is here:
>>
>> https://github.com/antonblanchard/will-it-scale/blob/master/main.c
>>
>> AFAIU, the test is started without the "-m" switch, which therefore affinitizes
>> tasks on cores rather than on processing units (SMT threads).
>>
>> When testcase() creates the child thread with new_task(), it basically issues:
>>
>> pthread_create(&threads[nr_threads++], NULL, func, arg);
>>
>> passing a NULL pthread_attr_t, and not executing any pre_trampoline on the
>> child.
>> The pre_trampoline would have issued hwloc_set_thread_cpubind if it were
>> executed on
>> the child, but it's not. Therefore, we expect the cpu affinity mask of the
>> parent to
>> be copied on clone and used by the child.
>>
>> A quick test on a machine with hyperthreading enabled shows that the cpu
>> affinity mask
>> for the parent and child has two bits set:
>>
>> taskset -p 1868607
>> pid 1868607's current affinity mask: 10001
>> taskset -p 1868606
>> pid 1868606's current affinity mask: 10001
>>
>> So AFAIU the placement of the parent and child will be random on either the same
>> processing unit, or on separate processing units within the same core.
>>
>> I suspect this randomness can significantly affect the performance number
>> between
>> runs, and trigger unwarranted performance regression warnings.
>>
>> Thanks,
>>
>> Mathieu
>>
> Yes, the randomness may happen in some special cases. But in 0-day, we
> test multi times (>=3), the report is the average number.
> For this case, we test 4 times, it is stable, the wave is ± 2%.
> So I don't think the -37.0% regression is caused by the randomness.
>
> 0/stats.json: "will-it-scale.per_thread_ops": 105228,
> 1/stats.json: "will-it-scale.per_thread_ops": 100443,
> 2/stats.json: "will-it-scale.per_thread_ops": 98786,
> 3/stats.json: "will-it-scale.per_thread_ops": 102821,
>
> c2daff748f0ea954 bdfcae11403e5099769a7c8dc32
> ---------------- ---------------------------
> %stddev %change %stddev
> \ | \
> 161714 ± 2% -37.0% 101819 ± 2% will-it-scale.per_thread_ops

Arguing whether this specific instance of the test is indeed a performance
regression or not is not relevant to this discussion.

What I am pointing out here is that the test needs fixing because it generates
noise due to a random thread placement configuration. This issue is about whether
we can trust the results of those tests as kernel maintainers.

So on one hand, you can fix the test. This is simple to do: make sure the thread
affinity does not allow for this randomness on SMT.

But you seem to argue that the test does not need to be fixed, because the 0day
infrastructure in which it runs will cover for this randomness. I really doubt
about this.

If you indeed choose to argue that the test does not need fixing, then here is the
statistical analysis I am looking for:

- With the 4 runs, what are the odds that the average result for one class significantly
differs from the other class due to this randomness. It may be small, but it is certainly
not zero,
- Based on those odds, and on the number of performance regression tests performed by 0day
each year, how frequently does 0day end up spamming kernel developers with random results
because of this randomness ?

That being said, I would really find more productive that we work together on fixing the
test rather than justifying why it can stay broken. Let me know if you have specific
questions on how to fix the test, and I'll be happy to help out.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com


2020-10-22 07:59:05

by Xing Zhengjun

[permalink] [raw]
Subject: Re: [LKP] Re: [sched] bdfcae1140: will-it-scale.per_thread_ops -37.0% regression



On 10/20/2020 9:14 PM, Mathieu Desnoyers wrote:
> ----- On Oct 19, 2020, at 11:24 PM, Xing Zhengjun [email protected] wrote:
>
>> On 10/7/2020 10:50 PM, Mathieu Desnoyers wrote:
>>> ----- On Oct 2, 2020, at 4:33 AM, Rong Chen [email protected] wrote:
>>>
>>>> Greeting,
>>>>
>>>> FYI, we noticed a -37.0% regression of will-it-scale.per_thread_ops due to
>>>> commit:
>>>>
>>>>
>>>> commit: bdfcae11403e5099769a7c8dc3262e3c4193edef ("[RFC PATCH 2/3] sched:
>>>> membarrier: cover kthread_use_mm (v3)")
>>>> url:
>>>> https://github.com/0day-ci/linux/commits/Mathieu-Desnoyers/Membarrier-updates/20200925-012549
>>>> base: https://git.kernel.org/cgit/linux/kernel/git/tip/tip.git
>>>> 848785df48835eefebe0c4eb5da7690690b0a8b7
>>>>
>>>> in testcase: will-it-scale
>>>> on test machine: 104 threads Skylake with 192G memory
>>>> with following parameters:
>>>>
>>>> nr_task: 50%
>>>> mode: thread
>>>> test: context_switch1
>>>> cpufreq_governor: performance
>>>> ucode: 0x2006906
>>>>
>>>> test-description: Will It Scale takes a testcase and runs it from 1 through to n
>>>> parallel copies to see if the testcase will scale. It builds both a process and
>>>> threads based test in order to see any differences between the two.
>>>> test-url: https://github.com/antonblanchard/will-it-scale
>>>>
>>>
>>> Hi,
>>>
>>> I would like to report what I suspect is a random thread placement issue in the
>>> context_switch1 test used by the 0day bot when running on a machine with
>>> hyperthread
>>> enabled.
>>>
>>> AFAIU the test code uses hwloc for thread placement which should theoretically
>>> ensure
>>> that each thread is placed on same processing unit, core and numa node between
>>> runs.
>>>
>>> We can find the test code here:
>>>
>>> https://github.com/antonblanchard/will-it-scale/blob/master/tests/context_switch1.c
>>>
>>> And the main file containing thread setup is here:
>>>
>>> https://github.com/antonblanchard/will-it-scale/blob/master/main.c
>>>
>>> AFAIU, the test is started without the "-m" switch, which therefore affinitizes
>>> tasks on cores rather than on processing units (SMT threads).
>>>
>>> When testcase() creates the child thread with new_task(), it basically issues:
>>>
>>> pthread_create(&threads[nr_threads++], NULL, func, arg);
>>>
>>> passing a NULL pthread_attr_t, and not executing any pre_trampoline on the
>>> child.
>>> The pre_trampoline would have issued hwloc_set_thread_cpubind if it were
>>> executed on
>>> the child, but it's not. Therefore, we expect the cpu affinity mask of the
>>> parent to
>>> be copied on clone and used by the child.
>>>
>>> A quick test on a machine with hyperthreading enabled shows that the cpu
>>> affinity mask
>>> for the parent and child has two bits set:
>>>
>>> taskset -p 1868607
>>> pid 1868607's current affinity mask: 10001
>>> taskset -p 1868606
>>> pid 1868606's current affinity mask: 10001
>>>
>>> So AFAIU the placement of the parent and child will be random on either the same
>>> processing unit, or on separate processing units within the same core.
>>>
>>> I suspect this randomness can significantly affect the performance number
>>> between
>>> runs, and trigger unwarranted performance regression warnings.
>>>
>>> Thanks,
>>>
>>> Mathieu
>>>
>> Yes, the randomness may happen in some special cases. But in 0-day, we
>> test multi times (>=3), the report is the average number.
>> For this case, we test 4 times, it is stable, the wave is ± 2%.
>> So I don't think the -37.0% regression is caused by the randomness.
>>
>> 0/stats.json: "will-it-scale.per_thread_ops": 105228,
>> 1/stats.json: "will-it-scale.per_thread_ops": 100443,
>> 2/stats.json: "will-it-scale.per_thread_ops": 98786,
>> 3/stats.json: "will-it-scale.per_thread_ops": 102821,
>>
>> c2daff748f0ea954 bdfcae11403e5099769a7c8dc32
>> ---------------- ---------------------------
>> %stddev %change %stddev
>> \ | \
>> 161714 ± 2% -37.0% 101819 ± 2% will-it-scale.per_thread_ops
>
> Arguing whether this specific instance of the test is indeed a performance
> regression or not is not relevant to this discussion.
>
> What I am pointing out here is that the test needs fixing because it generates
> noise due to a random thread placement configuration. This issue is about whether
> we can trust the results of those tests as kernel maintainers.
>
> So on one hand, you can fix the test. This is simple to do: make sure the thread
> affinity does not allow for this randomness on SMT.
>
> But you seem to argue that the test does not need to be fixed, because the 0day
> infrastructure in which it runs will cover for this randomness. I really doubt
> about this.
>
> If you indeed choose to argue that the test does not need fixing, then here is the
> statistical analysis I am looking for:
>
> - With the 4 runs, what are the odds that the average result for one class significantly
> differs from the other class due to this randomness. It may be small, but it is certainly
> not zero,

If 4 runs are not enough, how many times' run do you think is OK? In
fact, I have re-test it for more than 10 times, the test result is
almost the same.
=========================================================================================
tbox_group/testcase/rootfs/kconfig/compiler/nr_task/mode/test/cpufreq_governor/ucode/debug-setup:

lkp-skl-fpga01/will-it-scale/debian-10.4-x86_64-20200603.cgz/x86_64-rhel-8.3/gcc-9/50%/thread/context_switch1/performance/0x2006906/test2

commit:
c2daff748f0ea954746e8e3465998b1090be7c30
bdfcae11403e5099769a7c8dc3262e3c4193edef

c2daff748f0ea954 bdfcae11403e5099769a7c8dc32
---------------- ---------------------------
%stddev %change %stddev
\ | \
161582 -37.2% 101435 will-it-scale.per_thread_ops
8402288 -37.2% 5274649 will-it-scale.workload


> - Based on those odds, and on the number of performance regression tests performed by 0day
> each year, how frequently does 0day end up spamming kernel developers with random results
> because of this randomness ?
>
> That being said, I would really find more productive that we work together on fixing the
> test rather than justifying why it can stay broken. Let me know if you have specific
> questions on how to fix the test, and I'll be happy to help out.
>
> Thanks,
>
> Mathieu
>
In fact, 0-day just copy the will-it-scale benchmark from the GitHub, if
you think the will-it-scale benchmark has some issues, you can
contribute your idea and help to improve it, later we will update the
will-it-scale benchmark to the new version.
For this test case, if we bind the workload to a specific CPU, then it
will hide the scheduler balance issue. In the real world, we seldom bind
the CPU...

--
Zhengjun Xing

2020-10-22 17:17:26

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [LKP] Re: [sched] bdfcae1140: will-it-scale.per_thread_ops -37.0% regression

----- On Oct 21, 2020, at 9:54 PM, Xing Zhengjun [email protected] wrote:
[...]
> In fact, 0-day just copy the will-it-scale benchmark from the GitHub, if
> you think the will-it-scale benchmark has some issues, you can
> contribute your idea and help to improve it, later we will update the
> will-it-scale benchmark to the new version.

This is why I CC'd the maintainer of the will-it-scale github project, Anton Blanchard.
My main intent is to report this issue to him, but I have not heard back from him yet.
Is this project maintained ? Let me try to add his ozlabs.org address in CC.

> For this test case, if we bind the workload to a specific CPU, then it
> will hide the scheduler balance issue. In the real world, we seldom bind
> the CPU...

When you say that you bind the workload to a specific CPU, is that done
outside of the will-it-scale testsuite, thus limiting the entire testsuite
to a single CPU, or you expect that internally the will-it-scale context-switch1
test gets affined to a single specific CPU/core/hardware thread through use of
hwloc ?

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com

2020-10-23 08:01:52

by Xing Zhengjun

[permalink] [raw]
Subject: Re: [LKP] Re: [sched] bdfcae1140: will-it-scale.per_thread_ops -37.0% regression



On 10/22/2020 9:19 PM, Mathieu Desnoyers wrote:
> ----- On Oct 21, 2020, at 9:54 PM, Xing Zhengjun [email protected] wrote:
> [...]
>> In fact, 0-day just copy the will-it-scale benchmark from the GitHub, if
>> you think the will-it-scale benchmark has some issues, you can
>> contribute your idea and help to improve it, later we will update the
>> will-it-scale benchmark to the new version.
>
> This is why I CC'd the maintainer of the will-it-scale github project, Anton Blanchard.
> My main intent is to report this issue to him, but I have not heard back from him yet.
> Is this project maintained ? Let me try to add his ozlabs.org address in CC.
>
>> For this test case, if we bind the workload to a specific CPU, then it
>> will hide the scheduler balance issue. In the real world, we seldom bind
>> the CPU...
>
> When you say that you bind the workload to a specific CPU, is that done
> outside of the will-it-scale testsuite, thus limiting the entire testsuite
> to a single CPU, or you expect that internally the will-it-scale context-switch1
> test gets affined to a single specific CPU/core/hardware thread through use of
> hwloc ?
>
The later one.

> Thanks,
>
> Mathieu
>

--
Zhengjun Xing

2020-10-23 15:49:40

by Mathieu Desnoyers

[permalink] [raw]
Subject: Re: [LKP] Re: [sched] bdfcae1140: will-it-scale.per_thread_ops -37.0% regression

----- On Oct 23, 2020, at 1:37 AM, Xing Zhengjun [email protected] wrote:

> On 10/22/2020 9:19 PM, Mathieu Desnoyers wrote:
>> ----- On Oct 21, 2020, at 9:54 PM, Xing Zhengjun [email protected]
>> wrote:
>> [...]
>>> In fact, 0-day just copy the will-it-scale benchmark from the GitHub, if
>>> you think the will-it-scale benchmark has some issues, you can
>>> contribute your idea and help to improve it, later we will update the
>>> will-it-scale benchmark to the new version.
>>
>> This is why I CC'd the maintainer of the will-it-scale github project, Anton
>> Blanchard.
>> My main intent is to report this issue to him, but I have not heard back from
>> him yet.
>> Is this project maintained ? Let me try to add his ozlabs.org address in CC.
>>
>>> For this test case, if we bind the workload to a specific CPU, then it
>>> will hide the scheduler balance issue. In the real world, we seldom bind
>>> the CPU...
>>
>> When you say that you bind the workload to a specific CPU, is that done
>> outside of the will-it-scale testsuite, thus limiting the entire testsuite
>> to a single CPU, or you expect that internally the will-it-scale context-switch1
>> test gets affined to a single specific CPU/core/hardware thread through use of
>> hwloc ?
>>
> The later one.

Yeah, that's not currently true due to the affinity issue I pointed out. Both threads
used in that test-case are free to be scheduled either on the same HW thread, or of
two distinct HW threads on the same core.

Depending on the choice made by the scheduler for each individual test run,
this can lead to very large variation in the benchmark results.

Thanks,

Mathieu

--
Mathieu Desnoyers
EfficiOS Inc.
http://www.efficios.com