LinuxLists.cc - [lkp-robot] [sched/fair] d519329f72: unixbench.score -9.9% regression

2018-04-02 03:23:46

Subject: [lkp-robot] [sched/fair] d519329f72: unixbench.score -9.9% regression

Greeting,

FYI, we noticed a -9.9% regression of unixbench.score due to commit:

commit: d519329f72a6f36bc4f2b85452640cfe583b4f81 ("sched/fair: Update util_est only on util_avg updates")
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master

in testcase: unixbench
on test machine: 8 threads Intel(R) Core(TM) i7 CPU 870 @ 2.93GHz with 6G memory
with following parameters:

runtime: 300s
nr_task: 100%
test: execl

test-description: UnixBench is the original BYTE UNIX benchmark suite aims to test performance of Unix-like system.
test-url: https://github.com/kdlucas/byte-unixbench

Details are as below:
-------------------------------------------------------------------------------------------------->

To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml

=========================================================================================
compiler/kconfig/nr_task/rootfs/runtime/tbox_group/test/testcase:
gcc-7/x86_64-rhel-7.2/100%/debian-x86_64-2016-08-31.cgz/300s/nhm-white/execl/unixbench

commit:
a07630b8b2 ("sched/cpufreq/schedutil: Use util_est for OPP selection")
d519329f72 ("sched/fair: Update util_est only on util_avg updates")

a07630b8b2c16f82 d519329f72a6f36bc4f2b85452
---------------- --------------------------
%stddev %change %stddev
\ | \
4626 -9.9% 4167 unixbench.score
3495362 ? 4% +70.4% 5957769 ? 2% unixbench.time.involuntary_context_switches
2.866e+08 -11.6% 2.534e+08 unixbench.time.minor_page_faults
666.75 -9.7% 602.25 unixbench.time.percent_of_cpu_this_job_got
1830 -9.7% 1653 unixbench.time.system_time
395.13 -5.2% 374.58 unixbench.time.user_time
8611715 -58.9% 3537314 ? 3% unixbench.time.voluntary_context_switches
6639375 -9.1% 6033775 unixbench.workload
26025 +3849.3% 1027825 interrupts.CAL:Function_call_interrupts
4856 ? 14% -27.4% 3523 ? 11% slabinfo.filp.active_objs
3534356 -8.8% 3223918 softirqs.RCU
77929 -11.2% 69172 vmstat.system.cs
19489 ? 2% +7.5% 20956 vmstat.system.in
9.05 ? 9% +11.0% 10.05 ? 8% boot-time.dhcp
131.63 ? 4% +8.6% 142.89 ? 7% boot-time.idle
9.07 ? 9% +11.0% 10.07 ? 8% boot-time.kernel_boot
76288 ? 3% -12.8% 66560 ? 3% meminfo.DirectMap4k
16606 -13.1% 14433 meminfo.Inactive
16515 -13.2% 14341 meminfo.Inactive(anon)
11.87 ? 5% +7.8 19.63 ? 4% mpstat.cpu.idle%
0.07 ? 35% -0.0 0.04 ? 17% mpstat.cpu.soft%
68.91 -6.1 62.82 mpstat.cpu.sys%
29291570 +325.4% 1.246e+08 cpuidle.C1.time
8629105 -36.1% 5513780 cpuidle.C1.usage
668733 ? 12% +11215.3% 75668902 ? 2% cpuidle.C1E.time
9763 ? 12% +16572.7% 1627882 ? 2% cpuidle.C1E.usage
1.834e+08 ? 9% +23.1% 2.258e+08 ? 11% cpuidle.C3.time
222674 ? 8% +133.4% 519690 ? 6% cpuidle.C3.usage
4129 -13.3% 3581 proc-vmstat.nr_inactive_anon
4129 -13.3% 3581 proc-vmstat.nr_zone_inactive_anon
2.333e+08 -12.2% 2.049e+08 proc-vmstat.numa_hit
2.333e+08 -12.2% 2.049e+08 proc-vmstat.numa_local
6625 -10.9% 5905 proc-vmstat.pgactivate
2.392e+08 -12.1% 2.102e+08 proc-vmstat.pgalloc_normal
2.936e+08 -12.6% 2.566e+08 proc-vmstat.pgfault
2.392e+08 -12.1% 2.102e+08 proc-vmstat.pgfree
2850 -15.3% 2413 turbostat.Avg_MHz
8629013 -36.1% 5513569 turbostat.C1
1.09 +3.5 4.61 turbostat.C1%
9751 ? 12% +16593.0% 1627864 ? 2% turbostat.C1E
0.03 ? 19% +2.8 2.80 turbostat.C1E%
222574 ? 8% +133.4% 519558 ? 6% turbostat.C3
6.84 ? 8% +1.5 8.34 ? 10% turbostat.C3%
2.82 ? 7% +250.3% 9.87 ? 2% turbostat.CPU%c1
6552773 ? 3% +23.8% 8111699 ? 2% turbostat.IRQ
2.02 ? 11% +28.3% 2.58 ? 9% turbostat.Pkg%pc3
7.635e+11 -12.5% 6.682e+11 perf-stat.branch-instructions
3.881e+10 -12.9% 3.381e+10 perf-stat.branch-misses
2.09 -0.3 1.77 ? 4% perf-stat.cache-miss-rate%
1.551e+09 -15.1% 1.316e+09 ? 4% perf-stat.cache-misses
26177920 -10.5% 23428188 perf-stat.context-switches
1.99 -2.8% 1.93 perf-stat.cpi
7.553e+12 -14.7% 6.446e+12 perf-stat.cpu-cycles
522523 ? 2% +628.3% 3805664 perf-stat.cpu-migrations
2.425e+10 ? 4% -14.3% 2.078e+10 perf-stat.dTLB-load-misses
1.487e+12 -11.3% 1.319e+12 perf-stat.dTLB-loads
1.156e+10 ? 3% -7.7% 1.066e+10 perf-stat.dTLB-store-misses
6.657e+11 -11.1% 5.915e+11 perf-stat.dTLB-stores
0.15 +0.0 0.15 perf-stat.iTLB-load-miss-rate%
5.807e+09 -11.0% 5.166e+09 perf-stat.iTLB-load-misses
3.799e+12 -12.1% 3.34e+12 perf-stat.iTLB-loads
3.803e+12 -12.2% 3.338e+12 perf-stat.instructions
654.99 -1.4% 646.07 perf-stat.instructions-per-iTLB-miss
0.50 +2.8% 0.52 perf-stat.ipc
2.754e+08 -11.6% 2.435e+08 perf-stat.minor-faults
1.198e+08 ? 7% +73.1% 2.074e+08 ? 4% perf-stat.node-stores
2.754e+08 -11.6% 2.435e+08 perf-stat.page-faults
572928 -3.4% 553258 perf-stat.path-length

unixbench.score

4800 +-+------------------------------------------------------------------+
|+ + + |
4700 +-+ + + :+ +. :+ + + |
| + + + +. : + + + + + + + .+++++ .+ +|
4600 +-+ +++ :+++ + ++: : :+ +++ ++.++++ + ++++ ++ |
| + + + ++ ++ + |
4500 +-+ |
| |
4400 +-+ |
| |
4300 +-+ |
O |
4200 +-O O O OOOO OO OOO OOOO OOOO O O |
|O OO OOOOO O O OO O O O O O OO |
4100 +-+------------------------------------------------------------------+

unixbench.workload

9e+06 +-+---------------------------------------------------------------+
| : |
8.5e+06 +-+ : |
| : |
8e+06 +-+ : |
| :: |
7.5e+06 +-+ : : + |
| +: : : + |
7e+06 +-+ + + :: : :: + + : + + + + + |
|:+ + + : :: : : :: : :+ : : ::+ :+ .+ :+ ++ ++ + ++ ::++|
6.5e+06 +-O+ +++ ++++ +++ + ++ +.+ + ++ + + + + + + + +.+++ + |
O O O O O O O |
6e+06 +O+OOO O OOOOOOOO OOOO OO OOOOOOOOO O O O OO |
| O |
5.5e+06 +-+---------------------------------------------------------------+

[*] bisect-good sample
[O] bisect-bad sample

Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.

Thanks,
Xiaolong

Attachments:

(No filename) (9.69 kB)
config-4.16.0-rc6-00081-gd519329 (168.65 kB)
job-script (6.74 kB)
job.yaml (4.58 kB)
reproduce (37.00 B)
Download all attachments

2018-04-03 11:45:36

by Patrick Bellasi

[permalink] [raw]

Subject: Re: [lkp-robot] [sched/fair] d519329f72: unixbench.score -9.9% regression

Hi Xiaolong,

On 02-Apr 11:20, kernel test robot wrote:
>
> Greeting,
>
> FYI, we noticed a -9.9% regression of unixbench.score due to commit:

thanks for the report, I'll try to reproduce it locally to better
understand what's going on.

Meanwhile, I'm a little puzzled about some of the following
numbers... likely looking at the code it should be more clear.
But, maybe someone already knows the response.

> commit: d519329f72a6f36bc4f2b85452640cfe583b4f81 ("sched/fair: Update util_est only on util_avg updates")
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master

To give a bit of context, this patch is the last of a small series
introducing util_est:
https://lkml.org/lkml/2018/3/9/245
where we actually enable it by turning of by default the corresponding SCHED_FEAT.
Thus, all the effect of util_est are visible just after this very last patch.

I'm not surprised that, if there are issues related to util_est, they
manifest at this stage.

> in testcase: unixbench
> on test machine: 8 threads Intel(R) Core(TM) i7 CPU 870 @ 2.93GHz with 6G memory
> with following parameters:
>
> runtime: 300s
> nr_task: 100%
> test: execl
>
> test-description: UnixBench is the original BYTE UNIX benchmark suite aims to test performance of Unix-like system.
> test-url: https://github.com/kdlucas/byte-unixbench

AFAIU, this benchmark is composed of 12 different test cases:
https://github.com/intel/lkp-tests/blob/master/jobs/unixbench.yaml
and it's reporting a regression for "only" 1 of those tests (execl).

Is that correct?

> Details are as below:
> -------------------------------------------------------------------------------------------------->
>
>
> To reproduce:
>
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> bin/lkp install job.yaml # job file is attached in this email
> bin/lkp run job.yaml

Will try to give it a run in the next days.

>
> =========================================================================================
> compiler/kconfig/nr_task/rootfs/runtime/tbox_group/test/testcase:
> gcc-7/x86_64-rhel-7.2/100%/debian-x86_64-2016-08-31.cgz/300s/nhm-white/execl/unixbench
>
> commit:
> a07630b8b2 ("sched/cpufreq/schedutil: Use util_est for OPP selection")
> d519329f72 ("sched/fair: Update util_est only on util_avg updates")
>
> a07630b8b2c16f82 d519329f72a6f36bc4f2b85452
> ---------------- --------------------------
> %stddev %change %stddev
> \ | \
> 4626 -9.9% 4167 unixbench.score

Is this overall score a composition of the following scores?

In general, it would be nice to see in the following metrics which one
is considered "the lower the better" or "the higher the better".

> 3495362 ? 4% +70.4% 5957769 ? 2% unixbench.time.involuntary_context_switches

The above seems to indicate an increase in contention which generated
scheduler enforced context switches. But, AFAIU, the test under
analysis generates just one single task which keep execl itself.
I can't see how util_est can this generated an increase in context
switches? Will investigate better.

> 2.866e+08 -11.6% 2.534e+08 unixbench.time.minor_page_faults
> 666.75 -9.7% 602.25 unixbench.time.percent_of_cpu_this_job_got
> 1830 -9.7% 1653 unixbench.time.system_time
> 395.13 -5.2% 374.58 unixbench.time.user_time

When I read "system_time" and "user_time" I'm expecting it to be a
"the lower the better" metric.
Thus, if that's the case, by just looking at these last two metrics:
doesn't that means that the test is actually completing faster with
util_est enabled?

> 8611715 -58.9% 3537314 ? 3% unixbench.time.voluntary_context_switches
> 6639375 -9.1% 6033775 unixbench.workload

What the above metric represents?

> 26025 +3849.3% 1027825 interrupts.CAL:Function_call_interrupts

What this metric represents?

This is a big variation which I cannot easily see as related to
util_est overheads... unless these are interrupts related to tasks
migrations, which potentially can affect LB and WKP code paths.

> 4856 ? 14% -27.4% 3523 ? 11% slabinfo.filp.active_objs
> 3534356 -8.8% 3223918 softirqs.RCU
> 77929 -11.2% 69172 vmstat.system.cs
> 19489 ? 2% +7.5% 20956 vmstat.system.in
> 9.05 ? 9% +11.0% 10.05 ? 8% boot-time.dhcp
> 131.63 ? 4% +8.6% 142.89 ? 7% boot-time.idle
> 9.07 ? 9% +11.0% 10.07 ? 8% boot-time.kernel_boot
> 76288 ? 3% -12.8% 66560 ? 3% meminfo.DirectMap4k
> 16606 -13.1% 14433 meminfo.Inactive
> 16515 -13.2% 14341 meminfo.Inactive(anon)
> 11.87 ? 5% +7.8 19.63 ? 4% mpstat.cpu.idle%
> 0.07 ? 35% -0.0 0.04 ? 17% mpstat.cpu.soft%
> 68.91 -6.1 62.82 mpstat.cpu.sys%

The following:

> 29291570 +325.4% 1.246e+08 cpuidle.C1.time
> 8629105 -36.1% 5513780 cpuidle.C1.usage
> 668733 ? 12% +11215.3% 75668902 ? 2% cpuidle.C1E.time
> 9763 ? 12% +16572.7% 1627882 ? 2% cpuidle.C1E.usage
> 1.834e+08 ? 9% +23.1% 2.258e+08 ? 11% cpuidle.C3.time
> 222674 ? 8% +133.4% 519690 ? 6% cpuidle.C3.usage

are other really big variations which metrics represents, AFAIU, the
following IDLE states:
- C1 : Core doesn't work, clocks stopped
- C1E : Core doesn't work, clocks stopped and voltage lowered
- C3 : L1 and L2 copied to L3, all core clocks stopped

Thus, it seems to me that, with the patches in, we are more likely to
sleep longer. Which suggests a more aggressive race-to-idle policy:
running at higher frequencies to complete faster and sleep
longer and deeper.

This would kind-of match with the idea of completing faster, but I'm
not completely sure... however, if that should be the case, again I
would say that's a benefit, not a regression.

> 4129 -13.3% 3581 proc-vmstat.nr_inactive_anon
> 4129 -13.3% 3581 proc-vmstat.nr_zone_inactive_anon
> 2.333e+08 -12.2% 2.049e+08 proc-vmstat.numa_hit
> 2.333e+08 -12.2% 2.049e+08 proc-vmstat.numa_local
> 6625 -10.9% 5905 proc-vmstat.pgactivate
> 2.392e+08 -12.1% 2.102e+08 proc-vmstat.pgalloc_normal
> 2.936e+08 -12.6% 2.566e+08 proc-vmstat.pgfault
> 2.392e+08 -12.1% 2.102e+08 proc-vmstat.pgfree

All the following metrics:

> 2850 -15.3% 2413 turbostat.Avg_MHz
> 8629013 -36.1% 5513569 turbostat.C1
> 1.09 +3.5 4.61 turbostat.C1%
> 9751 ? 12% +16593.0% 1627864 ? 2% turbostat.C1E
> 0.03 ? 19% +2.8 2.80 turbostat.C1E%
> 222574 ? 8% +133.4% 519558 ? 6% turbostat.C3
> 6.84 ? 8% +1.5 8.34 ? 10% turbostat.C3%
> 2.82 ? 7% +250.3% 9.87 ? 2% turbostat.CPU%c1
> 6552773 ? 3% +23.8% 8111699 ? 2% turbostat.IRQ
> 2.02 ? 11% +28.3% 2.58 ? 9% turbostat.Pkg%pc3

maybe they can help to understand better what's going on with respect
to the race-to-idle theory...

> 7.635e+11 -12.5% 6.682e+11 perf-stat.branch-instructions
> 3.881e+10 -12.9% 3.381e+10 perf-stat.branch-misses
> 2.09 -0.3 1.77 ? 4% perf-stat.cache-miss-rate%
> 1.551e+09 -15.1% 1.316e+09 ? 4% perf-stat.cache-misses
> 26177920 -10.5% 23428188 perf-stat.context-switches
> 1.99 -2.8% 1.93 perf-stat.cpi
> 7.553e+12 -14.7% 6.446e+12 perf-stat.cpu-cycles

This:

> 522523 ? 2% +628.3% 3805664 perf-stat.cpu-migrations

is another good point. With util_est we are affecting LB and WKP
paths... but, give the specific execl test, not entirely sure how we
can affect migrations using util_est.

Have to check better these two points:
- do we reset PELT after an execl?
- do we trigger a possible task migration after and execl

> 2.425e+10 ? 4% -14.3% 2.078e+10 perf-stat.dTLB-load-misses
> 1.487e+12 -11.3% 1.319e+12 perf-stat.dTLB-loads
> 1.156e+10 ? 3% -7.7% 1.066e+10 perf-stat.dTLB-store-misses
> 6.657e+11 -11.1% 5.915e+11 perf-stat.dTLB-stores
> 0.15 +0.0 0.15 perf-stat.iTLB-load-miss-rate%
> 5.807e+09 -11.0% 5.166e+09 perf-stat.iTLB-load-misses
> 3.799e+12 -12.1% 3.34e+12 perf-stat.iTLB-loads
> 3.803e+12 -12.2% 3.338e+12 perf-stat.instructions
> 654.99 -1.4% 646.07 perf-stat.instructions-per-iTLB-miss
> 0.50 +2.8% 0.52 perf-stat.ipc
> 2.754e+08 -11.6% 2.435e+08 perf-stat.minor-faults
> 1.198e+08 ? 7% +73.1% 2.074e+08 ? 4% perf-stat.node-stores
> 2.754e+08 -11.6% 2.435e+08 perf-stat.page-faults
> 572928 -3.4% 553258 perf-stat.path-length
>
>
>
> unixbench.score
>
> 4800 +-+------------------------------------------------------------------+
> |+ + + |
> 4700 +-+ + + :+ +. :+ + + |
> | + + + +. : + + + + + + + .+++++ .+ +|
> 4600 +-+ +++ :+++ + ++: : :+ +++ ++.++++ + ++++ ++ |
> | + + + ++ ++ + |
> 4500 +-+ |
> | |
> 4400 +-+ |
> | |
> 4300 +-+ |
> O |
> 4200 +-O O O OOOO OO OOO OOOO OOOO O O |
> |O OO OOOOO O O OO O O O O O OO |
> 4100 +-+------------------------------------------------------------------+
>
>
> unixbench.workload
>
> 9e+06 +-+---------------------------------------------------------------+
> | : |
> 8.5e+06 +-+ : |
> | : |
> 8e+06 +-+ : |
> | :: |
> 7.5e+06 +-+ : : + |
> | +: : : + |
> 7e+06 +-+ + + :: : :: + + : + + + + + |
> |:+ + + : :: : : :: : :+ : : ::+ :+ .+ :+ ++ ++ + ++ ::++|
> 6.5e+06 +-O+ +++ ++++ +++ + ++ +.+ + ++ + + + + + + + +.+++ + |
> O O O O O O O |
> 6e+06 +O+OOO O OOOOOOOO OOOO OO OOOOOOOOO O O O OO |
> | O |
> 5.5e+06 +-+---------------------------------------------------------------+
>
>
>
> [*] bisect-good sample
> [O] bisect-bad sample
>
>
>
> Disclaimer:
> Results have been estimated based on internal Intel analysis and are provided
> for informational purposes only. Any difference in system hardware or software
> design or configuration may affect actual performance.
>
>
> Thanks,
> Xiaolong

[...]

> #
> # CPU Frequency scaling
> #
> CONFIG_CPU_FREQ=y
> CONFIG_CPU_FREQ_GOV_ATTR_SET=y
> CONFIG_CPU_FREQ_GOV_COMMON=y
> # CONFIG_CPU_FREQ_STAT is not set
> # CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
> # CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
> # CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set

The governor in use is not schedutil... thus util_est could effect the
test just because of signals tracking overheads, of because of the way
we affect tasks placement in WK and LB paths... which can be
correlated to the impact on task migrations and preemption...

> CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
> # CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
> # CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set
> CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
> CONFIG_CPU_FREQ_GOV_POWERSAVE=y
> CONFIG_CPU_FREQ_GOV_USERSPACE=y
> CONFIG_CPU_FREQ_GOV_ONDEMAND=y
> CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y
> # CONFIG_CPU_FREQ_GOV_SCHEDUTIL is not set
>
> #
> # CPU frequency scaling drivers
> #
> CONFIG_X86_INTEL_PSTATE=y
> CONFIG_X86_PCC_CPUFREQ=m
> CONFIG_X86_ACPI_CPUFREQ=m
> CONFIG_X86_ACPI_CPUFREQ_CPB=y
> CONFIG_X86_POWERNOW_K8=m
> CONFIG_X86_AMD_FREQ_SENSITIVITY=m
> # CONFIG_X86_SPEEDSTEP_CENTRINO is not set
> CONFIG_X86_P4_CLOCKMOD=m
>

--
#include <best/regards.h>

Patrick Bellasi

2018-04-04 06:18:06

by kernel test robot

[permalink] [raw]

Subject: Re: [lkp-robot] [sched/fair] d519329f72: unixbench.score -9.9% regression

On 04/03, Patrick Bellasi wrote:
>Hi Xiaolong,
>
>On 02-Apr 11:20, kernel test robot wrote:
>>
>> Greeting,
>>
>> FYI, we noticed a -9.9% regression of unixbench.score due to commit:
>
>thanks for the report, I'll try to reproduce it locally to better
>understand what's going on.

Thanks for your attention and feedback.

>
>Meanwhile, I'm a little puzzled about some of the following
>numbers... likely looking at the code it should be more clear.
>But, maybe someone already knows the response.
>
>> commit: d519329f72a6f36bc4f2b85452640cfe583b4f81 ("sched/fair: Update util_est only on util_avg updates")
>> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>
>To give a bit of context, this patch is the last of a small series
>introducing util_est:
> https://lkml.org/lkml/2018/3/9/245
>where we actually enable it by turning of by default the corresponding SCHED_FEAT.
>Thus, all the effect of util_est are visible just after this very last patch.
>
>I'm not surprised that, if there are issues related to util_est, they
>manifest at this stage.
>
>> in testcase: unixbench
>> on test machine: 8 threads Intel(R) Core(TM) i7 CPU 870 @ 2.93GHz with 6G memory
>> with following parameters:
>>
>> runtime: 300s
>> nr_task: 100%
>> test: execl
>>
>> test-description: UnixBench is the original BYTE UNIX benchmark suite aims to test performance of Unix-like system.
>> test-url: https://github.com/kdlucas/byte-unixbench
>
>AFAIU, this benchmark is composed of 12 different test cases:
> https://github.com/intel/lkp-tests/blob/master/jobs/unixbench.yaml
>and it's reporting a regression for "only" 1 of those tests (execl).
>
>Is that correct?

Yes, this regression is only regarding to the execl test in unixbench.

>
>> Details are as below:
>> -------------------------------------------------------------------------------------------------->
>>
>>
>> To reproduce:
>>
>> git clone https://github.com/intel/lkp-tests.git
>> cd lkp-tests
>> bin/lkp install job.yaml # job file is attached in this email
>> bin/lkp run job.yaml
>
>Will try to give it a run in the next days.
>
>>
>> =========================================================================================
>> compiler/kconfig/nr_task/rootfs/runtime/tbox_group/test/testcase:
>> gcc-7/x86_64-rhel-7.2/100%/debian-x86_64-2016-08-31.cgz/300s/nhm-white/execl/unixbench
>>
>> commit:
>> a07630b8b2 ("sched/cpufreq/schedutil: Use util_est for OPP selection")
>> d519329f72 ("sched/fair: Update util_est only on util_avg updates")
>>
>> a07630b8b2c16f82 d519329f72a6f36bc4f2b85452
>> ---------------- --------------------------
>> %stddev %change %stddev
>> \ | \
>> 4626 -9.9% 4167 unixbench.score
>
>Is this overall score a composition of the following scores?
>

Actually, the unixbench.score the main performance indicator of unixbench, it is
obtained directly from the raw output of unixbench as below:

------------------------------------------------------------------------
Benchmark Run: Sat Mar 31 2018 08:51:09 - 08:56:48
8 CPUs in system; running 8 parallel copies of tests

Execl Throughput 17946.9 lps (30.0 s, 7 samples)

System Benchmarks Partial Index BASELINE RESULT INDEX
Execl Throughput 43.0 17946.9 4173.7
========
System Benchmarks Index Score (Partial Only) 4173.7

>In general, it would be nice to see in the following metrics which one
>is considered "the lower the better" or "the higher the better".

It's a good suggestion, we'll work to make this metrics change more readable.

>
>> 3495362 ? 4% +70.4% 5957769 ? 2% unixbench.time.involuntary_context_switches
>
>The above seems to indicate an increase in contention which generated
>scheduler enforced context switches. But, AFAIU, the test under
>analysis generates just one single task which keep execl itself.
>I can't see how util_est can this generated an increase in context
>switches? Will investigate better.
>
>> 2.866e+08 -11.6% 2.534e+08 unixbench.time.minor_page_faults
>> 666.75 -9.7% 602.25 unixbench.time.percent_of_cpu_this_job_got
>> 1830 -9.7% 1653 unixbench.time.system_time
>> 395.13 -5.2% 374.58 unixbench.time.user_time
>
>When I read "system_time" and "user_time" I'm expecting it to be a
>"the lower the better" metric.
>Thus, if that's the case, by just looking at these last two metrics:
>doesn't that means that the test is actually completing faster with
>util_est enabled?

It seems so.

>
>> 8611715 -58.9% 3537314 ? 3% unixbench.time.voluntary_context_switches
>> 6639375 -9.1% 6033775 unixbench.workload
>
>What the above metric represents?

unixbench.time.voluntary_context_switches is obtained through time (GNU time 1.7), we use
it to record the time metrics while benchmark running, the whole output is like:

Command being timed: "/lkp/lkp/src/tests/unixbench"
User time (seconds): 377.38
System time (seconds): 1664.25
Percent of CPU this job got: 601%
Elapsed (wall clock) time (h:mm:ss or m:ss): 5:39.39
Average shared text size (kbytes): 0
Average unshared data size (kbytes): 0
Average stack size (kbytes): 0
Average total size (kbytes): 0
Maximum resident set size (kbytes): 8956
Average resident set size (kbytes): 0
Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 255131129
Voluntary context switches: 3520290
Involuntary context switches: 6137488
Swaps: 0
File system inputs: 0
File system outputs: 0
Socket messages sent: 0
Socket messages received: 0
Signals delivered: 0
Page size (bytes): 4096
Exit status: 0

unixbench.workload is defined as the total operations in unixbench for
all processes/threads.

>
>> 26025 +3849.3% 1027825 interrupts.CAL:Function_call_interrupts
>
>What this metric represents?

This metric is obtain through /proc/interrupts, we would take snapshot of this
file before and after the test, then sum up the Function call interrupts for
all cpus during the test.

Thanks,
Xiaolong
>
>This is a big variation which I cannot easily see as related to
>util_est overheads... unless these are interrupts related to tasks
>migrations, which potentially can affect LB and WKP code paths.
>
>> 4856 ? 14% -27.4% 3523 ? 11% slabinfo.filp.active_objs
>> 3534356 -8.8% 3223918 softirqs.RCU
>> 77929 -11.2% 69172 vmstat.system.cs
>> 19489 ? 2% +7.5% 20956 vmstat.system.in
>> 9.05 ? 9% +11.0% 10.05 ? 8% boot-time.dhcp
>> 131.63 ? 4% +8.6% 142.89 ? 7% boot-time.idle
>> 9.07 ? 9% +11.0% 10.07 ? 8% boot-time.kernel_boot
>> 76288 ? 3% -12.8% 66560 ? 3% meminfo.DirectMap4k
>> 16606 -13.1% 14433 meminfo.Inactive
>> 16515 -13.2% 14341 meminfo.Inactive(anon)
>> 11.87 ? 5% +7.8 19.63 ? 4% mpstat.cpu.idle%
>> 0.07 ? 35% -0.0 0.04 ? 17% mpstat.cpu.soft%
>> 68.91 -6.1 62.82 mpstat.cpu.sys%
>
>The following:
>
>> 29291570 +325.4% 1.246e+08 cpuidle.C1.time
>> 8629105 -36.1% 5513780 cpuidle.C1.usage
>> 668733 ? 12% +11215.3% 75668902 ? 2% cpuidle.C1E.time
>> 9763 ? 12% +16572.7% 1627882 ? 2% cpuidle.C1E.usage
>> 1.834e+08 ? 9% +23.1% 2.258e+08 ? 11% cpuidle.C3.time
>> 222674 ? 8% +133.4% 519690 ? 6% cpuidle.C3.usage
>
>are other really big variations which metrics represents, AFAIU, the
>following IDLE states:
>- C1 : Core doesn't work, clocks stopped
>- C1E : Core doesn't work, clocks stopped and voltage lowered
>- C3 : L1 and L2 copied to L3, all core clocks stopped
>
>Thus, it seems to me that, with the patches in, we are more likely to
>sleep longer. Which suggests a more aggressive race-to-idle policy:
>running at higher frequencies to complete faster and sleep
>longer and deeper.
>
>This would kind-of match with the idea of completing faster, but I'm
>not completely sure... however, if that should be the case, again I
>would say that's a benefit, not a regression.
>
>
>> 4129 -13.3% 3581 proc-vmstat.nr_inactive_anon
>> 4129 -13.3% 3581 proc-vmstat.nr_zone_inactive_anon
>> 2.333e+08 -12.2% 2.049e+08 proc-vmstat.numa_hit
>> 2.333e+08 -12.2% 2.049e+08 proc-vmstat.numa_local
>> 6625 -10.9% 5905 proc-vmstat.pgactivate
>> 2.392e+08 -12.1% 2.102e+08 proc-vmstat.pgalloc_normal
>> 2.936e+08 -12.6% 2.566e+08 proc-vmstat.pgfault
>> 2.392e+08 -12.1% 2.102e+08 proc-vmstat.pgfree
>
>All the following metrics:
>
>> 2850 -15.3% 2413 turbostat.Avg_MHz
>> 8629013 -36.1% 5513569 turbostat.C1
>> 1.09 +3.5 4.61 turbostat.C1%
>> 9751 ? 12% +16593.0% 1627864 ? 2% turbostat.C1E
>> 0.03 ? 19% +2.8 2.80 turbostat.C1E%
>> 222574 ? 8% +133.4% 519558 ? 6% turbostat.C3
>> 6.84 ? 8% +1.5 8.34 ? 10% turbostat.C3%
>> 2.82 ? 7% +250.3% 9.87 ? 2% turbostat.CPU%c1
>> 6552773 ? 3% +23.8% 8111699 ? 2% turbostat.IRQ
>> 2.02 ? 11% +28.3% 2.58 ? 9% turbostat.Pkg%pc3
>
>maybe they can help to understand better what's going on with respect
>to the race-to-idle theory...
>
>
>> 7.635e+11 -12.5% 6.682e+11 perf-stat.branch-instructions
>> 3.881e+10 -12.9% 3.381e+10 perf-stat.branch-misses
>> 2.09 -0.3 1.77 ? 4% perf-stat.cache-miss-rate%
>> 1.551e+09 -15.1% 1.316e+09 ? 4% perf-stat.cache-misses
>> 26177920 -10.5% 23428188 perf-stat.context-switches
>> 1.99 -2.8% 1.93 perf-stat.cpi
>> 7.553e+12 -14.7% 6.446e+12 perf-stat.cpu-cycles
>
>This:
>
>> 522523 ? 2% +628.3% 3805664 perf-stat.cpu-migrations
>
>is another good point. With util_est we are affecting LB and WKP
>paths... but, give the specific execl test, not entirely sure how we
>can affect migrations using util_est.
>
>Have to check better these two points:
>- do we reset PELT after an execl?
>- do we trigger a possible task migration after and execl
>
>
>> 2.425e+10 ? 4% -14.3% 2.078e+10 perf-stat.dTLB-load-misses
>> 1.487e+12 -11.3% 1.319e+12 perf-stat.dTLB-loads
>> 1.156e+10 ? 3% -7.7% 1.066e+10 perf-stat.dTLB-store-misses
>> 6.657e+11 -11.1% 5.915e+11 perf-stat.dTLB-stores
>> 0.15 +0.0 0.15 perf-stat.iTLB-load-miss-rate%
>> 5.807e+09 -11.0% 5.166e+09 perf-stat.iTLB-load-misses
>> 3.799e+12 -12.1% 3.34e+12 perf-stat.iTLB-loads
>> 3.803e+12 -12.2% 3.338e+12 perf-stat.instructions
>> 654.99 -1.4% 646.07 perf-stat.instructions-per-iTLB-miss
>> 0.50 +2.8% 0.52 perf-stat.ipc
>> 2.754e+08 -11.6% 2.435e+08 perf-stat.minor-faults
>> 1.198e+08 ? 7% +73.1% 2.074e+08 ? 4% perf-stat.node-stores
>> 2.754e+08 -11.6% 2.435e+08 perf-stat.page-faults
>> 572928 -3.4% 553258 perf-stat.path-length
>>
>>
>>
>> unixbench.score
>>
>> 4800 +-+------------------------------------------------------------------+
>> |+ + + |
>> 4700 +-+ + + :+ +. :+ + + |
>> | + + + +. : + + + + + + + .+++++ .+ +|
>> 4600 +-+ +++ :+++ + ++: : :+ +++ ++.++++ + ++++ ++ |
>> | + + + ++ ++ + |
>> 4500 +-+ |
>> | |
>> 4400 +-+ |
>> | |
>> 4300 +-+ |
>> O |
>> 4200 +-O O O OOOO OO OOO OOOO OOOO O O |
>> |O OO OOOOO O O OO O O O O O OO |
>> 4100 +-+------------------------------------------------------------------+
>>
>>
>> unixbench.workload
>>
>> 9e+06 +-+---------------------------------------------------------------+
>> | : |
>> 8.5e+06 +-+ : |
>> | : |
>> 8e+06 +-+ : |
>> | :: |
>> 7.5e+06 +-+ : : + |
>> | +: : : + |
>> 7e+06 +-+ + + :: : :: + + : + + + + + |
>> |:+ + + : :: : : :: : :+ : : ::+ :+ .+ :+ ++ ++ + ++ ::++|
>> 6.5e+06 +-O+ +++ ++++ +++ + ++ +.+ + ++ + + + + + + + +.+++ + |
>> O O O O O O O |
>> 6e+06 +O+OOO O OOOOOOOO OOOO OO OOOOOOOOO O O O OO |
>> | O |
>> 5.5e+06 +-+---------------------------------------------------------------+
>>
>>
>>
>> [*] bisect-good sample
>> [O] bisect-bad sample
>>
>>
>>
>> Disclaimer:
>> Results have been estimated based on internal Intel analysis and are provided
>> for informational purposes only. Any difference in system hardware or software
>> design or configuration may affect actual performance.
>>
>>
>> Thanks,
>> Xiaolong
>
>[...]
>
>> #
>> # CPU Frequency scaling
>> #
>> CONFIG_CPU_FREQ=y
>> CONFIG_CPU_FREQ_GOV_ATTR_SET=y
>> CONFIG_CPU_FREQ_GOV_COMMON=y
>> # CONFIG_CPU_FREQ_STAT is not set
>> # CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE is not set
>> # CONFIG_CPU_FREQ_DEFAULT_GOV_POWERSAVE is not set
>> # CONFIG_CPU_FREQ_DEFAULT_GOV_USERSPACE is not set
>
>The governor in use is not schedutil... thus util_est could effect the
>test just because of signals tracking overheads, of because of the way
>we affect tasks placement in WK and LB paths... which can be
>correlated to the impact on task migrations and preemption...
>
>> CONFIG_CPU_FREQ_DEFAULT_GOV_ONDEMAND=y
>> # CONFIG_CPU_FREQ_DEFAULT_GOV_CONSERVATIVE is not set
>> # CONFIG_CPU_FREQ_DEFAULT_GOV_SCHEDUTIL is not set
>> CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
>> CONFIG_CPU_FREQ_GOV_POWERSAVE=y
>> CONFIG_CPU_FREQ_GOV_USERSPACE=y
>> CONFIG_CPU_FREQ_GOV_ONDEMAND=y
>> CONFIG_CPU_FREQ_GOV_CONSERVATIVE=y
>> # CONFIG_CPU_FREQ_GOV_SCHEDUTIL is not set
>>
>> #
>> # CPU frequency scaling drivers
>> #
>> CONFIG_X86_INTEL_PSTATE=y
>> CONFIG_X86_PCC_CPUFREQ=m
>> CONFIG_X86_ACPI_CPUFREQ=m
>> CONFIG_X86_ACPI_CPUFREQ_CPB=y
>> CONFIG_X86_POWERNOW_K8=m
>> CONFIG_X86_AMD_FREQ_SENSITIVITY=m
>> # CONFIG_X86_SPEEDSTEP_CENTRINO is not set
>> CONFIG_X86_P4_CLOCKMOD=m
>>
>
>--
>#include <best/regards.h>
>
>Patrick Bellasi

2018-10-24 06:43:19

by Aaron Lu

[permalink] [raw]

Subject: Re: [LKP] [lkp-robot] [sched/fair] d519329f72: unixbench.score -9.9% regression

On Mon, Apr 02, 2018 at 11:20:00AM +0800, Ye, Xiaolong wrote:
>
> Greeting,
>
> FYI, we noticed a -9.9% regression of unixbench.score due to commit:
>
>
> commit: d519329f72a6f36bc4f2b85452640cfe583b4f81 ("sched/fair: Update util_est only on util_avg updates")
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>
> in testcase: unixbench
> on test machine: 8 threads Intel(R) Core(TM) i7 CPU 870 @ 2.93GHz with 6G memory
> with following parameters:
>
> runtime: 300s
> nr_task: 100%
> test: execl
>
> test-description: UnixBench is the original BYTE UNIX benchmark suite aims to test performance of Unix-like system.
> test-url: https://github.com/kdlucas/byte-unixbench

I tested this workload on different machines with this commit
d519329f72a6f36bc4f2b85452 and its parent a07630b8b2c16f82, I also
tested with v4.19-rc8 to see if the regression is gone -
the performance drop is there with v4.19-rc8 and with different
machines so I assume this regression is not solved yet.
Here are detailed data:

cmdline used to run this workload:
./Run execl -c $nr_cpu -i 30

lkp-ivb-d04(Ivybridge desktop, 4 CPUs)
a07630b8b2c16f82 d519329f72a6f36bc4f2b85452
---------------- --------------------------
%stddev %change %stddev
\ | \
2924 -8.7% 2669 unixbench.score

a07630b8b2c16f82fd5b71d890079f4dd7599c1d/avg.json: "unixbench.score": 2924.3333333333335,
d519329f72a6f36bc4f2b85452640cfe583b4f81/avg.json: "unixbench.score": 2669.5333333333333,
v4.19-rc8/avg.json: "unixbench.score": 2611.733333333333,

lkp-hsw-d01(Haswell desktop, 8 CPUs)
a07630b8b2c16f82 d519329f72a6f36bc4f2b85452
---------------- --------------------------
%stddev %change %stddev
\ | \
7090 -9.4% 6421 unixbench.score

a07630b8b2c16f82fd5b71d890079f4dd7599c1d/avg.json: "unixbench.score": 7090.400000000001,
d519329f72a6f36bc4f2b85452640cfe583b4f81/avg.json: "unixbench.score": 6421.233333333333,
v4.19-rc8/avg.json: "unixbench.score": 6431.933333333333,

lkp-skl-2sp2(2 sockets Skylake, 80 CPUs)
a07630b8b2c16f82 d519329f72a6f36bc4f2b85452
---------------- --------------------------
%stddev %change %stddev
\ | \
9657 ? 2% -3.3% 9334 unixbench.score

a07630b8b2c16f82fd5b71d890079f4dd7599c1d/avg.json: "unixbench.score": 9657.6,
d519329f72a6f36bc4f2b85452640cfe583b4f81/avg.json: "unixbench.score": 9334.9,
v4.19-rc8/avg.json: "unixbench.score": 9278.333333333334,

Please let me know if you need other information, thanks.

2018-10-24 17:02:38

by Patrick Bellasi

[permalink] [raw]

Subject: Re: [LKP] [lkp-robot] [sched/fair] d519329f72: unixbench.score -9.9% regression

On 24-Oct 14:41, Aaron Lu wrote:
> On Mon, Apr 02, 2018 at 11:20:00AM +0800, Ye, Xiaolong wrote:
> >
> > Greeting,
> >
> > FYI, we noticed a -9.9% regression of unixbench.score due to commit:
> >
> >
> > commit: d519329f72a6f36bc4f2b85452640cfe583b4f81 ("sched/fair: Update util_est only on util_avg updates")
> > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> >
> > in testcase: unixbench
> > on test machine: 8 threads Intel(R) Core(TM) i7 CPU 870 @ 2.93GHz with 6G memory
> > with following parameters:
> >
> > runtime: 300s
> > nr_task: 100%
> > test: execl
> >
> > test-description: UnixBench is the original BYTE UNIX benchmark suite aims to test performance of Unix-like system.
> > test-url: https://github.com/kdlucas/byte-unixbench

Hi Aaron,

> I tested this workload on different machines with this commit
> d519329f72a6f36bc4f2b85452 and its parent a07630b8b2c16f82, I also
> tested with v4.19-rc8 to see if the regression is gone -
> the performance drop is there with v4.19-rc8 and with different
> machines so I assume this regression is not solved yet.
>
> Here are detailed data:
>
> cmdline used to run this workload:
> ./Run execl -c $nr_cpu -i 30

I had a better look into this issue and found that something like this
could be the cure for the execl throughput regression:

---8<---
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 908c9cdae2f0..c34d41b542fc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6258,8 +6258,17 @@ static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
* covered by the following code when estimated utilization is
* enabled.
*/
- if (sched_feat(UTIL_EST))
- util = max(util, READ_ONCE(cfs_rq->avg.util_est.enqueued));
+ if (sched_feat(UTIL_EST)) {
+ unsigned int estimated =
+ READ_ONCE(cfs_rq->avg.util_est.enqueued);
+
+ if (unlikely(current == p)) {
+ estimated -= min_t(unsigned int, estimated,
+ (_task_util_est(p) | UTIL_AVG_UNCHANGED));
+ }
+
+ util = max(util, estimated);
+ }

/*
* Utilization (estimated) can exceed the CPU capacity, thus let's
---8<---

I'll test this better on a machine on my side and send out a proper
patch by tomorrow.

> Please let me know if you need other information, thanks.

Would be nice if you can test the above on your side too.

Cheers Patrick

--
#include <best/regards.h>

Patrick Bellasi

2018-10-25 08:59:08

by Aaron Lu

[permalink] [raw]

Subject: Re: [LKP] [lkp-robot] [sched/fair] d519329f72: unixbench.score -9.9% regression

On Wed, Oct 24, 2018 at 06:01:37PM +0100, Patrick Bellasi wrote:
> On 24-Oct 14:41, Aaron Lu wrote:
> > On Mon, Apr 02, 2018 at 11:20:00AM +0800, Ye, Xiaolong wrote:
> > >
> > > Greeting,
> > >
> > > FYI, we noticed a -9.9% regression of unixbench.score due to commit:
> > >
> > >
> > > commit: d519329f72a6f36bc4f2b85452640cfe583b4f81 ("sched/fair: Update util_est only on util_avg updates")
> > > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> > >
> > > in testcase: unixbench
> > > on test machine: 8 threads Intel(R) Core(TM) i7 CPU 870 @ 2.93GHz with 6G memory
> > > with following parameters:
> > >
> > > runtime: 300s
> > > nr_task: 100%
> > > test: execl
> > >
> > > test-description: UnixBench is the original BYTE UNIX benchmark suite aims to test performance of Unix-like system.
> > > test-url: https://github.com/kdlucas/byte-unixbench
>
> Hi Aaron,
>
> > I tested this workload on different machines with this commit
> > d519329f72a6f36bc4f2b85452 and its parent a07630b8b2c16f82, I also
> > tested with v4.19-rc8 to see if the regression is gone -
> > the performance drop is there with v4.19-rc8 and with different
> > machines so I assume this regression is not solved yet.
> >
> > Here are detailed data:
> >
> > cmdline used to run this workload:
> > ./Run execl -c $nr_cpu -i 30
>
> I had a better look into this issue and found that something like this
> could be the cure for the execl throughput regression:

Good news, yes they are!

> ---8<---
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 908c9cdae2f0..c34d41b542fc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6258,8 +6258,17 @@ static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
> * covered by the following code when estimated utilization is
> * enabled.
> */
> - if (sched_feat(UTIL_EST))
> - util = max(util, READ_ONCE(cfs_rq->avg.util_est.enqueued));
> + if (sched_feat(UTIL_EST)) {
> + unsigned int estimated =
> + READ_ONCE(cfs_rq->avg.util_est.enqueued);
> +
> + if (unlikely(current == p)) {
> + estimated -= min_t(unsigned int, estimated,
> + (_task_util_est(p) | UTIL_AVG_UNCHANGED));
> + }
> +
> + util = max(util, estimated);
> + }
>
> /*
> * Utilization (estimated) can exceed the CPU capacity, thus let's
> ---8<---
>
> I'll test this better on a machine on my side and send out a proper
> patch by tomorrow.
>
> > Please let me know if you need other information, thanks.
>
> Would be nice if you can test the above on your side too.
>

commit cbcb74a95c5af32f9127a102feca323139ba2c49 is the commit I made
from your diff and it restored performance for the two desktops. the
result on the skylake server isn't quite stable so I think the
performance gap is due to noise.

lkp-ivb-d04:
cbcb74a95c5af32f9127a102feca323139ba2c49/avg.json: "unixbench.score": 2946.0,
d519329f72a6f36bc4f2b85452640cfe583b4f81/avg.json: "unixbench.score": 2669.5333333333333,
a07630b8b2c16f82fd5b71d890079f4dd7599c1d/avg.json: "unixbench.score": 2924.3333333333335,

lkp-hsw-d01:
cbcb74a95c5af32f9127a102feca323139ba2c49/avg.json: "unixbench.score": 7013.533333333333,
d519329f72a6f36bc4f2b85452640cfe583b4f81/avg.json: "unixbench.score": 6421.233333333333,
a07630b8b2c16f82fd5b71d890079f4dd7599c1d/avg.json: "unixbench.score": 7090.400000000001,

lkp-skl-2sp2:
cbcb74a95c5af32f9127a102feca323139ba2c49/avg.json: "unixbench.score": 9347.02,
d519329f72a6f36bc4f2b85452640cfe583b4f81/avg.json: "unixbench.score": 9362.76,
a07630b8b2c16f82fd5b71d890079f4dd7599c1d/avg.json: "unixbench.score": 9520.86,

2018-10-25 09:31:50

by Patrick Bellasi

[permalink] [raw]

Subject: Re: [LKP] [lkp-robot] [sched/fair] d519329f72: unixbench.score -9.9% regression

On 25-Oct 16:56, Aaron Lu wrote:
> On Wed, Oct 24, 2018 at 06:01:37PM +0100, Patrick Bellasi wrote:
> > On 24-Oct 14:41, Aaron Lu wrote:
> > > On Mon, Apr 02, 2018 at 11:20:00AM +0800, Ye, Xiaolong wrote:
> > > >
> > > > Greeting,
> > > >
> > > > FYI, we noticed a -9.9% regression of unixbench.score due to commit:
> > > >
> > > >
> > > > commit: d519329f72a6f36bc4f2b85452640cfe583b4f81 ("sched/fair: Update util_est only on util_avg updates")
> > > > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> > > >
> > > > in testcase: unixbench
> > > > on test machine: 8 threads Intel(R) Core(TM) i7 CPU 870 @ 2.93GHz with 6G memory
> > > > with following parameters:
> > > >
> > > > runtime: 300s
> > > > nr_task: 100%
> > > > test: execl
> > > >
> > > > test-description: UnixBench is the original BYTE UNIX benchmark suite aims to test performance of Unix-like system.
> > > > test-url: https://github.com/kdlucas/byte-unixbench
> >
> > Hi Aaron,
> >
> > > I tested this workload on different machines with this commit
> > > d519329f72a6f36bc4f2b85452 and its parent a07630b8b2c16f82, I also
> > > tested with v4.19-rc8 to see if the regression is gone -
> > > the performance drop is there with v4.19-rc8 and with different
> > > machines so I assume this regression is not solved yet.
> > >
> > > Here are detailed data:
> > >
> > > cmdline used to run this workload:
> > > ./Run execl -c $nr_cpu -i 30
> >
> > I had a better look into this issue and found that something like this
> > could be the cure for the execl throughput regression:
>
> Good news, yes they are!

Great.

> > ---8<---
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 908c9cdae2f0..c34d41b542fc 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6258,8 +6258,17 @@ static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
> > * covered by the following code when estimated utilization is
> > * enabled.
> > */
> > - if (sched_feat(UTIL_EST))
> > - util = max(util, READ_ONCE(cfs_rq->avg.util_est.enqueued));
> > + if (sched_feat(UTIL_EST)) {
> > + unsigned int estimated =
> > + READ_ONCE(cfs_rq->avg.util_est.enqueued);
> > +
> > + if (unlikely(current == p)) {
> > + estimated -= min_t(unsigned int, estimated,
> > + (_task_util_est(p) | UTIL_AVG_UNCHANGED));
> > + }
> > +
> > + util = max(util, estimated);
> > + }
> >
> > /*
> > * Utilization (estimated) can exceed the CPU capacity, thus let's
> > ---8<---
> >
> > I'll test this better on a machine on my side and send out a proper
> > patch by tomorrow.
> >
> > > Please let me know if you need other information, thanks.
> >
> > Would be nice if you can test the above on your side too.
> >
>
> commit cbcb74a95c5af32f9127a102feca323139ba2c49 is the commit I made
> from your diff and it restored performance for the two desktops. the
> result on the skylake server isn't quite stable so I think the
> performance gap is due to noise.
>
> lkp-ivb-d04:
> cbcb74a95c5af32f9127a102feca323139ba2c49/avg.json: "unixbench.score": 2946.0,
> d519329f72a6f36bc4f2b85452640cfe583b4f81/avg.json: "unixbench.score": 2669.5333333333333,
> a07630b8b2c16f82fd5b71d890079f4dd7599c1d/avg.json: "unixbench.score": 2924.3333333333335,
>
> lkp-hsw-d01:
> cbcb74a95c5af32f9127a102feca323139ba2c49/avg.json: "unixbench.score": 7013.533333333333,
> d519329f72a6f36bc4f2b85452640cfe583b4f81/avg.json: "unixbench.score": 6421.233333333333,
> a07630b8b2c16f82fd5b71d890079f4dd7599c1d/avg.json: "unixbench.score": 7090.400000000001,
>
> lkp-skl-2sp2:
> cbcb74a95c5af32f9127a102feca323139ba2c49/avg.json: "unixbench.score": 9347.02,
> d519329f72a6f36bc4f2b85452640cfe583b4f81/avg.json: "unixbench.score": 9362.76,
> a07630b8b2c16f82fd5b71d890079f4dd7599c1d/avg.json: "unixbench.score": 9520.86,

I've measured a ~15% speedup with the patch in wrt v4.19.0 on a
40 CPUs Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz.

################################################################################
### Mainline (v4.19.0):

Benchmark Run: Wed Oct 24 2018 16:13:34 - 16:19:04
40 CPUs in system; running 40 parallel copies of tests

Execl Throughput 48136.5 lps (29.9 s, 7 samples)

System Benchmarks Partial Index BASELINE RESULT INDEX
Execl Throughput 43.0 48136.5 11194.5
========
System Benchmarks Index Score (Partial Only) 11194.5

################################################################################
### Mainline (v4.19.0) + patch:

Benchmark Run: Wed Oct 24 2018 16:29:56 - 16:35:26
40 CPUs in system; running 40 parallel copies of tests

Execl Throughput 55373.6 lps (29.9 s, 7 samples)

System Benchmarks Partial Index BASELINE RESULT INDEX
Execl Throughput 43.0 55373.6 12877.6
========
System Benchmarks Index Score (Partial Only) 12877.6

################################################################################

Thanks again reporting and testing on your side, I'll post soon a proper
patch.

Cheers Patrick

--
#include <best/regards.h>

Patrick Bellasi