2021-10-18 12:45:11

by Borislav Petkov

[permalink] [raw]
Subject: Re: 回复:[PATCH] perf : optimiz e clear page in Intel specified model with movq instruction

On Mon, Oct 18, 2021 at 03:43:46PM +0800, JY Ni wrote:
> _*Precondition:*__*do tests on a Intel CPX server.*_ CPU information of my
> test machine is in backup part._*

My machine:

processor : 0
vendor_id : GenuineIntel
cpu family : 6
model : 106
stepping : 4

That's a SKYLAKE_X.

I ran

./tools/perf/perf stat --repeat 5 --sync --pre=/root/bin/pre-build-kernel.sh -- make -s -j96 bzImage

on -rc6, building allmodconfig each of the 10 times.

pre-build-kernel.sh is

---
#!/bin/bash

make -s clean
echo 3 > /proc/sys/vm/drop_caches
---

Results are below but to me that's all "in the noise" with around one
percent if I can trust the stddev. Which is not even close to 40%.

So basically you're wasting your time.

5.15-rc6
--------

# ./tools/perf/perf stat --repeat 5 --sync --pre=/root/bin/pre-build-kernel.sh -- make -s -j96 bzImage

Performance counter stats for 'make -s -j96 bzImage' (5 runs):

3,072,392.92 msec task-clock # 51.109 CPUs utilized ( +- 0.05% )
1,351,534 context-switches # 440.257 /sec ( +- 0.99% )
224,862 cpu-migrations # 73.248 /sec ( +- 1.39% )
85,073,723 page-faults # 27.712 K/sec ( +- 0.01% )
8,743,357,421,495 cycles # 2.848 GHz ( +- 0.06% )
7,643,946,991,468 instructions # 0.88 insn per cycle ( +- 0.00% )
1,705,128,638,240 branches # 555.440 M/sec ( +- 0.00% )
37,637,576,027 branch-misses # 2.21% of all branches ( +- 0.03% )
22,511,903,971,150 slots # 7.333 G/sec ( +- 0.03% )
7,377,211,958,188 topdown-retiring # 32.5% retiring ( +- 0.02% )
3,145,247,374,138 topdown-bad-spec # 13.9% bad speculation ( +- 0.27% )
8,018,664,899,041 topdown-fe-bound # 35.2% frontend bound ( +- 0.07% )
4,167,103,609,622 topdown-be-bound # 18.3% backend bound ( +- 0.09% )

60.114 +- 0.112 seconds time elapsed ( +- 0.19% )



5.15-rc6 + patch
----------------

Performance counter stats for 'make -s -j96 bzImage' (5 runs):

3,033,250.65 msec task-clock # 51.243 CPUs utilized ( +- 0.05% )
1,329,033 context-switches # 438.210 /sec ( +- 0.64% )
225,550 cpu-migrations # 74.369 /sec ( +- 1.36% )
85,080,938 page-faults # 28.053 K/sec ( +- 0.00% )
8,629,663,367,477 cycles # 2.845 GHz ( +- 0.05% )
7,696,237,813,803 instructions # 0.89 insn per cycle ( +- 0.00% )
1,709,909,494,107 branches # 563.793 M/sec ( +- 0.00% )
37,719,552,337 branch-misses # 2.21% of all branches ( +- 0.02% )
22,214,249,023,820 slots # 7.325 G/sec ( +- 0.06% )
7,412,342,725,008 topdown-retiring # 33.0% retiring ( +- 0.01% )
3,141,090,408,028 topdown-bad-spec # 14.1% bad speculation ( +- 0.17% )
7,996,077,873,517 topdown-fe-bound # 35.6% frontend bound ( +- 0.03% )
3,862,154,886,962 topdown-be-bound # 17.3% backend bound ( +- 0.28% )

59.193 +- 0.302 seconds time elapsed ( +- 0.51% )

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette


2021-10-18 14:49:34

by Luming Yu

[permalink] [raw]
Subject: Re: 回复:[PATCH] perf: optimize clear page in In tel specified model with movq instruction

On Mon, Oct 18, 2021 at 8:43 PM Borislav Petkov <[email protected]> wrote:
>
> On Mon, Oct 18, 2021 at 03:43:46PM +0800, JY Ni wrote:
> > _*Precondition:*__*do tests on a Intel CPX server.*_ CPU information of my
> > test machine is in backup part._*
>
> My machine:
>
> processor : 0
> vendor_id : GenuineIntel
> cpu family : 6
> model : 106
> stepping : 4
>
> That's a SKYLAKE_X.
>
> I ran
>
> ./tools/perf/perf stat --repeat 5 --sync --pre=/root/bin/pre-build-kernel.sh -- make -s -j96 bzImage
>
> on -rc6, building allmodconfig each of the 10 times.
>
> pre-build-kernel.sh is
>
> ---
> #!/bin/bash
>
> make -s clean
> echo 3 > /proc/sys/vm/drop_caches
> ---
>
> Results are below but to me that's all "in the noise" with around one
> percent if I can trust the stddev. Which is not even close to 40%.
>
> So basically you're wasting your time.
>
> 5.15-rc6
> --------
>
> # ./tools/perf/perf stat --repeat 5 --sync --pre=/root/bin/pre-build-kernel.sh -- make -s -j96 bzImage
>
> Performance counter stats for 'make -s -j96 bzImage' (5 runs):
>
> 3,072,392.92 msec task-clock # 51.109 CPUs utilized ( +- 0.05% )
> 1,351,534 context-switches # 440.257 /sec ( +- 0.99% )
> 224,862 cpu-migrations # 73.248 /sec ( +- 1.39% )
> 85,073,723 page-faults # 27.712 K/sec ( +- 0.01% )
> 8,743,357,421,495 cycles # 2.848 GHz ( +- 0.06% )
> 7,643,946,991,468 instructions # 0.88 insn per cycle ( +- 0.00% )
> 1,705,128,638,240 branches # 555.440 M/sec ( +- 0.00% )
> 37,637,576,027 branch-misses # 2.21% of all branches ( +- 0.03% )
> 22,511,903,971,150 slots # 7.333 G/sec ( +- 0.03% )
> 7,377,211,958,188 topdown-retiring # 32.5% retiring ( +- 0.02% )
> 3,145,247,374,138 topdown-bad-spec # 13.9% bad speculation ( +- 0.27% )
> 8,018,664,899,041 topdown-fe-bound # 35.2% frontend bound ( +- 0.07% )
> 4,167,103,609,622 topdown-be-bound # 18.3% backend bound ( +- 0.09% )
>
> 60.114 +- 0.112 seconds time elapsed ( +- 0.19% )
>
>
>
> 5.15-rc6 + patch
> ----------------
>
> Performance counter stats for 'make -s -j96 bzImage' (5 runs):
>
> 3,033,250.65 msec task-clock # 51.243 CPUs utilized ( +- 0.05% )
> 1,329,033 context-switches # 438.210 /sec ( +- 0.64% )
> 225,550 cpu-migrations # 74.369 /sec ( +- 1.36% )
> 85,080,938 page-faults # 28.053 K/sec ( +- 0.00% )
> 8,629,663,367,477 cycles # 2.845 GHz ( +- 0.05% )
> 7,696,237,813,803 instructions # 0.89 insn per cycle ( +- 0.00% )
> 1,709,909,494,107 branches # 563.793 M/sec ( +- 0.00% )
> 37,719,552,337 branch-misses # 2.21% of all branches ( +- 0.02% )
> 22,214,249,023,820 slots # 7.325 G/sec ( +- 0.06% )
> 7,412,342,725,008 topdown-retiring # 33.0% retiring ( +- 0.01% )
> 3,141,090,408,028 topdown-bad-spec # 14.1% bad speculation ( +- 0.17% )
> 7,996,077,873,517 topdown-fe-bound # 35.6% frontend bound ( +- 0.03% )
> 3,862,154,886,962 topdown-be-bound # 17.3% backend bound ( +- 0.28% )
>
> 59.193 +- 0.302 seconds time elapsed ( +- 0.51% )

I'm trying to duplicate the difference and get noticed that time && perf stat
might have a different scale view about the real time spent on the job.

And jiayu.ni's time diff showed the best at 32 jobs and the worst at 96 jobs.

[linux-5.15-rc6]# time make -s bzImage -j96

real 1m8.922s
user 55m25.750s
sys 7m30.666s

[linux-5.15-rc6]# make -s clean

[linux-5.15-rc6]# perf stat make -s bzImage -j96
..
61.461679693 seconds time elapsed


2756.927852000 seconds user
369.365209000 seconds sys

If kbuild time that jiayu.ni has shared is not a solid proof for the
optimization idea can be accepted,
we can try other clear_page heavy workloads.

>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette