2020-11-17 02:45:31

by kernel test robot

[permalink] [raw]
Subject: [mm/gup] 47e29d32af: phoronix-test-suite.npb.FT.A.total_mop_s -45.0% regression


Greeting,

FYI, we noticed a -45.0% regression of phoronix-test-suite.npb.FT.A.total_mop_s due to commit:


commit: 47e29d32afba11b13efb51f03154a8cf22fb4360 ("mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages")
https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master


in testcase: phoronix-test-suite
on test machine: 96 threads Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 192G memory
with following parameters:

test: npb-1.3.1
option_a: FT.A
cpufreq_governor: performance
ucode: 0x5002f01

test-description: The Phoronix Test Suite is the most comprehensive testing and benchmarking platform available that provides an extensible framework for which new tests can be easily added.
test-url: http://www.phoronix-test-suite.com/



If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>


Details are as below:
-------------------------------------------------------------------------------------------------->


To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp run job.yaml

=========================================================================================
compiler/cpufreq_governor/kconfig/option_a/rootfs/tbox_group/test/testcase/ucode:
gcc-9/performance/x86_64-rhel-8.3/FT.A/debian-x86_64-phoronix/lkp-csl-2sp8/npb-1.3.1/phoronix-test-suite/0x5002f01

commit:
3faa52c03f ("mm/gup: track FOLL_PIN pages")
47e29d32af ("mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages")

3faa52c03f440d1b 47e29d32afba11b13efb51f0315
---------------- ---------------------------
fail:runs %reproduction fail:runs
| | |
1:4 -25% :4 kmsg.Spurious_LAPIC_timer_interrupt_on_cpu
%stddev %change %stddev
\ | \
4585 ? 2% -45.0% 2522 phoronix-test-suite.npb.FT.A.total_mop_s
1223 ? 4% +40.2% 1714 phoronix-test-suite.time.percent_of_cpu_this_job_got



phoronix-test-suite.npb.FT.A.total_mop_s

6500 +--------------------------------------------------------------------+
| .+. .+. .+. |
6000 |.+ +.+.+.++.+.+.+.+.+.+.+ +.+.++ +.+.+.+.+.+.+.+.+.++.+ |
5500 |-+ : |
| : |
5000 |-+ : |
4500 |-+ +.+.+.|
| |
4000 |-+ |
3500 |-+ |
| |
3000 |-+ |
2500 |-+ O O O |
| O O O O O OO O O O O O O O O O O |
2000 +--------------------------------------------------------------------+


[*] bisect-good sample
[O] bisect-bad sample



Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.


Thanks,
Oliver Sang


Attachments:
(No filename) (4.09 kB)
config-5.6.0-05655-g47e29d32afba11 (158.87 kB)
job-script (7.35 kB)
job.yaml (5.00 kB)
reproduce (305.00 B)
Download all attachments

2020-11-17 03:38:13

by John Hubbard

[permalink] [raw]
Subject: Re: [mm/gup] 47e29d32af: phoronix-test-suite.npb.FT.A.total_mop_s -45.0% regression


On 11/16/20 6:48 PM, kernel test robot wrote:
>
> Greeting,
>
> FYI, we noticed a -45.0% regression of phoronix-test-suite.npb.FT.A.total_mop_s due to commit:
>

That's a huge slowdown...

>
> commit: 47e29d32afba11b13efb51f03154a8cf22fb4360 ("mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages")
> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master

...but that commit happened in April, 2020. Surely if this were a serious issue we
would have some other indication...is this worth following up on?? I'm inclined to
ignore it, honestly.

thanks,
--
John Hubbard
NVIDIA
>
>
> in testcase: phoronix-test-suite
> on test machine: 96 threads Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 192G memory
> with following parameters:
>
> test: npb-1.3.1
> option_a: FT.A
> cpufreq_governor: performance
> ucode: 0x5002f01
>
> test-description: The Phoronix Test Suite is the most comprehensive testing and benchmarking platform available that provides an extensible framework for which new tests can be easily added.
> test-url: http://www.phoronix-test-suite.com/
>
>
>
> If you fix the issue, kindly add following tag
> Reported-by: kernel test robot <[email protected]>
>
>
> Details are as below:
> -------------------------------------------------------------------------------------------------->
>
>
> To reproduce:
>
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> bin/lkp install job.yaml # job file is attached in this email
> bin/lkp run job.yaml
>
> =========================================================================================
> compiler/cpufreq_governor/kconfig/option_a/rootfs/tbox_group/test/testcase/ucode:
> gcc-9/performance/x86_64-rhel-8.3/FT.A/debian-x86_64-phoronix/lkp-csl-2sp8/npb-1.3.1/phoronix-test-suite/0x5002f01
>
> commit:
> 3faa52c03f ("mm/gup: track FOLL_PIN pages")
> 47e29d32af ("mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages")
>
> 3faa52c03f440d1b 47e29d32afba11b13efb51f0315
> ---------------- ---------------------------
> fail:runs %reproduction fail:runs
> | | |
> 1:4 -25% :4 kmsg.Spurious_LAPIC_timer_interrupt_on_cpu
> %stddev %change %stddev
> \ | \
> 4585 ± 2% -45.0% 2522 phoronix-test-suite.npb.FT.A.total_mop_s
> 1223 ± 4% +40.2% 1714 phoronix-test-suite.time.percent_of_cpu_this_job_got
>
>
>
> phoronix-test-suite.npb.FT.A.total_mop_s
>
> 6500 +--------------------------------------------------------------------+
> | .+. .+. .+. |
> 6000 |.+ +.+.+.++.+.+.+.+.+.+.+ +.+.++ +.+.+.+.+.+.+.+.+.++.+ |
> 5500 |-+ : |
> | : |
> 5000 |-+ : |
> 4500 |-+ +.+.+.|
> | |
> 4000 |-+ |
> 3500 |-+ |
> | |
> 3000 |-+ |
> 2500 |-+ O O O |
> | O O O O O OO O O O O O O O O O O |
> 2000 +--------------------------------------------------------------------+
>
>
> [*] bisect-good sample
> [O] bisect-bad sample
>
>
>
> Disclaimer:
> Results have been estimated based on internal Intel analysis and are provided
> for informational purposes only. Any difference in system hardware or software
> design or configuration may affect actual performance.
>
>
> Thanks,
> Oliver Sang
>

2020-11-18 13:52:30

by Jan Kara

[permalink] [raw]
Subject: Re: [mm/gup] 47e29d32af: phoronix-test-suite.npb.FT.A.total_mop_s -45.0% regression

On Mon 16-11-20 19:35:31, John Hubbard wrote:
>
> On 11/16/20 6:48 PM, kernel test robot wrote:
> >
> > Greeting,
> >
> > FYI, we noticed a -45.0% regression of phoronix-test-suite.npb.FT.A.total_mop_s due to commit:
> >
>
> That's a huge slowdown...
>
> >
> > commit: 47e29d32afba11b13efb51f03154a8cf22fb4360 ("mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages")
> > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>
> ...but that commit happened in April, 2020. Surely if this were a serious
> issue we would have some other indication...is this worth following up
> on?? I'm inclined to ignore it, honestly.

Why this was detected so late is a fair question although it doesn't quite
invalidate the report... The NPB benchmark appears to be a supercomputing
benchmark so concievably it could be heavily using THPs. The question is
why it would be a heavy user of pinning as well but even that is imaginable
considering that MPI is in use etc.

So maybe it is worth trying to reproduce this because heavy THP + pinning
users might be indeed rare and only those would show regressions in THP
pinning performance...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2020-11-18 18:21:15

by Dan Williams

[permalink] [raw]
Subject: Re: [mm/gup] 47e29d32af: phoronix-test-suite.npb.FT.A.total_mop_s -45.0% regression

On Wed, Nov 18, 2020 at 5:51 AM Jan Kara <[email protected]> wrote:
>
> On Mon 16-11-20 19:35:31, John Hubbard wrote:
> >
> > On 11/16/20 6:48 PM, kernel test robot wrote:
> > >
> > > Greeting,
> > >
> > > FYI, we noticed a -45.0% regression of phoronix-test-suite.npb.FT.A.total_mop_s due to commit:
> > >
> >
> > That's a huge slowdown...
> >
> > >
> > > commit: 47e29d32afba11b13efb51f03154a8cf22fb4360 ("mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages")
> > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> >
> > ...but that commit happened in April, 2020. Surely if this were a serious
> > issue we would have some other indication...is this worth following up
> > on?? I'm inclined to ignore it, honestly.
>
> Why this was detected so late is a fair question although it doesn't quite
> invalidate the report...

I don't know what specifically happened in this case, perhaps someone
from the lkp team can comment? However, the myth / contention that
"surely someone else would have noticed by now" is why the lkp project
was launched. Kernels regressed without much complaint and it wasn't
until much later in the process, around the time enterprise distros
rebased to new kernels, did end users start filing performance loss
regression reports. Given -stable kernel releases, 6-7 months is still
faster than many end user upgrade cycles to new kernel baselines.

2020-11-18 19:37:15

by John Hubbard

[permalink] [raw]
Subject: Re: [mm/gup] 47e29d32af: phoronix-test-suite.npb.FT.A.total_mop_s -45.0% regression

On 11/18/20 10:17 AM, Dan Williams wrote:
> On Wed, Nov 18, 2020 at 5:51 AM Jan Kara <[email protected]> wrote:
>>
>> On Mon 16-11-20 19:35:31, John Hubbard wrote:
>>>
>>> On 11/16/20 6:48 PM, kernel test robot wrote:
>>>>
>>>> Greeting,
>>>>
>>>> FYI, we noticed a -45.0% regression of phoronix-test-suite.npb.FT.A.total_mop_s due to commit:
>>>>
>>>
>>> That's a huge slowdown...
>>>
>>>>
>>>> commit: 47e29d32afba11b13efb51f03154a8cf22fb4360 ("mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages")
>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>>
>>> ...but that commit happened in April, 2020. Surely if this were a serious
>>> issue we would have some other indication...is this worth following up
>>> on?? I'm inclined to ignore it, honestly.
>>
>> Why this was detected so late is a fair question although it doesn't quite
>> invalidate the report...
>
> I don't know what specifically happened in this case, perhaps someone
> from the lkp team can comment? However, the myth / contention that
> "surely someone else would have noticed by now" is why the lkp project
> was launched. Kernels regressed without much complaint and it wasn't
> until much later in the process, around the time enterprise distros
> rebased to new kernels, did end users start filing performance loss
> regression reports. Given -stable kernel releases, 6-7 months is still
> faster than many end user upgrade cycles to new kernel baselines.
>

I see, thanks for explaining. I'll take a peek, then.

thanks,
--
John Hubbard
NVIDIA

2020-11-20 02:30:36

by kernel test robot

[permalink] [raw]
Subject: Re: [mm/gup] 47e29d32af: phoronix-test-suite.npb.FT.A.total_mop_s -45.0% regression

On Wed, Nov 18, 2020 at 10:17:27AM -0800, Dan Williams wrote:
> On Wed, Nov 18, 2020 at 5:51 AM Jan Kara <[email protected]> wrote:
> >
> > On Mon 16-11-20 19:35:31, John Hubbard wrote:
> > >
> > > On 11/16/20 6:48 PM, kernel test robot wrote:
> > > >
> > > > Greeting,
> > > >
> > > > FYI, we noticed a -45.0% regression of phoronix-test-suite.npb.FT.A.total_mop_s due to commit:
> > > >
> > >
> > > That's a huge slowdown...
> > >
> > > >
> > > > commit: 47e29d32afba11b13efb51f03154a8cf22fb4360 ("mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages")
> > > > https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
> > >
> > > ...but that commit happened in April, 2020. Surely if this were a serious
> > > issue we would have some other indication...is this worth following up
> > > on?? I'm inclined to ignore it, honestly.
> >
> > Why this was detected so late is a fair question although it doesn't quite
> > invalidate the report...
>
> I don't know what specifically happened in this case, perhaps someone
> from the lkp team can comment?

- some extra phoronix test suites are enabled/fixed gradually so we will have
better coverage
- we scan kernel releases within the year to baseline the performance, it may
trigger bisection if one release has regressed and not recovered.

With this continuous effort, 0-day ci can detect the changes on mainline.

> However, the myth / contention that
> "surely someone else would have noticed by now" is why the lkp project
> was launched. Kernels regressed without much complaint and it wasn't
> until much later in the process, around the time enterprise distros
> rebased to new kernels, did end users start filing performance loss
> regression reports. Given -stable kernel releases, 6-7 months is still
> faster than many end user upgrade cycles to new kernel baselines.