2022-03-04 13:27:58

by kernel test robot

[permalink] [raw]
Subject: [perf vendor events] 3f5f0df7bf: perf-sanity-tests.perf_all_metrics_test.fail



Greeting,

FYI, we noticed the following commit (built with gcc-9):

commit: 3f5f0df7bf0f8c48d33d43454fc0b7d0f3ab9537 ("perf vendor events: Update metrics for Skylake")
https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master

in testcase: perf-sanity-tests
version: perf-x86_64-fb184c4af9b9-1_20220302
with following parameters:

perf_compiler: clang
ucode: 0xec



on test machine: 8 threads 1 sockets Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz with 32G memory

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):




If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>



2022-03-02 19:01:56 sudo /usr/src/perf_selftests-x86_64-rhel-8.3-func-3f5f0df7bf0f8c48d33d43454fc0b7d0f3ab9537/tools/perf/perf test 89
89: perf all metricgroups test : Ok
2022-03-02 19:02:05 sudo /usr/src/perf_selftests-x86_64-rhel-8.3-func-3f5f0df7bf0f8c48d33d43454fc0b7d0f3ab9537/tools/perf/perf test 90
90: perf all metrics test : FAILED!
2022-03-02 19:07:00 sudo /usr/src/perf_selftests-x86_64-rhel-8.3-func-3f5f0df7bf0f8c48d33d43454fc0b7d0f3ab9537/tools/perf/perf test 91
91: perf all PMU test : Ok



To reproduce:

git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
sudo bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
sudo bin/lkp run generated-yaml-file

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.



---
0DAY/LKP+ Test Infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation

Thanks,
Oliver Sang


Attachments:
(No filename) (1.95 kB)
config-5.17.0-rc3-00104-g3f5f0df7bf0f (168.17 kB)
job-script (5.52 kB)
dmesg.xz (37.39 kB)
perf-sanity-tests (41.05 kB)
job.yaml (4.70 kB)
reproduce (10.79 kB)
Download all attachments

2022-03-04 20:09:36

by Ian Rogers

[permalink] [raw]
Subject: Re: [perf vendor events] 3f5f0df7bf: perf-sanity-tests.perf_all_metrics_test.fail

On Fri, Mar 4, 2022 at 12:33 AM kernel test robot <[email protected]> wrote:
>
>
>
> Greeting,
>
> FYI, we noticed the following commit (built with gcc-9):
>
> commit: 3f5f0df7bf0f8c48d33d43454fc0b7d0f3ab9537 ("perf vendor events: Update metrics for Skylake")
> https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
>
> in testcase: perf-sanity-tests
> version: perf-x86_64-fb184c4af9b9-1_20220302
> with following parameters:
>
> perf_compiler: clang
> ucode: 0xec
>
>
>
> on test machine: 8 threads 1 sockets Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz with 32G memory
>
> caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):

Hi,

Thanks for the report! There is no information in the test output that
I can diagnose the issue with, could you add the -v option to perf
test so that I can see what the cause is, rather than just pass/fail.
At the time of filing the update I didn't have access to a Skylake
machine (just SkylakeX) but this test was ran as detailed in the
commit message:
https://lore.kernel.org/lkml/[email protected]/
Knowing the test, I suspect there may be a bad event on Skylake, but
can't confirm this because I lack the hardware and/or the test output.
The issue may also be how the test was run, such as not as root, not
in a container. There is a further issue with this test that metrics
(e.g. number of vector ops) that measure things that a simple
benchmark doesn't cause counts for can fail the test, as the test is
checking if the metric is reported - for example, there may be no
vector ops within the simple benchmark.

Thanks,
Ian

> If you fix the issue, kindly add following tag
> Reported-by: kernel test robot <[email protected]>
>
>
>
> 2022-03-02 19:01:56 sudo /usr/src/perf_selftests-x86_64-rhel-8.3-func-3f5f0df7bf0f8c48d33d43454fc0b7d0f3ab9537/tools/perf/perf test 89
> 89: perf all metricgroups test : Ok
> 2022-03-02 19:02:05 sudo /usr/src/perf_selftests-x86_64-rhel-8.3-func-3f5f0df7bf0f8c48d33d43454fc0b7d0f3ab9537/tools/perf/perf test 90
> 90: perf all metrics test : FAILED!
> 2022-03-02 19:07:00 sudo /usr/src/perf_selftests-x86_64-rhel-8.3-func-3f5f0df7bf0f8c48d33d43454fc0b7d0f3ab9537/tools/perf/perf test 91
> 91: perf all PMU test : Ok
>
>
>
> To reproduce:
>
> git clone https://github.com/intel/lkp-tests.git
> cd lkp-tests
> sudo bin/lkp install job.yaml # job file is attached in this email
> bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
> sudo bin/lkp run generated-yaml-file
>
> # if come across any failure that blocks the test,
> # please remove ~/.lkp and /lkp dir to run from a clean state.
>
>
>
> ---
> 0DAY/LKP+ Test Infrastructure Open Source Technology Center
> https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation
>
> Thanks,
> Oliver Sang
>

2022-04-13 16:18:36

by Carel Si

[permalink] [raw]
Subject: Re: [LKP] Re: [perf vendor events] 3f5f0df7bf: perf-sanity-tests.perf_all_metrics_test.fail

Hi,

On Fri, Mar 04, 2022 at 10:10:53AM -0800, Ian Rogers wrote:
> On Fri, Mar 4, 2022 at 12:33 AM kernel test robot <[email protected]> wrote:
> >
> >
> >
> > Greeting,
> >
> > FYI, we noticed the following commit (built with gcc-9):
> >
> > commit: 3f5f0df7bf0f8c48d33d43454fc0b7d0f3ab9537 ("perf vendor events: Update metrics for Skylake")
> > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> >
> > in testcase: perf-sanity-tests
> > version: perf-x86_64-fb184c4af9b9-1_20220302
> > with following parameters:
> >
> > perf_compiler: clang
> > ucode: 0xec
> >
> >
> >
> > on test machine: 8 threads 1 sockets Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz with 32G memory
> >
> > caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
>
> Hi,
>
> Thanks for the report! There is no information in the test output that
> I can diagnose the issue with, could you add the -v option to perf
> test so that I can see what the cause is, rather than just pass/fail.

We Added '-v' option, found out that 3f5f0df7bf failed at testing
'Branching_Overhead' [1] and 'IpArith_Scalar_SP' [2], details attached
in perf-sanity-tests.xz

[1]

Testing Branching_Overhead
Metric 'Branching_Overhead' not printed in:
# Running 'internals/synthesize' benchmark:
Computing performance of single threaded perf event synthesis by
synthesizing events on the perf process itself:
Average synthesis took: 459.468 usec (+- 0.265 usec)
Average num. events: 44.000 (+- 0.000)
Average time per event 10.442 usec
Average data synthesis took: 486.181 usec (+- 0.272 usec)
Average num. events: 296.000 (+- 0.000)
Average time per event 1.643 usec

Performance counter stats for 'perf bench internals synthesize':

<not counted> BR_INST_RETIRED.NEAR_CALL (0.00%)
<not counted> BR_INST_RETIRED.NEAR_TAKEN (0.00%)
<not counted> BR_INST_RETIRED.NOT_TAKEN (0.00%)
<not counted> BR_INST_RETIRED.CONDITIONAL (0.00%)
<not counted> CPU_CLK_UNHALTED.THREAD (0.00%)
9772951660 ns duration_time

9.772951660 seconds time elapsed

4.343887000 seconds user
5.248839000 seconds sys


Some events weren't counted. Try disabling the NMI watchdog:
echo 0 > /proc/sys/kernel/nmi_watchdog
perf stat ...
echo 1 > /proc/sys/kernel/nmi_watchdog

[2]

Testing IpArith_Scalar_SP
Metric 'IpArith_Scalar_SP' not printed in:
# Running 'internals/synthesize' benchmark:
Computing performance of single threaded perf event synthesis by
synthesizing events on the perf process itself:
Average synthesis took: 458.601 usec (+- 0.257 usec)
Average num. events: 44.000 (+- 0.000)
Average time per event 10.423 usec
Average data synthesis took: 486.297 usec (+- 0.306 usec)
Average num. events: 296.000 (+- 0.000)
Average time per event 1.643 usec

Performance counter stats for 'perf bench internals synthesize':

108854260048 INST_RETIRED.ANY
0 FP_ARITH_INST_RETIRED.SCALAR_SINGLE
9750270760 ns duration_time

9.750270760 seconds time elapsed

4.288438000 seconds user
5.323337000 seconds sys

Thanks

> At the time of filing the update I didn't have access to a Skylake
> machine (just SkylakeX) but this test was ran as detailed in the
> commit message:
> https://lore.kernel.org/lkml/[email protected]/
> Knowing the test, I suspect there may be a bad event on Skylake, but
> can't confirm this because I lack the hardware and/or the test output.
> The issue may also be how the test was run, such as not as root, not
> in a container. There is a further issue with this test that metrics
> (e.g. number of vector ops) that measure things that a simple
> benchmark doesn't cause counts for can fail the test, as the test is
> checking if the metric is reported - for example, there may be no
> vector ops within the simple benchmark.
>
> Thanks,
> Ian
>
> > If you fix the issue, kindly add following tag
> > Reported-by: kernel test robot <[email protected]>
> >
> >
> >
> > 2022-03-02 19:01:56 sudo /usr/src/perf_selftests-x86_64-rhel-8.3-func-3f5f0df7bf0f8c48d33d43454fc0b7d0f3ab9537/tools/perf/perf test 89
> > 89: perf all metricgroups test : Ok
> > 2022-03-02 19:02:05 sudo /usr/src/perf_selftests-x86_64-rhel-8.3-func-3f5f0df7bf0f8c48d33d43454fc0b7d0f3ab9537/tools/perf/perf test 90
> > 90: perf all metrics test : FAILED!
> > 2022-03-02 19:07:00 sudo /usr/src/perf_selftests-x86_64-rhel-8.3-func-3f5f0df7bf0f8c48d33d43454fc0b7d0f3ab9537/tools/perf/perf test 91
> > 91: perf all PMU test : Ok
> >
> >
> >
> > To reproduce:
> >
> > git clone https://github.com/intel/lkp-tests.git
> > cd lkp-tests
> > sudo bin/lkp install job.yaml # job file is attached in this email
> > bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
> > sudo bin/lkp run generated-yaml-file
> >
> > # if come across any failure that blocks the test,
> > # please remove ~/.lkp and /lkp dir to run from a clean state.
> >
> >
> >
> > ---
> > 0DAY/LKP+ Test Infrastructure Open Source Technology Center
> > https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation
> >
> > Thanks,
> > Oliver Sang
> >
> _______________________________________________
> LKP mailing list -- [email protected]
> To unsubscribe send an email to [email protected]


Attachments:
(No filename) (5.96 kB)
perf-sanity-tests.xz (57.64 kB)
Download all attachments

2022-04-14 02:08:07

by Ian Rogers

[permalink] [raw]
Subject: Re: [LKP] Re: [perf vendor events] 3f5f0df7bf: perf-sanity-tests.perf_all_metrics_test.fail

On Wed, Apr 13, 2022 at 12:06 AM Carel Si <[email protected]> wrote:
>
> Hi,
>
> On Fri, Mar 04, 2022 at 10:10:53AM -0800, Ian Rogers wrote:
> > On Fri, Mar 4, 2022 at 12:33 AM kernel test robot <[email protected]> wrote:
> > >
> > >
> > >
> > > Greeting,
> > >
> > > FYI, we noticed the following commit (built with gcc-9):
> > >
> > > commit: 3f5f0df7bf0f8c48d33d43454fc0b7d0f3ab9537 ("perf vendor events: Update metrics for Skylake")
> > > https://git.kernel.org/cgit/linux/kernel/git/next/linux-next.git master
> > >
> > > in testcase: perf-sanity-tests
> > > version: perf-x86_64-fb184c4af9b9-1_20220302
> > > with following parameters:
> > >
> > > perf_compiler: clang
> > > ucode: 0xec
> > >
> > >
> > >
> > > on test machine: 8 threads 1 sockets Intel(R) Core(TM) i7-7700 CPU @ 3.60GHz with 32G memory
> > >
> > > caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
> >
> > Hi,
> >
> > Thanks for the report! There is no information in the test output that
> > I can diagnose the issue with, could you add the -v option to perf
> > test so that I can see what the cause is, rather than just pass/fail.
>
> We Added '-v' option, found out that 3f5f0df7bf failed at testing
> 'Branching_Overhead' [1] and 'IpArith_Scalar_SP' [2], details attached
> in perf-sanity-tests.xz
>
> [1]
>
> Testing Branching_Overhead
> Metric 'Branching_Overhead' not printed in:
> # Running 'internals/synthesize' benchmark:
> Computing performance of single threaded perf event synthesis by
> synthesizing events on the perf process itself:
> Average synthesis took: 459.468 usec (+- 0.265 usec)
> Average num. events: 44.000 (+- 0.000)
> Average time per event 10.442 usec
> Average data synthesis took: 486.181 usec (+- 0.272 usec)
> Average num. events: 296.000 (+- 0.000)
> Average time per event 1.643 usec
>
> Performance counter stats for 'perf bench internals synthesize':
>
> <not counted> BR_INST_RETIRED.NEAR_CALL (0.00%)
> <not counted> BR_INST_RETIRED.NEAR_TAKEN (0.00%)
> <not counted> BR_INST_RETIRED.NOT_TAKEN (0.00%)
> <not counted> BR_INST_RETIRED.CONDITIONAL (0.00%)
> <not counted> CPU_CLK_UNHALTED.THREAD (0.00%)
> 9772951660 ns duration_time
>
> 9.772951660 seconds time elapsed
>
> 4.343887000 seconds user
> 5.248839000 seconds sys
>
>
> Some events weren't counted. Try disabling the NMI watchdog:
> echo 0 > /proc/sys/kernel/nmi_watchdog
> perf stat ...
> echo 1 > /proc/sys/kernel/nmi_watchdog

So the failure here is that the nmi_watchdog on your machine uses a
performance counter which means the group of events doesn't have
sufficient counters to compute the metric. There are a couple of known
issues here:

1) We create metric groups as weak groups, the perf_event_open should
fail for the group of events above so that then we don't group the
events. Something is wrong in the kernel PMU code meaning this isn't
happening. Perhaps Kan can take a look? I'll provide more details
below.
2) Ideally we wouldn't use a performance counter for the NMI watchdog:
https://lore.kernel.org/lkml/1558660583-28561-1-git-send-email-ricardo.neri-calderon@linux.intel.com/

We could expand the test here:
https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/tree/tools/perf/tests/shell/stat_all_metrics.sh?h=perf/core#n18
so that NMI watchdog failures are skip rather than fail.


Skylake group failures not breaking weak group (tested on a SkylakeX):
1) No group works:
$ perf stat -e 'BR_INST_RETIRED.NEAR_CALL,BR_INST_RETIRED.NEAR_TAKEN,BR_INST_RETIRED.NOT_TAKEN,BR_INST_RETIRED.CONDITIONAL,CPU_CLK_UNHALTED.THREAD'
-a sleep 1

Performance counter stats for 'system wide':

7,979,997 BR_INST_RETIRED.NEAR_CALL
(79.98%)
45,462,860 BR_INST_RETIRED.NEAR_TAKEN
(80.04%)
54,698,502 BR_INST_RETIRED.NOT_TAKEN
(80.05%)
78,865,520 BR_INST_RETIRED.CONDITIONAL
(80.04%)
1,104,280,963 CPU_CLK_UNHALTED.THREAD
(79.89%)

1.001761717 seconds time elapsed

2) Hard group fails:
$ perf stat -e '{BR_INST_RETIRED.NEAR_CALL,BR_INST_RETIRED.NEAR_TAKEN,BR_INST_RETIRED.NOT_TAKEN,BR_INST_RETIRED.CONDITIONAL,CPU_CLK_UNHALTED.THREAD}'
-a sleep 1

Performance counter stats for 'system wide':

<not counted> BR_INST_RETIRED.NEAR_CALL
(0.00%)
<not counted> BR_INST_RETIRED.NEAR_TAKEN
(0.00%)
<not counted> BR_INST_RETIRED.NOT_TAKEN
(0.00%)
<not counted> BR_INST_RETIRED.CONDITIONAL
(0.00%)
<not counted> CPU_CLK_UNHALTED.THREAD
(0.00%)

1.001565418 seconds time elapsed

Some events weren't counted. Try disabling the NMI watchdog:
echo 0 > /proc/sys/kernel/nmi_watchdog
perf stat ...
echo 1 > /proc/sys/kernel/nmi_watchdog

3) Weak group doesn't fall back to no group:
$ perf stat -e '{BR_INST_RETIRED.NEAR_CALL,BR_INST_RETIRED.NEAR_TAKEN,BR_INST_RETIRED.NOT_TAKEN,BR_INST_RETIRED.CONDITIONAL,CPU_CLK_UNHALTED.THREAD}:W'
-a sleep 1

Performance counter stats for 'system wide':

<not counted> BR_INST_RETIRED.NEAR_CALL
(0.00%)
<not counted> BR_INST_RETIRED.NEAR_TAKEN
(0.00%)
<not counted> BR_INST_RETIRED.NOT_TAKEN
(0.00%)
<not counted> BR_INST_RETIRED.CONDITIONAL
(0.00%)
<not counted> CPU_CLK_UNHALTED.THREAD
(0.00%)

1.001690318 seconds time elapsed

Some events weren't counted. Try disabling the NMI watchdog:
echo 0 > /proc/sys/kernel/nmi_watchdog
perf stat ...
echo 1 > /proc/sys/kernel/nmi_watchdog


> [2]
>
> Testing IpArith_Scalar_SP
> Metric 'IpArith_Scalar_SP' not printed in:
> # Running 'internals/synthesize' benchmark:
> Computing performance of single threaded perf event synthesis by
> synthesizing events on the perf process itself:
> Average synthesis took: 458.601 usec (+- 0.257 usec)
> Average num. events: 44.000 (+- 0.000)
> Average time per event 10.423 usec
> Average data synthesis took: 486.297 usec (+- 0.306 usec)
> Average num. events: 296.000 (+- 0.000)
> Average time per event 1.643 usec
>
> Performance counter stats for 'perf bench internals synthesize':
>
> 108854260048 INST_RETIRED.ANY
> 0 FP_ARITH_INST_RETIRED.SCALAR_SINGLE
> 9750270760 ns duration_time
>
> 9.750270760 seconds time elapsed
>
> 4.288438000 seconds user
> 5.323337000 seconds sys

I believe this fail case is now a skip. The relevant fix was:
https://git.kernel.org/pub/scm/linux/kernel/git/acme/linux.git/commit/tools/perf/tests/shell/stat_all_metrics.sh?h=perf/core&id=00236a2dc8a3768fdc689380d2e93b96cc971bd7

Thanks,
Ian

> Thanks
>
> > At the time of filing the update I didn't have access to a Skylake
> > machine (just SkylakeX) but this test was ran as detailed in the
> > commit message:
> > https://lore.kernel.org/lkml/[email protected]/
> > Knowing the test, I suspect there may be a bad event on Skylake, but
> > can't confirm this because I lack the hardware and/or the test output.
> > The issue may also be how the test was run, such as not as root, not
> > in a container. There is a further issue with this test that metrics
> > (e.g. number of vector ops) that measure things that a simple
> > benchmark doesn't cause counts for can fail the test, as the test is
> > checking if the metric is reported - for example, there may be no
> > vector ops within the simple benchmark.
> >
> > Thanks,
> > Ian
> >
> > > If you fix the issue, kindly add following tag
> > > Reported-by: kernel test robot <[email protected]>
> > >
> > >
> > >
> > > 2022-03-02 19:01:56 sudo /usr/src/perf_selftests-x86_64-rhel-8.3-func-3f5f0df7bf0f8c48d33d43454fc0b7d0f3ab9537/tools/perf/perf test 89
> > > 89: perf all metricgroups test : Ok
> > > 2022-03-02 19:02:05 sudo /usr/src/perf_selftests-x86_64-rhel-8.3-func-3f5f0df7bf0f8c48d33d43454fc0b7d0f3ab9537/tools/perf/perf test 90
> > > 90: perf all metrics test : FAILED!
> > > 2022-03-02 19:07:00 sudo /usr/src/perf_selftests-x86_64-rhel-8.3-func-3f5f0df7bf0f8c48d33d43454fc0b7d0f3ab9537/tools/perf/perf test 91
> > > 91: perf all PMU test : Ok
> > >
> > >
> > >
> > > To reproduce:
> > >
> > > git clone https://github.com/intel/lkp-tests.git
> > > cd lkp-tests
> > > sudo bin/lkp install job.yaml # job file is attached in this email
> > > bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
> > > sudo bin/lkp run generated-yaml-file
> > >
> > > # if come across any failure that blocks the test,
> > > # please remove ~/.lkp and /lkp dir to run from a clean state.
> > >
> > >
> > >
> > > ---
> > > 0DAY/LKP+ Test Infrastructure Open Source Technology Center
> > > https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation
> > >
> > > Thanks,
> > > Oliver Sang
> > >
> > _______________________________________________
> > LKP mailing list -- [email protected]
> > To unsubscribe send an email to [email protected]