LinuxLists.cc - [RFC v3 0/2] CPU-Idle latency selftest framework

2021-04-04 08:38:12

Subject: [RFC v3 0/2] CPU-Idle latency selftest framework

Changelog
RFC v2-->v3

Based on comments by Doug Smythies,
1. Changed commit log to reflect the test must be run as super user.
2. Added a comment specifying a method to run the test bash script
without recompiling.
3. Enable all the idle states after the experiments are completed so
that the system is in a coherent state after the tests have run
4. Correct the return status of a CPU that cannot be off-lined.

RFC v2: https://lkml.org/lkml/2021/4/1/615
---
A kernel module + userspace driver to estimate the wakeup latency
caused by going into stop states. The motivation behind this program is
to find significant deviations behind advertised latency and residency
values.

The patchset measures latencies for two kinds of events. IPIs and Timers
As this is a software-only mechanism, there will additional latencies of
the kernel-firmware-hardware interactions. To account for that, the
program also measures a baseline latency on a 100 percent loaded CPU
and the latencies achieved must be in view relative to that.

To achieve this, we introduce a kernel module and expose its control
knobs through the debugfs interface that the selftests can engage with.

The kernel module provides the following interfaces within
/sys/kernel/debug/latency_test/ for,

IPI test:
ipi_cpu_dest = Destination CPU for the IPI
ipi_cpu_src = Origin of the IPI
ipi_latency_ns = Measured latency time in ns
Timeout test:
timeout_cpu_src = CPU on which the timer to be queued
timeout_expected_ns = Timer duration
timeout_diff_ns = Difference of actual duration vs expected timer

Sample output on a POWER9 system is as follows:
# --IPI Latency Test---
# Baseline Average IPI latency(ns): 3114
# Observed Average IPI latency(ns) - State0: 3265
# Observed Average IPI latency(ns) - State1: 3507
# Observed Average IPI latency(ns) - State2: 3739
# Observed Average IPI latency(ns) - State3: 3807
# Observed Average IPI latency(ns) - State4: 17070
# Observed Average IPI latency(ns) - State5: 1038174
# Observed Average IPI latency(ns) - State6: 1068784
#
# --Timeout Latency Test--
# Baseline Average timeout diff(ns): 1420
# Observed Average timeout diff(ns) - State0: 1640
# Observed Average timeout diff(ns) - State1: 1764
# Observed Average timeout diff(ns) - State2: 1715
# Observed Average timeout diff(ns) - State3: 1845
# Observed Average timeout diff(ns) - State4: 16581
# Observed Average timeout diff(ns) - State5: 939977
# Observed Average timeout diff(ns) - State6: 1073024

Things to keep in mind:

1. This kernel module + bash driver does not guarantee idleness on a
core when the IPI and the Timer is armed. It only invokes sleep and
hopes that the core is idle once the IPI/Timer is invoked onto it.
Hence this program must be run on a completely idle system for best
results

2. Even on a completely idle system, there maybe book-keeping tasks or
jitter tasks that can run on the core we want idle. This can create
outliers in the latency measurement. Thankfully, these outliers
should be large enough to easily weed them out.

3. A userspace only selftest variant was also sent out as RFC based on
suggestions over the previous patchset to simply the kernel
complexeity. However, a userspace only approach had more noise in
the latency measurement due to userspace-kernel interactions
which led to run to run variance and a lesser accurate test.
Another downside of the nature of a userspace program is that it
takes orders of magnitude longer to complete a full system test
compared to the kernel framework.
RFC patch: https://lkml.org/lkml/2020/9/2/356

4. For Intel Systems, the Timer based latencies don't exactly give out
the measure of idle latencies. This is because of a hardware
optimization mechanism that pre-arms a CPU when a timer is set to
wakeup. That doesn't make this metric useless for Intel systems,
it just means that is measuring IPI/Timer responding latency rather
than idle wakeup latencies.
(Source: https://lkml.org/lkml/2020/9/2/610)
For solution to this problem, a hardware based latency analyzer is
devised by Artem Bityutskiy from Intel.
https://youtu.be/Opk92aQyvt0?t=8266
https://intel.github.io/wult/

Pratik Rajesh Sampat (2):
cpuidle: Extract IPI based and timer based wakeup latency from idle
states
selftest/cpuidle: Add support for cpuidle latency measurement

drivers/cpuidle/Makefile | 1 +
drivers/cpuidle/test-cpuidle_latency.c | 157 ++++++++++
lib/Kconfig.debug | 10 +
tools/testing/selftests/Makefile | 1 +
tools/testing/selftests/cpuidle/Makefile | 6 +
tools/testing/selftests/cpuidle/cpuidle.sh | 326 +++++++++++++++++++++
tools/testing/selftests/cpuidle/settings | 2 +
7 files changed, 503 insertions(+)
create mode 100644 drivers/cpuidle/test-cpuidle_latency.c
create mode 100644 tools/testing/selftests/cpuidle/Makefile
create mode 100755 tools/testing/selftests/cpuidle/cpuidle.sh
create mode 100644 tools/testing/selftests/cpuidle/settings

--
2.17.1

2021-04-09 05:25:47

by Doug Smythies

[permalink] [raw]

Subject: Re: [RFC v3 0/2] CPU-Idle latency selftest framework

Hi Pratik,

I tried V3 on a Intel i5-10600K processor with 6 cores and 12 CPUs.
The core to cpu mappings are:
core 0 has cpus 0 and 6
core 1 has cpus 1 and 7
core 2 has cpus 2 and 8
core 3 has cpus 3 and 9
core 4 has cpus 4 and 10
core 5 has cpus 5 and 11

By default, it will test CPUs 0,2,4,6,10 on cores 0,2,4,0,2,4.
wouldn't it make more sense to test each core once?
With the source CPU always 0, I think the results from the results
from the destination CPUs 0 and 6, on core 0 bias the results, at
least in the deeper idle states. They don't make much difference in
the shallow states. Myself, I wouldn't include them in the results.
Example, where I used the -v option for all CPUs:

--IPI Latency Test---
--Baseline IPI Latency measurement: CPU Busy--
SRC_CPU DEST_CPU IPI_Latency(ns)
0 0 101
0 1 790
0 2 609
0 3 595
0 4 737
0 5 759
0 6 780
0 7 741
0 8 574
0 9 681
0 10 527
0 11 552
Baseline Avg IPI latency(ns): 620 <<<< suggest 656 here
---Enabling state: 0---
SRC_CPU DEST_CPU IPI_Latency(ns)
0 0 76
0 1 471
0 2 420
0 3 462
0 4 454
0 5 468
0 6 453
0 7 473
0 8 380
0 9 483
0 10 492
0 11 454
Expected IPI latency(ns): 0
Observed Avg IPI latency(ns) - State 0: 423 <<<<< suggest 456 here
---Enabling state: 1---
SRC_CPU DEST_CPU IPI_Latency(ns)
0 0 112
0 1 866
0 2 663
0 3 851
0 4 1090
0 5 1314
0 6 1941
0 7 1458
0 8 687
0 9 802
0 10 1041
0 11 1284
Expected IPI latency(ns): 1000
Observed Avg IPI latency(ns) - State 1: 1009 <<<< suggest 1006 here
---Enabling state: 2---
SRC_CPU DEST_CPU IPI_Latency(ns)
0 0 75
0 1 16362
0 2 16785
0 3 19650
0 4 17356
0 5 17606
0 6 2217
0 7 17958
0 8 17332
0 9 16615
0 10 17382
0 11 17423
Expected IPI latency(ns): 120000
Observed Avg IPI latency(ns) - State 2: 14730 <<<< suggest 17447 here
---Enabling state: 3---
SRC_CPU DEST_CPU IPI_Latency(ns)
0 0 103
0 1 17416
0 2 17961
0 3 16651
0 4 17867
0 5 17726
0 6 2178
0 7 16620
0 8 20951
0 9 16567
0 10 17131
0 11 17563
Expected IPI latency(ns): 1034000
Observed Avg IPI latency(ns) - State 3: 14894 <<<< suggest 17645 here

Hope this helps.

... Doug

2021-04-09 07:46:52

by Pratik R. Sampat

[permalink] [raw]

Subject: Re: [RFC v3 0/2] CPU-Idle latency selftest framework

Hello Doug,

On 09/04/21 10:53 am, Doug Smythies wrote:
> Hi Pratik,
>
> I tried V3 on a Intel i5-10600K processor with 6 cores and 12 CPUs.
> The core to cpu mappings are:
> core 0 has cpus 0 and 6
> core 1 has cpus 1 and 7
> core 2 has cpus 2 and 8
> core 3 has cpus 3 and 9
> core 4 has cpus 4 and 10
> core 5 has cpus 5 and 11
>
> By default, it will test CPUs 0,2,4,6,10 on cores 0,2,4,0,2,4.
> wouldn't it make more sense to test each core once?

Ideally it would be better to run on all the CPUs, however on larger systems
that I'm testing on with hundreds of cores and a high a thread count, the
execution time increases while not particularly bringing any additional
information to the table.

That is why it made sense only run on one of the threads of each core to make
the experiment faster while preserving accuracy.

To handle various thread topologies it maybe worthwhile if we parse
/sys/devices/system/cpu/cpuX/topology/thread_siblings_list for each core and
use this information to run only once per physical core, rather than
assuming the topology.

What are your thoughts on a mechanism like this?

> With the source CPU always 0, I think the results from the results
> from the destination CPUs 0 and 6, on core 0 bias the results, at
> least in the deeper idle states. They don't make much difference in
> the shallow states. Myself, I wouldn't include them in the results.

I agree, CPU0->CPU0 same core interaction is causing a bias. I could omit that
observation while computing the average.

In the verbose mode I'll omit all the threads of CPU0 and in the default
(quick) mode just CPU0's latency can be omitted while computing average.

Thank you,
Pratik

> Example, where I used the -v option for all CPUs:
>
> --IPI Latency Test---
> --Baseline IPI Latency measurement: CPU Busy--
> SRC_CPU DEST_CPU IPI_Latency(ns)
> 0 0 101
> 0 1 790
> 0 2 609
> 0 3 595
> 0 4 737
> 0 5 759
> 0 6 780
> 0 7 741
> 0 8 574
> 0 9 681
> 0 10 527
> 0 11 552
> Baseline Avg IPI latency(ns): 620 <<<< suggest 656 here
> ---Enabling state: 0---
> SRC_CPU DEST_CPU IPI_Latency(ns)
> 0 0 76
> 0 1 471
> 0 2 420
> 0 3 462
> 0 4 454
> 0 5 468
> 0 6 453
> 0 7 473
> 0 8 380
> 0 9 483
> 0 10 492
> 0 11 454
> Expected IPI latency(ns): 0
> Observed Avg IPI latency(ns) - State 0: 423 <<<<< suggest 456 here
> ---Enabling state: 1---
> SRC_CPU DEST_CPU IPI_Latency(ns)
> 0 0 112
> 0 1 866
> 0 2 663
> 0 3 851
> 0 4 1090
> 0 5 1314
> 0 6 1941
> 0 7 1458
> 0 8 687
> 0 9 802
> 0 10 1041
> 0 11 1284
> Expected IPI latency(ns): 1000
> Observed Avg IPI latency(ns) - State 1: 1009 <<<< suggest 1006 here
> ---Enabling state: 2---
> SRC_CPU DEST_CPU IPI_Latency(ns)
> 0 0 75
> 0 1 16362
> 0 2 16785
> 0 3 19650
> 0 4 17356
> 0 5 17606
> 0 6 2217
> 0 7 17958
> 0 8 17332
> 0 9 16615
> 0 10 17382
> 0 11 17423
> Expected IPI latency(ns): 120000
> Observed Avg IPI latency(ns) - State 2: 14730 <<<< suggest 17447 here
> ---Enabling state: 3---
> SRC_CPU DEST_CPU IPI_Latency(ns)
> 0 0 103
> 0 1 17416
> 0 2 17961
> 0 3 16651
> 0 4 17867
> 0 5 17726
> 0 6 2178
> 0 7 16620
> 0 8 20951
> 0 9 16567
> 0 10 17131
> 0 11 17563
> Expected IPI latency(ns): 1034000
> Observed Avg IPI latency(ns) - State 3: 14894 <<<< suggest 17645 here
>
> Hope this helps.
>
> ... Doug

2021-04-09 14:28:18

by Doug Smythies

[permalink] [raw]

Subject: Re: [RFC v3 0/2] CPU-Idle latency selftest framework

On Fri, Apr 9, 2021 at 12:43 AM Pratik Sampat <[email protected]> wrote:
> On 09/04/21 10:53 am, Doug Smythies wrote:
> > I tried V3 on a Intel i5-10600K processor with 6 cores and 12 CPUs.
> > The core to cpu mappings are:
> > core 0 has cpus 0 and 6
> > core 1 has cpus 1 and 7
> > core 2 has cpus 2 and 8
> > core 3 has cpus 3 and 9
> > core 4 has cpus 4 and 10
> > core 5 has cpus 5 and 11
> >
> > By default, it will test CPUs 0,2,4,6,10 on cores 0,2,4,0,2,4.
> > wouldn't it make more sense to test each core once?
>
> Ideally it would be better to run on all the CPUs, however on larger systems
> that I'm testing on with hundreds of cores and a high a thread count, the
> execution time increases while not particularly bringing any additional
> information to the table.
>
> That is why it made sense only run on one of the threads of each core to make
> the experiment faster while preserving accuracy.
>
> To handle various thread topologies it maybe worthwhile if we parse
> /sys/devices/system/cpu/cpuX/topology/thread_siblings_list for each core and
> use this information to run only once per physical core, rather than
> assuming the topology.
>
> What are your thoughts on a mechanism like this?

Yes, seems like a good solution.

... Doug

2023-09-25 11:20:36

by Aboorva Devarajan

[permalink] [raw]

Subject: Re: [RFC v3 0/2] CPU-Idle latency selftest framework

On Mon, 2023-09-11 at 11:06 +0530, Aboorva Devarajan wrote:

CC'ing CPUidle lists and maintainers,

Patch Summary:

The patchset introduces a kernel module and userspace driver designed
for estimating the wakeup latency experienced when waking up from
various CPU idle states. It primarily measures latencies related to two
types of events: Inter-Processor Interrupts (IPIs) and Timers.

Background:

Initially, these patches were introduced as a generic self-test.
However, it was later discovered that Intel platforms incorporate
timer-based wakeup optimizations. These optimizations allow CPUs to
perform a pre-wakeup, which limits the effectiveness of latency
observation in certain scenarios because it only measures the optimized
wakeup latency [1].

Therefore, in this RFC, the self-test is specifically integrated into
PowerPC, as it has been tested and used in PowerPC so far.

Another proposal is to introduce these patches as a generic cpuilde IPI
and timer wake-up test. While this method may not give us an exact
measurement of latency variations at the hardware level, it can still
help us assess this metric from a software observability standpoint.

Looking forward to hearing what you think and any suggestions you may
have regarding this. Thanks.

[1]
https://lore.kernel.org/linux-pm/[email protected]/T/#m5c004b9b1a918f669e91b3d0f33e2e3500923234

> Changelog: v2 -> v3
>
> * Minimal code refactoring
> * Rebased on v6.6-rc1
>
> RFC v1:
> https://lore.kernel.org/all/[email protected]/
>
> RFC v2:
> https://lore.kernel.org/all/[email protected]/
>
> Other related RFC:
> https://lore.kernel.org/all/[email protected]/
>
> Userspace selftest:
> https://lkml.org/lkml/2020/9/2/356
>
> ----
>
> A kernel module + userspace driver to estimate the wakeup latency
> caused by going into stop states. The motivation behind this program
> is
> to find significant deviations behind advertised latency and
> residency
> values.
>
> The patchset measures latencies for two kinds of events. IPIs and
> Timers
> As this is a software-only mechanism, there will be additional
> latencies
> of the kernel-firmware-hardware interactions. To account for that,
> the
> program also measures a baseline latency on a 100 percent loaded CPU
> and the latencies achieved must be in view relative to that.
>
> To achieve this, we introduce a kernel module and expose its control
> knobs through the debugfs interface that the selftests can engage
> with.
>
> The kernel module provides the following interfaces within
> /sys/kernel/debug/powerpc/latency_test/ for,
>
> IPI test:
> ipi_cpu_dest = Destination CPU for the IPI
> ipi_cpu_src = Origin of the IPI
> ipi_latency_ns = Measured latency time in ns
> Timeout test:
> timeout_cpu_src = CPU on which the timer to be queued
> timeout_expected_ns = Timer duration
> timeout_diff_ns = Difference of actual duration vs expected timer
>
> Sample output is as follows:
>
> # --IPI Latency Test---
> # Baseline Avg IPI latency(ns): 2720
> # Observed Avg IPI latency(ns) - State snooze: 2565
> # Observed Avg IPI latency(ns) - State stop0_lite: 3856
> # Observed Avg IPI latency(ns) - State stop0: 3670
> # Observed Avg IPI latency(ns) - State stop1: 3872
> # Observed Avg IPI latency(ns) - State stop2: 17421
> # Observed Avg IPI latency(ns) - State stop4: 1003922
> # Observed Avg IPI latency(ns) - State stop5: 1058870
> #
> # --Timeout Latency Test--
> # Baseline Avg timeout diff(ns): 1435
> # Observed Avg timeout diff(ns) - State snooze: 1709
> # Observed Avg timeout diff(ns) - State stop0_lite: 2028
> # Observed Avg timeout diff(ns) - State stop0: 1954
> # Observed Avg timeout diff(ns) - State stop1: 1895
> # Observed Avg timeout diff(ns) - State stop2: 14556
> # Observed Avg timeout diff(ns) - State stop4: 873988
> # Observed Avg timeout diff(ns) - State stop5: 959137
>
> Aboorva Devarajan (2):
> powerpc/cpuidle: cpuidle wakeup latency based on IPI and timer
> events
> powerpc/selftest: Add support for cpuidle latency measurement
>
> arch/powerpc/Kconfig.debug | 10 +
> arch/powerpc/kernel/Makefile | 1 +
> arch/powerpc/kernel/test_cpuidle_latency.c | 154 ++++++
> tools/testing/selftests/powerpc/Makefile | 1 +
> .../powerpc/cpuidle_latency/.gitignore | 2 +
> .../powerpc/cpuidle_latency/Makefile | 6 +
> .../cpuidle_latency/cpuidle_latency.sh | 443
> ++++++++++++++++++
> .../powerpc/cpuidle_latency/settings | 1 +
> 8 files changed, 618 insertions(+)
> create mode 100644 arch/powerpc/kernel/test_cpuidle_latency.c
> create mode 100644
> tools/testing/selftests/powerpc/cpuidle_latency/.gitignore
> create mode 100644
> tools/testing/selftests/powerpc/cpuidle_latency/Makefile
> create mode 100755
> tools/testing/selftests/powerpc/cpuidle_latency/cpuidle_latency.sh
> create mode 100644
> tools/testing/selftests/powerpc/cpuidle_latency/settings
>

2023-10-12 04:51:40

by Aboorva Devarajan

[permalink] [raw]

Subject: Re: [RFC v3 0/2] CPU-Idle latency selftest framework

On Mon, 2023-09-25 at 10:36 +0530, Aboorva Devarajan wrote:

Gentle ping to check if there are any feedback or comments on this
patch-set.

Thanks
Aboorva

> On Mon, 2023-09-11 at 11:06 +0530, Aboorva Devarajan wrote:
>
> CC'ing CPUidle lists and maintainers,
>
> Patch Summary:
>
> The patchset introduces a kernel module and userspace driver designed
> for estimating the wakeup latency experienced when waking up from
> various CPU idle states. It primarily measures latencies related to
> two
> types of events: Inter-Processor Interrupts (IPIs) and Timers.
>
> Background:
>
> Initially, these patches were introduced as a generic self-test.
> However, it was later discovered that Intel platforms incorporate
> timer-based wakeup optimizations. These optimizations allow CPUs to
> perform a pre-wakeup, which limits the effectiveness of latency
> observation in certain scenarios because it only measures the
> optimized
> wakeup latency [1].
>
> Therefore, in this RFC, the self-test is specifically integrated into
> PowerPC, as it has been tested and used in PowerPC so far.
>
> Another proposal is to introduce these patches as a generic cpuilde
> IPI
> and timer wake-up test. While this method may not give us an exact
> measurement of latency variations at the hardware level, it can still
> help us assess this metric from a software observability standpoint.
>
> Looking forward to hearing what you think and any suggestions you may
> have regarding this. Thanks.
>
> [1]
> https://lore.kernel.org/linux-pm/[email protected]/T/#m5c004b9b1a918f669e91b3d0f33e2e3500923234
>
> > Changelog: v2 -> v3
> >
> > * Minimal code refactoring
> > * Rebased on v6.6-rc1
> >
> > RFC v1:
> > https://lore.kernel.org/all/[email protected]/
> >
> > RFC v2:
> > https://lore.kernel.org/all/[email protected]/
> >
> > Other related RFC:
> > https://lore.kernel.org/all/[email protected]/
> >
> > Userspace selftest:
> > https://lkml.org/lkml/2020/9/2/356
> >
> > ----
> >
> > A kernel module + userspace driver to estimate the wakeup latency
> > caused by going into stop states. The motivation behind this
> > program
> > is
> > to find significant deviations behind advertised latency and
> > residency
> > values.
> >
> > The patchset measures latencies for two kinds of events. IPIs and
> > Timers
> > As this is a software-only mechanism, there will be additional
> > latencies
> > of the kernel-firmware-hardware interactions. To account for that,
> > the
> > program also measures a baseline latency on a 100 percent loaded
> > CPU
> > and the latencies achieved must be in view relative to that.
> >
> > To achieve this, we introduce a kernel module and expose its
> > control
> > knobs through the debugfs interface that the selftests can engage
> > with.
> >
> > The kernel module provides the following interfaces within
> > /sys/kernel/debug/powerpc/latency_test/ for,
> >
> > IPI test:
> > ipi_cpu_dest = Destination CPU for the IPI
> > ipi_cpu_src = Origin of the IPI
> > ipi_latency_ns = Measured latency time in ns
> > Timeout test:
> > timeout_cpu_src = CPU on which the timer to be queued
> > timeout_expected_ns = Timer duration
> > timeout_diff_ns = Difference of actual duration vs expected
> > timer
> >
> > Sample output is as follows:
> >
> > # --IPI Latency Test---
> > # Baseline Avg IPI latency(ns): 2720
> > # Observed Avg IPI latency(ns) - State snooze: 2565
> > # Observed Avg IPI latency(ns) - State stop0_lite: 3856
> > # Observed Avg IPI latency(ns) - State stop0: 3670
> > # Observed Avg IPI latency(ns) - State stop1: 3872
> > # Observed Avg IPI latency(ns) - State stop2: 17421
> > # Observed Avg IPI latency(ns) - State stop4: 1003922
> > # Observed Avg IPI latency(ns) - State stop5: 1058870
> > #
> > # --Timeout Latency Test--
> > # Baseline Avg timeout diff(ns): 1435
> > # Observed Avg timeout diff(ns) - State snooze: 1709
> > # Observed Avg timeout diff(ns) - State stop0_lite: 2028
> > # Observed Avg timeout diff(ns) - State stop0: 1954
> > # Observed Avg timeout diff(ns) - State stop1: 1895
> > # Observed Avg timeout diff(ns) - State stop2: 14556
> > # Observed Avg timeout diff(ns) - State stop4: 873988
> > # Observed Avg timeout diff(ns) - State stop5: 959137
> >
> > Aboorva Devarajan (2):
> > powerpc/cpuidle: cpuidle wakeup latency based on IPI and timer
> > events
> > powerpc/selftest: Add support for cpuidle latency measurement
> >
> > arch/powerpc/Kconfig.debug | 10 +
> > arch/powerpc/kernel/Makefile | 1 +
> > arch/powerpc/kernel/test_cpuidle_latency.c | 154 ++++++
> > tools/testing/selftests/powerpc/Makefile | 1 +
> > .../powerpc/cpuidle_latency/.gitignore | 2 +
> > .../powerpc/cpuidle_latency/Makefile | 6 +
> > .../cpuidle_latency/cpuidle_latency.sh | 443
> > ++++++++++++++++++
> > .../powerpc/cpuidle_latency/settings | 1 +
> > 8 files changed, 618 insertions(+)
> > create mode 100644 arch/powerpc/kernel/test_cpuidle_latency.c
> > create mode 100644
> > tools/testing/selftests/powerpc/cpuidle_latency/.gitignore
> > create mode 100644
> > tools/testing/selftests/powerpc/cpuidle_latency/Makefile
> > create mode 100755
> > tools/testing/selftests/powerpc/cpuidle_latency/cpuidle_latency.sh
> > create mode 100644
> > tools/testing/selftests/powerpc/cpuidle_latency/settings
> >