2020-10-06 08:40:19

by Mel Gorman

[permalink] [raw]
Subject: ACPI _CST introduced performance regresions on Haswll

Hi Rafael,

Numerous workload regressions have bisected repeatedly to the commit
6d4f08a6776 ("intel_idle: Use ACPI _CST on server systems") but only on
a set of haswell machines that all have the same CPU.

CPU(s): 48
On-line CPU(s) list: 0-47
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 63
Model name: Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
Stepping: 2
CPU MHz: 1200.359
CPU max MHz: 3100.0000
CPU min MHz: 1200.0000

As well as being bisected in mainline, backporting the patch to a
distribution kernel also showed the same type of problem so the patch
is definitely suspicious. An example comparison showing the performance
before CST was enabled and recent mainline kernels is as follow

netperf UDP_STREAM
5.5.0 5.5.0-rc2 5.5.0-rc2 5.6.0 5.9.0-rc8
vanilla sle15-sp2-pre-cst sle15-sp2-enable-cst vanilla vanilla
Hmean send-64 203.21 ( 0.00%) 206.43 * 1.58%* 176.89 * -12.95%* 181.18 * -10.84%* 194.45 * -4.31%*
Hmean send-128 401.40 ( 0.00%) 414.19 * 3.19%* 355.84 * -11.35%* 364.13 * -9.29%* 387.83 * -3.38%*
Hmean send-256 786.69 ( 0.00%) 799.70 ( 1.65%) 700.65 * -10.94%* 719.82 * -8.50%* 756.40 * -3.85%*
Hmean send-1024 3059.57 ( 0.00%) 3106.57 * 1.54%* 2659.62 * -13.07%* 2793.58 * -8.69%* 3006.95 * -1.72%*
Hmean send-2048 5976.66 ( 0.00%) 6102.64 ( 2.11%) 5249.34 * -12.17%* 5392.04 * -9.78%* 5805.02 * -2.87%*
Hmean send-3312 9145.09 ( 0.00%) 9304.85 * 1.75%* 8197.25 * -10.36%* 8398.36 * -8.17%* 9120.88 ( -0.26%)
Hmean send-4096 10871.63 ( 0.00%) 11129.76 * 2.37%* 9667.68 * -11.07%* 9929.70 * -8.66%* 10863.41 ( -0.08%)
Hmean send-8192 17747.35 ( 0.00%) 17969.19 ( 1.25%) 15652.91 * -11.80%* 16081.20 * -9.39%* 17316.13 * -2.43%*
Hmean send-16384 29187.16 ( 0.00%) 29418.75 * 0.79%* 26296.64 * -9.90%* 27028.18 * -7.40%* 26941.26 * -7.69%*
Hmean recv-64 203.21 ( 0.00%) 206.43 * 1.58%* 176.89 * -12.95%* 181.18 * -10.84%* 194.45 * -4.31%*
Hmean recv-128 401.40 ( 0.00%) 414.19 * 3.19%* 355.84 * -11.35%* 364.13 * -9.29%* 387.83 * -3.38%*
Hmean recv-256 786.69 ( 0.00%) 799.70 ( 1.65%) 700.65 * -10.94%* 719.82 * -8.50%* 756.40 * -3.85%*
Hmean recv-1024 3059.57 ( 0.00%) 3106.57 * 1.54%* 2659.62 * -13.07%* 2793.58 * -8.69%* 3006.95 * -1.72%*
Hmean recv-2048 5976.66 ( 0.00%) 6102.64 ( 2.11%) 5249.34 * -12.17%* 5392.00 * -9.78%* 5805.02 * -2.87%*
Hmean recv-3312 9145.09 ( 0.00%) 9304.85 * 1.75%* 8197.25 * -10.36%* 8398.36 * -8.17%* 9120.88 ( -0.26%)
Hmean recv-4096 10871.63 ( 0.00%) 11129.76 * 2.37%* 9667.68 * -11.07%* 9929.70 * -8.66%* 10863.38 ( -0.08%)
Hmean recv-8192 17747.35 ( 0.00%) 17969.19 ( 1.25%) 15652.91 * -11.80%* 16081.20 * -9.39%* 17315.96 * -2.43%*
Hmean recv-16384 29187.13 ( 0.00%) 29418.72 * 0.79%* 26296.63 * -9.90%* 27028.18 * -7.40%* 26941.23 * -7.69%*

pre-cst is just before commit 6d4f08a6776 ("intel_idle: Use ACPI _CST on
server systems") and enable-cst is the commit. It was not fixed by 5.6 or
5.9-rc8. A lot of bisections ended up here including kernel compilation,
tbench, syscall entry/exit microbenchmark, hackbench, Java workloads etc.

What I don't understand is why. The latencies for c-state exit states
before and after the patch are both as follows

/sys/devices/system/cpu/cpu0/cpuidle/state0/latency:0
/sys/devices/system/cpu/cpu0/cpuidle/state1/latency:2
/sys/devices/system/cpu/cpu0/cpuidle/state2/latency:10
/sys/devices/system/cpu/cpu0/cpuidle/state3/latency:33
/sys/devices/system/cpu/cpu0/cpuidle/state4/latency:133

Perf profiles did not show up anything interesting. A diff of
/sys/devices/system/cpu/cpu0/cpuidle/state0/ before and after the patch
showed up nothing interesting. Any idea why exactly this patch shows up
as being hazardous on Haswell in particular?

--
Mel Gorman
SUSE Labs


2020-10-06 16:02:01

by Wysocki, Rafael J

[permalink] [raw]
Subject: Re: ACPI _CST introduced performance regresions on Haswll

Hi Mel,

On 10/6/2020 10:36 AM, Mel Gorman wrote:
> Hi Rafael,
>
> Numerous workload regressions have bisected repeatedly to the commit
> 6d4f08a6776 ("intel_idle: Use ACPI _CST on server systems") but only on
> a set of haswell machines that all have the same CPU.
>
> CPU(s): 48
> On-line CPU(s) list: 0-47
> Thread(s) per core: 2
> Core(s) per socket: 12
> Socket(s): 2
> NUMA node(s): 2
> Vendor ID: GenuineIntel
> CPU family: 6
> Model: 63
> Model name: Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz
> Stepping: 2
> CPU MHz: 1200.359
> CPU max MHz: 3100.0000
> CPU min MHz: 1200.0000
>
> As well as being bisected in mainline, backporting the patch to a
> distribution kernel also showed the same type of problem so the patch
> is definitely suspicious. An example comparison showing the performance
> before CST was enabled and recent mainline kernels is as follow
>
> netperf UDP_STREAM
> 5.5.0 5.5.0-rc2 5.5.0-rc2 5.6.0 5.9.0-rc8
> vanilla sle15-sp2-pre-cst sle15-sp2-enable-cst vanilla vanilla
> Hmean send-64 203.21 ( 0.00%) 206.43 * 1.58%* 176.89 * -12.95%* 181.18 * -10.84%* 194.45 * -4.31%*
> Hmean send-128 401.40 ( 0.00%) 414.19 * 3.19%* 355.84 * -11.35%* 364.13 * -9.29%* 387.83 * -3.38%*
> Hmean send-256 786.69 ( 0.00%) 799.70 ( 1.65%) 700.65 * -10.94%* 719.82 * -8.50%* 756.40 * -3.85%*
> Hmean send-1024 3059.57 ( 0.00%) 3106.57 * 1.54%* 2659.62 * -13.07%* 2793.58 * -8.69%* 3006.95 * -1.72%*
> Hmean send-2048 5976.66 ( 0.00%) 6102.64 ( 2.11%) 5249.34 * -12.17%* 5392.04 * -9.78%* 5805.02 * -2.87%*
> Hmean send-3312 9145.09 ( 0.00%) 9304.85 * 1.75%* 8197.25 * -10.36%* 8398.36 * -8.17%* 9120.88 ( -0.26%)
> Hmean send-4096 10871.63 ( 0.00%) 11129.76 * 2.37%* 9667.68 * -11.07%* 9929.70 * -8.66%* 10863.41 ( -0.08%)
> Hmean send-8192 17747.35 ( 0.00%) 17969.19 ( 1.25%) 15652.91 * -11.80%* 16081.20 * -9.39%* 17316.13 * -2.43%*
> Hmean send-16384 29187.16 ( 0.00%) 29418.75 * 0.79%* 26296.64 * -9.90%* 27028.18 * -7.40%* 26941.26 * -7.69%*
> Hmean recv-64 203.21 ( 0.00%) 206.43 * 1.58%* 176.89 * -12.95%* 181.18 * -10.84%* 194.45 * -4.31%*
> Hmean recv-128 401.40 ( 0.00%) 414.19 * 3.19%* 355.84 * -11.35%* 364.13 * -9.29%* 387.83 * -3.38%*
> Hmean recv-256 786.69 ( 0.00%) 799.70 ( 1.65%) 700.65 * -10.94%* 719.82 * -8.50%* 756.40 * -3.85%*
> Hmean recv-1024 3059.57 ( 0.00%) 3106.57 * 1.54%* 2659.62 * -13.07%* 2793.58 * -8.69%* 3006.95 * -1.72%*
> Hmean recv-2048 5976.66 ( 0.00%) 6102.64 ( 2.11%) 5249.34 * -12.17%* 5392.00 * -9.78%* 5805.02 * -2.87%*
> Hmean recv-3312 9145.09 ( 0.00%) 9304.85 * 1.75%* 8197.25 * -10.36%* 8398.36 * -8.17%* 9120.88 ( -0.26%)
> Hmean recv-4096 10871.63 ( 0.00%) 11129.76 * 2.37%* 9667.68 * -11.07%* 9929.70 * -8.66%* 10863.38 ( -0.08%)
> Hmean recv-8192 17747.35 ( 0.00%) 17969.19 ( 1.25%) 15652.91 * -11.80%* 16081.20 * -9.39%* 17315.96 * -2.43%*
> Hmean recv-16384 29187.13 ( 0.00%) 29418.72 * 0.79%* 26296.63 * -9.90%* 27028.18 * -7.40%* 26941.23 * -7.69%*
>
> pre-cst is just before commit 6d4f08a6776 ("intel_idle: Use ACPI _CST on
> server systems") and enable-cst is the commit. It was not fixed by 5.6 or
> 5.9-rc8. A lot of bisections ended up here including kernel compilation,
> tbench, syscall entry/exit microbenchmark, hackbench, Java workloads etc.
>
> What I don't understand is why. The latencies for c-state exit states
> before and after the patch are both as follows
>
> /sys/devices/system/cpu/cpu0/cpuidle/state0/latency:0
> /sys/devices/system/cpu/cpu0/cpuidle/state1/latency:2
> /sys/devices/system/cpu/cpu0/cpuidle/state2/latency:10
> /sys/devices/system/cpu/cpu0/cpuidle/state3/latency:33
> /sys/devices/system/cpu/cpu0/cpuidle/state4/latency:133
>
> Perf profiles did not show up anything interesting. A diff of
> /sys/devices/system/cpu/cpu0/cpuidle/state0/ before and after the patch
> showed up nothing interesting. Any idea why exactly this patch shows up
> as being hazardous on Haswell in particular?
>
Presumably, some of the idle states are disabled by default on the
affected machines.

Can you check the disable and default_status attributes of each state
before and after the commit in question?

Cheers!


2020-10-06 19:05:39

by Mel Gorman

[permalink] [raw]
Subject: Re: ACPI _CST introduced performance regresions on Haswll

On Tue, Oct 06, 2020 at 06:00:18PM +0200, Rafael J. Wysocki wrote:
> > server systems") and enable-cst is the commit. It was not fixed by 5.6 or
> > 5.9-rc8. A lot of bisections ended up here including kernel compilation,
> > tbench, syscall entry/exit microbenchmark, hackbench, Java workloads etc.
> >
> > What I don't understand is why. The latencies for c-state exit states
> > before and after the patch are both as follows
> >
> > /sys/devices/system/cpu/cpu0/cpuidle/state0/latency:0
> > /sys/devices/system/cpu/cpu0/cpuidle/state1/latency:2
> > /sys/devices/system/cpu/cpu0/cpuidle/state2/latency:10
> > /sys/devices/system/cpu/cpu0/cpuidle/state3/latency:33
> > /sys/devices/system/cpu/cpu0/cpuidle/state4/latency:133
> >
> > Perf profiles did not show up anything interesting. A diff of
> > /sys/devices/system/cpu/cpu0/cpuidle/state0/ before and after the patch
> > showed up nothing interesting. Any idea why exactly this patch shows up
> > as being hazardous on Haswell in particular?
> >
> Presumably, some of the idle states are disabled by default on the affected
> machines.
>
> Can you check the disable and default_status attributes of each state before
> and after the commit in question?
>

# grep . pre-cst/cpuidle/state*/disable
pre-cst/cpuidle/state0/disable:0
pre-cst/cpuidle/state1/disable:0
pre-cst/cpuidle/state2/disable:0
pre-cst/cpuidle/state3/disable:0
pre-cst/cpuidle/state4/disable:0
# grep . enable-cst/cpuidle/state*/disable
enable-cst/cpuidle/state0/disable:0
enable-cst/cpuidle/state1/disable:0
enable-cst/cpuidle/state2/disable:0
enable-cst/cpuidle/state3/disable:0
enable-cst/cpuidle/state4/disable:0
# grep . pre-cst/cpuidle/state*/default_status
pre-cst/cpuidle/state0/default_status:enabled
pre-cst/cpuidle/state1/default_status:enabled
pre-cst/cpuidle/state2/default_status:enabled
pre-cst/cpuidle/state3/default_status:enabled
pre-cst/cpuidle/state4/default_status:enabled

After the commit, the default_status file does not appear in /sys

--
Mel Gorman
SUSE Labs

2020-10-06 21:20:18

by Mel Gorman

[permalink] [raw]
Subject: Re: ACPI _CST introduced performance regresions on Haswll

On Tue, Oct 06, 2020 at 09:29:24PM +0200, Rafael J. Wysocki wrote:
> > After the commit, the default_status file does not appear in /sys
> >
> Something is amiss, then, because the commit doesn't affect the presence of
> this file.
>

This was cleared up in another mail.

> The only thing it does is to set the use_acpi flag for several processor
> models in intel_idle.c.
>
> It can be effectively reversed by removing all of the ".use_acpi = true,"
> lines from intel_idle.c.
>
> In particular, please check if changing the value of use_acpi in struct
> idle_cpu_hsx from 'true' to 'false' alone (without reverting the commit)
> makes the issue go away in 5.9-rc8 (the default_status file should be
> present regardless).
>

Thanks. I applied the following

diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 9a810e4a7946..6478347669a9 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -1044,7 +1044,7 @@ static const struct idle_cpu idle_cpu_hsw __initconst = {
static const struct idle_cpu idle_cpu_hsx __initconst = {
.state_table = hsw_cstates,
.disable_promotion_to_c1e = true,
- .use_acpi = true,
+ .use_acpi = false,
};

netperf UDP_STREAM
pre enable enable 5.9-rc8 5.9-rc8
cst cst cst-no-hsx-acpi vanilla no-hsx-acpi
Hmean send-64 203.96 ( 0.00%) 179.23 * -12.13%* 201.04 ( -1.44%) 203.24 ( -0.36%) 233.43 * 14.45%*
Hmean send-128 403.66 ( 0.00%) 355.99 * -11.81%* 402.28 ( -0.34%) 387.65 * -3.97%* 461.47 * 14.32%*
Hmean send-256 784.39 ( 0.00%) 697.78 * -11.04%* 782.15 ( -0.29%) 758.49 * -3.30%* 895.31 * 14.14%*
Hmean recv-64 203.96 ( 0.00%) 179.23 * -12.13%* 201.04 ( -1.44%) 203.24 ( -0.36%) 233.43 * 14.45%*
Hmean recv-128 403.66 ( 0.00%) 355.99 * -11.81%* 402.28 ( -0.34%) 387.65 * -3.97%* 461.47 * 14.32%*
Hmean recv-256 784.39 ( 0.00%) 697.78 * -11.04%* 782.15 ( -0.29%) 758.49 * -3.30%* 895.28 * 14.14%*

This is a more limited run to save time but is enough to illustrate
the point.

pre-cst is just before your patch
enable-cst is your patch that was bisected
enable-cst-no-hsx-acpi is your patch with use_acpi disabled
5.9-rc8-vanilla is what it sounds like
5.9-rc8-no-hsx-acpi disables use_acpi

The enable-cst-no-hsx-acpi result indicates that use_acpi was the issue for
Haswell (at least these machines). Looking just at 5.9-rc8-vanillaa might
have been misleading because its performance is not far off the baseline
due to unrelated changes that mostly offset the performance penalty.

The key question is -- how appropriate would it be to disable acpi for
Haswell? Would that be generally safe or could it hide other surprises?

--
Mel Gorman
SUSE Labs

2020-10-06 22:12:51

by Wysocki, Rafael J

[permalink] [raw]
Subject: Re: ACPI _CST introduced performance regresions on Haswll

On 10/6/2020 9:03 PM, Mel Gorman wrote:
> On Tue, Oct 06, 2020 at 06:00:18PM +0200, Rafael J. Wysocki wrote:
>>> server systems") and enable-cst is the commit. It was not fixed by 5.6 or
>>> 5.9-rc8. A lot of bisections ended up here including kernel compilation,
>>> tbench, syscall entry/exit microbenchmark, hackbench, Java workloads etc.
>>>
>>> What I don't understand is why. The latencies for c-state exit states
>>> before and after the patch are both as follows
>>>
>>> /sys/devices/system/cpu/cpu0/cpuidle/state0/latency:0
>>> /sys/devices/system/cpu/cpu0/cpuidle/state1/latency:2
>>> /sys/devices/system/cpu/cpu0/cpuidle/state2/latency:10
>>> /sys/devices/system/cpu/cpu0/cpuidle/state3/latency:33
>>> /sys/devices/system/cpu/cpu0/cpuidle/state4/latency:133
>>>
>>> Perf profiles did not show up anything interesting. A diff of
>>> /sys/devices/system/cpu/cpu0/cpuidle/state0/ before and after the patch
>>> showed up nothing interesting. Any idea why exactly this patch shows up
>>> as being hazardous on Haswell in particular?
>>>
>> Presumably, some of the idle states are disabled by default on the affected
>> machines.
>>
>> Can you check the disable and default_status attributes of each state before
>> and after the commit in question?
>>
> # grep . pre-cst/cpuidle/state*/disable
> pre-cst/cpuidle/state0/disable:0
> pre-cst/cpuidle/state1/disable:0
> pre-cst/cpuidle/state2/disable:0
> pre-cst/cpuidle/state3/disable:0
> pre-cst/cpuidle/state4/disable:0
> # grep . enable-cst/cpuidle/state*/disable
> enable-cst/cpuidle/state0/disable:0
> enable-cst/cpuidle/state1/disable:0
> enable-cst/cpuidle/state2/disable:0
> enable-cst/cpuidle/state3/disable:0
> enable-cst/cpuidle/state4/disable:0
> # grep . pre-cst/cpuidle/state*/default_status
> pre-cst/cpuidle/state0/default_status:enabled
> pre-cst/cpuidle/state1/default_status:enabled
> pre-cst/cpuidle/state2/default_status:enabled
> pre-cst/cpuidle/state3/default_status:enabled
> pre-cst/cpuidle/state4/default_status:enabled
>
> After the commit, the default_status file does not appear in /sys
>
Something is amiss, then, because the commit doesn't affect the presence
of this file.

The only thing it does is to set the use_acpi flag for several processor
models in intel_idle.c.

It can be effectively reversed by removing all of the ".use_acpi =
true," lines from intel_idle.c.

In particular, please check if changing the value of use_acpi in struct
idle_cpu_hsx from 'true' to 'false' alone (without reverting the commit)
makes the issue go away in 5.9-rc8 (the default_status file should be
present regardless).

Cheers!


2020-10-06 22:25:57

by Mel Gorman

[permalink] [raw]
Subject: Re: ACPI _CST introduced performance regresions on Haswll

On Tue, Oct 06, 2020 at 08:03:22PM +0100, Mel Gorman wrote:
> On Tue, Oct 06, 2020 at 06:00:18PM +0200, Rafael J. Wysocki wrote:
> > > server systems") and enable-cst is the commit. It was not fixed by 5.6 or
> > > 5.9-rc8. A lot of bisections ended up here including kernel compilation,
> > > tbench, syscall entry/exit microbenchmark, hackbench, Java workloads etc.
> > >
> > > What I don't understand is why. The latencies for c-state exit states
> > > before and after the patch are both as follows
> > >
> > > /sys/devices/system/cpu/cpu0/cpuidle/state0/latency:0
> > > /sys/devices/system/cpu/cpu0/cpuidle/state1/latency:2
> > > /sys/devices/system/cpu/cpu0/cpuidle/state2/latency:10
> > > /sys/devices/system/cpu/cpu0/cpuidle/state3/latency:33
> > > /sys/devices/system/cpu/cpu0/cpuidle/state4/latency:133
> > >
> > > Perf profiles did not show up anything interesting. A diff of
> > > /sys/devices/system/cpu/cpu0/cpuidle/state0/ before and after the patch
> > > showed up nothing interesting. Any idea why exactly this patch shows up
> > > as being hazardous on Haswell in particular?
> > >
> > Presumably, some of the idle states are disabled by default on the affected
> > machines.
> >
> > Can you check the disable and default_status attributes of each state before
> > and after the commit in question?
> >
>
> # grep . pre-cst/cpuidle/state*/disable

Sorry, second attempt after thinking the results made no sense at all.
Turns out I fat fingered setting up the enable-cst kernel the second time
to collect what you asked for and the patch was not applied at all.

# grep . pre-cst/cpuidle/state*/disable
pre-cst/cpuidle/state0/disable:0
pre-cst/cpuidle/state1/disable:0
pre-cst/cpuidle/state2/disable:0
pre-cst/cpuidle/state3/disable:0
pre-cst/cpuidle/state4/disable:0
# grep . pre-cst/cpuidle/state*/default_status
pre-cst/cpuidle/state0/default_status:enabled
pre-cst/cpuidle/state1/default_status:enabled
pre-cst/cpuidle/state2/default_status:enabled
pre-cst/cpuidle/state3/default_status:enabled
pre-cst/cpuidle/state4/default_status:enabled
# grep . enable-cst/cpuidle/state*/disable
enable-cst/cpuidle/state0/disable:0
enable-cst/cpuidle/state1/disable:0
enable-cst/cpuidle/state2/disable:0
enable-cst/cpuidle/state3/disable:1
enable-cst/cpuidle/state4/disable:1
# grep . enable-cst/cpuidle/state*/default_status
enable-cst/cpuidle/state0/default_status:enabled
enable-cst/cpuidle/state1/default_status:enabled
enable-cst/cpuidle/state2/default_status:enabled
enable-cst/cpuidle/state3/default_status:disabled
enable-cst/cpuidle/state4/default_status:disabled

That looks like C3 and C6 are disabled after the patch.

# grep . enable-cst/cpuidle/state*/name
enable-cst/cpuidle/state0/name:POLL
enable-cst/cpuidle/state1/name:C1
enable-cst/cpuidle/state2/name:C1E
enable-cst/cpuidle/state3/name:C3
enable-cst/cpuidle/state4/name:C6

--
Mel Gorman
SUSE Labs

2020-10-07 16:19:31

by Wysocki, Rafael J

[permalink] [raw]
Subject: Re: ACPI _CST introduced performance regresions on Haswll

On 10/6/2020 11:18 PM, Mel Gorman wrote:
> On Tue, Oct 06, 2020 at 09:29:24PM +0200, Rafael J. Wysocki wrote:
>>> After the commit, the default_status file does not appear in /sys
>>>
>> Something is amiss, then, because the commit doesn't affect the presence of
>> this file.
>>
> This was cleared up in another mail.
>
>> The only thing it does is to set the use_acpi flag for several processor
>> models in intel_idle.c.
>>
>> It can be effectively reversed by removing all of the ".use_acpi = true,"
>> lines from intel_idle.c.
>>
>> In particular, please check if changing the value of use_acpi in struct
>> idle_cpu_hsx from 'true' to 'false' alone (without reverting the commit)
>> makes the issue go away in 5.9-rc8 (the default_status file should be
>> present regardless).
>>
> Thanks. I applied the following
>
> diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
> index 9a810e4a7946..6478347669a9 100644
> --- a/drivers/idle/intel_idle.c
> +++ b/drivers/idle/intel_idle.c
> @@ -1044,7 +1044,7 @@ static const struct idle_cpu idle_cpu_hsw __initconst = {
> static const struct idle_cpu idle_cpu_hsx __initconst = {
> .state_table = hsw_cstates,
> .disable_promotion_to_c1e = true,
> - .use_acpi = true,
> + .use_acpi = false,
> };
>
> netperf UDP_STREAM
> pre enable enable 5.9-rc8 5.9-rc8
> cst cst cst-no-hsx-acpi vanilla no-hsx-acpi
> Hmean send-64 203.96 ( 0.00%) 179.23 * -12.13%* 201.04 ( -1.44%) 203.24 ( -0.36%) 233.43 * 14.45%*
> Hmean send-128 403.66 ( 0.00%) 355.99 * -11.81%* 402.28 ( -0.34%) 387.65 * -3.97%* 461.47 * 14.32%*
> Hmean send-256 784.39 ( 0.00%) 697.78 * -11.04%* 782.15 ( -0.29%) 758.49 * -3.30%* 895.31 * 14.14%*
> Hmean recv-64 203.96 ( 0.00%) 179.23 * -12.13%* 201.04 ( -1.44%) 203.24 ( -0.36%) 233.43 * 14.45%*
> Hmean recv-128 403.66 ( 0.00%) 355.99 * -11.81%* 402.28 ( -0.34%) 387.65 * -3.97%* 461.47 * 14.32%*
> Hmean recv-256 784.39 ( 0.00%) 697.78 * -11.04%* 782.15 ( -0.29%) 758.49 * -3.30%* 895.28 * 14.14%*
>
> This is a more limited run to save time but is enough to illustrate
> the point.
>
> pre-cst is just before your patch
> enable-cst is your patch that was bisected
> enable-cst-no-hsx-acpi is your patch with use_acpi disabled
> 5.9-rc8-vanilla is what it sounds like
> 5.9-rc8-no-hsx-acpi disables use_acpi
>
> The enable-cst-no-hsx-acpi result indicates that use_acpi was the issue for
> Haswell (at least these machines). Looking just at 5.9-rc8-vanillaa might
> have been misleading because its performance is not far off the baseline
> due to unrelated changes that mostly offset the performance penalty.
>
> The key question is -- how appropriate would it be to disable acpi for
> Haswell? Would that be generally safe or could it hide other surprises?
>
It should be safe, but let's try to do something more fine-grained.

There is the CPUIDLE_FLAG_ALWAYS_ENABLE flag that is set for C1E.? Can
you please try to set it for C6 in hsw_cstates instead of clearing
use_acpi in idle_cpu_hsx and retest?


2020-10-07 16:21:14

by Wysocki, Rafael J

[permalink] [raw]
Subject: Re: ACPI _CST introduced performance regresions on Haswll

On 10/6/2020 9:47 PM, Mel Gorman wrote:
> On Tue, Oct 06, 2020 at 08:03:22PM +0100, Mel Gorman wrote:
>> On Tue, Oct 06, 2020 at 06:00:18PM +0200, Rafael J. Wysocki wrote:
>>>> server systems") and enable-cst is the commit. It was not fixed by 5.6 or
>>>> 5.9-rc8. A lot of bisections ended up here including kernel compilation,
>>>> tbench, syscall entry/exit microbenchmark, hackbench, Java workloads etc.
>>>>
>>>> What I don't understand is why. The latencies for c-state exit states
>>>> before and after the patch are both as follows
>>>>
>>>> /sys/devices/system/cpu/cpu0/cpuidle/state0/latency:0
>>>> /sys/devices/system/cpu/cpu0/cpuidle/state1/latency:2
>>>> /sys/devices/system/cpu/cpu0/cpuidle/state2/latency:10
>>>> /sys/devices/system/cpu/cpu0/cpuidle/state3/latency:33
>>>> /sys/devices/system/cpu/cpu0/cpuidle/state4/latency:133
>>>>
>>>> Perf profiles did not show up anything interesting. A diff of
>>>> /sys/devices/system/cpu/cpu0/cpuidle/state0/ before and after the patch
>>>> showed up nothing interesting. Any idea why exactly this patch shows up
>>>> as being hazardous on Haswell in particular?
>>>>
>>> Presumably, some of the idle states are disabled by default on the affected
>>> machines.
>>>
>>> Can you check the disable and default_status attributes of each state before
>>> and after the commit in question?
>>>
>> # grep . pre-cst/cpuidle/state*/disable
> Sorry, second attempt after thinking the results made no sense at all.
> Turns out I fat fingered setting up the enable-cst kernel the second time
> to collect what you asked for and the patch was not applied at all.
>
> # grep . pre-cst/cpuidle/state*/disable
> pre-cst/cpuidle/state0/disable:0
> pre-cst/cpuidle/state1/disable:0
> pre-cst/cpuidle/state2/disable:0
> pre-cst/cpuidle/state3/disable:0
> pre-cst/cpuidle/state4/disable:0
> # grep . pre-cst/cpuidle/state*/default_status
> pre-cst/cpuidle/state0/default_status:enabled
> pre-cst/cpuidle/state1/default_status:enabled
> pre-cst/cpuidle/state2/default_status:enabled
> pre-cst/cpuidle/state3/default_status:enabled
> pre-cst/cpuidle/state4/default_status:enabled
> # grep . enable-cst/cpuidle/state*/disable
> enable-cst/cpuidle/state0/disable:0
> enable-cst/cpuidle/state1/disable:0
> enable-cst/cpuidle/state2/disable:0
> enable-cst/cpuidle/state3/disable:1
> enable-cst/cpuidle/state4/disable:1
> # grep . enable-cst/cpuidle/state*/default_status
> enable-cst/cpuidle/state0/default_status:enabled
> enable-cst/cpuidle/state1/default_status:enabled
> enable-cst/cpuidle/state2/default_status:enabled
> enable-cst/cpuidle/state3/default_status:disabled
> enable-cst/cpuidle/state4/default_status:disabled
>
> That looks like C3 and C6 are disabled after the patch.
>
> # grep . enable-cst/cpuidle/state*/name
> enable-cst/cpuidle/state0/name:POLL
> enable-cst/cpuidle/state1/name:C1
> enable-cst/cpuidle/state2/name:C1E
> enable-cst/cpuidle/state3/name:C3
> enable-cst/cpuidle/state4/name:C6
>
That's kind of unexpected and there may be two reasons for that.

First off, the MWAIT hints in the ACPI tables for C3 and C6 may be
different from the ones in the intel_idle internal table.

Second, the ACPI tables may only be listing C1.

Can you send me the acpidump output from the affected machine, please?


2020-10-07 20:00:31

by Mel Gorman

[permalink] [raw]
Subject: Re: ACPI _CST introduced performance regresions on Haswll

On Wed, Oct 07, 2020 at 05:40:43PM +0200, Rafael J. Wysocki wrote:
> > # grep . enable-cst/cpuidle/state*/disable
> > enable-cst/cpuidle/state0/disable:0
> > enable-cst/cpuidle/state1/disable:0
> > enable-cst/cpuidle/state2/disable:0
> > enable-cst/cpuidle/state3/disable:1
> > enable-cst/cpuidle/state4/disable:1
> > # grep . enable-cst/cpuidle/state*/default_status
> > enable-cst/cpuidle/state0/default_status:enabled
> > enable-cst/cpuidle/state1/default_status:enabled
> > enable-cst/cpuidle/state2/default_status:enabled
> > enable-cst/cpuidle/state3/default_status:disabled
> > enable-cst/cpuidle/state4/default_status:disabled
> >
> > That looks like C3 and C6 are disabled after the patch.
> >
> > # grep . enable-cst/cpuidle/state*/name
> > enable-cst/cpuidle/state0/name:POLL
> > enable-cst/cpuidle/state1/name:C1
> > enable-cst/cpuidle/state2/name:C1E
> > enable-cst/cpuidle/state3/name:C3
> > enable-cst/cpuidle/state4/name:C6
> >
>
> That's kind of unexpected and there may be two reasons for that.
>
> First off, the MWAIT hints in the ACPI tables for C3 and C6 may be different
> from the ones in the intel_idle internal table.
>
> Second, the ACPI tables may only be listing C1.
>
> Can you send me the acpidump output from the affected machine, please?
>

Attached.

Thanks.

--
Mel Gorman
SUSE Labs


Attachments:
(No filename) (1.33 kB)
acpidump.xz (90.55 kB)
Download all attachments

2020-10-08 10:47:22

by Mel Gorman

[permalink] [raw]
Subject: Re: ACPI _CST introduced performance regresions on Haswll

On Wed, Oct 07, 2020 at 05:45:30PM +0200, Rafael J. Wysocki wrote:
> > pre-cst is just before your patch
> > enable-cst is your patch that was bisected
> > enable-cst-no-hsx-acpi is your patch with use_acpi disabled
> > 5.9-rc8-vanilla is what it sounds like
> > 5.9-rc8-no-hsx-acpi disables use_acpi
> >
> > The enable-cst-no-hsx-acpi result indicates that use_acpi was the issue for
> > Haswell (at least these machines). Looking just at 5.9-rc8-vanillaa might
> > have been misleading because its performance is not far off the baseline
> > due to unrelated changes that mostly offset the performance penalty.
> >
> > The key question is -- how appropriate would it be to disable acpi for
> > Haswell? Would that be generally safe or could it hide other surprises?
> >
> It should be safe, but let's try to do something more fine-grained.
>
> There is the CPUIDLE_FLAG_ALWAYS_ENABLE flag that is set for C1E.? Can you
> please try to set it for C6 in hsw_cstates instead of clearing use_acpi in
> idle_cpu_hsx and retest?
>

Performance-wise, always enabling C6 helps but it may be specific to
this workload. Looking across all tested kernels I get;

netperf-udp
5.5.0 5.5.0-rc2 5.5.0-rc2 5.9.0-rc8 5.9.0-rc8 5.9.0-rc8
vanilla pre-cst enable-cst vanilla disable-acpi enable-c6
Hmean send-64 196.31 ( 0.00%) 208.56 * 6.24%* 181.15 * -7.72%* 199.84 * 1.80%* 235.09 * 19.76%* 234.79 * 19.60%*
Hmean send-128 391.75 ( 0.00%) 408.13 * 4.18%* 359.92 * -8.12%* 396.81 ( 1.29%) 469.44 * 19.83%* 465.55 * 18.84%*
Hmean send-256 776.38 ( 0.00%) 798.39 * 2.84%* 707.31 * -8.90%* 781.63 ( 0.68%) 917.19 * 18.14%* 905.06 * 16.57%*
Hmean send-1024 3019.64 ( 0.00%) 3099.00 * 2.63%* 2756.32 * -8.72%* 3017.06 ( -0.09%) 3509.84 * 16.23%* 3532.85 * 17.00%*
Hmean send-2048 5790.31 ( 0.00%) 6209.53 * 7.24%* 5394.42 * -6.84%* 5846.11 ( 0.96%) 6861.93 * 18.51%* 6852.08 * 18.34%*
Hmean send-3312 8909.98 ( 0.00%) 9483.92 * 6.44%* 8332.35 * -6.48%* 9047.52 * 1.54%* 10677.93 * 19.84%* 10509.41 * 17.95%*
Hmean send-4096 10517.63 ( 0.00%) 11044.19 * 5.01%* 9851.70 * -6.33%* 10914.24 * 3.77%* 12719.58 * 20.94%* 12731.06 * 21.04%*
Hmean send-8192 17355.48 ( 0.00%) 18344.50 * 5.70%* 15844.38 * -8.71%* 17690.46 ( 1.93%) 20777.97 * 19.72%* 20220.24 * 16.51%*
Hmean send-16384 28585.78 ( 0.00%) 28950.90 ( 1.28%) 25946.88 * -9.23%* 26643.69 * -6.79%* 30891.89 * 8.07%* 30701.46 * 7.40%*

The difference between always using ACPI and force enabling C6 is
negligible in this case but more on that later

netperf-udp
5.9.0-rc8 5.9.0-rc8
disable-acpi enable-c6
Hmean send-64 235.09 ( 0.00%) 234.79 ( -0.13%)
Hmean send-128 469.44 ( 0.00%) 465.55 ( -0.83%)
Hmean send-256 917.19 ( 0.00%) 905.06 ( -1.32%)
Hmean send-1024 3509.84 ( 0.00%) 3532.85 ( 0.66%)
Hmean send-2048 6861.93 ( 0.00%) 6852.08 ( -0.14%)
Hmean send-3312 10677.93 ( 0.00%) 10509.41 * -1.58%*
Hmean send-4096 12719.58 ( 0.00%) 12731.06 ( 0.09%)
Hmean send-8192 20777.97 ( 0.00%) 20220.24 * -2.68%*
Hmean send-16384 30891.89 ( 0.00%) 30701.46 ( -0.62%)

The default status and enabled states differ.

For 5.9-rc8 vanilla, the default and disabled status for cstates are

./5.9.0-rc8-vanilla/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state0/disable:0
./5.9.0-rc8-vanilla/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state1/disable:0
./5.9.0-rc8-vanilla/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state2/disable:0
./5.9.0-rc8-vanilla/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state3/disable:1
./5.9.0-rc8-vanilla/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state4/disable:1
./5.9.0-rc8-vanilla/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state0/default_status:enabled
./5.9.0-rc8-vanilla/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state1/default_status:enabled
./5.9.0-rc8-vanilla/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state2/default_status:enabled
./5.9.0-rc8-vanilla/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state3/default_status:disabled
./5.9.0-rc8-vanilla/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state4/default_status:disabled

For use_acpi == false, all c-states are enabled

./5.9.0-rc8-disable-acpi/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state0/disable:0
./5.9.0-rc8-disable-acpi/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state1/disable:0
./5.9.0-rc8-disable-acpi/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state2/disable:0
./5.9.0-rc8-disable-acpi/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state3/disable:0
./5.9.0-rc8-disable-acpi/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state4/disable:0
./5.9.0-rc8-disable-acpi/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state0/default_status:enabled
./5.9.0-rc8-disable-acpi/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state1/default_status:enabled
./5.9.0-rc8-disable-acpi/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state2/default_status:enabled
./5.9.0-rc8-disable-acpi/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state3/default_status:enabled
./5.9.0-rc8-disable-acpi/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state4/default_status:enabled

Force enabling C6

./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state0/disable:0
./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state1/disable:0
./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state2/disable:0
./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state3/disable:1
./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state4/disable:0
./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state0/default_status:enabled
./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state1/default_status:enabled
./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state2/default_status:enabled
./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state3/default_status:disabled
./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state4/default_status:enabled

Note that as expected, C3 remains disabled when only C6 is forced (state3
== c3, state4 == c6). While this particular workload does not appear to
care as it does not remain idle for long, the exit latency difference
between c3 and c6 is large so potentially a workload that idles for short
durations that are somewhere between c1e and c3 exit latency might take
a larger penalty exiting from c6 state if the deeper c-state is selected
for idling.

./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state0/residency:0
./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state1/residency:2
./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state2/residency:20
./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state3/residency:100
./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state4/residency:400

--
Mel Gorman
SUSE Labs

2020-10-08 18:49:12

by Mel Gorman

[permalink] [raw]
Subject: Re: ACPI _CST introduced performance regresions on Haswll

On Thu, Oct 08, 2020 at 07:15:46PM +0200, Rafael J. Wysocki wrote:
> > Force enabling C6
> >
> > ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state0/disable:0
> > ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state1/disable:0
> > ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state2/disable:0
> > ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state3/disable:1
> > ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state4/disable:0
> > ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state0/default_status:enabled
> > ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state1/default_status:enabled
> > ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state2/default_status:enabled
> > ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state3/default_status:disabled
> > ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state4/default_status:enabled
> >
> > Note that as expected, C3 remains disabled when only C6 is forced (state3
> > == c3, state4 == c6). While this particular workload does not appear to
> > care as it does not remain idle for long, the exit latency difference
> > between c3 and c6 is large so potentially a workload that idles for short
> > durations that are somewhere between c1e and c3 exit latency might take
> > a larger penalty exiting from c6 state if the deeper c-state is selected
> > for idling.
> >
> > ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state0/residency:0
> > ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state1/residency:2
> > ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state2/residency:20
> > ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state3/residency:100
> > ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state4/residency:400
> >
>
> If you are worried that C6 might be used instead of C3 in some cases, this
> is not going to happen.
>

Ok, so it goes in the C1E direction instead. I lost track of how C-state
is selected based on predictions about the future. It's changed a bit
over time.

> I all cases in which C3 would have been used had it not been disabled, C1E
> will be used instead.
>
> Which BTW indicates that using C1E more often adds a lot of latency to the
> workload (if C3 and C6 are both disabled, C1E is used in all cases in which
> one of them would have been used).

Which is weird. From the exit latency alone, I'd think it would be faster
to use C1E instead of C3. It implies that using C1E instead of C3/C6 has
some other side-effect on Haswell. At one point, there was plenty of advice
on disabling C1E but very little concrete information on what impact it
has exactly and why it might cause problems that other c-states avoid.

> With C6 enabled, that state is used at
> least sometimes (so C1E is used less often), but PC6 doesn't seem to be
> really used - it looks like core C6 only is entered and which may be why C6
> adds less latency than C1E (and analogously for C3).
>

At the moment, I'm happy with either solution but mostly because I can't
tell what other trade-offs should be considered :/

--
Mel Gorman
SUSE Labs

2020-10-08 20:28:02

by Wysocki, Rafael J

[permalink] [raw]
Subject: Re: ACPI _CST introduced performance regresions on Haswll

On 10/8/2020 11:09 AM, Mel Gorman wrote:
> On Wed, Oct 07, 2020 at 05:45:30PM +0200, Rafael J. Wysocki wrote:
>>> pre-cst is just before your patch
>>> enable-cst is your patch that was bisected
>>> enable-cst-no-hsx-acpi is your patch with use_acpi disabled
>>> 5.9-rc8-vanilla is what it sounds like
>>> 5.9-rc8-no-hsx-acpi disables use_acpi
>>>
>>> The enable-cst-no-hsx-acpi result indicates that use_acpi was the issue for
>>> Haswell (at least these machines). Looking just at 5.9-rc8-vanillaa might
>>> have been misleading because its performance is not far off the baseline
>>> due to unrelated changes that mostly offset the performance penalty.
>>>
>>> The key question is -- how appropriate would it be to disable acpi for
>>> Haswell? Would that be generally safe or could it hide other surprises?
>>>
>> It should be safe, but let's try to do something more fine-grained.
>>
>> There is the CPUIDLE_FLAG_ALWAYS_ENABLE flag that is set for C1E.? Can you
>> please try to set it for C6 in hsw_cstates instead of clearing use_acpi in
>> idle_cpu_hsx and retest?
>>
> Performance-wise, always enabling C6 helps but it may be specific to
> this workload. Looking across all tested kernels I get;
>
> netperf-udp
> 5.5.0 5.5.0-rc2 5.5.0-rc2 5.9.0-rc8 5.9.0-rc8 5.9.0-rc8
> vanilla pre-cst enable-cst vanilla disable-acpi enable-c6
> Hmean send-64 196.31 ( 0.00%) 208.56 * 6.24%* 181.15 * -7.72%* 199.84 * 1.80%* 235.09 * 19.76%* 234.79 * 19.60%*
> Hmean send-128 391.75 ( 0.00%) 408.13 * 4.18%* 359.92 * -8.12%* 396.81 ( 1.29%) 469.44 * 19.83%* 465.55 * 18.84%*
> Hmean send-256 776.38 ( 0.00%) 798.39 * 2.84%* 707.31 * -8.90%* 781.63 ( 0.68%) 917.19 * 18.14%* 905.06 * 16.57%*
> Hmean send-1024 3019.64 ( 0.00%) 3099.00 * 2.63%* 2756.32 * -8.72%* 3017.06 ( -0.09%) 3509.84 * 16.23%* 3532.85 * 17.00%*
> Hmean send-2048 5790.31 ( 0.00%) 6209.53 * 7.24%* 5394.42 * -6.84%* 5846.11 ( 0.96%) 6861.93 * 18.51%* 6852.08 * 18.34%*
> Hmean send-3312 8909.98 ( 0.00%) 9483.92 * 6.44%* 8332.35 * -6.48%* 9047.52 * 1.54%* 10677.93 * 19.84%* 10509.41 * 17.95%*
> Hmean send-4096 10517.63 ( 0.00%) 11044.19 * 5.01%* 9851.70 * -6.33%* 10914.24 * 3.77%* 12719.58 * 20.94%* 12731.06 * 21.04%*
> Hmean send-8192 17355.48 ( 0.00%) 18344.50 * 5.70%* 15844.38 * -8.71%* 17690.46 ( 1.93%) 20777.97 * 19.72%* 20220.24 * 16.51%*
> Hmean send-16384 28585.78 ( 0.00%) 28950.90 ( 1.28%) 25946.88 * -9.23%* 26643.69 * -6.79%* 30891.89 * 8.07%* 30701.46 * 7.40%*
>
> The difference between always using ACPI and force enabling C6 is
> negligible in this case but more on that later
>
> netperf-udp
> 5.9.0-rc8 5.9.0-rc8
> disable-acpi enable-c6
> Hmean send-64 235.09 ( 0.00%) 234.79 ( -0.13%)
> Hmean send-128 469.44 ( 0.00%) 465.55 ( -0.83%)
> Hmean send-256 917.19 ( 0.00%) 905.06 ( -1.32%)
> Hmean send-1024 3509.84 ( 0.00%) 3532.85 ( 0.66%)
> Hmean send-2048 6861.93 ( 0.00%) 6852.08 ( -0.14%)
> Hmean send-3312 10677.93 ( 0.00%) 10509.41 * -1.58%*
> Hmean send-4096 12719.58 ( 0.00%) 12731.06 ( 0.09%)
> Hmean send-8192 20777.97 ( 0.00%) 20220.24 * -2.68%*
> Hmean send-16384 30891.89 ( 0.00%) 30701.46 ( -0.62%)
>
> The default status and enabled states differ.
>
> For 5.9-rc8 vanilla, the default and disabled status for cstates are
>
> ./5.9.0-rc8-vanilla/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state0/disable:0
> ./5.9.0-rc8-vanilla/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state1/disable:0
> ./5.9.0-rc8-vanilla/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state2/disable:0
> ./5.9.0-rc8-vanilla/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state3/disable:1
> ./5.9.0-rc8-vanilla/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state4/disable:1
> ./5.9.0-rc8-vanilla/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state0/default_status:enabled
> ./5.9.0-rc8-vanilla/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state1/default_status:enabled
> ./5.9.0-rc8-vanilla/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state2/default_status:enabled
> ./5.9.0-rc8-vanilla/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state3/default_status:disabled
> ./5.9.0-rc8-vanilla/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state4/default_status:disabled
>
> For use_acpi == false, all c-states are enabled
>
> ./5.9.0-rc8-disable-acpi/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state0/disable:0
> ./5.9.0-rc8-disable-acpi/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state1/disable:0
> ./5.9.0-rc8-disable-acpi/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state2/disable:0
> ./5.9.0-rc8-disable-acpi/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state3/disable:0
> ./5.9.0-rc8-disable-acpi/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state4/disable:0
> ./5.9.0-rc8-disable-acpi/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state0/default_status:enabled
> ./5.9.0-rc8-disable-acpi/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state1/default_status:enabled
> ./5.9.0-rc8-disable-acpi/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state2/default_status:enabled
> ./5.9.0-rc8-disable-acpi/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state3/default_status:enabled
> ./5.9.0-rc8-disable-acpi/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state4/default_status:enabled
>
> Force enabling C6
>
> ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state0/disable:0
> ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state1/disable:0
> ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state2/disable:0
> ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state3/disable:1
> ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state4/disable:0
> ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state0/default_status:enabled
> ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state1/default_status:enabled
> ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state2/default_status:enabled
> ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state3/default_status:disabled
> ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state4/default_status:enabled
>
> Note that as expected, C3 remains disabled when only C6 is forced (state3
> == c3, state4 == c6). While this particular workload does not appear to
> care as it does not remain idle for long, the exit latency difference
> between c3 and c6 is large so potentially a workload that idles for short
> durations that are somewhere between c1e and c3 exit latency might take
> a larger penalty exiting from c6 state if the deeper c-state is selected
> for idling.
>
> ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state0/residency:0
> ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state1/residency:2
> ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state2/residency:20
> ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state3/residency:100
> ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state4/residency:400
>
If you are worried that C6 might be used instead of C3 in some cases,
this is not going to happen.

I all cases in which C3 would have been used had it not been disabled,
C1E will be used instead.

Which BTW indicates that using C1E more often adds a lot of latency to
the workload (if C3 and C6 are both disabled, C1E is used in all cases
in which one of them would have been used). With C6 enabled, that state
is used at least sometimes (so C1E is used less often), but PC6 doesn't
seem to be really used - it looks like core C6 only is entered and which
may be why C6 adds less latency than C1E (and analogously for C3).


2020-10-14 06:23:09

by Wysocki, Rafael J

[permalink] [raw]
Subject: Re: ACPI _CST introduced performance regresions on Haswll

On 10/8/2020 7:34 PM, Mel Gorman wrote:
> On Thu, Oct 08, 2020 at 07:15:46PM +0200, Rafael J. Wysocki wrote:
>>> Force enabling C6
>>>
>>> ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state0/disable:0
>>> ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state1/disable:0
>>> ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state2/disable:0
>>> ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state3/disable:1
>>> ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state4/disable:0
>>> ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state0/default_status:enabled
>>> ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state1/default_status:enabled
>>> ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state2/default_status:enabled
>>> ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state3/default_status:disabled
>>> ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state4/default_status:enabled
>>>
>>> Note that as expected, C3 remains disabled when only C6 is forced (state3
>>> == c3, state4 == c6). While this particular workload does not appear to
>>> care as it does not remain idle for long, the exit latency difference
>>> between c3 and c6 is large so potentially a workload that idles for short
>>> durations that are somewhere between c1e and c3 exit latency might take
>>> a larger penalty exiting from c6 state if the deeper c-state is selected
>>> for idling.
>>>
>>> ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state0/residency:0
>>> ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state1/residency:2
>>> ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state2/residency:20
>>> ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state3/residency:100
>>> ./5.9.0-rc8-enable-c6/iter-0/sys/devices/system/cpu/cpu0/cpuidle/state4/residency:400
>>>
>> If you are worried that C6 might be used instead of C3 in some cases, this
>> is not going to happen.
>>
> Ok, so it goes in the C1E direction instead. I lost track of how C-state
> is selected based on predictions about the future. It's changed a bit
> over time.
>
>> I all cases in which C3 would have been used had it not been disabled, C1E
>> will be used instead.
>>
>> Which BTW indicates that using C1E more often adds a lot of latency to the
>> workload (if C3 and C6 are both disabled, C1E is used in all cases in which
>> one of them would have been used).
> Which is weird. From the exit latency alone, I'd think it would be faster
> to use C1E instead of C3. It implies that using C1E instead of C3/C6 has
> some other side-effect on Haswell. At one point, there was plenty of advice
> on disabling C1E but very little concrete information on what impact it
> has exactly and why it might cause problems that other c-states avoid.
>
>> With C6 enabled, that state is used at
>> least sometimes (so C1E is used less often), but PC6 doesn't seem to be
>> really used - it looks like core C6 only is entered and which may be why C6
>> adds less latency than C1E (and analogously for C3).
>>
> At the moment, I'm happy with either solution but mostly because I can't
> tell what other trade-offs should be considered :/
>
I talked to Len and Srinivas about this and my theory above didn't survive.

The most likely reason why you see a performance drop after enabling the
ACPI support (which effectively causes C3 and C6 to be disabled by
default on the affected machines) is because the benchmarks in question
require sufficiently high one-CPU performance and the CPUs cannot reach
high enough one-core turbo P-states without the other CPUs going into C6.

Inspection of the ACPI tables you sent me indicates that there is a BIOS
switch in that system allowing C6 to be enabled.? Would it be possible
to check whether or not there is a BIOS setup option to change that setting?

Also, I need to know what happens if that system is started with
intel_idle disabled.? That is, what idle states are visible in sysfs in
that configuration (what their names and descriptions are in particular)
and whether or not the issue is still present then.

In addition to that, can you please run the benchmark on that system
under turbostat both with unmodified intel_idle and with intel_idle
disabled and post the turbostat output in each of those cases?

Cheers!


2020-10-15 07:25:37

by Mel Gorman

[permalink] [raw]
Subject: Re: ACPI _CST introduced performance regresions on Haswll

On Tue, Oct 13, 2020 at 08:55:26PM +0200, Rafael J. Wysocki wrote:
> > > With C6 enabled, that state is used at
> > > least sometimes (so C1E is used less often), but PC6 doesn't seem to be
> > > really used - it looks like core C6 only is entered and which may be why C6
> > > adds less latency than C1E (and analogously for C3).
> > >
> > At the moment, I'm happy with either solution but mostly because I can't
> > tell what other trade-offs should be considered :/
> >
>
> I talked to Len and Srinivas about this and my theory above didn't survive.
>
> The most likely reason why you see a performance drop after enabling the
> ACPI support (which effectively causes C3 and C6 to be disabled by default
> on the affected machines) is because the benchmarks in question require
> sufficiently high one-CPU performance and the CPUs cannot reach high enough
> one-core turbo P-states without the other CPUs going into C6.
>

That makes sense. It also can explain anomalies like server/clients being
split across NUMA nodes with no other activity can sometimes be better
because of turbo states being more important than memory locality for
some benchmarks.

> Inspection of the ACPI tables you sent me indicates that there is a BIOS
> switch in that system allowing C6 to be enabled.? Would it be possible to
> check whether or not there is a BIOS setup option to change that setting?
>

Yes, it's well hidden but it's there. If the profile is made custom, then
the p-states can be selected and "custom" default enables C6 but not C3
(there is a note saying that it's not recommended for that CPU). If I
then switch it back to the normal profile, the c-states are not restored
so this is a one-way trip even if you disable the c-state in custom,
reboot, switch back, reboot. Same if the machine is reset to "optimal
default settings". Yey for BIOS developers.

This means I have a limited number of attempts to do something about
this. 2 machines can no longer reproduce the problem reliably.

However, that also says a "solution" is to enable the state on each of
these machines, discard the historical results and carry on. In practice,
if reports are received either upstream or in distros about strange
behaviour on old machines when upgrading then first check the c-states
and the CPU generation. Given long enough, the machines with that specific
CPU and a crappy BIOS will phase out :/

> Also, I need to know what happens if that system is started with intel_idle
> disabled.? That is, what idle states are visible in sysfs in that
> configuration (what their names and descriptions are in particular) and
> whether or not the issue is still present then.
>

Kernel name c-state exit latency disabled? default?
----------- ------ ------------ --------- --------
5.9-vanilla POLL latency:0 disabled:0 default:enabled
5.9-vanilla C1 latency:2 disabled:0 default:enabled
5.9-vanilla C1E latency:10 disabled:0 default:enabled
5.9-vanilla C3 latency:33 disabled:1 default:disabled
5.9-vanilla C6 latency:133 disabled:0 default:enabled
5.9-intel_idle-disabled POLL latency:0 disabled:0 default:enabled
5.9-intel_idle-disabled C1 latency:1 disabled:0 default:enabled
5.9-intel_idle-disabled C2 latency:41 disabled:0 default:enabled
5.9-acpi-disable POLL latency:0 disabled:0 default:enabled
5.9-acpi-disable C1 latency:2 disabled:0 default:enabled
5.9-acpi-disable C1E latency:10 disabled:0 default:enabled
5.9-acpi-disable C3 latency:33 disabled:0 default:enabled
5.9-acpi-disable C6 latency:133 disabled:0 default:enabled
5.9-custom-powerprofile POLL latency:0 disabled:0 default:enabled
5.9-custom-powerprofile C1 latency:2 disabled:0 default:enabled
5.9-custom-powerprofile C1E latency:10 disabled:0 default:enabled
5.9-custom-powerprofile C3 latency:33 disabled:1 default:disabled
5.9-custom-powerprofile C6 latency:133 disabled:0 default:enabled

In this case, the test results are similar. I vaguely recall the bios
was reconfigured on the second machine for unrelated reasons and it's no
longer able to reproduce the problem properly. As the results show little
difference in this case, I didn't include the turbostat figures. The
summary from mmtests showed this

5.9 5.9 5.9 5.9
vanillaintel_idle-disabledacpi-disablecustom-powerprofile
Hmean Avg_MHz 8.31 8.29 8.35 8.26
Hmean Busy% 0.58 0.56 0.58 0.57
Hmean CPU%c1 3.19 40.81 3.14 3.18
Hmean CPU%c3 0.00 0.00 0.00 0.00
Hmean CPU%c6 92.24 0.00 92.21 92.20
Hmean CPU%c7 0.00 0.00 0.00 0.00
Hmean PkgWatt 47.98 0.00 47.95 47.68

i.e. The average time on c6 was high on the vanilla kernel where as it
would not have been when the problem was originally reproduced (I
clearly broke this test machine in a way I can't "fix"). Disabling
intel_idle kept it mostly in C1 state.

I'll try a third machine tomorrow but even if I reproduce the problem,
I won't be able to do it again because ... yey bios developers.

--
Mel Gorman
SUSE Labs

2020-10-15 23:11:04

by Mel Gorman

[permalink] [raw]
Subject: Re: ACPI _CST introduced performance regresions on Haswll

> Yes, it's well hidden but it's there. If the profile is made custom, then
> the p-states can be selected and "custom" default enables C6 but not C3
> (there is a note saying that it's not recommended for that CPU). If I
> then switch it back to the normal profile, the c-states are not restored
> so this is a one-way trip even if you disable the c-state in custom,
> reboot, switch back, reboot. Same if the machine is reset to "optimal
> default settings". Yey for BIOS developers.
>
> This means I have a limited number of attempts to do something about
> this. 2 machines can no longer reproduce the problem reliably.
>

Turns out I didn't even have that. On another machine (same model,
same cpu, different BIOS that cannot be updated), enabling the C6 state
still did not enable it on boot and dmesg complained about CST not being
usable. This is weird because one would expect that if CST was unusable
that it would be the same as use_acpi == false.

This could potentially be if the ACPI tables are unsuitable due to bad
bad FFH information for a lower c-state. If _CST is not found or usable,
should acpi_state_table.count be reset to go back to the old behaviour?

diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 13600c403035..3b84f8631b40 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -1261,6 +1261,7 @@ static bool intel_idle_acpi_cst_extract(void)
return true;
}

+ acpi_state_table.count = 0;
pr_debug("ACPI _CST not found or not usable\n");
return false;
}

--
Mel Gorman
SUSE Labs

2020-10-16 15:33:17

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: ACPI _CST introduced performance regresions on Haswll

On Thu, Oct 15, 2020 at 8:34 PM Mel Gorman <[email protected]> wrote:
>
> > Yes, it's well hidden but it's there. If the profile is made custom, then
> > the p-states can be selected and "custom" default enables C6 but not C3
> > (there is a note saying that it's not recommended for that CPU). If I
> > then switch it back to the normal profile, the c-states are not restored
> > so this is a one-way trip even if you disable the c-state in custom,
> > reboot, switch back, reboot. Same if the machine is reset to "optimal
> > default settings". Yey for BIOS developers.
> >
> > This means I have a limited number of attempts to do something about
> > this. 2 machines can no longer reproduce the problem reliably.
> >
>
> Turns out I didn't even have that. On another machine (same model,
> same cpu, different BIOS that cannot be updated), enabling the C6 state
> still did not enable it on boot and dmesg complained about CST not being
> usable. This is weird because one would expect that if CST was unusable
> that it would be the same as use_acpi == false.
>
> This could potentially be if the ACPI tables are unsuitable due to bad
> bad FFH information for a lower c-state. If _CST is not found or usable,
> should acpi_state_table.count be reset to go back to the old behaviour?

Yes, it should, although I would reset it in intel_idle_cst_usable()
right away before returning 'false'.

I can send a patch to do the above or please submit the one below as
it works too.

> diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
> index 13600c403035..3b84f8631b40 100644
> --- a/drivers/idle/intel_idle.c
> +++ b/drivers/idle/intel_idle.c
> @@ -1261,6 +1261,7 @@ static bool intel_idle_acpi_cst_extract(void)
> return true;
> }
>
> + acpi_state_table.count = 0;
> pr_debug("ACPI _CST not found or not usable\n");
> return false;
> }
>
> --

2020-10-16 16:20:06

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: ACPI _CST introduced performance regresions on Haswll

On Fri, Oct 16, 2020 at 4:09 PM Mel Gorman <[email protected]> wrote:
>
> On Fri, Oct 16, 2020 at 03:41:12PM +0200, Rafael J. Wysocki wrote:
> > > Turns out I didn't even have that. On another machine (same model,
> > > same cpu, different BIOS that cannot be updated), enabling the C6 state
> > > still did not enable it on boot and dmesg complained about CST not being
> > > usable. This is weird because one would expect that if CST was unusable
> > > that it would be the same as use_acpi == false.
> > >
> > > This could potentially be if the ACPI tables are unsuitable due to bad
> > > bad FFH information for a lower c-state. If _CST is not found or usable,
> > > should acpi_state_table.count be reset to go back to the old behaviour?
> >
> > Yes, it should, although I would reset it in intel_idle_cst_usable()
> > right away before returning 'false'.
> >
>
> Good stuff.
>
> > I can send a patch to do the above or please submit the one below as
> > it works too.
> >
>
> I'm happy to go with your alternative if you prefer. For a finish,
> I decided it was worth reporting if the _CST was ignored regardless of
> the reason. It performs roughly the same as setting use_acpi = false on
> the affected machines.
>
> ---8<---
> intel_idle: Ignore _CST if control cannot be taken from the platform
>
> e6d4f08a6776 ("intel_idle: Use ACPI _CST on server systems") avoids
> enabling c-states that have been disabled by the platform with the
> exception of C1E.
>
> Unfortunately, BIOS implementations are not always consistent in terms
> of how capabilities are advertised and control cannot always be handed
> over. If control cannot be handed over then intel_idle reports that "ACPI
> _CST not found or not usable" but does not clear acpi_state_table.count
> meaning the information is still partially used.
>
> This patch ignores ACPI information if CST control cannot be requested from
> the platform. This was only observed on a number of Haswell platforms that
> had identical CPUs but not identical BIOS versions. While this problem
> may be rare overall, 24 separate test cases bisected to this specific
> commit across 4 separate test machines and is worth addressing. If the
> situation occurs, the kernel behaves as it did before commit e6d4f08a6776
> and uses any c-states that are discovered.
>
> The affected test cases were all ones that involved a small number of
> processes -- exec microbenchmark, pipe microbenchmark, git test suite,
> netperf, tbench with one client and system call microbenchmark. Each
> case benefits from being able to use turboboost which is prevented if the
> lower c-states are unavailable. This may mask real regressions specific
> to older hardware so it is worth addressing.
>
> C-state status before and after the patch
>
> 5.9.0-vanilla POLL latency:0 disabled:0 default:enabled
> 5.9.0-vanilla C1 latency:2 disabled:0 default:enabled
> 5.9.0-vanilla C1E latency:10 disabled:0 default:enabled
> 5.9.0-vanilla C3 latency:33 disabled:1 default:disabled
> 5.9.0-vanilla C6 latency:133 disabled:1 default:disabled
> 5.9.0-ignore-cst-v1r1 POLL latency:0 disabled:0 default:enabled
> 5.9.0-ignore-cst-v1r1 C1 latency:2 disabled:0 default:enabled
> 5.9.0-ignore-cst-v1r1 C1E latency:10 disabled:0 default:enabled
> 5.9.0-ignore-cst-v1r1 C3 latency:33 disabled:0 default:enabled
> 5.9.0-ignore-cst-v1r1 C6 latency:133 disabled:0 default:enabled
>
> Patch enables C3/C6.
>
> Netperf UDP_STREAM
>
> netperf-udp
> 5.5.0 5.9.0
> vanilla ignore-cst-v1r1
> Hmean send-64 193.41 ( 0.00%) 226.54 * 17.13%*
> Hmean send-128 392.16 ( 0.00%) 450.54 * 14.89%*
> Hmean send-256 769.94 ( 0.00%) 881.85 * 14.53%*
> Hmean send-1024 2994.21 ( 0.00%) 3468.95 * 15.85%*
> Hmean send-2048 5725.60 ( 0.00%) 6628.99 * 15.78%*
> Hmean send-3312 8468.36 ( 0.00%) 10288.02 * 21.49%*
> Hmean send-4096 10135.46 ( 0.00%) 12387.57 * 22.22%*
> Hmean send-8192 17142.07 ( 0.00%) 19748.11 * 15.20%*
> Hmean send-16384 28539.71 ( 0.00%) 30084.45 * 5.41%*
>
> Fixes: e6d4f08a6776 ("intel_idle: Use ACPI _CST on server systems")
> Signed-off-by: Mel Gorman <[email protected]>
> ---
> drivers/idle/intel_idle.c | 7 +++----
> 1 file changed, 3 insertions(+), 4 deletions(-)
>
> diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
> index 9a810e4a7946..4af2d3f2c8aa 100644
> --- a/drivers/idle/intel_idle.c
> +++ b/drivers/idle/intel_idle.c
> @@ -1212,14 +1212,13 @@ static bool __init intel_idle_acpi_cst_extract(void)
> if (!intel_idle_cst_usable())
> continue;
>
> - if (!acpi_processor_claim_cst_control()) {
> - acpi_state_table.count = 0;
> - return false;
> - }
> + if (!acpi_processor_claim_cst_control())
> + break;
>
> return true;
> }
>
> + acpi_state_table.count = 0;
> pr_debug("ACPI _CST not found or not usable\n");
> return false;
> }

Applied as a fix for 5.10-rc1, thanks!

2020-10-16 18:43:59

by Mel Gorman

[permalink] [raw]
Subject: Re: ACPI _CST introduced performance regresions on Haswll

On Fri, Oct 16, 2020 at 03:41:12PM +0200, Rafael J. Wysocki wrote:
> > Turns out I didn't even have that. On another machine (same model,
> > same cpu, different BIOS that cannot be updated), enabling the C6 state
> > still did not enable it on boot and dmesg complained about CST not being
> > usable. This is weird because one would expect that if CST was unusable
> > that it would be the same as use_acpi == false.
> >
> > This could potentially be if the ACPI tables are unsuitable due to bad
> > bad FFH information for a lower c-state. If _CST is not found or usable,
> > should acpi_state_table.count be reset to go back to the old behaviour?
>
> Yes, it should, although I would reset it in intel_idle_cst_usable()
> right away before returning 'false'.
>

Good stuff.

> I can send a patch to do the above or please submit the one below as
> it works too.
>

I'm happy to go with your alternative if you prefer. For a finish,
I decided it was worth reporting if the _CST was ignored regardless of
the reason. It performs roughly the same as setting use_acpi = false on
the affected machines.

---8<---
intel_idle: Ignore _CST if control cannot be taken from the platform

e6d4f08a6776 ("intel_idle: Use ACPI _CST on server systems") avoids
enabling c-states that have been disabled by the platform with the
exception of C1E.

Unfortunately, BIOS implementations are not always consistent in terms
of how capabilities are advertised and control cannot always be handed
over. If control cannot be handed over then intel_idle reports that "ACPI
_CST not found or not usable" but does not clear acpi_state_table.count
meaning the information is still partially used.

This patch ignores ACPI information if CST control cannot be requested from
the platform. This was only observed on a number of Haswell platforms that
had identical CPUs but not identical BIOS versions. While this problem
may be rare overall, 24 separate test cases bisected to this specific
commit across 4 separate test machines and is worth addressing. If the
situation occurs, the kernel behaves as it did before commit e6d4f08a6776
and uses any c-states that are discovered.

The affected test cases were all ones that involved a small number of
processes -- exec microbenchmark, pipe microbenchmark, git test suite,
netperf, tbench with one client and system call microbenchmark. Each
case benefits from being able to use turboboost which is prevented if the
lower c-states are unavailable. This may mask real regressions specific
to older hardware so it is worth addressing.

C-state status before and after the patch

5.9.0-vanilla POLL latency:0 disabled:0 default:enabled
5.9.0-vanilla C1 latency:2 disabled:0 default:enabled
5.9.0-vanilla C1E latency:10 disabled:0 default:enabled
5.9.0-vanilla C3 latency:33 disabled:1 default:disabled
5.9.0-vanilla C6 latency:133 disabled:1 default:disabled
5.9.0-ignore-cst-v1r1 POLL latency:0 disabled:0 default:enabled
5.9.0-ignore-cst-v1r1 C1 latency:2 disabled:0 default:enabled
5.9.0-ignore-cst-v1r1 C1E latency:10 disabled:0 default:enabled
5.9.0-ignore-cst-v1r1 C3 latency:33 disabled:0 default:enabled
5.9.0-ignore-cst-v1r1 C6 latency:133 disabled:0 default:enabled

Patch enables C3/C6.

Netperf UDP_STREAM

netperf-udp
5.5.0 5.9.0
vanilla ignore-cst-v1r1
Hmean send-64 193.41 ( 0.00%) 226.54 * 17.13%*
Hmean send-128 392.16 ( 0.00%) 450.54 * 14.89%*
Hmean send-256 769.94 ( 0.00%) 881.85 * 14.53%*
Hmean send-1024 2994.21 ( 0.00%) 3468.95 * 15.85%*
Hmean send-2048 5725.60 ( 0.00%) 6628.99 * 15.78%*
Hmean send-3312 8468.36 ( 0.00%) 10288.02 * 21.49%*
Hmean send-4096 10135.46 ( 0.00%) 12387.57 * 22.22%*
Hmean send-8192 17142.07 ( 0.00%) 19748.11 * 15.20%*
Hmean send-16384 28539.71 ( 0.00%) 30084.45 * 5.41%*

Fixes: e6d4f08a6776 ("intel_idle: Use ACPI _CST on server systems")
Signed-off-by: Mel Gorman <[email protected]>
---
drivers/idle/intel_idle.c | 7 +++----
1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index 9a810e4a7946..4af2d3f2c8aa 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -1212,14 +1212,13 @@ static bool __init intel_idle_acpi_cst_extract(void)
if (!intel_idle_cst_usable())
continue;

- if (!acpi_processor_claim_cst_control()) {
- acpi_state_table.count = 0;
- return false;
- }
+ if (!acpi_processor_claim_cst_control())
+ break;

return true;
}

+ acpi_state_table.count = 0;
pr_debug("ACPI _CST not found or not usable\n");
return false;
}