Currently cpuidle assumes worst-case C-state parameters, and so C6
is described with PC6 parameters, which is worst case for requesting
CC6. When PC6 is enabled, this is appropriate. But if PC6 is disabled
in BIOS, the exit latency and target_residency should be adjusted
accordingly.
Exit latency:
The C6 exit latency is measured when woken up from CC6/PC6. In the past,
if PC6 is disabled, CPU would be demoted to CC6/PC3, which is close to
the latency from CC6/PC6 and there is no need to update the C6 exit latency.
However on newer platform there is no CC3/PC3 anymore, then the C6 exit
latency with PC6 disabled should be CC6/PC0.
Target residency:
With PC6 disabled and C3/PC3 supported, the OS requests C3 if idle
duration is within [CC6, PC6) target_residency. On new CPU generations
with C3/PC3 deprecated, the OS would request C1E. This would cause
low energy-efficiency. In summary, the question is, should we lower
the bar to request C6 when PC6 is disabled? The answer is yes.
To fill this gap, check if PC6 is disabled in the BIOS in the
MSR_PKG_CST_CONFIG_CONTROL(0xe2). If so, use CC6/PC0 parameters as the
new exit latency. Meanwhile, update target_residency to 3 times of the new
exit latency. This is consistent with how intel_idle driver uses _CST to
calculate the target_residency. The consequence is that, the OS would
be more offen to choose C6 over C1E when PC6 is disabled. The new exit
latency of CC6/PC0 was from wult[1] result, which was measured via NIC
wakeup from 99.99th latency.
Before the change:
PC6 enabled Y N N
has C3/PC3 N N Y
idle duration [CC6,PC6) [CC6,PC6) [CC6,PC6)
CPU request C1E C1E C3
After the change:
PC6 enabled Y N N
has C3/PC3 N N Y
idle duration [CC6,PC6) [CC6,PC6) [CC6,PC6)
CPU request C1E *C6* C3
*C6* rather than C1E is chosen.
There is concern that if we should introduce a more generic solution
rather than optimizing on each platforms. However consider the
code complexity and different PC6 bit interpretation on different
platforms, tune the code per platform seems to be an acceptable trade-off.
[1] https://intel.github.io/wult/
Suggested-by: Len Brown <[email protected]>
Signed-off-by: Chen Yu <[email protected]>
---
drivers/idle/intel_idle.c | 33 +++++++++++++++++++++++++++++++++
1 file changed, 33 insertions(+)
diff --git a/drivers/idle/intel_idle.c b/drivers/idle/intel_idle.c
index ec1b9d306ba6..e6c543b5ee1d 100644
--- a/drivers/idle/intel_idle.c
+++ b/drivers/idle/intel_idle.c
@@ -1484,6 +1484,36 @@ static void __init sklh_idle_state_table_update(void)
skl_cstates[6].flags |= CPUIDLE_FLAG_UNUSABLE; /* C9-SKL */
}
+/**
+ * skx_idle_state_table_update - Adjust the Sky Lake/Cascade Lake
+ * idle states table.
+ */
+static void __init skx_idle_state_table_update(void)
+{
+ unsigned long long msr;
+
+ rdmsrl(MSR_PKG_CST_CONFIG_CONTROL, msr);
+
+ /*
+ * 000b: C0/C1 (no package C-state support)
+ * 001b: C2
+ * 010b: C6 (non-retention)
+ * 011b: C6 (retention)
+ * 111b: No Package C state limits.
+ */
+ if ((msr & 0x7) < 2) {
+ /*
+ * Uses the CC6 + PC0 latency and 3 times of
+ * latency for target_residency if the PC6
+ * is disabled in BIOS. This is consistent
+ * with how intel_idle driver uses _CST
+ * to set the target_residency.
+ */
+ skx_cstates[2].exit_latency = 92;
+ skx_cstates[2].target_residency = 276;
+ }
+}
+
static bool __init intel_idle_verify_cstate(unsigned int mwait_hint)
{
unsigned int mwait_cstate = MWAIT_HINT2CSTATE(mwait_hint) + 1;
@@ -1515,6 +1545,9 @@ static void __init intel_idle_init_cstates_icpu(struct cpuidle_driver *drv)
case INTEL_FAM6_SKYLAKE:
sklh_idle_state_table_update();
break;
+ case INTEL_FAM6_SKYLAKE_X:
+ skx_idle_state_table_update();
+ break;
}
for (cstate = 0; cstate < CPUIDLE_STATE_MAX; ++cstate) {
--
2.25.1
On Thu, 2021-05-27 at 12:56 +0800, Chen Yu wrote:
... snip ...
> Exit latency:
> The C6 exit latency is measured when woken up from CC6/PC6. In the past,
> if PC6 is disabled, CPU would be demoted to CC6/PC3, which is close to
> the latency from CC6/PC6 and there is no need to update the C6 exit latency.
> However on newer platform there is no CC3/PC3 anymore, then the C6 exit
> latency with PC6 disabled should be CC6/PC0.
>
> Target residency:
> With PC6 disabled and C3/PC3 supported, the OS requests C3 if idle
> duration is within [CC6, PC6) target_residency. On new CPU generations
> with C3/PC3 deprecated, the OS would request C1E. This would cause
> low energy-efficiency. In summary, the question is, should we lower
> the bar to request C6 when PC6 is disabled? The answer is yes.
... snip ...
Hi Yu,
Thanks for this patch, it is very actual and helpful.
Comments about the commit message below.
This patch is specifically about SKX. It also covers CLX and CPX,
because they have the same ID.
Now, this platforms do not have C3 and PC3. So I would avoid talking
about these states
in the commit message. Why making a simple thing more complex?
Here are all the SKX C-states.
1. Linux-level C-states (linux can ask for): C1, C1E, C6.
2. HW-level C-states (HW supports under the hood): C1, C1E, CC6, PC2,
PC6.
Here is the story of this patch in my understanding.
1. C6 maps to CC6 and PC6.
2. CC6 is "shallower" than PC6.
3. Linux assumes worst case - PC6.
4. Many datacenters and users disable PC6.
5. We can optimize intel_idle in this case: adjust C6 latency and
target residency to match (faster) CC6.
That's it.
Then may be it is worth mentioning that CC6 vs PC2 difference is not
really measurable, so
the adjustment is only for PC6.
Artem.
On Thu, May 27, 2021 at 11:25:18AM +0300, Artem Bityutskiy wrote:
> On Thu, 2021-05-27 at 12:56 +0800, Chen Yu wrote:
>
> ... snip ...
>
> > Exit latency:
> > The C6 exit latency is measured when woken up from CC6/PC6. In the past,
> > if PC6 is disabled, CPU would be demoted to CC6/PC3, which is close to
> > the latency from CC6/PC6 and there is no need to update the C6 exit latency.
> > However on newer platform there is no CC3/PC3 anymore, then the C6 exit
> > latency with PC6 disabled should be CC6/PC0.
> >
> > Target residency:
> > With PC6 disabled and C3/PC3 supported, the OS requests C3 if idle
> > duration is within [CC6, PC6) target_residency. On new CPU generations
> > with C3/PC3 deprecated, the OS would request C1E. This would cause
> > low energy-efficiency. In summary, the question is, should we lower
> > the bar to request C6 when PC6 is disabled? The answer is yes.
> ... snip ...
>
> Hi Yu,
>
Hi Artem,
> Thanks for this patch, it is very actual and helpful.
>
> Comments about the commit message below.
>
> This patch is specifically about SKX. It also covers CLX and CPX,
> because they have the same ID.
>
> Now, this platforms do not have C3 and PC3. So I would avoid talking
> about these states
> in the commit message. Why making a simple thing more complex?
>
I agree, in theory this issue is irrelevant of the C3/PC3, however I was
thinking the issue would become more significant to user on platform without
C3/PC3, and this increase the necessity of this patch. But let me
refine the commit log to only mention C3 a little just for future reference.
> Here are all the SKX C-states.
>
> 1. Linux-level C-states (linux can ask for): C1, C1E, C6.
> 2. HW-level C-states (HW supports under the hood): C1, C1E, CC6, PC2,
> PC6.
>
And PC0?
> Here is the story of this patch in my understanding.
>
> 1. C6 maps to CC6 and PC6.
> 2. CC6 is "shallower" than PC6.
> 3. Linux assumes worst case - PC6.
> 4. Many datacenters and users disable PC6.
> 5. We can optimize intel_idle in this case: adjust C6 latency and
> target residency to match (faster) CC6.
>
> That's it.
>
Okay.
> Then may be it is worth mentioning that CC6 vs PC2 difference is not
> really measurable, so
> the adjustment is only for PC6.
>
Okay.
thanks,
Chenyu
> Artem.
>