2022-09-23 20:58:25

by K Prateek Nayak

[permalink] [raw]
Subject: Re: [PATCH] ACPI: processor idle: Practically limit "Dummy wait" workaround to old Intel systems

Hello Dave,

On 9/23/2022 12:17 AM, Dave Hansen wrote:
> Old, circa 2002 chipsets have a bug: they don't go idle when they are
> supposed to. So, a workaround was added to slow the CPU down and
> ensure that the CPU waits a bit for the chipset to actually go idle.
> This workaround is ancient and has been in place in some form since
> the original kernel ACPI implementation.
>
> But, this workaround is very painful on modern systems. The "inl()"
> can take thousands of cycles (see Link: for some more detailed
> numbers and some fun kernel archaeology).
>
> First and foremost, modern systems should not be using this code.
> Typical Intel systems have not used it in over a decade because it is
> horribly inferior to MWAIT-based idle.
>
> Despite this, people do seem to be tripping over this workaround on
> AMD system today.
>
> Limit the "dummy wait" workaround to Intel systems. Keep Modern AMD
> systems from tripping over the workaround. Remotely modern Intel
> systems use intel_idle instead of this code and will, in practice,
> remain unaffected by the dummy wait.

I've run 30 runs of tbench with 128 clients on a dual socket Zen3 system
(2 x 64C/128T) and do not see any massive regression like I used to when
we were hitting the dummy wait issue:

Kernel : baseline baseline + C2 disabled baseline + this patch

Min (MB/s) : 2215.06 33072.10 (+1393.05%) 30519.60 (+1277.82%)
Max (MB/s) : 32938.80 34399.10 32699.30
Median (MB/s) : 32191.80 33476.60 31418.90
AMean (MB/s) : 22448.55 33649.27 (+49.89%) 31545.93 (+40.52%)
AMean Stddev : 17526.70 680.14 1095.39
AMean CoefVar : 78.07% 2.02% 3.47%

The range is well within the variation we've normally seen with tbench
on the test platform.

>
> Signed-off-by: Dave Hansen <[email protected]>
> Cc: Len Brown <[email protected]>
> Cc: Mario Limonciello <[email protected]>
> Cc: Peter Zijlstra (Intel) <[email protected]>
> Cc: Borislav Petkov <[email protected]>

Can you please add a cc to stable?

Cc: [email protected]

> Suggested-by: Rafael J. Wysocki <[email protected]>
> Reported-by: K Prateek Nayak <[email protected]>
> Link: https://lore.kernel.org/all/[email protected]/
> ---
> drivers/acpi/processor_idle.c | 23 ++++++++++++++++++++---
> 1 file changed, 20 insertions(+), 3 deletions(-)
>
> diff --git a/drivers/acpi/processor_idle.c b/drivers/acpi/processor_idle.c
> index 16a1663d02d4..9f40917c49ef 100644
> --- a/drivers/acpi/processor_idle.c
> +++ b/drivers/acpi/processor_idle.c
> @@ -531,10 +531,27 @@ static void wait_for_freeze(void)
> /* No delay is needed if we are in guest */
> if (boot_cpu_has(X86_FEATURE_HYPERVISOR))
> return;
> + /*
> + * Modern (>=Nehalem) Intel systems use ACPI via intel_idle,
> + * not this code. Assume that any Intel systems using this
> + * are ancient and may need the dummy wait. This also assumes
> + * that the motivating chipset issue was Intel-only.
> + */
> + if (boot_cpu_data.x86_vendor != X86_VENDOR_INTEL)

Based on Andreas's comment, this problem is not limited to Intel chipsets
and affects at least the AMD Athlon on VIA chipset (circa 2006)
(https://lore.kernel.org/lkml/[email protected]/)
To be on safer side, the exception could be made for AMD Fam 17h+ and also
Hygon as pointed out by Peter, where we know the dummy wait is unnecessary.
Extending the condition you proposed, we can have:

if (boot_cpu_data.x86_vendor == X86_VENDOR_HYGON ||
((boot_cpu_data.x86_vendor == X86_VENDOR_AMD) &&
(boot_cpu_data.x86_model >= 0x17)))
return;

It is not pretty by any means which is why we can use a x86_BUG_STPCLK to
limit the dummy op to only affected processors. This way, the x86 vendor
check and family check can be avoided in the acpi code. A v2 has been sent
out tackling the problem this way:
https://lore.kernel.org/lkml/[email protected]/

> + return;
> #endif
> - /* Dummy wait op - must do something useless after P_LVL2 read
> - because chipsets cannot guarantee that STPCLK# signal
> - gets asserted in time to freeze execution properly. */
> + /*
> + * Dummy wait op - must do something useless after P_LVL2 read
> + * because chipsets cannot guarantee that STPCLK# signal gets
> + * asserted in time to freeze execution properly
> + *
> + * This workaround has been in place since the original ACPI
> + * implementation was merged, circa 2002.
> + *
> + * If a profile is pointing to this instruction, please first
> + * consider moving your system to a more modern idle
> + * mechanism.
> + */
> inl(acpi_gbl_FADT.xpm_timer_block.address);
> }
>

The patch, as it is, solves the problem we've seen on the newer AMD
platforms with large core density that use IOPORT based C-states.

Tested-by: K Prateek Nayak <[email protected]>
--
Thanks and Regards,
Prateek