2023-10-04 15:30:51

by René Rebe

[permalink] [raw]
Subject: [RFC] AMD Zen4 CPU bug? Spurious SMT Sibling Invalid Opcode Speculation

Hello everyone,

during cross compiling our “Embedded” Linux Distribution T2 (https://t2sde.org) I observers some random illegal instruction build errors since we got ourselves a Ryzen 7950x on launch day a year ago:

vendor_id : AuthenticAMD
cpu family : 25
model : 97
model name : AMD Ryzen 9 7950X 16-Core Processor
stepping : 2
microcode : 0xa601203

Initially I thought it must surely be some early system instability, and some DDR5 AGESA and microcode updates will eventually take care of that. Month after month passed and so far no BIOS update helped. So I finally started to investigate this over the last months, run the DDR5 memory at base clock, then disabled Precision Boost, and at the end run the CPU and RAM even below advertised base clock and the pseudo random illegal instructions at some gcc instances where still observed and I started to realized they where actually 99% quite identical around gcc user-space address 0xc0e0c0 (sometimes slightly off like 0xc08aaf):

during GIMPLE pass: switchlower
../src/intel/vulkan/anv_nir_lower_ubo_loads.c: In function 'lower_ubo_load_instr':
../src/intel/vulkan/anv_nir_lower_ubo_loads.c:28:1: internal compiler error: Illegal instruction
28 | lower_ubo_load_instr(nir_builder *b, nir_instr *instr, UNUSED void *_data)
| ^~~~~~~~~~~~~~~~~~~~
0x1435c95 internal_error(char const*, ...)
???:0
0xc0e0c0 tree_switch_conversion::switch_decision_tree::try_switch_expansion(vec<tree_switch_conversion::cluster*, va_heap, vl_ptr>&)
???:0
0xc0eb69 tree_switch_conversion::switch_decision_tree::analyze_switch_statement()
???:0

0x0000000000c0e0b6 <+246>: cmp $0xffffffff,%r14d
0x0000000000c0e0ba <+250>: je 0xc0e190 <_ZN22tree_switch_conversion20switch_decision_tree20try_switch_expansionER3vecIPNS_7clusterE7va_heap6vl_ptrE+464>
0x0000000000c0e0c0 <+256>: mov %rax,%rcx # <----- HERE -----!
0x0000000000c0e0c3 <+259>: cmpb $0x0,0xa8(%r13)
0x0000000000c0e0cb <+267>: jne 0xc0e060 <_ZN22tree_switch_conversion20switch_decision_tree20try_switch_expansionER3vecIPNS_7clusterE7va_heap6vl_ptrE+160>
0x0000000000c0e0cd <+269>: mov 0xa0(%r13),%rsi
0x0000000000c0e0d4 <+276>: mov $0x8,%eax

The illegal instructions only occur sometimes, so rebuilding a package is usually successful.
To rule out any software inherited instability, I booted the bit identical system copy (using rsync) on a Ryzen 5950x and could build all the system using the identical kernel and gcc binaries without any such spurious illegal instructions.

This appeared to mostly show up with gcc as cross compiled for sparc64, but that should not matter, as this ist just generic x86-64(-v1) code that in similar sequence, memory access and I/O pattern could likely appear in any other sophisticated and complex enough user-space program.

Trying to further narrow this down, and wether it is just one defect core, I patched the kernel to show the likely CPU. Not sure if this is the most reliable, but that is the patch:

--- linux-6.5/arch/x86/kernel/traps.c.orig 2023-10-02 11:53:47.413623693 +0200
+++ linux-6.5/arch/x86/kernel/traps.c 2023-10-02 11:53:58.580624927 +0200
@@ -294,8 +294,12 @@
static inline void handle_invalid_op(struct pt_regs regs)
#endif
{
+ void __useraddr = error_get_trap_addr(regs);
+ int cpu = raw_smp_processor_id();
+ printk("INVALID OPCODE: %lx likely on CPU %d (core %d, socket %d)\n",
+ cpu, addr, topology_core_id(cpu), topology_physical_package_id(cpu));
do_error_trap(regs, 0, "invalid opcode", X86_TRAP_UD, SIGILL,
ILL_ILLOPN, error_get_trap_addr(regs));
+ ILL_ILLOPN, addr);
}

This showed number over all cores and CCX to be affected:

[ 1901.688448] INVALID OPCODE: c0e0c0 likely on CPU 26 (core 10, socket 0)
[ 1930.529211] INVALID OPCODE: c0e0c0 likely on CPU 21 (core 5, socket 0)
[ 1971.898911] INVALID OPCODE: c0e0c0 likely on CPU 27 (core 11, socket 0)
[ 2006.781557] INVALID OPCODE: c0e0c0 likely on CPU 19 (core 3, socket 0)
[ 2054.672900] INVALID OPCODE: c0e0c0 likely on CPU 30 (core 14, socket 0)
[ 2097.180969] INVALID OPCODE: c0e0c0 likely on CPU 27 (core 11, socket 0)
[ 2140.558150] INVALID OPCODE: c0e0c0 likely on CPU 23 (core 7, socket 0)
[ 2168.601674] INVALID OPCODE: c0e0c0 likely on CPU 15 (core 15, socket 0)


I sorted the result # dmesg | grep INVALID| sed 's/.*://' | cut -d ' ' -f 6| sort -n| uniq -c
4 0
2 1
2 2
2 3
5 4
5 5
3 7
4 8
4 9
2 10
4 11
3 12
5 13
5 14
2 15
6 16
8 17
2 18
5 19
5 20
3 21
7 22
5 23
3 24
2 25
2 26
3 27
4 28
4 29
6 30
5 31

Already discussing this issue with some other folks and kernel developer it was suggested it could be TLB related, and we realized we were booting with mitigations=off for a little higher all system compilation performance and I can report that without mitigations=off this spurious illegal instructions do not appear. Also disabling SMT makes the problem disappear, too.

So I iterated over all the mitigation options and found spectre_v2_user=off to be enough to make this bug reproducibly appear when loading most cores running this sparc64-t2-linux-gcc.

Now the good news is: running with modern security mitigations enabled hides this what to me looks like a Zen 4 SMT sibling processor state corruption bug or mis-speculation. However, I would argue, non malicious user-more programs should not exhibit spurious illegal instructions with an operating system running in a classic, high performance mode without any special security mitigations in place.

As this is very reproducible with GCC for sparc64 for me, I created an initrd with a pre-processed source file (from Mesa IIRC) setup to boot into a loop running sparc64-t2-linux-gcc on all cores (all grouped in usr/local) for others to test how widespread this issue is:

https://dl.t2sde.org/amd-zen4-smt-c0fefe/

Boot with:
spectre_v2_user=off or mitigations=off

It is even reproducible in qemu/kvm running on a host with this spectre_v2_user=off:
qemu-system-x86_64 --enable-kvm -smp 32 -cpu host -m 4G -kernel vmlinuz-6.5.5-t2 -initrd initrd-6.5.5-t2.gz -nographic -append "console=ttyS0"

To test on your system with chroot:
mkdir bug; cd bug; gunzip ../initrd-6.5.5-t2 | cpio -i
chroot . usr/local/init

With this reduced test case illegal instructions appear within an average of just 5 seconds on my Ryzen 7950x.

To rule out that this is some random linux kernel config and optimization fluke, I built the kernel with clang and gcc, without any change, and also downloaded the latest Intel Clear Linux kernel binary to double checked that it is affected in the same way, and sure it does.

After all this research, to me this looks like an Zen 4 CPU bug, but any other comments, hints, patches welcome!

I realize AMD has never microcode for Epyc server CPUs, if this is already fixed in some newer microcode, it would really be amazing (hint) if AMD would release microcode updates for $999 consumer CPUs in a more timely manner, and not only high end server SKUs via linux-firmware, ...

Thank you so much,

René Rebe

--
ExactCODE GmbH, Lietzenburger Str. 42, DE-10789 Berlin
http://exactcode.com | http://exactscan.com | http://ocrkit.com


2023-10-04 22:26:02

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC] AMD Zen4 CPU bug? Spurious SMT Sibling Invalid Opcode Speculation

On Wed, Oct 04, 2023 at 05:29:32PM +0200, René Rebe wrote:
> during cross compiling our “Embedded” Linux Distribution T2 (https://t2sde.org) I observers some random illegal instruction build errors since we got ourselves a Ryzen 7950x on launch day a year ago:

Thanks for reporting. I'm looking into it.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-10-06 09:21:41

by René Rebe

[permalink] [raw]
Subject: Re: [RFC] AMD Zen4 CPU bug? Spurious SMT Sibling Invalid Opcode Speculation

Hi,

> On 5. Oct 2023, at 00:25, Borislav Petkov <[email protected]> wrote:
>
> On Wed, Oct 04, 2023 at 05:29:32PM +0200, René Rebe wrote:
>> during cross compiling our “Embedded” Linux Distribution T2 (https://t2sde.org) I observers some random illegal instruction build errors since we got ourselves a Ryzen 7950x on launch day a year ago:
>
> Thanks for reporting. I'm looking into it.

Thank you Borislav, were you able to reproduce this on Zen 4 you have access to?

Thanks,
René

--
ExactCODE GmbH, Lietzenburger Str. 42, DE-10789 Berlin
http://exactcode.com | http://exactscan.com | http://ocrkit.com

2023-10-06 09:33:12

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC] AMD Zen4 CPU bug? Spurious SMT Sibling Invalid Opcode Speculation

On Fri, Oct 06, 2023 at 11:21:13AM +0200, René Rebe wrote:
> Thank you Borislav, were you able to reproduce this on Zen 4 you have
> access to?

I'm still working on it and I'll have something soon.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-10-10 08:41:33

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC] AMD Zen4 CPU bug? Spurious SMT Sibling Invalid Opcode Speculation

On Fri, Oct 06, 2023 at 11:32:44AM +0200, Borislav Petkov wrote:
> I'm still working on it and I'll have something soon.

Ok, try this below and see whether it fixes your reproducer.

Thx.

---
From: "Borislav Petkov (AMD)" <[email protected]>
Date: Sat, 7 Oct 2023 12:57:02 +0200
Subject: [PATCH] x86/cpu: Fix AMD erratum #1485 on Zen4-based CPUs

Fix erratum #1485 on Zen4 parts where running with STIBP disabled can
cause an #UD exception. The performance impact of the fix is negligible.

Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Cc: <[email protected]>
---
arch/x86/include/asm/msr-index.h | 9 +++++++--
arch/x86/kernel/cpu/amd.c | 8 ++++++++
2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 1d111350197f..b37abb55e948 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -637,12 +637,17 @@
/* AMD Last Branch Record MSRs */
#define MSR_AMD64_LBR_SELECT 0xc000010e

-/* Fam 17h MSRs */
-#define MSR_F17H_IRPERF 0xc00000e9
+/* Zen4 */
+#define MSR_ZEN4_BP_CFG 0xc001102e
+#define MSR_ZEN4_BP_CFG_SHARED_BTB_FIX_BIT 5

+/* Zen 2 */
#define MSR_ZEN2_SPECTRAL_CHICKEN 0xc00110e3
#define MSR_ZEN2_SPECTRAL_CHICKEN_BIT BIT_ULL(1)

+/* Fam 17h MSRs */
+#define MSR_F17H_IRPERF 0xc00000e9
+
/* Fam 16h MSRs */
#define MSR_F16H_L2I_PERF_CTL 0xc0010230
#define MSR_F16H_L2I_PERF_CTR 0xc0010231
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 03ef962a6992..ece2b5b7b0fe 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -80,6 +80,10 @@ static const int amd_div0[] =
AMD_LEGACY_ERRATUM(AMD_MODEL_RANGE(0x17, 0x00, 0x0, 0x2f, 0xf),
AMD_MODEL_RANGE(0x17, 0x50, 0x0, 0x5f, 0xf));

+static const int amd_erratum_1485[] =
+ AMD_LEGACY_ERRATUM(AMD_MODEL_RANGE(0x19, 0x10, 0x0, 0x1f, 0xf),
+ AMD_MODEL_RANGE(0x19, 0x60, 0x0, 0xaf, 0xf));
+
static bool cpu_has_amd_erratum(struct cpuinfo_x86 *cpu, const int *erratum)
{
int osvw_id = *erratum++;
@@ -1149,6 +1153,10 @@ static void init_amd(struct cpuinfo_x86 *c)
pr_notice_once("AMD Zen1 DIV0 bug detected. Disable SMT for full protection.\n");
setup_force_cpu_bug(X86_BUG_DIV0);
}
+
+ if (!cpu_has(c, X86_FEATURE_HYPERVISOR) &&
+ cpu_has_amd_erratum(c, amd_erratum_1485))
+ msr_set_bit(MSR_ZEN4_BP_CFG, MSR_ZEN4_BP_CFG_SHARED_BTB_FIX_BIT);
}

#ifdef CONFIG_X86_32
--
2.42.0.rc0.25.ga82fb66fed25

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-10-10 21:19:36

by René Rebe

[permalink] [raw]
Subject: Re: [RFC] AMD Zen4 CPU bug? Spurious SMT Sibling Invalid Opcode Speculation

Hi Borislav,


> On 10. Oct 2023, at 10:39, Borislav Petkov <[email protected]> wrote:
>
> On Fri, Oct 06, 2023 at 11:32:44AM +0200, Borislav Petkov wrote:
>> I'm still working on it and I'll have something soon.
>
> Ok, try this below and see whether it fixes your reproducer.

On the first day the patch so far appears to have prevented
the spurious #UD exception to appear again.

Tested-by: René Rebe <[email protected]>

> Thx.
>
> ---
> From: "Borislav Petkov (AMD)" <[email protected]>
> Date: Sat, 7 Oct 2023 12:57:02 +0200
> Subject: [PATCH] x86/cpu: Fix AMD erratum #1485 on Zen4-based CPUs
>
> Fix erratum #1485 on Zen4 parts where running with STIBP disabled can
> cause an #UD exception. The performance impact of the fix is negligible.
>
> Signed-off-by: Borislav Petkov (AMD) <[email protected]>
> Cc: <[email protected]>
> ---
> arch/x86/include/asm/msr-index.h | 9 +++++++--
> arch/x86/kernel/cpu/amd.c | 8 ++++++++
> 2 files changed, 15 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
> index 1d111350197f..b37abb55e948 100644
> --- a/arch/x86/include/asm/msr-index.h
> +++ b/arch/x86/include/asm/msr-index.h
> @@ -637,12 +637,17 @@
> /* AMD Last Branch Record MSRs */
> #define MSR_AMD64_LBR_SELECT 0xc000010e
>
> -/* Fam 17h MSRs */
> -#define MSR_F17H_IRPERF 0xc00000e9
> +/* Zen4 */
> +#define MSR_ZEN4_BP_CFG 0xc001102e
> +#define MSR_ZEN4_BP_CFG_SHARED_BTB_FIX_BIT 5
>
> +/* Zen 2 */
> #define MSR_ZEN2_SPECTRAL_CHICKEN 0xc00110e3
> #define MSR_ZEN2_SPECTRAL_CHICKEN_BIT BIT_ULL(1)
>
> +/* Fam 17h MSRs */
> +#define MSR_F17H_IRPERF 0xc00000e9
> +
> /* Fam 16h MSRs */
> #define MSR_F16H_L2I_PERF_CTL 0xc0010230
> #define MSR_F16H_L2I_PERF_CTR 0xc0010231
> diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
> index 03ef962a6992..ece2b5b7b0fe 100644
> --- a/arch/x86/kernel/cpu/amd.c
> +++ b/arch/x86/kernel/cpu/amd.c
> @@ -80,6 +80,10 @@ static const int amd_div0[] =
> AMD_LEGACY_ERRATUM(AMD_MODEL_RANGE(0x17, 0x00, 0x0, 0x2f, 0xf),
> AMD_MODEL_RANGE(0x17, 0x50, 0x0, 0x5f, 0xf));
>
> +static const int amd_erratum_1485[] =
> + AMD_LEGACY_ERRATUM(AMD_MODEL_RANGE(0x19, 0x10, 0x0, 0x1f, 0xf),
> + AMD_MODEL_RANGE(0x19, 0x60, 0x0, 0xaf, 0xf));
> +
> static bool cpu_has_amd_erratum(struct cpuinfo_x86 *cpu, const int *erratum)
> {
> int osvw_id = *erratum++;
> @@ -1149,6 +1153,10 @@ static void init_amd(struct cpuinfo_x86 *c)
> pr_notice_once("AMD Zen1 DIV0 bug detected. Disable SMT for full protection.\n");
> setup_force_cpu_bug(X86_BUG_DIV0);
> }
> +
> + if (!cpu_has(c, X86_FEATURE_HYPERVISOR) &&
> + cpu_has_amd_erratum(c, amd_erratum_1485))
> + msr_set_bit(MSR_ZEN4_BP_CFG, MSR_ZEN4_BP_CFG_SHARED_BTB_FIX_BIT);
> }
>
> #ifdef CONFIG_X86_32
> --
> 2.42.0.rc0.25.ga82fb66fed25
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette

--
ExactCODE GmbH, Lietzenburger Str. 42, DE-10789 Berlin
http://exactcode.com | http://exactscan.com | http://ocrkit.com

2023-10-11 08:59:38

by Borislav Petkov

[permalink] [raw]
Subject: Re: [RFC] AMD Zen4 CPU bug? Spurious SMT Sibling Invalid Opcode Speculation

On Tue, Oct 10, 2023 at 11:18:57PM +0200, René Rebe wrote:
> On the first day the patch so far appears to have prevented
> the spurious #UD exception to appear again.
>
> Tested-by: René Rebe <[email protected]>

Thanks for reporting and testing!

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-10-11 09:24:02

by tip-bot2 for Jacob Pan

[permalink] [raw]
Subject: [tip: x86/urgent] x86/cpu: Fix AMD erratum #1485 on Zen4-based CPUs

The following commit has been merged into the x86/urgent branch of tip:

Commit-ID: f454b18e07f518bcd0c05af17a2239138bff52de
Gitweb: https://git.kernel.org/tip/f454b18e07f518bcd0c05af17a2239138bff52de
Author: Borislav Petkov (AMD) <[email protected]>
AuthorDate: Sat, 07 Oct 2023 12:57:02 +02:00
Committer: Borislav Petkov (AMD) <[email protected]>
CommitterDate: Wed, 11 Oct 2023 11:00:11 +02:00

x86/cpu: Fix AMD erratum #1485 on Zen4-based CPUs

Fix erratum #1485 on Zen4 parts where running with STIBP disabled can
cause an #UD exception. The performance impact of the fix is negligible.

Reported-by: René Rebe <[email protected]>
Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Tested-by: René Rebe <[email protected]>
Cc: <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/include/asm/msr-index.h | 9 +++++++--
arch/x86/kernel/cpu/amd.c | 8 ++++++++
2 files changed, 15 insertions(+), 2 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 1d11135..b37abb5 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -637,12 +637,17 @@
/* AMD Last Branch Record MSRs */
#define MSR_AMD64_LBR_SELECT 0xc000010e

-/* Fam 17h MSRs */
-#define MSR_F17H_IRPERF 0xc00000e9
+/* Zen4 */
+#define MSR_ZEN4_BP_CFG 0xc001102e
+#define MSR_ZEN4_BP_CFG_SHARED_BTB_FIX_BIT 5

+/* Zen 2 */
#define MSR_ZEN2_SPECTRAL_CHICKEN 0xc00110e3
#define MSR_ZEN2_SPECTRAL_CHICKEN_BIT BIT_ULL(1)

+/* Fam 17h MSRs */
+#define MSR_F17H_IRPERF 0xc00000e9
+
/* Fam 16h MSRs */
#define MSR_F16H_L2I_PERF_CTL 0xc0010230
#define MSR_F16H_L2I_PERF_CTR 0xc0010231
diff --git a/arch/x86/kernel/cpu/amd.c b/arch/x86/kernel/cpu/amd.c
index 03ef962..ece2b5b 100644
--- a/arch/x86/kernel/cpu/amd.c
+++ b/arch/x86/kernel/cpu/amd.c
@@ -80,6 +80,10 @@ static const int amd_div0[] =
AMD_LEGACY_ERRATUM(AMD_MODEL_RANGE(0x17, 0x00, 0x0, 0x2f, 0xf),
AMD_MODEL_RANGE(0x17, 0x50, 0x0, 0x5f, 0xf));

+static const int amd_erratum_1485[] =
+ AMD_LEGACY_ERRATUM(AMD_MODEL_RANGE(0x19, 0x10, 0x0, 0x1f, 0xf),
+ AMD_MODEL_RANGE(0x19, 0x60, 0x0, 0xaf, 0xf));
+
static bool cpu_has_amd_erratum(struct cpuinfo_x86 *cpu, const int *erratum)
{
int osvw_id = *erratum++;
@@ -1149,6 +1153,10 @@ static void init_amd(struct cpuinfo_x86 *c)
pr_notice_once("AMD Zen1 DIV0 bug detected. Disable SMT for full protection.\n");
setup_force_cpu_bug(X86_BUG_DIV0);
}
+
+ if (!cpu_has(c, X86_FEATURE_HYPERVISOR) &&
+ cpu_has_amd_erratum(c, amd_erratum_1485))
+ msr_set_bit(MSR_ZEN4_BP_CFG, MSR_ZEN4_BP_CFG_SHARED_BTB_FIX_BIT);
}

#ifdef CONFIG_X86_32

2023-10-11 21:28:54

by Ingo Molnar

[permalink] [raw]
Subject: Re: [tip: x86/urgent] x86/cpu: Fix AMD erratum #1485 on Zen4-based CPUs


* tip-bot2 for Borislav Petkov (AMD) <[email protected]> wrote:

> /* AMD Last Branch Record MSRs */
> #define MSR_AMD64_LBR_SELECT 0xc000010e
>
> +/* Zen4 */
> +#define MSR_ZEN4_BP_CFG 0xc001102e
> +#define MSR_ZEN4_BP_CFG_SHARED_BTB_FIX_BIT 5
>
> +/* Zen 2 */
> #define MSR_ZEN2_SPECTRAL_CHICKEN 0xc00110e3
> #define MSR_ZEN2_SPECTRAL_CHICKEN_BIT BIT_ULL(1)
>
> +/* Fam 17h MSRs */
> +#define MSR_F17H_IRPERF 0xc00000e9

Yeah, so these latest AMD MSR definitions in <asm/msr-index.h> are pretty
confused, they list MSRs in the following order:

Zen 4
Zen 2
Fam 19h // resolution in tip:master
Fam 17h

where perf/core added a Fam 19h section a couple of days ago ...

While in reality:

Zen 2 == Fam 17h
Zen 4 == Fam 19h

So it's confusing to list these separately and out of order.

So in resolving the conflict in perf/core I updated this section to read:

/* Fam 19h (Zen 4) MSRs */
#define MSR_F19H_UMC_PERF_CTL 0xc0010800
#define MSR_F19H_UMC_PERF_CTR 0xc0010801

#define MSR_ZEN4_BP_CFG 0xc001102e
#define MSR_ZEN4_BP_CFG_SHARED_BTB_FIX_BIT 5

/* Fam 17h (Zen 2) MSRs */
#define MSR_F17H_IRPERF 0xc00000e9

#define MSR_ZEN2_SPECTRAL_CHICKEN 0xc00110e3
#define MSR_ZEN2_SPECTRAL_CHICKEN_BIT BIT_ULL(1)

This doesn't change the definitions themselves, only merges the comments
and the sections, (to keep the Git conflict resolution non-evil), but
arguably once perf/core goes upstream, we should probably unify the naming
to follow the existing nomenclature, which is, starting at around F15H, the
following:

MSR_F15H_
MSR_F16H_
MSR_F17H_
MSR_F19H_

Or are the MSRs named ZEN2 and ZEN4 in AMD SDMs, which we should follow?

Anyway, something to keep in mind.

Thanks,

Ingo

2023-10-12 07:42:07

by Borislav Petkov

[permalink] [raw]
Subject: Re: [tip: x86/urgent] x86/cpu: Fix AMD erratum #1485 on Zen4-based CPUs

On Wed, Oct 11, 2023 at 11:28:26PM +0200, Ingo Molnar wrote:
> While in reality:
>
> Zen 2 == Fam 17h
> Zen 4 == Fam 19h

If only were that easy...

family 0x17 is Zen1 and 2, family 0x19 is spread around Zen 3 and 4.

>
> So it's confusing to list these separately and out of order.
>
> So in resolving the conflict in perf/core I updated this section to read:
>
> /* Fam 19h (Zen 4) MSRs */

That's wrong.

> #define MSR_F19H_UMC_PERF_CTL 0xc0010800
> #define MSR_F19H_UMC_PERF_CTR 0xc0010801
>
> #define MSR_ZEN4_BP_CFG 0xc001102e
> #define MSR_ZEN4_BP_CFG_SHARED_BTB_FIX_BIT 5
>
> /* Fam 17h (Zen 2) MSRs */

Ditto.

> This doesn't change the definitions themselves, only merges the comments
> and the sections, (to keep the Git conflict resolution non-evil), but
> arguably once perf/core goes upstream, we should probably unify the naming
> to follow the existing nomenclature, which is, starting at around F15H, the
> following:
>
> MSR_F15H_
> MSR_F16H_
> MSR_F17H_
> MSR_F19H_
>
> Or are the MSRs named ZEN2 and ZEN4 in AMD SDMs, which we should follow?

See above. The MSRs are per Zen generation while the family is per
family. Yes, it is confusing. :-\

IOW, you want to have this as the end product:

/* Zen4 */
#define MSR_ZEN4_BP_CFG 0xc001102e
#define MSR_ZEN4_BP_CFG_SHARED_BTB_FIX_BIT 5

/* Fam 19h MSRs */
#define MSR_F19H_UMC_PERF_CTL 0xc0010800
#define MSR_F19H_UMC_PERF_CTR 0xc0010801

/* Zen 2 */
#define MSR_ZEN2_SPECTRAL_CHICKEN 0xc00110e3
#define MSR_ZEN2_SPECTRAL_CHICKEN_BIT BIT_ULL(1)

/* Fam 17h MSRs */
#define MSR_F17H_IRPERF 0xc00000e9

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2023-10-12 18:12:14

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH] x86/cpu: Fix the AMD Fam 17h, Fam 19h, Zen2 and Zen4 enumerations


* Borislav Petkov <[email protected]> wrote:

> On Wed, Oct 11, 2023 at 11:28:26PM +0200, Ingo Molnar wrote:
> > While in reality:
> >
> > Zen 2 == Fam 17h
> > Zen 4 == Fam 19h
>
> If only were that easy...
>
> family 0x17 is Zen1 and 2, family 0x19 is spread around Zen 3 and 4.
>
...
> See above. The MSRs are per Zen generation while the family is per
> family. Yes, it is confusing. :-\

Fun!

> IOW, you want to have this as the end product:
>
> /* Zen4 */
> #define MSR_ZEN4_BP_CFG 0xc001102e
> #define MSR_ZEN4_BP_CFG_SHARED_BTB_FIX_BIT 5
>
> /* Fam 19h MSRs */
> #define MSR_F19H_UMC_PERF_CTL 0xc0010800
> #define MSR_F19H_UMC_PERF_CTR 0xc0010801
>
> /* Zen 2 */
> #define MSR_ZEN2_SPECTRAL_CHICKEN 0xc00110e3
> #define MSR_ZEN2_SPECTRAL_CHICKEN_BIT BIT_ULL(1)
>
> /* Fam 17h MSRs */
> #define MSR_F17H_IRPERF 0xc00000e9

Ok, thanks - I've distilled your enumeration order into the separate
patch below - there's more commits in perf/core meanwhile, and maybe
it isn't even bad there's a bit of a spotlight on the naming
scheme here.

I've turned your above grouping & comments into a patch, created a
changelog and added your SOB, see the perf/core commit below.
Lemme know if that's not OK to you.

Thanks,

Ingo

=============>
From: Borislav Petkov <[email protected]>
Date: Thu, 12 Oct 2023 20:01:59 +0200
Subject: [PATCH] x86/cpu: Fix the AMD Fam 17h, Fam 19h, Zen2 and Zen4 MSR enumerations

The comments introduced in <asm/msr-index.h> in the merge conflict fixup in:

8f4156d58713 ("Merge branch 'x86/urgent' into perf/core, to resolve conflict")

... aren't right: AMD naming schemes are more complex than implied,
family 0x17 is Zen1 and 2, family 0x19 is spread around Zen 3 and 4.

So there's indeed four separate MSR namespaces for:

MSR_F17H_
MSR_F19H_
MSR_ZEN2_
MSR_ZEN4_

... and the namespaces cannot be merged.

Fix it up. No change in functionality.

Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/include/asm/msr-index.h | 20 +++++++++++---------
1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 0ad9ba8baa8a..f8b502867dd1 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -637,18 +637,20 @@
/* AMD Last Branch Record MSRs */
#define MSR_AMD64_LBR_SELECT 0xc000010e

-/* Fam 19h (Zen 4) MSRs */
-#define MSR_F19H_UMC_PERF_CTL 0xc0010800
-#define MSR_F19H_UMC_PERF_CTR 0xc0010801
-
-#define MSR_ZEN4_BP_CFG 0xc001102e
+/* Zen4 */
+#define MSR_ZEN4_BP_CFG 0xc001102e
#define MSR_ZEN4_BP_CFG_SHARED_BTB_FIX_BIT 5

-/* Fam 17h (Zen 2) MSRs */
-#define MSR_F17H_IRPERF 0xc00000e9
+/* Fam 19h MSRs */
+#define MSR_F19H_UMC_PERF_CTL 0xc0010800
+#define MSR_F19H_UMC_PERF_CTR 0xc0010801

-#define MSR_ZEN2_SPECTRAL_CHICKEN 0xc00110e3
-#define MSR_ZEN2_SPECTRAL_CHICKEN_BIT BIT_ULL(1)
+/* Zen 2 */
+#define MSR_ZEN2_SPECTRAL_CHICKEN 0xc00110e3
+#define MSR_ZEN2_SPECTRAL_CHICKEN_BIT BIT_ULL(1)
+
+/* Fam 17h MSRs */
+#define MSR_F17H_IRPERF 0xc00000e9

/* Fam 16h MSRs */
#define MSR_F16H_L2I_PERF_CTL 0xc0010230

2023-10-12 18:21:10

by tip-bot2 for Jacob Pan

[permalink] [raw]
Subject: [tip: perf/core] x86/cpu: Fix the AMD Fam 17h, Fam 19h, Zen2 and Zen4 MSR enumerations

The following commit has been merged into the perf/core branch of tip:

Commit-ID: deedec0a152a3d7fa5b04ef9431aeb71802835b5
Gitweb: https://git.kernel.org/tip/deedec0a152a3d7fa5b04ef9431aeb71802835b5
Author: Borislav Petkov <[email protected]>
AuthorDate: Thu, 12 Oct 2023 20:01:59 +02:00
Committer: Ingo Molnar <[email protected]>
CommitterDate: Thu, 12 Oct 2023 20:10:39 +02:00

x86/cpu: Fix the AMD Fam 17h, Fam 19h, Zen2 and Zen4 MSR enumerations

The comments introduced in <asm/msr-index.h> in the merge conflict fixup in:

8f4156d58713 ("Merge branch 'x86/urgent' into perf/core, to resolve conflict")

... aren't right: AMD naming schemes are more complex than implied,
family 0x17 is Zen1 and 2, family 0x19 is spread around Zen 3 and 4.

So there's indeed four separate MSR namespaces for:

MSR_F17H_
MSR_F19H_
MSR_ZEN2_
MSR_ZEN4_

... and the namespaces cannot be merged.

Fix it up. No change in functionality.

Signed-off-by: Borislav Petkov (AMD) <[email protected]>
Signed-off-by: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/include/asm/msr-index.h | 20 +++++++++++---------
1 file changed, 11 insertions(+), 9 deletions(-)

diff --git a/arch/x86/include/asm/msr-index.h b/arch/x86/include/asm/msr-index.h
index 0ad9ba8..f8b5028 100644
--- a/arch/x86/include/asm/msr-index.h
+++ b/arch/x86/include/asm/msr-index.h
@@ -637,18 +637,20 @@
/* AMD Last Branch Record MSRs */
#define MSR_AMD64_LBR_SELECT 0xc000010e

-/* Fam 19h (Zen 4) MSRs */
-#define MSR_F19H_UMC_PERF_CTL 0xc0010800
-#define MSR_F19H_UMC_PERF_CTR 0xc0010801
-
-#define MSR_ZEN4_BP_CFG 0xc001102e
+/* Zen4 */
+#define MSR_ZEN4_BP_CFG 0xc001102e
#define MSR_ZEN4_BP_CFG_SHARED_BTB_FIX_BIT 5

-/* Fam 17h (Zen 2) MSRs */
-#define MSR_F17H_IRPERF 0xc00000e9
+/* Fam 19h MSRs */
+#define MSR_F19H_UMC_PERF_CTL 0xc0010800
+#define MSR_F19H_UMC_PERF_CTR 0xc0010801

-#define MSR_ZEN2_SPECTRAL_CHICKEN 0xc00110e3
-#define MSR_ZEN2_SPECTRAL_CHICKEN_BIT BIT_ULL(1)
+/* Zen 2 */
+#define MSR_ZEN2_SPECTRAL_CHICKEN 0xc00110e3
+#define MSR_ZEN2_SPECTRAL_CHICKEN_BIT BIT_ULL(1)
+
+/* Fam 17h MSRs */
+#define MSR_F17H_IRPERF 0xc00000e9

/* Fam 16h MSRs */
#define MSR_F16H_L2I_PERF_CTL 0xc0010230