This is a follow up on:
https://lore.kernel.org/lkml/[email protected]
Late microcode loading is desired by enterprise users. Late loading is
problematic as it requires detailed knowledge about the change and an
analysis whether this change modifies something which is already in use by
the kernel. Large enterprise customers have engineering teams and access to
deep technical vendor support. The regular admin does not have such
resources, so the kernel has always tainted the kernel after late loading.
Intel recently added a new previously reserved field to the microcode
header which contains the minimal microcode revision which must be running
on the CPU to make the load safe. This field is 0 in all older microcode
revisions, which the kernel assumes to be unsafe. Minimal revision checking
can be enforced via Kconfig or kernel command line. It then refuses to load
an unsafe revision. The default loads unsafe revisions like before and
taints the kernel. If a safe revision is loaded the kernel is not tainted.
But that does not solve all other known problems with late loading:
- Late loading on current Intel CPUs is unsafe vs. NMI when
hyperthreading is enabled. If a NMI hits the secondary sibling while
the primary loads the microcode, the machine can crash.
- Soft offline SMT siblings which are playing dead with MWAIT can cause
damage too when the microcode update modifies MWAIT. That's a
realistic scenario in the context of 'nosmt' mitigations. :(
Neither the core code nor the Intel specific code handles any of this at all.
While trying to implement this, I stumbled over disfunctional, horribly
complex and redundant code, which I decided to clean up first so the new
functionality can be added on a clean slate.
So the series has several sections:
1) Move the 32bit early loading after paging enable
2) Cleanup of the Intel specific code
3) Implementation of proper core control logic to handle the NMI safe
requirements
4) Support for minimal revision check in the core and the Intel specific
parts.
Changes vs. V3:
- Rebased on v6.6-rc1
- Remove the early load magic which was required for physical address
mode from the AMD code.
- Address the review comments from Borislav, which is mostly naming,
comments and change logs. No functional changes vs. v3
The series is also available from git:
git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git ucode-v4
Thanks,
tglx
---
Documentation/admin-guide/kernel-parameters.txt | 5
arch/x86/Kconfig | 25
arch/x86/include/asm/apic.h | 5
arch/x86/include/asm/cpu.h | 20
arch/x86/include/asm/microcode.h | 19
arch/x86/kernel/Makefile | 1
arch/x86/kernel/apic/apic_flat_64.c | 2
arch/x86/kernel/apic/ipi.c | 8
arch/x86/kernel/apic/x2apic_cluster.c | 1
arch/x86/kernel/apic/x2apic_phys.c | 1
arch/x86/kernel/cpu/common.c | 12
arch/x86/kernel/cpu/microcode/amd.c | 129 +---
arch/x86/kernel/cpu/microcode/core.c | 637 ++++++++++++++--------
arch/x86/kernel/cpu/microcode/intel.c | 682 +++++++-----------------
arch/x86/kernel/cpu/microcode/internal.h | 32 -
arch/x86/kernel/head32.c | 6
arch/x86/kernel/head_32.S | 10
arch/x86/kernel/nmi.c | 9
arch/x86/kernel/smpboot.c | 12
drivers/platform/x86/intel/ifs/load.c | 8
include/linux/cpuhotplug.h | 1
21 files changed, 788 insertions(+), 837 deletions(-)
Hi Thomas,
> ...
>
> The series is also available from git:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/tglx/devel.git ucode-v4
> ...
Test Result (same as ucode-v3)
------------------------------
Tested 'ucode-v4' on an Intel Sapphire Rapids server that both early load
and late load worked well. For more details, please refer to the test below:
Tested Machine
--------------
Intel Sapphire Rapids server with 2 sockets, each containing 48 cores,
resulting in a total of 192 threads.
Microcodes
----------
a) Microcode revisison of CPU : 0xab000130
b) Microcode revision in the initramfs : 0xab000140 // for early load
c) Microcode revision in /lib/firmware/intel-ucode/* : 0xab000160 // for late load
[ Microcode b) & c) headers both contain minirev 0x2b0000a1. ]
Dmesg log
---------
// Early load OK.
[ 0.000000] microcode: updated early: 0xab000130 -> 0xab000140, date = 2022-11-04
...
[ 20.215653] microcode: Microcode Update Driver: v2.2.
...
// Late load OK.
[ 27.596822] microcode: Updated to revision 0xab000160, date = 2022-11-16
[ 27.606848] microcode: load: updated on 96 primary CPUs with 96 siblings
[ 27.614789] microcode: revision: 0xab000140 -> 0xab000160
Thanks!
-Qiuxu
On Sun, Oct 08, 2023 at 04:54:56PM +0800, Qiuxu Zhuo wrote:
> Test Result (same as ucode-v3)
> ------------------------------
> Tested 'ucode-v4' on an Intel Sapphire Rapids server that both early load
> and late load worked well. For more details, please refer to the test below:
Thanks.
I've found a couple of issues and once I'm done with my testing, I'll
push tip:x86/microcode and you could run it then to make sure it all is
still ok.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
> From: Borislav Petkov <[email protected]>
> ...
> > Test Result (same as ucode-v3)
> > ------------------------------
> > Tested 'ucode-v4' on an Intel Sapphire Rapids server that both early
> > load and late load worked well. For more details, please refer to the test
> below:
>
> Thanks.
>
> I've found a couple of issues and once I'm done with my testing, I'll push
> tip:x86/microcode and you could run it then to make sure it all is still ok.
Hi Boris,
OK. I'll re-run the test once you push the code to tip:x86/microcode.
-Qiuxu
Hi Boris,
> From: Borislav Petkov <[email protected]>
> ...
> > Test Result (same as ucode-v3)
> > ------------------------------
> > Tested 'ucode-v4' on an Intel Sapphire Rapids server that both early
> > load and late load worked well. For more details, please refer to the test
> below:
>
> Thanks.
>
> I've found a couple of issues and once I'm done with my testing, I'll push
> tip:x86/microcode and you could run it then to make sure it all is still ok.
Test Result (same as ucode-v4)
------------------------------
Tested tip:x86/microcode (top commit 9975802d3f74) on an Intel Sapphire
Rapids server that both early load and late load worked well. For more
details, please refer to the test below:
Tested Machine
--------------
Intel Sapphire Rapids server with 2 sockets, each containing 48 cores,
resulting in a total of 192 threads.
Microcodes
----------
a) Microcode revisison of CPU : 0xab000130
b) Microcode revision in the initramfs : 0xab000140 // for early load
c) Microcode revision in /lib/firmware/intel-ucode/* : 0xab000160 // for late load
[ Microcode b) & c) headers both contain minirev 0x2b0000a1. ]
Dmesg log
---------
// Early load OK.
[ 0.000000] microcode: updated early: 0xab000130 -> 0xab000140, date = 2022-11-04
...
[ 20.261926] microcode: Microcode Update Driver: v2.2.
...
// Late load OK.
[ 27.400858] microcode: Updated to revision 0xab000160, date = 2022-11-16
[ 27.409978] microcode: load: updated on 96 primary CPUs with 96 siblings
[ 27.409997] microcode: revision: 0xab000140 -> 0xab000160
cpuinfo
-------
cat /proc/cpuinfo | grep -m1 microcode
microcode : 0xab000160
Thanks!
-Qiuxu
On Tue, Oct 10, 2023 at 08:00:27AM +0000, Zhuo, Qiuxu wrote:
> Test Result (same as ucode-v4)
> ------------------------------
> Tested tip:x86/microcode (top commit 9975802d3f74) on an Intel Sapphire
> Rapids server that both early load and late load worked well. For more
> details, please refer to the test below:
Thanks!
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette