LinuxLists.cc - [RFC][PATCH 00/34] [RFC] x86: Rework system-wide configuration masquerading as per-cpu data

2024-02-22 18:42:50

[permalink] [raw]

Subject: [RFC][PATCH 00/34] [RFC] x86: Rework system-wide configuration masquerading as per-cpu data

tl;dr: This tries to be more regimented in how system-wide x86 processor
configuration data is initialized. It does that by moving fields out of
the per-cpu 'struct cpuinfo_x86' and into a new structure. It also
requires that the boot CPU set these data up *once* and then be left
alone.

This is a more comprehensive approach to avoid the random tinkering
in patches like these:

https://lore.kernel.org/all/[email protected]/

--

'struct cpuinfo_x86' is a mess. At a high level, it can be thought
of as a CPU enumeration cache that (among other things) avoids using
CPUID constantly. There's a copy of it for each CPU, but often the
'boot_cpu_data' version is the only one that counts. Its fields are
written in basically random order, including a memcpy() into the
secondary processor copies from 'boot_cpu_data'.

This series focuses on a subset of the 'cpuinfo_x86' fields that must
be consistent across all CPUs:
* The virtual and physical sizes supported by the CPU
* The number of those physical bits that the caches comprehend
* The size of a cacheline that CLFLUSH works on
* The actual alignment that matters to the caches (can be different
from that CLFLUSH size)

There are a few random folks cc'd here because their subsystem reads
one or more of these things in an x86 #ifdef of some kind.

There's a mishmash of ways to obtain these from CPUID, plus fixups
for erratum and general processor wonkiness. There are also defaults
for these for processors that, for instance, don't even _have_ CPUID
support. I think virt/phys_bits can be set and overwritten no fewer
than five times!

Let's bring some order to the chaos.

First, create some abstractions so that there are no longer direct
explicit accesses to some 'boot_cpu_data' fields. This also provides
a way to do sanity checking so nothing consumes garbage.

Second, separate out the address configuration inputs from the
resulting values. The inputs are provided in early boot by the boot
processor and stashed in x86_addr_config.

Third, remove global, shared configuration from 'struct cpuinfo_x86'.
Put it in a new structure: 'struct x86_sys_config'.

This creates a simple set of rules:
1. The boot CPU populates 'bsp_addr_config' and nothing ever writes
to it again
2. get_cpu_address_sizes() reads 'bsp_addr_config' and writes
'x86_config'
3. 'bsp_addr_config' is never read again
4. 'x86_config' is never written again

The centralized helpers also now provide a chance of enforcing those
rules. Finish up the series by enforcing those rules and spitting out
a warning when they are violated. This warning mechanism works in
*very* early boot and is a little unusual. It could use some more
eyeballs (Subject: Enforce read-only x86_config state).

=== FAQ ===

== Why are both 'bsp_addr_config' and 'x86_config' needed? ==

Having defined, separate lifetimes helps enforce the new rules.

Once everything is in place 'bsp_addr_config' is only used at boot and
can be marked __init. Mostpost can help find users of it that leak to
after boot. Secondary CPU startup code can't be __init (because of
CPU hotplug) this helps ensure that 'bsp_addr_config' can only be
referenced from boot CPU code.

'x86_config' can also be __ro_after_init, which helps mitigate the
chance of it becoming tweaked after boot and growing into a little
mini version of 'boot_cpu_data' with all the same problems.

--

Ideas for future work (probably in another series):

* Consolidate defaults in x86_config accessors and
get_cpu_address_sizes() down to one copy
* Can we do something more aggressive with x86_clflush_size()? Is
it ever anything other than 64 bytes on 64-bit?

Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Len Brown <[email protected]>
Cc: Mikulas Patocka <[email protected]>
Cc: Mike Snitzer <[email protected]>
Cc: Daniel Vetter <[email protected]>
Cc: Juergen Gross <[email protected]>

2024-02-22 18:42:57

[permalink] [raw]

Subject: [RFC][PATCH 01/34] x86/xen: Explain why calling get_cpu_cap() so early is a hack

From: Dave Hansen <[email protected]>

xen_start_kernel() some of the first C code run "Xen PV" systems.
It precedes normally very early things like setup_arch() or the
processor initialization code.

That means that 'boot_cpu_data' is garbage. It has not even
established the utter basics like if the CPU supports the CPUID
instruction. Unfortunately get_cpu_cap() requires this exact
information.

Nevertheless xen_start_kernel() calls get_cpu_cap(). But it
works out in practice because it's looking for the NX bit which
comes from an extended CPUID leaf that doesn't depend on
c->cpuid_level being set. This also implicitly assumes that
Xen PV guests support CPUID.

Leave the hack in place, but at least explain some of what is
going on.

Signed-off-by: Dave Hansen <[email protected]>
Cc: Juergen Gross <[email protected]>
---

b/arch/x86/xen/enlighten_pv.c | 6 +++++-
1 file changed, 5 insertions(+), 1 deletion(-)

diff -puN arch/x86/xen/enlighten_pv.c~xen-explain1 arch/x86/xen/enlighten_pv.c
--- a/arch/x86/xen/enlighten_pv.c~xen-explain1 2024-02-22 10:08:48.404451146 -0800
+++ b/arch/x86/xen/enlighten_pv.c 2024-02-22 10:08:48.404451146 -0800
@@ -1372,7 +1372,11 @@ asmlinkage __visible void __init xen_sta
/* Get mfn list */
xen_build_dynamic_phys_to_machine();

- /* Work out if we support NX */
+ /*
+ * This is a hack. 'boot_cpu_data' is not filled out enough
+ * for get_cpu_cap() to function fully. But it _can_ fill out
+ * the leaves where NX is. Gross, but it works.
+ */
get_cpu_cap(&boot_cpu_data);
x86_configure_nx();

_

2024-02-22 18:43:10

[permalink] [raw]

Subject: [RFC][PATCH 02/34] x86/xen: Remove early "debug" physical address lookups

From: Dave Hansen <[email protected]>

The __pa() facility is subject to debugging checks if CONFIG_DEBUG_VIRTUAL=y.
One of those debugging checks is whether the physical address is valid
on the platform. That information is normally available via CPUID. But
the __pa() code currently looks it up in 'boot_cpu_data' which is not
fully set up in early Xen PV boot.

The Xen PV code currently tries to get this info with
get_cpu_address_sizes() which also depends on 'boot_cpu_data' to be at
least somewhat set up. The result is that the c->x86_phys_bits gets a
sane value, but not one that has anything to do with the hardware. In
other words, the CONFIG_DEBUG_VIRTUAL checks are performed with what
amounts to garbage inputs.

Garbage checks are worse than no check at all. Move over to the
"nodebug" variant to axe the checks.

Signed-off-by: Dave Hansen <[email protected]>
Cc: Juergen Gross <[email protected]>
---

b/arch/x86/xen/enlighten_pv.c | 2 +-
b/arch/x86/xen/mmu_pv.c | 2 +-
2 files changed, 2 insertions(+), 2 deletions(-)

diff -puN arch/x86/xen/enlighten_pv.c~xen-no-early-__pa arch/x86/xen/enlighten_pv.c
--- a/arch/x86/xen/enlighten_pv.c~xen-no-early-__pa 2024-02-22 10:08:48.868469363 -0800
+++ b/arch/x86/xen/enlighten_pv.c 2024-02-22 10:08:48.872469519 -0800
@@ -1452,7 +1452,7 @@ asmlinkage __visible void __init xen_sta
boot_params.hdr.type_of_loader = (9 << 4) | 0;
boot_params.hdr.ramdisk_image = initrd_start;
boot_params.hdr.ramdisk_size = xen_start_info->mod_len;
- boot_params.hdr.cmd_line_ptr = __pa(xen_start_info->cmd_line);
+ boot_params.hdr.cmd_line_ptr = __pa_nodebug(xen_start_info->cmd_line);
boot_params.hdr.hardware_subarch = X86_SUBARCH_XEN;

if (!xen_initial_domain()) {
diff -puN arch/x86/xen/mmu_pv.c~xen-no-early-__pa arch/x86/xen/mmu_pv.c
--- a/arch/x86/xen/mmu_pv.c~xen-no-early-__pa 2024-02-22 10:08:48.872469519 -0800
+++ b/arch/x86/xen/mmu_pv.c 2024-02-22 10:08:48.872469519 -0800
@@ -2006,7 +2006,7 @@ void __init xen_reserve_special_pages(vo
{
phys_addr_t paddr;

- memblock_reserve(__pa(xen_start_info), PAGE_SIZE);
+ memblock_reserve(__pa_nodebug(xen_start_info), PAGE_SIZE);
if (xen_start_info->store_mfn) {
paddr = PFN_PHYS(mfn_to_pfn(xen_start_info->store_mfn));
memblock_reserve(paddr, PAGE_SIZE);
_

2024-02-22 18:43:21

[permalink] [raw]

Subject: [RFC][PATCH 03/34] x86/pci: Assume that clflush size is always provided

From: Dave Hansen <[email protected]>

The early boot code always sets _some_ clflush size. Use that fact
to avoid handling the case where it is not set.

There may have been a time when the Xen PV call in here way too
early. But it calls get_cpu_address_sizes() before calling here
now. It should also be safe.

Note: This series will eventually return sane defaults even very
early in boot. I believe this is safe now, but it becomes *really*
safe later on.

Signed-off-by: Dave Hansen <[email protected]>
---

b/arch/x86/pci/common.c | 19 +++----------------
1 file changed, 3 insertions(+), 16 deletions(-)

diff -puN arch/x86/pci/common.c~x86-pci-clflush-size arch/x86/pci/common.c
--- a/arch/x86/pci/common.c~x86-pci-clflush-size 2024-02-22 10:08:49.356488521 -0800
+++ b/arch/x86/pci/common.c 2024-02-22 10:08:49.356488521 -0800
@@ -480,22 +480,9 @@ void pcibios_scan_root(int busnum)

void __init pcibios_set_cache_line_size(void)
{
- struct cpuinfo_x86 *c = &boot_cpu_data;
-
- /*
- * Set PCI cacheline size to that of the CPU if the CPU has reported it.
- * (For older CPUs that don't support cpuid, we se it to 32 bytes
- * It's also good for 386/486s (which actually have 16)
- * as quite a few PCI devices do not support smaller values.
- */
- if (c->x86_clflush_size > 0) {
- pci_dfl_cache_line_size = c->x86_clflush_size >> 2;
- printk(KERN_DEBUG "PCI: pci_cache_line_size set to %d bytes\n",
- pci_dfl_cache_line_size << 2);
- } else {
- pci_dfl_cache_line_size = 32 >> 2;
- printk(KERN_DEBUG "PCI: Unknown cacheline size. Setting to 32 bytes\n");
- }
+ pci_dfl_cache_line_size = boot_cpu_data.x86_clflush_size >> 2;
+ printk(KERN_DEBUG "PCI: pci_cache_line_size set to %d bytes\n",
+ pci_dfl_cache_line_size << 2);
}

int __init pcibios_init(void)
_

2024-02-22 18:43:51

[permalink] [raw]

Subject: [RFC][PATCH 05/34] x86/cpu: Move /proc per-cpu ->x86_phys_bits reference to global one

From: Dave Hansen <[email protected]>

This is one of the few references to the per-cpu version of
->x86_phys_bits. It is theoretically possible to have this value vary
from CPU to CPU. But in practice nobody builds systems that way.

Continue outputting the value in /proc, but read it from the global
configuration.

Signed-off-by: Dave Hansen <[email protected]>
---

b/arch/x86/kernel/cpu/proc.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff -puN arch/x86/kernel/cpu/proc.c~proc-global-x86-phys-bits arch/x86/kernel/cpu/proc.c
--- a/arch/x86/kernel/cpu/proc.c~proc-global-x86-phys-bits 2024-02-22 10:08:50.584536732 -0800
+++ b/arch/x86/kernel/cpu/proc.c 2024-02-22 10:08:50.584536732 -0800
@@ -133,7 +133,7 @@ static int show_cpuinfo(struct seq_file
seq_printf(m, "clflush size\t: %u\n", c->x86_clflush_size);
seq_printf(m, "cache_alignment\t: %d\n", c->x86_cache_alignment);
seq_printf(m, "address sizes\t: %u bits physical, %u bits virtual\n",
- c->x86_phys_bits, c->x86_virt_bits);
+ x86_phys_bits(), c->x86_virt_bits);

seq_puts(m, "power management:");
for (i = 0; i < 32; i++) {
_

2024-02-22 18:43:50

[permalink] [raw]

Subject: [RFC][PATCH 04/34] x86/mm: Introduce physical address limit helper

From: Dave Hansen <[email protected]>

Modern processors enumerate the sizes of the physical and virtual address
spaces that they can handle. The kernel stashes this information for each
CPU in 'cpuinfo_x86'.

Like a lot of system-wide configuration, in practice the kernel only uses
this information from the boot CPU. That makes a lot of sense because
the kernel can't practically support two CPUs with such different
fundamental support as the size of the address spaces.

Having a per-cpu copy of this data is silly. It is, at best, a waste
of space to keep it around for the non-boot CPUs. At worst, it is yet
another bit of data that must be initialized in a particular order and
can be a source of bugs.

Introduce a helper to look up the number of supported physical address bits:

x86_phys_bits()

Replace most open-coded references to boot_cpu_data.x86_phys_bits.

This form is more compact and also opens up the door to adding some
centralized checking and enforcing rules around this important system-
wide value.

Signed-off-by: Dave Hansen <[email protected]>
Cc: "Rafael J. Wysocki" <[email protected]>
Cc: Len Brown <[email protected]>
---

b/arch/x86/include/asm/kmsan.h | 2 +-
b/arch/x86/include/asm/mce.h | 2 +-
b/arch/x86/include/asm/processor.h | 5 +++++
b/arch/x86/kernel/cpu/mtrr/cleanup.c | 2 +-
b/arch/x86/kernel/cpu/mtrr/generic.c | 2 +-
b/arch/x86/kernel/cpu/mtrr/mtrr.c | 5 ++---
b/arch/x86/kernel/setup.c | 2 +-
b/arch/x86/kvm/cpuid.c | 2 +-
b/arch/x86/kvm/mmu.h | 10 +++++-----
b/arch/x86/kvm/mmu/spte.c | 2 +-
b/arch/x86/kvm/svm/svm.c | 2 +-
b/arch/x86/kvm/vmx/vmx.c | 10 +++++-----
b/arch/x86/kvm/vmx/vmx.h | 2 +-
b/arch/x86/mm/physaddr.h | 2 +-
b/drivers/acpi/acpi_fpdt.c | 2 +-
15 files changed, 28 insertions(+), 24 deletions(-)

diff -puN arch/x86/include/asm/kmsan.h~introduce-x86_phys_bits arch/x86/include/asm/kmsan.h
--- a/arch/x86/include/asm/kmsan.h~introduce-x86_phys_bits 2024-02-22 10:08:49.828507052 -0800
+++ b/arch/x86/include/asm/kmsan.h 2024-02-22 10:08:49.852507994 -0800
@@ -52,7 +52,7 @@ static inline void *arch_kmsan_get_meta_
static inline bool kmsan_phys_addr_valid(unsigned long addr)
{
if (IS_ENABLED(CONFIG_PHYS_ADDR_T_64BIT))
- return !(addr >> boot_cpu_data.x86_phys_bits);
+ return !(addr >> x86_phys_bits());
else
return true;
}
diff -puN arch/x86/include/asm/mce.h~introduce-x86_phys_bits arch/x86/include/asm/mce.h
--- a/arch/x86/include/asm/mce.h~introduce-x86_phys_bits 2024-02-22 10:08:49.828507052 -0800
+++ b/arch/x86/include/asm/mce.h 2024-02-22 10:08:49.852507994 -0800
@@ -89,7 +89,7 @@
#define MCI_MISC_ADDR_GENERIC 7 /* generic */

/* MCi_ADDR register defines */
-#define MCI_ADDR_PHYSADDR GENMASK_ULL(boot_cpu_data.x86_phys_bits - 1, 0)
+#define MCI_ADDR_PHYSADDR GENMASK_ULL(x86_phys_bits() - 1, 0)

/* CTL2 register defines */
#define MCI_CTL2_CMCI_EN BIT_ULL(30)
diff -puN arch/x86/include/asm/processor.h~introduce-x86_phys_bits arch/x86/include/asm/processor.h
--- a/arch/x86/include/asm/processor.h~introduce-x86_phys_bits 2024-02-22 10:08:49.828507052 -0800
+++ b/arch/x86/include/asm/processor.h 2024-02-22 10:08:49.852507994 -0800
@@ -767,4 +767,9 @@ static inline void weak_wrmsr_fence(void
alternative("mfence; lfence", "", ALT_NOT(X86_FEATURE_APIC_MSRS_FENCE));
}

+static inline u8 x86_phys_bits(void)
+{
+ return boot_cpu_data.x86_phys_bits;
+}
+
#endif /* _ASM_X86_PROCESSOR_H */
diff -puN arch/x86/kernel/cpu/mtrr/cleanup.c~introduce-x86_phys_bits arch/x86/kernel/cpu/mtrr/cleanup.c
--- a/arch/x86/kernel/cpu/mtrr/cleanup.c~introduce-x86_phys_bits 2024-02-22 10:08:49.832507209 -0800
+++ b/arch/x86/kernel/cpu/mtrr/cleanup.c 2024-02-22 10:08:49.852507994 -0800
@@ -170,7 +170,7 @@ set_var_mtrr(unsigned int reg, unsigned
return;
}

- mask = (1ULL << boot_cpu_data.x86_phys_bits) - 1;
+ mask = (1ULL << x86_phys_bits()) - 1;
mask &= ~((((u64)sizek) << 10) - 1);

base = ((u64)basek) << 10;
diff -puN arch/x86/kernel/cpu/mtrr/generic.c~introduce-x86_phys_bits arch/x86/kernel/cpu/mtrr/generic.c
--- a/arch/x86/kernel/cpu/mtrr/generic.c~introduce-x86_phys_bits 2024-02-22 10:08:49.832507209 -0800
+++ b/arch/x86/kernel/cpu/mtrr/generic.c 2024-02-22 10:08:49.852507994 -0800
@@ -660,7 +660,7 @@ static void __init print_mtrr_state(void
}
pr_info("MTRR variable ranges %sabled:\n",
mtrr_state.enabled & MTRR_STATE_MTRR_ENABLED ? "en" : "dis");
- high_width = (boot_cpu_data.x86_phys_bits - (32 - PAGE_SHIFT) + 3) / 4;
+ high_width = (x86_phys_bits() - (32 - PAGE_SHIFT) + 3) / 4;

for (i = 0; i < num_var_ranges; ++i) {
if (mtrr_state.var_ranges[i].mask_lo & MTRR_PHYSMASK_V)
diff -puN arch/x86/kernel/cpu/mtrr/mtrr.c~introduce-x86_phys_bits arch/x86/kernel/cpu/mtrr/mtrr.c
--- a/arch/x86/kernel/cpu/mtrr/mtrr.c~introduce-x86_phys_bits 2024-02-22 10:08:49.836507366 -0800
+++ b/arch/x86/kernel/cpu/mtrr/mtrr.c 2024-02-22 10:08:49.852507994 -0800
@@ -252,8 +252,7 @@ int mtrr_add_page(unsigned long base, un
return -EINVAL;
}

- if ((base | (base + size - 1)) >>
- (boot_cpu_data.x86_phys_bits - PAGE_SHIFT)) {
+ if ((base | (base + size - 1)) >> (x86_phys_bits() - PAGE_SHIFT)) {
pr_warn("base or size exceeds the MTRR width\n");
return -EINVAL;
}
@@ -556,7 +555,7 @@ void __init mtrr_bp_init(void)
const char *why = "(not available)";
unsigned long config, dummy;

- phys_hi_rsvd = GENMASK(31, boot_cpu_data.x86_phys_bits - 32);
+ phys_hi_rsvd = GENMASK(31, x86_phys_bits() - 32);

if (!generic_mtrrs && mtrr_state.enabled) {
/*
diff -puN arch/x86/kernel/setup.c~introduce-x86_phys_bits arch/x86/kernel/setup.c
--- a/arch/x86/kernel/setup.c~introduce-x86_phys_bits 2024-02-22 10:08:49.836507366 -0800
+++ b/arch/x86/kernel/setup.c 2024-02-22 10:08:49.852507994 -0800
@@ -813,7 +813,7 @@ void __init setup_arch(char **cmdline_p)
*/
early_reserve_memory();

- iomem_resource.end = (1ULL << boot_cpu_data.x86_phys_bits) - 1;
+ iomem_resource.end = (1ULL << x86_phys_bits()) - 1;
e820__memory_setup();
parse_setup_data();

diff -puN arch/x86/kvm/cpuid.c~introduce-x86_phys_bits arch/x86/kvm/cpuid.c
--- a/arch/x86/kvm/cpuid.c~introduce-x86_phys_bits 2024-02-22 10:08:49.836507366 -0800
+++ b/arch/x86/kvm/cpuid.c 2024-02-22 10:08:49.852507994 -0800
@@ -1236,7 +1236,7 @@ static inline int __do_cpuid_func(struct
* the HPAs do not affect GPAs.
*/
if (!tdp_enabled)
- g_phys_as = boot_cpu_data.x86_phys_bits;
+ g_phys_as = x86_phys_bits();
else if (!g_phys_as)
g_phys_as = phys_as;

diff -puN arch/x86/kvm/mmu.h~introduce-x86_phys_bits arch/x86/kvm/mmu.h
--- a/arch/x86/kvm/mmu.h~introduce-x86_phys_bits 2024-02-22 10:08:49.840507523 -0800
+++ b/arch/x86/kvm/mmu.h 2024-02-22 10:08:49.852507994 -0800
@@ -84,10 +84,10 @@ static inline gfn_t kvm_mmu_max_gfn(void
static inline u8 kvm_get_shadow_phys_bits(void)
{
/*
- * boot_cpu_data.x86_phys_bits is reduced when MKTME or SME are detected
- * in CPU detection code, but the processor treats those reduced bits as
- * 'keyID' thus they are not reserved bits. Therefore KVM needs to look at
- * the physical address bits reported by CPUID.
+ * x86_phys_bits() is reduced when MKTME or SME are detected in CPU
+ * detection code, but the processor treats those reduced bits as 'keyID'
+ * thus they are not reserved bits. Therefore KVM needs to look at the
+ * physical address bits reported by CPUID.
*/
if (likely(boot_cpu_data.extended_cpuid_level >= 0x80000008))
return cpuid_eax(0x80000008) & 0xff;
@@ -97,7 +97,7 @@ static inline u8 kvm_get_shadow_phys_bit
* custom CPUID. Proceed with whatever the kernel found since these features
* aren't virtualizable (SME/SEV also require CPUIDs higher than 0x80000008).
*/
- return boot_cpu_data.x86_phys_bits;
+ return x86_phys_bits();
}

void kvm_mmu_set_mmio_spte_mask(u64 mmio_value, u64 mmio_mask, u64 access_mask);
diff -puN arch/x86/kvm/mmu/spte.c~introduce-x86_phys_bits arch/x86/kvm/mmu/spte.c
--- a/arch/x86/kvm/mmu/spte.c~introduce-x86_phys_bits 2024-02-22 10:08:49.840507523 -0800
+++ b/arch/x86/kvm/mmu/spte.c 2024-02-22 10:08:49.852507994 -0800
@@ -468,7 +468,7 @@ void kvm_mmu_reset_all_pte_masks(void)
* the most significant bits of legal physical address space.
*/
shadow_nonpresent_or_rsvd_mask = 0;
- low_phys_bits = boot_cpu_data.x86_phys_bits;
+ low_phys_bits = x86_phys_bits();
if (boot_cpu_has_bug(X86_BUG_L1TF) &&
!WARN_ON_ONCE(boot_cpu_data.x86_cache_bits >=
52 - SHADOW_NONPRESENT_OR_RSVD_MASK_LEN)) {
diff -puN arch/x86/kvm/svm/svm.c~introduce-x86_phys_bits arch/x86/kvm/svm/svm.c
--- a/arch/x86/kvm/svm/svm.c~introduce-x86_phys_bits 2024-02-22 10:08:49.840507523 -0800
+++ b/arch/x86/kvm/svm/svm.c 2024-02-22 10:08:49.852507994 -0800
@@ -5054,7 +5054,7 @@ static __init void svm_adjust_mmio_mask(
return;

enc_bit = cpuid_ebx(0x8000001f) & 0x3f;
- mask_bit = boot_cpu_data.x86_phys_bits;
+ mask_bit = x86_phys_bits();

/* Increment the mask bit if it is the same as the encryption bit */
if (enc_bit == mask_bit)
diff -puN arch/x86/kvm/vmx/vmx.c~introduce-x86_phys_bits arch/x86/kvm/vmx/vmx.c
--- a/arch/x86/kvm/vmx/vmx.c~introduce-x86_phys_bits 2024-02-22 10:08:49.844507680 -0800
+++ b/arch/x86/kvm/vmx/vmx.c 2024-02-22 10:08:49.852507994 -0800
@@ -8444,14 +8444,14 @@ static void __init vmx_setup_me_spte_mas
* kvm_get_shadow_phys_bits() returns shadow_phys_bits. Use
* the former to avoid exposing shadow_phys_bits.
*
- * On pre-MKTME system, boot_cpu_data.x86_phys_bits equals to
+ * On pre-MKTME system, x86_phys_bits() equals to
* shadow_phys_bits. On MKTME and/or TDX capable systems,
- * boot_cpu_data.x86_phys_bits holds the actual physical address
- * w/o the KeyID bits, and shadow_phys_bits equals to MAXPHYADDR
+ * x86_phys_bits() holds the actual physical address w/o the
+ * KeyID bits, and shadow_phys_bits equals to MAXPHYADDR
* reported by CPUID. Those bits between are KeyID bits.
*/
- if (boot_cpu_data.x86_phys_bits != kvm_get_shadow_phys_bits())
- me_mask = rsvd_bits(boot_cpu_data.x86_phys_bits,
+ if (x86_phys_bits() != kvm_get_shadow_phys_bits())
+ me_mask = rsvd_bits(x86_phys_bits(),
kvm_get_shadow_phys_bits() - 1);
/*
* Unlike SME, host kernel doesn't support setting up any
diff -puN arch/x86/kvm/vmx/vmx.h~introduce-x86_phys_bits arch/x86/kvm/vmx/vmx.h
--- a/arch/x86/kvm/vmx/vmx.h~introduce-x86_phys_bits 2024-02-22 10:08:49.844507680 -0800
+++ b/arch/x86/kvm/vmx/vmx.h 2024-02-22 10:08:49.852507994 -0800
@@ -721,7 +721,7 @@ static inline bool vmx_need_pf_intercept
if (!enable_ept)
return true;

- return allow_smaller_maxphyaddr && cpuid_maxphyaddr(vcpu) < boot_cpu_data.x86_phys_bits;
+ return allow_smaller_maxphyaddr && cpuid_maxphyaddr(vcpu) < x86_phys_bits();
}

static inline bool is_unrestricted_guest(struct kvm_vcpu *vcpu)
diff -puN arch/x86/mm/physaddr.h~introduce-x86_phys_bits arch/x86/mm/physaddr.h
--- a/arch/x86/mm/physaddr.h~introduce-x86_phys_bits 2024-02-22 10:08:49.848507837 -0800
+++ b/arch/x86/mm/physaddr.h 2024-02-22 10:08:49.852507994 -0800
@@ -4,7 +4,7 @@
static inline int phys_addr_valid(resource_size_t addr)
{
#ifdef CONFIG_PHYS_ADDR_T_64BIT
- return !(addr >> boot_cpu_data.x86_phys_bits);
+ return !(addr >> x86_phys_bits());
#else
return 1;
#endif
diff -puN drivers/acpi/acpi_fpdt.c~introduce-x86_phys_bits drivers/acpi/acpi_fpdt.c
--- a/drivers/acpi/acpi_fpdt.c~introduce-x86_phys_bits 2024-02-22 10:08:49.848507837 -0800
+++ b/drivers/acpi/acpi_fpdt.c 2024-02-22 10:08:49.852507994 -0800
@@ -151,7 +151,7 @@ static bool fpdt_address_valid(u64 addre
* On some systems the table contains invalid addresses
* with unsuppored high address bits set, check for this.
*/
- return !(address >> boot_cpu_data.x86_phys_bits);
+ return !(address >> x86_phys_bits());
}
#else
static bool fpdt_address_valid(u64 address)
_

2024-02-22 18:44:25

[permalink] [raw]

Subject: [RFC][PATCH 07/34] x86/mm: Introduce virtual address space limit helper

From: Dave Hansen <[email protected]>

This uses the same logic and approach which were used for the physical
address limits with x86_phys_bits() and extends them to the virtual
address space.

Introduce a system-wide helper for users to query the size of the
virtual address space: x86_virt_bits()

Signed-off-by: Dave Hansen <[email protected]>
---

b/arch/x86/events/amd/brs.c | 2 +-
b/arch/x86/events/amd/lbr.c | 2 +-
b/arch/x86/events/intel/pt.c | 4 ++--
b/arch/x86/include/asm/processor.h | 5 +++++
b/arch/x86/kernel/cpu/proc.c | 2 +-
b/arch/x86/mm/maccess.c | 4 ++--
6 files changed, 12 insertions(+), 7 deletions(-)

diff -puN arch/x86/events/amd/brs.c~x86_virt_bits-func arch/x86/events/amd/brs.c
--- a/arch/x86/events/amd/brs.c~x86_virt_bits-func 2024-02-22 10:08:51.528573793 -0800
+++ b/arch/x86/events/amd/brs.c 2024-02-22 10:08:51.536574107 -0800
@@ -285,7 +285,7 @@ void amd_brs_drain(void)
struct perf_branch_entry *br = cpuc->lbr_entries;
union amd_debug_extn_cfg cfg;
u32 i, nr = 0, num, tos, start;
- u32 shift = 64 - boot_cpu_data.x86_virt_bits;
+ u32 shift = 64 - x86_virt_bits();

/*
* BRS event forced on PMC0,
diff -puN arch/x86/events/amd/lbr.c~x86_virt_bits-func arch/x86/events/amd/lbr.c
--- a/arch/x86/events/amd/lbr.c~x86_virt_bits-func 2024-02-22 10:08:51.528573793 -0800
+++ b/arch/x86/events/amd/lbr.c 2024-02-22 10:08:51.536574107 -0800
@@ -89,7 +89,7 @@ static __always_inline u64 amd_pmu_lbr_g

static __always_inline u64 sign_ext_branch_ip(u64 ip)
{
- u32 shift = 64 - boot_cpu_data.x86_virt_bits;
+ u32 shift = 64 - x86_virt_bits();

return (u64)(((s64)ip << shift) >> shift);
}
diff -puN arch/x86/events/intel/pt.c~x86_virt_bits-func arch/x86/events/intel/pt.c
--- a/arch/x86/events/intel/pt.c~x86_virt_bits-func 2024-02-22 10:08:51.528573793 -0800
+++ b/arch/x86/events/intel/pt.c 2024-02-22 10:08:51.536574107 -0800
@@ -1453,8 +1453,8 @@ static void pt_event_addr_filters_sync(s
* canonical addresses does not affect the result of the
* address filter.
*/
- msr_a = clamp_to_ge_canonical_addr(a, boot_cpu_data.x86_virt_bits);
- msr_b = clamp_to_le_canonical_addr(b, boot_cpu_data.x86_virt_bits);
+ msr_a = clamp_to_ge_canonical_addr(a, x86_virt_bits());
+ msr_b = clamp_to_le_canonical_addr(b, x86_virt_bits());
if (msr_b < msr_a)
msr_a = msr_b = 0;
}
diff -puN arch/x86/include/asm/processor.h~x86_virt_bits-func arch/x86/include/asm/processor.h
--- a/arch/x86/include/asm/processor.h~x86_virt_bits-func 2024-02-22 10:08:51.532573950 -0800
+++ b/arch/x86/include/asm/processor.h 2024-02-22 10:08:51.536574107 -0800
@@ -772,4 +772,9 @@ static inline u8 x86_phys_bits(void)
return boot_cpu_data.x86_phys_bits;
}

+static inline u8 x86_virt_bits(void)
+{
+ return boot_cpu_data.x86_virt_bits;
+}
+
#endif /* _ASM_X86_PROCESSOR_H */
diff -puN arch/x86/kernel/cpu/proc.c~x86_virt_bits-func arch/x86/kernel/cpu/proc.c
--- a/arch/x86/kernel/cpu/proc.c~x86_virt_bits-func 2024-02-22 10:08:51.532573950 -0800
+++ b/arch/x86/kernel/cpu/proc.c 2024-02-22 10:08:51.536574107 -0800
@@ -133,7 +133,7 @@ static int show_cpuinfo(struct seq_file
seq_printf(m, "clflush size\t: %u\n", c->x86_clflush_size);
seq_printf(m, "cache_alignment\t: %d\n", c->x86_cache_alignment);
seq_printf(m, "address sizes\t: %u bits physical, %u bits virtual\n",
- x86_phys_bits(), c->x86_virt_bits);
+ x86_phys_bits(), x86_virt_bits());

seq_puts(m, "power management:");
for (i = 0; i < 32; i++) {
diff -puN arch/x86/mm/maccess.c~x86_virt_bits-func arch/x86/mm/maccess.c
--- a/arch/x86/mm/maccess.c~x86_virt_bits-func 2024-02-22 10:08:51.536574107 -0800
+++ b/arch/x86/mm/maccess.c 2024-02-22 10:08:51.536574107 -0800
@@ -20,10 +20,10 @@ bool copy_from_kernel_nofault_allowed(co
* is initialized. Needed for instruction decoding in early
* exception handlers.
*/
- if (!boot_cpu_data.x86_virt_bits)
+ if (!x86_virt_bits())
return true;

- return __is_canonical_address(vaddr, boot_cpu_data.x86_virt_bits);
+ return __is_canonical_address(vaddr, x86_virt_bits());
}
#else
bool copy_from_kernel_nofault_allowed(const void *unsafe_src, size_t size)
_

2024-02-22 18:45:30

[permalink] [raw]

Subject: [RFC][PATCH 11/34] x86/cpu/intel: Prepare MKTME for "address configuration" infrastructure

From: Dave Hansen <[email protected]>

Intel also does memory encryption and also fiddles with the physical
address bits. This is currently called for *each* CPU, but practically
only done on the boot CPU because of 'mktme_status'.

Move it from the "each CPU" ->c_init() function to ->c_bsp_init() where
the whole thing only gets called once ever. This also necessitates moving
detect_tme() and its entourage around in the file.

Signed-off-by: Dave Hansen <[email protected]>
---

b/arch/x86/kernel/cpu/intel.c | 174 +++++++++++++++++++++---------------------
1 file changed, 87 insertions(+), 87 deletions(-)

diff -puN arch/x86/kernel/cpu/intel.c~intel-move-TME-detection arch/x86/kernel/cpu/intel.c
--- a/arch/x86/kernel/cpu/intel.c~intel-move-TME-detection 2024-02-22 10:08:53.820663775 -0800
+++ b/arch/x86/kernel/cpu/intel.c 2024-02-22 10:08:53.824663932 -0800
@@ -324,9 +324,96 @@ static void early_init_intel(struct cpui
detect_ht_early(c);
}

+#define MSR_IA32_TME_ACTIVATE 0x982
+
+/* Helpers to access TME_ACTIVATE MSR */
+#define TME_ACTIVATE_LOCKED(x) (x & 0x1)
+#define TME_ACTIVATE_ENABLED(x) (x & 0x2)
+
+#define TME_ACTIVATE_POLICY(x) ((x >> 4) & 0xf) /* Bits 7:4 */
+#define TME_ACTIVATE_POLICY_AES_XTS_128 0
+
+#define TME_ACTIVATE_KEYID_BITS(x) ((x >> 32) & 0xf) /* Bits 35:32 */
+
+#define TME_ACTIVATE_CRYPTO_ALGS(x) ((x >> 48) & 0xffff) /* Bits 63:48 */
+#define TME_ACTIVATE_CRYPTO_AES_XTS_128 1
+
+/* Values for mktme_status (SW only construct) */
+#define MKTME_ENABLED 0
+#define MKTME_DISABLED 1
+#define MKTME_UNINITIALIZED 2
+static int mktme_status = MKTME_UNINITIALIZED;
+
+static void detect_tme(struct cpuinfo_x86 *c)
+{
+ u64 tme_activate, tme_policy, tme_crypto_algs;
+ int keyid_bits = 0, nr_keyids = 0;
+ static u64 tme_activate_cpu0 = 0;
+
+ rdmsrl(MSR_IA32_TME_ACTIVATE, tme_activate);
+
+ if (mktme_status != MKTME_UNINITIALIZED) {
+ if (tme_activate != tme_activate_cpu0) {
+ /* Broken BIOS? */
+ pr_err_once("x86/tme: configuration is inconsistent between CPUs\n");
+ pr_err_once("x86/tme: MKTME is not usable\n");
+ mktme_status = MKTME_DISABLED;
+
+ /* Proceed. We may need to exclude bits from x86_phys_bits. */
+ }
+ } else {
+ tme_activate_cpu0 = tme_activate;
+ }
+
+ if (!TME_ACTIVATE_LOCKED(tme_activate) || !TME_ACTIVATE_ENABLED(tme_activate)) {
+ pr_info_once("x86/tme: not enabled by BIOS\n");
+ mktme_status = MKTME_DISABLED;
+ return;
+ }
+
+ if (mktme_status != MKTME_UNINITIALIZED)
+ goto detect_keyid_bits;
+
+ pr_info("x86/tme: enabled by BIOS\n");
+
+ tme_policy = TME_ACTIVATE_POLICY(tme_activate);
+ if (tme_policy != TME_ACTIVATE_POLICY_AES_XTS_128)
+ pr_warn("x86/tme: Unknown policy is active: %#llx\n", tme_policy);
+
+ tme_crypto_algs = TME_ACTIVATE_CRYPTO_ALGS(tme_activate);
+ if (!(tme_crypto_algs & TME_ACTIVATE_CRYPTO_AES_XTS_128)) {
+ pr_err("x86/mktme: No known encryption algorithm is supported: %#llx\n",
+ tme_crypto_algs);
+ mktme_status = MKTME_DISABLED;
+ }
+detect_keyid_bits:
+ keyid_bits = TME_ACTIVATE_KEYID_BITS(tme_activate);
+ nr_keyids = (1UL << keyid_bits) - 1;
+ if (nr_keyids) {
+ pr_info_once("x86/mktme: enabled by BIOS\n");
+ pr_info_once("x86/mktme: %d KeyIDs available\n", nr_keyids);
+ } else {
+ pr_info_once("x86/mktme: disabled by BIOS\n");
+ }
+
+ if (mktme_status == MKTME_UNINITIALIZED) {
+ /* MKTME is usable */
+ mktme_status = MKTME_ENABLED;
+ }
+
+ /*
+ * KeyID bits effectively lower the number of physical address
+ * bits. Update cpuinfo_x86::x86_phys_bits accordingly.
+ */
+ c->x86_phys_bits -= keyid_bits;
+}
+
static void bsp_init_intel(struct cpuinfo_x86 *c)
{
resctrl_cpu_detect(c);
+
+ if (cpu_has(c, X86_FEATURE_TME))
+ detect_tme(c);
}

#ifdef CONFIG_X86_32
@@ -482,90 +569,6 @@ static void srat_detect_node(struct cpui
#endif
}

-#define MSR_IA32_TME_ACTIVATE 0x982
-
-/* Helpers to access TME_ACTIVATE MSR */
-#define TME_ACTIVATE_LOCKED(x) (x & 0x1)
-#define TME_ACTIVATE_ENABLED(x) (x & 0x2)
-
-#define TME_ACTIVATE_POLICY(x) ((x >> 4) & 0xf) /* Bits 7:4 */
-#define TME_ACTIVATE_POLICY_AES_XTS_128 0
-
-#define TME_ACTIVATE_KEYID_BITS(x) ((x >> 32) & 0xf) /* Bits 35:32 */
-
-#define TME_ACTIVATE_CRYPTO_ALGS(x) ((x >> 48) & 0xffff) /* Bits 63:48 */
-#define TME_ACTIVATE_CRYPTO_AES_XTS_128 1
-
-/* Values for mktme_status (SW only construct) */
-#define MKTME_ENABLED 0
-#define MKTME_DISABLED 1
-#define MKTME_UNINITIALIZED 2
-static int mktme_status = MKTME_UNINITIALIZED;
-
-static void detect_tme(struct cpuinfo_x86 *c)
-{
- u64 tme_activate, tme_policy, tme_crypto_algs;
- int keyid_bits = 0, nr_keyids = 0;
- static u64 tme_activate_cpu0 = 0;
-
- rdmsrl(MSR_IA32_TME_ACTIVATE, tme_activate);
-
- if (mktme_status != MKTME_UNINITIALIZED) {
- if (tme_activate != tme_activate_cpu0) {
- /* Broken BIOS? */
- pr_err_once("x86/tme: configuration is inconsistent between CPUs\n");
- pr_err_once("x86/tme: MKTME is not usable\n");
- mktme_status = MKTME_DISABLED;
-
- /* Proceed. We may need to exclude bits from x86_phys_bits. */
- }
- } else {
- tme_activate_cpu0 = tme_activate;
- }
-
- if (!TME_ACTIVATE_LOCKED(tme_activate) || !TME_ACTIVATE_ENABLED(tme_activate)) {
- pr_info_once("x86/tme: not enabled by BIOS\n");
- mktme_status = MKTME_DISABLED;
- return;
- }
-
- if (mktme_status != MKTME_UNINITIALIZED)
- goto detect_keyid_bits;
-
- pr_info("x86/tme: enabled by BIOS\n");
-
- tme_policy = TME_ACTIVATE_POLICY(tme_activate);
- if (tme_policy != TME_ACTIVATE_POLICY_AES_XTS_128)
- pr_warn("x86/tme: Unknown policy is active: %#llx\n", tme_policy);
-
- tme_crypto_algs = TME_ACTIVATE_CRYPTO_ALGS(tme_activate);
- if (!(tme_crypto_algs & TME_ACTIVATE_CRYPTO_AES_XTS_128)) {
- pr_err("x86/mktme: No known encryption algorithm is supported: %#llx\n",
- tme_crypto_algs);
- mktme_status = MKTME_DISABLED;
- }
-detect_keyid_bits:
- keyid_bits = TME_ACTIVATE_KEYID_BITS(tme_activate);
- nr_keyids = (1UL << keyid_bits) - 1;
- if (nr_keyids) {
- pr_info_once("x86/mktme: enabled by BIOS\n");
- pr_info_once("x86/mktme: %d KeyIDs available\n", nr_keyids);
- } else {
- pr_info_once("x86/mktme: disabled by BIOS\n");
- }
-
- if (mktme_status == MKTME_UNINITIALIZED) {
- /* MKTME is usable */
- mktme_status = MKTME_ENABLED;
- }
-
- /*
- * KeyID bits effectively lower the number of physical address
- * bits. Update cpuinfo_x86::x86_phys_bits accordingly.
- */
- c->x86_phys_bits -= keyid_bits;
-}
-
static void init_cpuid_fault(struct cpuinfo_x86 *c)
{
u64 msr;
@@ -702,9 +705,6 @@ static void init_intel(struct cpuinfo_x8

init_ia32_feat_ctl(c);

- if (cpu_has(c, X86_FEATURE_TME))
- detect_tme(c);
-
init_intel_misc_features(c);

split_lock_init();
_

2024-02-22 18:46:00

[permalink] [raw]

Subject: [RFC][PATCH 13/34] x86/boot: Use address reduction config to handle erratum

From: Dave Hansen <[email protected]>

Continue to chip away at sites that muck with ->x86_phys_bits. This
one is an oldie but a goodie:

af9c142de94e ("[PATCH] x86_64: Force correct address space size for MTRR on some 64bit Intel Xeons")

Evidently, the CPUs in question "report 40bit, but only have 36bits
of physical address space." Since there's now a handy way to reduce
the amount of physical address bits, use that to reduce from 40->36.

This means there are now two (Intel) users of the address bits
reduction feature. There is no way a 2005-era CPU will ever have TME,
but it is still nice to be more explicit about the possibility of a
collision.

Signed-off-by: Dave Hansen <[email protected]>
---

b/arch/x86/kernel/cpu/intel.c | 27 +++++++++++++++++----------
1 file changed, 17 insertions(+), 10 deletions(-)

diff -puN arch/x86/kernel/cpu/intel.c~intel-phys-addr-errata arch/x86/kernel/cpu/intel.c
--- a/arch/x86/kernel/cpu/intel.c~intel-phys-addr-errata 2024-02-22 10:08:54.772701150 -0800
+++ b/arch/x86/kernel/cpu/intel.c 2024-02-22 10:08:54.776701306 -0800
@@ -242,11 +242,6 @@ static void early_init_intel(struct cpui
c->x86_cache_alignment = 128;
#endif

- /* CPUID workaround for 0F33/0F34 CPU */
- if (c->x86 == 0xF && c->x86_model == 0x3
- && (c->x86_stepping == 0x3 || c->x86_stepping == 0x4))
- c->x86_phys_bits = 36;
-
/*
* c->x86_power is 8000_0007 edx. Bit 8 is TSC runs at constant rate
* with P/T states and does not stop in deep C-states.
@@ -344,7 +339,7 @@ static void early_init_intel(struct cpui
#define MKTME_UNINITIALIZED 2
static int mktme_status = MKTME_UNINITIALIZED;

-static void detect_tme(struct cpuinfo_x86 *c)
+static int detect_tme(struct cpuinfo_x86 *c)
{
u64 tme_activate, tme_policy, tme_crypto_algs;
int keyid_bits = 0, nr_keyids = 0;
@@ -368,7 +363,7 @@ static void detect_tme(struct cpuinfo_x8
if (!TME_ACTIVATE_LOCKED(tme_activate) || !TME_ACTIVATE_ENABLED(tme_activate)) {
pr_info_once("x86/tme: not enabled by BIOS\n");
mktme_status = MKTME_DISABLED;
- return;
+ return 0;
}

if (mktme_status != MKTME_UNINITIALIZED)
@@ -401,16 +396,28 @@ detect_keyid_bits:
mktme_status = MKTME_ENABLED;
}

- /* KeyID bits effectively lower the number of physical address bits */
- bsp_addr_config.phys_addr_reduction_bits = keyid_bits;
+ return keyid_bits;
}

static void bsp_init_intel(struct cpuinfo_x86 *c)
{
+ int keyid_bits = 0;
+
resctrl_cpu_detect(c);

if (cpu_has(c, X86_FEATURE_TME))
- detect_tme(c);
+ keyid_bits = detect_tme(c);
+
+ /* KeyID bits effectively lower the number of physical address bits */
+ bsp_addr_config.phys_addr_reduction_bits = keyid_bits;
+
+ /* CPUID workaround for 0F33/0F34 CPU */
+ if (c->x86 == 0xF && c->x86_model == 0x3
+ && (c->x86_stepping == 0x3 || c->x86_stepping == 0x4)) {
+ /* Warn if MKTME and this workaround collide: */
+ WARN_ON_ONCE(keyid_bits);
+ bsp_addr_config.phys_addr_reduction_bits = 4;
+ }
}

#ifdef CONFIG_X86_32
_

2024-02-22 18:46:58

[permalink] [raw]

Subject: [RFC][PATCH 16/34] x86/cpu: Move physical address limit out of cpuinfo_x86

From: Dave Hansen <[email protected]>

There are no longer any direct references to cpuinfo_x86->x86_phys_bits.
The only remaining references are to 'boot_cpu_data' via the
x86_phys_bits() helper.

This means the farce that x86_phys_bits is per-cpu data can end. Remove
it from cpuinfo_x86 and add it to a new global data structure:
'x86_config'. (Better names welcome)

Signed-off-by: Dave Hansen <[email protected]>
---

b/arch/x86/include/asm/processor.h | 16 ++++++++++++++--
b/arch/x86/kernel/cpu/common.c | 12 ++++++------
b/arch/x86/kernel/setup.c | 1 +
3 files changed, 21 insertions(+), 8 deletions(-)

diff -puN arch/x86/include/asm/processor.h~no-cpu-data-phys_bits arch/x86/include/asm/processor.h
--- a/arch/x86/include/asm/processor.h~no-cpu-data-phys_bits 2024-02-22 10:08:56.220757996 -0800
+++ b/arch/x86/include/asm/processor.h 2024-02-22 10:08:56.228758310 -0800
@@ -118,7 +118,6 @@ struct cpuinfo_x86 {
__u32 vmx_capability[NVMXINTS];
#endif
__u8 x86_virt_bits;
- __u8 x86_phys_bits;
/* CPUID returned core id bits: */
__u8 x86_coreid_bits;
/* Max extended CPUID function supported: */
@@ -176,6 +175,19 @@ struct x86_addr_config {
u8 phys_addr_reduction_bits;
};

+/*
+ * System-wide configuration that is shared by all processors.
+ *
+ * Built in early_cpu_init() on the boot CPU and and never
+ * modified after that.
+ */
+struct x86_sys_config {
+ /* Physical address bits supported by all processors */
+ u8 phys_bits;
+};
+
+extern struct x86_sys_config x86_config;
+
#define X86_VENDOR_INTEL 0
#define X86_VENDOR_CYRIX 1
#define X86_VENDOR_AMD 2
@@ -783,7 +795,7 @@ static inline void weak_wrmsr_fence(void

static inline u8 x86_phys_bits(void)
{
- return boot_cpu_data.x86_phys_bits;
+ return x86_config.phys_bits;
}

static inline u8 x86_virt_bits(void)
diff -puN arch/x86/kernel/cpu/common.c~no-cpu-data-phys_bits arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~no-cpu-data-phys_bits 2024-02-22 10:08:56.224758153 -0800
+++ b/arch/x86/kernel/cpu/common.c 2024-02-22 10:08:56.228758310 -0800
@@ -1107,27 +1107,27 @@ void get_cpu_address_sizes(struct cpuinf
cpuid(0x80000008, &eax, &ebx, &ecx, &edx);

c->x86_virt_bits = (eax >> 8) & 0xff;
- c->x86_phys_bits = eax & 0xff;
+ x86_config.phys_bits = eax & 0xff;
} else {
if (IS_ENABLED(CONFIG_X86_64)) {
+ x86_config.phys_bits = 36;
c->x86_clflush_size = 64;
- c->x86_phys_bits = 36;
c->x86_virt_bits = 48;
} else {
+ x86_config.phys_bits = 32;
c->x86_clflush_size = 32;
c->x86_virt_bits = 32;
- c->x86_phys_bits = 32;

if (cpu_has(c, X86_FEATURE_PAE) ||
cpu_has(c, X86_FEATURE_PSE36))
- c->x86_phys_bits = 36;
+ x86_config.phys_bits = 36;
}
}
- c->x86_cache_bits = c->x86_phys_bits;
+ c->x86_cache_bits = x86_config.phys_bits;
c->x86_cache_alignment = x86_clflush_size();

/* Do this last to avoid affecting ->x86_cache_bits. */
- c->x86_phys_bits -= bsp_addr_config.phys_addr_reduction_bits;
+ x86_config.phys_bits -= bsp_addr_config.phys_addr_reduction_bits;
}

static void identify_cpu_without_cpuid(struct cpuinfo_x86 *c)
diff -puN arch/x86/kernel/setup.c~no-cpu-data-phys_bits arch/x86/kernel/setup.c
--- a/arch/x86/kernel/setup.c~no-cpu-data-phys_bits 2024-02-22 10:08:56.224758153 -0800
+++ b/arch/x86/kernel/setup.c 2024-02-22 10:08:56.228758310 -0800
@@ -132,6 +132,7 @@ struct cpuinfo_x86 boot_cpu_data __read_
EXPORT_SYMBOL(boot_cpu_data);

struct x86_addr_config bsp_addr_config;
+struct x86_sys_config x86_config __read_mostly;

#if !defined(CONFIG_X86_PAE) || defined(CONFIG_X86_64)
__visible unsigned long mmu_cr4_features __ro_after_init;
_

2024-02-22 18:47:05

[permalink] [raw]

Subject: [RFC][PATCH 17/34] x86/cpu: Move virtual address limit out of cpuinfo_x86

From: Dave Hansen <[email protected]>

There are no longer any direct references to
cpuinfo_x86->x86_virt_bits. The only remaining references are to
'boot_cpu_data' via the x86_virt_bits() helper.

This means the farce that x86_virt_bits is per-cpu data can end.
Remove it from cpuinfo_x86 and add it to 'x86_config'.

Signed-off-by: Dave Hansen <[email protected]>
---

b/arch/x86/include/asm/processor.h | 6 +++---
b/arch/x86/kernel/cpu/common.c | 8 +++-----
b/arch/x86/mm/maccess.c | 9 +++++----
3 files changed, 11 insertions(+), 12 deletions(-)

diff -puN arch/x86/include/asm/processor.h~no-cpu-data-virt_bits arch/x86/include/asm/processor.h
--- a/arch/x86/include/asm/processor.h~no-cpu-data-virt_bits 2024-02-22 10:08:56.748778725 -0800
+++ b/arch/x86/include/asm/processor.h 2024-02-22 10:08:56.752778882 -0800
@@ -117,7 +117,6 @@ struct cpuinfo_x86 {
#ifdef CONFIG_X86_VMX_FEATURE_NAMES
__u32 vmx_capability[NVMXINTS];
#endif
- __u8 x86_virt_bits;
/* CPUID returned core id bits: */
__u8 x86_coreid_bits;
/* Max extended CPUID function supported: */
@@ -182,8 +181,9 @@ struct x86_addr_config {
* modified after that.
*/
struct x86_sys_config {
- /* Physical address bits supported by all processors */
+ /* Address bits supported by all processors */
u8 phys_bits;
+ u8 virt_bits;
};

extern struct x86_sys_config x86_config;
@@ -800,7 +800,7 @@ static inline u8 x86_phys_bits(void)

static inline u8 x86_virt_bits(void)
{
- return boot_cpu_data.x86_virt_bits;
+ return x86_config.virt_bits;
}

static inline u8 x86_clflush_size(void)
diff -puN arch/x86/kernel/cpu/common.c~no-cpu-data-virt_bits arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~no-cpu-data-virt_bits 2024-02-22 10:08:56.748778725 -0800
+++ b/arch/x86/kernel/cpu/common.c 2024-02-22 10:08:56.752778882 -0800
@@ -1106,17 +1106,17 @@ void get_cpu_address_sizes(struct cpuinf
if (vp_bits_from_cpuid) {
cpuid(0x80000008, &eax, &ebx, &ecx, &edx);

- c->x86_virt_bits = (eax >> 8) & 0xff;
+ x86_config.virt_bits = (eax >> 8) & 0xff;
x86_config.phys_bits = eax & 0xff;
} else {
if (IS_ENABLED(CONFIG_X86_64)) {
x86_config.phys_bits = 36;
+ x86_config.virt_bits = 48;
c->x86_clflush_size = 64;
- c->x86_virt_bits = 48;
} else {
x86_config.phys_bits = 32;
+ x86_config.virt_bits = 32;
c->x86_clflush_size = 32;
- c->x86_virt_bits = 32;

if (cpu_has(c, X86_FEATURE_PAE) ||
cpu_has(c, X86_FEATURE_PSE36))
@@ -1827,11 +1827,9 @@ static void identify_cpu(struct cpuinfo_
c->topo.l2c_id = BAD_APICID;
#ifdef CONFIG_X86_64
c->x86_clflush_size = 64;
- c->x86_virt_bits = 48;
#else
c->cpuid_level = -1; /* CPUID not detected */
c->x86_clflush_size = 32;
- c->x86_virt_bits = 32;
#endif
c->x86_cache_alignment = x86_clflush_size();
memset(&c->x86_capability, 0, sizeof(c->x86_capability));
diff -puN arch/x86/mm/maccess.c~no-cpu-data-virt_bits arch/x86/mm/maccess.c
--- a/arch/x86/mm/maccess.c~no-cpu-data-virt_bits 2024-02-22 10:08:56.752778882 -0800
+++ b/arch/x86/mm/maccess.c 2024-02-22 10:08:56.752778882 -0800
@@ -16,11 +16,12 @@ bool copy_from_kernel_nofault_allowed(co
return false;

/*
- * Allow everything during early boot before 'x86_virt_bits'
- * is initialized. Needed for instruction decoding in early
- * exception handlers.
+ * Allow everything during early boot before 'virt_bits' is
+ * initialized. Needed for instruction decoding in early
+ * exception handlers. Avoid the helper because it may do
+ * error checking for being uninitiazed.
*/
- if (!x86_virt_bits())
+ if (!x86_config.virt_bits)
return true;

return __is_canonical_address(vaddr, x86_virt_bits());
_

2024-02-22 18:47:10

[permalink] [raw]

Subject: [RFC][PATCH 18/34] x86/cpu/centaur: Move cache alignment override to BSP init

From: Dave Hansen <[email protected]>

The 'x86_cache_alignment' is yet another system-wide value which is stored
per-cpu. Stop initializing it per-cpu on (some) Centaur CPUs and move it
to a new BSP init function.

Signed-off-by: Dave Hansen <[email protected]>
---

b/arch/x86/kernel/cpu/centaur.c | 15 +++++++++------
1 file changed, 9 insertions(+), 6 deletions(-)

diff -puN arch/x86/kernel/cpu/centaur.c~centaur-c_early_init arch/x86/kernel/cpu/centaur.c
--- a/arch/x86/kernel/cpu/centaur.c~centaur-c_early_init 2024-02-22 10:08:57.268799139 -0800
+++ b/arch/x86/kernel/cpu/centaur.c 2024-02-22 10:08:57.268799139 -0800
@@ -61,12 +61,8 @@ static void init_c3(struct cpuinfo_x86 *
if (c->x86_model >= 6 && c->x86_model < 9)
set_cpu_cap(c, X86_FEATURE_3DNOW);
#endif
- if (c->x86 == 0x6 && c->x86_model >= 0xf) {
- c->x86_cache_alignment = x86_clflush_size() * 2;
- set_cpu_cap(c, X86_FEATURE_REP_GOOD);
- }
-
- if (c->x86 >= 7)
+ if ((c->x86 == 0x6 && c->x86_model >= 0xf) ||
+ (c->x86 >= 7))
set_cpu_cap(c, X86_FEATURE_REP_GOOD);
}

@@ -217,6 +213,12 @@ static void init_centaur(struct cpuinfo_
init_ia32_feat_ctl(c);
}

+static void bsp_init_centaur(struct cpuinfo_x86 *c)
+{
+ if (c->x86 == 0x6 && c->x86_model >= 0xf)
+ c->x86_cache_alignment = x86_clflush_size() * 2;
+}
+
#ifdef CONFIG_X86_32
static unsigned int
centaur_size_cache(struct cpuinfo_x86 *c, unsigned int size)
@@ -241,6 +243,7 @@ static const struct cpu_dev centaur_cpu_
.c_vendor = "Centaur",
.c_ident = { "CentaurHauls" },
.c_early_init = early_init_centaur,
+ .c_bsp_init = bsp_init_centaur,
.c_init = init_centaur,
#ifdef CONFIG_X86_32
.legacy_cache_size = centaur_size_cache,
_

2024-02-22 18:47:44

[permalink] [raw]

Subject: [RFC][PATCH 20/34] x86/cpu: Remove superfluous cache alignment assignments

From: Dave Hansen <[email protected]>

Now that the boot CPU's ->x86_cache_alignment is authoritative
and is is consistently established, these other copies only
serve to clobber it needlessly. Remove them.

Signed-off-by: Dave Hansen <[email protected]>
---

b/arch/x86/kernel/cpu/common.c | 2 --
1 file changed, 2 deletions(-)

diff -puN arch/x86/kernel/cpu/common.c~dup-cache-alignment arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~dup-cache-alignment 2024-02-22 10:08:58.260838084 -0800
+++ b/arch/x86/kernel/cpu/common.c 2024-02-22 10:08:58.260838084 -0800
@@ -956,7 +956,6 @@ void cpu_detect(struct cpuinfo_x86 *c)
if (cap0 & (1<<19)) {
if (c == &boot_cpu_data)
c->x86_clflush_size = ((misc >> 8) & 0xff) * 8;
- c->x86_cache_alignment = x86_clflush_size();
}
}
}
@@ -1834,7 +1833,6 @@ static void identify_cpu(struct cpuinfo_
c->cpuid_level = -1; /* CPUID not detected */
c->x86_clflush_size = 32;
#endif
- c->x86_cache_alignment = x86_clflush_size();
memset(&c->x86_capability, 0, sizeof(c->x86_capability));
#ifdef CONFIG_X86_VMX_FEATURE_NAMES
memset(&c->vmx_capability, 0, sizeof(c->vmx_capability));
_

2024-02-22 18:48:17

[permalink] [raw]

Subject: [RFC][PATCH 22/34] x86/cpu: Move CLFLUSH size into global config

From: Dave Hansen <[email protected]>

x86_clflush_size is maintained per-cpu despite being global configuration.

Move it to 'x86_config'.

Signed-off-by: Dave Hansen <[email protected]>
---

b/arch/x86/include/asm/processor.h | 4 ++--
b/arch/x86/kernel/cpu/common.c | 2 +-
2 files changed, 3 insertions(+), 3 deletions(-)

diff -puN arch/x86/include/asm/processor.h~bsp-clflush_size arch/x86/include/asm/processor.h
--- a/arch/x86/include/asm/processor.h~bsp-clflush_size 2024-02-22 10:08:59.208875301 -0800
+++ b/arch/x86/include/asm/processor.h 2024-02-22 10:08:59.212875458 -0800
@@ -148,7 +148,6 @@ struct cpuinfo_x86 {
u64 ppin;
/* cpuid returned max cores value: */
u16 x86_max_cores;
- u16 x86_clflush_size;
/* number of cores as seen by the OS: */
u16 booted_cores;
/* Index into per_cpu list: */
@@ -191,6 +190,7 @@ struct x86_sys_config {
/* Address bits supported by all processors */
u8 phys_bits;
u8 virt_bits;
+ u16 clflush_size;
};

extern struct x86_sys_config x86_config;
@@ -812,7 +812,7 @@ static inline u8 x86_virt_bits(void)

static inline u8 x86_clflush_size(void)
{
- return boot_cpu_data.x86_clflush_size;
+ return x86_config.clflush_size;
}

#endif /* _ASM_X86_PROCESSOR_H */
diff -puN arch/x86/kernel/cpu/common.c~bsp-clflush_size arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~bsp-clflush_size 2024-02-22 10:08:59.208875301 -0800
+++ b/arch/x86/kernel/cpu/common.c 2024-02-22 10:08:59.212875458 -0800
@@ -1136,7 +1136,7 @@ void get_cpu_address_sizes(struct cpuinf
x86_config.phys_bits = 36;
}
}
- c->x86_clflush_size = detect_clflush_size(c);
+ x86_config.clflush_size = detect_clflush_size(c);

c->x86_cache_bits = x86_config.phys_bits;

_

2024-02-22 18:48:55

[permalink] [raw]

Subject: [RFC][PATCH 24/34] x86/cpu: Establish 'min_cache_bits' configuration

From: Dave Hansen <[email protected]>

Continue moving towards a setup where code never tweaks 'boot_cpu_data'.

Code must establish their intent in 'x86_addr_config' and then later
code will use that config information to establish the system-wide
config.

The L1TF wants to tweak x86_cache_bits. Let it do this, but move
the code away from bugs.c so that ti can be easily called earlier
en boot.

Signed-off-by: Dave Hansen <[email protected]>
---

b/arch/x86/include/asm/processor.h | 6 +++++
b/arch/x86/kernel/cpu/bugs.c | 41 -------------------------------------
b/arch/x86/kernel/cpu/common.c | 2 +
b/arch/x86/kernel/cpu/intel.c | 40 ++++++++++++++++++++++++++++++++++++
4 files changed, 48 insertions(+), 41 deletions(-)

diff -puN arch/x86/include/asm/processor.h~bsp-min_cache_bits arch/x86/include/asm/processor.h
--- a/arch/x86/include/asm/processor.h~bsp-min_cache_bits 2024-02-22 10:09:00.220915031 -0800
+++ b/arch/x86/include/asm/processor.h 2024-02-22 10:09:00.224915188 -0800
@@ -177,6 +177,12 @@ struct x86_addr_config {
* will take place at a more coarse granularity.
*/
u8 cache_align_mult;
+
+ /*
+ * Specify a floor for the number of bits that the CPU
+ * caches comprehend. Used only for L1TF mitigation.
+ */
+ u8 min_cache_bits;
};

/*
diff -puN arch/x86/kernel/cpu/bugs.c~bsp-min_cache_bits arch/x86/kernel/cpu/bugs.c
--- a/arch/x86/kernel/cpu/bugs.c~bsp-min_cache_bits 2024-02-22 10:09:00.220915031 -0800
+++ b/arch/x86/kernel/cpu/bugs.c 2024-02-22 10:09:00.224915188 -0800
@@ -2237,45 +2237,6 @@ EXPORT_SYMBOL_GPL(l1tf_mitigation);
enum vmx_l1d_flush_state l1tf_vmx_mitigation = VMENTER_L1D_FLUSH_AUTO;
EXPORT_SYMBOL_GPL(l1tf_vmx_mitigation);

-/*
- * These CPUs all support 44bits physical address space internally in the
- * cache but CPUID can report a smaller number of physical address bits.
- *
- * The L1TF mitigation uses the top most address bit for the inversion of
- * non present PTEs. When the installed memory reaches into the top most
- * address bit due to memory holes, which has been observed on machines
- * which report 36bits physical address bits and have 32G RAM installed,
- * then the mitigation range check in l1tf_select_mitigation() triggers.
- * This is a false positive because the mitigation is still possible due to
- * the fact that the cache uses 44bit internally. Use the cache bits
- * instead of the reported physical bits and adjust them on the affected
- * machines to 44bit if the reported bits are less than 44.
- */
-static void override_cache_bits(struct cpuinfo_x86 *c)
-{
- if (c->x86 != 6)
- return;
-
- switch (c->x86_model) {
- case INTEL_FAM6_NEHALEM:
- case INTEL_FAM6_WESTMERE:
- case INTEL_FAM6_SANDYBRIDGE:
- case INTEL_FAM6_IVYBRIDGE:
- case INTEL_FAM6_HASWELL:
- case INTEL_FAM6_HASWELL_L:
- case INTEL_FAM6_HASWELL_G:
- case INTEL_FAM6_BROADWELL:
- case INTEL_FAM6_BROADWELL_G:
- case INTEL_FAM6_SKYLAKE_L:
- case INTEL_FAM6_SKYLAKE:
- case INTEL_FAM6_KABYLAKE_L:
- case INTEL_FAM6_KABYLAKE:
- if (c->x86_cache_bits < 44)
- c->x86_cache_bits = 44;
- break;
- }
-}
-
static void __init l1tf_select_mitigation(void)
{
u64 half_pa;
@@ -2288,8 +2249,6 @@ static void __init l1tf_select_mitigatio
else if (cpu_mitigations_auto_nosmt())
l1tf_mitigation = L1TF_MITIGATION_FLUSH_NOSMT;

- override_cache_bits(&boot_cpu_data);
-
switch (l1tf_mitigation) {
case L1TF_MITIGATION_OFF:
case L1TF_MITIGATION_FLUSH_NOWARN:
diff -puN arch/x86/kernel/cpu/common.c~bsp-min_cache_bits arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~bsp-min_cache_bits 2024-02-22 10:09:00.220915031 -0800
+++ b/arch/x86/kernel/cpu/common.c 2024-02-22 10:09:00.228915345 -0800
@@ -1139,6 +1139,8 @@ void get_cpu_address_sizes(struct cpuinf
x86_config.clflush_size = detect_clflush_size(c);

c->x86_cache_bits = x86_config.phys_bits;
+ if (c->x86_cache_bits < bsp_addr_config.min_cache_bits)
+ c->x86_cache_bits = bsp_addr_config.min_cache_bits;

x86_config.cache_alignment = x86_clflush_size();
if (bsp_addr_config.cache_align_mult)
diff -puN arch/x86/kernel/cpu/intel.c~bsp-min_cache_bits arch/x86/kernel/cpu/intel.c
--- a/arch/x86/kernel/cpu/intel.c~bsp-min_cache_bits 2024-02-22 10:09:00.224915188 -0800
+++ b/arch/x86/kernel/cpu/intel.c 2024-02-22 10:09:00.228915345 -0800
@@ -395,6 +395,44 @@ detect_keyid_bits:
return keyid_bits;
}

+/*
+ * These CPUs all support 44bits physical address space internally in the
+ * cache but CPUID can report a smaller number of physical address bits.
+ *
+ * The L1TF mitigation uses the top most address bit for the inversion of
+ * non present PTEs. When the installed memory reaches into the top most
+ * address bit due to memory holes, which has been observed on machines
+ * which report 36bits physical address bits and have 32G RAM installed,
+ * then the mitigation range check in l1tf_select_mitigation() triggers.
+ * This is a false positive because the mitigation is still possible due to
+ * the fact that the cache uses 44bit internally. Use the cache bits
+ * instead of the reported physical bits and adjust them on the affected
+ * machines to 44bit if the reported bits are less than 44.
+ */
+static void set_min_cache_bits(struct cpuinfo_x86 *c)
+{
+ if (c->x86 != 6)
+ return;
+
+ switch (c->x86_model) {
+ case INTEL_FAM6_NEHALEM:
+ case INTEL_FAM6_WESTMERE:
+ case INTEL_FAM6_SANDYBRIDGE:
+ case INTEL_FAM6_IVYBRIDGE:
+ case INTEL_FAM6_HASWELL:
+ case INTEL_FAM6_HASWELL_L:
+ case INTEL_FAM6_HASWELL_G:
+ case INTEL_FAM6_BROADWELL:
+ case INTEL_FAM6_BROADWELL_G:
+ case INTEL_FAM6_SKYLAKE_L:
+ case INTEL_FAM6_SKYLAKE:
+ case INTEL_FAM6_KABYLAKE_L:
+ case INTEL_FAM6_KABYLAKE:
+ bsp_addr_config.min_cache_bits = 44;
+ break;
+ }
+}
+
static void bsp_init_intel(struct cpuinfo_x86 *c)
{
int keyid_bits = 0;
@@ -418,6 +456,8 @@ static void bsp_init_intel(struct cpuinf
/* Netburst reports 64 bytes clflush size, but does IO in 128 bytes */
if (c->x86 == 15)
bsp_addr_config.cache_align_mult = 2;
+
+ set_min_cache_bits(c);
}

#ifdef CONFIG_X86_32
_

2024-02-22 18:49:07

[permalink] [raw]

Subject: [RFC][PATCH 25/34] x86/cpu: Move cache bits to global config

From: Dave Hansen <[email protected]>

x86_cache_bits is established and stored per-cpu despite being system-wide.
Move it from 'struct cpuinfo_x86' to 'x86_config' and give it a helper.

Signed-off-by: Dave Hansen <[email protected]>
---

b/arch/x86/include/asm/processor.h | 10 +++++++---
b/arch/x86/kernel/cpu/common.c | 8 ++++----
b/arch/x86/kvm/mmu/spte.c | 6 +++---
3 files changed, 14 insertions(+), 10 deletions(-)

diff -puN arch/x86/include/asm/processor.h~config-cache_bits arch/x86/include/asm/processor.h
--- a/arch/x86/include/asm/processor.h~config-cache_bits 2024-02-22 10:09:00.768936544 -0800
+++ b/arch/x86/include/asm/processor.h 2024-02-22 10:09:00.772936701 -0800
@@ -154,8 +154,6 @@ struct cpuinfo_x86 {
/* Is SMT active on this core? */
bool smt_active;
u32 microcode;
- /* Address space bits used by the cache internally */
- u8 x86_cache_bits;
unsigned initialized : 1;
} __randomize_layout;

@@ -195,6 +193,7 @@ struct x86_sys_config {
/* Address bits supported by all processors */
u8 phys_bits;
u8 virt_bits;
+ u8 cache_bits;
u16 clflush_size;
int cache_alignment; /* in bytes */
};
@@ -241,7 +240,7 @@ extern void cpu_detect(struct cpuinfo_x8

static inline unsigned long long l1tf_pfn_limit(void)
{
- return BIT_ULL(boot_cpu_data.x86_cache_bits - 1 - PAGE_SHIFT);
+ return BIT_ULL(x86_config.cache_bits - 1 - PAGE_SHIFT);
}

extern void early_cpu_init(void);
@@ -816,6 +815,11 @@ static inline u8 x86_virt_bits(void)
return x86_config.virt_bits;
}

+static inline u8 x86_cache_bits(void)
+{
+ return x86_config.cache_bits;
+}
+
static inline u8 x86_clflush_size(void)
{
return x86_config.clflush_size;
diff -puN arch/x86/kernel/cpu/common.c~config-cache_bits arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~config-cache_bits 2024-02-22 10:09:00.768936544 -0800
+++ b/arch/x86/kernel/cpu/common.c 2024-02-22 10:09:00.772936701 -0800
@@ -1138,15 +1138,15 @@ void get_cpu_address_sizes(struct cpuinf
}
x86_config.clflush_size = detect_clflush_size(c);

- c->x86_cache_bits = x86_config.phys_bits;
- if (c->x86_cache_bits < bsp_addr_config.min_cache_bits)
- c->x86_cache_bits = bsp_addr_config.min_cache_bits;
+ x86_config.cache_bits = x86_config.phys_bits;
+ if (x86_config.cache_bits < bsp_addr_config.min_cache_bits)
+ x86_config.cache_bits = bsp_addr_config.min_cache_bits;

x86_config.cache_alignment = x86_clflush_size();
if (bsp_addr_config.cache_align_mult)
x86_config.cache_alignment *= bsp_addr_config.cache_align_mult;

- /* Do this last to avoid affecting ->x86_cache_bits. */
+ /* Do this last to avoid affecting '.cache_bits'. */
x86_config.phys_bits -= bsp_addr_config.phys_addr_reduction_bits;
}

diff -puN arch/x86/kvm/mmu/spte.c~config-cache_bits arch/x86/kvm/mmu/spte.c
--- a/arch/x86/kvm/mmu/spte.c~config-cache_bits 2024-02-22 10:09:00.772936701 -0800
+++ b/arch/x86/kvm/mmu/spte.c 2024-02-22 10:09:00.772936701 -0800
@@ -470,12 +470,12 @@ void kvm_mmu_reset_all_pte_masks(void)
shadow_nonpresent_or_rsvd_mask = 0;
low_phys_bits = x86_phys_bits();
if (boot_cpu_has_bug(X86_BUG_L1TF) &&
- !WARN_ON_ONCE(boot_cpu_data.x86_cache_bits >=
+ !WARN_ON_ONCE(x86_cache_bits() >=
52 - SHADOW_NONPRESENT_OR_RSVD_MASK_LEN)) {
- low_phys_bits = boot_cpu_data.x86_cache_bits
+ low_phys_bits = x86_cache_bits()
- SHADOW_NONPRESENT_OR_RSVD_MASK_LEN;
shadow_nonpresent_or_rsvd_mask =
- rsvd_bits(low_phys_bits, boot_cpu_data.x86_cache_bits - 1);
+ rsvd_bits(low_phys_bits, x86_cache_bits() - 1);
}

shadow_nonpresent_or_rsvd_lower_gfn_mask =
_

2024-02-22 18:49:52

[permalink] [raw]

Subject: [RFC][PATCH 29/34] x86/xen: Remove extra get_cpu_address_sizes() call

From: Dave Hansen <[email protected]>

Xen PV should now be avoiding reading 'x86_config' at all due to
previous fixes. Even if something was missed, it should now get sane
values *and* be able to survive long enough to get a warning out later
in boot.

Remove the belt-and-suspenders call to get_cpu_address_sizes().

Signed-off-by: Dave Hansen <[email protected]>
---

b/arch/x86/xen/enlighten_pv.c | 3 ---
1 file changed, 3 deletions(-)

diff -puN arch/x86/xen/enlighten_pv.c~xen-remove-early-addr_size-fun arch/x86/xen/enlighten_pv.c
--- a/arch/x86/xen/enlighten_pv.c~xen-remove-early-addr_size-fun 2024-02-22 10:09:02.753014433 -0800
+++ b/arch/x86/xen/enlighten_pv.c 2024-02-22 10:09:02.757014590 -0800
@@ -1386,9 +1386,6 @@ asmlinkage __visible void __init xen_sta
*/
xen_setup_gdt(0);

- /* Determine virtual and physical address sizes */
- get_cpu_address_sizes(&boot_cpu_data);
-
/* Let's presume PV guests always boot on vCPU with id 0. */
per_cpu(xen_vcpu_id, 0) = 0;

_

2024-02-22 18:50:49

[permalink] [raw]

Subject: [RFC][PATCH 31/34] x86/cpu/intel: Mark BSP init function as __init

From: Dave Hansen <[email protected]>

The boot CPU is obviously not initialized during normal system runtime
and its code can be __init. This prevents a warning once
'bsp_addr_config' gets marked __init.

Signed-off-by: Dave Hansen <[email protected]>
---

b/arch/x86/kernel/cpu/intel.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff -puN arch/x86/kernel/cpu/intel.c~bsp_init_intel-__init arch/x86/kernel/cpu/intel.c
--- a/arch/x86/kernel/cpu/intel.c~bsp_init_intel-__init 2024-02-22 10:09:03.721052435 -0800
+++ b/arch/x86/kernel/cpu/intel.c 2024-02-22 10:09:03.725052592 -0800
@@ -433,7 +433,7 @@ static void set_min_cache_bits(struct cp
}
}

-static void bsp_init_intel(struct cpuinfo_x86 *c)
+static void __init bsp_init_intel(struct cpuinfo_x86 *c)
{
int keyid_bits = 0;

_

2024-02-22 18:51:04

[permalink] [raw]

Subject: [RFC][PATCH 34/34] x86/cpu: Mark new boot CPU and config structures appropriately

From: Dave Hansen <[email protected]>

Just to recap the rules:

1. 'bsp_addr_config' is only used in early boot by the BSP.
2. 'x86_config' is only written once by the BSP and can be
used read-only during normal system operation

Mark 'bsp_addr_config' so it goes away after boot *and* non-init
references to it can be detected and warned about.

Mark 'x86_config' as __ro_after_init so it does not turn into the same
mess that 'boot_cpu_data' is.

Signed-off-by: Dave Hansen <[email protected]>
---

b/arch/x86/kernel/setup.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff -puN arch/x86/kernel/setup.c~mark-new-structs arch/x86/kernel/setup.c
--- a/arch/x86/kernel/setup.c~mark-new-structs 2024-02-22 10:09:05.173109438 -0800
+++ b/arch/x86/kernel/setup.c 2024-02-22 10:09:05.177109595 -0800
@@ -131,8 +131,8 @@ struct ist_info ist_info;
struct cpuinfo_x86 boot_cpu_data __read_mostly;
EXPORT_SYMBOL(boot_cpu_data);

-struct x86_addr_config bsp_addr_config;
-struct x86_sys_config x86_config __read_mostly;
+struct x86_addr_config bsp_addr_config __initdata;
+struct x86_sys_config x86_config __ro_after_init;

#if !defined(CONFIG_X86_PAE) || defined(CONFIG_X86_64)
__visible unsigned long mmu_cr4_features __ro_after_init;
_

2024-02-22 19:11:31

[permalink] [raw]

Subject: [RFC][PATCH 06/34] x86/boot: Use consistent value for iomem_resource.end

From: Dave Hansen <[email protected]>

The 'struct cpuinfo_x86' values (including 'boot_cpu_info') get
written and overwritten rather randomly. They are not stable
during early boot and readers end up getting a random mishmash
of hard-coded defaults or CPUID-provided values based on when
the values are read.

iomem_resource.end is one of these users. Because of where it
is called, it ended up seeing .x86_phys_bits==MAX_PHYSMEM_BITS
which is (mostly) a compile-time default. But
iomem_resource.end is never updated if the runtime CPUID
x86_phys_bits is lower.

Set iomem_resource.end to the compile-time value explicitly.
It does not need to be precise as this is mostly to ensure
that insane values can't be reserved in 'iomem_resource'.

Make MAX_PHYSMEM_BITS available outside of sparsemem
configurations by removing the #ifdef CONFIG_SPARSEMEM in the
header.

Signed-off-by: Dave Hansen <[email protected]>
---

b/arch/x86/include/asm/sparsemem.h | 3 ---
b/arch/x86/kernel/setup.c | 10 +++++++++-
2 files changed, 9 insertions(+), 4 deletions(-)

diff -puN arch/x86/kernel/setup.c~iomem_resource_end arch/x86/kernel/setup.c
--- a/arch/x86/kernel/setup.c~iomem_resource_end 2024-02-22 10:08:51.048554948 -0800
+++ b/arch/x86/kernel/setup.c 2024-02-22 10:21:04.485531464 -0800
@@ -51,6 +51,7 @@
#include <asm/pci-direct.h>
#include <asm/prom.h>
#include <asm/proto.h>
+#include <asm/sparsemem.h>
#include <asm/thermal.h>
#include <asm/unwind.h>
#include <asm/vsyscall.h>
@@ -813,7 +814,14 @@ void __init setup_arch(char **cmdline_p)
*/
early_reserve_memory();

- iomem_resource.end = (1ULL << x86_phys_bits()) - 1;
+ /*
+ * This was too big before. It ended up getting MAX_PHYSMEM_BITS
+ * even if .x86_phys_bits was eventually lowered below that.
+ * But that was evidently harmless, so leave it too big, but
+ * set it explicitly to MAX_PHYSMEM_BITS instead of taking a
+ * trip through .x86_phys_bits.
+ */
+ iomem_resource.end = (1ULL << MAX_PHYSMEM_BITS) - 1;
e820__memory_setup();
parse_setup_data();

diff -puN arch/x86/include/asm/sparsemem.h~iomem_resource_end arch/x86/include/asm/sparsemem.h
--- a/arch/x86/include/asm/sparsemem.h~iomem_resource_end 2024-02-22 10:19:56.842831828 -0800
+++ b/arch/x86/include/asm/sparsemem.h 2024-02-22 10:20:21.207804806 -0800
@@ -4,7 +4,6 @@

#include <linux/types.h>

-#ifdef CONFIG_SPARSEMEM
/*
* generic non-linear memory support:
*
@@ -29,8 +28,6 @@
# define MAX_PHYSMEM_BITS (pgtable_l5_enabled() ? 52 : 46)
#endif

-#endif /* CONFIG_SPARSEMEM */
-
#ifndef __ASSEMBLY__
#ifdef CONFIG_NUMA_KEEP_MEMINFO
extern int phys_to_target_node(phys_addr_t start);
_

2024-02-22 19:42:07

[permalink] [raw]

Subject: [RFC][PATCH 28/34] x86/cpu: Return sane defaults for early x86_config reads

From: Dave Hansen <[email protected]>

*If* an unexpected user tries to read 'x86_config' before it is
C_FINALIZED and valid, it will likely get 0's or other garbage.
Those values often cause the user to crash before they can get
a nice message out to the console.

Provide some mostly random but unlikely-to-crash-the-caller
values.

Signed-off-by: Dave Hansen <[email protected]>
---

b/arch/x86/include/asm/processor.h | 32 ++++++++++++++++++++++++++------
1 file changed, 26 insertions(+), 6 deletions(-)

diff -puN arch/x86/include/asm/processor.h~x86_config-defaults arch/x86/include/asm/processor.h
--- a/arch/x86/include/asm/processor.h~x86_config-defaults 2024-02-22 10:09:02.280995903 -0800
+++ b/arch/x86/include/asm/processor.h 2024-02-22 10:09:02.280995903 -0800
@@ -813,37 +813,57 @@ static inline void weak_wrmsr_fence(void
alternative("mfence; lfence", "", ALT_NOT(X86_FEATURE_APIC_MSRS_FENCE));
}

-static inline void read_x86_config(void)
+static inline int read_x86_config(void)
{
if (x86_config.conf_state == C_FINALIZED)
- return;
+ return 0;

/* Only record the first one: */
if (!x86_config.early_reader)
x86_config.early_reader = __builtin_return_address(0);
+
+ return -EINVAL;
}

+/*
+ * These _should_ not be called until x86_config is C_FINALIZED.
+ * In the case that they are, two things will happen:
+ * 1. The first caller will get the call site recorded in
+ * 'early_reader' which will trigger a warning later in
+ * boot.
+ * 2. A moderately sane hard-coded default value will be provided.
+ * The entire intent of this value is to let things limp along
+ * until the warning can spit out.
+ */
static inline u8 x86_phys_bits(void)
{
- read_x86_config();
+ if (read_x86_config())
+ return 52;
+
return x86_config.phys_bits;
}

static inline u8 x86_virt_bits(void)
{
- read_x86_config();
+ if (read_x86_config())
+ return IS_ENABLED(CONFIG_X86_64) ? 57 : 32;
+
return x86_config.virt_bits;
}

static inline u8 x86_cache_bits(void)
{
- read_x86_config();
+ if (read_x86_config())
+ return 52;
+
return x86_config.cache_bits;
}

static inline u8 x86_clflush_size(void)
{
- read_x86_config();
+ if (read_x86_config())
+ return 64;
+
return x86_config.clflush_size;
}

_

2024-02-22 19:55:58

[permalink] [raw]

Subject: [RFC][PATCH 10/34] x86/cpu/amd: Use new "address configuration" infrastructure

From: Dave Hansen <[email protected]>

The AMD memory encryption code is currently one of the 'boot_cpu_data'
field twiddlers. This has led to all kinds of interesting ordering
bugs at boot.

Move it away from random fiddling and over to 'bsp_addr_config'.

Signed-off-by: Dave Hansen <[email protected]>
---

b/arch/x86/kernel/cpu/amd.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff -puN arch/x86/kernel/cpu/amd.c~amd-phys_addr_reduction_bits arch/x86/kernel/cpu/amd.c
--- a/arch/x86/kernel/cpu/amd.c~amd-phys_addr_reduction_bits 2024-02-22 10:08:53.340644930 -0800
+++ b/arch/x86/kernel/cpu/amd.c 2024-02-22 10:08:53.344645087 -0800
@@ -622,7 +622,8 @@ static void early_detect_mem_encrypt(str
* will be a value above 32-bits this is still done for
* CONFIG_X86_32 so that accurate values are reported.
*/
- c->x86_phys_bits -= (cpuid_ebx(0x8000001f) >> 6) & 0x3f;
+ bsp_addr_config.phys_addr_reduction_bits =
+ (cpuid_ebx(0x8000001f) >> 6) & 0x3f;

if (IS_ENABLED(CONFIG_X86_32))
goto clear_all;
_

2024-02-22 20:44:30

[permalink] [raw]

Subject: [RFC][PATCH 23/34] x86/cpu: Move cache alignment configuration to global struct

From: Dave Hansen <[email protected]>

It is a tale as old as time: x86_cache_alignment is established and
stored per-cpu despite being system-wide. Move it to the global
configuration structure.

The other values have received _new_ wrappers. But this one already
had cache_line_size(). Just use that wrapper.

Signed-off-by: Dave Hansen <[email protected]>
---

b/arch/x86/include/asm/processor.h | 4 ++--
b/arch/x86/kernel/cpu/common.c | 4 ++--
b/arch/x86/kernel/cpu/proc.c | 2 +-
3 files changed, 5 insertions(+), 5 deletions(-)

diff -puN arch/x86/include/asm/processor.h~x86_config-x86_cache_alignment arch/x86/include/asm/processor.h
--- a/arch/x86/include/asm/processor.h~x86_config-x86_cache_alignment 2024-02-22 10:08:59.700894617 -0800
+++ b/arch/x86/include/asm/processor.h 2024-02-22 10:08:59.708894931 -0800
@@ -137,7 +137,6 @@ struct cpuinfo_x86 {
struct cpuinfo_topology topo;
/* in KB - valid for CPUS which support this call: */
unsigned int x86_cache_size;
- int x86_cache_alignment; /* In bytes */
/* Cache QoS architectural values, valid only on the BSP: */
int x86_cache_max_rmid; /* max index */
int x86_cache_occ_scale; /* scale to bytes */
@@ -191,6 +190,7 @@ struct x86_sys_config {
u8 phys_bits;
u8 virt_bits;
u16 clflush_size;
+ int cache_alignment; /* in bytes */
};

extern struct x86_sys_config x86_config;
@@ -229,7 +229,7 @@ DECLARE_PER_CPU_READ_MOSTLY(struct cpuin

extern const struct seq_operations cpuinfo_op;

-#define cache_line_size() (boot_cpu_data.x86_cache_alignment)
+#define cache_line_size() (x86_config.cache_alignment)

extern void cpu_detect(struct cpuinfo_x86 *c);

diff -puN arch/x86/kernel/cpu/common.c~x86_config-x86_cache_alignment arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~x86_config-x86_cache_alignment 2024-02-22 10:08:59.704894774 -0800
+++ b/arch/x86/kernel/cpu/common.c 2024-02-22 10:08:59.708894931 -0800
@@ -1140,9 +1140,9 @@ void get_cpu_address_sizes(struct cpuinf

c->x86_cache_bits = x86_config.phys_bits;

- c->x86_cache_alignment = x86_clflush_size();
+ x86_config.cache_alignment = x86_clflush_size();
if (bsp_addr_config.cache_align_mult)
- c->x86_cache_alignment *= bsp_addr_config.cache_align_mult;
+ x86_config.cache_alignment *= bsp_addr_config.cache_align_mult;

/* Do this last to avoid affecting ->x86_cache_bits. */
x86_config.phys_bits -= bsp_addr_config.phys_addr_reduction_bits;
diff -puN arch/x86/kernel/cpu/proc.c~x86_config-x86_cache_alignment arch/x86/kernel/cpu/proc.c
--- a/arch/x86/kernel/cpu/proc.c~x86_config-x86_cache_alignment 2024-02-22 10:08:59.704894774 -0800
+++ b/arch/x86/kernel/cpu/proc.c 2024-02-22 10:08:59.708894931 -0800
@@ -131,7 +131,7 @@ static int show_cpuinfo(struct seq_file
seq_printf(m, "TLB size\t: %d 4K pages\n", c->x86_tlbsize);
#endif
seq_printf(m, "clflush size\t: %u\n", x86_clflush_size());
- seq_printf(m, "cache_alignment\t: %d\n", c->x86_cache_alignment);
+ seq_printf(m, "cache_alignment\t: %d\n", x86_config.cache_alignment);
seq_printf(m, "address sizes\t: %u bits physical, %u bits virtual\n",
x86_phys_bits(), x86_virt_bits());

_

2024-02-22 20:44:33

[permalink] [raw]

Subject: [RFC][PATCH 15/34] x86/cpu: Remove default x86_phys_bits assignment

From: Dave Hansen <[email protected]>

This was presumably needed for *some* reason at some point. It is
useless now. Remove it.

Signed-off-by: Dave Hansen <[email protected]>
---

b/arch/x86/kernel/setup.c | 1 -
1 file changed, 1 deletion(-)

diff -puN arch/x86/kernel/setup.c~no-early-max_phys_bits-assign arch/x86/kernel/setup.c
--- a/arch/x86/kernel/setup.c~no-early-max_phys_bits-assign 2024-02-22 10:08:55.736738995 -0800
+++ b/arch/x86/kernel/setup.c 2024-02-22 10:08:55.740739152 -0800
@@ -752,7 +752,6 @@ void __init setup_arch(char **cmdline_p)
__flush_tlb_all();
#else
printk(KERN_INFO "Command line: %s\n", boot_command_line);
- boot_cpu_data.x86_phys_bits = MAX_PHYSMEM_BITS;
#endif

/*
_

2024-02-22 20:45:37

[permalink] [raw]

Subject: [RFC][PATCH 09/34] x86/cpu: Introduce address configuration structure

From: Dave Hansen <[email protected]>

Right now, 'boot_cpu_data' is established in a very ad hoc way. Random
bits of the initialization code write random things to it, either
blowing away state or tweaking it as they see fit. It's madness.

Add some more structure to the process.

Introduce an "address configuration" structure just for the boot CPU.
This will be used to establish system-wide address space configuration.
It is written while bringing up the boot CPU and then read *ONCE* to
establish the configuration.

Also introduce the first field: phys_addr_reduction_bits. This field
will be used by memory encryption hardware that reduces the actual
usable address bits beneath what the hardware enumerates.

Signed-off-by: Dave Hansen <[email protected]>
---

b/arch/x86/include/asm/processor.h | 14 ++++++++++++++
b/arch/x86/kernel/cpu/common.c | 3 +++
b/arch/x86/kernel/setup.c | 2 ++
3 files changed, 19 insertions(+)

diff -puN arch/x86/include/asm/processor.h~bsp-addr-info arch/x86/include/asm/processor.h
--- a/arch/x86/include/asm/processor.h~bsp-addr-info 2024-02-22 10:08:52.824624673 -0800
+++ b/arch/x86/include/asm/processor.h 2024-02-22 10:08:52.828624830 -0800
@@ -163,6 +163,19 @@ struct cpuinfo_x86 {
unsigned initialized : 1;
} __randomize_layout;

+/*
+ * Must be written by the time ->c_bsp_init() completes.
+ * Consumed in get_cpu_address_sizes().
+ */
+struct x86_addr_config {
+ /*
+ * How many bits of the expected or enumerated physical
+ * address space are unavailable? Typically set on
+ * platforms that use memory encryption.
+ */
+ u8 phys_addr_reduction_bits;
+};
+
#define X86_VENDOR_INTEL 0
#define X86_VENDOR_CYRIX 1
#define X86_VENDOR_AMD 2
@@ -182,6 +195,7 @@ struct cpuinfo_x86 {
*/
extern struct cpuinfo_x86 boot_cpu_data;
extern struct cpuinfo_x86 new_cpu_data;
+extern struct x86_addr_config bsp_addr_config;

extern __u32 cpu_caps_cleared[NCAPINTS + NBUGINTS];
extern __u32 cpu_caps_set[NCAPINTS + NBUGINTS];
diff -puN arch/x86/kernel/cpu/common.c~bsp-addr-info arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~bsp-addr-info 2024-02-22 10:08:52.824624673 -0800
+++ b/arch/x86/kernel/cpu/common.c 2024-02-22 10:08:52.828624830 -0800
@@ -1125,6 +1125,9 @@ void get_cpu_address_sizes(struct cpuinf
}
c->x86_cache_bits = c->x86_phys_bits;
c->x86_cache_alignment = x86_clflush_size();
+
+ /* Do this last to avoid affecting ->x86_cache_bits. */
+ c->x86_phys_bits -= bsp_addr_config.phys_addr_reduction_bits;
}

static void identify_cpu_without_cpuid(struct cpuinfo_x86 *c)
diff -puN arch/x86/kernel/setup.c~bsp-addr-info arch/x86/kernel/setup.c
--- a/arch/x86/kernel/setup.c~bsp-addr-info 2024-02-22 10:08:52.824624673 -0800
+++ b/arch/x86/kernel/setup.c 2024-02-22 10:08:52.828624830 -0800
@@ -131,6 +131,8 @@ struct ist_info ist_info;
struct cpuinfo_x86 boot_cpu_data __read_mostly;
EXPORT_SYMBOL(boot_cpu_data);

+struct x86_addr_config bsp_addr_config;
+
#if !defined(CONFIG_X86_PAE) || defined(CONFIG_X86_64)
__visible unsigned long mmu_cr4_features __ro_after_init;
#else
_

2024-02-22 20:49:05

[permalink] [raw]

Subject: [RFC][PATCH 12/34] x86/cpu/intel: Actually use "address configuration" infrastructure for MKTME

From: Dave Hansen <[email protected]>

Now that the TME detection is only called once at boot, stop twiddling
'boot_cpu_data' directly and move over to 'bsp_addr_config'.

Signed-off-by: Dave Hansen <[email protected]>
---

b/arch/x86/kernel/cpu/intel.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)

diff -puN arch/x86/kernel/cpu/intel.c~intel-addr-reduce arch/x86/kernel/cpu/intel.c
--- a/arch/x86/kernel/cpu/intel.c~intel-addr-reduce 2024-02-22 10:08:54.296682462 -0800
+++ b/arch/x86/kernel/cpu/intel.c 2024-02-22 10:08:54.296682462 -0800
@@ -401,11 +401,8 @@ detect_keyid_bits:
mktme_status = MKTME_ENABLED;
}

- /*
- * KeyID bits effectively lower the number of physical address
- * bits. Update cpuinfo_x86::x86_phys_bits accordingly.
- */
- c->x86_phys_bits -= keyid_bits;
+ /* KeyID bits effectively lower the number of physical address bits */
+ bsp_addr_config.phys_addr_reduction_bits = keyid_bits;
}

static void bsp_init_intel(struct cpuinfo_x86 *c)
_

2024-02-22 20:49:14

[permalink] [raw]

Subject: [RFC][PATCH 21/34] x86/cpu: Consolidate CLFLUSH size setting.

From: Dave Hansen <[email protected]>

Right now, cpu_detect() establishes ->x86_clflush_size on each CPU
that. supports CLFLUSH. As you might have guessed, that per-cpu
value is no longer used.

Move the setting to get_cpu_address_sizes() which is only called
on the boot CPU. Consolidate all of the ->x86_clflush_size logic
into a single helper.

This removes the (cap0 & (1<<19) aka X86_FEATURE_CLFLUSH check before
reading the CLFLUSH size out of CPUID[1]. Those bits are presumably
either 0 or have actual data about CLFLUSH in them.

Signed-off-by: Dave Hansen <[email protected]>
---

b/arch/x86/kernel/cpu/common.c | 35 ++++++++++++++++++++++++-----------
1 file changed, 24 insertions(+), 11 deletions(-)

diff -puN arch/x86/kernel/cpu/common.c~later-clflush_size-detect arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~later-clflush_size-detect 2024-02-22 10:08:58.728856457 -0800
+++ b/arch/x86/kernel/cpu/common.c 2024-02-22 10:08:58.732856614 -0800
@@ -935,6 +935,27 @@ static void get_cpu_vendor(struct cpuinf
this_cpu = &default_cpu;
}

+static u16 detect_clflush_size(struct cpuinfo_x86 *c)
+{
+ u16 clflush_size = 0;
+
+ if (c->cpuid_level >= 0x00000001) {
+ u32 eax, ebx, ecx, edx;
+
+ cpuid(0x00000001, &eax, &ebx, &ecx, &edx);
+
+ clflush_size = ((ebx >> 8) & 0xff) * 8;
+ }
+
+ if (clflush_size)
+ return clflush_size;
+
+ /* Return some mostly sane defaults: */
+ if (IS_ENABLED(CONFIG_X86_64))
+ return 64;
+ return 32;
+}
+
void cpu_detect(struct cpuinfo_x86 *c)
{
/* Get vendor name */
@@ -952,11 +973,6 @@ void cpu_detect(struct cpuinfo_x86 *c)
c->x86 = x86_family(tfms);
c->x86_model = x86_model(tfms);
c->x86_stepping = x86_stepping(tfms);
-
- if (cap0 & (1<<19)) {
- if (c == &boot_cpu_data)
- c->x86_clflush_size = ((misc >> 8) & 0xff) * 8;
- }
}
}

@@ -1111,17 +1127,17 @@ void get_cpu_address_sizes(struct cpuinf
if (IS_ENABLED(CONFIG_X86_64)) {
x86_config.phys_bits = 36;
x86_config.virt_bits = 48;
- c->x86_clflush_size = 64;
} else {
x86_config.phys_bits = 32;
x86_config.virt_bits = 32;
- c->x86_clflush_size = 32;

if (cpu_has(c, X86_FEATURE_PAE) ||
cpu_has(c, X86_FEATURE_PSE36))
x86_config.phys_bits = 36;
}
}
+ c->x86_clflush_size = detect_clflush_size(c);
+
c->x86_cache_bits = x86_config.phys_bits;

c->x86_cache_alignment = x86_clflush_size();
@@ -1827,11 +1843,8 @@ static void identify_cpu(struct cpuinfo_
c->topo.cu_id = 0xff;
c->topo.llc_id = BAD_APICID;
c->topo.l2c_id = BAD_APICID;
-#ifdef CONFIG_X86_64
- c->x86_clflush_size = 64;
-#else
+#ifndef CONFIG_X86_64
c->cpuid_level = -1; /* CPUID not detected */
- c->x86_clflush_size = 32;
#endif
memset(&c->x86_capability, 0, sizeof(c->x86_capability));
#ifdef CONFIG_X86_VMX_FEATURE_NAMES
_

2024-02-22 20:58:08

[permalink] [raw]

Subject: [RFC][PATCH 32/34] x86/cpu/amd: Move memory encryption detection

From: Dave Hansen <[email protected]>

Right now the AMD memory encryption detection code is run on every CPU
despite it operating on 'bsp_addr_config' and doing setup_clear_cpu_cap()
which only affects 'boot_cpu_data'.

Move it to bsp_init_amd() where it belongs and change its name to match.

Signed-off-by: Dave Hansen <[email protected]>
---

b/arch/x86/kernel/cpu/amd.c | 110 ++++++++++++++++++++++----------------------
1 file changed, 55 insertions(+), 55 deletions(-)

diff -puN arch/x86/kernel/cpu/amd.c~early_init_amd-__init arch/x86/kernel/cpu/amd.c
--- a/arch/x86/kernel/cpu/amd.c~early_init_amd-__init 2024-02-22 10:09:04.201071279 -0800
+++ b/arch/x86/kernel/cpu/amd.c 2024-02-22 10:09:04.201071279 -0800
@@ -468,8 +468,62 @@ static void early_init_amd_mc(struct cpu
#endif
}

-static void bsp_init_amd(struct cpuinfo_x86 *c)
+static void bsp_detect_mem_encrypt(struct cpuinfo_x86 *c)
{
+ u64 msr;
+
+ /*
+ * BIOS support is required for SME and SEV.
+ * For SME: If BIOS has enabled SME then adjust x86_phys_bits by
+ * the SME physical address space reduction value.
+ * If BIOS has not enabled SME then don't advertise the
+ * SME feature (set in scattered.c).
+ * If the kernel has not enabled SME via any means then
+ * don't advertise the SME feature.
+ * For SEV: If BIOS has not enabled SEV then don't advertise the
+ * SEV and SEV_ES feature (set in scattered.c).
+ *
+ * In all cases, since support for SME and SEV requires long mode,
+ * don't advertise the feature under CONFIG_X86_32.
+ */
+ if (cpu_has(c, X86_FEATURE_SME) || cpu_has(c, X86_FEATURE_SEV)) {
+ /* Check if memory encryption is enabled */
+ rdmsrl(MSR_AMD64_SYSCFG, msr);
+ if (!(msr & MSR_AMD64_SYSCFG_MEM_ENCRYPT))
+ goto clear_all;
+
+ /*
+ * Always adjust physical address bits. Even though this
+ * will be a value above 32-bits this is still done for
+ * CONFIG_X86_32 so that accurate values are reported.
+ */
+ bsp_addr_config.phys_addr_reduction_bits =
+ (cpuid_ebx(0x8000001f) >> 6) & 0x3f;
+
+ if (IS_ENABLED(CONFIG_X86_32))
+ goto clear_all;
+
+ if (!sme_me_mask)
+ setup_clear_cpu_cap(X86_FEATURE_SME);
+
+ rdmsrl(MSR_K7_HWCR, msr);
+ if (!(msr & MSR_K7_HWCR_SMMLOCK))
+ goto clear_sev;
+
+ return;
+
+clear_all:
+ setup_clear_cpu_cap(X86_FEATURE_SME);
+clear_sev:
+ setup_clear_cpu_cap(X86_FEATURE_SEV);
+ setup_clear_cpu_cap(X86_FEATURE_SEV_ES);
+ }
+}
+
+static void __init bsp_init_amd(struct cpuinfo_x86 *c)
+{
+ bsp_detect_mem_encrypt(c);
+
if (cpu_has(c, X86_FEATURE_CONSTANT_TSC)) {

if (c->x86 > 0x10 ||
@@ -593,58 +647,6 @@ warn:
WARN_ONCE(1, "Family 0x%x, model: 0x%x??\n", c->x86, c->x86_model);
}

-static void early_detect_mem_encrypt(struct cpuinfo_x86 *c)
-{
- u64 msr;
-
- /*
- * BIOS support is required for SME and SEV.
- * For SME: If BIOS has enabled SME then adjust x86_phys_bits by
- * the SME physical address space reduction value.
- * If BIOS has not enabled SME then don't advertise the
- * SME feature (set in scattered.c).
- * If the kernel has not enabled SME via any means then
- * don't advertise the SME feature.
- * For SEV: If BIOS has not enabled SEV then don't advertise the
- * SEV and SEV_ES feature (set in scattered.c).
- *
- * In all cases, since support for SME and SEV requires long mode,
- * don't advertise the feature under CONFIG_X86_32.
- */
- if (cpu_has(c, X86_FEATURE_SME) || cpu_has(c, X86_FEATURE_SEV)) {
- /* Check if memory encryption is enabled */
- rdmsrl(MSR_AMD64_SYSCFG, msr);
- if (!(msr & MSR_AMD64_SYSCFG_MEM_ENCRYPT))
- goto clear_all;
-
- /*
- * Always adjust physical address bits. Even though this
- * will be a value above 32-bits this is still done for
- * CONFIG_X86_32 so that accurate values are reported.
- */
- bsp_addr_config.phys_addr_reduction_bits =
- (cpuid_ebx(0x8000001f) >> 6) & 0x3f;
-
- if (IS_ENABLED(CONFIG_X86_32))
- goto clear_all;
-
- if (!sme_me_mask)
- setup_clear_cpu_cap(X86_FEATURE_SME);
-
- rdmsrl(MSR_K7_HWCR, msr);
- if (!(msr & MSR_K7_HWCR_SMMLOCK))
- goto clear_sev;
-
- return;
-
-clear_all:
- setup_clear_cpu_cap(X86_FEATURE_SME);
-clear_sev:
- setup_clear_cpu_cap(X86_FEATURE_SEV);
- setup_clear_cpu_cap(X86_FEATURE_SEV_ES);
- }
-}
-
static void early_init_amd(struct cpuinfo_x86 *c)
{
u64 value;
@@ -715,8 +717,6 @@ static void early_init_amd(struct cpuinf
if (c->x86 == 0x16 && c->x86_model <= 0xf)
msr_set_bit(MSR_AMD64_LS_CFG, 15);

- early_detect_mem_encrypt(c);
-
/* Re-enable TopologyExtensions if switched off by BIOS */
if (c->x86 == 0x15 &&
(c->x86_model >= 0x10 && c->x86_model <= 0x6f) &&
_

2024-02-22 20:59:07

[permalink] [raw]

Subject: [RFC][PATCH 26/34] x86/cpu: Zap superfluous get_cpu_address_sizes()

From: Dave Hansen <[email protected]>

All of the 'x86_config' state is established by the boot CPU long
before generic_identify(). There is no need to call
get_cpu_address_sizes() again.

Signed-off-by: Dave Hansen <[email protected]>
---

b/arch/x86/kernel/cpu/common.c | 2 --
1 file changed, 2 deletions(-)

diff -puN arch/x86/kernel/cpu/common.c~zap-extra-get_cpu_address_sizes arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~zap-extra-get_cpu_address_sizes 2024-02-22 10:09:01.292957116 -0800
+++ b/arch/x86/kernel/cpu/common.c 2024-02-22 10:09:01.296957273 -0800
@@ -1770,8 +1770,6 @@ static void generic_identify(struct cpui

get_cpu_cap(c);

- get_cpu_address_sizes(c);
-
if (c->cpuid_level >= 0x00000001) {
c->topo.initial_apicid = (cpuid_ebx(1) >> 24) & 0xFF;
#ifdef CONFIG_X86_32
_

2024-02-22 21:00:01

[permalink] [raw]

Subject: [RFC][PATCH 30/34] x86/cpu/centaur: Mark BSP init function as __init

From: Dave Hansen <[email protected]>

The boot CPU is obviously not initialized during normal system runtime
and its code can be __init. This prevents a warning once
'bsp_addr_config' gets marked __init.

Signed-off-by: Dave Hansen <[email protected]>
---

b/arch/x86/kernel/cpu/centaur.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff -puN arch/x86/kernel/cpu/centaur.c~bsp_init_centaur-__init arch/x86/kernel/cpu/centaur.c
--- a/arch/x86/kernel/cpu/centaur.c~bsp_init_centaur-__init 2024-02-22 10:09:03.237033434 -0800
+++ b/arch/x86/kernel/cpu/centaur.c 2024-02-22 10:09:03.237033434 -0800
@@ -213,7 +213,7 @@ static void init_centaur(struct cpuinfo_
init_ia32_feat_ctl(c);
}

-static void bsp_init_centaur(struct cpuinfo_x86 *c)
+static void __init bsp_init_centaur(struct cpuinfo_x86 *c)
{
if (c->x86 == 0x6 && c->x86_model >= 0xf)
bsp_addr_config.cache_align_mult = 2;
_

2024-02-22 21:20:02

[permalink] [raw]

Subject: [RFC][PATCH 27/34] x86/cpu: Enforce read-only x86_config state (lightly)

From: Dave Hansen <[email protected]>

Part of the reason that this all turned into such a mess is that there
were no rules around when 'struct cpuinfo_x86' was written. It was
(and is) just a free-for-all.

Establish that 'x86_config' has two phases of its lifetime: an
C_INITIALIZING phase where it can be written and a later C_FINALIZED
stage where it can only be read. It is simple to audit the fact
that this state transition happens just where the comment says it
should be.

Check that the config is C_FINALIZED in each of the wrappers that
read a 'x86_config' value. If something reads too early, stash
some information to the caller so that it can spit out a warning
later.

This goofy stash-then-warn construct is necessary here because any
hapless readers are likely to be in a spot where they can not easily
WARN() themselves, like the early Xen PV boot that's caused so many
problems.

This also moves one x86_clflush_size() reference over to the more
direct x86_config.cache_alignment because otherwise it would trip
the !C_FINALIZED check.

Signed-off-by: Dave Hansen <[email protected]>
---

b/arch/x86/include/asm/processor.h | 22 ++++++++++++++++++++++
b/arch/x86/kernel/cpu/common.c | 5 ++++-
b/arch/x86/kernel/setup.c | 7 +++++++
3 files changed, 33 insertions(+), 1 deletion(-)

diff -puN arch/x86/include/asm/processor.h~x86_config-finalize arch/x86/include/asm/processor.h
--- a/arch/x86/include/asm/processor.h~x86_config-finalize 2024-02-22 10:09:01.772975960 -0800
+++ b/arch/x86/include/asm/processor.h 2024-02-22 10:09:01.780976274 -0800
@@ -183,6 +183,11 @@ struct x86_addr_config {
u8 min_cache_bits;
};

+enum x86_sys_config_state {
+ C_INITIALIZING,
+ C_FINALIZED
+};
+
/*
* System-wide configuration that is shared by all processors.
*
@@ -190,6 +195,9 @@ struct x86_addr_config {
* modified after that.
*/
struct x86_sys_config {
+ enum x86_sys_config_state conf_state;
+ void *early_reader;
+
/* Address bits supported by all processors */
u8 phys_bits;
u8 virt_bits;
@@ -805,23 +813,37 @@ static inline void weak_wrmsr_fence(void
alternative("mfence; lfence", "", ALT_NOT(X86_FEATURE_APIC_MSRS_FENCE));
}

+static inline void read_x86_config(void)
+{
+ if (x86_config.conf_state == C_FINALIZED)
+ return;
+
+ /* Only record the first one: */
+ if (!x86_config.early_reader)
+ x86_config.early_reader = __builtin_return_address(0);
+}
+
static inline u8 x86_phys_bits(void)
{
+ read_x86_config();
return x86_config.phys_bits;
}

static inline u8 x86_virt_bits(void)
{
+ read_x86_config();
return x86_config.virt_bits;
}

static inline u8 x86_cache_bits(void)
{
+ read_x86_config();
return x86_config.cache_bits;
}

static inline u8 x86_clflush_size(void)
{
+ read_x86_config();
return x86_config.clflush_size;
}

diff -puN arch/x86/kernel/cpu/common.c~x86_config-finalize arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~x86_config-finalize 2024-02-22 10:09:01.776976117 -0800
+++ b/arch/x86/kernel/cpu/common.c 2024-02-22 10:09:01.780976274 -0800
@@ -1114,6 +1114,9 @@ void get_cpu_address_sizes(struct cpuinf
u32 eax, ebx, ecx, edx;
bool vp_bits_from_cpuid = true;

+ WARN_ON_ONCE(x86_config.conf_state > C_INITIALIZING);
+ x86_config.conf_state = C_INITIALIZING;
+
if (!cpu_has(c, X86_FEATURE_CPUID) ||
(c->extended_cpuid_level < 0x80000008))
vp_bits_from_cpuid = false;
@@ -1142,7 +1145,7 @@ void get_cpu_address_sizes(struct cpuinf
if (x86_config.cache_bits < bsp_addr_config.min_cache_bits)
x86_config.cache_bits = bsp_addr_config.min_cache_bits;

- x86_config.cache_alignment = x86_clflush_size();
+ x86_config.cache_alignment = x86_config.clflush_size;
if (bsp_addr_config.cache_align_mult)
x86_config.cache_alignment *= bsp_addr_config.cache_align_mult;

diff -puN arch/x86/kernel/setup.c~x86_config-finalize arch/x86/kernel/setup.c
--- a/arch/x86/kernel/setup.c~x86_config-finalize 2024-02-22 10:09:01.776976117 -0800
+++ b/arch/x86/kernel/setup.c 2024-02-22 10:09:01.780976274 -0800
@@ -762,7 +762,14 @@ void __init setup_arch(char **cmdline_p)
olpc_ofw_detect();

idt_setup_early_traps();
+
early_cpu_init();
+
+ /* Ensure no readers snuck in before the config was finished: */
+ WARN_ONCE(x86_config.early_reader, "x86_config.early_reader: %pS\n",
+ x86_config.early_reader);
+ x86_config.conf_state = C_FINALIZED;
+
jump_label_init();
static_call_init();
early_ioremap_init();
_

2024-02-22 21:40:34

[permalink] [raw]

Subject: [RFC][PATCH 19/34] x86/cpu: Introduce cache alignment multiplier

From: Dave Hansen <[email protected]>

The kernel currently tracks two different but similar-sounding values:

x86_clflush_size
and
x86_cache_alignment

x86_clflush_size is literally the size of the cacheline which is also
the architectural granularity at which CLFLUSH operates.

But some weirdo CPUs like the Pentium 4 do some operations across
two cachelines. Their CLFLUSH still works at 'x86_clflush_size'
but they need some operations aligned to that double cacheline to
work well, thus 'x86_cache_alignment'.

Introduce and use a 'bsp_addr_config' field for these systems.

Note that the Intel Family 15 code actually did this in two different
sites, one for 32-bit and one for 64-bit. Unify them.

Signed-off-by: Dave Hansen <[email protected]>
---

b/arch/x86/include/asm/processor.h | 7 +++++++
b/arch/x86/kernel/cpu/centaur.c | 2 +-
b/arch/x86/kernel/cpu/common.c | 3 +++
b/arch/x86/kernel/cpu/intel.c | 10 ++++------
4 files changed, 15 insertions(+), 7 deletions(-)

diff -puN arch/x86/include/asm/processor.h~x86_cache_alignment_mult arch/x86/include/asm/processor.h
--- a/arch/x86/include/asm/processor.h~x86_cache_alignment_mult 2024-02-22 10:08:57.732817356 -0800
+++ b/arch/x86/include/asm/processor.h 2024-02-22 10:08:57.740817669 -0800
@@ -172,6 +172,13 @@ struct x86_addr_config {
* platforms that use memory encryption.
*/
u8 phys_addr_reduction_bits;
+
+ /*
+ * "x86_clflush_size" is the size of an actual cacheline.
+ * Allow systems to specify a multiplier where alignment
+ * will take place at a more coarse granularity.
+ */
+ u8 cache_align_mult;
};

/*
diff -puN arch/x86/kernel/cpu/centaur.c~x86_cache_alignment_mult arch/x86/kernel/cpu/centaur.c
--- a/arch/x86/kernel/cpu/centaur.c~x86_cache_alignment_mult 2024-02-22 10:08:57.732817356 -0800
+++ b/arch/x86/kernel/cpu/centaur.c 2024-02-22 10:08:57.740817669 -0800
@@ -216,7 +216,7 @@ static void init_centaur(struct cpuinfo_
static void bsp_init_centaur(struct cpuinfo_x86 *c)
{
if (c->x86 == 0x6 && c->x86_model >= 0xf)
- c->x86_cache_alignment = x86_clflush_size() * 2;
+ bsp_addr_config.cache_align_mult = 2;
}

#ifdef CONFIG_X86_32
diff -puN arch/x86/kernel/cpu/common.c~x86_cache_alignment_mult arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~x86_cache_alignment_mult 2024-02-22 10:08:57.736817513 -0800
+++ b/arch/x86/kernel/cpu/common.c 2024-02-22 10:08:57.740817669 -0800
@@ -1124,7 +1124,10 @@ void get_cpu_address_sizes(struct cpuinf
}
}
c->x86_cache_bits = x86_config.phys_bits;
+
c->x86_cache_alignment = x86_clflush_size();
+ if (bsp_addr_config.cache_align_mult)
+ c->x86_cache_alignment *= bsp_addr_config.cache_align_mult;

/* Do this last to avoid affecting ->x86_cache_bits. */
x86_config.phys_bits -= bsp_addr_config.phys_addr_reduction_bits;
diff -puN arch/x86/kernel/cpu/intel.c~x86_cache_alignment_mult arch/x86/kernel/cpu/intel.c
--- a/arch/x86/kernel/cpu/intel.c~x86_cache_alignment_mult 2024-02-22 10:08:57.736817513 -0800
+++ b/arch/x86/kernel/cpu/intel.c 2024-02-22 10:08:57.740817669 -0800
@@ -236,10 +236,6 @@ static void early_init_intel(struct cpui

#ifdef CONFIG_X86_64
set_cpu_cap(c, X86_FEATURE_SYSENTER32);
-#else
- /* Netburst reports 64 bytes clflush size, but does IO in 128 bytes */
- if (c->x86 == 15 && c->x86_cache_alignment == 64)
- c->x86_cache_alignment = 128;
#endif

/*
@@ -418,6 +414,10 @@ static void bsp_init_intel(struct cpuinf
WARN_ON_ONCE(keyid_bits);
bsp_addr_config.phys_addr_reduction_bits = 4;
}
+
+ /* Netburst reports 64 bytes clflush size, but does IO in 128 bytes */
+ if (c->x86 == 15)
+ bsp_addr_config.cache_align_mult = 2;
}

#ifdef CONFIG_X86_32
@@ -659,8 +659,6 @@ static void init_intel(struct cpuinfo_x8
set_cpu_bug(c, X86_BUG_MONITOR);

#ifdef CONFIG_X86_64
- if (c->x86 == 15)
- c->x86_cache_alignment = x86_clflush_size() * 2;
if (c->x86 == 6)
set_cpu_cap(c, X86_FEATURE_REP_GOOD);
#else
_

2024-02-22 21:51:32

[permalink] [raw]

Subject: [RFC][PATCH 08/34] x86/cpu: Add CLFLUSH size helper

From: Dave Hansen <[email protected]>

This is probably starting to look familiar: The size of a cacheline is
fundamental system-wide information. Keeping it per-cpu is just silly.

Introduce a system-wide helper for looking up the cacheline size and
use it.

This does one slightly odd looking thing, it stops setting
c->x86_clflush_size on all but the boot CPU. This is functionally OK
because all readers of the secondary CPU values also go away. It also
makes it explicit that the 'boot_cpu_data' is the one true system-wide
value.

Signed-off-by: Dave Hansen <[email protected]>
Cc: Mikulas Patocka <[email protected]>
Cc: Mike Snitzer <[email protected]>
Cc: Daniel Vetter <[email protected]>

---

b/arch/x86/include/asm/processor.h | 5 +++++
b/arch/x86/kernel/cpu/centaur.c | 2 +-
b/arch/x86/kernel/cpu/common.c | 9 +++++----
b/arch/x86/kernel/cpu/intel.c | 2 +-
b/arch/x86/kernel/cpu/proc.c | 2 +-
b/arch/x86/lib/usercopy_64.c | 7 +++----
b/arch/x86/mm/pat/set_memory.c | 2 +-
b/arch/x86/pci/common.c | 2 +-
b/drivers/gpu/drm/drm_cache.c | 4 ++--
b/drivers/gpu/drm/i915/i915_cmd_parser.c | 3 +--
b/drivers/gpu/drm/i915/i915_gem.c | 2 +-
b/drivers/md/dm-writecache.c | 2 +-
12 files changed, 23 insertions(+), 19 deletions(-)

diff -puN arch/x86/include/asm/processor.h~x86_clflush_size-func arch/x86/include/asm/processor.h
--- a/arch/x86/include/asm/processor.h~x86_clflush_size-func 2024-02-22 10:08:52.112596720 -0800
+++ b/arch/x86/include/asm/processor.h 2024-02-22 10:08:52.132597505 -0800
@@ -777,4 +777,9 @@ static inline u8 x86_virt_bits(void)
return boot_cpu_data.x86_virt_bits;
}

+static inline u8 x86_clflush_size(void)
+{
+ return boot_cpu_data.x86_clflush_size;
+}
+
#endif /* _ASM_X86_PROCESSOR_H */
diff -puN arch/x86/kernel/cpu/centaur.c~x86_clflush_size-func arch/x86/kernel/cpu/centaur.c
--- a/arch/x86/kernel/cpu/centaur.c~x86_clflush_size-func 2024-02-22 10:08:52.112596720 -0800
+++ b/arch/x86/kernel/cpu/centaur.c 2024-02-22 10:08:52.132597505 -0800
@@ -62,7 +62,7 @@ static void init_c3(struct cpuinfo_x86 *
set_cpu_cap(c, X86_FEATURE_3DNOW);
#endif
if (c->x86 == 0x6 && c->x86_model >= 0xf) {
- c->x86_cache_alignment = c->x86_clflush_size * 2;
+ c->x86_cache_alignment = x86_clflush_size() * 2;
set_cpu_cap(c, X86_FEATURE_REP_GOOD);
}

diff -puN arch/x86/kernel/cpu/common.c~x86_clflush_size-func arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~x86_clflush_size-func 2024-02-22 10:08:52.112596720 -0800
+++ b/arch/x86/kernel/cpu/common.c 2024-02-22 10:08:52.132597505 -0800
@@ -954,8 +954,9 @@ void cpu_detect(struct cpuinfo_x86 *c)
c->x86_stepping = x86_stepping(tfms);

if (cap0 & (1<<19)) {
- c->x86_clflush_size = ((misc >> 8) & 0xff) * 8;
- c->x86_cache_alignment = c->x86_clflush_size;
+ if (c == &boot_cpu_data)
+ c->x86_clflush_size = ((misc >> 8) & 0xff) * 8;
+ c->x86_cache_alignment = x86_clflush_size();
}
}
}
@@ -1123,7 +1124,7 @@ void get_cpu_address_sizes(struct cpuinf
}
}
c->x86_cache_bits = c->x86_phys_bits;
- c->x86_cache_alignment = c->x86_clflush_size;
+ c->x86_cache_alignment = x86_clflush_size();
}

static void identify_cpu_without_cpuid(struct cpuinfo_x86 *c)
@@ -1831,7 +1832,7 @@ static void identify_cpu(struct cpuinfo_
c->x86_phys_bits = 32;
c->x86_virt_bits = 32;
#endif
- c->x86_cache_alignment = c->x86_clflush_size;
+ c->x86_cache_alignment = x86_clflush_size();
memset(&c->x86_capability, 0, sizeof(c->x86_capability));
#ifdef CONFIG_X86_VMX_FEATURE_NAMES
memset(&c->vmx_capability, 0, sizeof(c->vmx_capability));
diff -puN arch/x86/kernel/cpu/intel.c~x86_clflush_size-func arch/x86/kernel/cpu/intel.c
--- a/arch/x86/kernel/cpu/intel.c~x86_clflush_size-func 2024-02-22 10:08:52.116596877 -0800
+++ b/arch/x86/kernel/cpu/intel.c 2024-02-22 10:08:52.132597505 -0800
@@ -653,7 +653,7 @@ static void init_intel(struct cpuinfo_x8

#ifdef CONFIG_X86_64
if (c->x86 == 15)
- c->x86_cache_alignment = c->x86_clflush_size * 2;
+ c->x86_cache_alignment = x86_clflush_size() * 2;
if (c->x86 == 6)
set_cpu_cap(c, X86_FEATURE_REP_GOOD);
#else
diff -puN arch/x86/kernel/cpu/proc.c~x86_clflush_size-func arch/x86/kernel/cpu/proc.c
--- a/arch/x86/kernel/cpu/proc.c~x86_clflush_size-func 2024-02-22 10:08:52.116596877 -0800
+++ b/arch/x86/kernel/cpu/proc.c 2024-02-22 10:08:52.132597505 -0800
@@ -130,7 +130,7 @@ static int show_cpuinfo(struct seq_file
if (c->x86_tlbsize > 0)
seq_printf(m, "TLB size\t: %d 4K pages\n", c->x86_tlbsize);
#endif
- seq_printf(m, "clflush size\t: %u\n", c->x86_clflush_size);
+ seq_printf(m, "clflush size\t: %u\n", x86_clflush_size());
seq_printf(m, "cache_alignment\t: %d\n", c->x86_cache_alignment);
seq_printf(m, "address sizes\t: %u bits physical, %u bits virtual\n",
x86_phys_bits(), x86_virt_bits());
diff -puN arch/x86/lib/usercopy_64.c~x86_clflush_size-func arch/x86/lib/usercopy_64.c
--- a/arch/x86/lib/usercopy_64.c~x86_clflush_size-func 2024-02-22 10:08:52.120597034 -0800
+++ b/arch/x86/lib/usercopy_64.c 2024-02-22 10:08:52.132597505 -0800
@@ -27,13 +27,12 @@
*/
static void clean_cache_range(void *addr, size_t size)
{
- u16 x86_clflush_size = boot_cpu_data.x86_clflush_size;
- unsigned long clflush_mask = x86_clflush_size - 1;
+ unsigned long clflush_mask = x86_clflush_size() - 1;
void *vend = addr + size;
void *p;

for (p = (void *)((unsigned long)addr & ~clflush_mask);
- p < vend; p += x86_clflush_size)
+ p < vend; p += x86_clflush_size())
clwb(p);
}

@@ -65,7 +64,7 @@ long __copy_user_flushcache(void *dst, c
clean_cache_range(dst, size);
} else {
if (!IS_ALIGNED(dest, 8)) {
- dest = ALIGN(dest, boot_cpu_data.x86_clflush_size);
+ dest = ALIGN(dest, x86_clflush_size());
clean_cache_range(dst, 1);
}

diff -puN arch/x86/mm/pat/set_memory.c~x86_clflush_size-func arch/x86/mm/pat/set_memory.c
--- a/arch/x86/mm/pat/set_memory.c~x86_clflush_size-func 2024-02-22 10:08:52.120597034 -0800
+++ b/arch/x86/mm/pat/set_memory.c 2024-02-22 10:08:52.132597505 -0800
@@ -314,7 +314,7 @@ static unsigned long __cpa_addr(struct c

static void clflush_cache_range_opt(void *vaddr, unsigned int size)
{
- const unsigned long clflush_size = boot_cpu_data.x86_clflush_size;
+ const unsigned long clflush_size = x86_clflush_size();
void *p = (void *)((unsigned long)vaddr & ~(clflush_size - 1));
void *vend = vaddr + size;

diff -puN arch/x86/pci/common.c~x86_clflush_size-func arch/x86/pci/common.c
--- a/arch/x86/pci/common.c~x86_clflush_size-func 2024-02-22 10:08:52.120597034 -0800
+++ b/arch/x86/pci/common.c 2024-02-22 10:08:52.132597505 -0800
@@ -480,7 +480,7 @@ void pcibios_scan_root(int busnum)

void __init pcibios_set_cache_line_size(void)
{
- pci_dfl_cache_line_size = boot_cpu_data.x86_clflush_size >> 2;
+ pci_dfl_cache_line_size = x86_clflush_size() >> 2;
printk(KERN_DEBUG "PCI: pci_cache_line_size set to %d bytes\n",
pci_dfl_cache_line_size << 2);
}
diff -puN drivers/gpu/drm/drm_cache.c~x86_clflush_size-func drivers/gpu/drm/drm_cache.c
--- a/drivers/gpu/drm/drm_cache.c~x86_clflush_size-func 2024-02-22 10:08:52.124597191 -0800
+++ b/drivers/gpu/drm/drm_cache.c 2024-02-22 10:08:52.132597505 -0800
@@ -52,7 +52,7 @@ drm_clflush_page(struct page *page)
{
uint8_t *page_virtual;
unsigned int i;
- const int size = boot_cpu_data.x86_clflush_size;
+ const int size = x86_clflush_size();

if (unlikely(page == NULL))
return;
@@ -160,7 +160,7 @@ drm_clflush_virt_range(void *addr, unsig
{
#if defined(CONFIG_X86)
if (static_cpu_has(X86_FEATURE_CLFLUSH)) {
- const int size = boot_cpu_data.x86_clflush_size;
+ const int size = x86_clflush_size();
void *end = addr + length;

addr = (void *)(((unsigned long)addr) & -size);
diff -puN drivers/gpu/drm/i915/i915_cmd_parser.c~x86_clflush_size-func drivers/gpu/drm/i915/i915_cmd_parser.c
--- a/drivers/gpu/drm/i915/i915_cmd_parser.c~x86_clflush_size-func 2024-02-22 10:08:52.124597191 -0800
+++ b/drivers/gpu/drm/i915/i915_cmd_parser.c 2024-02-22 10:08:52.132597505 -0800
@@ -1203,8 +1203,7 @@ static u32 *copy_batch(struct drm_i915_g
*/
remain = length;
if (dst_needs_clflush & CLFLUSH_BEFORE)
- remain = round_up(remain,
- boot_cpu_data.x86_clflush_size);
+ remain = round_up(remain, x86_clflush_size());

ptr = dst;
x = offset_in_page(offset);
diff -puN drivers/gpu/drm/i915/i915_gem.c~x86_clflush_size-func drivers/gpu/drm/i915/i915_gem.c
--- a/drivers/gpu/drm/i915/i915_gem.c~x86_clflush_size-func 2024-02-22 10:08:52.124597191 -0800
+++ b/drivers/gpu/drm/i915/i915_gem.c 2024-02-22 10:08:52.132597505 -0800
@@ -696,7 +696,7 @@ i915_gem_shmem_pwrite(struct drm_i915_ge
*/
partial_cacheline_write = 0;
if (needs_clflush & CLFLUSH_BEFORE)
- partial_cacheline_write = boot_cpu_data.x86_clflush_size - 1;
+ partial_cacheline_write = x86_clflush_size() - 1;

user_data = u64_to_user_ptr(args->data_ptr);
remain = args->size;
diff -puN drivers/md/dm-writecache.c~x86_clflush_size-func drivers/md/dm-writecache.c
--- a/drivers/md/dm-writecache.c~x86_clflush_size-func 2024-02-22 10:08:52.128597348 -0800
+++ b/drivers/md/dm-writecache.c 2024-02-22 10:08:52.132597505 -0800
@@ -1229,7 +1229,7 @@ static void memcpy_flushcache_optimized(
*/
#ifdef CONFIG_X86
if (static_cpu_has(X86_FEATURE_CLFLUSHOPT) &&
- likely(boot_cpu_data.x86_clflush_size == 64) &&
+ likely(x86_clflush_size() == 64) &&
likely(size >= 768)) {
do {
memcpy((void *)dest, (void *)source, 64);
_

2024-02-22 22:26:35

[permalink] [raw]

Subject: [RFC][PATCH 33/34] x86/cpu: Make get_cpu_address_sizes() static and __init

From: Dave Hansen <[email protected]>

Now that all the other users are gone and the final user is in the same file
and __init, get_cpu_address_sizes() can be static and __init.

Signed-off-by: Dave Hansen <[email protected]>
---

b/arch/x86/kernel/cpu/common.c | 2 +-
b/arch/x86/kernel/cpu/cpu.h | 1 -
2 files changed, 1 insertion(+), 2 deletions(-)

diff -puN arch/x86/kernel/cpu/common.c~get_cpu_address_sizes-__init arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~get_cpu_address_sizes-__init 2024-02-22 10:09:04.677089966 -0800
+++ b/arch/x86/kernel/cpu/common.c 2024-02-22 10:09:04.681090123 -0800
@@ -1109,7 +1109,7 @@ void get_cpu_cap(struct cpuinfo_x86 *c)
apply_forced_caps(c);
}

-void get_cpu_address_sizes(struct cpuinfo_x86 *c)
+static void __init get_cpu_address_sizes(struct cpuinfo_x86 *c)
{
u32 eax, ebx, ecx, edx;
bool vp_bits_from_cpuid = true;
diff -puN arch/x86/kernel/cpu/cpu.h~get_cpu_address_sizes-__init arch/x86/kernel/cpu/cpu.h
--- a/arch/x86/kernel/cpu/cpu.h~get_cpu_address_sizes-__init 2024-02-22 10:09:04.677089966 -0800
+++ b/arch/x86/kernel/cpu/cpu.h 2024-02-22 10:09:04.681090123 -0800
@@ -64,7 +64,6 @@ static inline void tsx_ap_init(void) { }
extern void init_spectral_chicken(struct cpuinfo_x86 *c);

extern void get_cpu_cap(struct cpuinfo_x86 *c);
-extern void get_cpu_address_sizes(struct cpuinfo_x86 *c);
extern void cpu_detect_cache_sizes(struct cpuinfo_x86 *c);
extern void init_scattered_cpuid_features(struct cpuinfo_x86 *c);
extern void init_intel_cacheinfo(struct cpuinfo_x86 *c);
_

2024-02-22 22:45:04

[permalink] [raw]

Subject: [RFC][PATCH 14/34] x86/cpu: Remove default physical address space settings

From: Dave Hansen <[email protected]>

At this point, the secondary CPUs' c->x86_phys_bits should be unused. The boot
CPU values were also established long before reaching here. This can only
serve to sow chaos. Remove the twiddling.

Signed-off-by: Dave Hansen <[email protected]>
---

b/arch/x86/kernel/cpu/common.c | 2 --
1 file changed, 2 deletions(-)

diff -puN arch/x86/kernel/cpu/common.c~no-default-bit-setting-phys arch/x86/kernel/cpu/common.c
--- a/arch/x86/kernel/cpu/common.c~no-default-bit-setting-phys 2024-02-22 10:08:55.256720151 -0800
+++ b/arch/x86/kernel/cpu/common.c 2024-02-22 10:08:55.260720308 -0800
@@ -1827,12 +1827,10 @@ static void identify_cpu(struct cpuinfo_
c->topo.l2c_id = BAD_APICID;
#ifdef CONFIG_X86_64
c->x86_clflush_size = 64;
- c->x86_phys_bits = 36;
c->x86_virt_bits = 48;
#else
c->cpuid_level = -1; /* CPUID not detected */
c->x86_clflush_size = 32;
- c->x86_phys_bits = 32;
c->x86_virt_bits = 32;
#endif
c->x86_cache_alignment = x86_clflush_size();
_

2024-02-23 11:33:23

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [RFC][PATCH 11/34] x86/cpu/intel: Prepare MKTME for "address configuration" infrastructure

On Thu, Feb 22, 2024 at 10:39:41AM -0800, Dave Hansen wrote:
>
> From: Dave Hansen <[email protected]>
>
> Intel also does memory encryption and also fiddles with the physical
> address bits. This is currently called for *each* CPU, but practically
> only done on the boot CPU because of 'mktme_status'.
>
> Move it from the "each CPU" ->c_init() function to ->c_bsp_init() where
> the whole thing only gets called once ever. This also necessitates moving
> detect_tme() and its entourage around in the file.

The state machine around mktme_state doesn't make sense if we only call it
on boot CPU, so detect_tme() can be drastically simplified. I can do it on
top of the patchset.

--
Kiryl Shutsemau / Kirill A. Shutemov

2024-02-23 11:42:55

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [RFC][PATCH 12/34] x86/cpu/intel: Actually use "address configuration" infrastructure for MKTME

On Thu, Feb 22, 2024 at 10:39:42AM -0800, Dave Hansen wrote:
>
> From: Dave Hansen <[email protected]>
>
> Now that the TME detection is only called once at boot, stop twiddling
> 'boot_cpu_data' directly and move over to 'bsp_addr_config'.
>
> Signed-off-by: Dave Hansen <[email protected]>
> ---
>
> b/arch/x86/kernel/cpu/intel.c | 7 ++-----
> 1 file changed, 2 insertions(+), 5 deletions(-)
>
> diff -puN arch/x86/kernel/cpu/intel.c~intel-addr-reduce arch/x86/kernel/cpu/intel.c
> --- a/arch/x86/kernel/cpu/intel.c~intel-addr-reduce 2024-02-22 10:08:54.296682462 -0800
> +++ b/arch/x86/kernel/cpu/intel.c 2024-02-22 10:08:54.296682462 -0800
> @@ -401,11 +401,8 @@ detect_keyid_bits:
> mktme_status = MKTME_ENABLED;
> }
>
> - /*
> - * KeyID bits effectively lower the number of physical address
> - * bits. Update cpuinfo_x86::x86_phys_bits accordingly.
> - */
> - c->x86_phys_bits -= keyid_bits;
> + /* KeyID bits effectively lower the number of physical address bits */
> + bsp_addr_config.phys_addr_reduction_bits = keyid_bits;

Do we expect reduction_bits to stack? Like can multiple features steal
physical bits? Make use "+= keyid_bits" here?

--
Kiryl Shutsemau / Kirill A. Shutemov

2024-02-23 16:20:05

[permalink] [raw]

Subject: Re: [RFC][PATCH 12/34] x86/cpu/intel: Actually use "address configuration" infrastructure for MKTME

On 2/23/24 03:41, Kirill A. Shutemov wrote:
>> - /*
>> - * KeyID bits effectively lower the number of physical address
>> - * bits. Update cpuinfo_x86::x86_phys_bits accordingly.
>> - */
>> - c->x86_phys_bits -= keyid_bits;
>> + /* KeyID bits effectively lower the number of physical address bits */
>> + bsp_addr_config.phys_addr_reduction_bits = keyid_bits;
> Do we expect reduction_bits to stack? Like can multiple features steal
> physical bits? Make use "+= keyid_bits" here?

Good question.

IMNHO, the idea that different "users" of these fields can be oblivious
to each other is the reason that this has gotten so bad.

I thought about interfering or stacking a bit when I was putting this
together. It's one of the reasons I chose to add the specific
'phys_addr_reduction_bits' field _instead_ of continuing to just munge
'phys_addr_bits'.

I want 'bsp_addr_config' to represent the *inputs* that eventually end
up in x86_config, and have the inputs distilled down to the output in
one (or very few) places.

There are thankfully very few users of this: Intel and AMD memory
encryption and one Intel CPU erratum for very old CPUs. So they can't
stack in practice.

If we ever need something to stack, I'd prefer that we add two fields,
maybe:

bsp_addr_config.enc_reduction_bits

and

bsp_addr_config.maxphyaddr_erratum_override

or something, then have any collision be resolved when 'bsp_addr_config'
is consulted, in *ONE* central pace in the code.

2024-02-23 16:53:58

[permalink] [raw]

Subject: Re: [RFC][PATCH 11/34] x86/cpu/intel: Prepare MKTME for "address configuration" infrastructure

On 2/23/24 03:33, Kirill A. Shutemov wrote:
> On Thu, Feb 22, 2024 at 10:39:41AM -0800, Dave Hansen wrote:
>> From: Dave Hansen <[email protected]>
>>
>> Intel also does memory encryption and also fiddles with the physical
>> address bits. This is currently called for *each* CPU, but practically
>> only done on the boot CPU because of 'mktme_status'.
>>
>> Move it from the "each CPU" ->c_init() function to ->c_bsp_init() where
>> the whole thing only gets called once ever. This also necessitates moving
>> detect_tme() and its entourage around in the file.
> The state machine around mktme_state doesn't make sense if we only call it
> on boot CPU, so detect_tme() can be drastically simplified. I can do it on
> top of the patchset.

That would be great. Looking at it again, the (tme_activate !=
tme_activate_cpu0) block is total cruft now. It probably just needs to
get moved to secondary CPU startup.

2024-02-26 12:15:08

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [RFC][PATCH 11/34] x86/cpu/intel: Prepare MKTME for "address configuration" infrastructure

On Fri, Feb 23, 2024 at 08:22:16AM -0800, Dave Hansen wrote:
> On 2/23/24 03:33, Kirill A. Shutemov wrote:
> > On Thu, Feb 22, 2024 at 10:39:41AM -0800, Dave Hansen wrote:
> >> From: Dave Hansen <[email protected]>
> >>
> >> Intel also does memory encryption and also fiddles with the physical
> >> address bits. This is currently called for *each* CPU, but practically
> >> only done on the boot CPU because of 'mktme_status'.
> >>
> >> Move it from the "each CPU" ->c_init() function to ->c_bsp_init() where
> >> the whole thing only gets called once ever. This also necessitates moving
> >> detect_tme() and its entourage around in the file.
> > The state machine around mktme_state doesn't make sense if we only call it
> > on boot CPU, so detect_tme() can be drastically simplified. I can do it on
> > top of the patchset.
>
> That would be great. Looking at it again, the (tme_activate !=
> tme_activate_cpu0) block is total cruft now. It probably just needs to
> get moved to secondary CPU startup.

I have never saw the check to be useful. I think it can be just dropped.

The patch below makes detect_tme() only enumerate TME and MKTME. And
report number of keyid bits. Kernel doesn't care about anything else.

Any comments?

From 1080535093d21f025d46fd610de5ad788591f4b5 Mon Sep 17 00:00:00 2001
From: "Kirill A. Shutemov" <[email protected]>
Date: Mon, 26 Feb 2024 14:01:01 +0200
Subject: [PATCH] x86/cpu/intel: Simplify detect_tme()

The detect_tme() function is now only called by the boot CPU. The logic
for cross-checking TME configuration between CPUs is no longer used. It
has never identified a real problem and can be safely removed.

The kernel does not use MKTME and is not concerned with MKTME policy or
encryption algorithms. There is no need to check them.

Signed-off-by: Kirill A. Shutemov <[email protected]>
---
arch/x86/kernel/cpu/intel.c | 44 ++-----------------------------------
1 file changed, 2 insertions(+), 42 deletions(-)

diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
index 4192aa4576f4..60918b49344c 100644
--- a/arch/x86/kernel/cpu/intel.c
+++ b/arch/x86/kernel/cpu/intel.c
@@ -329,55 +329,20 @@ static void early_init_intel(struct cpuinfo_x86 *c)
#define TME_ACTIVATE_CRYPTO_ALGS(x) ((x >> 48) & 0xffff) /* Bits 63:48 */
#define TME_ACTIVATE_CRYPTO_AES_XTS_128 1

-/* Values for mktme_status (SW only construct) */
-#define MKTME_ENABLED 0
-#define MKTME_DISABLED 1
-#define MKTME_UNINITIALIZED 2
-static int mktme_status = MKTME_UNINITIALIZED;
-
static int detect_tme(struct cpuinfo_x86 *c)
{
- u64 tme_activate, tme_policy, tme_crypto_algs;
- int keyid_bits = 0, nr_keyids = 0;
- static u64 tme_activate_cpu0 = 0;
+ int keyid_bits, nr_keyids;
+ u64 tme_activate;

rdmsrl(MSR_IA32_TME_ACTIVATE, tme_activate);

- if (mktme_status != MKTME_UNINITIALIZED) {
- if (tme_activate != tme_activate_cpu0) {
- /* Broken BIOS? */
- pr_err_once("x86/tme: configuration is inconsistent between CPUs\n");
- pr_err_once("x86/tme: MKTME is not usable\n");
- mktme_status = MKTME_DISABLED;
-
- /* Proceed. We may need to exclude bits from x86_phys_bits. */
- }
- } else {
- tme_activate_cpu0 = tme_activate;
- }
-
if (!TME_ACTIVATE_LOCKED(tme_activate) || !TME_ACTIVATE_ENABLED(tme_activate)) {
pr_info_once("x86/tme: not enabled by BIOS\n");
- mktme_status = MKTME_DISABLED;
return 0;
}

- if (mktme_status != MKTME_UNINITIALIZED)
- goto detect_keyid_bits;
-
pr_info("x86/tme: enabled by BIOS\n");

- tme_policy = TME_ACTIVATE_POLICY(tme_activate);
- if (tme_policy != TME_ACTIVATE_POLICY_AES_XTS_128)
- pr_warn("x86/tme: Unknown policy is active: %#llx\n", tme_policy);
-
- tme_crypto_algs = TME_ACTIVATE_CRYPTO_ALGS(tme_activate);
- if (!(tme_crypto_algs & TME_ACTIVATE_CRYPTO_AES_XTS_128)) {
- pr_err("x86/mktme: No known encryption algorithm is supported: %#llx\n",
- tme_crypto_algs);
- mktme_status = MKTME_DISABLED;
- }
-detect_keyid_bits:
keyid_bits = TME_ACTIVATE_KEYID_BITS(tme_activate);
nr_keyids = (1UL << keyid_bits) - 1;
if (nr_keyids) {
@@ -387,11 +352,6 @@ static int detect_tme(struct cpuinfo_x86 *c)
pr_info_once("x86/mktme: disabled by BIOS\n");
}

- if (mktme_status == MKTME_UNINITIALIZED) {
- /* MKTME is usable */
- mktme_status = MKTME_ENABLED;
- }
-
return keyid_bits;
}

--
Kiryl Shutsemau / Kirill A. Shutemov

2024-02-27 11:00:09

[permalink] [raw]

Subject: Re: [RFC][PATCH 06/34] x86/boot: Use consistent value for iomem_resource.end

On Thu, 2024-02-22 at 10:39 -0800, Dave Hansen wrote:
> From: Dave Hansen <[email protected]>
>
> The 'struct cpuinfo_x86' values (including 'boot_cpu_info') get
> written and overwritten rather randomly. They are not stable
> during early boot and readers end up getting a random mishmash
> of hard-coded defaults or CPUID-provided values based on when
> the values are read.
>
> iomem_resource.end is one of these users. Because of where it
> is called, it ended up seeing .x86_phys_bits==MAX_PHYSMEM_BITS
> which is (mostly) a compile-time default. But
> iomem_resource.end is never updated if the runtime CPUID
> x86_phys_bits is lower.
>
> Set iomem_resource.end to the compile-time value explicitly.
> It does not need to be precise as this is mostly to ensure
> that insane values can't be reserved in 'iomem_resource'.
>
> Make MAX_PHYSMEM_BITS available outside of sparsemem
> configurations by removing the #ifdef CONFIG_SPARSEMEM in the
> header.
>
> Signed-off-by: Dave Hansen <[email protected]>
> ---
>
> b/arch/x86/include/asm/sparsemem.h | 3 ---
> b/arch/x86/kernel/setup.c | 10 +++++++++-
> 2 files changed, 9 insertions(+), 4 deletions(-)
>
> diff -puN arch/x86/kernel/setup.c~iomem_resource_end arch/x86/kernel/setup.c
> --- a/arch/x86/kernel/setup.c~iomem_resource_end 2024-02-22 10:08:51.048554948 -0800
> +++ b/arch/x86/kernel/setup.c 2024-02-22 10:21:04.485531464 -0800
> @@ -51,6 +51,7 @@
> #include <asm/pci-direct.h>
> #include <asm/prom.h>
> #include <asm/proto.h>
> +#include <asm/sparsemem.h>
> #include <asm/thermal.h>
> #include <asm/unwind.h>
> #include <asm/vsyscall.h>
> @@ -813,7 +814,14 @@ void __init setup_arch(char **cmdline_p)
> */
> early_reserve_memory();
>
> - iomem_resource.end = (1ULL << x86_phys_bits()) - 1;
> + /*
> + * This was too big before. It ended up getting MAX_PHYSMEM_BITS
> + * even if .x86_phys_bits was eventually lowered below that.
> + * But that was evidently harmless, so leave it too big, but
> + * set it explicitly to MAX_PHYSMEM_BITS instead of taking a
> + * trip through .x86_phys_bits.
> + */
> + iomem_resource.end = (1ULL << MAX_PHYSMEM_BITS) - 1;

Paolo's patchset to move MKTME keyid bits detection to early_cpu_init() was
merged to tip:x86/urgent, so looks it will land to Linus's tree before this
series:

https://lore.kernel.org/lkml/[email protected]/T/

Paplo's series actually moves the reduction of x86_phys_bits before setting the
iomem_resource.end here, so after rebasing the changelog/comment seems don't
apply anymore.

Perhaps we can get rid of this patch and just set iomem_resource.end based on
x86_phys_bits()?

2024-02-27 11:06:01

[permalink] [raw]

Subject: Re: [RFC][PATCH 04/34] x86/mm: Introduce physical address limit helper

On Thu, 2024-02-22 at 10:39 -0800, Dave Hansen wrote:
> From: Dave Hansen <[email protected]>
>
> Modern processors enumerate the sizes of the physical and virtual address
> spaces that they can handle. The kernel stashes this information for each
> CPU in 'cpuinfo_x86'.
>
> Like a lot of system-wide configuration, in practice the kernel only uses
> this information from the boot CPU. That makes a lot of sense because
> the kernel can't practically support two CPUs with such different
> fundamental support as the size of the address spaces.
>
> Having a per-cpu copy of this data is silly. It is, at best, a waste
> of space to keep it around for the non-boot CPUs. At worst, it is yet
> another bit of data that must be initialized in a particular order and
> can be a source of bugs.
>
> Introduce a helper to look up the number of supported physical address bits:
>
> x86_phys_bits()
>
> Replace most open-coded references to boot_cpu_data.x86_phys_bits.
>
> This form is more compact and also opens up the door to adding some
> centralized checking and enforcing rules around this important system-
> wide value.
>
> Signed-off-by: Dave Hansen <[email protected]>
> Cc: "Rafael J. Wysocki" <[email protected]>
> Cc: Len Brown <[email protected]>
> ---
>

Reviewed-by: Kai Huang <[email protected]>

2024-02-27 11:06:15

[permalink] [raw]

Subject: Re: [RFC][PATCH 05/34] x86/cpu: Move /proc per-cpu ->x86_phys_bits reference to global one

On Thu, 2024-02-22 at 10:39 -0800, Dave Hansen wrote:
> From: Dave Hansen <[email protected]>
>
> This is one of the few references to the per-cpu version of
> ->x86_phys_bits. It is theoretically possible to have this value vary
> from CPU to CPU. But in practice nobody builds systems that way.
>
> Continue outputting the value in /proc, but read it from the global
> configuration.
>
> Signed-off-by: Dave Hansen <[email protected]>
> ---
>

Reviewed-by: Kai Huang <[email protected]>

2024-02-27 11:11:47

[permalink] [raw]

Subject: Re: [RFC][PATCH 07/34] x86/mm: Introduce virtual address space limit helper

On Thu, 2024-02-22 at 10:39 -0800, Dave Hansen wrote:
> From: Dave Hansen <[email protected]>
>
> This uses the same logic and approach which were used for the physical
> address limits with x86_phys_bits() and extends them to the virtual
> address space.
>
> Introduce a system-wide helper for users to query the size of the
> virtual address space: x86_virt_bits()
>
> Signed-off-by: Dave Hansen <[email protected]>
> ---
>

Reviewed-by: Kai Huang <[email protected]>

2024-02-27 21:50:02

[permalink] [raw]

Subject: Re: [RFC][PATCH 11/34] x86/cpu/intel: Prepare MKTME for "address configuration" infrastructure

On 27/02/2024 1:14 am, Kirill A. Shutemov wrote:
> On Fri, Feb 23, 2024 at 08:22:16AM -0800, Dave Hansen wrote:
>> On 2/23/24 03:33, Kirill A. Shutemov wrote:
>>> On Thu, Feb 22, 2024 at 10:39:41AM -0800, Dave Hansen wrote:
>>>> From: Dave Hansen <[email protected]>
>>>>
>>>> Intel also does memory encryption and also fiddles with the physical
>>>> address bits. This is currently called for *each* CPU, but practically
>>>> only done on the boot CPU because of 'mktme_status'.
>>>>
>>>> Move it from the "each CPU" ->c_init() function to ->c_bsp_init() where
>>>> the whole thing only gets called once ever. This also necessitates moving
>>>> detect_tme() and its entourage around in the file.
>>> The state machine around mktme_state doesn't make sense if we only call it
>>> on boot CPU, so detect_tme() can be drastically simplified. I can do it on
>>> top of the patchset.
>>
>> That would be great. Looking at it again, the (tme_activate !=
>> tme_activate_cpu0) block is total cruft now. It probably just needs to
>> get moved to secondary CPU startup.
>
> I have never saw the check to be useful. I think it can be just dropped.
>
> The patch below makes detect_tme() only enumerate TME and MKTME. And
> report number of keyid bits. Kernel doesn't care about anything else.
>
> Any comments?
>
> From 1080535093d21f025d46fd610de5ad788591f4b5 Mon Sep 17 00:00:00 2001
> From: "Kirill A. Shutemov" <[email protected]>
> Date: Mon, 26 Feb 2024 14:01:01 +0200
> Subject: [PATCH] x86/cpu/intel: Simplify detect_tme()
>
> The detect_tme() function is now only called by the boot CPU. The logic
> for cross-checking TME configuration between CPUs is no longer used. It
> has never identified a real problem and can be safely removed.
>
> The kernel does not use MKTME and is not concerned with MKTME policy or
> encryption algorithms. There is no need to check them.
>
> Signed-off-by: Kirill A. Shutemov <[email protected]>
> ---
> arch/x86/kernel/cpu/intel.c | 44 ++-----------------------------------
> 1 file changed, 2 insertions(+), 42 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/intel.c b/arch/x86/kernel/cpu/intel.c
> index 4192aa4576f4..60918b49344c 100644
> --- a/arch/x86/kernel/cpu/intel.c
> +++ b/arch/x86/kernel/cpu/intel.c
> @@ -329,55 +329,20 @@ static void early_init_intel(struct cpuinfo_x86 *c)
> #define TME_ACTIVATE_CRYPTO_ALGS(x) ((x >> 48) & 0xffff) /* Bits 63:48 */
> #define TME_ACTIVATE_CRYPTO_AES_XTS_128 1
>
> -/* Values for mktme_status (SW only construct) */
> -#define MKTME_ENABLED 0
> -#define MKTME_DISABLED 1
> -#define MKTME_UNINITIALIZED 2
> -static int mktme_status = MKTME_UNINITIALIZED;
> -
> static int detect_tme(struct cpuinfo_x86 *c)
> {
> - u64 tme_activate, tme_policy, tme_crypto_algs;
> - int keyid_bits = 0, nr_keyids = 0;
> - static u64 tme_activate_cpu0 = 0;
> + int keyid_bits, nr_keyids;
> + u64 tme_activate;
>
> rdmsrl(MSR_IA32_TME_ACTIVATE, tme_activate);
>
> - if (mktme_status != MKTME_UNINITIALIZED) {
> - if (tme_activate != tme_activate_cpu0) {
> - /* Broken BIOS? */
> - pr_err_once("x86/tme: configuration is inconsistent between CPUs\n");
> - pr_err_once("x86/tme: MKTME is not usable\n");
> - mktme_status = MKTME_DISABLED;
> -
> - /* Proceed. We may need to exclude bits from x86_phys_bits. */
> - }
> - } else {
> - tme_activate_cpu0 = tme_activate;
> - }
> -
> if (!TME_ACTIVATE_LOCKED(tme_activate) || !TME_ACTIVATE_ENABLED(tme_activate)) {
> pr_info_once("x86/tme: not enabled by BIOS\n");

Since now detect_tme() is only called on BSP, seems we can change to
pr_info(), which is used ...
> - mktme_status = MKTME_DISABLED;
> return 0;
> }
>
> - if (mktme_status != MKTME_UNINITIALIZED)
> - goto detect_keyid_bits;
> -
> pr_info("x86/tme: enabled by BIOS\n");

.. right here anyway.

>
> - tme_policy = TME_ACTIVATE_POLICY(tme_activate);
> - if (tme_policy != TME_ACTIVATE_POLICY_AES_XTS_128)
> - pr_warn("x86/tme: Unknown policy is active: %#llx\n", tme_policy);
> -
> - tme_crypto_algs = TME_ACTIVATE_CRYPTO_ALGS(tme_activate);
> - if (!(tme_crypto_algs & TME_ACTIVATE_CRYPTO_AES_XTS_128)) {
> - pr_err("x86/mktme: No known encryption algorithm is supported: %#llx\n",
> - tme_crypto_algs);
> - mktme_status = MKTME_DISABLED;
> - }
> -detect_keyid_bits:
> keyid_bits = TME_ACTIVATE_KEYID_BITS(tme_activate);
> nr_keyids = (1UL << keyid_bits) - 1;
> if (nr_keyids) {
> @@ -387,11 +352,6 @@ static int detect_tme(struct cpuinfo_x86 *c)
> pr_info_once("x86/mktme: disabled by BIOS\n");
> }
>
> - if (mktme_status == MKTME_UNINITIALIZED) {
> - /* MKTME is usable */
> - mktme_status = MKTME_ENABLED;
> - }
> -
> return keyid_bits;

Other than that the code change LGTM.

Reviewed-by: Kai Huang <[email protected]>

2024-02-28 00:08:19

[permalink] [raw]

Subject: Re: [RFC][PATCH 09/34] x86/cpu: Introduce address configuration structure

On 23/02/2024 7:39 am, Dave Hansen wrote:
>
> From: Dave Hansen <[email protected]>
>
> Right now, 'boot_cpu_data' is established in a very ad hoc way. Random
> bits of the initialization code write random things to it, either
> blowing away state or tweaking it as they see fit. It's madness.
>
> Add some more structure to the process.
>
> Introduce an "address configuration" structure just for the boot CPU.
> This will be used to establish system-wide address space configuration.
> It is written while bringing up the boot CPU and then read *ONCE* to
> establish the configuration.
>

From the above description, ...

> +struct x86_addr_config bsp_addr_config;
> +

.. this can be __ro_after_init?

Maybe I am missing something, but if it cannot be annotated with
__ro_after_init here, IMHO it would be helpful to call out in the changelog.

2024-02-28 14:31:44

[permalink] [raw]

Subject: Re: [RFC][PATCH 06/34] x86/boot: Use consistent value for iomem_resource.end

On Tue, 2024-02-27 at 10:59 +0000, Huang, Kai wrote:
> On Thu, 2024-02-22 at 10:39 -0800, Dave Hansen wrote:
> > From: Dave Hansen <[email protected]>
> >
> > The 'struct cpuinfo_x86' values (including 'boot_cpu_info') get
> > written and overwritten rather randomly. They are not stable
> > during early boot and readers end up getting a random mishmash
> > of hard-coded defaults or CPUID-provided values based on when
> > the values are read.
> >
> > iomem_resource.end is one of these users. Because of where it
> > is called, it ended up seeing .x86_phys_bits==MAX_PHYSMEM_BITS
> > which is (mostly) a compile-time default. But
> > iomem_resource.end is never updated if the runtime CPUID
> > x86_phys_bits is lower.
> >
> > Set iomem_resource.end to the compile-time value explicitly.
> > It does not need to be precise as this is mostly to ensure
> > that insane values can't be reserved in 'iomem_resource'.
> >
> > Make MAX_PHYSMEM_BITS available outside of sparsemem
> > configurations by removing the #ifdef CONFIG_SPARSEMEM in the
> > header.
> >
> > Signed-off-by: Dave Hansen <[email protected]>
> > ---
> >
> > b/arch/x86/include/asm/sparsemem.h |    3 ---
> > b/arch/x86/kernel/setup.c          |   10 +++++++++-
> > 2 files changed, 9 insertions(+), 4 deletions(-)
> >
> > diff -puN arch/x86/kernel/setup.c~iomem_resource_end
> > arch/x86/kernel/setup.c
> > --- a/arch/x86/kernel/setup.c~iomem_resource_end        2024-02-22
> > 10:08:51.048554948 -0800
> > +++ b/arch/x86/kernel/setup.c   2024-02-22 10:21:04.485531464 -0800
> > @@ -51,6 +51,7 @@
> > #include <asm/pci-direct.h>
> > #include <asm/prom.h>
> > #include <asm/proto.h>
> > +#include <asm/sparsemem.h>
> > #include <asm/thermal.h>
> > #include <asm/unwind.h>
> > #include <asm/vsyscall.h>
> > @@ -813,7 +814,14 @@ void __init setup_arch(char **cmdline_p)
> >          */
> >         early_reserve_memory();
> >
> > -       iomem_resource.end = (1ULL << x86_phys_bits()) - 1;
> > +       /*
> > +        * This was too big before. It ended up getting
> > MAX_PHYSMEM_BITS
> > +        * even if .x86_phys_bits was eventually lowered below
> > that.
> > +        * But that was evidently harmless, so leave it too big,
> > but
> > +        * set it explicitly to MAX_PHYSMEM_BITS instead of taking
> > a
> > +        * trip through .x86_phys_bits.
> > +        */
> > +       iomem_resource.end = (1ULL << MAX_PHYSMEM_BITS) - 1;
>
> Paolo's patchset to move MKTME keyid bits detection to
> early_cpu_init() was
> merged to tip:x86/urgent, so looks it will land to Linus's tree
> before this
> series:
>
> https://lore.kernel.org/lkml/[email protected]/T/
>
> Paplo's series actually moves the reduction of x86_phys_bits before
> setting the
> iomem_resource.end here, so after rebasing the changelog/comment
> seems don't
> apply anymore.

My understanding is that the below order is always true,
setup_arch()
early_cpu_init()
get_cpu_address_sizes()
iomem_resource.end = (1ULL << x86_phys_bits()) - 1;
with or without the above patch.

>
> Perhaps we can get rid of this patch and just set iomem_resource.end
> based on
> x86_phys_bits()?
>
Agreed.

thanks,
rui

2024-02-28 15:21:02

by Kirill A. Shutemov

[permalink] [raw]

Subject: Re: [RFC][PATCH 11/34] x86/cpu/intel: Prepare MKTME for "address configuration" infrastructure

On Wed, Feb 28, 2024 at 10:48:19AM +1300, Huang, Kai wrote:
> > if (!TME_ACTIVATE_LOCKED(tme_activate) || !TME_ACTIVATE_ENABLED(tme_activate)) {
> > pr_info_once("x86/tme: not enabled by BIOS\n");
>
> Since now detect_tme() is only called on BSP, seems we can change to
> pr_info(), which is used ...

Right.

Dave, do you want a new version? Or can you fix it up on apply?

--
Kiryl Shutsemau / Kirill A. Shutemov

2024-02-28 16:58:56

[permalink] [raw]

Subject: Re: [RFC][PATCH 11/34] x86/cpu/intel: Prepare MKTME for "address configuration" infrastructure

On 2/28/24 07:20, Kirill A. Shutemov wrote:
> On Wed, Feb 28, 2024 at 10:48:19AM +1300, Huang, Kai wrote:
>>> if (!TME_ACTIVATE_LOCKED(tme_activate) || !TME_ACTIVATE_ENABLED(tme_activate)) {
>>> pr_info_once("x86/tme: not enabled by BIOS\n");
>> Since now detect_tme() is only called on BSP, seems we can change to
>> pr_info(), which is used ...
> Right.
>
> Dave, do you want a new version? Or can you fix it up on apply?

I've already got a version of your patch pulled into my tree. I did the
conversion there. Thanks for asking!

2024-02-28 17:29:12

[permalink] [raw]

Subject: Re: [RFC][PATCH 09/34] x86/cpu: Introduce address configuration structure

On 2/27/24 15:47, Huang, Kai wrote:
> From the above description, ...
>
>> +struct x86_addr_config bsp_addr_config;
>> +
>
> ... this can be __ro_after_init?
>
> Maybe I am missing something, but if it cannot be annotated with
> __ro_after_init here, IMHO it would be helpful to call out in the
> changelog.

It can actually be full on __init .. eventually. It goes *AWAY* after boot.

I'll add a note in the changelog why it can't be until later in the series.

2024-03-07 16:44:39

by Borislav Petkov

[permalink] [raw]

Subject: Re: [RFC][PATCH 01/34] x86/xen: Explain why calling get_cpu_cap() so early is a hack

On Thu, Feb 22, 2024 at 10:39:28AM -0800, Dave Hansen wrote:
>
> From: Dave Hansen <[email protected]>
>
> xen_start_kernel() some of the first C code run "Xen PV" systems.

That sentence reads weird.

> It precedes normally very early things like setup_arch() or the
> processor initialization code.
>
> That means that 'boot_cpu_data' is garbage. It has not even

s/is/contains/

> established the utter basics like if the CPU supports the CPUID
> instruction. Unfortunately get_cpu_cap() requires this exact
> information.
>
> Nevertheless xen_start_kernel() calls get_cpu_cap(). But it
> works out in practice because it's looking for the NX bit which
> comes from an extended CPUID leaf that doesn't depend on
> c->cpuid_level being set. This also implicitly assumes that
> Xen PV guests support CPUID.
>
> Leave the hack in place, but at least explain some of what is
> going on.
>
> Signed-off-by: Dave Hansen <[email protected]>
> Cc: Juergen Gross <[email protected]>
> ---
>
> b/arch/x86/xen/enlighten_pv.c | 6 +++++-
> 1 file changed, 5 insertions(+), 1 deletion(-)
>
> diff -puN arch/x86/xen/enlighten_pv.c~xen-explain1 arch/x86/xen/enlighten_pv.c
> --- a/arch/x86/xen/enlighten_pv.c~xen-explain1 2024-02-22 10:08:48.404451146 -0800
> +++ b/arch/x86/xen/enlighten_pv.c 2024-02-22 10:08:48.404451146 -0800
> @@ -1372,7 +1372,11 @@ asmlinkage __visible void __init xen_sta
> /* Get mfn list */
> xen_build_dynamic_phys_to_machine();
>
> - /* Work out if we support NX */
> + /*
> + * This is a hack. 'boot_cpu_data' is not filled out enough
> + * for get_cpu_cap() to function fully. But it _can_ fill out
> + * the leaves where NX is. Gross, but it works.
> + */
> get_cpu_cap(&boot_cpu_data);
> x86_configure_nx();

If all it needs is to figure out whether it supports NX, just do that by
foot here and get rid of the get_cpu_cap(&boot_cpu_data) call
completely.

It won't be a hack anymore.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-03-07 17:02:24

by Borislav Petkov

[permalink] [raw]

Subject: Re: [RFC][PATCH 02/34] x86/xen: Remove early "debug" physical address lookups

On Thu, Feb 22, 2024 at 10:39:29AM -0800, Dave Hansen wrote:
>
> From: Dave Hansen <[email protected]>
>
> The __pa() facility is subject to debugging checks if CONFIG_DEBUG_VIRTUAL=y.
> One of those debugging checks is whether the physical address is valid
> on the platform. That information is normally available via CPUID. But
> the __pa() code currently looks it up in 'boot_cpu_data' which is not
> fully set up in early Xen PV boot.
>
> The Xen PV code currently tries to get this info with
> get_cpu_address_sizes() which also depends on 'boot_cpu_data' to be at
> least somewhat set up. The result is that the c->x86_phys_bits gets a
> sane value, but not one that has anything to do with the hardware. In
> other words, the CONFIG_DEBUG_VIRTUAL checks are performed with what
> amounts to garbage inputs.
>
> Garbage checks are worse than no check at all. Move over to the
> "nodebug" variant to axe the checks.
>
> Signed-off-by: Dave Hansen <[email protected]>
> Cc: Juergen Gross <[email protected]>
> ---
>
> b/arch/x86/xen/enlighten_pv.c | 2 +-
> b/arch/x86/xen/mmu_pv.c | 2 +-
> 2 files changed, 2 insertions(+), 2 deletions(-)

Reviewed-by: Borislav Petkov (AMD) <[email protected]>

At least as far as I can follow the Xen boot flow, yap, this makes
sense.

But let's have Jürgen confirm first.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-03-08 08:59:58

by Borislav Petkov

[permalink] [raw]

Subject: Re: [RFC][PATCH 03/34] x86/pci: Assume that clflush size is always provided

On Thu, Feb 22, 2024 at 10:39:30AM -0800, Dave Hansen wrote:
>
> From: Dave Hansen <[email protected]>
>
> The early boot code always sets _some_ clflush size. Use that fact

s/clflush/CLFLUSH/g

as it is an insn mnemonic.

> to avoid handling the case where it is not set.
>
> There may have been a time when the Xen PV call in here way too
> early. But it calls get_cpu_address_sizes() before calling here
> now. It should also be safe.
>
> Note: This series will eventually return sane defaults even very
> early in boot. I believe this is safe now, but it becomes *really*
> safe later on.

This should probably be ...

>
> Signed-off-by: Dave Hansen <[email protected]>
> ---

<--- ... here as when applied, "this series" means nothing.

With that:

Reviewed-by: Borislav Petkov (AMD) <[email protected]>

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2024-03-11 13:16:41

by Juergen Gross

[permalink] [raw]

Subject: Re: [RFC][PATCH 02/34] x86/xen: Remove early "debug" physical address lookups

On 22.02.24 19:39, Dave Hansen wrote:
> From: Dave Hansen <[email protected]>
>
> The __pa() facility is subject to debugging checks if CONFIG_DEBUG_VIRTUAL=y.
> One of those debugging checks is whether the physical address is valid
> on the platform. That information is normally available via CPUID. But
> the __pa() code currently looks it up in 'boot_cpu_data' which is not
> fully set up in early Xen PV boot.
>
> The Xen PV code currently tries to get this info with
> get_cpu_address_sizes() which also depends on 'boot_cpu_data' to be at
> least somewhat set up. The result is that the c->x86_phys_bits gets a
> sane value, but not one that has anything to do with the hardware. In
> other words, the CONFIG_DEBUG_VIRTUAL checks are performed with what
> amounts to garbage inputs.
>
> Garbage checks are worse than no check at all. Move over to the
> "nodebug" variant to axe the checks.
>
> Signed-off-by: Dave Hansen <[email protected]>
> Cc: Juergen Gross <[email protected]>

Reviewed-by: Juergen Gross <[email protected]>

Juergen