This series tries to fix the rather special case of PAT being available
without having MTRRs (either due to CONFIG_MTRR being not set, or
because the feature has been disabled e.g. by a hypervisor).
The main use cases are Xen PV guests and SEV-SNP guests running under
Hyper-V.
Instead of trying to work around all the issues by adding if statements
here and there, just try to use the complete available infrastructure
by setting up a read-only MTRR state when needed.
In the Xen PV case the current MTRR MSR values can be read from the
hypervisor, while for the SEV-SNP case all needed is to set the
default caching mode to "WB".
I have added more cleanup which has been discussed when looking into
the most recent failures.
Note that I couldn't test the Hyper-V related change (patch 3).
Running on bare metal and with Xen didn't show any problems with the
series applied.
It should be noted that patches 9+10 are replacing today's way to
lookup the MTRR cache type for a memory region from looking at the
MTRR register values to building a memory map with the cache types.
This should make the lookup much faster and much easier to understand.
Changes in V2:
- replaced former patches 1+2 with new patches 1-4, avoiding especially
the rather hacky approach of V1, while making all the MTRR type
conflict tests available for the Xen PV case
- updated patch 6 (was patch 4 in V1)
Changes in V3:
- dropped patch 5 of V2, as already applied
- split patch 1 of V2 into 2 patches
- new patches 6-10
- addressed comments
Changes in V4:
- addressed comments
Changes in V5
- addressed comments
- some other small fixes
- new patches 3, 8 and 15
clone of "mtrr"
Juergen Gross (15):
x86/mtrr: split off physical address size calculation
x86/mtrr: optimize mtrr_calc_physbits()
x86/mtrr: replace some constants with defines
x86/mtrr: support setting MTRR state for software defined MTRRs
x86/hyperv: set MTRR state when running as SEV-SNP Hyper-V guest
x86/xen: set MTRR state when running as Xen PV initial domain
x86/mtrr: replace vendor tests in MTRR code
x86/mtrr: have only one set_mtrr() variant
x86/mtrr: allocate mtrr_value array dynamically
x86/mtrr: add get_effective_type() service function
x86/mtrr: construct a memory map with cache modes
x86/mtrr: use new cache_map in mtrr_type_lookup()
x86/mtrr: don't let mtrr_type_lookup() return MTRR_TYPE_INVALID
x86/mm: only check uniform after calling mtrr_type_lookup()
x86/mtrr: remove unused code
arch/x86/include/asm/mtrr.h | 44 ++-
arch/x86/include/uapi/asm/mtrr.h | 6 +-
arch/x86/kernel/cpu/mshyperv.c | 4 +
arch/x86/kernel/cpu/mtrr/amd.c | 2 +-
arch/x86/kernel/cpu/mtrr/centaur.c | 11 +-
arch/x86/kernel/cpu/mtrr/cleanup.c | 6 +-
arch/x86/kernel/cpu/mtrr/cyrix.c | 2 +-
arch/x86/kernel/cpu/mtrr/generic.c | 578 +++++++++++++++++++----------
arch/x86/kernel/cpu/mtrr/mtrr.c | 146 ++++----
arch/x86/kernel/cpu/mtrr/mtrr.h | 7 +-
arch/x86/kernel/setup.c | 2 +
arch/x86/mm/pgtable.c | 24 +-
arch/x86/xen/enlighten_pv.c | 52 +++
13 files changed, 573 insertions(+), 311 deletions(-)
--
2.35.3
When running virtualized, MTRR access can be reduced (e.g. in Xen PV
guests or when running as a SEV-SNP guest under Hyper-V). Typically
the hypervisor will reset the MTRR feature in CPUID data, resulting
in no MTRR memory type information being available for the kernel.
This has turned out to result in problems:
- Hyper-V SEV-SNP guests using uncached mappings where they shouldn't
- Xen PV dom0 mapping memory as WB which should be UC- instead
Solve those problems by supporting to set a static MTRR state,
overwriting the empty state used today. In case such a state has been
set, don't call get_mtrr_state() in mtrr_bp_init(). The set state
will only be used by mtrr_type_lookup(), as in all other cases
mtrr_enabled() is being checked, which will return false. Accept the
overwrite call only for selected cases when running as a guest.
Disable X86_FEATURE_MTRR in order to avoid any MTRR modifications by
just refusing them.
Signed-off-by: Juergen Gross <[email protected]>
Tested-by: Michael Kelley <[email protected]>
---
V2:
- new patch
V3:
- omit fixed MTRRs, as those are currently not needed
- disable X86_FEATURE_MTRR instead of testing it
- provide a stub for !CONFIG_MTRR (Michael Kelley)
- use cpu_feature_enabled() (Boris Petkov)
- add tests for mtrr_overwrite_state() being allowed (Boris Petkov)
V4:
- add test for hv_is_isolation_supported() (Michael Kelley)
V5:
- drop test for running as native (Boris Petkov)
- split large complex test into multiple simple ones (Boris Petkov)
- enhance test in mtrr_bp_init() (Boris Petkov)
---
arch/x86/include/asm/mtrr.h | 8 +++++
arch/x86/kernel/cpu/mtrr/generic.c | 58 +++++++++++++++++++++++++++++-
arch/x86/kernel/cpu/mtrr/mtrr.c | 9 +++++
arch/x86/kernel/setup.c | 2 ++
4 files changed, 76 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/mtrr.h b/arch/x86/include/asm/mtrr.h
index 4e59f7854950..6decb18e22ed 100644
--- a/arch/x86/include/asm/mtrr.h
+++ b/arch/x86/include/asm/mtrr.h
@@ -49,6 +49,8 @@
*/
# ifdef CONFIG_MTRR
void mtrr_bp_init(void);
+void mtrr_overwrite_state(struct mtrr_var_range *var, unsigned int num_var,
+ mtrr_type def_type);
extern u8 mtrr_type_lookup(u64 addr, u64 end, u8 *uniform);
extern void mtrr_save_fixed_ranges(void *);
extern void mtrr_save_state(void);
@@ -66,6 +68,12 @@ void mtrr_disable(void);
void mtrr_enable(void);
void mtrr_generic_set_state(void);
# else
+static inline void mtrr_overwrite_state(struct mtrr_var_range *var,
+ unsigned int num_var,
+ mtrr_type def_type)
+{
+}
+
static inline u8 mtrr_type_lookup(u64 addr, u64 end, u8 *uniform)
{
/*
diff --git a/arch/x86/kernel/cpu/mtrr/generic.c b/arch/x86/kernel/cpu/mtrr/generic.c
index 9a12da76635c..0794f3f1cc27 100644
--- a/arch/x86/kernel/cpu/mtrr/generic.c
+++ b/arch/x86/kernel/cpu/mtrr/generic.c
@@ -8,10 +8,12 @@
#include <linux/init.h>
#include <linux/io.h>
#include <linux/mm.h>
-
+#include <linux/cc_platform.h>
#include <asm/processor-flags.h>
#include <asm/cacheinfo.h>
#include <asm/cpufeature.h>
+#include <asm/hypervisor.h>
+#include <asm/mshyperv.h>
#include <asm/tlbflush.h>
#include <asm/mtrr.h>
#include <asm/msr.h>
@@ -241,6 +243,60 @@ static u8 mtrr_type_lookup_variable(u64 start, u64 end, u64 *partial_end,
return mtrr_state.def_type;
}
+/**
+ * mtrr_overwrite_state - set static MTRR state
+ *
+ * Used to set MTRR state via different means (e.g. with data obtained from
+ * a hypervisor).
+ * Is allowed only for special cases when running virtualized. Must be called
+ * from the x86_init.hyper.init_platform() hook.
+ */
+void mtrr_overwrite_state(struct mtrr_var_range *var, unsigned int num_var,
+ mtrr_type def_type)
+{
+ unsigned int i;
+
+ /* Only allowed to be called once before mtrr_bp_init(). */
+ if (WARN_ON(mtrr_state_set))
+ return;
+
+ /* Only allowed when running virtualized. */
+ if (!cpu_feature_enabled(X86_FEATURE_HYPERVISOR))
+ return;
+
+ /*
+ * Only allowed for special virtualization cases:
+ * - when running as SEV-SNP guest
+ * - when running as Hyper-V isolated guest
+ * - when running as Xen PV guest
+ * - when running as TSX guest
+ */
+ if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP) &&
+ !hv_is_isolation_supported() &&
+ !cpu_feature_enabled(X86_FEATURE_XENPV) &&
+ !cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
+ return;
+
+ /* Disable MTRR in order to disable MTRR modifications. */
+ setup_clear_cpu_cap(X86_FEATURE_MTRR);
+
+ if (var) {
+ if (num_var > MTRR_MAX_VAR_RANGES) {
+ pr_warn("Trying to overwrite MTRR state with %u variable entries\n",
+ num_var);
+ num_var = MTRR_MAX_VAR_RANGES;
+ }
+ for (i = 0; i < num_var; i++)
+ mtrr_state.var_ranges[i] = var[i];
+ num_var_ranges = num_var;
+ }
+
+ mtrr_state.def_type = def_type;
+ mtrr_state.enabled |= MTRR_STATE_MTRR_ENABLED;
+
+ mtrr_state_set = 1;
+}
+
/**
* mtrr_type_lookup - look up memory type in MTRR
*
diff --git a/arch/x86/kernel/cpu/mtrr/mtrr.c b/arch/x86/kernel/cpu/mtrr/mtrr.c
index 1beb38f7a7a3..1c19d67ddab3 100644
--- a/arch/x86/kernel/cpu/mtrr/mtrr.c
+++ b/arch/x86/kernel/cpu/mtrr/mtrr.c
@@ -666,6 +666,15 @@ void __init mtrr_bp_init(void)
const char *why = "(not available)";
unsigned int phys_addr;
+ if (!generic_mtrrs && mtrr_state.enabled) {
+ /* Software overwrite of MTRR state, only for generic case. */
+ mtrr_calc_physbits(true);
+ init_table();
+ pr_info("MTRRs set to read-only\n");
+
+ return;
+ }
+
phys_addr = mtrr_calc_physbits(generic_mtrrs);
if (generic_mtrrs) {
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 16babff771bd..0cccfeb67c3a 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -1037,6 +1037,8 @@ void __init setup_arch(char **cmdline_p)
/*
* VMware detection requires dmi to be available, so this
* needs to be done after dmi_setup(), for the boot CPU.
+ * For some guest types (Xen PV, SEV-SNP, TDX) it is required to be
+ * called before cache_bp_init() for setting up MTRR state.
*/
init_hypervisor_platform();
--
2.35.3
In order to avoid mappings using the UC- cache attribute, set the
MTRR state to use WB caching as the default.
This is needed in order to cope with the fact that PAT is enabled,
while MTRRs are not supported by the hypervisor.
Fixes: 90b926e68f50 ("x86/pat: Fix pat_x_mtrr_type() for MTRR disabled case")
Signed-off-by: Juergen Gross <[email protected]>
Tested-by: Michael Kelley <[email protected]>
---
V2:
- new patch
---
arch/x86/kernel/cpu/mshyperv.c | 4 ++++
1 file changed, 4 insertions(+)
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index f36dc2f796c5..0a6cc3cf8919 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -34,6 +34,7 @@
#include <clocksource/hyperv_timer.h>
#include <asm/numa.h>
#include <asm/coco.h>
+#include <asm/mtrr.h>
/* Is Linux running as the root partition? */
bool hv_root_partition;
@@ -408,6 +409,9 @@ static void __init ms_hyperv_init_platform(void)
#ifdef CONFIG_SWIOTLB
swiotlb_unencrypted_base = ms_hyperv.shared_gpa_boundary;
#endif
+
+ /* Set WB as the default cache mode. */
+ mtrr_overwrite_state(NULL, 0, MTRR_TYPE_WRBACK);
}
/* Isolation VMs are unenlightened SEV-based VMs, thus this check: */
if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
--
2.35.3
When running as Xen PV initial domain (aka dom0), MTRRs are disabled
by the hypervisor, but the system should nevertheless use correct
cache memory types. This has always kind of worked, as disabled MTRRs
resulted in disabled PAT, too, so that the kernel avoided code paths
resulting in inconsistencies. This bypassed all of the sanity checks
the kernel is doing with enabled MTRRs in order to avoid memory
mappings with conflicting memory types.
This has been changed recently, leading to PAT being accepted to be
enabled, while MTRRs stayed disabled. The result is that
mtrr_type_lookup() no longer is accepting all memory type requests,
but started to return WB even if UC- was requested. This led to
driver failures during initialization of some devices.
In reality MTRRs are still in effect, but they are under complete
control of the Xen hypervisor. It is possible, however, to retrieve
the MTRR settings from the hypervisor.
In order to fix those problems, overwrite the MTRR state via
mtrr_overwrite_state() with the MTRR data from the hypervisor, if the
system is running as a Xen dom0.
Fixes: 72cbc8f04fe2 ("x86/PAT: Have pat_enabled() properly reflect state when running on Xen")
Signed-off-by: Juergen Gross <[email protected]>
Reviewed-by: Boris Ostrovsky <[email protected]>
---
V2:
- new patch
V3:
- move the call of mtrr_overwrite_state() to xen_pv_init_platform()
V4:
- only call mtrr_overwrite_state() if any MTRR got from Xen
(Boris Ostrovsky)
---
arch/x86/xen/enlighten_pv.c | 52 +++++++++++++++++++++++++++++++++++++
1 file changed, 52 insertions(+)
diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index 093b78c8bbec..fdaea02ab5ab 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -68,6 +68,7 @@
#include <asm/reboot.h>
#include <asm/hypervisor.h>
#include <asm/mach_traps.h>
+#include <asm/mtrr.h>
#include <asm/mwait.h>
#include <asm/pci_x86.h>
#include <asm/cpu.h>
@@ -119,6 +120,54 @@ static int __init parse_xen_msr_safe(char *str)
}
early_param("xen_msr_safe", parse_xen_msr_safe);
+/* Get MTRR settings from Xen and put them into mtrr_state. */
+static void __init xen_set_mtrr_data(void)
+{
+#ifdef CONFIG_MTRR
+ struct xen_platform_op op = {
+ .cmd = XENPF_read_memtype,
+ .interface_version = XENPF_INTERFACE_VERSION,
+ };
+ unsigned int reg;
+ unsigned long mask;
+ uint32_t eax, width;
+ static struct mtrr_var_range var[MTRR_MAX_VAR_RANGES] __initdata;
+
+ /* Get physical address width (only 64-bit cpus supported). */
+ width = 36;
+ eax = cpuid_eax(0x80000000);
+ if ((eax >> 16) == 0x8000 && eax >= 0x80000008) {
+ eax = cpuid_eax(0x80000008);
+ width = eax & 0xff;
+ }
+
+ for (reg = 0; reg < MTRR_MAX_VAR_RANGES; reg++) {
+ op.u.read_memtype.reg = reg;
+ if (HYPERVISOR_platform_op(&op))
+ break;
+
+ /*
+ * Only called in dom0, which has all RAM PFNs mapped at
+ * RAM MFNs, and all PCI space etc. is identity mapped.
+ * This means we can treat MFN == PFN regarding MTRR settings.
+ */
+ var[reg].base_lo = op.u.read_memtype.type;
+ var[reg].base_lo |= op.u.read_memtype.mfn << PAGE_SHIFT;
+ var[reg].base_hi = op.u.read_memtype.mfn >> (32 - PAGE_SHIFT);
+ mask = ~((op.u.read_memtype.nr_mfns << PAGE_SHIFT) - 1);
+ mask &= (1UL << width) - 1;
+ if (mask)
+ mask |= MTRR_MASK_VALID;
+ var[reg].mask_lo = mask;
+ var[reg].mask_hi = mask >> 32;
+ }
+
+ /* Only overwrite MTRR state if any MTRR could be got from Xen. */
+ if (reg)
+ mtrr_overwrite_state(var, reg, MTRR_TYPE_UNCACHABLE);
+#endif
+}
+
static void __init xen_pv_init_platform(void)
{
/* PV guests can't operate virtio devices without grants. */
@@ -135,6 +184,9 @@ static void __init xen_pv_init_platform(void)
/* pvclock is in shared info area */
xen_init_time_ops();
+
+ if (xen_initial_domain())
+ xen_set_mtrr_data();
}
static void __init xen_pv_guest_late_init(void)
--
2.35.3
Today there are two variants of set_mtrr(): one calling stop_machine()
and one calling stop_machine_cpuslocked().
The first one (set_mtrr()) has only one caller, and this caller is
always running with only one CPU online and interrupts being off.
Remove the first variant completely and replace the call of it with
a call of mtrr_if->set().
Rename the second variant set_mtrr_cpuslocked() to set_mtrr() now that
there is only one variant left.
Signed-off-by: Juergen Gross <[email protected]>
---
V5:
- new patch
---
arch/x86/kernel/cpu/mtrr/mtrr.c | 28 ++++++++--------------------
1 file changed, 8 insertions(+), 20 deletions(-)
diff --git a/arch/x86/kernel/cpu/mtrr/mtrr.c b/arch/x86/kernel/cpu/mtrr/mtrr.c
index 46aae69d259e..4fa3d0f94f39 100644
--- a/arch/x86/kernel/cpu/mtrr/mtrr.c
+++ b/arch/x86/kernel/cpu/mtrr/mtrr.c
@@ -192,20 +192,8 @@ static inline int types_compatible(mtrr_type type1, mtrr_type type2)
* Note that the mechanism is the same for UP systems, too; all the SMP stuff
* becomes nops.
*/
-static void
-set_mtrr(unsigned int reg, unsigned long base, unsigned long size, mtrr_type type)
-{
- struct set_mtrr_data data = { .smp_reg = reg,
- .smp_base = base,
- .smp_size = size,
- .smp_type = type
- };
-
- stop_machine(mtrr_rendezvous_handler, &data, cpu_online_mask);
-}
-
-static void set_mtrr_cpuslocked(unsigned int reg, unsigned long base,
- unsigned long size, mtrr_type type)
+static void set_mtrr(unsigned int reg, unsigned long base, unsigned long size,
+ mtrr_type type)
{
struct set_mtrr_data data = { .smp_reg = reg,
.smp_base = base,
@@ -335,7 +323,7 @@ int mtrr_add_page(unsigned long base, unsigned long size,
/* Search for an empty MTRR */
i = mtrr_if->get_free_region(base, size, replace);
if (i >= 0) {
- set_mtrr_cpuslocked(i, base, size, type);
+ set_mtrr(i, base, size, type);
if (likely(replace < 0)) {
mtrr_usage_table[i] = 1;
} else {
@@ -343,7 +331,7 @@ int mtrr_add_page(unsigned long base, unsigned long size,
if (increment)
mtrr_usage_table[i]++;
if (unlikely(replace != i)) {
- set_mtrr_cpuslocked(replace, 0, 0, 0);
+ set_mtrr(replace, 0, 0, 0);
mtrr_usage_table[replace] = 0;
}
}
@@ -471,7 +459,7 @@ int mtrr_del_page(int reg, unsigned long base, unsigned long size)
goto out;
}
if (--mtrr_usage_table[reg] < 1)
- set_mtrr_cpuslocked(reg, 0, 0, 0);
+ set_mtrr(reg, 0, 0, 0);
error = reg;
out:
mutex_unlock(&mtrr_mutex);
@@ -601,9 +589,9 @@ static void mtrr_restore(void)
for (i = 0; i < num_var_ranges; i++) {
if (mtrr_value[i].lsize) {
- set_mtrr(i, mtrr_value[i].lbase,
- mtrr_value[i].lsize,
- mtrr_value[i].ltype);
+ mtrr_if->set(i, mtrr_value[i].lbase,
+ mtrr_value[i].lsize,
+ mtrr_value[i].ltype);
}
}
}
--
2.35.3
Modern CPUs all share the same MTRR interface implemented via
generic_mtrr_ops.
At several places in MTRR code this generic interface is deduced via
is_cpu(INTEL) tests, which is only working due to X86_VENDOR_INTEL
being 0 (the is_cpu() macro is testing mtrr_if->vendor, which isn't
explicitly set in generic_mtrr_ops).
Fix that by replacing the is_cpu(INTEL) tests with testing for mtrr_if
to be &generic_mtrr_ops.
The only other place where the .vendor member of struct mtrr_ops is
being used is in set_num_var_ranges(), where depending on the vendor
the number of MTRR registers is determined. This can easily be changed
by replacing .vendor with the static number of MTRR registers.
It should be noted that the test "is_cpu(HYGON)" wasn't ever returning
true, as there is no struct mtrr_ops with that vendor information.
Signed-off-by: Juergen Gross <[email protected]>
Tested-by: Michael Kelley <[email protected]>
---
V3:
- new patch
V4:
- use cpu_feature_enabled(X86_FEATURE_MTRR) for testing generic MTRRs
(Boris Petkov)
---
arch/x86/kernel/cpu/mtrr/amd.c | 2 +-
arch/x86/kernel/cpu/mtrr/centaur.c | 2 +-
arch/x86/kernel/cpu/mtrr/cleanup.c | 4 ++--
arch/x86/kernel/cpu/mtrr/cyrix.c | 2 +-
arch/x86/kernel/cpu/mtrr/generic.c | 2 +-
arch/x86/kernel/cpu/mtrr/mtrr.c | 8 +++-----
arch/x86/kernel/cpu/mtrr/mtrr.h | 4 +---
7 files changed, 10 insertions(+), 14 deletions(-)
diff --git a/arch/x86/kernel/cpu/mtrr/amd.c b/arch/x86/kernel/cpu/mtrr/amd.c
index eff6ac62c0ff..ef3e8e42b782 100644
--- a/arch/x86/kernel/cpu/mtrr/amd.c
+++ b/arch/x86/kernel/cpu/mtrr/amd.c
@@ -110,7 +110,7 @@ amd_validate_add_page(unsigned long base, unsigned long size, unsigned int type)
}
const struct mtrr_ops amd_mtrr_ops = {
- .vendor = X86_VENDOR_AMD,
+ .var_regs = 2,
.set = amd_set_mtrr,
.get = amd_get_mtrr,
.get_free_region = generic_get_free_region,
diff --git a/arch/x86/kernel/cpu/mtrr/centaur.c b/arch/x86/kernel/cpu/mtrr/centaur.c
index b8a74eddde83..4466ddeb0125 100644
--- a/arch/x86/kernel/cpu/mtrr/centaur.c
+++ b/arch/x86/kernel/cpu/mtrr/centaur.c
@@ -112,7 +112,7 @@ centaur_validate_add_page(unsigned long base, unsigned long size, unsigned int t
}
const struct mtrr_ops centaur_mtrr_ops = {
- .vendor = X86_VENDOR_CENTAUR,
+ .var_regs = 8,
.set = centaur_set_mcr,
.get = centaur_get_mcr,
.get_free_region = centaur_get_free_region,
diff --git a/arch/x86/kernel/cpu/mtrr/cleanup.c b/arch/x86/kernel/cpu/mtrr/cleanup.c
index ce45d7617874..0f27c38f3ff9 100644
--- a/arch/x86/kernel/cpu/mtrr/cleanup.c
+++ b/arch/x86/kernel/cpu/mtrr/cleanup.c
@@ -689,7 +689,7 @@ int __init mtrr_cleanup(unsigned address_bits)
int index_good;
int i;
- if (!is_cpu(INTEL) || enable_mtrr_cleanup < 1)
+ if (!cpu_feature_enabled(X86_FEATURE_MTRR) || enable_mtrr_cleanup < 1)
return 0;
rdmsr(MSR_MTRRdefType, def, dummy);
@@ -886,7 +886,7 @@ int __init mtrr_trim_uncached_memory(unsigned long end_pfn)
* Make sure we only trim uncachable memory on machines that
* support the Intel MTRR architecture:
*/
- if (!is_cpu(INTEL) || disable_mtrr_trim)
+ if (!cpu_feature_enabled(X86_FEATURE_MTRR) || disable_mtrr_trim)
return 0;
rdmsr(MSR_MTRRdefType, def, dummy);
diff --git a/arch/x86/kernel/cpu/mtrr/cyrix.c b/arch/x86/kernel/cpu/mtrr/cyrix.c
index 173b9e01e623..238dad57d4d6 100644
--- a/arch/x86/kernel/cpu/mtrr/cyrix.c
+++ b/arch/x86/kernel/cpu/mtrr/cyrix.c
@@ -235,7 +235,7 @@ static void cyrix_set_arr(unsigned int reg, unsigned long base,
}
const struct mtrr_ops cyrix_mtrr_ops = {
- .vendor = X86_VENDOR_CYRIX,
+ .var_regs = 8,
.set = cyrix_set_arr,
.get = cyrix_get_arr,
.get_free_region = cyrix_get_free_region,
diff --git a/arch/x86/kernel/cpu/mtrr/generic.c b/arch/x86/kernel/cpu/mtrr/generic.c
index 0794f3f1cc27..5d60b46187f7 100644
--- a/arch/x86/kernel/cpu/mtrr/generic.c
+++ b/arch/x86/kernel/cpu/mtrr/generic.c
@@ -846,7 +846,7 @@ int generic_validate_add_page(unsigned long base, unsigned long size,
* For Intel PPro stepping <= 7
* must be 4 MiB aligned and not touch 0x70000000 -> 0x7003FFFF
*/
- if (is_cpu(INTEL) && boot_cpu_data.x86 == 6 &&
+ if (mtrr_if == &generic_mtrr_ops && boot_cpu_data.x86 == 6 &&
boot_cpu_data.x86_model == 1 &&
boot_cpu_data.x86_stepping <= 7) {
if (base & ((1 << (22 - PAGE_SHIFT)) - 1)) {
diff --git a/arch/x86/kernel/cpu/mtrr/mtrr.c b/arch/x86/kernel/cpu/mtrr/mtrr.c
index 1c19d67ddab3..46aae69d259e 100644
--- a/arch/x86/kernel/cpu/mtrr/mtrr.c
+++ b/arch/x86/kernel/cpu/mtrr/mtrr.c
@@ -108,14 +108,12 @@ static int have_wrcomb(void)
/* This function returns the number of variable MTRRs */
static void __init set_num_var_ranges(bool use_generic)
{
- unsigned long config = 0, dummy;
+ unsigned long config, dummy;
if (use_generic)
rdmsr(MSR_MTRRcap, config, dummy);
- else if (is_cpu(AMD) || is_cpu(HYGON))
- config = 2;
- else if (is_cpu(CYRIX) || is_cpu(CENTAUR))
- config = 8;
+ else
+ config = mtrr_if->var_regs;
num_var_ranges = config & MTRR_CONFIG_NUM_VAR_MASK;
}
diff --git a/arch/x86/kernel/cpu/mtrr/mtrr.h b/arch/x86/kernel/cpu/mtrr/mtrr.h
index 02eb5871492d..a3c362d3d5bf 100644
--- a/arch/x86/kernel/cpu/mtrr/mtrr.h
+++ b/arch/x86/kernel/cpu/mtrr/mtrr.h
@@ -13,7 +13,7 @@
extern unsigned int mtrr_usage_table[MTRR_MAX_VAR_RANGES];
struct mtrr_ops {
- u32 vendor;
+ u32 var_regs;
void (*set)(unsigned int reg, unsigned long base,
unsigned long size, mtrr_type type);
void (*get)(unsigned int reg, unsigned long *base,
@@ -54,8 +54,6 @@ bool get_mtrr_state(void);
extern u64 size_or_mask, size_and_mask;
extern const struct mtrr_ops *mtrr_if;
-#define is_cpu(vnd) (mtrr_if && mtrr_if->vendor == X86_VENDOR_##vnd)
-
extern unsigned int num_var_ranges;
extern u64 mtrr_tom2;
extern struct mtrr_state_type mtrr_state;
--
2.35.3
After MTRR initialization construct a memory map with cache modes from
MTRR values. This will speed up lookups via mtrr_lookup_type()
especially in case of overlapping MTRRs.
This will be needed when switching the semantics of the "uniform"
parameter of mtrr_lookup_type() from "only covered by one MTRR" to
"memory range has a uniform cache mode", which is the data the callers
really want to know. Today this information is not easily available,
in case MTRRs are not well sorted regarding base address.
The map will be built in __initdata. When memory management is up, the
map will be moved to dynamically allocated memory, in order to avoid
the need of an overly large array. The size of this array is calculated
using the number of variable MTRR registers and the needed size for
fixed entries.
Only add the map creation and expansion for now. The lookup will be
added later.
When writing new MTRR entries in the running system rebuild the map
inside the call from mtrr_rendezvous_handler() in order to avoid nasty
race conditions with concurrent lookups.
Signed-off-by: Juergen Gross <[email protected]>
Tested-by: Michael Kelley <[email protected]>
---
V3:
- new patch
V5:
- fix setting of mtrr_tom2
- change cache_map .type and .fixed to bitfields (Boris Petkov)
- use memmove() (Boris Petkov)
- a lot of comments (Boris Petkov)
- rewrite setting of merge bools (Boris Petkov)
- mark mtrr_build_map() as __init
- add pr_info() (Boris Petkov)
---
arch/x86/kernel/cpu/mtrr/generic.c | 288 +++++++++++++++++++++++++++++
arch/x86/kernel/cpu/mtrr/mtrr.c | 10 +-
arch/x86/kernel/cpu/mtrr/mtrr.h | 3 +
3 files changed, 300 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/mtrr/generic.c b/arch/x86/kernel/cpu/mtrr/generic.c
index 005f07ebb3a3..fe8238832095 100644
--- a/arch/x86/kernel/cpu/mtrr/generic.c
+++ b/arch/x86/kernel/cpu/mtrr/generic.c
@@ -33,6 +33,38 @@ static struct fixed_range_block fixed_range_blocks[] = {
{}
};
+struct cache_map {
+ u64 start;
+ u64 end;
+ u64 flags;
+ u64 type:8;
+ u64 fixed:1;
+};
+
+/*
+ * CACHE_MAP_MAX is the maximum number of memory ranges in cache_map, where
+ * no 2 adjacent ranges have the same cache mode (those would be merged).
+ * The number is based on the worst case:
+ * - no two adjacent fixed MTRRs share the same cache mode
+ * - one variable MTRR is spanning a huge area with mode WB
+ * - 255 variable MTRRs with mode UC all overlap with the WB MTRR, creating 2
+ * additional ranges each (result like "ababababa...aba" with a = WB, b = UC),
+ * accounting for MTRR_MAX_VAR_RANGES * 2 - 1 range entries
+ * - a TOM2 area (even with overlapping an UC MTRR can't add 2 range entries
+ * to the possible maximum, as it always starts at 4GB, thus it can't be in
+ * the middle of that MTRR, unless that MTRR starts at 0, which would remove
+ * the initial "a" from the "abababa" pattern above)
+ * The map won't contain ranges with no matching MTRR (those fall back to the
+ * default cache mode).
+ */
+#define CACHE_MAP_MAX (MTRR_NUM_FIXED_RANGES + MTRR_MAX_VAR_RANGES * 2)
+
+static struct cache_map init_cache_map[CACHE_MAP_MAX] __initdata;
+static struct cache_map *cache_map __refdata = init_cache_map;
+static unsigned int cache_map_size = CACHE_MAP_MAX;
+static unsigned int cache_map_n;
+static unsigned int cache_map_fixed;
+
static unsigned long smp_changes_mask;
static int mtrr_state_set;
u64 mtrr_tom2;
@@ -78,6 +110,20 @@ static u64 get_mtrr_size(u64 mask)
return size;
}
+static u8 get_var_mtrr_state(unsigned int reg, u64 *start, u64 *size)
+{
+ struct mtrr_var_range *mtrr = mtrr_state.var_ranges + reg;
+
+ if (!(mtrr->mask_lo & MTRR_MASK_VALID))
+ return MTRR_TYPE_INVALID;
+
+ *start = (((u64)mtrr->base_hi) << 32) + (mtrr->base_lo & PAGE_MASK);
+ *size = get_mtrr_size((((u64)mtrr->mask_hi) << 32) +
+ (mtrr->mask_lo & PAGE_MASK));
+
+ return mtrr->base_lo & MTRR_BASE_TYPE_MASK;
+}
+
static u8 get_effective_type(u8 type1, u8 type2)
{
if (type1 == MTRR_TYPE_UNCACHABLE || type2 == MTRR_TYPE_UNCACHABLE)
@@ -242,6 +288,244 @@ static u8 mtrr_type_lookup_variable(u64 start, u64 end, u64 *partial_end,
return mtrr_state.def_type;
}
+static void rm_map_entry_at(int idx)
+{
+ cache_map_n--;
+ memmove(cache_map + idx, cache_map + idx + 1,
+ sizeof(*cache_map) * (cache_map_n - idx));
+}
+
+/*
+ * Add an entry into cache_map at a specific index. Merges adjacent entries if
+ * appropriate. Return the number of merges for correcting the scan index
+ * (this is needed as merging will reduce the number of entries, which will
+ * result in skipping entries in future iterations if the scan index isn't
+ * corrected).
+ * Note that the corrected index can never go below -1 (resulting in being 0 in
+ * the next scan iteration), as "2" is returned only if the current index is
+ * larger than zero.
+ */
+static int add_map_entry_at(u64 start, u64 end, u8 type, int idx)
+{
+ bool merge_prev = false, merge_next = false;
+
+ if (start >= end)
+ return 0;
+
+ if (idx > 0) {
+ struct cache_map *prev = cache_map + idx - 1;
+
+ if (!prev->fixed && start == prev->end && type == prev->type)
+ merge_prev = true;
+ }
+
+ if (idx < cache_map_n) {
+ struct cache_map *next = cache_map + idx;
+
+ if (!next->fixed && end == next->start && type == next->type)
+ merge_next = true;
+ }
+
+ if (merge_prev && merge_next) {
+ cache_map[idx - 1].end = cache_map[idx].end;
+ rm_map_entry_at(idx);
+ return 2;
+ }
+ if (merge_prev) {
+ cache_map[idx - 1].end = end;
+ return 1;
+ }
+ if (merge_next) {
+ cache_map[idx].start = start;
+ return 1;
+ }
+
+ /* Sanity check: the array should NEVER be too small! */
+ if (cache_map_n == cache_map_size) {
+ WARN(1, "MTRR cache mode memory map exhausted!\n");
+ cache_map_n = cache_map_fixed;
+ return 0;
+ }
+
+ memmove(cache_map + idx + 1, cache_map + idx,
+ sizeof(*cache_map) * (cache_map_n - idx));
+
+ cache_map[idx].start = start;
+ cache_map[idx].end = end;
+ cache_map[idx].type = type;
+ cache_map[idx].fixed = 0;
+ cache_map_n++;
+
+ return 0;
+}
+
+/* Clear a part of an entry. Return 1 if start of entry is still valid. */
+static int clr_map_range_at(u64 start, u64 end, int idx)
+{
+ int ret = start != cache_map[idx].start;
+ u64 tmp;
+
+ if (start == cache_map[idx].start && end == cache_map[idx].end) {
+ rm_map_entry_at(idx);
+ } else if (start == cache_map[idx].start) {
+ cache_map[idx].start = end;
+ } else if (end == cache_map[idx].end) {
+ cache_map[idx].end = start;
+ } else {
+ tmp = cache_map[idx].end;
+ cache_map[idx].end = start;
+ add_map_entry_at(end, tmp, cache_map[idx].type, idx + 1);
+ }
+
+ return ret;
+}
+
+/*
+ * Add MTRR to the map. The current map is scanned and each part of the MTRR
+ * either overlapping with an existing entry or with a hole in the map is
+ * handled separately.
+ */
+static void add_map_entry(u64 start, u64 end, u8 type)
+{
+ u8 new_type, old_type;
+ u64 tmp;
+ int i;
+
+ for (i = 0; i < cache_map_n && start < end; i++) {
+ if (start >= cache_map[i].end)
+ continue;
+
+ if (start < cache_map[i].start) {
+ /* Region start has no overlap. */
+ tmp = min(end, cache_map[i].start);
+ i -= add_map_entry_at(start, tmp, type, i);
+ start = tmp;
+ continue;
+ }
+
+ new_type = get_effective_type(type, cache_map[i].type);
+ old_type = cache_map[i].type;
+
+ if (cache_map[i].fixed || new_type == old_type) {
+ /* Cut off start of new entry. */
+ start = cache_map[i].end;
+ continue;
+ }
+
+ /* Handle only overlapping part of region. */
+ tmp = min(end, cache_map[i].end);
+ i += clr_map_range_at(start, tmp, i);
+ i -= add_map_entry_at(start, tmp, new_type, i);
+ start = tmp;
+ }
+
+ /* Add rest of region after last map entry (rest might be empty). */
+ add_map_entry_at(start, end, type, i);
+}
+
+/* Add variable MTRRs to cache map. */
+static void map_add_var(void)
+{
+ u64 start, size;
+ unsigned int i;
+ u8 type;
+
+ /*
+ * Add AMD TOM2 MTRR. Can't be added in mtrr_build_map(), as it needs
+ * to be added again when rebuilding the map due to potentially having
+ * moved as a result of variable MTRRs for memory below 4GB.
+ */
+ if (mtrr_tom2) {
+ add_map_entry(BIT_ULL(32), mtrr_tom2, MTRR_TYPE_WRBACK);
+ cache_map[cache_map_n - 1].fixed = 1;
+ }
+
+ for (i = 0; i < num_var_ranges; i++) {
+ type = get_var_mtrr_state(i, &start, &size);
+ if (type != MTRR_TYPE_INVALID)
+ add_map_entry(start, start + size, type);
+ }
+}
+
+/* Rebuild map by replacing variable entries. */
+static void rebuild_map(void)
+{
+ cache_map_n = cache_map_fixed;
+
+ map_add_var();
+}
+
+static unsigned int __init get_cache_map_size(void)
+{
+ return cache_map_fixed + 2 * num_var_ranges + (mtrr_tom2 != 0);
+}
+
+/* Build the cache_map containing the cache modes per memory range. */
+void __init mtrr_build_map(void)
+{
+ u64 start, end, size;
+ unsigned int i;
+ u8 type;
+
+ if (!mtrr_state.enabled)
+ return;
+
+ /* Add fixed MTRRs, optimize for adjacent entries with same type. */
+ if (mtrr_state.enabled & MTRR_STATE_MTRR_FIXED_ENABLED) {
+ /*
+ * Start with 64k size fixed entries, preset 1st one (hence the
+ * loop below is starting with index 1).
+ */
+ start = 0;
+ end = size = 0x10000;
+ type = mtrr_state.fixed_ranges[0];
+
+ for (i = 1; i < MTRR_NUM_FIXED_RANGES; i++) {
+ /* 8 64k entries, then 16 16k ones, rest 4k. */
+ if (i == 8 || i == 24)
+ size >>= 2;
+
+ if (mtrr_state.fixed_ranges[i] != type) {
+ add_map_entry(start, end, type);
+ start = end;
+ type = mtrr_state.fixed_ranges[i];
+ }
+ end += size;
+ }
+ add_map_entry(start, end, type);
+ }
+
+ /* Mark fixed, they take precedence. */
+ for (i = 0; i < cache_map_n; i++)
+ cache_map[i].fixed = 1;
+ cache_map_fixed = cache_map_n;
+
+ map_add_var();
+
+ pr_info("MTRR map: %u entries (%u fixed + %u variable; max %u), built from %u variable MTRRs\n",
+ cache_map_n, cache_map_fixed, cache_map_n - cache_map_fixed,
+ get_cache_map_size(), num_var_ranges + (mtrr_tom2 != 0));
+}
+
+/* Copy the cache_map from __initdata memory to dynamically allocated one. */
+void __init mtrr_copy_map(void)
+{
+ unsigned int new_size = get_cache_map_size();
+
+ if (!mtrr_state.enabled || !new_size) {
+ cache_map = NULL;
+ return;
+ }
+
+ mutex_lock(&mtrr_mutex);
+
+ cache_map = kcalloc(new_size, sizeof(*cache_map), GFP_KERNEL);
+ memmove(cache_map, init_cache_map, cache_map_n * sizeof(*cache_map));
+ cache_map_size = new_size;
+
+ mutex_unlock(&mtrr_mutex);
+}
+
/**
* mtrr_overwrite_state - set static MTRR state
*
@@ -834,6 +1118,10 @@ static void generic_set_mtrr(unsigned int reg, unsigned long base,
cache_enable();
local_irq_restore(flags);
+
+ /* On the first CPU rebuild the cache mode memory map. */
+ if (smp_processor_id() == cpumask_first(cpu_online_mask))
+ rebuild_map();
}
int generic_validate_add_page(unsigned long base, unsigned long size,
diff --git a/arch/x86/kernel/cpu/mtrr/mtrr.c b/arch/x86/kernel/cpu/mtrr/mtrr.c
index 76f5b5e1128b..d44e4c2670cc 100644
--- a/arch/x86/kernel/cpu/mtrr/mtrr.c
+++ b/arch/x86/kernel/cpu/mtrr/mtrr.c
@@ -65,7 +65,7 @@ static bool mtrr_enabled(void)
}
unsigned int mtrr_usage_table[MTRR_MAX_VAR_RANGES];
-static DEFINE_MUTEX(mtrr_mutex);
+DEFINE_MUTEX(mtrr_mutex);
u64 size_or_mask, size_and_mask;
@@ -660,6 +660,7 @@ void __init mtrr_bp_init(void)
/* Software overwrite of MTRR state, only for generic case. */
mtrr_calc_physbits(true);
init_table();
+ mtrr_build_map();
pr_info("MTRRs set to read-only\n");
return;
@@ -697,6 +698,7 @@ void __init mtrr_bp_init(void)
if (get_mtrr_state()) {
memory_caching_control |= CACHE_MTRR;
changed_by_mtrr_cleanup = mtrr_cleanup(phys_addr);
+ mtrr_build_map();
} else {
mtrr_if = NULL;
why = "by BIOS";
@@ -725,6 +727,12 @@ void mtrr_save_state(void)
static int __init mtrr_init_finialize(void)
{
+ /*
+ * Map might exist if mtrr_overwrite_state() has been called or if
+ * mtrr_enabled() returns true.
+ */
+ mtrr_copy_map();
+
if (!mtrr_enabled())
return 0;
diff --git a/arch/x86/kernel/cpu/mtrr/mtrr.h b/arch/x86/kernel/cpu/mtrr/mtrr.h
index a3c362d3d5bf..6246a1d8650b 100644
--- a/arch/x86/kernel/cpu/mtrr/mtrr.h
+++ b/arch/x86/kernel/cpu/mtrr/mtrr.h
@@ -53,6 +53,7 @@ bool get_mtrr_state(void);
extern u64 size_or_mask, size_and_mask;
extern const struct mtrr_ops *mtrr_if;
+extern struct mutex mtrr_mutex;
extern unsigned int num_var_ranges;
extern u64 mtrr_tom2;
@@ -61,6 +62,8 @@ extern struct mtrr_state_type mtrr_state;
void mtrr_state_warn(void);
const char *mtrr_attrib_to_str(int x);
void mtrr_wrmsr(unsigned, unsigned, unsigned);
+void mtrr_build_map(void);
+void mtrr_copy_map(void);
/* CPU specific mtrr_ops vectors. */
extern const struct mtrr_ops amd_mtrr_ops;
--
2.35.3
The mtrr_value[] array is a static variable, which is used only in a
few configurations. Consuming 6kB is ridiculous for this case,
especially as the array doesn't need to be that large and it can easily
be allocated dynamically.
The "few configurations" are all 32-bit ones, so put the code inside a
CONFIG_X86_32 #ifdef.
Signed-off-by: Juergen Gross <[email protected]>
Tested-by: Michael Kelley <[email protected]>
---
V5:
- check for allocation failure (Kai Huang, Boris Petkov)
- add #ifdef
---
arch/x86/kernel/cpu/mtrr/mtrr.c | 21 +++++++++++++++------
1 file changed, 15 insertions(+), 6 deletions(-)
diff --git a/arch/x86/kernel/cpu/mtrr/mtrr.c b/arch/x86/kernel/cpu/mtrr/mtrr.c
index 4fa3d0f94f39..76f5b5e1128b 100644
--- a/arch/x86/kernel/cpu/mtrr/mtrr.c
+++ b/arch/x86/kernel/cpu/mtrr/mtrr.c
@@ -560,8 +560,10 @@ int arch_phys_wc_index(int handle)
}
EXPORT_SYMBOL_GPL(arch_phys_wc_index);
-/* The suspend/resume methods are only for CPU without MTRR. CPU using generic
- * MTRR driver doesn't require this
+#ifdef CONFIG_X86_32
+/*
+ * The suspend/resume methods are only for CPUs without MTRR. CPUs using generic
+ * MTRR driver don't require this.
*/
struct mtrr_value {
mtrr_type ltype;
@@ -569,12 +571,15 @@ struct mtrr_value {
unsigned long lsize;
};
-static struct mtrr_value mtrr_value[MTRR_MAX_VAR_RANGES];
+static struct mtrr_value *mtrr_value;
static int mtrr_save(void)
{
int i;
+ if (!mtrr_value)
+ return -ENOMEM;
+
for (i = 0; i < num_var_ranges; i++) {
mtrr_if->get(i, &mtrr_value[i].lbase,
&mtrr_value[i].lsize,
@@ -596,12 +601,11 @@ static void mtrr_restore(void)
}
}
-
-
static struct syscore_ops mtrr_syscore_ops = {
.suspend = mtrr_save,
.resume = mtrr_restore,
};
+#endif /* CONFIG_X86_32 */
int __initdata changed_by_mtrr_cleanup;
@@ -730,15 +734,20 @@ static int __init mtrr_init_finialize(void)
return 0;
}
+#ifdef CONFIG_X86_32
+ mtrr_value = kcalloc(num_var_ranges, sizeof(*mtrr_value), GFP_KERNEL);
+
/*
* The CPU has no MTRR and seems to not support SMP. They have
* specific drivers, we use a tricky method to support
- * suspend/resume for them.
+ * suspend/resume for them. In case above allocation failed we can't
+ * support suspend/resume (handled in mtrr_save()).
*
* TBD: is there any system with such CPU which supports
* suspend/resume? If no, we should remove the code.
*/
register_syscore_ops(&mtrr_syscore_ops);
+#endif
return 0;
}
--
2.35.3
Add a service function for obtaining the effective cache mode of
overlapping MTRR registers.
Make use of that function in check_type_overlap().
Signed-off-by: Juergen Gross <[email protected]>
Tested-by: Michael Kelley <[email protected]>
---
V3:
- new patch
---
arch/x86/kernel/cpu/mtrr/generic.c | 39 +++++++++++++++---------------
1 file changed, 19 insertions(+), 20 deletions(-)
diff --git a/arch/x86/kernel/cpu/mtrr/generic.c b/arch/x86/kernel/cpu/mtrr/generic.c
index 5d60b46187f7..005f07ebb3a3 100644
--- a/arch/x86/kernel/cpu/mtrr/generic.c
+++ b/arch/x86/kernel/cpu/mtrr/generic.c
@@ -78,31 +78,30 @@ static u64 get_mtrr_size(u64 mask)
return size;
}
+static u8 get_effective_type(u8 type1, u8 type2)
+{
+ if (type1 == MTRR_TYPE_UNCACHABLE || type2 == MTRR_TYPE_UNCACHABLE)
+ return MTRR_TYPE_UNCACHABLE;
+
+ if ((type1 == MTRR_TYPE_WRBACK && type2 == MTRR_TYPE_WRTHROUGH) ||
+ (type1 == MTRR_TYPE_WRTHROUGH && type2 == MTRR_TYPE_WRBACK))
+ return MTRR_TYPE_WRTHROUGH;
+
+ if (type1 != type2)
+ return MTRR_TYPE_UNCACHABLE;
+
+ return type1;
+}
+
/*
* Check and return the effective type for MTRR-MTRR type overlap.
- * Returns 1 if the effective type is UNCACHEABLE, else returns 0
+ * Returns true if the effective type is UNCACHEABLE, else returns false
*/
-static int check_type_overlap(u8 *prev, u8 *curr)
+static bool check_type_overlap(u8 *prev, u8 *curr)
{
- if (*prev == MTRR_TYPE_UNCACHABLE || *curr == MTRR_TYPE_UNCACHABLE) {
- *prev = MTRR_TYPE_UNCACHABLE;
- *curr = MTRR_TYPE_UNCACHABLE;
- return 1;
- }
-
- if ((*prev == MTRR_TYPE_WRBACK && *curr == MTRR_TYPE_WRTHROUGH) ||
- (*prev == MTRR_TYPE_WRTHROUGH && *curr == MTRR_TYPE_WRBACK)) {
- *prev = MTRR_TYPE_WRTHROUGH;
- *curr = MTRR_TYPE_WRTHROUGH;
- }
+ *prev = *curr = get_effective_type(*curr, *prev);
- if (*prev != *curr) {
- *prev = MTRR_TYPE_UNCACHABLE;
- *curr = MTRR_TYPE_UNCACHABLE;
- return 1;
- }
-
- return 0;
+ return *prev == MTRR_TYPE_UNCACHABLE;
}
/**
--
2.35.3
Today pud_set_huge() and pmd_set_huge() test for the MTRR type to be
WB or INVALID after calling mtrr_type_lookup(). Those tests can be
dropped, as the only reason to not use a large mapping would be
uniform being 0. Any MTRR type can be accepted as long as it applies
to the whole memory range covered by the mapping, as the alternative
would only be to map the same region with smaller pages instead, using
the same PAT type as for the large mapping.
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Juergen Gross <[email protected]>
Tested-by: Michael Kelley <[email protected]>
---
V3:
- adapt comment for pud_set_huge()
---
arch/x86/mm/pgtable.c | 24 ++++++++----------------
1 file changed, 8 insertions(+), 16 deletions(-)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index e4f499eb0f29..15a8009a4480 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -702,14 +702,8 @@ void p4d_clear_huge(p4d_t *p4d)
* pud_set_huge - setup kernel PUD mapping
*
* MTRRs can override PAT memory types with 4KiB granularity. Therefore, this
- * function sets up a huge page only if any of the following conditions are met:
- *
- * - MTRRs are disabled, or
- *
- * - MTRRs are enabled and the range is completely covered by a single MTRR, or
- *
- * - MTRRs are enabled and the corresponding MTRR memory type is WB, which
- * has no effect on the requested PAT memory type.
+ * function sets up a huge page only if the complete range has the same MTRR
+ * caching mode.
*
* Callers should try to decrease page size (1GB -> 2MB -> 4K) if the bigger
* page mapping attempt fails.
@@ -718,11 +712,10 @@ void p4d_clear_huge(p4d_t *p4d)
*/
int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
{
- u8 mtrr, uniform;
+ u8 uniform;
- mtrr = mtrr_type_lookup(addr, addr + PUD_SIZE, &uniform);
- if ((mtrr != MTRR_TYPE_INVALID) && (!uniform) &&
- (mtrr != MTRR_TYPE_WRBACK))
+ mtrr_type_lookup(addr, addr + PUD_SIZE, &uniform);
+ if (!uniform)
return 0;
/* Bail out if we are we on a populated non-leaf entry: */
@@ -745,11 +738,10 @@ int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
*/
int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot)
{
- u8 mtrr, uniform;
+ u8 uniform;
- mtrr = mtrr_type_lookup(addr, addr + PMD_SIZE, &uniform);
- if ((mtrr != MTRR_TYPE_INVALID) && (!uniform) &&
- (mtrr != MTRR_TYPE_WRBACK)) {
+ mtrr_type_lookup(addr, addr + PMD_SIZE, &uniform);
+ if (!uniform) {
pr_warn_once("%s: Cannot satisfy [mem %#010llx-%#010llx] with a huge-page mapping due to MTRR override.\n",
__func__, addr, addr + PMD_SIZE);
return 0;
--
2.35.3
mtrr_type_lookup() should always return a valid memory type. In case
there is no information available, it should return the default UC.
This will remove the last case where mtrr_type_lookup() can return
MTRR_TYPE_INVALID, so adjust the comment in include/uapi/asm/mtrr.h.
Note that removing the MTRR_TYPE_INVALID #define from that header
could break user code, so it has to stay.
At the same time the mtrr_type_lookup() stub for the !CONFIG_MTRR
case should set uniform to 1, as if the memory range would be
covered by no MTRR at all.
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Juergen Gross <[email protected]>
Tested-by: Michael Kelley <[email protected]>
---
V2:
- always set uniform
- set uniform to 1 in case of disabled MTRRs (Linus Torvalds)
V3:
- adjust include/uapi/asm/mtrr.h comment
---
arch/x86/include/asm/mtrr.h | 7 +++++--
arch/x86/include/uapi/asm/mtrr.h | 6 +++---
arch/x86/kernel/cpu/mtrr/generic.c | 4 ++--
3 files changed, 10 insertions(+), 7 deletions(-)
diff --git a/arch/x86/include/asm/mtrr.h b/arch/x86/include/asm/mtrr.h
index 6decb18e22ed..b17a66da1237 100644
--- a/arch/x86/include/asm/mtrr.h
+++ b/arch/x86/include/asm/mtrr.h
@@ -77,9 +77,12 @@ static inline void mtrr_overwrite_state(struct mtrr_var_range *var,
static inline u8 mtrr_type_lookup(u64 addr, u64 end, u8 *uniform)
{
/*
- * Return no-MTRRs:
+ * Return the default MTRR type, without any known other types in
+ * that range.
*/
- return MTRR_TYPE_INVALID;
+ *uniform = 1;
+
+ return MTRR_TYPE_UNCACHABLE;
}
#define mtrr_save_fixed_ranges(arg) do {} while (0)
#define mtrr_save_state() do {} while (0)
diff --git a/arch/x86/include/uapi/asm/mtrr.h b/arch/x86/include/uapi/asm/mtrr.h
index 376563f2bac1..4aa05c2ffa78 100644
--- a/arch/x86/include/uapi/asm/mtrr.h
+++ b/arch/x86/include/uapi/asm/mtrr.h
@@ -115,9 +115,9 @@ struct mtrr_state_type {
#define MTRR_NUM_TYPES 7
/*
- * Invalid MTRR memory type. mtrr_type_lookup() returns this value when
- * MTRRs are disabled. Note, this value is allocated from the reserved
- * values (0x7-0xff) of the MTRR memory types.
+ * Invalid MTRR memory type. No longer used outside of MTRR code.
+ * Note, this value is allocated from the reserved values (0x7-0xff) of
+ * the MTRR memory types.
*/
#define MTRR_TYPE_INVALID 0xff
diff --git a/arch/x86/kernel/cpu/mtrr/generic.c b/arch/x86/kernel/cpu/mtrr/generic.c
index 5d502b926dd8..178253d117c6 100644
--- a/arch/x86/kernel/cpu/mtrr/generic.c
+++ b/arch/x86/kernel/cpu/mtrr/generic.c
@@ -463,13 +463,13 @@ u8 mtrr_type_lookup(u64 start, u64 end, u8 *uniform)
if (!mtrr_state_set) {
*uniform = 0; /* Uniformity is unknown. */
- return MTRR_TYPE_INVALID;
+ return MTRR_TYPE_UNCACHABLE;
}
*uniform = 1;
if (!(mtrr_state.enabled & MTRR_STATE_MTRR_ENABLED))
- return MTRR_TYPE_INVALID;
+ return MTRR_TYPE_UNCACHABLE;
for (i = 0; i < cache_map_n && start < end; i++) {
if (start >= cache_map[i].end)
--
2.35.3
Instead of crawling through the MTRR register state, use the new
cache_map for looking up the cache type(s) of a memory region.
This allows now to set the uniform parameter according to the
uniformity of the cache mode of the region, instead of setting it
only if the complete region is mapped by a single MTRR. This now
includes even the region covered by the fixed MTRR registers.
Make sure uniform is always set.
Signed-off-by: Juergen Gross <[email protected]>
Tested-by: Michael Kelley <[email protected]>
---
V3:
- new patch
V3.1:
- fix type_merge() (Michael Kelley)
V4:
- fix type_merge() again (Michael Kelley)
---
arch/x86/kernel/cpu/mtrr/generic.c | 228 ++++-------------------------
1 file changed, 32 insertions(+), 196 deletions(-)
diff --git a/arch/x86/kernel/cpu/mtrr/generic.c b/arch/x86/kernel/cpu/mtrr/generic.c
index fe8238832095..5d502b926dd8 100644
--- a/arch/x86/kernel/cpu/mtrr/generic.c
+++ b/arch/x86/kernel/cpu/mtrr/generic.c
@@ -139,155 +139,6 @@ static u8 get_effective_type(u8 type1, u8 type2)
return type1;
}
-/*
- * Check and return the effective type for MTRR-MTRR type overlap.
- * Returns true if the effective type is UNCACHEABLE, else returns false
- */
-static bool check_type_overlap(u8 *prev, u8 *curr)
-{
- *prev = *curr = get_effective_type(*curr, *prev);
-
- return *prev == MTRR_TYPE_UNCACHABLE;
-}
-
-/**
- * mtrr_type_lookup_fixed - look up memory type in MTRR fixed entries
- *
- * Return the MTRR fixed memory type of 'start'.
- *
- * MTRR fixed entries are divided into the following ways:
- * 0x00000 - 0x7FFFF : This range is divided into eight 64KB sub-ranges
- * 0x80000 - 0xBFFFF : This range is divided into sixteen 16KB sub-ranges
- * 0xC0000 - 0xFFFFF : This range is divided into sixty-four 4KB sub-ranges
- *
- * Return Values:
- * MTRR_TYPE_(type) - Matched memory type
- * MTRR_TYPE_INVALID - Unmatched
- */
-static u8 mtrr_type_lookup_fixed(u64 start, u64 end)
-{
- int idx;
-
- if (start >= 0x100000)
- return MTRR_TYPE_INVALID;
-
- /* 0x0 - 0x7FFFF */
- if (start < 0x80000) {
- idx = 0;
- idx += (start >> 16);
- return mtrr_state.fixed_ranges[idx];
- /* 0x80000 - 0xBFFFF */
- } else if (start < 0xC0000) {
- idx = 1 * 8;
- idx += ((start - 0x80000) >> 14);
- return mtrr_state.fixed_ranges[idx];
- }
-
- /* 0xC0000 - 0xFFFFF */
- idx = 3 * 8;
- idx += ((start - 0xC0000) >> 12);
- return mtrr_state.fixed_ranges[idx];
-}
-
-/**
- * mtrr_type_lookup_variable - look up memory type in MTRR variable entries
- *
- * Return Value:
- * MTRR_TYPE_(type) - Matched memory type or default memory type (unmatched)
- *
- * Output Arguments:
- * repeat - Set to 1 when [start:end] spanned across MTRR range and type
- * returned corresponds only to [start:*partial_end]. Caller has
- * to lookup again for [*partial_end:end].
- *
- * uniform - Set to 1 when an MTRR covers the region uniformly, i.e. the
- * region is fully covered by a single MTRR entry or the default
- * type.
- */
-static u8 mtrr_type_lookup_variable(u64 start, u64 end, u64 *partial_end,
- int *repeat, u8 *uniform)
-{
- int i;
- u64 base, mask;
- u8 prev_match, curr_match;
-
- *repeat = 0;
- *uniform = 1;
-
- prev_match = MTRR_TYPE_INVALID;
- for (i = 0; i < num_var_ranges; ++i) {
- unsigned short start_state, end_state, inclusive;
-
- if (!(mtrr_state.var_ranges[i].mask_lo & MTRR_MASK_VALID))
- continue;
-
- base = (((u64)mtrr_state.var_ranges[i].base_hi) << 32) +
- (mtrr_state.var_ranges[i].base_lo & PAGE_MASK);
- mask = (((u64)mtrr_state.var_ranges[i].mask_hi) << 32) +
- (mtrr_state.var_ranges[i].mask_lo & PAGE_MASK);
-
- start_state = ((start & mask) == (base & mask));
- end_state = ((end & mask) == (base & mask));
- inclusive = ((start < base) && (end > base));
-
- if ((start_state != end_state) || inclusive) {
- /*
- * We have start:end spanning across an MTRR.
- * We split the region into either
- *
- * - start_state:1
- * (start:mtrr_end)(mtrr_end:end)
- * - end_state:1
- * (start:mtrr_start)(mtrr_start:end)
- * - inclusive:1
- * (start:mtrr_start)(mtrr_start:mtrr_end)(mtrr_end:end)
- *
- * depending on kind of overlap.
- *
- * Return the type of the first region and a pointer
- * to the start of next region so that caller will be
- * advised to lookup again after having adjusted start
- * and end.
- *
- * Note: This way we handle overlaps with multiple
- * entries and the default type properly.
- */
- if (start_state)
- *partial_end = base + get_mtrr_size(mask);
- else
- *partial_end = base;
-
- if (unlikely(*partial_end <= start)) {
- WARN_ON(1);
- *partial_end = start + PAGE_SIZE;
- }
-
- end = *partial_end - 1; /* end is inclusive */
- *repeat = 1;
- *uniform = 0;
- }
-
- if ((start & mask) != (base & mask))
- continue;
-
- curr_match = mtrr_state.var_ranges[i].base_lo &
- MTRR_BASE_TYPE_MASK;
- if (prev_match == MTRR_TYPE_INVALID) {
- prev_match = curr_match;
- continue;
- }
-
- *uniform = 0;
- if (check_type_overlap(&prev_match, &curr_match))
- return curr_match;
- }
-
- if (prev_match != MTRR_TYPE_INVALID)
- return prev_match;
-
- return mtrr_state.def_type;
-}
-
static void rm_map_entry_at(int idx)
{
cache_map_n--;
@@ -580,6 +431,20 @@ void mtrr_overwrite_state(struct mtrr_var_range *var, unsigned int num_var,
mtrr_state_set = 1;
}
+static u8 type_merge(u8 type, u8 new_type, u8 *uniform)
+{
+ u8 effective_type;
+
+ if (type == MTRR_TYPE_INVALID)
+ return new_type;
+
+ effective_type = get_effective_type(type, new_type);
+ if (type != effective_type)
+ *uniform = 0;
+
+ return effective_type;
+}
+
/**
* mtrr_type_lookup - look up memory type in MTRR
*
@@ -588,66 +453,37 @@ void mtrr_overwrite_state(struct mtrr_var_range *var, unsigned int num_var,
* MTRR_TYPE_INVALID - MTRR is disabled
*
* Output Argument:
- * uniform - Set to 1 when an MTRR covers the region uniformly, i.e. the
- * region is fully covered by a single MTRR entry or the default
- * type.
+ * uniform - Set to 1 when the returned MTRR type is valid for the whole
+ * region, set to 0 else.
*/
u8 mtrr_type_lookup(u64 start, u64 end, u8 *uniform)
{
- u8 type, prev_type, is_uniform = 1, dummy;
- int repeat;
- u64 partial_end;
-
- /* Make end inclusive instead of exclusive */
- end--;
+ u8 type = MTRR_TYPE_INVALID;
+ unsigned int i;
- if (!mtrr_state_set)
+ if (!mtrr_state_set) {
+ *uniform = 0; /* Uniformity is unknown. */
return MTRR_TYPE_INVALID;
+ }
+
+ *uniform = 1;
if (!(mtrr_state.enabled & MTRR_STATE_MTRR_ENABLED))
return MTRR_TYPE_INVALID;
- /*
- * Look up the fixed ranges first, which take priority over
- * the variable ranges.
- */
- if ((start < 0x100000) &&
- (mtrr_state.have_fixed) &&
- (mtrr_state.enabled & MTRR_STATE_MTRR_FIXED_ENABLED)) {
- is_uniform = 0;
- type = mtrr_type_lookup_fixed(start, end);
- goto out;
- }
-
- /*
- * Look up the variable ranges. Look of multiple ranges matching
- * this address and pick type as per MTRR precedence.
- */
- type = mtrr_type_lookup_variable(start, end, &partial_end,
- &repeat, &is_uniform);
+ for (i = 0; i < cache_map_n && start < end; i++) {
+ if (start >= cache_map[i].end)
+ continue;
+ if (start < cache_map[i].start)
+ type = type_merge(type, mtrr_state.def_type, uniform);
+ type = type_merge(type, cache_map[i].type, uniform);
- /*
- * Common path is with repeat = 0.
- * However, we can have cases where [start:end] spans across some
- * MTRR ranges and/or the default type. Do repeated lookups for
- * that case here.
- */
- while (repeat) {
- prev_type = type;
- start = partial_end;
- is_uniform = 0;
- type = mtrr_type_lookup_variable(start, end, &partial_end,
- &repeat, &dummy);
-
- if (check_type_overlap(&prev_type, &type))
- goto out;
+ start = cache_map[i].end;
}
- if (mtrr_tom2 && (start >= (1ULL<<32)) && (end < mtrr_tom2))
- type = MTRR_TYPE_WRBACK;
+ if (start < end)
+ type = type_merge(type, mtrr_state.def_type, uniform);
-out:
- *uniform = is_uniform;
return type;
}
--
2.35.3
mtrr_centaur_report_mcr() isn't used by anyone, so it can be removed.
Signed-off-by: Juergen Gross <[email protected]>
---
V5:
- new patch
---
arch/x86/include/asm/mtrr.h | 4 ----
arch/x86/kernel/cpu/mtrr/centaur.c | 9 ---------
2 files changed, 13 deletions(-)
diff --git a/arch/x86/include/asm/mtrr.h b/arch/x86/include/asm/mtrr.h
index b17a66da1237..3aced3568e2b 100644
--- a/arch/x86/include/asm/mtrr.h
+++ b/arch/x86/include/asm/mtrr.h
@@ -60,7 +60,6 @@ extern int mtrr_add_page(unsigned long base, unsigned long size,
unsigned int type, bool increment);
extern int mtrr_del(int reg, unsigned long base, unsigned long size);
extern int mtrr_del_page(int reg, unsigned long base, unsigned long size);
-extern void mtrr_centaur_report_mcr(int mcr, u32 lo, u32 hi);
extern void mtrr_bp_restore(void);
extern int mtrr_trim_uncached_memory(unsigned long end_pfn);
extern int amd_special_default_mtrr(void);
@@ -108,9 +107,6 @@ static inline int mtrr_trim_uncached_memory(unsigned long end_pfn)
{
return 0;
}
-static inline void mtrr_centaur_report_mcr(int mcr, u32 lo, u32 hi)
-{
-}
#define mtrr_bp_init() do {} while (0)
#define mtrr_bp_restore() do {} while (0)
#define mtrr_disable() do {} while (0)
diff --git a/arch/x86/kernel/cpu/mtrr/centaur.c b/arch/x86/kernel/cpu/mtrr/centaur.c
index 4466ddeb0125..6f6c3ae92943 100644
--- a/arch/x86/kernel/cpu/mtrr/centaur.c
+++ b/arch/x86/kernel/cpu/mtrr/centaur.c
@@ -45,15 +45,6 @@ centaur_get_free_region(unsigned long base, unsigned long size, int replace_reg)
return -ENOSPC;
}
-/*
- * Report boot time MCR setups
- */
-void mtrr_centaur_report_mcr(int mcr, u32 lo, u32 hi)
-{
- centaur_mcr[mcr].low = lo;
- centaur_mcr[mcr].high = hi;
-}
-
static void
centaur_get_mcr(unsigned int reg, unsigned long *base,
unsigned long *size, mtrr_type * type)
--
2.35.3
From: Juergen Gross <[email protected]> Sent: Friday, March 31, 2023 11:37 PM
>
> In order to avoid mappings using the UC- cache attribute, set the
> MTRR state to use WB caching as the default.
>
> This is needed in order to cope with the fact that PAT is enabled,
> while MTRRs are not supported by the hypervisor.
>
> Fixes: 90b926e68f50 ("x86/pat: Fix pat_x_mtrr_type() for MTRR disabled case")
> Signed-off-by: Juergen Gross <[email protected]>
> Tested-by: Michael Kelley <[email protected]>
I've tested the full v5 series in a Hyper-V SEV-SNP guest with vTOM. All looks
good. The following output appears in dmesg, which I think matches expectations:
[ 0.004000] MTRR map: 0 entries (0 fixed + 0 variable; max 0), built from 0 variable MTRRs
[ 0.004000] MTRRs set to read-only
[ 0.004000] x86/PAT: Configuration [0-7]: WB WC UC- UC WB WP UC- WT
The page attributes listed in /sys/kernel/debug/x86/pat_memtype_list are
"write-back" in the expected cases, and the "mtrr" x86 feature no longer appears
in the "flags" output of "lscpu" or /proc/cpuinfo. /proc/mtrr does not exist, again
as expected.
> ---
> V2:
> - new patch
> ---
> arch/x86/kernel/cpu/mshyperv.c | 4 ++++
> 1 file changed, 4 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index f36dc2f796c5..0a6cc3cf8919 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -34,6 +34,7 @@
> #include <clocksource/hyperv_timer.h>
> #include <asm/numa.h>
> #include <asm/coco.h>
> +#include <asm/mtrr.h>
>
> /* Is Linux running as the root partition? */
> bool hv_root_partition;
> @@ -408,6 +409,9 @@ static void __init ms_hyperv_init_platform(void)
> #ifdef CONFIG_SWIOTLB
> swiotlb_unencrypted_base =
> ms_hyperv.shared_gpa_boundary;
> #endif
> +
> + /* Set WB as the default cache mode. */
> + mtrr_overwrite_state(NULL, 0, MTRR_TYPE_WRBACK);
The placement of this call needs to change due to a patch set I submitted and
that Boris has accepted into the 'tip' tree for the 6.4 merge window (so it's now
in linux-next). The call should be placed in the function hv_vtom_init() in the
source file arch/x86/hyperv/ivm.c. Anywhere in hv_vtom_init() is fine -- I would
suggest at the end.
Minor request: In a v6 of this patch set, please include me on all the patches in
the series so I can easily keep track of how it is progressing.
Michael
> }
> /* Isolation VMs are unenlightened SEV-based VMs, thus this check: */
> if (IS_ENABLED(CONFIG_AMD_MEM_ENCRYPT)) {
> --
> 2.35.3
On Sat, 2023-04-01 at 08:36 +0200, Juergen Gross wrote:
> +/**
> + * mtrr_overwrite_state - set static MTRR state
> + *
> + * Used to set MTRR state via different means (e.g. with data obtained from
> + * a hypervisor).
> + * Is allowed only for special cases when running virtualized. Must be called
> + * from the x86_init.hyper.init_platform() hook.
> + */
> +void mtrr_overwrite_state(struct mtrr_var_range *var, unsigned int num_var,
> + mtrr_type def_type)
> +{
> + unsigned int i;
> +
> + /* Only allowed to be called once before mtrr_bp_init(). */
> + if (WARN_ON(mtrr_state_set))
> + return;
> +
> + /* Only allowed when running virtualized. */
> + if (!cpu_feature_enabled(X86_FEATURE_HYPERVISOR))
> + return;
> +
> + /*
> + * Only allowed for special virtualization cases:
> + * - when running as SEV-SNP guest
> + * - when running as Hyper-V isolated guest
> + * - when running as Xen PV guest
> + * - when running as TSX guest
^
TDX guest
> + */
> + if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP) &&
> + !hv_is_isolation_supported() &&
> + !cpu_feature_enabled(X86_FEATURE_XENPV) &&
> + !cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
> + return;
> +
> + /* Disable MTRR in order to disable MTRR modifications. */
> + setup_clear_cpu_cap(X86_FEATURE_MTRR);
> +
> + if (var) {
> + if (num_var > MTRR_MAX_VAR_RANGES) {
> + pr_warn("Trying to overwrite MTRR state with %u variable entries\n",
> + num_var);
> + num_var = MTRR_MAX_VAR_RANGES;
> + }
> + for (i = 0; i < num_var; i++)
> + mtrr_state.var_ranges[i] = var[i];
> + num_var_ranges = num_var;
> + }
> +
> + mtrr_state.def_type = def_type;
> + mtrr_state.enabled |= MTRR_STATE_MTRR_ENABLED;
> +
> + mtrr_state_set = 1;
> +}
> +
> /**
> * mtrr_type_lookup - look up memory type in MTRR
> *
> diff --git a/arch/x86/kernel/cpu/mtrr/mtrr.c b/arch/x86/kernel/cpu/mtrr/mtrr.c
> index 1beb38f7a7a3..1c19d67ddab3 100644
> --- a/arch/x86/kernel/cpu/mtrr/mtrr.c
> +++ b/arch/x86/kernel/cpu/mtrr/mtrr.c
> @@ -666,6 +666,15 @@ void __init mtrr_bp_init(void)
> const char *why = "(not available)";
> unsigned int phys_addr;
>
> + if (!generic_mtrrs && mtrr_state.enabled) {
> + /* Software overwrite of MTRR state, only for generic case. */
^
!generic case?
> + mtrr_calc_physbits(true);
> + init_table();
> + pr_info("MTRRs set to read-only\n");
> +
> + return;
> + }
> +
> phys_addr = mtrr_calc_physbits(generic_mtrrs);
On 03.04.23 04:25, Huang, Kai wrote:
> On Sat, 2023-04-01 at 08:36 +0200, Juergen Gross wrote:
>> +/**
>> + * mtrr_overwrite_state - set static MTRR state
>> + *
>> + * Used to set MTRR state via different means (e.g. with data obtained from
>> + * a hypervisor).
>> + * Is allowed only for special cases when running virtualized. Must be called
>> + * from the x86_init.hyper.init_platform() hook.
>> + */
>> +void mtrr_overwrite_state(struct mtrr_var_range *var, unsigned int num_var,
>> + mtrr_type def_type)
>> +{
>> + unsigned int i;
>> +
>> + /* Only allowed to be called once before mtrr_bp_init(). */
>> + if (WARN_ON(mtrr_state_set))
>> + return;
>> +
>> + /* Only allowed when running virtualized. */
>> + if (!cpu_feature_enabled(X86_FEATURE_HYPERVISOR))
>> + return;
>> +
>> + /*
>> + * Only allowed for special virtualization cases:
>> + * - when running as SEV-SNP guest
>> + * - when running as Hyper-V isolated guest
>> + * - when running as Xen PV guest
>> + * - when running as TSX guest
> ^
> TDX guest
Thanks.
>
>> + */
>> + if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP) &&
>> + !hv_is_isolation_supported() &&
>> + !cpu_feature_enabled(X86_FEATURE_XENPV) &&
>> + !cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
>> + return;
>> +
>> + /* Disable MTRR in order to disable MTRR modifications. */
>> + setup_clear_cpu_cap(X86_FEATURE_MTRR);
>> +
>> + if (var) {
>> + if (num_var > MTRR_MAX_VAR_RANGES) {
>> + pr_warn("Trying to overwrite MTRR state with %u variable entries\n",
>> + num_var);
>> + num_var = MTRR_MAX_VAR_RANGES;
>> + }
>> + for (i = 0; i < num_var; i++)
>> + mtrr_state.var_ranges[i] = var[i];
>> + num_var_ranges = num_var;
>> + }
>> +
>> + mtrr_state.def_type = def_type;
>> + mtrr_state.enabled |= MTRR_STATE_MTRR_ENABLED;
>> +
>> + mtrr_state_set = 1;
>> +}
>> +
>> /**
>> * mtrr_type_lookup - look up memory type in MTRR
>> *
>> diff --git a/arch/x86/kernel/cpu/mtrr/mtrr.c b/arch/x86/kernel/cpu/mtrr/mtrr.c
>> index 1beb38f7a7a3..1c19d67ddab3 100644
>> --- a/arch/x86/kernel/cpu/mtrr/mtrr.c
>> +++ b/arch/x86/kernel/cpu/mtrr/mtrr.c
>> @@ -666,6 +666,15 @@ void __init mtrr_bp_init(void)
>> const char *why = "(not available)";
>> unsigned int phys_addr;
>>
>> + if (!generic_mtrrs && mtrr_state.enabled) {
>> + /* Software overwrite of MTRR state, only for generic case. */
> ^
> !generic case?
No. This test just verifies that the (visible) MTRR feature is switched off,
as there are no ways to modify any MTRR registers in the overwrite case.
I can make this more obvious in a comment.
Juergen
> > > /**
> > > * mtrr_type_lookup - look up memory type in MTRR
> > > *
> > > diff --git a/arch/x86/kernel/cpu/mtrr/mtrr.c b/arch/x86/kernel/cpu/mtrr/mtrr.c
> > > index 1beb38f7a7a3..1c19d67ddab3 100644
> > > --- a/arch/x86/kernel/cpu/mtrr/mtrr.c
> > > +++ b/arch/x86/kernel/cpu/mtrr/mtrr.c
> > > @@ -666,6 +666,15 @@ void __init mtrr_bp_init(void)
> > > const char *why = "(not available)";
> > > unsigned int phys_addr;
> > >
> > > + if (!generic_mtrrs && mtrr_state.enabled) {
> > > + /* Software overwrite of MTRR state, only for generic case. */
> > ^
> > !generic case?
>
> No. This test just verifies that the (visible) MTRR feature is switched off,
> as there are no ways to modify any MTRR registers in the overwrite case.
>
> I can make this more obvious in a comment.
Should the comment say something like (because it applies to the code inside the
check):
If we have a static (synthetic) MTRR already established for special
VMs, we still need to calculate the physical address bits using
generic
way, because the hardware to run those special VMs indeed has MTRR.
That explains why 'true' is passed to mtrr_calc_physbits().
?
>
>
> Juergen
On 03.04.23 11:27, Huang, Kai wrote:
>
>>>> /**
>>>> * mtrr_type_lookup - look up memory type in MTRR
>>>> *
>>>> diff --git a/arch/x86/kernel/cpu/mtrr/mtrr.c b/arch/x86/kernel/cpu/mtrr/mtrr.c
>>>> index 1beb38f7a7a3..1c19d67ddab3 100644
>>>> --- a/arch/x86/kernel/cpu/mtrr/mtrr.c
>>>> +++ b/arch/x86/kernel/cpu/mtrr/mtrr.c
>>>> @@ -666,6 +666,15 @@ void __init mtrr_bp_init(void)
>>>> const char *why = "(not available)";
>>>> unsigned int phys_addr;
>>>>
>>>> + if (!generic_mtrrs && mtrr_state.enabled) {
>>>> + /* Software overwrite of MTRR state, only for generic case. */
>>> ^
>>> !generic case?
>>
>> No. This test just verifies that the (visible) MTRR feature is switched off,
>> as there are no ways to modify any MTRR registers in the overwrite case.
>>
>> I can make this more obvious in a comment.
>
> Should the comment say something like (because it applies to the code inside the
> check):
>
>
> If we have a static (synthetic) MTRR already established for special
> VMs, we still need to calculate the physical address bits using
> generic
> way, because the hardware to run those special VMs indeed has MTRR.
>
> That explains why 'true' is passed to mtrr_calc_physbits().
I'd rather say that the interface of mtrr_overwrite_state() is based on the
interface of generic MTRRs.
Juergen
On Mon, 2023-04-03 at 09:27 +0000, Huang, Kai wrote:
> > > > /**
> > > > * mtrr_type_lookup - look up memory type in MTRR
> > > > *
> > > > diff --git a/arch/x86/kernel/cpu/mtrr/mtrr.c
> > > > b/arch/x86/kernel/cpu/mtrr/mtrr.c
> > > > index 1beb38f7a7a3..1c19d67ddab3 100644
> > > > --- a/arch/x86/kernel/cpu/mtrr/mtrr.c
> > > > +++ b/arch/x86/kernel/cpu/mtrr/mtrr.c
> > > > @@ -666,6 +666,15 @@ void __init mtrr_bp_init(void)
> > > > const char *why = "(not available)";
> > > > unsigned int phys_addr;
> > > >
> > > > + if (!generic_mtrrs && mtrr_state.enabled) {
> > > > + /* Software overwrite of MTRR state, only for generic
> > > > case. */
> > > ^
> > > !generic
> > > case?
> >
> > No. This test just verifies that the (visible) MTRR feature is switched off,
> > as there are no ways to modify any MTRR registers in the overwrite case.
> >
> > I can make this more obvious in a comment.
>
> Should the comment say something like (because it applies to the code inside
> the
> check):
>
>
> If we have a static (synthetic) MTRR already established for special
> VMs, we still need to calculate the physical address bits using
> generic
> way, because the hardware to run those special VMs indeed has MTRR.
>
> That explains why 'true' is passed to mtrr_calc_physbits().
>
> ?
>
>
And yes I guess it would be better to point out we cannot modify any MTRR
registers in the overwrite case, but this is also clear to me. So no opinion on
this point, but I do find the original comment is a little bit confusing.
On Mon, 2023-04-03 at 11:35 +0200, Juergen Gross wrote:
> On 03.04.23 11:27, Huang, Kai wrote:
> >
> > > > > /**
> > > > > * mtrr_type_lookup - look up memory type in MTRR
> > > > > *
> > > > > diff --git a/arch/x86/kernel/cpu/mtrr/mtrr.c b/arch/x86/kernel/cpu/mtrr/mtrr.c
> > > > > index 1beb38f7a7a3..1c19d67ddab3 100644
> > > > > --- a/arch/x86/kernel/cpu/mtrr/mtrr.c
> > > > > +++ b/arch/x86/kernel/cpu/mtrr/mtrr.c
> > > > > @@ -666,6 +666,15 @@ void __init mtrr_bp_init(void)
> > > > > const char *why = "(not available)";
> > > > > unsigned int phys_addr;
> > > > >
> > > > > + if (!generic_mtrrs && mtrr_state.enabled) {
> > > > > + /* Software overwrite of MTRR state, only for generic case. */
> > > > ^
> > > > !generic case?
> > >
> > > No. This test just verifies that the (visible) MTRR feature is switched off,
> > > as there are no ways to modify any MTRR registers in the overwrite case.
> > >
> > > I can make this more obvious in a comment.
> >
> > Should the comment say something like (because it applies to the code inside the
> > check):
> >
> >
> > If we have a static (synthetic) MTRR already established for special
> > VMs, we still need to calculate the physical address bits using
> > generic
> > way, because the hardware to run those special VMs indeed has MTRR.
> >
> > That explains why 'true' is passed to mtrr_calc_physbits().
>
> I'd rather say that the interface of mtrr_overwrite_state() is based on the
> interface of generic MTRRs.
Sure fine to me too. Thanks.
On Sat, Apr 01, 2023 at 08:36:41AM +0200, Juergen Gross wrote:
> When running virtualized, MTRR access can be reduced (e.g. in Xen PV
> guests or when running as a SEV-SNP guest under Hyper-V). Typically
> the hypervisor will reset the MTRR feature in CPUID data, resulting
> in no MTRR memory type information being available for the kernel.
>
> This has turned out to result in problems:
Let's add the links to those problems:
> - Hyper-V SEV-SNP guests using uncached mappings where they shouldn't
I can't find Michael's original report, do you have a link?
> - Xen PV dom0 mapping memory as WB which should be UC- instead
Link: https://lore.kernel.org/all/[email protected]/
>
> Solve those problems by supporting to set a static MTRR state,
s/by supporting to set a/allowing an MTRR static state override/
> +void mtrr_overwrite_state(struct mtrr_var_range *var, unsigned int num_var,
> + mtrr_type def_type)
> +{
> + unsigned int i;
> +
> + /* Only allowed to be called once before mtrr_bp_init(). */
> + if (WARN_ON(mtrr_state_set))
WARN_ON_ONCE() is probably better.
> + return;
> +
> + /* Only allowed when running virtualized. */
> + if (!cpu_feature_enabled(X86_FEATURE_HYPERVISOR))
> + return;
> +
> + /*
> + * Only allowed for special virtualization cases:
> + * - when running as SEV-SNP guest
> + * - when running as Hyper-V isolated guest
when running as a SEV-SNP guest on a HyperV with vTOM enabled
that's a single condition.
> + * - when running as Xen PV guest
> + * - when running as TSX guest
> + */
> + if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP) &&
> + !hv_is_isolation_supported() &&
> + !cpu_feature_enabled(X86_FEATURE_XENPV) &&
> + !cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
IOW:
if (!(hv_is_isolation_supported() && cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) &&
!cpu_feature_enabled(X86_FEATURE_XENPV) &&
!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 11.04.23 15:20, Borislav Petkov wrote:
> On Sat, Apr 01, 2023 at 08:36:41AM +0200, Juergen Gross wrote:
>> When running virtualized, MTRR access can be reduced (e.g. in Xen PV
>> guests or when running as a SEV-SNP guest under Hyper-V). Typically
>> the hypervisor will reset the MTRR feature in CPUID data, resulting
>> in no MTRR memory type information being available for the kernel.
>>
>> This has turned out to result in problems:
>
> Let's add the links to those problems:
>
>> - Hyper-V SEV-SNP guests using uncached mappings where they shouldn't
>
> I can't find Michael's original report, do you have a link?
DYM
https://lore.kernel.org/lkml/BYAPR21MB16883ABC186566BD4D2A1451D7FE9@BYAPR21MB1688.namprd21.prod.outlook.com/
?
>
>> - Xen PV dom0 mapping memory as WB which should be UC- instead
>
> Link: https://lore.kernel.org/all/[email protected]/
>
>>
>> Solve those problems by supporting to set a static MTRR state,
>
> s/by supporting to set a/allowing an MTRR static state override/
>
>> +void mtrr_overwrite_state(struct mtrr_var_range *var, unsigned int num_var,
>> + mtrr_type def_type)
>> +{
>> + unsigned int i;
>> +
>> + /* Only allowed to be called once before mtrr_bp_init(). */
>> + if (WARN_ON(mtrr_state_set))
>
> WARN_ON_ONCE() is probably better.
If you like that better (I don't see the real benefit, as something would
be severely broken if this triggers more than once, but I don't really
mind).
>
>> + return;
>> +
>> + /* Only allowed when running virtualized. */
>> + if (!cpu_feature_enabled(X86_FEATURE_HYPERVISOR))
>> + return;
>> +
>> + /*
>> + * Only allowed for special virtualization cases:
>> + * - when running as SEV-SNP guest
>> + * - when running as Hyper-V isolated guest
>
> when running as a SEV-SNP guest on a HyperV with vTOM enabled
>
> that's a single condition.
>
>> + * - when running as Xen PV guest
>> + * - when running as TSX guest
>> + */
>> + if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP) &&
>> + !hv_is_isolation_supported() &&
>> + !cpu_feature_enabled(X86_FEATURE_XENPV) &&
>> + !cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
>
> IOW:
>
> if (!(hv_is_isolation_supported() && cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) &&
> !cpu_feature_enabled(X86_FEATURE_XENPV) &&
> !cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
Okay.
Juergen
From: Borislav Petkov <[email protected]> Sent: Tuesday, April 11, 2023 6:21 AM
>
> On Sat, Apr 01, 2023 at 08:36:41AM +0200, Juergen Gross wrote:
[snip]
> >
> > +
> > + /*
> > + * Only allowed for special virtualization cases:
> > + * - when running as SEV-SNP guest
> > + * - when running as Hyper-V isolated guest
>
> when running as a SEV-SNP guest on a HyperV with vTOM enabled
>
> that's a single condition.
>
> > + * - when running as Xen PV guest
> > + * - when running as TSX guest
> > + */
> > + if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP) &&
> > + !hv_is_isolation_supported() &&
> > + !cpu_feature_enabled(X86_FEATURE_XENPV) &&
> > + !cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
>
> IOW:
>
> if (!(hv_is_isolation_supported() && cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) &&
> !cpu_feature_enabled(X86_FEATURE_XENPV) &&
> !cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
>
That's doesn't work. Hyper-V guests with vTOM don't have
CC_ATTR_GUEST_SEV_SNP. As previously discussed, the SEV-SNP
machinery is handled by the paravisor, and the Linux guest doesn't
see it. Enabling CC_ATTR_GUEST_SEV_SNP in a vTOM guest would
trigger Linux to do a bunch of SNP stuff that the paravisor has already
done and would break things. The standalone hv_is_isolation_supported()
test is sufficient to detect this case.
I really wanted to avoid calls to hv_is_isolation_supported() outside
of Hyper-V specific code in the kernel. The alternative is to create
another CC_ATTR_ value that is set in the vTOM case, but that reopens
the naming can-of-worms.
Michael
On 11.04.23 15:59, Michael Kelley (LINUX) wrote:
> From: Borislav Petkov <[email protected]> Sent: Tuesday, April 11, 2023 6:21 AM
>>
>> On Sat, Apr 01, 2023 at 08:36:41AM +0200, Juergen Gross wrote:
>
> [snip]
>
>>>
>>> +
>>> + /*
>>> + * Only allowed for special virtualization cases:
>>> + * - when running as SEV-SNP guest
>>> + * - when running as Hyper-V isolated guest
>>
>> when running as a SEV-SNP guest on a HyperV with vTOM enabled
>>
>> that's a single condition.
>>
>>> + * - when running as Xen PV guest
>>> + * - when running as TSX guest
>>> + */
>>> + if (!cc_platform_has(CC_ATTR_GUEST_SEV_SNP) &&
>>> + !hv_is_isolation_supported() &&
>>> + !cpu_feature_enabled(X86_FEATURE_XENPV) &&
>>> + !cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
>>
>> IOW:
>>
>> if (!(hv_is_isolation_supported() && cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) &&
>> !cpu_feature_enabled(X86_FEATURE_XENPV) &&
>> !cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
>>
>
> That's doesn't work. Hyper-V guests with vTOM don't have
> CC_ATTR_GUEST_SEV_SNP.
Yeah, the condition needs to be:
if (!(hv_is_isolation_supported() ||
cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) &&
!cpu_feature_enabled(X86_FEATURE_XENPV) &&
!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
This is equivalent to the condition in my patch.
Juergen
On Tue, Apr 11, 2023 at 01:59:36PM +0000, Michael Kelley (LINUX) wrote:
> That's doesn't work. Hyper-V guests with vTOM don't have
> CC_ATTR_GUEST_SEV_SNP. As previously discussed, the SEV-SNP
> machinery is handled by the paravisor,
Do you see now why I want for all kinds of guest types to not deviate
from doing the proper/default checks for their type? This is a good
example, case-in-point.
All "special" guests would get broken in the future, otherwise.
> I really wanted to avoid calls to hv_is_isolation_supported() outside
> of Hyper-V specific code in the kernel. The alternative is to create
> another CC_ATTR_ value that is set in the vTOM case, but that reopens
> the naming can-of-worms.
Now is the time for you to settle on what would be the official way to
query those guest types, before it propagates everywhere.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Tue, Apr 11, 2023 at 04:04:00PM +0200, Juergen Gross wrote:
> Yeah, the condition needs to be:
>
> if (!(hv_is_isolation_supported() ||
> cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) &&
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Why is that needed at all?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 11.04.23 16:26, Borislav Petkov wrote:
> On Tue, Apr 11, 2023 at 04:04:00PM +0200, Juergen Gross wrote:
>> Yeah, the condition needs to be:
>>
>> if (!(hv_is_isolation_supported() ||
>> cc_platform_has(CC_ATTR_GUEST_SEV_SNP)) &&
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> Why is that needed at all?
>
Short answer: You asked me to add it in V2 of the series.
Longer answer: SEV_SNP guests and TDX guests would cause #VE when writing
MTRR MSRs.
Juergen
On Tue, Apr 11, 2023 at 03:31:06PM +0200, Juergen Gross wrote:
> DYM https://lore.kernel.org/lkml/BYAPR21MB16883ABC186566BD4D2A1451D7FE9@BYAPR21MB1688.namprd21.prod.outlook.com/
Yap, looks like the one. Thx.
> If you like that better (I don't see the real benefit, as something would
> be severely broken if this triggers more than once, but I don't really
> mind).
No need to spam the console with the same thing, especially if something
else broken is wanting to tell us what it is, or some other assertion
fires.
But yeah, ok, this is on the BSP path so should *not* happen more than
once anyway.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Tue, Apr 11, 2023 at 05:57:07PM +0200, Juergen Gross wrote:
> Short answer: You asked me to add it in V2 of the series.
>
> Longer answer: SEV_SNP guests and TDX guests would cause #VE when writing
> MTRR MSRs.
Ah, sorry. Let's extend the comment then and use your original check:
/*
* Only allowed for special virtualization cases:
* - when running as Hyper-V, SEV-SNP guest using vTOM
* - when running as Xen PV guest
* - when running as SEV-SNP or TSX guest to avoid unnecessary
* VMM communication/Virtualization exceptions (#VC, #VE)
*/
if (!hv_is_isolation_supported() &&
!cpu_feature_enabled(X86_FEATURE_XENPV) &&
!cc_platform_has(CC_ATTR_GUEST_SEV_SNP) &&
!cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
return;
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 11.04.23 19:15, Borislav Petkov wrote:
> On Tue, Apr 11, 2023 at 05:57:07PM +0200, Juergen Gross wrote:
>> Short answer: You asked me to add it in V2 of the series.
>>
>> Longer answer: SEV_SNP guests and TDX guests would cause #VE when writing
>> MTRR MSRs.
>
> Ah, sorry. Let's extend the comment then and use your original check:
>
> /*
> * Only allowed for special virtualization cases:
> * - when running as Hyper-V, SEV-SNP guest using vTOM
> * - when running as Xen PV guest
> * - when running as SEV-SNP or TSX guest to avoid unnecessary
> * VMM communication/Virtualization exceptions (#VC, #VE)
> */
> if (!hv_is_isolation_supported() &&
> !cpu_feature_enabled(X86_FEATURE_XENPV) &&
> !cc_platform_has(CC_ATTR_GUEST_SEV_SNP) &&
> !cpu_feature_enabled(X86_FEATURE_TDX_GUEST))
> return;
Okay.
Juergen
On Sat, Apr 01, 2023 at 08:36:44AM +0200, Juergen Gross wrote:
> Modern CPUs all share the same MTRR interface implemented via
> generic_mtrr_ops.
>
> At several places in MTRR code this generic interface is deduced via
> is_cpu(INTEL) tests, which is only working due to X86_VENDOR_INTEL
> being 0 (the is_cpu() macro is testing mtrr_if->vendor, which isn't
> explicitly set in generic_mtrr_ops).
>
> Fix that by replacing the is_cpu(INTEL) tests with testing for mtrr_if
> to be &generic_mtrr_ops.
Replace with:
"Test the generic CPU feature X86_FEATURE_MTRR instead."
> diff --git a/arch/x86/kernel/cpu/mtrr/mtrr.c b/arch/x86/kernel/cpu/mtrr/mtrr.c
> index 1c19d67ddab3..46aae69d259e 100644
> --- a/arch/x86/kernel/cpu/mtrr/mtrr.c
> +++ b/arch/x86/kernel/cpu/mtrr/mtrr.c
> @@ -108,14 +108,12 @@ static int have_wrcomb(void)
> /* This function returns the number of variable MTRRs */
> static void __init set_num_var_ranges(bool use_generic)
> {
> - unsigned long config = 0, dummy;
> + unsigned long config, dummy;
>
> if (use_generic)
> rdmsr(MSR_MTRRcap, config, dummy);
> - else if (is_cpu(AMD) || is_cpu(HYGON))
> - config = 2;
> - else if (is_cpu(CYRIX) || is_cpu(CENTAUR))
> - config = 8;
> + else
> + config = mtrr_if->var_regs;
>
> num_var_ranges = config & MTRR_CONFIG_NUM_VAR_MASK;
> }
From previous review which you've missed to incorporate:
"Since you're touching this function, you might simply expand its body in
its only call site in mtrr_bp_init(), put a comment above the expanded
code and remove that function.
That is, if we're going to do the ->var_regs thing."
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 12.04.23 10:45, Borislav Petkov wrote:
> On Sat, Apr 01, 2023 at 08:36:44AM +0200, Juergen Gross wrote:
>> Modern CPUs all share the same MTRR interface implemented via
>> generic_mtrr_ops.
>>
>> At several places in MTRR code this generic interface is deduced via
>> is_cpu(INTEL) tests, which is only working due to X86_VENDOR_INTEL
>> being 0 (the is_cpu() macro is testing mtrr_if->vendor, which isn't
>> explicitly set in generic_mtrr_ops).
>>
>> Fix that by replacing the is_cpu(INTEL) tests with testing for mtrr_if
>> to be &generic_mtrr_ops.
>
> Replace with:
>
> "Test the generic CPU feature X86_FEATURE_MTRR instead."
>
>> diff --git a/arch/x86/kernel/cpu/mtrr/mtrr.c b/arch/x86/kernel/cpu/mtrr/mtrr.c
>> index 1c19d67ddab3..46aae69d259e 100644
>> --- a/arch/x86/kernel/cpu/mtrr/mtrr.c
>> +++ b/arch/x86/kernel/cpu/mtrr/mtrr.c
>> @@ -108,14 +108,12 @@ static int have_wrcomb(void)
>> /* This function returns the number of variable MTRRs */
>> static void __init set_num_var_ranges(bool use_generic)
>> {
>> - unsigned long config = 0, dummy;
>> + unsigned long config, dummy;
>>
>> if (use_generic)
>> rdmsr(MSR_MTRRcap, config, dummy);
>> - else if (is_cpu(AMD) || is_cpu(HYGON))
>> - config = 2;
>> - else if (is_cpu(CYRIX) || is_cpu(CENTAUR))
>> - config = 8;
>> + else
>> + config = mtrr_if->var_regs;
>>
>> num_var_ranges = config & MTRR_CONFIG_NUM_VAR_MASK;
>> }
>
> From previous review which you've missed to incorporate:
>
> "Since you're touching this function, you might simply expand its body in
> its only call site in mtrr_bp_init(), put a comment above the expanded
> code and remove that function.
>
> That is, if we're going to do the ->var_regs thing."
>
Oh, sorry. Will do it.
Juergen
On Sat, Apr 01, 2023 at 08:36:45AM +0200, Juergen Gross wrote:
> Today there are two variants of set_mtrr(): one calling stop_machine()
"... two variants which set MTRRs: set_mtrr() and set_mtrr_cpuslocked(),
first calling ..."
> and one calling stop_machine_cpuslocked().
>
> The first one (set_mtrr()) has only one caller, and this caller is
> always running with only one CPU online and interrupts being off.
Wait, whaaat?
It's only caller is mtrr_restore() and that is part of syscorse ops
which is registered for
"The CPU has no MTRR and seems to not support SMP."
Do you mean that, per chance?
If you do, please explain that properly in the commit message - this is
not a guessing game.
By the looks of that syscore thing, it is needed for the very old MTRR
implementations which weren't SMP (K6, Centaur, Cyrix etc).
Please explain that in the commit message too. It needs to say *why* the
transformation you're doing is ok.
"this caller is always running with only one CPU online" is not nearly
beginning to explain what the situation is.
> Remove the first variant completely and replace the call of it with
> a call of mtrr_if->set().
>
> Rename the second variant set_mtrr_cpuslocked() to set_mtrr() now that
> there is only one variant left.
Those are visible from the diff itself - you don't need to explain what
the patch does but why.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 12.04.23 14:30, Borislav Petkov wrote:
> On Sat, Apr 01, 2023 at 08:36:45AM +0200, Juergen Gross wrote:
>> Today there are two variants of set_mtrr(): one calling stop_machine()
>
> "... two variants which set MTRRs: set_mtrr() and set_mtrr_cpuslocked(),
> first calling ..."
>
>> and one calling stop_machine_cpuslocked().
>>
>> The first one (set_mtrr()) has only one caller, and this caller is
>> always running with only one CPU online and interrupts being off.
>
> Wait, whaaat?
>
> It's only caller is mtrr_restore() and that is part of syscorse ops
> which is registered for
>
> "The CPU has no MTRR and seems to not support SMP."
>
> Do you mean that, per chance?
I'm not sure the comment is accurate ("has no MTRR" is a bad way
to say that X86_FEATURE_MTRR is 0, BTW).
The main point is that mtrr_restore() is called with interrupts off
and before any other potential CPUs are started again.
> If you do, please explain that properly in the commit message - this is
> not a guessing game.
Okay.
> By the looks of that syscore thing, it is needed for the very old MTRR
> implementations which weren't SMP (K6, Centaur, Cyrix etc).
>
> Please explain that in the commit message too. It needs to say *why* the
> transformation you're doing is ok.
I've outlined the "why" above. I'll replace the paragraph you were referring
to.
>
> "this caller is always running with only one CPU online" is not nearly
> beginning to explain what the situation is.
>
>> Remove the first variant completely and replace the call of it with
>> a call of mtrr_if->set().
>>
>> Rename the second variant set_mtrr_cpuslocked() to set_mtrr() now that
>> there is only one variant left.
>
> Those are visible from the diff itself - you don't need to explain what
> the patch does but why.
Okay, I'll drop that paragraph.
Juergen
On Wed, Apr 12, 2023 at 02:56:33PM +0200, Juergen Gross wrote:
> I'm not sure the comment is accurate ("has no MTRR" is a bad way
> to say that X86_FEATURE_MTRR is 0, BTW).
Yeah, this comment is from see below.
The commit message is, as expected, useless. And that comment part
reads, at least to me, like, "maybe we should do this but I don't know,
perhaps, no hw to try this on".
I.e., this is for all those MTRR implementations which are different
from Intel's MTRR spec and are so old that they couldn't possibly have
had SMP. I.e., those are ooold single core machines.
The second part of the comment is not sure whether those machines could
even handle suspend/resume...
And I'm willing to bet good money, this code won't ever run on real hw
anymore because all that real hw is long dead and buried.
> The main point is that mtrr_restore() is called with interrupts off
> and before any other potential CPUs are started again.
Yes. I am only making sure that you mean what I think you mean and
trying to get you to be more verbose in your commit messages.
:-)
Thx.
---
commit 3b520b238e018ef0e9d11c9115d5e7d9419c4ef9
Author: Shaohua Li <[email protected]>
Date: Thu Jul 7 17:56:38 2005 -0700
[PATCH] MTRR suspend/resume cleanup
There has been some discuss about solving the SMP MTRR suspend/resume
breakage, but I didn't find a patch for it. This is an intent for it. The
basic idea is moving mtrr initializing into cpu_identify for all APs (so it
works for cpu hotplug). For BP, restore_processor_state is responsible for
restoring MTRR.
Signed-off-by: Shaohua Li <[email protected]>
Acked-by: Andi Kleen <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
Signed-off-by: Linus Torvalds <[email protected]>
diff --git a/arch/i386/kernel/cpu/common.c b/arch/i386/kernel/cpu/common.c
index 2203a9d20212..4553ffd94b1f 100644
--- a/arch/i386/kernel/cpu/common.c
+++ b/arch/i386/kernel/cpu/common.c
@@ -435,6 +435,11 @@ void __devinit identify_cpu(struct cpuinfo_x86 *c)
if (c == &boot_cpu_data)
sysenter_setup();
enable_sep_cpu();
+
+ if (c == &boot_cpu_data)
+ mtrr_bp_init();
+ else
+ mtrr_ap_init();
}
#ifdef CONFIG_X86_HT
diff --git a/arch/i386/kernel/cpu/mtrr/generic.c b/arch/i386/kernel/cpu/mtrr/generic.c
index 64d91f73a0a4..169ac8e0db68 100644
--- a/arch/i386/kernel/cpu/mtrr/generic.c
+++ b/arch/i386/kernel/cpu/mtrr/generic.c
@@ -67,13 +67,6 @@ void __init get_mtrr_state(void)
mtrr_state.enabled = (lo & 0xc00) >> 10;
}
-/* Free resources associated with a struct mtrr_state */
-void __init finalize_mtrr_state(void)
-{
- kfree(mtrr_state.var_ranges);
- mtrr_state.var_ranges = NULL;
-}
-
/* Some BIOS's are fucked and don't set all MTRRs the same! */
void __init mtrr_state_warn(void)
{
@@ -334,6 +327,9 @@ static void generic_set_mtrr(unsigned int reg, unsigned long base,
*/
{
unsigned long flags;
+ struct mtrr_var_range *vr;
+
+ vr = &mtrr_state.var_ranges[reg];
local_irq_save(flags);
prepare_set();
@@ -342,11 +338,15 @@ static void generic_set_mtrr(unsigned int reg, unsigned long base,
/* The invalid bit is kept in the mask, so we simply clear the
relevant mask register to disable a range. */
mtrr_wrmsr(MTRRphysMask_MSR(reg), 0, 0);
+ memset(vr, 0, sizeof(struct mtrr_var_range));
} else {
- mtrr_wrmsr(MTRRphysBase_MSR(reg), base << PAGE_SHIFT | type,
- (base & size_and_mask) >> (32 - PAGE_SHIFT));
- mtrr_wrmsr(MTRRphysMask_MSR(reg), -size << PAGE_SHIFT | 0x800,
- (-size & size_and_mask) >> (32 - PAGE_SHIFT));
+ vr->base_lo = base << PAGE_SHIFT | type;
+ vr->base_hi = (base & size_and_mask) >> (32 - PAGE_SHIFT);
+ vr->mask_lo = -size << PAGE_SHIFT | 0x800;
+ vr->mask_hi = (-size & size_and_mask) >> (32 - PAGE_SHIFT);
+
+ mtrr_wrmsr(MTRRphysBase_MSR(reg), vr->base_lo, vr->base_hi);
+ mtrr_wrmsr(MTRRphysMask_MSR(reg), vr->mask_lo, vr->mask_hi);
}
post_set();
diff --git a/arch/i386/kernel/cpu/mtrr/main.c b/arch/i386/kernel/cpu/mtrr/main.c
index d66b09e0c820..764cac64e211 100644
--- a/arch/i386/kernel/cpu/mtrr/main.c
+++ b/arch/i386/kernel/cpu/mtrr/main.c
@@ -332,6 +332,8 @@ int mtrr_add_page(unsigned long base, unsigned long size,
error = -EINVAL;
+ /* No CPU hotplug when we change MTRR entries */
+ lock_cpu_hotplug();
/* Search for existing MTRR */
down(&main_lock);
for (i = 0; i < num_var_ranges; ++i) {
@@ -372,6 +374,7 @@ int mtrr_add_page(unsigned long base, unsigned long size,
error = i;
out:
up(&main_lock);
+ unlock_cpu_hotplug();
return error;
}
@@ -461,6 +464,8 @@ int mtrr_del_page(int reg, unsigned long base, unsigned long size)
return -ENXIO;
max = num_var_ranges;
+ /* No CPU hotplug when we change MTRR entries */
+ lock_cpu_hotplug();
down(&main_lock);
if (reg < 0) {
/* Search for existing MTRR */
@@ -501,6 +506,7 @@ int mtrr_del_page(int reg, unsigned long base, unsigned long size)
error = reg;
out:
up(&main_lock);
+ unlock_cpu_hotplug();
return error;
}
/**
@@ -544,21 +550,9 @@ static void __init init_ifs(void)
centaur_init_mtrr();
}
-static void __init init_other_cpus(void)
-{
- if (use_intel())
- get_mtrr_state();
-
- /* bring up the other processors */
- set_mtrr(~0U,0,0,0);
-
- if (use_intel()) {
- finalize_mtrr_state();
- mtrr_state_warn();
- }
-}
-
-
+/* The suspend/resume methods are only for CPU without MTRR. CPU using generic
+ * MTRR driver doesn't require this
+ */
struct mtrr_value {
mtrr_type ltype;
unsigned long lbase;
@@ -611,13 +605,13 @@ static struct sysdev_driver mtrr_sysdev_driver = {
/**
- * mtrr_init - initialize mtrrs on the boot CPU
+ * mtrr_bp_init - initialize mtrrs on the boot CPU
*
* This needs to be called early; before any of the other CPUs are
* initialized (i.e. before smp_init()).
*
*/
-static int __init mtrr_init(void)
+void __init mtrr_bp_init(void)
{
init_ifs();
@@ -674,12 +668,48 @@ static int __init mtrr_init(void)
if (mtrr_if) {
set_num_var_ranges();
init_table();
- init_other_cpus();
-
- return sysdev_driver_register(&cpu_sysdev_class,
- &mtrr_sysdev_driver);
+ if (use_intel())
+ get_mtrr_state();
}
- return -ENXIO;
}
-subsys_initcall(mtrr_init);
+void mtrr_ap_init(void)
+{
+ unsigned long flags;
+
+ if (!mtrr_if || !use_intel())
+ return;
+ /*
+ * Ideally we should hold main_lock here to avoid mtrr entries changed,
+ * but this routine will be called in cpu boot time, holding the lock
+ * breaks it. This routine is called in two cases: 1.very earily time
+ * of software resume, when there absolutely isn't mtrr entry changes;
+ * 2.cpu hotadd time. We let mtrr_add/del_page hold cpuhotplug lock to
+ * prevent mtrr entry changes
+ */
+ local_irq_save(flags);
+
+ mtrr_if->set_all();
+
+ local_irq_restore(flags);
+}
+
+static int __init mtrr_init_finialize(void)
+{
+ if (!mtrr_if)
+ return 0;
+ if (use_intel())
+ mtrr_state_warn();
+ else {
+ /* The CPUs haven't MTRR and seemes not support SMP. They have
+ * specific drivers, we use a tricky method to support
+ * suspend/resume for them.
+ * TBD: is there any system with such CPU which supports
+ * suspend/resume? if no, we should remove the code.
+ */
+ sysdev_driver_register(&cpu_sysdev_class,
+ &mtrr_sysdev_driver);
+ }
+ return 0;
+}
+subsys_initcall(mtrr_init_finialize);
diff --git a/arch/i386/kernel/cpu/mtrr/mtrr.h b/arch/i386/kernel/cpu/mtrr/mtrr.h
index de1351245599..99c9f2682041 100644
--- a/arch/i386/kernel/cpu/mtrr/mtrr.h
+++ b/arch/i386/kernel/cpu/mtrr/mtrr.h
@@ -91,7 +91,6 @@ extern struct mtrr_ops * mtrr_if;
extern unsigned int num_var_ranges;
-void finalize_mtrr_state(void);
void mtrr_state_warn(void);
char *mtrr_attrib_to_str(int x);
void mtrr_wrmsr(unsigned, unsigned, unsigned);
diff --git a/arch/i386/power/cpu.c b/arch/i386/power/cpu.c
index 0e6b45b61251..c547c1af6fa1 100644
--- a/arch/i386/power/cpu.c
+++ b/arch/i386/power/cpu.c
@@ -137,6 +137,7 @@ void __restore_processor_state(struct saved_context *ctxt)
fix_processor_context();
do_fpu_end();
+ mtrr_ap_init();
}
void restore_processor_state(void)
diff --git a/arch/x86_64/kernel/setup.c b/arch/x86_64/kernel/setup.c
index b02d921da4f7..5fd03225058a 100644
--- a/arch/x86_64/kernel/setup.c
+++ b/arch/x86_64/kernel/setup.c
@@ -1076,6 +1076,10 @@ void __cpuinit identify_cpu(struct cpuinfo_x86 *c)
#ifdef CONFIG_X86_MCE
mcheck_init(c);
#endif
+ if (c == &boot_cpu_data)
+ mtrr_bp_init();
+ else
+ mtrr_ap_init();
#ifdef CONFIG_NUMA
if (c != &boot_cpu_data)
numa_add_cpu(c - cpu_data);
diff --git a/arch/x86_64/kernel/suspend.c b/arch/x86_64/kernel/suspend.c
index 6c0f402e3a88..0612640d91b1 100644
--- a/arch/x86_64/kernel/suspend.c
+++ b/arch/x86_64/kernel/suspend.c
@@ -119,6 +119,7 @@ void __restore_processor_state(struct saved_context *ctxt)
fix_processor_context();
do_fpu_end();
+ mtrr_ap_init();
}
void restore_processor_state(void)
diff --git a/include/asm-i386/processor.h b/include/asm-i386/processor.h
index 6f0f93d0d417..5d06e6bd6ba0 100644
--- a/include/asm-i386/processor.h
+++ b/include/asm-i386/processor.h
@@ -694,4 +694,12 @@ extern unsigned long boot_option_idle_override;
extern void enable_sep_cpu(void);
extern int sysenter_setup(void);
+#ifdef CONFIG_MTRR
+extern void mtrr_ap_init(void);
+extern void mtrr_bp_init(void);
+#else
+#define mtrr_ap_init() do {} while (0)
+#define mtrr_bp_init() do {} while (0)
+#endif
+
#endif /* __ASM_I386_PROCESSOR_H */
diff --git a/include/asm-x86_64/proto.h b/include/asm-x86_64/proto.h
index f2f073642d62..6c813eb521f3 100644
--- a/include/asm-x86_64/proto.h
+++ b/include/asm-x86_64/proto.h
@@ -15,6 +15,13 @@ extern void pda_init(int);
extern void early_idt_handler(void);
extern void mcheck_init(struct cpuinfo_x86 *c);
+#ifdef CONFIG_MTRR
+extern void mtrr_ap_init(void);
+extern void mtrr_bp_init(void);
+#else
+#define mtrr_ap_init() do {} while (0)
+#define mtrr_bp_init() do {} while (0)
+#endif
extern void init_memory_mapping(unsigned long start, unsigned long end);
extern void system_call(void);
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Sat, Apr 01, 2023 at 08:36:46AM +0200, Juergen Gross wrote:
> +#ifdef CONFIG_X86_32
TBH, I'm not really crazy about adding more ifdeffery.
Wondering if adding a 32-bit only build object:
obj-$(CONFIG_X86_32) += amd.o cyrix.o centaur.o legacy.o
to arch/x86/kernel/cpu/mtrr/Makefile and moving all that gunk over
there, out of the way, would be even cleaner...
> + mtrr_value = kcalloc(num_var_ranges, sizeof(*mtrr_value), GFP_KERNEL);
> +
> /*
> * The CPU has no MTRR and seems to not support SMP. They have
> * specific drivers, we use a tricky method to support
> - * suspend/resume for them.
> + * suspend/resume for them. In case above allocation failed we can't
> + * support suspend/resume (handled in mtrr_save()).
Oh well.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 12.04.23 23:11, Borislav Petkov wrote:
> On Sat, Apr 01, 2023 at 08:36:46AM +0200, Juergen Gross wrote:
>> +#ifdef CONFIG_X86_32
>
> TBH, I'm not really crazy about adding more ifdeffery.
>
> Wondering if adding a 32-bit only build object:
>
> obj-$(CONFIG_X86_32) += amd.o cyrix.o centaur.o legacy.o
>
> to arch/x86/kernel/cpu/mtrr/Makefile and moving all that gunk over
> there, out of the way, would be even cleaner...
Yes, that should be rather simple.
Will add a patch to do that.
Juergen
On Sat, Apr 01, 2023 at 08:36:48AM +0200, Juergen Gross wrote:
> +static void rm_map_entry_at(int idx)
> +{
> + cache_map_n--;
Let's not call memmove() when cache_map_n == idx.
Below too.
Especially since cache_map + idx + 1 is not valid and I wouldn't want it
getting prefetched from %rsi in the hw when there's no reason for it and
also the RET even from a function which doesn't do anything, costs.
> + memmove(cache_map + idx, cache_map + idx + 1,
> + sizeof(*cache_map) * (cache_map_n - idx));
> +}
Ok, first weird issue I encountered while playing with my carved out
program to exercise this cache_map handling thing. I can share it if you
want it - it is ugly but it works.
So while rebuilding the map, I have these current regions in the map, at
one point in time:
Current map:
0: start: 0x0000000000000000, end: 0x0000000000100000, type: 0x0
1: start: 0x0000000100000000, end: 0x0000000820000000, type: 0x6
2: start: 0x000002f10000c000, end: 0x000003bf0000c000, type: 0x2
3: start: 0x000003bf0000c000, end: 0x00000d4b0000c000, type: 0x1
4: start: 0x00000d4b0000c000, end: 0x00019fc000001000, type: 0x0
5: start: 0x00019fc000001000, end: 0x0001df2d00001000, type: 0x2
note entry #3.
Now the next one it inserts is:
add_map_entry_at: start: 0x3bf0000c000, end: 0xd4b0000c000, type: 0x0, idx: 3
merge_prev 0: prev->fixed: 0, prev->end: 0x3bf0000c000, prev->type: 0x2
merge_next: 1, next->fixed: 0, next->start: 0xd4b0000c000, next->type: 0x0
add_map_entry_at: ret: 1
Note how it is the same as entry number #3 - just a different type.
What it ends up doing is, it simply overwrites the previous one and
merges it with the next:
Current map:
0: start: 0x0000000000000000, end: 0x0000000000100000, type: 0x0
1: start: 0x0000000100000000, end: 0x0000000820000000, type: 0x6
2: start: 0x000002f10000c000, end: 0x000003bf0000c000, type: 0x2
3: start: 0x000003bf0000c000, end: 0x00019fc000001000, type: 0x0
4: start: 0x00019fc000001000, end: 0x0001df2d00001000, type: 0x2
Now I think right about now we should've screamed loudly.
I know I know, this should never happen, right? And firmware programming
those MTRRs doesn't make mistakes...
However, we should be loud here and scream when a MTRR range disappears
like that.
Right?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 20.04.23 14:15, Borislav Petkov wrote:
> On Sat, Apr 01, 2023 at 08:36:48AM +0200, Juergen Gross wrote:
>> +static void rm_map_entry_at(int idx)
>> +{
>> + cache_map_n--;
>
> Let's not call memmove() when cache_map_n == idx.
>
> Below too.
>
> Especially since cache_map + idx + 1 is not valid and I wouldn't want it
> getting prefetched from %rsi in the hw when there's no reason for it and
> also the RET even from a function which doesn't do anything, costs.
OTOH the additional compare to 0 has costs, too, and this cost is spent for
ALL calls, while the zero size call is a rather rare case.
Regarding "cache_map + idx + 1 is not valid": the standard clearly points
out that a call with size 0 is valid and won't copy anything [1].
>
>> + memmove(cache_map + idx, cache_map + idx + 1,
>> + sizeof(*cache_map) * (cache_map_n - idx));
>> +}
>
> Ok, first weird issue I encountered while playing with my carved out
> program to exercise this cache_map handling thing. I can share it if you
> want it - it is ugly but it works.
>
> So while rebuilding the map, I have these current regions in the map, at
> one point in time:
>
> Current map:
> 0: start: 0x0000000000000000, end: 0x0000000000100000, type: 0x0
> 1: start: 0x0000000100000000, end: 0x0000000820000000, type: 0x6
> 2: start: 0x000002f10000c000, end: 0x000003bf0000c000, type: 0x2
> 3: start: 0x000003bf0000c000, end: 0x00000d4b0000c000, type: 0x1
> 4: start: 0x00000d4b0000c000, end: 0x00019fc000001000, type: 0x0
> 5: start: 0x00019fc000001000, end: 0x0001df2d00001000, type: 0x2
>
> note entry #3.
>
> Now the next one it inserts is:
>
> add_map_entry_at: start: 0x3bf0000c000, end: 0xd4b0000c000, type: 0x0, idx: 3
> merge_prev 0: prev->fixed: 0, prev->end: 0x3bf0000c000, prev->type: 0x2
> merge_next: 1, next->fixed: 0, next->start: 0xd4b0000c000, next->type: 0x0
> add_map_entry_at: ret: 1
>
> Note how it is the same as entry number #3 - just a different type.
>
> What it ends up doing is, it simply overwrites the previous one and
> merges it with the next:
>
> Current map:
> 0: start: 0x0000000000000000, end: 0x0000000000100000, type: 0x0
> 1: start: 0x0000000100000000, end: 0x0000000820000000, type: 0x6
> 2: start: 0x000002f10000c000, end: 0x000003bf0000c000, type: 0x2
> 3: start: 0x000003bf0000c000, end: 0x00019fc000001000, type: 0x0
> 4: start: 0x00019fc000001000, end: 0x0001df2d00001000, type: 0x2
The map would reflect hardware behavior. Type 0 wins in case of overlapping
MTRRs.
> Now I think right about now we should've screamed loudly.
Now this is another requirement, right? Today's MTRR code wouldn't scream
either.
> I know I know, this should never happen, right? And firmware programming
> those MTRRs doesn't make mistakes...
At least we don't correct such mistakes today. Do you think we should change
that?
> However, we should be loud here and scream when a MTRR range disappears
> like that.
>
> Right?
TBH, I don't know.
Juergen
On 20.04.23 14:30, Juergen Gross wrote:
> On 20.04.23 14:15, Borislav Petkov wrote:
>> On Sat, Apr 01, 2023 at 08:36:48AM +0200, Juergen Gross wrote:
>>> +static void rm_map_entry_at(int idx)
>>> +{
>>> + cache_map_n--;
>>
>> Let's not call memmove() when cache_map_n == idx.
>>
>> Below too.
>>
>> Especially since cache_map + idx + 1 is not valid and I wouldn't want it
>> getting prefetched from %rsi in the hw when there's no reason for it and
>> also the RET even from a function which doesn't do anything, costs.
>
> OTOH the additional compare to 0 has costs, too, and this cost is spent for
> ALL calls, while the zero size call is a rather rare case.
>
> Regarding "cache_map + idx + 1 is not valid": the standard clearly points
> out that a call with size 0 is valid and won't copy anything [1].
>
>>
>>> + memmove(cache_map + idx, cache_map + idx + 1,
>>> + sizeof(*cache_map) * (cache_map_n - idx));
>>> +}
>>
>> Ok, first weird issue I encountered while playing with my carved out
>> program to exercise this cache_map handling thing. I can share it if you
>> want it - it is ugly but it works.
>>
>> So while rebuilding the map, I have these current regions in the map, at
>> one point in time:
>>
>> Current map:
>> 0: start: 0x0000000000000000, end: 0x0000000000100000, type: 0x0
>> 1: start: 0x0000000100000000, end: 0x0000000820000000, type: 0x6
>> 2: start: 0x000002f10000c000, end: 0x000003bf0000c000, type: 0x2
>> 3: start: 0x000003bf0000c000, end: 0x00000d4b0000c000, type: 0x1
>> 4: start: 0x00000d4b0000c000, end: 0x00019fc000001000, type: 0x0
>> 5: start: 0x00019fc000001000, end: 0x0001df2d00001000, type: 0x2
>>
>> note entry #3.
>>
>> Now the next one it inserts is:
>>
>> add_map_entry_at: start: 0x3bf0000c000, end: 0xd4b0000c000, type: 0x0, idx: 3
>> merge_prev 0: prev->fixed: 0, prev->end: 0x3bf0000c000, prev->type: 0x2
>> merge_next: 1, next->fixed: 0, next->start: 0xd4b0000c000, next->type: 0x0
>> add_map_entry_at: ret: 1
>>
>> Note how it is the same as entry number #3 - just a different type.
>>
>> What it ends up doing is, it simply overwrites the previous one and
>> merges it with the next:
>>
>> Current map:
>> 0: start: 0x0000000000000000, end: 0x0000000000100000, type: 0x0
>> 1: start: 0x0000000100000000, end: 0x0000000820000000, type: 0x6
>> 2: start: 0x000002f10000c000, end: 0x000003bf0000c000, type: 0x2
>> 3: start: 0x000003bf0000c000, end: 0x00019fc000001000, type: 0x0
>> 4: start: 0x00019fc000001000, end: 0x0001df2d00001000, type: 0x2
>
> The map would reflect hardware behavior. Type 0 wins in case of overlapping
> MTRRs.
>
>> Now I think right about now we should've screamed loudly.
>
> Now this is another requirement, right? Today's MTRR code wouldn't scream
> either.
>
>> I know I know, this should never happen, right? And firmware programming
>> those MTRRs doesn't make mistakes...
>
> At least we don't correct such mistakes today. Do you think we should change
> that?
>
>> However, we should be loud here and scream when a MTRR range disappears
>> like that.
>>
>> Right?
>
> TBH, I don't know.
Bah, forgot the link:
[1]: https://lists.llvm.org/pipermail/llvm-dev/2007-October/011070.html
Juergen
On Thu, Apr 20, 2023 at 02:30:09PM +0200, Juergen Gross wrote:
> OTOH the additional compare to 0 has costs, too, and this cost is spent for
> ALL calls
I'll take the cost of a single
cmpl %edi, %edx
for a handful of entries any day of the week.
> while the zero size call is a rather rare case.
$ grep "memmove size" /tmp/mtrr.txt
memmove size 0
memmove size 0
memmove size 0
memmove size 0
memmove size 0
memmove size 0
memmove size 0
memmove size 0
memmove size 0
memmove size 0
memmove size 0
for - I admit a largely contrived map - but 5 entries only:
Current map:
0: start: 0x0000000000000000, end: 0x0000000000100000, type: 0x0
1: start: 0x0000000100000000, end: 0x0000000820000000, type: 0x6
2: start: 0x000050d70000c000, end: 0x000062a60000c000, type: 0x4
3: start: 0x000062a60000c000, end: 0x0001d6d200001000, type: 0x0
4: start: 0x0001d6d200001000, end: 0x0001dd8100001000, type: 0x6
> Regarding "cache_map + idx + 1 is not valid": the standard clearly points
> out that a call with size 0 is valid and won't copy anything
That's not what I meant. Please read again what I said:
"I wouldn't want it getting prefetched from %rsi in the hw when there's
no reason for it".
IOW, I don't want to put invalid values in hw registers if I can. Think
hw mitigations and *very* hungry prefetchers.
> > Current map:
> > 0: start: 0x0000000000000000, end: 0x0000000000100000, type: 0x0
> > 1: start: 0x0000000100000000, end: 0x0000000820000000, type: 0x6
> > 2: start: 0x000002f10000c000, end: 0x000003bf0000c000, type: 0x2
> > 3: start: 0x000003bf0000c000, end: 0x00019fc000001000, type: 0x0
> > 4: start: 0x00019fc000001000, end: 0x0001df2d00001000, type: 0x2
>
> The map would reflect hardware behavior. Type 0 wins in case of overlapping
> MTRRs.
Type 0 is MTRR_TYPE_UNCACHABLE, 1 is MTRR_TYPE_WRCOMB.
"Uncacheable (UC)—Reads from, and writes to, UC memory are not cacheable. Reads from UC
memory cannot be speculative. Write-combining to UC memory is not
allowed."
That last sentence.
So if you have conflicting regions and one is WC but then something is
expecting it to be UC and that something doesn't want for it to
*especially* to be WC because it should not combine writes to it, then
that clearly is a misconfiguration, I'd say.
> Now this is another requirement, right? Today's MTRR code wouldn't scream
> either.
So are we fixing this right or only a little?
> At least we don't correct such mistakes today. Do you think we should change
> that?
I'm thinking considering how often we've seen all those error messages
from mtrr_state_warn(), we should at least warn when we encounter
inconsistencies.
This won't help with released BIOSes but it will help when they do new
BIOS verification and see those messages. People do use Linux for that
a lot and then they'll definitely look and address them.
And this is a pretty good goal to have, IMO.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 20.04.23 15:01, Borislav Petkov wrote:
> On Thu, Apr 20, 2023 at 02:30:09PM +0200, Juergen Gross wrote:
>> OTOH the additional compare to 0 has costs, too, and this cost is spent for
>> ALL calls
>
> I'll take the cost of a single
>
> cmpl %edi, %edx
>
> for a handful of entries any day of the week.
>
>> while the zero size call is a rather rare case.
>
> $ grep "memmove size" /tmp/mtrr.txt
> memmove size 0
> memmove size 0
> memmove size 0
> memmove size 0
> memmove size 0
> memmove size 0
> memmove size 0
> memmove size 0
> memmove size 0
> memmove size 0
> memmove size 0
>
> for - I admit a largely contrived map - but 5 entries only:
>
> Current map:
> 0: start: 0x0000000000000000, end: 0x0000000000100000, type: 0x0
> 1: start: 0x0000000100000000, end: 0x0000000820000000, type: 0x6
> 2: start: 0x000050d70000c000, end: 0x000062a60000c000, type: 0x4
> 3: start: 0x000062a60000c000, end: 0x0001d6d200001000, type: 0x0
> 4: start: 0x0001d6d200001000, end: 0x0001dd8100001000, type: 0x6
>
>> Regarding "cache_map + idx + 1 is not valid": the standard clearly points
>> out that a call with size 0 is valid and won't copy anything
>
> That's not what I meant. Please read again what I said:
>
> "I wouldn't want it getting prefetched from %rsi in the hw when there's
> no reason for it".
So you are suggesting that prefetching can happen across one wrong speculated
branch, but not across two of them? And you are not worrying about prefetches
past the end of a copy with size > 0?
> IOW, I don't want to put invalid values in hw registers if I can. Think
> hw mitigations and *very* hungry prefetchers.
>
>>> Current map:
>>> 0: start: 0x0000000000000000, end: 0x0000000000100000, type: 0x0
>>> 1: start: 0x0000000100000000, end: 0x0000000820000000, type: 0x6
>>> 2: start: 0x000002f10000c000, end: 0x000003bf0000c000, type: 0x2
>>> 3: start: 0x000003bf0000c000, end: 0x00019fc000001000, type: 0x0
>>> 4: start: 0x00019fc000001000, end: 0x0001df2d00001000, type: 0x2
>>
>> The map would reflect hardware behavior. Type 0 wins in case of overlapping
>> MTRRs.
>
> Type 0 is MTRR_TYPE_UNCACHABLE, 1 is MTRR_TYPE_WRCOMB.
>
> "Uncacheable (UC)—Reads from, and writes to, UC memory are not cacheable. Reads from UC
> memory cannot be speculative. Write-combining to UC memory is not
> allowed."
>
> That last sentence.
"If two or more variable memory ranges match and one of the memory types is UC,
the UC memory type used."
So technically no problem, apart from lower performance.
> So if you have conflicting regions and one is WC but then something is
> expecting it to be UC and that something doesn't want for it to
> *especially* to be WC because it should not combine writes to it, then
> that clearly is a misconfiguration, I'd say.
Would you be fine with adding that as an additional patch?
I believe if we really want that, then we should be able to disable such a
cleanup. So it should be an add-on anyway.
>> Now this is another requirement, right? Today's MTRR code wouldn't scream
>> either.
>
> So are we fixing this right or only a little?
I'm not against adding such additional checks. I wouldn't like to force them
into this series right now, because we need this series for Hyper-V isolated
guest support.
>> At least we don't correct such mistakes today. Do you think we should change
>> that?
>
> I'm thinking considering how often we've seen all those error messages
> from mtrr_state_warn(), we should at least warn when we encounter
> inconsistencies.
Just to say it explicitly: you are concerned for the case that a complete
MTRR is hidden beneath another one (e.g. a large UC MTRR hiding a smaller
MTRR with another type, or a variable MTRR being hidden by fixed MTRRs)?
> This won't help with released BIOSes but it will help when they do new
> BIOS verification and see those messages. People do use Linux for that
> a lot and then they'll definitely look and address them.
Okay.
> And this is a pretty good goal to have, IMO.
Probably, yes.
Juergen
On Thu, Apr 20, 2023 at 03:57:43PM +0200, Juergen Gross wrote:
> So you are suggesting that prefetching can happen across one wrong speculated
> branch, but not across two of them? And you are not worrying about prefetches
> past the end of a copy with size > 0?
Maybe it will, maybe it won't.
I am worried about calling a function unnecessarily. I'm worried about
calling the asm version __memmove() with zero length unnecessarily. I'm
worried about executing the return thunk unnecessarily:
ffffffff8104c749: 48 83 c4 28 add $0x28,%rsp
ffffffff8104c74d: e9 72 2a 9b 00 jmp ffffffff819ff1c4 <__x86_return_thunk>
ffffffff8104c752: 31 c0 xor %eax,%eax
ffffffff8104c754: e9 6b 2a 9b 00 jmp ffffffff819ff1c4 <__x86_return_thunk>
ffffffff8104c759: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
Just say that you don't want to do this simple check and I will do it
myself because I'm tired of debating.
> "If two or more variable memory ranges match and one of the memory types is UC,
> the UC memory type used."
>
> So technically no problem, apart from lower performance.
How do you come from "Write-combining to UC memory is not allowed" to
"lower performance"?
Not allowed is not allowed. Geez.
> Would you be fine with adding that as an additional patch?
>
> I believe if we really want that, then we should be able to disable such a
> cleanup. So it should be an add-on anyway.
Sure, whatever.
> I'm not against adding such additional checks. I wouldn't like to force them
> into this series right now, because we need this series for Hyper-V isolated
> guest support.
We will add this series when they're ready. If Hyper-V needs them
immediately they can take whatever they want and do whatever they want.
Or you can do a simpler fix for Hyper-V that goes before this, if you
want to address Hyper-V.
> Just to say it explicitly: you are concerned for the case that a complete
> MTRR is hidden beneath another one (e.g. a large UC MTRR hiding a smaller
> MTRR with another type, or a variable MTRR being hidden by fixed MTRRs)?
I am concerned about catching any and all inconsistencies with the MTRRs
and catching them right. If we're going to spend all this time on this,
then let's do it right, once and for all and do it in a manner that can
be improved in the future.
Thanks!
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 20.04.23 16:54, Borislav Petkov wrote:
> On Thu, Apr 20, 2023 at 03:57:43PM +0200, Juergen Gross wrote:
>> So you are suggesting that prefetching can happen across one wrong speculated
>> branch, but not across two of them? And you are not worrying about prefetches
>> past the end of a copy with size > 0?
>
> Maybe it will, maybe it won't.
>
> I am worried about calling a function unnecessarily. I'm worried about
> calling the asm version __memmove() with zero length unnecessarily. I'm
> worried about executing the return thunk unnecessarily:
>
> ffffffff8104c749: 48 83 c4 28 add $0x28,%rsp
> ffffffff8104c74d: e9 72 2a 9b 00 jmp ffffffff819ff1c4 <__x86_return_thunk>
> ffffffff8104c752: 31 c0 xor %eax,%eax
> ffffffff8104c754: e9 6b 2a 9b 00 jmp ffffffff819ff1c4 <__x86_return_thunk>
> ffffffff8104c759: 0f 1f 80 00 00 00 00 nopl 0x0(%rax)
>
> Just say that you don't want to do this simple check and I will do it
> myself because I'm tired of debating.
I just want to make sure to understand your concerns and that the reasoning
is sane.
You seem to feel rather strong here, so I'll add the test.
>
>> "If two or more variable memory ranges match and one of the memory types is UC,
>> the UC memory type used."
>>
>> So technically no problem, apart from lower performance.
>
> How do you come from "Write-combining to UC memory is not allowed" to
> "lower performance"?
>
> Not allowed is not allowed. Geez.
Yes. And using UC instead of WC usually means lower performance of writes.
>> Would you be fine with adding that as an additional patch?
>>
>> I believe if we really want that, then we should be able to disable such a
>> cleanup. So it should be an add-on anyway.
>
> Sure, whatever.
Okay, thanks.
I think this will need another final loop over the MTRRs to check against the
constructed map if a MTRR is completely useless.
Another question: in case we detect such a hidden MTRR, should it be disabled
in order to have more MTRRs available for run-time adding?
>
>> I'm not against adding such additional checks. I wouldn't like to force them
>> into this series right now, because we need this series for Hyper-V isolated
>> guest support.
>
> We will add this series when they're ready. If Hyper-V needs them
> immediately they can take whatever they want and do whatever they want.
>
> Or you can do a simpler fix for Hyper-V that goes before this, if you
> want to address Hyper-V.
>
>> Just to say it explicitly: you are concerned for the case that a complete
>> MTRR is hidden beneath another one (e.g. a large UC MTRR hiding a smaller
>> MTRR with another type, or a variable MTRR being hidden by fixed MTRRs)?
>
> I am concerned about catching any and all inconsistencies with the MTRRs
> and catching them right. If we're going to spend all this time on this,
> then let's do it right, once and for all and do it in a manner that can
> be improved in the future.
Okay.
Juergen
On Thu, Apr 20, 2023 at 05:10:03PM +0200, Juergen Gross wrote:
> I think this will need another final loop over the MTRRs to check against the
> constructed map if a MTRR is completely useless.
Yeah, I slept on it: so I think there should be a patch ontop which does
add debug output - disabled by default and controllable by adding
"mtrr=debug" on the cmdline or so - which dumps the cache map operations
(add/remove) and the final table.
The reason being, when this cache_map thing hits upstream, we would need
a way to debug any potential issues which people might report so asking
them to do a "mtrr=debug" boot would be a good help.
And pls make the prints pr_info() and not pr_debug() because people
should not have to recompile in order to enable that.
> Another question: in case we detect such a hidden MTRR, should it be disabled
> in order to have more MTRRs available for run-time adding?
Let's not do anything for now and address this if really needed.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 21.04.23 13:23, Borislav Petkov wrote:
> On Thu, Apr 20, 2023 at 05:10:03PM +0200, Juergen Gross wrote:
>> I think this will need another final loop over the MTRRs to check against the
>> constructed map if a MTRR is completely useless.
>
> Yeah, I slept on it: so I think there should be a patch ontop which does
> add debug output - disabled by default and controllable by adding
> "mtrr=debug" on the cmdline or so - which dumps the cache map operations
> (add/remove) and the final table.
Okay.
> The reason being, when this cache_map thing hits upstream, we would need
> a way to debug any potential issues which people might report so asking
> them to do a "mtrr=debug" boot would be a good help.
>
> And pls make the prints pr_info() and not pr_debug() because people
> should not have to recompile in order to enable that.
Agreed.
>> Another question: in case we detect such a hidden MTRR, should it be disabled
>> in order to have more MTRRs available for run-time adding?
>
> Let's not do anything for now and address this if really needed.
Okay.
Juergen