This series tries to fix the rather special case of PAT being available
without having MTRRs (either due to CONFIG_MTRR being not set, or
because the feature has been disabled e.g. by a hypervisor).
The main use cases are Xen PV guests and SEV-SNP guests running under
Hyper-V.
Instead of trying to work around all the issues by adding if statements
here and there, just try to use the complete available infrastructure
by setting up a read-only MTRR state when needed.
In the Xen PV case the current MTRR MSR values can be read from the
hypervisor, while for the SEV-SNP case all needed is to set the
default caching mode to "WB".
I have added more cleanup which has been discussed when looking into
the most recent failures.
Note that I couldn't test the Hyper-V related change (patch 3).
Running on bare metal and with Xen didn't show any problems with the
series applied.
Changes in V2:
- replaced former patches 1+2 with new patches 1-4, avoiding especially
the rather hacky approach of V1, while making all the MTRR type
conflict tests available for the Xen PV case
- updated patch 6 (was patch 4 in V1)
Juergen Gross (8):
x86/mtrr: split off physical address size calculation
x86/mtrr: support setting MTRR state for software defined MTRRs
x86/hyperv: set MTRR state when running as SEV-SNP Hyper-V guest
x86/xen: set MTRR state when running as Xen PV initial domain
x86/mtrr: revert commit 90b926e68f50
x86/mtrr: don't let mtrr_type_lookup() return MTRR_TYPE_INVALID
x86/mm: only check uniform after calling mtrr_type_lookup()
x86/mtrr: drop sanity check in mtrr_type_lookup_fixed()
arch/x86/include/asm/mtrr.h | 9 +++-
arch/x86/include/uapi/asm/mtrr.h | 6 +--
arch/x86/kernel/cpu/mshyperv.c | 8 +++
arch/x86/kernel/cpu/mtrr/generic.c | 56 +++++++++++++++++----
arch/x86/kernel/cpu/mtrr/mtrr.c | 79 ++++++++++++++++--------------
arch/x86/mm/pat/memtype.c | 3 +-
arch/x86/mm/pgtable.c | 6 +--
arch/x86/xen/enlighten_pv.c | 49 ++++++++++++++++++
8 files changed, 159 insertions(+), 57 deletions(-)
--
2.35.3
Move the calculation of the physical address size in mtrr_bp_init()
into a helper function. This will be needed later.
Signed-off-by: Juergen Gross <[email protected]>
---
V2:
- new patch
---
arch/x86/kernel/cpu/mtrr/mtrr.c | 70 ++++++++++++++++-----------------
1 file changed, 33 insertions(+), 37 deletions(-)
diff --git a/arch/x86/kernel/cpu/mtrr/mtrr.c b/arch/x86/kernel/cpu/mtrr/mtrr.c
index 783f3210d582..542ca5639dfd 100644
--- a/arch/x86/kernel/cpu/mtrr/mtrr.c
+++ b/arch/x86/kernel/cpu/mtrr/mtrr.c
@@ -617,27 +617,11 @@ static struct syscore_ops mtrr_syscore_ops = {
.resume = mtrr_restore,
};
-int __initdata changed_by_mtrr_cleanup;
-
-#define SIZE_OR_MASK_BITS(n) (~((1ULL << ((n) - PAGE_SHIFT)) - 1))
-/**
- * mtrr_bp_init - initialize mtrrs on the boot CPU
- *
- * This needs to be called early; before any of the other CPUs are
- * initialized (i.e. before smp_init()).
- *
- */
-void __init mtrr_bp_init(void)
+static unsigned int __init mtrr_calc_physbits(bool generic)
{
- const char *why = "(not available)";
- u32 phys_addr;
-
- phys_addr = 32;
+ unsigned int phys_addr = 32;
- if (boot_cpu_has(X86_FEATURE_MTRR)) {
- mtrr_if = &generic_mtrr_ops;
- size_or_mask = SIZE_OR_MASK_BITS(36);
- size_and_mask = 0x00f00000;
+ if (generic) {
phys_addr = 36;
/*
@@ -654,42 +638,54 @@ void __init mtrr_bp_init(void)
(boot_cpu_data.x86_stepping == 0x3 ||
boot_cpu_data.x86_stepping == 0x4))
phys_addr = 36;
-
- size_or_mask = SIZE_OR_MASK_BITS(phys_addr);
- size_and_mask = ~size_or_mask & 0xfffff00000ULL;
} else if (boot_cpu_data.x86_vendor == X86_VENDOR_CENTAUR &&
boot_cpu_data.x86 == 6) {
/*
* VIA C* family have Intel style MTRRs,
* but don't support PAE
*/
- size_or_mask = SIZE_OR_MASK_BITS(32);
- size_and_mask = 0;
phys_addr = 32;
}
+ }
+
+ size_or_mask = ~((1ULL << ((phys_addr) - PAGE_SHIFT)) - 1);
+ size_and_mask = ~size_or_mask & 0xfffff00000ULL;
+
+ return phys_addr;
+}
+
+int __initdata changed_by_mtrr_cleanup;
+
+/**
+ * mtrr_bp_init - initialize mtrrs on the boot CPU
+ *
+ * This needs to be called early; before any of the other CPUs are
+ * initialized (i.e. before smp_init()).
+ *
+ */
+void __init mtrr_bp_init(void)
+{
+ const char *why = "(not available)";
+ unsigned int phys_addr;
+
+ phys_addr = mtrr_calc_physbits(boot_cpu_has(X86_FEATURE_MTRR));
+
+ if (boot_cpu_has(X86_FEATURE_MTRR)) {
+ mtrr_if = &generic_mtrr_ops;
} else {
switch (boot_cpu_data.x86_vendor) {
case X86_VENDOR_AMD:
- if (cpu_feature_enabled(X86_FEATURE_K6_MTRR)) {
- /* Pre-Athlon (K6) AMD CPU MTRRs */
+ /* Pre-Athlon (K6) AMD CPU MTRRs */
+ if (cpu_feature_enabled(X86_FEATURE_K6_MTRR))
mtrr_if = &amd_mtrr_ops;
- size_or_mask = SIZE_OR_MASK_BITS(32);
- size_and_mask = 0;
- }
break;
case X86_VENDOR_CENTAUR:
- if (cpu_feature_enabled(X86_FEATURE_CENTAUR_MCR)) {
+ if (cpu_feature_enabled(X86_FEATURE_CENTAUR_MCR))
mtrr_if = ¢aur_mtrr_ops;
- size_or_mask = SIZE_OR_MASK_BITS(32);
- size_and_mask = 0;
- }
break;
case X86_VENDOR_CYRIX:
- if (cpu_feature_enabled(X86_FEATURE_CYRIX_ARR)) {
+ if (cpu_feature_enabled(X86_FEATURE_CYRIX_ARR))
mtrr_if = &cyrix_mtrr_ops;
- size_or_mask = SIZE_OR_MASK_BITS(32);
- size_and_mask = 0;
- }
break;
default:
break;
--
2.35.3
When running virtualized, MTRR access can be reduced (e.g. in Xen PV
guests or when running as a SEV-SNP guest under Hyper-V). Typically
the hypervisor will reset the MTRR feature in cpuid data, resulting
in no MTRR memory type information being available for the kernel.
This has turned out to result in problems:
- Hyper-V SEV-SNP guests using uncached mappings where they shouldn't
- Xen PV dom0 mapping memory as WB which should be UC- instead
Solve those problems by supporting to set a fixed MTRR state,
overwriting the empty state used today. In case such a state has been
set, don't call get_mtrr_state() in mtrr_bp_init(). The set state
will only be used by mtrr_type_lookup(), as in all other cases
mtrr_enabled() is being checked, which will return false. Accept the
overwrite call only in case of MTRRs being disabled in cpuid.
Signed-off-by: Juergen Gross <[email protected]>
---
V2:
- new patch
---
arch/x86/include/asm/mtrr.h | 2 ++
arch/x86/kernel/cpu/mtrr/generic.c | 38 ++++++++++++++++++++++++++++++
arch/x86/kernel/cpu/mtrr/mtrr.c | 9 +++++++
3 files changed, 49 insertions(+)
diff --git a/arch/x86/include/asm/mtrr.h b/arch/x86/include/asm/mtrr.h
index f0eeaf6e5f5f..0b8f51d683dc 100644
--- a/arch/x86/include/asm/mtrr.h
+++ b/arch/x86/include/asm/mtrr.h
@@ -31,6 +31,8 @@
*/
# ifdef CONFIG_MTRR
void mtrr_bp_init(void);
+void mtrr_overwrite_state(struct mtrr_var_range *var, unsigned int num_var,
+ mtrr_type *fixed, mtrr_type def_type);
extern u8 mtrr_type_lookup(u64 addr, u64 end, u8 *uniform);
extern void mtrr_save_fixed_ranges(void *);
extern void mtrr_save_state(void);
diff --git a/arch/x86/kernel/cpu/mtrr/generic.c b/arch/x86/kernel/cpu/mtrr/generic.c
index ee09d359e08f..788bc16888a5 100644
--- a/arch/x86/kernel/cpu/mtrr/generic.c
+++ b/arch/x86/kernel/cpu/mtrr/generic.c
@@ -240,6 +240,44 @@ static u8 mtrr_type_lookup_variable(u64 start, u64 end, u64 *partial_end,
return mtrr_state.def_type;
}
+/**
+ * mtrr_overwrite_state - set fixed MTRR state
+ *
+ * Used to set MTRR state via different means (e.g. with data obtained from
+ * a hypervisor).
+ */
+void mtrr_overwrite_state(struct mtrr_var_range *var, unsigned int num_var,
+ mtrr_type *fixed, mtrr_type def_type)
+{
+ unsigned int i;
+
+ if (boot_cpu_has(X86_FEATURE_MTRR))
+ return;
+
+ if (var) {
+ if (num_var > MTRR_MAX_VAR_RANGES) {
+ pr_warn("Trying to overwrite MTRR state with %u variable entries\n",
+ num_var);
+ num_var = MTRR_MAX_VAR_RANGES;
+ }
+ for (i = 0; i < num_var; i++)
+ mtrr_state.var_ranges[i] = var[i];
+ num_var_ranges = num_var;
+ }
+
+ if (fixed) {
+ for (i = 0; i < MTRR_NUM_FIXED_RANGES; i++)
+ mtrr_state.fixed_ranges[i] = fixed[i];
+ mtrr_state.enabled |= MTRR_STATE_MTRR_FIXED_ENABLED;
+ mtrr_state.have_fixed = 1;
+ }
+
+ mtrr_state.def_type = def_type;
+ mtrr_state.enabled |= MTRR_STATE_MTRR_ENABLED;
+
+ mtrr_state_set = 1;
+}
+
/**
* mtrr_type_lookup - look up memory type in MTRR
*
diff --git a/arch/x86/kernel/cpu/mtrr/mtrr.c b/arch/x86/kernel/cpu/mtrr/mtrr.c
index 542ca5639dfd..b73fe243c7fd 100644
--- a/arch/x86/kernel/cpu/mtrr/mtrr.c
+++ b/arch/x86/kernel/cpu/mtrr/mtrr.c
@@ -668,6 +668,15 @@ void __init mtrr_bp_init(void)
const char *why = "(not available)";
unsigned int phys_addr;
+ if (mtrr_state.enabled) {
+ /* Software overwrite of MTRR state, only for generic case. */
+ mtrr_calc_physbits(true);
+ init_table();
+ pr_info("MTRRs set to read-only\n");
+
+ return;
+ }
+
phys_addr = mtrr_calc_physbits(boot_cpu_has(X86_FEATURE_MTRR));
if (boot_cpu_has(X86_FEATURE_MTRR)) {
--
2.35.3
In order to avoid mappings using the UC- cache attribute, set the
MTRR state to use WB caching as the default.
This is needed in order to cope with the fact that PAT is enabled,
while MTRRs are disabled by the hypervisor.
Fixes: 90b926e68f50 ("x86/pat: Fix pat_x_mtrr_type() for MTRR disabled case")
Signed-off-by: Juergen Gross <[email protected]>
---
V2:
- new patch
---
arch/x86/kernel/cpu/mshyperv.c | 8 ++++++++
1 file changed, 8 insertions(+)
diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
index 46668e255421..51e47dc0e987 100644
--- a/arch/x86/kernel/cpu/mshyperv.c
+++ b/arch/x86/kernel/cpu/mshyperv.c
@@ -34,6 +34,7 @@
#include <clocksource/hyperv_timer.h>
#include <asm/numa.h>
#include <asm/coco.h>
+#include <asm/mtrr.h>
/* Is Linux running as the root partition? */
bool hv_root_partition;
@@ -335,6 +336,13 @@ static void __init ms_hyperv_init_platform(void)
static_branch_enable(&isolation_type_snp);
#ifdef CONFIG_SWIOTLB
swiotlb_unencrypted_base = ms_hyperv.shared_gpa_boundary;
+#endif
+#ifdef CONFIG_MTRR
+ /*
+ * Set WB as the default cache mode in case MTRRs are
+ * disabled by the hypervisor.
+ */
+ mtrr_overwrite_state(NULL, 0, NULL, MTRR_TYPE_WRBACK);
#endif
}
/* Isolation VMs are unenlightened SEV-based VMs, thus this check: */
--
2.35.3
When running as Xen PV initial domain (aka dom0), MTRRs are disabled
by the hypervisor, but the system should nevertheless use correct
cache memory types. This has always kind of worked, as disabled MTRRs
resulted in disabled PAT, too, so that the kernel avoided code paths
resulting in inconsistencies. This bypassed all of the sanity checks
the kernel is doing with enabled MTRRs in order to avoid memory
mappings with conflicting memory types.
This has been changed recently, leading to PAT being accepted to be
enabled, while MTRRs stayed disabled. The result is that
mtrr_type_lookup() no longer is accepting all memory type requests,
but started to return WB even if UC- was requested. This led to
driver failures during initialization of some devices.
In reality MTRRs are still in effect, but they are under complete
control of the Xen hypervisor. It is possible, however, to retrieve
the MTRR settings from the hypervisor.
In order to fix those problems, overwrite the MTRR state via
mtrr_overwrite_state() with the MTRR data from the hypervisor, if the
system is running as a Xen dom0.
Fixes: 72cbc8f04fe2 ("x86/PAT: Have pat_enabled() properly reflect state when running on Xen")
Signed-off-by: Juergen Gross <[email protected]>
---
V2:
- new patch
---
arch/x86/xen/enlighten_pv.c | 49 +++++++++++++++++++++++++++++++++++++
1 file changed, 49 insertions(+)
diff --git a/arch/x86/xen/enlighten_pv.c b/arch/x86/xen/enlighten_pv.c
index 5b1379662877..9cf520c0c810 100644
--- a/arch/x86/xen/enlighten_pv.c
+++ b/arch/x86/xen/enlighten_pv.c
@@ -68,6 +68,7 @@
#include <asm/reboot.h>
#include <asm/hypervisor.h>
#include <asm/mach_traps.h>
+#include <asm/mtrr.h>
#include <asm/mwait.h>
#include <asm/pci_x86.h>
#include <asm/cpu.h>
@@ -1200,6 +1201,52 @@ static void __init xen_boot_params_init_edd(void)
#endif
}
+/* Get MTRR settings from Xen and put them into mtrr_state. */
+static void __init xen_set_mtrr_data(void)
+{
+#ifdef CONFIG_MTRR
+ struct xen_platform_op op = {
+ .cmd = XENPF_read_memtype,
+ .interface_version = XENPF_INTERFACE_VERSION,
+ };
+ unsigned int reg;
+ unsigned long mask;
+ uint32_t eax, width;
+ static struct mtrr_var_range var[MTRR_MAX_VAR_RANGES] __initdata;
+
+ /* Get physical address width (only 64-bit cpus supported). */
+ width = 36;
+ eax = cpuid_eax(0x80000000);
+ if ((eax >> 16) == 0x8000 && eax >= 0x80000008) {
+ eax = cpuid_eax(0x80000008);
+ width = eax & 0xff;
+ }
+
+ for (reg = 0; reg < MTRR_MAX_VAR_RANGES; reg++) {
+ op.u.read_memtype.reg = reg;
+ if (HYPERVISOR_platform_op(&op))
+ break;
+
+ /*
+ * Only called in dom0, which has all RAM PFNs mapped at
+ * RAM MFNs, and all PCI space etc. is identity mapped.
+ * This means we can treat MFN == PFN regarding MTTR settings.
+ */
+ var[reg].base_lo = op.u.read_memtype.type;
+ var[reg].base_lo |= op.u.read_memtype.mfn << PAGE_SHIFT;
+ var[reg].base_hi = op.u.read_memtype.mfn >> (32 - PAGE_SHIFT);
+ mask = ~((op.u.read_memtype.nr_mfns << PAGE_SHIFT) - 1);
+ mask &= (1UL << width) - 1;
+ if (mask)
+ mask |= 1 << 11;
+ var[reg].mask_lo = mask;
+ var[reg].mask_hi = mask >> 32;
+ }
+
+ mtrr_overwrite_state(var, reg, NULL, MTRR_TYPE_UNCACHABLE);
+#endif
+}
+
/*
* Set up the GDT and segment registers for -fstack-protector. Until
* we do this, we have to be careful not to call any stack-protected
@@ -1403,6 +1450,8 @@ asmlinkage __visible void __init xen_start_kernel(struct start_info *si)
xen_boot_params_init_edd();
+ xen_set_mtrr_data();
+
#ifdef CONFIG_ACPI
/*
* Disable selecting "Firmware First mode" for correctable
--
2.35.3
Commit 90b926e68f50 ("x86/pat: Fix pat_x_mtrr_type() for MTRR disabled
case") has introduced a regression with Xen.
Revert the patch.
Signed-off-by: Juergen Gross <[email protected]>
---
arch/x86/mm/pat/memtype.c | 3 +--
1 file changed, 1 insertion(+), 2 deletions(-)
diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
index fb4b1b5e0dea..46de9cf5c91d 100644
--- a/arch/x86/mm/pat/memtype.c
+++ b/arch/x86/mm/pat/memtype.c
@@ -387,8 +387,7 @@ static unsigned long pat_x_mtrr_type(u64 start, u64 end,
u8 mtrr_type, uniform;
mtrr_type = mtrr_type_lookup(start, end, &uniform);
- if (mtrr_type != MTRR_TYPE_WRBACK &&
- mtrr_type != MTRR_TYPE_INVALID)
+ if (mtrr_type != MTRR_TYPE_WRBACK)
return _PAGE_CACHE_MODE_UC_MINUS;
return _PAGE_CACHE_MODE_WB;
--
2.35.3
mtrr_type_lookup() should always return a valid memory type. In case
there is no information available, it should return the default UC.
At the same time the mtrr_type_lookup() stub for the !CONFIG_MTRR
case should set uniform to 1, as if the memory range would be
covered by no MTRR at all.
In the CONFIG_MTRR case make sure uniform is always set to either 0
or 1. Without mtrr_state_set set it to 0, as it isn't known yet whether
the covered range is uniform or not (in fact there is currently no
caller interested in the uniform value before mtrr_state_set is being
set). With MTRRs disabled uniform can be set to 1.
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Juergen Gross <[email protected]>
---
V2:
- always set uniform
- set uniform to 1 in case of disabled MTRRs (Linus Torvalds)
---
arch/x86/include/asm/mtrr.h | 7 +++++--
arch/x86/kernel/cpu/mtrr/generic.c | 12 ++++++++----
2 files changed, 13 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/mtrr.h b/arch/x86/include/asm/mtrr.h
index 0b8f51d683dc..f362c33e3d74 100644
--- a/arch/x86/include/asm/mtrr.h
+++ b/arch/x86/include/asm/mtrr.h
@@ -53,9 +53,12 @@ void mtrr_generic_set_state(void);
static inline u8 mtrr_type_lookup(u64 addr, u64 end, u8 *uniform)
{
/*
- * Return no-MTRRs:
+ * Return the default MTRR type, without any known other types in
+ * that range.
*/
- return MTRR_TYPE_INVALID;
+ *uniform = 1;
+
+ return MTRR_TYPE_UNCACHABLE;
}
#define mtrr_save_fixed_ranges(arg) do {} while (0)
#define mtrr_save_state() do {} while (0)
diff --git a/arch/x86/kernel/cpu/mtrr/generic.c b/arch/x86/kernel/cpu/mtrr/generic.c
index 788bc16888a5..afb59ff2cc00 100644
--- a/arch/x86/kernel/cpu/mtrr/generic.c
+++ b/arch/x86/kernel/cpu/mtrr/generic.c
@@ -299,11 +299,15 @@ u8 mtrr_type_lookup(u64 start, u64 end, u8 *uniform)
/* Make end inclusive instead of exclusive */
end--;
- if (!mtrr_state_set)
- return MTRR_TYPE_INVALID;
+ if (!mtrr_state_set) {
+ *uniform = 0; /* Uniformity is unknown. */
+ return MTRR_TYPE_UNCACHABLE;
+ }
- if (!(mtrr_state.enabled & MTRR_STATE_MTRR_ENABLED))
- return MTRR_TYPE_INVALID;
+ if (!(mtrr_state.enabled & MTRR_STATE_MTRR_ENABLED)) {
+ *uniform = 1;
+ return MTRR_TYPE_UNCACHABLE;
+ }
/*
* Look up the fixed ranges first, which take priority over
--
2.35.3
Today pud_set_huge() and pmd_set_huge() test for the MTRR type to be
WB or INVALID after calling mtrr_type_lookup(). Those tests can be
dropped, as the only reason to not use a large mapping would be
uniform being 0. Any MTRR type can be accepted as long as it applies
to the whole memory range covered by the mapping, as the alternative
would only be to map the same region with smaller pages instead using
the same PAT type as for the large mapping.
Suggested-by: Linus Torvalds <[email protected]>
Signed-off-by: Juergen Gross <[email protected]>
---
arch/x86/mm/pgtable.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index e4f499eb0f29..7b9c5443d176 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -721,8 +721,7 @@ int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
u8 mtrr, uniform;
mtrr = mtrr_type_lookup(addr, addr + PUD_SIZE, &uniform);
- if ((mtrr != MTRR_TYPE_INVALID) && (!uniform) &&
- (mtrr != MTRR_TYPE_WRBACK))
+ if (!uniform)
return 0;
/* Bail out if we are we on a populated non-leaf entry: */
@@ -748,8 +747,7 @@ int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot)
u8 mtrr, uniform;
mtrr = mtrr_type_lookup(addr, addr + PMD_SIZE, &uniform);
- if ((mtrr != MTRR_TYPE_INVALID) && (!uniform) &&
- (mtrr != MTRR_TYPE_WRBACK)) {
+ if (!uniform) {
pr_warn_once("%s: Cannot satisfy [mem %#010llx-%#010llx] with a huge-page mapping due to MTRR override.\n",
__func__, addr, addr + PMD_SIZE);
return 0;
--
2.35.3
mtrr_type_lookup_fixed() contains a sanity check for the case it is
being called with a start address above 1 MB. As it is static and it
is called only iff the start address is below 1MB, this sanity check
can be dropped.
This will remove the last case where mtrr_type_lookup() can return
MTRR_TYPE_INVALID, so adjust the comment in include/uapi/asm/mtrr.h.
Note that removing the MTRR_TYPE_INVALID #define from that header
could break user code, so it has to stay.
Signed-off-by: Juergen Gross <[email protected]>
---
arch/x86/include/uapi/asm/mtrr.h | 6 +++---
arch/x86/kernel/cpu/mtrr/generic.c | 6 +-----
2 files changed, 4 insertions(+), 8 deletions(-)
diff --git a/arch/x86/include/uapi/asm/mtrr.h b/arch/x86/include/uapi/asm/mtrr.h
index 376563f2bac1..4aa05c2ffa78 100644
--- a/arch/x86/include/uapi/asm/mtrr.h
+++ b/arch/x86/include/uapi/asm/mtrr.h
@@ -115,9 +115,9 @@ struct mtrr_state_type {
#define MTRR_NUM_TYPES 7
/*
- * Invalid MTRR memory type. mtrr_type_lookup() returns this value when
- * MTRRs are disabled. Note, this value is allocated from the reserved
- * values (0x7-0xff) of the MTRR memory types.
+ * Invalid MTRR memory type. No longer used outside of MTRR code.
+ * Note, this value is allocated from the reserved values (0x7-0xff) of
+ * the MTRR memory types.
*/
#define MTRR_TYPE_INVALID 0xff
diff --git a/arch/x86/kernel/cpu/mtrr/generic.c b/arch/x86/kernel/cpu/mtrr/generic.c
index afb59ff2cc00..575b050a55bf 100644
--- a/arch/x86/kernel/cpu/mtrr/generic.c
+++ b/arch/x86/kernel/cpu/mtrr/generic.c
@@ -114,16 +114,12 @@ static int check_type_overlap(u8 *prev, u8 *curr)
* 0xC0000 - 0xFFFFF : This range is divided into sixty-four 4KB sub-ranges
*
* Return Values:
- * MTRR_TYPE_(type) - Matched memory type
- * MTRR_TYPE_INVALID - Unmatched
+ * MTRR_TYPE_(type) - Memory type
*/
static u8 mtrr_type_lookup_fixed(u64 start, u64 end)
{
int idx;
- if (start >= 0x100000)
- return MTRR_TYPE_INVALID;
-
/* 0x0 - 0x7FFFF */
if (start < 0x80000) {
idx = 0;
--
2.35.3
Hi, this is your Linux kernel regression tracker.
On 09.02.23 08:22, Juergen Gross wrote:
> Commit 90b926e68f50 ("x86/pat: Fix pat_x_mtrr_type() for MTRR disabled
> case") has introduced a regression with Xen.
>
> Revert the patch.
That regression you refer to is afaics one I'm tracking[1] that was
introduced this cycle. That makes me wonder: could this patch be applied
directly to fix the issue quickly? Or are patches 1 to 4 needed as well
(or the whole series?) to avoid other problems?
Ciao, Thorsten
[1]
https://lore.kernel.org/all/[email protected]/
P.S.: BTW; let me tell regzbot to monitor this thread:
#regzbot ^backmonitor:
https://lore.kernel.org/all/[email protected]/
> Signed-off-by: Juergen Gross <[email protected]>
> ---
> arch/x86/mm/pat/memtype.c | 3 +--
> 1 file changed, 1 insertion(+), 2 deletions(-)
>
> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
> index fb4b1b5e0dea..46de9cf5c91d 100644
> --- a/arch/x86/mm/pat/memtype.c
> +++ b/arch/x86/mm/pat/memtype.c
> @@ -387,8 +387,7 @@ static unsigned long pat_x_mtrr_type(u64 start, u64 end,
> u8 mtrr_type, uniform;
>
> mtrr_type = mtrr_type_lookup(start, end, &uniform);
> - if (mtrr_type != MTRR_TYPE_WRBACK &&
> - mtrr_type != MTRR_TYPE_INVALID)
> + if (mtrr_type != MTRR_TYPE_WRBACK)
> return _PAGE_CACHE_MODE_UC_MINUS;
>
> return _PAGE_CACHE_MODE_WB;
On Thu, 2023-02-09 at 08:22 +0100, Juergen Gross wrote:
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index e4f499eb0f29..7b9c5443d176 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -721,8 +721,7 @@ int pud_set_huge(pud_t *pud, phys_addr_t addr,
> pgprot_t prot)
> u8 mtrr, uniform;
'mtrr' is now unused. Can it be dropped? Same for the pmd_set_huge().
>
> mtrr = mtrr_type_lookup(addr, addr + PUD_SIZE, &uniform);
> - if ((mtrr != MTRR_TYPE_INVALID) && (!uniform) &&
> - (mtrr != MTRR_TYPE_WRBACK))
> + if (!uniform)
> return 0;
On Thu, 2023-02-09 at 08:22 +0100, Juergen Gross wrote:
> This series tries to fix the rather special case of PAT being
> available
> without having MTRRs (either due to CONFIG_MTRR being not set, or
> because the feature has been disabled e.g. by a hypervisor).
debug_vm_pgtable fails in a KVM guest with CONFIG_MTRR=y. CONFIG_MTRR=n
succeeds.
[ 0.830280] debug_vm_pgtable: [debug_vm_pgtable ]:
Validating architecture page table helpers
[ 0.831906] ------------[ cut here ]------------
[ 0.832711] WARNING: CPU: 0 PID: 1 at mm/debug_vm_pgtable.c:461
debug_vm_pgtable+0xb9a/0xe16
[ 0.833998] Modules linked in:
[ 0.834450] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.2.0-rc7+
#2366
[ 0.835462] RIP: 0010:debug_vm_pgtable+0xb9a/0xe16
[ 0.836217] Code: e2 3a 73 4a 48 c7 00 00 00 00 00 48 8b b4 24 a0 00
00 00 48 8b 54 24 60 48 8b 7c 24 20 48 c4
[ 0.839068] RSP: 0000:ffffc90000013de0 EFLAGS: 00010246
[ 0.839735] RAX: 0000000000000000 RBX: ffff888100048868 RCX:
bffffffffffffff0
[ 0.840646] RDX: 0000000000000000 RSI: 0000000040000000 RDI:
0000000000000000
[ 0.841661] RBP: ffff88810004d140 R08: 0000000000000000 R09:
ffff888100280880
[ 0.842625] R10: 0000000000000001 R11: 0000000000000001 R12:
ffff888103810298
[ 0.843574] R13: ffff888100048780 R14: ffffffff8282e099 R15:
0000000000000000
[ 0.844524] FS: 0000000000000000(0000) GS:ffff88813bc00000(0000)
knlGS:0000000000000000
[ 0.845706] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.846499] CR2: ffff88813ffff000 CR3: 000000000222d001 CR4:
0000000000370ef0
[ 0.847464] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 0.848432] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 0.849371] Call Trace:
[ 0.849699] <TASK>
[ 0.849997] ? destroy_args+0x131/0x131
[ 0.850487] do_one_initcall+0x61/0x250
[ 0.850983] ? rdinit_setup+0x2c/0x2c
[ 0.851451] kernel_init_freeable+0x18e/0x1d8
[ 0.852033] ? rest_init+0x130/0x130
[ 0.852533] kernel_init+0x16/0x120
[ 0.853035] ret_from_fork+0x1f/0x30
[ 0.853507] </TASK>
[ 0.853803] ---[ end trace 0000000000000000 ]---
[ 0.854421] ------------[ cut here ]------------
[ 0.855027] WARNING: CPU: 0 PID: 1 at mm/debug_vm_pgtable.c:462
debug_vm_pgtable+0xbaa/0xe16
[ 0.856115] Modules linked in:
[ 0.856517] CPU: 0 PID: 1 Comm: swapper/0 Tainted:
G W 6.2.0-rc7+ #2366
[ 0.857562] RIP: 0010:debug_vm_pgtable+0xbaa/0xe16
[ 0.858186] Code: 00 00 00 48 8b 54 24 60 48 8b 7c 24 20 48 c1 e6 0c
e8 79 18 7f fe 85 c0 75 02 0f 0b 48 8b 7b
[ 0.860778] RSP: 0000:ffffc90000013de0 EFLAGS: 00010246
[ 0.861519] RAX: 0000000000000000 RBX: ffff888100048868 RCX:
bffffffffffffff0
[ 0.862530] RDX: 0000000000000000 RSI: 0000000040000000 RDI:
ffff88810380e7f8
[ 0.863522] RBP: ffff88810004d140 R08: 0000000000000000 R09:
ffff888100280880
[ 0.864449] R10: 0000000000000001 R11: 0000000000000001 R12:
ffff888103810298
[ 0.865454] R13: ffff888100048780 R14: ffffffff8282e099 R15:
0000000000000000
[ 0.866401] FS: 0000000000000000(0000) GS:ffff88813bc00000(0000)
knlGS:0000000000000000
[ 0.867438] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 0.868181] CR2: ffff88813ffff000 CR3: 000000000222d001 CR4:
0000000000370ef0
[ 0.869097] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[ 0.870026] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[ 0.870943] Call Trace:
[ 0.871259] <TASK>
[ 0.871537] ? destroy_args+0x131/0x131
[ 0.872030] do_one_initcall+0x61/0x250
[ 0.872521] ? rdinit_setup+0x2c/0x2c
[ 0.873005] kernel_init_freeable+0x18e/0x1d8
[ 0.873607] ? rest_init+0x130/0x130
[ 0.874116] kernel_init+0x16/0x120
[ 0.874618] ret_from_fork+0x1f/0x30
[ 0.875123] </TASK>
[ 0.875411] ---[ end trace 0000000000000000 ]---
On Thu, Feb 09, 2023 at 08:22:13AM +0100, Juergen Gross wrote:
> @@ -654,42 +638,54 @@ void __init mtrr_bp_init(void)
> (boot_cpu_data.x86_stepping == 0x3 ||
> boot_cpu_data.x86_stepping == 0x4))
> phys_addr = 36;
> -
> - size_or_mask = SIZE_OR_MASK_BITS(phys_addr);
> - size_and_mask = ~size_or_mask & 0xfffff00000ULL;
> } else if (boot_cpu_data.x86_vendor == X86_VENDOR_CENTAUR &&
> boot_cpu_data.x86 == 6) {
> /*
> * VIA C* family have Intel style MTRRs,
> * but don't support PAE
> */
> - size_or_mask = SIZE_OR_MASK_BITS(32);
> - size_and_mask = 0;
> phys_addr = 32;
> }
> + }
> +
> + size_or_mask = ~((1ULL << ((phys_addr) - PAGE_SHIFT)) - 1);
Too many brackets because you've taken the macro and put in the argument
directly.
In any case, reviewing patches which do code movement *and* changes in
the same diff is always unnecessarily nasty. Please do the mechanical
code movement only - cleanups come ontop.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
From: Juergen Gross <[email protected]> Sent: Wednesday, February 8, 2023 11:22 PM
>
> When running virtualized, MTRR access can be reduced (e.g. in Xen PV
> guests or when running as a SEV-SNP guest under Hyper-V). Typically
> the hypervisor will reset the MTRR feature in cpuid data, resulting
> in no MTRR memory type information being available for the kernel.
>
> This has turned out to result in problems:
>
> - Hyper-V SEV-SNP guests using uncached mappings where they shouldn't
> - Xen PV dom0 mapping memory as WB which should be UC- instead
>
> Solve those problems by supporting to set a fixed MTRR state,
> overwriting the empty state used today. In case such a state has been
> set, don't call get_mtrr_state() in mtrr_bp_init(). The set state
> will only be used by mtrr_type_lookup(), as in all other cases
> mtrr_enabled() is being checked, which will return false. Accept the
> overwrite call only in case of MTRRs being disabled in cpuid.
>
> Signed-off-by: Juergen Gross <[email protected]>
> ---
> V2:
> - new patch
> ---
> arch/x86/include/asm/mtrr.h | 2 ++
> arch/x86/kernel/cpu/mtrr/generic.c | 38 ++++++++++++++++++++++++++++++
> arch/x86/kernel/cpu/mtrr/mtrr.c | 9 +++++++
> 3 files changed, 49 insertions(+)
>
> diff --git a/arch/x86/include/asm/mtrr.h b/arch/x86/include/asm/mtrr.h
> index f0eeaf6e5f5f..0b8f51d683dc 100644
> --- a/arch/x86/include/asm/mtrr.h
> +++ b/arch/x86/include/asm/mtrr.h
> @@ -31,6 +31,8 @@
> */
> # ifdef CONFIG_MTRR
> void mtrr_bp_init(void);
> +void mtrr_overwrite_state(struct mtrr_var_range *var, unsigned int num_var,
> + mtrr_type *fixed, mtrr_type def_type);
Could you add a stub for the !CONFIG_MTRR case? Then the
#ifdef CONFIG_MTRR could be removed in Patch 3 of this series.
Michael
From: Juergen Gross <[email protected]> Sent: Wednesday, February 8, 2023 11:22 PM
>
> In order to avoid mappings using the UC- cache attribute, set the
> MTRR state to use WB caching as the default.
>
> This is needed in order to cope with the fact that PAT is enabled,
> while MTRRs are disabled by the hypervisor.
>
> Fixes: 90b926e68f50 ("x86/pat: Fix pat_x_mtrr_type() for MTRR disabled case")
> Signed-off-by: Juergen Gross <[email protected]>
> ---
> V2:
> - new patch
> ---
> arch/x86/kernel/cpu/mshyperv.c | 8 ++++++++
> 1 file changed, 8 insertions(+)
>
> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
> index 46668e255421..51e47dc0e987 100644
> --- a/arch/x86/kernel/cpu/mshyperv.c
> +++ b/arch/x86/kernel/cpu/mshyperv.c
> @@ -34,6 +34,7 @@
> #include <clocksource/hyperv_timer.h>
> #include <asm/numa.h>
> #include <asm/coco.h>
> +#include <asm/mtrr.h>
>
> /* Is Linux running as the root partition? */
> bool hv_root_partition;
> @@ -335,6 +336,13 @@ static void __init ms_hyperv_init_platform(void)
> static_branch_enable(&isolation_type_snp);
> #ifdef CONFIG_SWIOTLB
> swiotlb_unencrypted_base = ms_hyperv.shared_gpa_boundary;
> +#endif
Unfortunately, Hyper-V does not filter out the MTRR flag from the
CPUID leaf, so this code also needs
setup_clear_cpu_cap(X86_FEATURE_MTRR);
before calling mtrr_overwrite_state(). I've got a bug filed for the
Hyper-V team to fix the flag, but clearing the feature here solves the
problem regardless.
> +#ifdef CONFIG_MTRR
Hopefully this #ifdef can go away, per my comment in Patch 2 of
the series.
Michael
> + /*
> + * Set WB as the default cache mode in case MTRRs are
> + * disabled by the hypervisor.
> + */
> + mtrr_overwrite_state(NULL, 0, NULL, MTRR_TYPE_WRBACK);
> #endif
> }
> /* Isolation VMs are unenlightened SEV-based VMs, thus this check: */
> --
> 2.35.3
From: Juergen Gross <[email protected]> Sent: Wednesday, February 8, 2023 11:22 PM
>
> Today pud_set_huge() and pmd_set_huge() test for the MTRR type to be
> WB or INVALID after calling mtrr_type_lookup(). Those tests can be
> dropped, as the only reason to not use a large mapping would be
> uniform being 0. Any MTRR type can be accepted as long as it applies
> to the whole memory range covered by the mapping, as the alternative
> would only be to map the same region with smaller pages instead using
> the same PAT type as for the large mapping.
>
> Suggested-by: Linus Torvalds <[email protected]>
> Signed-off-by: Juergen Gross <[email protected]>
> ---
> arch/x86/mm/pgtable.c | 6 ++----
> 1 file changed, 2 insertions(+), 4 deletions(-)
>
> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> index e4f499eb0f29..7b9c5443d176 100644
> --- a/arch/x86/mm/pgtable.c
> +++ b/arch/x86/mm/pgtable.c
> @@ -721,8 +721,7 @@ int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
> u8 mtrr, uniform;
>
> mtrr = mtrr_type_lookup(addr, addr + PUD_SIZE, &uniform);
> - if ((mtrr != MTRR_TYPE_INVALID) && (!uniform) &&
> - (mtrr != MTRR_TYPE_WRBACK))
> + if (!uniform)
> return 0;
>
> /* Bail out if we are we on a populated non-leaf entry: */
> @@ -748,8 +747,7 @@ int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot)
> u8 mtrr, uniform;
>
> mtrr = mtrr_type_lookup(addr, addr + PMD_SIZE, &uniform);
> - if ((mtrr != MTRR_TYPE_INVALID) && (!uniform) &&
> - (mtrr != MTRR_TYPE_WRBACK)) {
> + if (!uniform) {
> pr_warn_once("%s: Cannot satisfy [mem %#010llx-%#010llx] with a huge-page mapping due to MTRR override.\n",
> __func__, addr, addr + PMD_SIZE);
I'm seeing this warning trigger in a normal Hyper-V guest (i.e., *not* an
SEV-SNP Confidential VM). The original filtering here based on
MTRR_TYPE_WRBACK appears to be hiding a bug in mtrr_type_lookup_variable()
where it incorrectly thinks an address range matches two different variable
MTRRs, and hence clears "uniform".
Here are the variable MTRRs in the normal Hyper-V guest with 32 GiBytes
of memory:
[ 0.043592] MTRR variable ranges enabled:
[ 0.048308] 0 base 000000000000 mask FFFF00000000 write-back
[ 0.057450] 1 base 000100000000 mask FFF000000000 write-back
[ 0.063972] 2 disabled
[ 0.066755] 3 disabled
[ 0.070024] 4 disabled
[ 0.072856] 5 disabled
[ 0.076112] 6 disabled
[ 0.078760] 7 disabled
Variable MTRR #0 covers addresses up to 4 GiByte, while #1 covers
4 GiByte to 64 GiByte. But in mtrr_type_lookup_variable(), address
range 0xF8000000 to 0xF81FFFFF is matching both MTRRs, when it
should be matching just #0.
The problem looks to be this code in mtrr_type_lookup_variable():
if ((start & mask) != (base & mask))
continue;
If the mask bits of start and base are different, then the
MTRR doesn't match, and the continue statement should be
executed. That's correct. But if the mask bits are the same,
that's not sufficient for the MTRR to match. If the end
address is less than base, the MTRR doesn't match, and
the continue statement should still be executed, which
isn't happening.
But somebody please check my thinking. :-)
Michael
On 10.02.23 19:59, Linux regression tracking (Thorsten Leemhuis) wrote:
> Hi, this is your Linux kernel regression tracker.
>
> On 09.02.23 08:22, Juergen Gross wrote:
>> Commit 90b926e68f50 ("x86/pat: Fix pat_x_mtrr_type() for MTRR disabled
>> case") has introduced a regression with Xen.
>>
>> Revert the patch.
>
> That regression you refer to is afaics one I'm tracking[1] that was
> introduced this cycle. That makes me wonder: could this patch be applied
> directly to fix the issue quickly? Or are patches 1 to 4 needed as well
> (or the whole series?) to avoid other problems?
Patches 1-4 are needed, too, as otherwise the issue claimed to be fixed
with patch 5 would show up again.
I'm working on addressing the comments I've received.
Juergen
>
> Ciao, Thorsten
>
> [1]
> https://lore.kernel.org/all/[email protected]/
>
> P.S.: BTW; let me tell regzbot to monitor this thread:
>
> #regzbot ^backmonitor:
> https://lore.kernel.org/all/[email protected]/
>
>
>> Signed-off-by: Juergen Gross <[email protected]>
>> ---
>> arch/x86/mm/pat/memtype.c | 3 +--
>> 1 file changed, 1 insertion(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/mm/pat/memtype.c b/arch/x86/mm/pat/memtype.c
>> index fb4b1b5e0dea..46de9cf5c91d 100644
>> --- a/arch/x86/mm/pat/memtype.c
>> +++ b/arch/x86/mm/pat/memtype.c
>> @@ -387,8 +387,7 @@ static unsigned long pat_x_mtrr_type(u64 start, u64 end,
>> u8 mtrr_type, uniform;
>>
>> mtrr_type = mtrr_type_lookup(start, end, &uniform);
>> - if (mtrr_type != MTRR_TYPE_WRBACK &&
>> - mtrr_type != MTRR_TYPE_INVALID)
>> + if (mtrr_type != MTRR_TYPE_WRBACK)
>> return _PAGE_CACHE_MODE_UC_MINUS;
>>
>> return _PAGE_CACHE_MODE_WB;
On 11.02.23 01:06, Edgecombe, Rick P wrote:
> On Thu, 2023-02-09 at 08:22 +0100, Juergen Gross wrote:
>> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
>> index e4f499eb0f29..7b9c5443d176 100644
>> --- a/arch/x86/mm/pgtable.c
>> +++ b/arch/x86/mm/pgtable.c
>> @@ -721,8 +721,7 @@ int pud_set_huge(pud_t *pud, phys_addr_t addr,
>> pgprot_t prot)
>> u8 mtrr, uniform;
>
> 'mtrr' is now unused. Can it be dropped? Same for the pmd_set_huge().
I guess it will be used again, due to the comment you made for the whole
series.
Juergen
On 11.02.23 01:06, Edgecombe, Rick P wrote:
> On Thu, 2023-02-09 at 08:22 +0100, Juergen Gross wrote:
>> This series tries to fix the rather special case of PAT being
>> available
>> without having MTRRs (either due to CONFIG_MTRR being not set, or
>> because the feature has been disabled e.g. by a hypervisor).
>
> debug_vm_pgtable fails in a KVM guest with CONFIG_MTRR=y. CONFIG_MTRR=n
> succeeds.
>
> [ 0.830280] debug_vm_pgtable: [debug_vm_pgtable ]:
> Validating architecture page table helpers
> [ 0.831906] ------------[ cut here ]------------
> [ 0.832711] WARNING: CPU: 0 PID: 1 at mm/debug_vm_pgtable.c:461
> debug_vm_pgtable+0xb9a/0xe16
> [ 0.833998] Modules linked in:
> [ 0.834450] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.2.0-rc7+
> #2366
> [ 0.835462] RIP: 0010:debug_vm_pgtable+0xb9a/0xe16
> [ 0.836217] Code: e2 3a 73 4a 48 c7 00 00 00 00 00 48 8b b4 24 a0 00
> 00 00 48 8b 54 24 60 48 8b 7c 24 20 48 c4
> [ 0.839068] RSP: 0000:ffffc90000013de0 EFLAGS: 00010246
> [ 0.839735] RAX: 0000000000000000 RBX: ffff888100048868 RCX:
> bffffffffffffff0
> [ 0.840646] RDX: 0000000000000000 RSI: 0000000040000000 RDI:
> 0000000000000000
> [ 0.841661] RBP: ffff88810004d140 R08: 0000000000000000 R09:
> ffff888100280880
> [ 0.842625] R10: 0000000000000001 R11: 0000000000000001 R12:
> ffff888103810298
> [ 0.843574] R13: ffff888100048780 R14: ffffffff8282e099 R15:
> 0000000000000000
> [ 0.844524] FS: 0000000000000000(0000) GS:ffff88813bc00000(0000)
> knlGS:0000000000000000
> [ 0.845706] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 0.846499] CR2: ffff88813ffff000 CR3: 000000000222d001 CR4:
> 0000000000370ef0
> [ 0.847464] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ 0.848432] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [ 0.849371] Call Trace:
> [ 0.849699] <TASK>
> [ 0.849997] ? destroy_args+0x131/0x131
> [ 0.850487] do_one_initcall+0x61/0x250
> [ 0.850983] ? rdinit_setup+0x2c/0x2c
> [ 0.851451] kernel_init_freeable+0x18e/0x1d8
> [ 0.852033] ? rest_init+0x130/0x130
> [ 0.852533] kernel_init+0x16/0x120
> [ 0.853035] ret_from_fork+0x1f/0x30
> [ 0.853507] </TASK>
> [ 0.853803] ---[ end trace 0000000000000000 ]---
> [ 0.854421] ------------[ cut here ]------------
> [ 0.855027] WARNING: CPU: 0 PID: 1 at mm/debug_vm_pgtable.c:462
> debug_vm_pgtable+0xbaa/0xe16
> [ 0.856115] Modules linked in:
> [ 0.856517] CPU: 0 PID: 1 Comm: swapper/0 Tainted:
> G W 6.2.0-rc7+ #2366
> [ 0.857562] RIP: 0010:debug_vm_pgtable+0xbaa/0xe16
> [ 0.858186] Code: 00 00 00 48 8b 54 24 60 48 8b 7c 24 20 48 c1 e6 0c
> e8 79 18 7f fe 85 c0 75 02 0f 0b 48 8b 7b
> [ 0.860778] RSP: 0000:ffffc90000013de0 EFLAGS: 00010246
> [ 0.861519] RAX: 0000000000000000 RBX: ffff888100048868 RCX:
> bffffffffffffff0
> [ 0.862530] RDX: 0000000000000000 RSI: 0000000040000000 RDI:
> ffff88810380e7f8
> [ 0.863522] RBP: ffff88810004d140 R08: 0000000000000000 R09:
> ffff888100280880
> [ 0.864449] R10: 0000000000000001 R11: 0000000000000001 R12:
> ffff888103810298
> [ 0.865454] R13: ffff888100048780 R14: ffffffff8282e099 R15:
> 0000000000000000
> [ 0.866401] FS: 0000000000000000(0000) GS:ffff88813bc00000(0000)
> knlGS:0000000000000000
> [ 0.867438] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 0.868181] CR2: ffff88813ffff000 CR3: 000000000222d001 CR4:
> 0000000000370ef0
> [ 0.869097] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
> 0000000000000000
> [ 0.870026] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
> 0000000000000400
> [ 0.870943] Call Trace:
> [ 0.871259] <TASK>
> [ 0.871537] ? destroy_args+0x131/0x131
> [ 0.872030] do_one_initcall+0x61/0x250
> [ 0.872521] ? rdinit_setup+0x2c/0x2c
> [ 0.873005] kernel_init_freeable+0x18e/0x1d8
> [ 0.873607] ? rest_init+0x130/0x130
> [ 0.874116] kernel_init+0x16/0x120
> [ 0.874618] ret_from_fork+0x1f/0x30
> [ 0.875123] </TASK>
> [ 0.875411] ---[ end trace 0000000000000000 ]---
Thanks for the report.
I'll have a look. Probably I'll need to re-add the check for WB in patch 7.
Juergen
On 11.02.23 11:08, Borislav Petkov wrote:
> On Thu, Feb 09, 2023 at 08:22:13AM +0100, Juergen Gross wrote:
>> @@ -654,42 +638,54 @@ void __init mtrr_bp_init(void)
>> (boot_cpu_data.x86_stepping == 0x3 ||
>> boot_cpu_data.x86_stepping == 0x4))
>> phys_addr = 36;
>> -
>> - size_or_mask = SIZE_OR_MASK_BITS(phys_addr);
>> - size_and_mask = ~size_or_mask & 0xfffff00000ULL;
>> } else if (boot_cpu_data.x86_vendor == X86_VENDOR_CENTAUR &&
>> boot_cpu_data.x86 == 6) {
>> /*
>> * VIA C* family have Intel style MTRRs,
>> * but don't support PAE
>> */
>> - size_or_mask = SIZE_OR_MASK_BITS(32);
>> - size_and_mask = 0;
>> phys_addr = 32;
>> }
>> + }
>> +
>> + size_or_mask = ~((1ULL << ((phys_addr) - PAGE_SHIFT)) - 1);
>
> Too many brackets because you've taken the macro and put in the argument
> directly.
Oh, yes.
> In any case, reviewing patches which do code movement *and* changes in
> the same diff is always unnecessarily nasty. Please do the mechanical
> code movement only - cleanups come ontop.
Okay.
Juergen
On 13.02.23 02:07, Michael Kelley (LINUX) wrote:
> From: Juergen Gross <[email protected]> Sent: Wednesday, February 8, 2023 11:22 PM
>>
>> When running virtualized, MTRR access can be reduced (e.g. in Xen PV
>> guests or when running as a SEV-SNP guest under Hyper-V). Typically
>> the hypervisor will reset the MTRR feature in cpuid data, resulting
>> in no MTRR memory type information being available for the kernel.
>>
>> This has turned out to result in problems:
>>
>> - Hyper-V SEV-SNP guests using uncached mappings where they shouldn't
>> - Xen PV dom0 mapping memory as WB which should be UC- instead
>>
>> Solve those problems by supporting to set a fixed MTRR state,
>> overwriting the empty state used today. In case such a state has been
>> set, don't call get_mtrr_state() in mtrr_bp_init(). The set state
>> will only be used by mtrr_type_lookup(), as in all other cases
>> mtrr_enabled() is being checked, which will return false. Accept the
>> overwrite call only in case of MTRRs being disabled in cpuid.
>>
>> Signed-off-by: Juergen Gross <[email protected]>
>> ---
>> V2:
>> - new patch
>> ---
>> arch/x86/include/asm/mtrr.h | 2 ++
>> arch/x86/kernel/cpu/mtrr/generic.c | 38 ++++++++++++++++++++++++++++++
>> arch/x86/kernel/cpu/mtrr/mtrr.c | 9 +++++++
>> 3 files changed, 49 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/mtrr.h b/arch/x86/include/asm/mtrr.h
>> index f0eeaf6e5f5f..0b8f51d683dc 100644
>> --- a/arch/x86/include/asm/mtrr.h
>> +++ b/arch/x86/include/asm/mtrr.h
>> @@ -31,6 +31,8 @@
>> */
>> # ifdef CONFIG_MTRR
>> void mtrr_bp_init(void);
>> +void mtrr_overwrite_state(struct mtrr_var_range *var, unsigned int num_var,
>> + mtrr_type *fixed, mtrr_type def_type);
>
> Could you add a stub for the !CONFIG_MTRR case? Then the
> #ifdef CONFIG_MTRR could be removed in Patch 3 of this series.
I was on the edge whether to add a stub. The Xen use case strongly
suggests that the code wants to be inside an #ifdef, while the Hyper-V
case is so simple, that it would benefit from the stub. As there was
another #ifdef just above the added code in mshyperv.c I believed it
would be fine without a stub. As you seem to like it better with the
stub, I can add it.
Juergen
On 13.02.23 02:07, Michael Kelley (LINUX) wrote:
> From: Juergen Gross <[email protected]> Sent: Wednesday, February 8, 2023 11:22 PM
>>
>> In order to avoid mappings using the UC- cache attribute, set the
>> MTRR state to use WB caching as the default.
>>
>> This is needed in order to cope with the fact that PAT is enabled,
>> while MTRRs are disabled by the hypervisor.
>>
>> Fixes: 90b926e68f50 ("x86/pat: Fix pat_x_mtrr_type() for MTRR disabled case")
>> Signed-off-by: Juergen Gross <[email protected]>
>> ---
>> V2:
>> - new patch
>> ---
>> arch/x86/kernel/cpu/mshyperv.c | 8 ++++++++
>> 1 file changed, 8 insertions(+)
>>
>> diff --git a/arch/x86/kernel/cpu/mshyperv.c b/arch/x86/kernel/cpu/mshyperv.c
>> index 46668e255421..51e47dc0e987 100644
>> --- a/arch/x86/kernel/cpu/mshyperv.c
>> +++ b/arch/x86/kernel/cpu/mshyperv.c
>> @@ -34,6 +34,7 @@
>> #include <clocksource/hyperv_timer.h>
>> #include <asm/numa.h>
>> #include <asm/coco.h>
>> +#include <asm/mtrr.h>
>>
>> /* Is Linux running as the root partition? */
>> bool hv_root_partition;
>> @@ -335,6 +336,13 @@ static void __init ms_hyperv_init_platform(void)
>> static_branch_enable(&isolation_type_snp);
>> #ifdef CONFIG_SWIOTLB
>> swiotlb_unencrypted_base = ms_hyperv.shared_gpa_boundary;
>> +#endif
>
> Unfortunately, Hyper-V does not filter out the MTRR flag from the
> CPUID leaf, so this code also needs
>
> setup_clear_cpu_cap(X86_FEATURE_MTRR);
>
> before calling mtrr_overwrite_state(). I've got a bug filed for the
> Hyper-V team to fix the flag, but clearing the feature here solves the
> problem regardless.
Okay, will add it.
>
>> +#ifdef CONFIG_MTRR
>
> Hopefully this #ifdef can go away, per my comment in Patch 2 of
> the series.
As said already, fine with me.
Juergen
On 13.02.23 02:08, Michael Kelley (LINUX) wrote:
> From: Juergen Gross <[email protected]> Sent: Wednesday, February 8, 2023 11:22 PM
>>
>> Today pud_set_huge() and pmd_set_huge() test for the MTRR type to be
>> WB or INVALID after calling mtrr_type_lookup(). Those tests can be
>> dropped, as the only reason to not use a large mapping would be
>> uniform being 0. Any MTRR type can be accepted as long as it applies
>> to the whole memory range covered by the mapping, as the alternative
>> would only be to map the same region with smaller pages instead using
>> the same PAT type as for the large mapping.
>>
>> Suggested-by: Linus Torvalds <[email protected]>
>> Signed-off-by: Juergen Gross <[email protected]>
>> ---
>> arch/x86/mm/pgtable.c | 6 ++----
>> 1 file changed, 2 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
>> index e4f499eb0f29..7b9c5443d176 100644
>> --- a/arch/x86/mm/pgtable.c
>> +++ b/arch/x86/mm/pgtable.c
>> @@ -721,8 +721,7 @@ int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
>> u8 mtrr, uniform;
>>
>> mtrr = mtrr_type_lookup(addr, addr + PUD_SIZE, &uniform);
>> - if ((mtrr != MTRR_TYPE_INVALID) && (!uniform) &&
>> - (mtrr != MTRR_TYPE_WRBACK))
>> + if (!uniform)
>> return 0;
>>
>> /* Bail out if we are we on a populated non-leaf entry: */
>> @@ -748,8 +747,7 @@ int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot)
>> u8 mtrr, uniform;
>>
>> mtrr = mtrr_type_lookup(addr, addr + PMD_SIZE, &uniform);
>> - if ((mtrr != MTRR_TYPE_INVALID) && (!uniform) &&
>> - (mtrr != MTRR_TYPE_WRBACK)) {
>> + if (!uniform) {
>> pr_warn_once("%s: Cannot satisfy [mem %#010llx-%#010llx] with a huge-page mapping due to MTRR override.\n",
>> __func__, addr, addr + PMD_SIZE);
>
> I'm seeing this warning trigger in a normal Hyper-V guest (i.e., *not* an
> SEV-SNP Confidential VM). The original filtering here based on
> MTRR_TYPE_WRBACK appears to be hiding a bug in mtrr_type_lookup_variable()
> where it incorrectly thinks an address range matches two different variable
> MTRRs, and hence clears "uniform".
>
> Here are the variable MTRRs in the normal Hyper-V guest with 32 GiBytes
> of memory:
>
> [ 0.043592] MTRR variable ranges enabled:
> [ 0.048308] 0 base 000000000000 mask FFFF00000000 write-back
> [ 0.057450] 1 base 000100000000 mask FFF000000000 write-back
> [ 0.063972] 2 disabled
> [ 0.066755] 3 disabled
> [ 0.070024] 4 disabled
> [ 0.072856] 5 disabled
> [ 0.076112] 6 disabled
> [ 0.078760] 7 disabled
>
> Variable MTRR #0 covers addresses up to 4 GiByte, while #1 covers
> 4 GiByte to 64 GiByte. But in mtrr_type_lookup_variable(), address
> range 0xF8000000 to 0xF81FFFFF is matching both MTRRs, when it
> should be matching just #0.
>
> The problem looks to be this code in mtrr_type_lookup_variable():
>
> if ((start & mask) != (base & mask))
> continue;
>
> If the mask bits of start and base are different, then the
> MTRR doesn't match, and the continue statement should be
> executed. That's correct. But if the mask bits are the same,
> that's not sufficient for the MTRR to match. If the end
> address is less than base, the MTRR doesn't match, and
> the continue statement should still be executed, which
> isn't happening.
>
> But somebody please check my thinking. :-)
I don't see a flaw in your reasoning.
Rick mentioned a problem with this patch in a KVM guest. I'll try to
reproduce his setup for checking whether fixing mtrr_type_lookup_variable()
is enough, or if we need to keep the tests for WB in this patch.
Juergen
From: Juergen Gross <[email protected]> Sent: Sunday, February 12, 2023 10:27 PM
>
> On 13.02.23 02:07, Michael Kelley (LINUX) wrote:
> > From: Juergen Gross <[email protected]> Sent: Wednesday, February 8, 2023 11:22 PM
> >>
> >> When running virtualized, MTRR access can be reduced (e.g. in Xen PV
> >> guests or when running as a SEV-SNP guest under Hyper-V). Typically
> >> the hypervisor will reset the MTRR feature in cpuid data, resulting
> >> in no MTRR memory type information being available for the kernel.
> >>
> >> This has turned out to result in problems:
> >>
> >> - Hyper-V SEV-SNP guests using uncached mappings where they shouldn't
> >> - Xen PV dom0 mapping memory as WB which should be UC- instead
> >>
> >> Solve those problems by supporting to set a fixed MTRR state,
> >> overwriting the empty state used today. In case such a state has been
> >> set, don't call get_mtrr_state() in mtrr_bp_init(). The set state
> >> will only be used by mtrr_type_lookup(), as in all other cases
> >> mtrr_enabled() is being checked, which will return false. Accept the
> >> overwrite call only in case of MTRRs being disabled in cpuid.
> >>
> >> Signed-off-by: Juergen Gross <[email protected]>
> >> ---
> >> V2:
> >> - new patch
> >> ---
> >> arch/x86/include/asm/mtrr.h | 2 ++
> >> arch/x86/kernel/cpu/mtrr/generic.c | 38 ++++++++++++++++++++++++++++++
> >> arch/x86/kernel/cpu/mtrr/mtrr.c | 9 +++++++
> >> 3 files changed, 49 insertions(+)
> >>
> >> diff --git a/arch/x86/include/asm/mtrr.h b/arch/x86/include/asm/mtrr.h
> >> index f0eeaf6e5f5f..0b8f51d683dc 100644
> >> --- a/arch/x86/include/asm/mtrr.h
> >> +++ b/arch/x86/include/asm/mtrr.h
> >> @@ -31,6 +31,8 @@
> >> */
> >> # ifdef CONFIG_MTRR
> >> void mtrr_bp_init(void);
> >> +void mtrr_overwrite_state(struct mtrr_var_range *var, unsigned int num_var,
> >> + mtrr_type *fixed, mtrr_type def_type);
> >
> > Could you add a stub for the !CONFIG_MTRR case? Then the
> > #ifdef CONFIG_MTRR could be removed in Patch 3 of this series.
>
> I was on the edge whether to add a stub. The Xen use case strongly
> suggests that the code wants to be inside an #ifdef, while the Hyper-V
> case is so simple, that it would benefit from the stub. As there was
> another #ifdef just above the added code in mshyperv.c I believed it
> would be fine without a stub. As you seem to like it better with the
> stub, I can add it.
>
Thanks. And that other #ifdef is going away soon ...
Michael
On Thu, Feb 09, 2023 at 08:22:14AM +0100, Juergen Gross wrote:
> When running virtualized, MTRR access can be reduced (e.g. in Xen PV
> guests or when running as a SEV-SNP guest under Hyper-V). Typically
> the hypervisor will reset the MTRR feature in cpuid data, resulting
> in no MTRR memory type information being available for the kernel.
>
> This has turned out to result in problems:
>
> - Hyper-V SEV-SNP guests using uncached mappings where they shouldn't
> - Xen PV dom0 mapping memory as WB which should be UC- instead
>
> Solve those problems by supporting to set a fixed MTRR state,
> overwriting the empty state used today. In case such a state has been
> set, don't call get_mtrr_state() in mtrr_bp_init(). The set state
> will only be used by mtrr_type_lookup(), as in all other cases
> mtrr_enabled() is being checked, which will return false. Accept the
> overwrite call only in case of MTRRs being disabled in cpuid.
s/cpuid/CPUID/g
> Signed-off-by: Juergen Gross <[email protected]>
> ---
> V2:
> - new patch
> ---
> arch/x86/include/asm/mtrr.h | 2 ++
> arch/x86/kernel/cpu/mtrr/generic.c | 38 ++++++++++++++++++++++++++++++
> arch/x86/kernel/cpu/mtrr/mtrr.c | 9 +++++++
> 3 files changed, 49 insertions(+)
>
> diff --git a/arch/x86/include/asm/mtrr.h b/arch/x86/include/asm/mtrr.h
> index f0eeaf6e5f5f..0b8f51d683dc 100644
> --- a/arch/x86/include/asm/mtrr.h
> +++ b/arch/x86/include/asm/mtrr.h
> @@ -31,6 +31,8 @@
> */
> # ifdef CONFIG_MTRR
> void mtrr_bp_init(void);
> +void mtrr_overwrite_state(struct mtrr_var_range *var, unsigned int num_var,
> + mtrr_type *fixed, mtrr_type def_type);
> extern u8 mtrr_type_lookup(u64 addr, u64 end, u8 *uniform);
> extern void mtrr_save_fixed_ranges(void *);
> extern void mtrr_save_state(void);
> diff --git a/arch/x86/kernel/cpu/mtrr/generic.c b/arch/x86/kernel/cpu/mtrr/generic.c
> index ee09d359e08f..788bc16888a5 100644
> --- a/arch/x86/kernel/cpu/mtrr/generic.c
> +++ b/arch/x86/kernel/cpu/mtrr/generic.c
> @@ -240,6 +240,44 @@ static u8 mtrr_type_lookup_variable(u64 start, u64 end, u64 *partial_end,
> return mtrr_state.def_type;
> }
>
> +/**
> + * mtrr_overwrite_state - set fixed MTRR state
fixed only? You pass in variable too...
> + *
> + * Used to set MTRR state via different means (e.g. with data obtained from
> + * a hypervisor).
> + */
> +void mtrr_overwrite_state(struct mtrr_var_range *var, unsigned int num_var,
> + mtrr_type *fixed, mtrr_type def_type)
> +{
> + unsigned int i;
> +
> + if (boot_cpu_has(X86_FEATURE_MTRR))
check_for_deprecated_apis: WARNING: arch/x86/kernel/cpu/mtrr/generic.c:254: Do not use boot_cpu_has() - use cpu_feature_enabled() instead.
> + return;
So this here needs to check:
if (!cpu_feature_enabled(X86_FEATURE_HYPERVISOR) &&
!(cpu_feature_enabled(X86_FEATURE_SEV_SNP) ||
cpu_feature_enabled(X86_FEATURE_XENPV))) {
WARN_ON_ONCE(1);
return;
}
as we don't want this to be called somewhere or by something else.
The SEV_SNP flag can be used from:
https://lore.kernel.org/r/[email protected]
I'm assuming here HyperV SEV-SNP guests really do set that feature flag
(they better). We can expedite that patch ofc.
And for dom0 I *think* we use X86_FEATURE_XENPV but I leave that to you.
> +
> + if (var) {
> + if (num_var > MTRR_MAX_VAR_RANGES) {
> + pr_warn("Trying to overwrite MTRR state with %u variable entries\n",
> + num_var);
What's that check for? Sanity of callers?
> + num_var = MTRR_MAX_VAR_RANGES;
> + }
> + for (i = 0; i < num_var; i++)
> + mtrr_state.var_ranges[i] = var[i];
> + num_var_ranges = num_var;
> + }
> +
> + if (fixed) {
> + for (i = 0; i < MTRR_NUM_FIXED_RANGES; i++)
You're not doing this sanity check here, expecting that callers would
know what they're doing...
> + mtrr_state.fixed_ranges[i] = fixed[i];
> + mtrr_state.enabled |= MTRR_STATE_MTRR_FIXED_ENABLED;
> + mtrr_state.have_fixed = 1;
> + }
> +
> + mtrr_state.def_type = def_type;
> + mtrr_state.enabled |= MTRR_STATE_MTRR_ENABLED;
> +
> + mtrr_state_set = 1;
> +}
I can't say that I'm crazy about the call sites:
mtrr_overwrite_state(NULL, 0, NULL, MTRR_TYPE_WRBACK);
This looks like it wants a
mtrr_override_def_type(MTRR_TYPE_WRBACK);
instead of passing in all those nulls as params.
This:
mtrr_overwrite_state(var, reg, NULL, MTRR_TYPE_UNCACHABLE);
I guess is a bit better.
Dunno, if it is only those two callers we can say, meh, whatever, this
interface is not pretty but does the job at least. But if more users
start popping up then I guess we can do
mtrr_override_fixed()
mtrr_override_variable()
mtrr_override_def_type()
...
> /**
> * mtrr_type_lookup - look up memory type in MTRR
> *
> diff --git a/arch/x86/kernel/cpu/mtrr/mtrr.c b/arch/x86/kernel/cpu/mtrr/mtrr.c
> index 542ca5639dfd..b73fe243c7fd 100644
> --- a/arch/x86/kernel/cpu/mtrr/mtrr.c
> +++ b/arch/x86/kernel/cpu/mtrr/mtrr.c
> @@ -668,6 +668,15 @@ void __init mtrr_bp_init(void)
> const char *why = "(not available)";
> unsigned int phys_addr;
>
> + if (mtrr_state.enabled) {
Not crazy about this either: this relies on the fragile boot ordering
where init_hypervisor_platform() runs before this so it has a chance
that mtrr_state.enabled will be already set.
Yeah, yeah, cache_bp_init() and all the MTRR BSP setup stuff happens
after it but there should at least be a comment over
init_hypervisor_platform()'s call site in setup_arch() stating that
cache_bp_init() needs to happen *after* it because <reason>.
I think we should also check
x86_hyper_type
here and not do anything if not set. As this is all HV-related muck.
Xen I guess is a bit better because that call there happens even earlier
but we need the comments to say that the ordering matters because future
reorganization could cause it to blow up and people would search
themselves crazy why in the hell it breaks...
Can Xen use x86_hyper_type() too?
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Mon, 13 Feb 2023, Juergen Gross wrote:
> On 10.02.23 19:59, Linux regression tracking (Thorsten Leemhuis) wrote:
> > Hi, this is your Linux kernel regression tracker.
> >
> > On 09.02.23 08:22, Juergen Gross wrote:
> > > Commit 90b926e68f50 ("x86/pat: Fix pat_x_mtrr_type() for MTRR disabled
> > > case") has introduced a regression with Xen.
> > >
> > > Revert the patch.
> >
> > That regression you refer to is afaics one I'm tracking[1] that was
> > introduced this cycle. That makes me wonder: could this patch be applied
> > directly to fix the issue quickly? Or are patches 1 to 4 needed as well
> > (or the whole series?) to avoid other problems?
>
> Patches 1-4 are needed, too, as otherwise the issue claimed to be fixed
> with patch 5 would show up again.
The (last?) -rc8 version was released yesterday. Would it be possible to
include at least (only) the revert in mainline so that 6.2 will be
released with a working storage configuration under Xen?
Otherwise one would have to carry around that single revert manually until
this patch series has landed in mainline, or convince all the
distributions to do so :-\
Anyway, thanks for fixing this problem, I did not expect this to be such a
complicated issue when I reported that thing :-)
Christian.
--
BOFH excuse #52:
Smell from unhygienic janitorial staff wrecked the tape heads
On 13.02.23 12:39, Borislav Petkov wrote:
> On Thu, Feb 09, 2023 at 08:22:14AM +0100, Juergen Gross wrote:
>> When running virtualized, MTRR access can be reduced (e.g. in Xen PV
>> guests or when running as a SEV-SNP guest under Hyper-V). Typically
>> the hypervisor will reset the MTRR feature in cpuid data, resulting
>> in no MTRR memory type information being available for the kernel.
>>
>> This has turned out to result in problems:
>>
>> - Hyper-V SEV-SNP guests using uncached mappings where they shouldn't
>> - Xen PV dom0 mapping memory as WB which should be UC- instead
>>
>> Solve those problems by supporting to set a fixed MTRR state,
>> overwriting the empty state used today. In case such a state has been
>> set, don't call get_mtrr_state() in mtrr_bp_init(). The set state
>> will only be used by mtrr_type_lookup(), as in all other cases
>> mtrr_enabled() is being checked, which will return false. Accept the
>> overwrite call only in case of MTRRs being disabled in cpuid.
>
> s/cpuid/CPUID/g
Okay.
>
>> Signed-off-by: Juergen Gross <[email protected]>
>> ---
>> V2:
>> - new patch
>> ---
>> arch/x86/include/asm/mtrr.h | 2 ++
>> arch/x86/kernel/cpu/mtrr/generic.c | 38 ++++++++++++++++++++++++++++++
>> arch/x86/kernel/cpu/mtrr/mtrr.c | 9 +++++++
>> 3 files changed, 49 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/mtrr.h b/arch/x86/include/asm/mtrr.h
>> index f0eeaf6e5f5f..0b8f51d683dc 100644
>> --- a/arch/x86/include/asm/mtrr.h
>> +++ b/arch/x86/include/asm/mtrr.h
>> @@ -31,6 +31,8 @@
>> */
>> # ifdef CONFIG_MTRR
>> void mtrr_bp_init(void);
>> +void mtrr_overwrite_state(struct mtrr_var_range *var, unsigned int num_var,
>> + mtrr_type *fixed, mtrr_type def_type);
>> extern u8 mtrr_type_lookup(u64 addr, u64 end, u8 *uniform);
>> extern void mtrr_save_fixed_ranges(void *);
>> extern void mtrr_save_state(void);
>> diff --git a/arch/x86/kernel/cpu/mtrr/generic.c b/arch/x86/kernel/cpu/mtrr/generic.c
>> index ee09d359e08f..788bc16888a5 100644
>> --- a/arch/x86/kernel/cpu/mtrr/generic.c
>> +++ b/arch/x86/kernel/cpu/mtrr/generic.c
>> @@ -240,6 +240,44 @@ static u8 mtrr_type_lookup_variable(u64 start, u64 end, u64 *partial_end,
>> return mtrr_state.def_type;
>> }
>>
>> +/**
>> + * mtrr_overwrite_state - set fixed MTRR state
>
> fixed only? You pass in variable too...
Fixed in the sense of static.
>
>> + *
>> + * Used to set MTRR state via different means (e.g. with data obtained from
>> + * a hypervisor).
>> + */
>> +void mtrr_overwrite_state(struct mtrr_var_range *var, unsigned int num_var,
>> + mtrr_type *fixed, mtrr_type def_type)
>> +{
>> + unsigned int i;
>> +
>> + if (boot_cpu_has(X86_FEATURE_MTRR))
>
> check_for_deprecated_apis: WARNING: arch/x86/kernel/cpu/mtrr/generic.c:254: Do not use boot_cpu_has() - use cpu_feature_enabled() instead.
Okay.
>
>> + return;
>
> So this here needs to check:
>
> if (!cpu_feature_enabled(X86_FEATURE_HYPERVISOR) &&
> !(cpu_feature_enabled(X86_FEATURE_SEV_SNP) ||
> cpu_feature_enabled(X86_FEATURE_XENPV))) {
> WARN_ON_ONCE(1);
> return;
> }
>
> as we don't want this to be called somewhere or by something else.
Wouldn't !cpu_feature_enabled(X86_FEATURE_HYPERVISOR) be enough?
I'm not sure we won't need that for TDX guests, too.
>
> The SEV_SNP flag can be used from:
>
> https://lore.kernel.org/r/[email protected]
>
> I'm assuming here HyperV SEV-SNP guests really do set that feature flag
> (they better). We can expedite that patch ofc.
>
> And for dom0 I *think* we use X86_FEATURE_XENPV but I leave that to you.
Yes, it is only relevant for PV dom0.
>
>> +
>> + if (var) {
>> + if (num_var > MTRR_MAX_VAR_RANGES) {
>> + pr_warn("Trying to overwrite MTRR state with %u variable entries\n",
>> + num_var);
>
> What's that check for? Sanity of callers?
Yes.
>
>> + num_var = MTRR_MAX_VAR_RANGES;
>> + }
>> + for (i = 0; i < num_var; i++)
>> + mtrr_state.var_ranges[i] = var[i];
>> + num_var_ranges = num_var;
>> + }
>> +
>> + if (fixed) {
>> + for (i = 0; i < MTRR_NUM_FIXED_RANGES; i++)
>
> You're not doing this sanity check here, expecting that callers would
> know what they're doing...
The number of fixed MTRRs is not dynamic AFAIK.
>
>> + mtrr_state.fixed_ranges[i] = fixed[i];
>> + mtrr_state.enabled |= MTRR_STATE_MTRR_FIXED_ENABLED;
>> + mtrr_state.have_fixed = 1;
>> + }
>> +
>> + mtrr_state.def_type = def_type;
>> + mtrr_state.enabled |= MTRR_STATE_MTRR_ENABLED;
>> +
>> + mtrr_state_set = 1;
>> +}
>
> I can't say that I'm crazy about the call sites:
>
> mtrr_overwrite_state(NULL, 0, NULL, MTRR_TYPE_WRBACK);
>
> This looks like it wants a
>
> mtrr_override_def_type(MTRR_TYPE_WRBACK);
>
> instead of passing in all those nulls as params.
>
> This:
>
> mtrr_overwrite_state(var, reg, NULL, MTRR_TYPE_UNCACHABLE);
>
> I guess is a bit better.
>
> Dunno, if it is only those two callers we can say, meh, whatever, this
> interface is not pretty but does the job at least. But if more users
> start popping up then I guess we can do
>
> mtrr_override_fixed()
> mtrr_override_variable()
> mtrr_override_def_type()
A single interface makes it easier to avoid multiple calls.
In the end I'm fine with either way.
>
> ...
>
>
>> /**
>> * mtrr_type_lookup - look up memory type in MTRR
>> *
>> diff --git a/arch/x86/kernel/cpu/mtrr/mtrr.c b/arch/x86/kernel/cpu/mtrr/mtrr.c
>> index 542ca5639dfd..b73fe243c7fd 100644
>> --- a/arch/x86/kernel/cpu/mtrr/mtrr.c
>> +++ b/arch/x86/kernel/cpu/mtrr/mtrr.c
>> @@ -668,6 +668,15 @@ void __init mtrr_bp_init(void)
>> const char *why = "(not available)";
>> unsigned int phys_addr;
>>
>> + if (mtrr_state.enabled) {
>
> Not crazy about this either: this relies on the fragile boot ordering
> where init_hypervisor_platform() runs before this so it has a chance
> that mtrr_state.enabled will be already set.
>
> Yeah, yeah, cache_bp_init() and all the MTRR BSP setup stuff happens
> after it but there should at least be a comment over
> init_hypervisor_platform()'s call site in setup_arch() stating that
> cache_bp_init() needs to happen *after* it because <reason>.
I'll add that comment.
>
> I think we should also check
>
> x86_hyper_type
>
> here and not do anything if not set. As this is all HV-related muck.
>
> Xen I guess is a bit better because that call there happens even earlier
> but we need the comments to say that the ordering matters because future
> reorganization could cause it to blow up and people would search
> themselves crazy why in the hell it breaks...
>
> Can Xen use x86_hyper_type() too?
It does.
Juergen
On Mon, Feb 13, 2023 at 03:07:07PM +0100, Juergen Gross wrote:
> Fixed in the sense of static.
Well, you can't use "fixed" to say "static" when former means something
very specific already in MTRR land.
> Wouldn't !cpu_feature_enabled(X86_FEATURE_HYPERVISOR) be enough?
>
> I'm not sure we won't need that for TDX guests, too.
See, that's the problem. I wanna have it simple too. Lemme check with
dhansen.
> Yes, it is only relevant for PV dom0.
Right, I was asking whether "PV dom0" == X86_FEATURE_XENPV?
:)
> The number of fixed MTRRs is not dynamic AFAIK.
But nothing guarantees that the caller would pass an array "mtrr_type
*fixed" of size MTRR_NUM_FIXED_RANGES, right?
> A single interface makes it easier to avoid multiple calls.
>
> In the end I'm fine with either way.
Yeah, I know. Question is, how much of this functionality will be
needed/used so that we can go all out on the interface design or we can
do a single one and forget about it...
> > Can Xen use x86_hyper_type() too?
>
> It does.
Then pls add a x86_hyper_type check too to make sure a potential move of
this call is caught in the future.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Mon, Feb 13, 2023 at 04:03:07PM +0100, Borislav Petkov wrote:
> > Wouldn't !cpu_feature_enabled(X86_FEATURE_HYPERVISOR) be enough?
> >
> > I'm not sure we won't need that for TDX guests, too.
>
> See, that's the problem. I wanna have it simple too. Lemme check with
> dhansen.
He says MTRRs are enabled in TDX guests: "X86_FEATURE_MTRR is fixed to
1 in TDX guests."
So we will have to do the more finer-grained check I guess.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 13.02.23 16:11, Borislav Petkov wrote:
> On Mon, Feb 13, 2023 at 04:03:07PM +0100, Borislav Petkov wrote:
>>> Wouldn't !cpu_feature_enabled(X86_FEATURE_HYPERVISOR) be enough?
>>>
>>> I'm not sure we won't need that for TDX guests, too.
>>
>> See, that's the problem. I wanna have it simple too. Lemme check with
>> dhansen.
>
> He says MTRRs are enabled in TDX guests: "X86_FEATURE_MTRR is fixed to
> 1 in TDX guests."
>
> So we will have to do the more finer-grained check I guess.
Isn't the check for !X86_FEATURE_MTRR && X86_FEATURE_HYPERVISOR enough
then?
Yes, you still could construct cases where it would go wrong, but I don't
think we should over-engineer it.
Juergen
On 2/13/23 07:11, Borislav Petkov wrote:
> On Mon, Feb 13, 2023 at 04:03:07PM +0100, Borislav Petkov wrote:
>>> Wouldn't !cpu_feature_enabled(X86_FEATURE_HYPERVISOR) be enough?
>>>
>>> I'm not sure we won't need that for TDX guests, too.
>> See, that's the problem. I wanna have it simple too. Lemme check with
>> dhansen.
> He says MTRRs are enabled in TDX guests: "X86_FEATURE_MTRR is fixed to
> 1 in TDX guests."
>
> So we will have to do the more finer-grained check I guess.
Yes, TDX guests see MTRRs as being supported. But, the TDX module also
appears to inject a #VE for all RDMSR or WRMSR's to the MTRRs. That
makes them effectively useless.
I actually don't know what the heck TDX guests are supposed to do if
they feel like mucking with the MSRs. The architecture (CPUID) is
essentially telling them: "Sure, go ahead MTRRs are fiiiiiiine". But
the TDX module is sitting there throwing exceptions (#VE) if the guest
tries to touch MTRRs.
It sounds like there are some guest<->host ABIs on Xen to help the
guests do this. But I don't see anything in the TDX "GHCI" about it.
On 13.02.23 16:03, Borislav Petkov wrote:
> On Mon, Feb 13, 2023 at 03:07:07PM +0100, Juergen Gross wrote:
>> Fixed in the sense of static.
>
> Well, you can't use "fixed" to say "static" when former means something
> very specific already in MTRR land.
>
>> Wouldn't !cpu_feature_enabled(X86_FEATURE_HYPERVISOR) be enough?
>>
>> I'm not sure we won't need that for TDX guests, too.
>
> See, that's the problem. I wanna have it simple too. Lemme check with
> dhansen.
>
>> Yes, it is only relevant for PV dom0.
>
> Right, I was asking whether "PV dom0" == X86_FEATURE_XENPV?
No, you can have PV guests not being dom0.
>
> :)
>
>> The number of fixed MTRRs is not dynamic AFAIK.
>
> But nothing guarantees that the caller would pass an array "mtrr_type
> *fixed" of size MTRR_NUM_FIXED_RANGES, right?
Right.
In the end I wouldn't mind dropping the fixed MTRRs from the interface, as
they are currently not needed at all.
>
>> A single interface makes it easier to avoid multiple calls.
>>
>> In the end I'm fine with either way.
>
> Yeah, I know. Question is, how much of this functionality will be
> needed/used so that we can go all out on the interface design or we can
> do a single one and forget about it...
I'd say we go with what is needed right now. And having a single interface
makes all the sanity checking you are asking for easier.
>
>>> Can Xen use x86_hyper_type() too?
>>
>> It does.
>
> Then pls add a x86_hyper_type check too to make sure a potential move of
> this call is caught in the future.
What are you especially asking for?
With my current patches Xen PV dom0 will call mtrr_overwrite_state() before
x86_hyper_type is set, while a Hyper-V SEV-SNP guest will make the call after
it has been set. Both calls happen before cache_bp_init().
So I could move the mtrr_overwrite_state() call for Xen PV dom0 into its
init_platform() callback and check in mtrr_overwrite_state() x86_hyper_type
to be set, or I could reject a call of mtrr_overwrite_state() after the call
of cache_bp_init() has happened, or I could do both.
Juergen
On 13.02.23 16:27, Dave Hansen wrote:
> On 2/13/23 07:11, Borislav Petkov wrote:
>> On Mon, Feb 13, 2023 at 04:03:07PM +0100, Borislav Petkov wrote:
>>>> Wouldn't !cpu_feature_enabled(X86_FEATURE_HYPERVISOR) be enough?
>>>>
>>>> I'm not sure we won't need that for TDX guests, too.
>>> See, that's the problem. I wanna have it simple too. Lemme check with
>>> dhansen.
>> He says MTRRs are enabled in TDX guests: "X86_FEATURE_MTRR is fixed to
>> 1 in TDX guests."
>>
>> So we will have to do the more finer-grained check I guess.
>
> Yes, TDX guests see MTRRs as being supported. But, the TDX module also
> appears to inject a #VE for all RDMSR or WRMSR's to the MTRRs. That
> makes them effectively useless.
>
> I actually don't know what the heck TDX guests are supposed to do if
> they feel like mucking with the MSRs. The architecture (CPUID) is
> essentially telling them: "Sure, go ahead MTRRs are fiiiiiiine". But
> the TDX module is sitting there throwing exceptions (#VE) if the guest
> tries to touch MTRRs.
>
> It sounds like there are some guest<->host ABIs on Xen to help the
> guests do this. But I don't see anything in the TDX "GHCI" about it.
This is in line of the PAT init sequence of TDX guests. PAT is said to
be supported, but a TDX guest can't use the sequence as written in the
SDM for setting the PAT MSR (disable caches, etc.).
Juergen
On Mon, Feb 13, 2023 at 04:18:48PM +0100, Juergen Gross wrote:
> Yes, you still could construct cases where it would go wrong, but I don't
> think we should over-engineer it.
Actually, we should allow only those for which we know they get special
treatment for MTRRs settings and warn for all the rest.
And judging by Dave's reply, I think TDX should be in that category too
since it throws #VEs...
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 13.02.23 16:40, Borislav Petkov wrote:
> On Mon, Feb 13, 2023 at 04:18:48PM +0100, Juergen Gross wrote:
>> Yes, you still could construct cases where it would go wrong, but I don't
>> think we should over-engineer it.
>
> Actually, we should allow only those for which we know they get special
> treatment for MTRRs settings and warn for all the rest.
>
> And judging by Dave's reply, I think TDX should be in that category too
> since it throws #VEs...
>
Okay, and it has MTRRs enabled (as Hyper-V SEV-SNP guests), so I shouldn't
test that, I guess (or we should disable the feature before calling the
overwrite function).
Juergen
On 13.02.23 12:46, Christian Kujau wrote:
> On Mon, 13 Feb 2023, Juergen Gross wrote:
>> On 10.02.23 19:59, Linux regression tracking (Thorsten Leemhuis) wrote:
>>> Hi, this is your Linux kernel regression tracker.
>>>
>>> On 09.02.23 08:22, Juergen Gross wrote:
>>>> Commit 90b926e68f50 ("x86/pat: Fix pat_x_mtrr_type() for MTRR disabled
>>>> case") has introduced a regression with Xen.
>>>>
>>>> Revert the patch.
>>>
>>> That regression you refer to is afaics one I'm tracking[1] that was
>>> introduced this cycle. That makes me wonder: could this patch be applied
>>> directly to fix the issue quickly? Or are patches 1 to 4 needed as well
>>> (or the whole series?) to avoid other problems?
>>
>> Patches 1-4 are needed, too, as otherwise the issue claimed to be fixed
>> with patch 5 would show up again.
>
> The (last?) -rc8 version was released yesterday. Would it be possible to
> include at least (only) the revert in mainline so that 6.2 will be
> released with a working storage configuration under Xen?
Hmm, this would make Hyper-V SEV-SNP guests slow again.
I'm not completely against it, but OTOH I'm a little bit biased as the
maintainer of the Xen code. :-)
Michael, would you see major problems with doing the revert before having
the final patches for fixing your issue, too?
> Otherwise one would have to carry around that single revert manually until
> this patch series has landed in mainline, or convince all the
> distributions to do so :-\
>
> Anyway, thanks for fixing this problem, I did not expect this to be such a
> complicated issue when I reported that thing :-)
Yes, I have opened a can of worms with my MTRR/PAT disentangling.
Juergen
From: Juergen Gross <[email protected]>
>
> On 13.02.23 12:46, Christian Kujau wrote:
> > On Mon, 13 Feb 2023, Juergen Gross wrote:
> >> On 10.02.23 19:59, Linux regression tracking (Thorsten Leemhuis) wrote:
> >>> Hi, this is your Linux kernel regression tracker.
> >>>
> >>> On 09.02.23 08:22, Juergen Gross wrote:
> >>>> Commit 90b926e68f50 ("x86/pat: Fix pat_x_mtrr_type() for MTRR disabled
> >>>> case") has introduced a regression with Xen.
> >>>>
> >>>> Revert the patch.
> >>>
> >>> That regression you refer to is afaics one I'm tracking[1] that was
> >>> introduced this cycle. That makes me wonder: could this patch be applied
> >>> directly to fix the issue quickly? Or are patches 1 to 4 needed as well
> >>> (or the whole series?) to avoid other problems?
> >>
> >> Patches 1-4 are needed, too, as otherwise the issue claimed to be fixed
> >> with patch 5 would show up again.
> >
> > The (last?) -rc8 version was released yesterday. Would it be possible to
> > include at least (only) the revert in mainline so that 6.2 will be
> > released with a working storage configuration under Xen?
>
> Hmm, this would make Hyper-V SEV-SNP guests slow again.
>
> I'm not completely against it, but OTOH I'm a little bit biased as the
> maintainer of the Xen code. :-)
>
> Michael, would you see major problems with doing the revert before having
> the final patches for fixing your issue, too?
>
I'm OK with doing the revert. It's probably the right tradeoff for the
broader community because the Hyper-V use case is more narrow and
requires more curation for other reasons. The use case is the Azure
public cloud, and we can pretty much make sure that one of the solutions
is applied to kernels used with SEV-SNP in that environment.
Michael
> > Otherwise one would have to carry around that single revert manually until
> > this patch series has landed in mainline, or convince all the
> > distributions to do so :-\
> >
> > Anyway, thanks for fixing this problem, I did not expect this to be such a
> > complicated issue when I reported that thing :-)
>
> Yes, I have opened a can of worms with my MTRR/PAT disentangling.
>
>
> Juergen
On 13.02.23 18:01, Michael Kelley (LINUX) wrote:
> From: Juergen Gross <[email protected]>
>>
>> On 13.02.23 12:46, Christian Kujau wrote:
>>> On Mon, 13 Feb 2023, Juergen Gross wrote:
>>>> On 10.02.23 19:59, Linux regression tracking (Thorsten Leemhuis) wrote:
>>>>> Hi, this is your Linux kernel regression tracker.
>>>>>
>>>>> On 09.02.23 08:22, Juergen Gross wrote:
>>>>>> Commit 90b926e68f50 ("x86/pat: Fix pat_x_mtrr_type() for MTRR disabled
>>>>>> case") has introduced a regression with Xen.
>>>>>>
>>>>>> Revert the patch.
>>>>>
>>>>> That regression you refer to is afaics one I'm tracking[1] that was
>>>>> introduced this cycle. That makes me wonder: could this patch be applied
>>>>> directly to fix the issue quickly? Or are patches 1 to 4 needed as well
>>>>> (or the whole series?) to avoid other problems?
>>>>
>>>> Patches 1-4 are needed, too, as otherwise the issue claimed to be fixed
>>>> with patch 5 would show up again.
>>>
>>> The (last?) -rc8 version was released yesterday. Would it be possible to
>>> include at least (only) the revert in mainline so that 6.2 will be
>>> released with a working storage configuration under Xen?
>>
>> Hmm, this would make Hyper-V SEV-SNP guests slow again.
>>
>> I'm not completely against it, but OTOH I'm a little bit biased as the
>> maintainer of the Xen code. :-)
>>
>> Michael, would you see major problems with doing the revert before having
>> the final patches for fixing your issue, too?
>>
>
> I'm OK with doing the revert. It's probably the right tradeoff for the
> broader community because the Hyper-V use case is more narrow and
> requires more curation for other reasons. The use case is the Azure
> public cloud, and we can pretty much make sure that one of the solutions
> is applied to kernels used with SEV-SNP in that environment.
Thanks.
Boris, would you take the revert (patch 5 of my series) via x86/urgent, please?
Juergen
On Mon, 2023-02-13 at 07:12 +0100, Juergen Gross wrote:
>
> Thanks for the report.
>
> I'll have a look. Probably I'll need to re-add the check for WB in
> patch 7.
Sure, let me know if you need any more details about by setup.
On Mon, Feb 13, 2023 at 04:36:12PM +0100, Juergen Gross wrote:
> In the end I wouldn't mind dropping the fixed MTRRs from the interface, as
> they are currently not needed at all.
Yes, the less the better.
> I'd say we go with what is needed right now. And having a single interface
> makes all the sanity checking you are asking for easier.
I guess I need to remember to finish designing this if more users
appear...
> What are you especially asking for?
>
> With my current patches Xen PV dom0 will call mtrr_overwrite_state() before
> x86_hyper_type is set, while a Hyper-V SEV-SNP guest will make the call after
> it has been set. Both calls happen before cache_bp_init().
>
> So I could move the mtrr_overwrite_state() call for Xen PV dom0 into its
> init_platform() callback and check in mtrr_overwrite_state() x86_hyper_type
> to be set,
I believe that is good enough, see below.
> or I could reject a call of mtrr_overwrite_state() after the call of
> cache_bp_init() has happened, or I could do both.
I think one thing is enough as we'll be loud enough.
---
diff --git a/arch/x86/kernel/cpu/mtrr/mtrr.c b/arch/x86/kernel/cpu/mtrr/mtrr.c
index b73fe243c7fd..2dbe2c10e959 100644
--- a/arch/x86/kernel/cpu/mtrr/mtrr.c
+++ b/arch/x86/kernel/cpu/mtrr/mtrr.c
@@ -49,6 +49,7 @@
#include <asm/cacheinfo.h>
#include <asm/cpufeature.h>
#include <asm/e820/api.h>
+#include <asm/hypervisor.h>
#include <asm/mtrr.h>
#include <asm/msr.h>
#include <asm/memtype.h>
@@ -668,7 +669,12 @@ void __init mtrr_bp_init(void)
const char *why = "(not available)";
unsigned int phys_addr;
+#ifdef CONFIG_HYPERVISOR_GUEST
if (mtrr_state.enabled) {
+
+ /* This should not happen without a hypervisor present. */
+ WARN_ON_ONCE(!x86_hyper_type);
+
/* Software overwrite of MTRR state, only for generic case. */
mtrr_calc_physbits(true);
init_table();
@@ -676,6 +682,7 @@ void __init mtrr_bp_init(void)
return;
}
+#endif
phys_addr = mtrr_calc_physbits(boot_cpu_has(X86_FEATURE_MTRR));
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Mon, Feb 13, 2023 at 04:44:09PM +0100, Juergen Gross wrote:
> Okay, and it has MTRRs enabled (as Hyper-V SEV-SNP guests), so I shouldn't
> test that, I guess (or we should disable the feature before calling the
> overwrite function).
I think we should handle TDX the same way - as if the MTRRs are
read-only there. So you can check X86_FEATURE_TDX_GUEST in addition.
Thx.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Mon, 13 Feb 2023, Juergen Gross wrote:
> Hmm, this would make Hyper-V SEV-SNP guests slow again.
>
> I'm not completely against it, but OTOH I'm a little bit biased as the
> maintainer of the Xen code. :-)
Understood. I'm a bit puzzled why nobody else reports this, maybe Xen Dom0
and external USB enclosures are not that common.
> Michael, would you see major problems with doing the revert before having
> the final patches for fixing your issue, too?
If that revert ends up in mainline, feel free to add:
Tested-by: Christian Kujau <[email protected]>
...if that makes any difference.
Thanks,
Christian.
--
BOFH excuse #199:
the curls in your keyboard cord are losing electricity.
On Mon, Feb 13, 2023 at 03:07:07PM +0100, Juergen Gross wrote:
> > So this here needs to check:
> >
> > if (!cpu_feature_enabled(X86_FEATURE_HYPERVISOR) &&
> > !(cpu_feature_enabled(X86_FEATURE_SEV_SNP) ||
> > cpu_feature_enabled(X86_FEATURE_XENPV))) {
> > WARN_ON_ONCE(1);
> > return;
> > }
> >
> > as we don't want this to be called somewhere or by something else.
>
> Wouldn't !cpu_feature_enabled(X86_FEATURE_HYPERVISOR) be enough?
>
> I'm not sure we won't need that for TDX guests, too.
TDX guests are covered by X86_FEATURE_HYPERVISOR.
--
Kiryl Shutsemau / Kirill A. Shutemov
On 13.02.23 19:43, Borislav Petkov wrote:
> On Mon, Feb 13, 2023 at 04:36:12PM +0100, Juergen Gross wrote:
>> In the end I wouldn't mind dropping the fixed MTRRs from the interface, as
>> they are currently not needed at all.
>
> Yes, the less the better.
>
>> I'd say we go with what is needed right now. And having a single interface
>> makes all the sanity checking you are asking for easier.
>
> I guess I need to remember to finish designing this if more users
> appear...
>
>> What are you especially asking for?
>>
>> With my current patches Xen PV dom0 will call mtrr_overwrite_state() before
>> x86_hyper_type is set, while a Hyper-V SEV-SNP guest will make the call after
>> it has been set. Both calls happen before cache_bp_init().
>>
>> So I could move the mtrr_overwrite_state() call for Xen PV dom0 into its
>> init_platform() callback and check in mtrr_overwrite_state() x86_hyper_type
>> to be set,
>
> I believe that is good enough, see below.
>
>> or I could reject a call of mtrr_overwrite_state() after the call of
>> cache_bp_init() has happened, or I could do both.
>
> I think one thing is enough as we'll be loud enough.
>
> ---
> diff --git a/arch/x86/kernel/cpu/mtrr/mtrr.c b/arch/x86/kernel/cpu/mtrr/mtrr.c
> index b73fe243c7fd..2dbe2c10e959 100644
> --- a/arch/x86/kernel/cpu/mtrr/mtrr.c
> +++ b/arch/x86/kernel/cpu/mtrr/mtrr.c
> @@ -49,6 +49,7 @@
> #include <asm/cacheinfo.h>
> #include <asm/cpufeature.h>
> #include <asm/e820/api.h>
> +#include <asm/hypervisor.h>
> #include <asm/mtrr.h>
> #include <asm/msr.h>
> #include <asm/memtype.h>
> @@ -668,7 +669,12 @@ void __init mtrr_bp_init(void)
> const char *why = "(not available)";
> unsigned int phys_addr;
>
> +#ifdef CONFIG_HYPERVISOR_GUEST
> if (mtrr_state.enabled) {
> +
> + /* This should not happen without a hypervisor present. */
> + WARN_ON_ONCE(!x86_hyper_type);
> +
> /* Software overwrite of MTRR state, only for generic case. */
> mtrr_calc_physbits(true);
> init_table();
> @@ -676,6 +682,7 @@ void __init mtrr_bp_init(void)
>
> return;
> }
> +#endif
I will change this a little bit in order to avoid the #ifdef by using
"WARN_ON(hypervisor_is_type() == X86_HYPER_NATIVE);"
Juergen
On 13.02.23 19:53, Borislav Petkov wrote:
> On Mon, Feb 13, 2023 at 04:44:09PM +0100, Juergen Gross wrote:
>> Okay, and it has MTRRs enabled (as Hyper-V SEV-SNP guests), so I shouldn't
>> test that, I guess (or we should disable the feature before calling the
>> overwrite function).
>
> I think we should handle TDX the same way - as if the MTRRs are
> read-only there. So you can check X86_FEATURE_TDX_GUEST in addition.
Okay, if you really want to dictate the allowed use cases (this seems to be
a layering violation), but you are the maintainer of that code.
Juergen
On 13.02.23 23:54, Christian Kujau wrote:
> On Mon, 13 Feb 2023, Juergen Gross wrote:
>> Hmm, this would make Hyper-V SEV-SNP guests slow again.
>>
>> I'm not completely against it, but OTOH I'm a little bit biased as the
>> maintainer of the Xen code. :-)
>
> Understood. I'm a bit puzzled why nobody else reports this, maybe Xen Dom0
> and external USB enclosures are not that common.
Not all USB drivers/interfaces have this problem. Your problems are with:
Intel Corporation 8 Series/C220 Series Chipset Family USB xHCI (rev 05)
while a system I'm working with the following has no problems:
Intel Corporation 8 Series/C220 Series Chipset Family USB xHCI (rev 04)
The only difference seems to be the revision of your adapter.
>
>> Michael, would you see major problems with doing the revert before having
>> the final patches for fixing your issue, too?
>
> If that revert ends up in mainline, feel free to add:
>
> Tested-by: Christian Kujau <[email protected]>
Thanks,
Juergen
On Tue, Feb 14, 2023 at 08:04:47AM +0100, Juergen Gross wrote:
> Okay, if you really want to dictate the allowed use cases (this seems to be
Read upthread - TDX guests cause #VEs for MTRR accesses. #VEs which are
unneeded and should be avoided if possible.
> a layering violation), but you are the maintainer of that code.
And why are you so against catching misuses of this, which should
absolutely *not* be needed by anything else?
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 14.02.23 09:58, Borislav Petkov wrote:
> On Tue, Feb 14, 2023 at 08:04:47AM +0100, Juergen Gross wrote:
>> Okay, if you really want to dictate the allowed use cases (this seems to be
>
> Read upthread - TDX guests cause #VEs for MTRR accesses. #VEs which are
> unneeded and should be avoided if possible.
Of course, I don't question the need for TDX guests to use the overwrite.
>
>> a layering violation), but you are the maintainer of that code.
>
> And why are you so against catching misuses of this, which should
> absolutely *not* be needed by anything else
I just don't like the idea of trying to catch all possible misuses in
lower levels, at the same time introducing the need to modify those
tests in case a new valid use case is popping up.
But as said, you are the maintainer, so its your final decision.
Juergen
On Tue, Feb 14, 2023 at 10:02:51AM +0100, Juergen Gross wrote:
> I just don't like the idea of trying to catch all possible misuses in
> lower levels, at the same time introducing the need to modify those
> tests in case a new valid use case is popping up.
So what would you do: generally allow this so that potentially other
guest configurations misuse it?
And when we decide to change it, all those users will come running and
complaining that we broke it?
And then we're stuck with a nasty workaround in the tree because we have
to support them too?
See, all we do here is because of such misguided (or maybe didn't know
better) decisions which have happened a long time ago.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 14.02.23 10:10, Borislav Petkov wrote:
> On Tue, Feb 14, 2023 at 10:02:51AM +0100, Juergen Gross wrote:
>> I just don't like the idea of trying to catch all possible misuses in
>> lower levels, at the same time introducing the need to modify those
>> tests in case a new valid use case is popping up.
>
> So what would you do: generally allow this so that potentially other
> guest configurations misuse it?
I guess this largely depends on the functionality. I don't see why anyone
would try to use MTRR overwrite functionality without really needing it.
But maybe I'm wrong here and I'm under-estimating the "creativity" of
kernel hackers.
> And when we decide to change it, all those users will come running and
> complaining that we broke it?
>
> And then we're stuck with a nasty workaround in the tree because we have
> to support them too?
>
> See, all we do here is because of such misguided (or maybe didn't know
> better) decisions which have happened a long time ago.
I can see your point.
Maybe I haven't seen enough crazy hacks yet. :-)
No need to further discuss this topic from my side, as I have voiced my
opinion and you did so, too. I will add the tests you are asking for.
Juergen
On Tue, Feb 14, 2023 at 10:17:12AM +0100, Juergen Gross wrote:
> I guess this largely depends on the functionality. I don't see why anyone
> would try to use MTRR overwrite functionality without really needing it.
>
> But maybe I'm wrong here and I'm under-estimating the "creativity" of
> kernel hackers.
This is exactly it - if it is there, it will get used eventually.
Think of it this way: this is a special, well, kinda hack, if you will,
which *nothing* else would need. We can always relax the condition for
using it if something else appears with a valid use case.
What we can't do nearly as easily is the reverse: remove it or tighten
the check later.
So the general policy is: workarounds like this need to be as
specialized as possible.
> Maybe I haven't seen enough crazy hacks yet. :-)
You're kidding, right? You hack on Xen for a long time... :-P
> No need to further discuss this topic from my side, as I have voiced my
> opinion and you did so, too. I will add the tests you are asking for.
Thanks!
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 13.02.23 19:21, Edgecombe, Rick P wrote:
> On Mon, 2023-02-13 at 07:12 +0100, Juergen Gross wrote:
>>
>> Thanks for the report.
>>
>> I'll have a look. Probably I'll need to re-add the check for WB in
>> patch 7.
>
> Sure, let me know if you need any more details about by setup.
I have reproduced the issue.
Adding back the test for WB will fix it, but I'm not sure this is really
what I should do.
The problem arises in case a large mapping is spanning multiple MTRRs,
even if they define the same caching type (uniform is set to 0 in this
case).
So the basic question for me is: shouldn't the semantics of uniform be
adpated? Today it means "the range is covered by only one MTRR or by
none". Looking at the use cases I'm wondering whether it shouldn't be
"the whole range has the same caching type".
Thoughts?
Juergen
On 13.02.23 02:08, Michael Kelley (LINUX) wrote:
> From: Juergen Gross <[email protected]> Sent: Wednesday, February 8, 2023 11:22 PM
>>
>> Today pud_set_huge() and pmd_set_huge() test for the MTRR type to be
>> WB or INVALID after calling mtrr_type_lookup(). Those tests can be
>> dropped, as the only reason to not use a large mapping would be
>> uniform being 0. Any MTRR type can be accepted as long as it applies
>> to the whole memory range covered by the mapping, as the alternative
>> would only be to map the same region with smaller pages instead using
>> the same PAT type as for the large mapping.
>>
>> Suggested-by: Linus Torvalds <[email protected]>
>> Signed-off-by: Juergen Gross <[email protected]>
>> ---
>> arch/x86/mm/pgtable.c | 6 ++----
>> 1 file changed, 2 insertions(+), 4 deletions(-)
>>
>> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
>> index e4f499eb0f29..7b9c5443d176 100644
>> --- a/arch/x86/mm/pgtable.c
>> +++ b/arch/x86/mm/pgtable.c
>> @@ -721,8 +721,7 @@ int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t prot)
>> u8 mtrr, uniform;
>>
>> mtrr = mtrr_type_lookup(addr, addr + PUD_SIZE, &uniform);
>> - if ((mtrr != MTRR_TYPE_INVALID) && (!uniform) &&
>> - (mtrr != MTRR_TYPE_WRBACK))
>> + if (!uniform)
>> return 0;
>>
>> /* Bail out if we are we on a populated non-leaf entry: */
>> @@ -748,8 +747,7 @@ int pmd_set_huge(pmd_t *pmd, phys_addr_t addr, pgprot_t prot)
>> u8 mtrr, uniform;
>>
>> mtrr = mtrr_type_lookup(addr, addr + PMD_SIZE, &uniform);
>> - if ((mtrr != MTRR_TYPE_INVALID) && (!uniform) &&
>> - (mtrr != MTRR_TYPE_WRBACK)) {
>> + if (!uniform) {
>> pr_warn_once("%s: Cannot satisfy [mem %#010llx-%#010llx] with a huge-page mapping due to MTRR override.\n",
>> __func__, addr, addr + PMD_SIZE);
>
> I'm seeing this warning trigger in a normal Hyper-V guest (i.e., *not* an
> SEV-SNP Confidential VM). The original filtering here based on
> MTRR_TYPE_WRBACK appears to be hiding a bug in mtrr_type_lookup_variable()
> where it incorrectly thinks an address range matches two different variable
> MTRRs, and hence clears "uniform".
>
> Here are the variable MTRRs in the normal Hyper-V guest with 32 GiBytes
> of memory:
>
> [ 0.043592] MTRR variable ranges enabled:
> [ 0.048308] 0 base 000000000000 mask FFFF00000000 write-back
> [ 0.057450] 1 base 000100000000 mask FFF000000000 write-back
I've read the SDM chapter for MTRRs again. Doesn't #1 violate the requirements
for MTRR settings? The SDM says:
For ranges greater than 4 KBytes, each range must be of length 2^n and its
base address must be aligned on a 2^n boundary, where n is a value equal to
or greater than 12. The base-address alignment value cannot be less than its
length. For example, an 8-KByte range cannot be aligned on a 4-KByte boundary.
It must be aligned on at least an 8-KByte boundary.
This makes the reasoning below wrong.
> [ 0.063972] 2 disabled
> [ 0.066755] 3 disabled
> [ 0.070024] 4 disabled
> [ 0.072856] 5 disabled
> [ 0.076112] 6 disabled
> [ 0.078760] 7 disabled
>
> Variable MTRR #0 covers addresses up to 4 GiByte, while #1 covers
> 4 GiByte to 64 GiByte. But in mtrr_type_lookup_variable(), address
> range 0xF8000000 to 0xF81FFFFF is matching both MTRRs, when it
> should be matching just #0.
>
> The problem looks to be this code in mtrr_type_lookup_variable():
>
> if ((start & mask) != (base & mask))
> continue;
>
> If the mask bits of start and base are different, then the
> MTRR doesn't match, and the continue statement should be
> executed. That's correct. But if the mask bits are the same,
> that's not sufficient for the MTRR to match. If the end
> address is less than base, the MTRR doesn't match, and
> the continue statement should still be executed, which
> isn't happening.
>
> But somebody please check my thinking. :-)
I think you need to correct the hypervisor.
Juergen
From: Juergen Gross <[email protected]> Sent: Wednesday, February 15, 2023 5:40 AM
>
> On 13.02.23 02:08, Michael Kelley (LINUX) wrote:
> > From: Juergen Gross <[email protected]> Sent: Wednesday, February 8, 2023 11:22
> PM
> >>
> >> Today pud_set_huge() and pmd_set_huge() test for the MTRR type to be
> >> WB or INVALID after calling mtrr_type_lookup(). Those tests can be
> >> dropped, as the only reason to not use a large mapping would be
> >> uniform being 0. Any MTRR type can be accepted as long as it applies
> >> to the whole memory range covered by the mapping, as the alternative
> >> would only be to map the same region with smaller pages instead using
> >> the same PAT type as for the large mapping.
> >>
> >> Suggested-by: Linus Torvalds <[email protected]>
> >> Signed-off-by: Juergen Gross <[email protected]>
> >> ---
> >> arch/x86/mm/pgtable.c | 6 ++----
> >> 1 file changed, 2 insertions(+), 4 deletions(-)
> >>
> >> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
> >> index e4f499eb0f29..7b9c5443d176 100644
> >> --- a/arch/x86/mm/pgtable.c
> >> +++ b/arch/x86/mm/pgtable.c
> >> @@ -721,8 +721,7 @@ int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t
> prot)
> >> u8 mtrr, uniform;
> >>
> >> mtrr = mtrr_type_lookup(addr, addr + PUD_SIZE, &uniform);
> >> - if ((mtrr != MTRR_TYPE_INVALID) && (!uniform) &&
> >> - (mtrr != MTRR_TYPE_WRBACK))
> >> + if (!uniform)
> >> return 0;
> >>
> >> /* Bail out if we are we on a populated non-leaf entry: */
> >> @@ -748,8 +747,7 @@ int pmd_set_huge(pmd_t *pmd, phys_addr_t addr,
> pgprot_t prot)
> >> u8 mtrr, uniform;
> >>
> >> mtrr = mtrr_type_lookup(addr, addr + PMD_SIZE, &uniform);
> >> - if ((mtrr != MTRR_TYPE_INVALID) && (!uniform) &&
> >> - (mtrr != MTRR_TYPE_WRBACK)) {
> >> + if (!uniform) {
> >> pr_warn_once("%s: Cannot satisfy [mem %#010llx-%#010llx] with a
> huge-page mapping due to MTRR override.\n",
> >> __func__, addr, addr + PMD_SIZE);
> >
> > I'm seeing this warning trigger in a normal Hyper-V guest (i.e., *not* an
> > SEV-SNP Confidential VM). The original filtering here based on
> > MTRR_TYPE_WRBACK appears to be hiding a bug in mtrr_type_lookup_variable()
> > where it incorrectly thinks an address range matches two different variable
> > MTRRs, and hence clears "uniform".
> >
> > Here are the variable MTRRs in the normal Hyper-V guest with 32 GiBytes
> > of memory:
> >
> > [ 0.043592] MTRR variable ranges enabled:
> > [ 0.048308] 0 base 000000000000 mask FFFF00000000 write-back
> > [ 0.057450] 1 base 000100000000 mask FFF000000000 write-back
>
> I've read the SDM chapter for MTRRs again. Doesn't #1 violate the requirements
> for MTRR settings? The SDM says:
>
> For ranges greater than 4 KBytes, each range must be of length 2^n and its
> base address must be aligned on a 2^n boundary, where n is a value equal to
> or greater than 12. The base-address alignment value cannot be less than its
> length. For example, an 8-KByte range cannot be aligned on a 4-KByte boundary.
> It must be aligned on at least an 8-KByte boundary.
>
> This makes the reasoning below wrong.
Argh. It sure looks like you are right. I just assumed the MTRRs coming from
Hyper-V were good. Shame on me. :-(
I've ping'ed the Hyper-V team to see what they say. But it's hard to see how
they could argue that these MTRRs are correctly formed. The Intel spec is
unambiguous.
Even if Hyper-V agrees that the MTRRs are wrong, a fix will take time to
propagate. In the meantime, it seems like the Linux mitigations could be
any of the following:
1) Keep the test for WB in pud_set_huge() and pmd_set_huge()
2) Remove the test, but have "uniform" set to 1 when multiple MTRRs are
matched but all have the same caching type, which you proposed to
solve Rick Edgecombe's problem. This is likely to paper over the
problem I saw with the Hyper-V MTRRs because the incorrectly matching
MTRRs would all be WB.
3) In *all* Hyper-V VMs (not just Confidential VMs), disable X86_FEATURE_MTRR
and use the new override to set the default type to WB. Hopefully we don't
have to do this, but I can submit a separate patch if it becomes necessary.
Michael
On Wed, Feb 15, 2023 at 12:25 AM Juergen Gross <[email protected]> wrote:
>
> The problem arises in case a large mapping is spanning multiple MTRRs,
> even if they define the same caching type (uniform is set to 0 in this
> case).
Oh, I think then you should fix uniform to be 1.
IOW, we should not think "multiple MTRRs" means "non-uniform". Only
"different actual memory types" should mean non-uniformity.
If I remember correctly, there were good reasons to have overlapping
MTRR's. In fact, you can generate a single MTRR that described a
memory ttype that wasn't even contiguous if you had odd memory setups.
Intel definitely defines how overlapping MTRR's work, and "same types
overlaps" is documented as a real thing.
Linus
On 15.02.23 20:38, Michael Kelley (LINUX) wrote:
> From: Juergen Gross <[email protected]> Sent: Wednesday, February 15, 2023 5:40 AM
>>
>> On 13.02.23 02:08, Michael Kelley (LINUX) wrote:
>>> From: Juergen Gross <[email protected]> Sent: Wednesday, February 8, 2023 11:22
>> PM
>>>>
>>>> Today pud_set_huge() and pmd_set_huge() test for the MTRR type to be
>>>> WB or INVALID after calling mtrr_type_lookup(). Those tests can be
>>>> dropped, as the only reason to not use a large mapping would be
>>>> uniform being 0. Any MTRR type can be accepted as long as it applies
>>>> to the whole memory range covered by the mapping, as the alternative
>>>> would only be to map the same region with smaller pages instead using
>>>> the same PAT type as for the large mapping.
>>>>
>>>> Suggested-by: Linus Torvalds <[email protected]>
>>>> Signed-off-by: Juergen Gross <[email protected]>
>>>> ---
>>>> arch/x86/mm/pgtable.c | 6 ++----
>>>> 1 file changed, 2 insertions(+), 4 deletions(-)
>>>>
>>>> diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
>>>> index e4f499eb0f29..7b9c5443d176 100644
>>>> --- a/arch/x86/mm/pgtable.c
>>>> +++ b/arch/x86/mm/pgtable.c
>>>> @@ -721,8 +721,7 @@ int pud_set_huge(pud_t *pud, phys_addr_t addr, pgprot_t
>> prot)
>>>> u8 mtrr, uniform;
>>>>
>>>> mtrr = mtrr_type_lookup(addr, addr + PUD_SIZE, &uniform);
>>>> - if ((mtrr != MTRR_TYPE_INVALID) && (!uniform) &&
>>>> - (mtrr != MTRR_TYPE_WRBACK))
>>>> + if (!uniform)
>>>> return 0;
>>>>
>>>> /* Bail out if we are we on a populated non-leaf entry: */
>>>> @@ -748,8 +747,7 @@ int pmd_set_huge(pmd_t *pmd, phys_addr_t addr,
>> pgprot_t prot)
>>>> u8 mtrr, uniform;
>>>>
>>>> mtrr = mtrr_type_lookup(addr, addr + PMD_SIZE, &uniform);
>>>> - if ((mtrr != MTRR_TYPE_INVALID) && (!uniform) &&
>>>> - (mtrr != MTRR_TYPE_WRBACK)) {
>>>> + if (!uniform) {
>>>> pr_warn_once("%s: Cannot satisfy [mem %#010llx-%#010llx] with a
>> huge-page mapping due to MTRR override.\n",
>>>> __func__, addr, addr + PMD_SIZE);
>>>
>>> I'm seeing this warning trigger in a normal Hyper-V guest (i.e., *not* an
>>> SEV-SNP Confidential VM). The original filtering here based on
>>> MTRR_TYPE_WRBACK appears to be hiding a bug in mtrr_type_lookup_variable()
>>> where it incorrectly thinks an address range matches two different variable
>>> MTRRs, and hence clears "uniform".
>>>
>>> Here are the variable MTRRs in the normal Hyper-V guest with 32 GiBytes
>>> of memory:
>>>
>>> [ 0.043592] MTRR variable ranges enabled:
>>> [ 0.048308] 0 base 000000000000 mask FFFF00000000 write-back
>>> [ 0.057450] 1 base 000100000000 mask FFF000000000 write-back
>>
>> I've read the SDM chapter for MTRRs again. Doesn't #1 violate the requirements
>> for MTRR settings? The SDM says:
>>
>> For ranges greater than 4 KBytes, each range must be of length 2^n and its
>> base address must be aligned on a 2^n boundary, where n is a value equal to
>> or greater than 12. The base-address alignment value cannot be less than its
>> length. For example, an 8-KByte range cannot be aligned on a 4-KByte boundary.
>> It must be aligned on at least an 8-KByte boundary.
>>
>> This makes the reasoning below wrong.
>
> Argh. It sure looks like you are right. I just assumed the MTRRs coming from
> Hyper-V were good. Shame on me. :-(
I assumed the same, as I didn't see any flaw in your reasoning before. :-)
> I've ping'ed the Hyper-V team to see what they say. But it's hard to see how
> they could argue that these MTRRs are correctly formed. The Intel spec is
> unambiguous.
>
> Even if Hyper-V agrees that the MTRRs are wrong, a fix will take time to
> propagate. In the meantime, it seems like the Linux mitigations could be
> any of the following:
>
> 1) Keep the test for WB in pud_set_huge() and pmd_set_huge()
>
> 2) Remove the test, but have "uniform" set to 1 when multiple MTRRs are
> matched but all have the same caching type, which you proposed to
> solve Rick Edgecombe's problem. This is likely to paper over the
> problem I saw with the Hyper-V MTRRs because the incorrectly matching
> MTRRs would all be WB.
>
> 3) In *all* Hyper-V VMs (not just Confidential VMs), disable X86_FEATURE_MTRR
> and use the new override to set the default type to WB. Hopefully we don't
> have to do this, but I can submit a separate patch if it becomes necessary.
4) Sanitize MTRRs in mtrr_cleanup(), resulting in MTRR#1 in your example to
be modified to start at 0 (which would not really help to solve the
multiple match you are seeing, but I'm about to solve that one, too, as
the multiple MTRR match is allowed in the specs, but not really handled
correctly in mtrr_type_lookup()).
Juergen
On 16.02.23 00:22, Linus Torvalds wrote:
> On Wed, Feb 15, 2023 at 12:25 AM Juergen Gross <[email protected]> wrote:
>>
>> The problem arises in case a large mapping is spanning multiple MTRRs,
>> even if they define the same caching type (uniform is set to 0 in this
>> case).
>
> Oh, I think then you should fix uniform to be 1.
>
> IOW, we should not think "multiple MTRRs" means "non-uniform". Only
> "different actual memory types" should mean non-uniformity.
Thanks for confirmation. I completely agree.
> If I remember correctly, there were good reasons to have overlapping
> MTRR's. In fact, you can generate a single MTRR that described a
> memory ttype that wasn't even contiguous if you had odd memory setups.
>
> Intel definitely defines how overlapping MTRR's work, and "same types
> overlaps" is documented as a real thing.
Yes. And it is handled wrong in current code.
Handling it correctly will require quite some reworking of the code,
which I've already started to work on. I will defer the pud_set_huge()/
pmd_set_huge() modifying patch to after this rework.
Juergen
On 13.02.23 12:39, Borislav Petkov wrote:
> On Thu, Feb 09, 2023 at 08:22:14AM +0100, Juergen Gross wrote:
>> When running virtualized, MTRR access can be reduced (e.g. in Xen PV
>> guests or when running as a SEV-SNP guest under Hyper-V). Typically
>> the hypervisor will reset the MTRR feature in cpuid data, resulting
>> in no MTRR memory type information being available for the kernel.
>>
>> This has turned out to result in problems:
>>
>> - Hyper-V SEV-SNP guests using uncached mappings where they shouldn't
>> - Xen PV dom0 mapping memory as WB which should be UC- instead
>>
>> Solve those problems by supporting to set a fixed MTRR state,
>> overwriting the empty state used today. In case such a state has been
>> set, don't call get_mtrr_state() in mtrr_bp_init(). The set state
>> will only be used by mtrr_type_lookup(), as in all other cases
>> mtrr_enabled() is being checked, which will return false. Accept the
>> overwrite call only in case of MTRRs being disabled in cpuid.
>
> s/cpuid/CPUID/g
>
>> Signed-off-by: Juergen Gross <[email protected]>
>> ---
>> V2:
>> - new patch
>> ---
>> arch/x86/include/asm/mtrr.h | 2 ++
>> arch/x86/kernel/cpu/mtrr/generic.c | 38 ++++++++++++++++++++++++++++++
>> arch/x86/kernel/cpu/mtrr/mtrr.c | 9 +++++++
>> 3 files changed, 49 insertions(+)
>>
>> diff --git a/arch/x86/include/asm/mtrr.h b/arch/x86/include/asm/mtrr.h
>> index f0eeaf6e5f5f..0b8f51d683dc 100644
>> --- a/arch/x86/include/asm/mtrr.h
>> +++ b/arch/x86/include/asm/mtrr.h
>> @@ -31,6 +31,8 @@
>> */
>> # ifdef CONFIG_MTRR
>> void mtrr_bp_init(void);
>> +void mtrr_overwrite_state(struct mtrr_var_range *var, unsigned int num_var,
>> + mtrr_type *fixed, mtrr_type def_type);
>> extern u8 mtrr_type_lookup(u64 addr, u64 end, u8 *uniform);
>> extern void mtrr_save_fixed_ranges(void *);
>> extern void mtrr_save_state(void);
>> diff --git a/arch/x86/kernel/cpu/mtrr/generic.c b/arch/x86/kernel/cpu/mtrr/generic.c
>> index ee09d359e08f..788bc16888a5 100644
>> --- a/arch/x86/kernel/cpu/mtrr/generic.c
>> +++ b/arch/x86/kernel/cpu/mtrr/generic.c
>> @@ -240,6 +240,44 @@ static u8 mtrr_type_lookup_variable(u64 start, u64 end, u64 *partial_end,
>> return mtrr_state.def_type;
>> }
>>
>> +/**
>> + * mtrr_overwrite_state - set fixed MTRR state
>
> fixed only? You pass in variable too...
>
>> + *
>> + * Used to set MTRR state via different means (e.g. with data obtained from
>> + * a hypervisor).
>> + */
>> +void mtrr_overwrite_state(struct mtrr_var_range *var, unsigned int num_var,
>> + mtrr_type *fixed, mtrr_type def_type)
>> +{
>> + unsigned int i;
>> +
>> + if (boot_cpu_has(X86_FEATURE_MTRR))
>
> check_for_deprecated_apis: WARNING: arch/x86/kernel/cpu/mtrr/generic.c:254: Do not use boot_cpu_has() - use cpu_feature_enabled() instead.
>
>> + return;
>
> So this here needs to check:
>
> if (!cpu_feature_enabled(X86_FEATURE_HYPERVISOR) &&
> !(cpu_feature_enabled(X86_FEATURE_SEV_SNP) ||
> cpu_feature_enabled(X86_FEATURE_XENPV))) {
> WARN_ON_ONCE(1);
> return;
> }
>
> as we don't want this to be called somewhere or by something else.
>
> The SEV_SNP flag can be used from:
>
> https://lore.kernel.org/r/[email protected]
>
> I'm assuming here HyperV SEV-SNP guests really do set that feature flag
> (they better). We can expedite that patch ofc.
Is that flag _really_ meant to indicate we are running as a SEV-SNP guest?
Given that the referenced patch is part of the SEV-SNP host support series,
I'm inclined to suspect it won't be set for sure in HyperV SEV-SNP guests.
And who is setting it for KVM SEV-SNP guests?
Juergen
On Thu, Feb 16, 2023 at 10:32:28AM +0100, Juergen Gross wrote:
> On 13.02.23 12:39, Borislav Petkov wrote:
> >On Thu, Feb 09, 2023 at 08:22:14AM +0100, Juergen Gross wrote:
> >>When running virtualized, MTRR access can be reduced (e.g. in Xen PV
> >>guests or when running as a SEV-SNP guest under Hyper-V). Typically
> >>the hypervisor will reset the MTRR feature in cpuid data, resulting
> >>in no MTRR memory type information being available for the kernel.
> >>
> >>This has turned out to result in problems:
> >>
> >>- Hyper-V SEV-SNP guests using uncached mappings where they shouldn't
> >>- Xen PV dom0 mapping memory as WB which should be UC- instead
> >>
> >>Solve those problems by supporting to set a fixed MTRR state,
> >>overwriting the empty state used today. In case such a state has been
> >>set, don't call get_mtrr_state() in mtrr_bp_init(). The set state
> >>will only be used by mtrr_type_lookup(), as in all other cases
> >>mtrr_enabled() is being checked, which will return false. Accept the
> >>overwrite call only in case of MTRRs being disabled in cpuid.
> >
> >s/cpuid/CPUID/g
> >
> >>Signed-off-by: Juergen Gross <[email protected]>
> >>---
> >>V2:
> >>- new patch
> >>---
> >> arch/x86/include/asm/mtrr.h | 2 ++
> >> arch/x86/kernel/cpu/mtrr/generic.c | 38 ++++++++++++++++++++++++++++++
> >> arch/x86/kernel/cpu/mtrr/mtrr.c | 9 +++++++
> >> 3 files changed, 49 insertions(+)
> >>
> >>diff --git a/arch/x86/include/asm/mtrr.h b/arch/x86/include/asm/mtrr.h
> >>index f0eeaf6e5f5f..0b8f51d683dc 100644
> >>--- a/arch/x86/include/asm/mtrr.h
> >>+++ b/arch/x86/include/asm/mtrr.h
> >>@@ -31,6 +31,8 @@
> >> */
> >> # ifdef CONFIG_MTRR
> >> void mtrr_bp_init(void);
> >>+void mtrr_overwrite_state(struct mtrr_var_range *var, unsigned int num_var,
> >>+ mtrr_type *fixed, mtrr_type def_type);
> >> extern u8 mtrr_type_lookup(u64 addr, u64 end, u8 *uniform);
> >> extern void mtrr_save_fixed_ranges(void *);
> >> extern void mtrr_save_state(void);
> >>diff --git a/arch/x86/kernel/cpu/mtrr/generic.c b/arch/x86/kernel/cpu/mtrr/generic.c
> >>index ee09d359e08f..788bc16888a5 100644
> >>--- a/arch/x86/kernel/cpu/mtrr/generic.c
> >>+++ b/arch/x86/kernel/cpu/mtrr/generic.c
> >>@@ -240,6 +240,44 @@ static u8 mtrr_type_lookup_variable(u64 start, u64 end, u64 *partial_end,
> >> return mtrr_state.def_type;
> >> }
> >>+/**
> >>+ * mtrr_overwrite_state - set fixed MTRR state
> >
> >fixed only? You pass in variable too...
> >
> >>+ *
> >>+ * Used to set MTRR state via different means (e.g. with data obtained from
> >>+ * a hypervisor).
> >>+ */
> >>+void mtrr_overwrite_state(struct mtrr_var_range *var, unsigned int num_var,
> >>+ mtrr_type *fixed, mtrr_type def_type)
> >>+{
> >>+ unsigned int i;
> >>+
> >>+ if (boot_cpu_has(X86_FEATURE_MTRR))
> >
> >check_for_deprecated_apis: WARNING: arch/x86/kernel/cpu/mtrr/generic.c:254: Do not use boot_cpu_has() - use cpu_feature_enabled() instead.
> >
> >>+ return;
> >
> >So this here needs to check:
> >
> > if (!cpu_feature_enabled(X86_FEATURE_HYPERVISOR) &&
> > !(cpu_feature_enabled(X86_FEATURE_SEV_SNP) ||
> > cpu_feature_enabled(X86_FEATURE_XENPV))) {
> > WARN_ON_ONCE(1);
> > return;
> > }
> >
> >as we don't want this to be called somewhere or by something else.
> >
> >The SEV_SNP flag can be used from:
> >
> >https://lore.kernel.org/r/[email protected]
> >
> >I'm assuming here HyperV SEV-SNP guests really do set that feature flag
> >(they better). We can expedite that patch ofc.
>
> Is that flag _really_ meant to indicate we are running as a SEV-SNP guest?
>
> Given that the referenced patch is part of the SEV-SNP host support series,
> I'm inclined to suspect it won't be set for sure in HyperV SEV-SNP guests.
> And who is setting it for KVM SEV-SNP guests?
>
>
> Juergen
Initially both guest and host side have X86_FEATURE_SEV_SNP, but
early_detect_mem_encrypt() clears it for the guest side.
You would want cc_platform_has(CC_ATTR_GUEST_SEV_SNP), but:
* there are two kinds of Hyper-V SEV-SNP VMs: "unenlightened" (vTOM+paravisor)
and "enlightened"
* the kernel currently supports the "unenlightened" kind which do not set
CC_ATTR_GUEST_SEV_SNP (because it implies codepaths which do not apply to
vTOM mode)
* the patchset that adds support for "enlightened" Hyper-V SEV-SNP VMs sets
CC_ATTR_GUEST_SEV_SNP [1]
The closest you can get is:
cc_platform_has(CC_ATTR_GUEST_MEM_ENCRYPT) && hypervisor_is_type(X86_HYPER_MS_HYPERV)
but that would soon cover TDX guests too so <shrug>...
[1]: https://lore.kernel.org/lkml/[email protected]/
On Mon, Feb 13, 2023 at 12:39:40PM +0100, Borislav Petkov wrote:
> On Thu, Feb 09, 2023 at 08:22:14AM +0100, Juergen Gross wrote:
> > When running virtualized, MTRR access can be reduced (e.g. in Xen PV
> > guests or when running as a SEV-SNP guest under Hyper-V). Typically
> > the hypervisor will reset the MTRR feature in cpuid data, resulting
> > in no MTRR memory type information being available for the kernel.
> >
> > This has turned out to result in problems:
> >
> > - Hyper-V SEV-SNP guests using uncached mappings where they shouldn't
> > - Xen PV dom0 mapping memory as WB which should be UC- instead
> >
> > Solve those problems by supporting to set a fixed MTRR state,
> > overwriting the empty state used today. In case such a state has been
> > set, don't call get_mtrr_state() in mtrr_bp_init(). The set state
> > will only be used by mtrr_type_lookup(), as in all other cases
> > mtrr_enabled() is being checked, which will return false. Accept the
> > overwrite call only in case of MTRRs being disabled in cpuid.
>
> s/cpuid/CPUID/g
>
> > Signed-off-by: Juergen Gross <[email protected]>
> > ---
> > V2:
> > - new patch
> > ---
> > arch/x86/include/asm/mtrr.h | 2 ++
> > arch/x86/kernel/cpu/mtrr/generic.c | 38 ++++++++++++++++++++++++++++++
> > arch/x86/kernel/cpu/mtrr/mtrr.c | 9 +++++++
> > 3 files changed, 49 insertions(+)
> >
> > diff --git a/arch/x86/include/asm/mtrr.h b/arch/x86/include/asm/mtrr.h
> > index f0eeaf6e5f5f..0b8f51d683dc 100644
> > --- a/arch/x86/include/asm/mtrr.h
> > +++ b/arch/x86/include/asm/mtrr.h
> > @@ -31,6 +31,8 @@
> > */
> > # ifdef CONFIG_MTRR
> > void mtrr_bp_init(void);
> > +void mtrr_overwrite_state(struct mtrr_var_range *var, unsigned int num_var,
> > + mtrr_type *fixed, mtrr_type def_type);
> > extern u8 mtrr_type_lookup(u64 addr, u64 end, u8 *uniform);
> > extern void mtrr_save_fixed_ranges(void *);
> > extern void mtrr_save_state(void);
> > diff --git a/arch/x86/kernel/cpu/mtrr/generic.c b/arch/x86/kernel/cpu/mtrr/generic.c
> > index ee09d359e08f..788bc16888a5 100644
> > --- a/arch/x86/kernel/cpu/mtrr/generic.c
> > +++ b/arch/x86/kernel/cpu/mtrr/generic.c
> > @@ -240,6 +240,44 @@ static u8 mtrr_type_lookup_variable(u64 start, u64 end, u64 *partial_end,
> > return mtrr_state.def_type;
> > }
> >
> > +/**
> > + * mtrr_overwrite_state - set fixed MTRR state
>
> fixed only? You pass in variable too...
>
> > + *
> > + * Used to set MTRR state via different means (e.g. with data obtained from
> > + * a hypervisor).
> > + */
> > +void mtrr_overwrite_state(struct mtrr_var_range *var, unsigned int num_var,
> > + mtrr_type *fixed, mtrr_type def_type)
> > +{
> > + unsigned int i;
> > +
> > + if (boot_cpu_has(X86_FEATURE_MTRR))
>
> check_for_deprecated_apis: WARNING: arch/x86/kernel/cpu/mtrr/generic.c:254: Do not use boot_cpu_has() - use cpu_feature_enabled() instead.
>
Hi Boris,
Where does this check come from? I can't find a source for it.
Jeremi
On Thu, Feb 16, 2023 at 10:32:28AM +0100, Juergen Gross wrote:
> Is that flag _really_ meant to indicate we are running as a SEV-SNP guest?
Yes.
> Given that the referenced patch is part of the SEV-SNP host support series,
> I'm inclined to suspect it won't be set for sure in HyperV SEV-SNP guests.
It better be. If it is a modified guest - no matter how modified - it
should set that flag. The vTOM thing is still being discussed.
> And who is setting it for KVM SEV-SNP guests?
That same patch does.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On Thu, Feb 16, 2023 at 03:07:28AM -0800, Jeremi Piotrowski wrote:
> Where does this check come from? I can't find a source for it.
That's the patch checker I've been writing while reviewing patches:
https://git.kernel.org/pub/scm/linux/kernel/git/bp/bp.git/log/?h=vp
It is, ofc, WIP.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
On 16.02.23 12:25, Borislav Petkov wrote:
> On Thu, Feb 16, 2023 at 10:32:28AM +0100, Juergen Gross wrote:
>> Is that flag _really_ meant to indicate we are running as a SEV-SNP guest?
>
> Yes.
>
>> Given that the referenced patch is part of the SEV-SNP host support series,
>> I'm inclined to suspect it won't be set for sure in HyperV SEV-SNP guests.
>
> It better be. If it is a modified guest - no matter how modified - it
> should set that flag. The vTOM thing is still being discussed.
>
>> And who is setting it for KVM SEV-SNP guests?
>
> That same patch does.
Hmm, I must be blind. I can't spot it.
I'm seeing only the feature bit #define and a call of
setup_clear_cpu_cap(X86_FEATURE_SEV_SNP) in this patch.
Or is it done by hardware or the hypervisor?
Juergen
On Thu, Feb 16, 2023 at 01:19:22PM +0100, Juergen Gross wrote:
> Hmm, I must be blind. I can't spot it.
>
> I'm seeing only the feature bit #define and a call of
> setup_clear_cpu_cap(X86_FEATURE_SEV_SNP) in this patch.
>
> Or is it done by hardware or the hypervisor?
Correction - I meant CC_ATTR_GUEST_SEV_SNP not the CPUID feature flag.
Sorry for the confusion folks.
--
Regards/Gruss,
Boris.
https://people.kernel.org/tglx/notes-about-netiquette
From: Borislav Petkov <[email protected]> Sent: Thursday, February 16, 2023 4:29 AM
>
> On Thu, Feb 16, 2023 at 01:19:22PM +0100, Juergen Gross wrote:
> > Hmm, I must be blind. I can't spot it.
> >
> > I'm seeing only the feature bit #define and a call of
> > setup_clear_cpu_cap(X86_FEATURE_SEV_SNP) in this patch.
> >
> > Or is it done by hardware or the hypervisor?
>
> Correction - I meant CC_ATTR_GUEST_SEV_SNP not the CPUID feature flag.
>
In current upstream code, Hyper-V vTOM VMs aren't participating in
the CC_ATTR_* scheme at all, so CC_ATTR_GUEST_SEV_SNP won't be
set. Getting Hyper-V vTOM VMs integrated into that scheme is a key
part of my big patch set[1] that we're separately trying to resolve the
last issues with.
Michael
[1] https://lore.kernel.org/linux-hyperv/[email protected]/