Setting CONFIG_PARAVIRT=y has an unintended side effect: it silently
turns all rdmsr and wrmsr operations into the safe variants without
any checks that the operations actually succeed.
With CONFIG_PARAVIRT=n, unchecked MSR failures OOPS and probably
cause boot to fail if they happen before init starts.
Neither behavior is very good, and it's particularly unfortunate that
the behavior changes depending on CONFIG_PARAVIRT.
In particular, KVM guests might be unwittingly depending on the
PARAVIRT=y behavior because CONFIG_KVM_GUEST currently depends on
CONFIG_PARAVIRT, and, because accesses in that case are completely
unchecked, we wouldn't even see a warning.
This series changes the native behavior, regardless of
CONFIG_PARAVIRT. A non-"safe" MSR failure will give an informative
warning once and will be fixed up -- native_read_msr will return
zero, and both reads and writes will continue where they left off.
If panic_on_oops is set, they will still OOPS and panic.
By using the shiny new custom exception handler infrastructure,
there should be no overhead on the success paths.
I didn't change the behavior on Xen, but, with this series applied,
it would be straightforward for the Xen maintainers to make the
corresponding change -- knowledge of whether the access is "safe" is
now propagated into the pvops.
Doing this is probably a prerequisite to sanely decoupling
CONFIG_KVM_GUEST and CONFIG_PARAVIRT, which would probably make
Arjan and the rest of the Clear Containers people happy :)
There's also room to reduce the code size of the "safe" variants
using custom exception handlers in the future.
Changes from v3:
- WARN_ONCE instead of WARN (Ingo)
- In the warning text, s/unsafe/unchecked/ (Ingo, sort of)
Changes from earlier versions: lots of changes!
Andy Lutomirski (5):
x86/paravirt: Add _safe to the read_msr and write_msr PV hooks
x86/msr: Carry on after a non-"safe" MSR access fails without
!panic_on_oops
x86/paravirt: Add paravirt_{read,write}_msr
x86/paravirt: Make "unsafe" MSR accesses unsafe even if PARAVIRT=y
x86/msr: Set the return value to zero when native_rdmsr_safe fails
arch/x86/include/asm/msr.h | 20 ++++++++++++----
arch/x86/include/asm/paravirt.h | 45 +++++++++++++++++++++--------------
arch/x86/include/asm/paravirt_types.h | 14 +++++++----
arch/x86/kernel/paravirt.c | 6 +++--
arch/x86/mm/extable.c | 33 +++++++++++++++++++++++++
arch/x86/xen/enlighten.c | 27 +++++++++++++++++++--
6 files changed, 114 insertions(+), 31 deletions(-)
--
2.5.0
Enabling CONFIG_PARAVIRT had an unintended side effect: rdmsr turned
into rdmsr_safe and wrmsr turned into wrmsr_safe, even on bare
metal. Undo that by using the new unsafe paravirt MSR hooks.
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/paravirt.h | 9 +++------
1 file changed, 3 insertions(+), 6 deletions(-)
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 68297d87e85c..0c99f10874e4 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -151,24 +151,21 @@ static inline int paravirt_write_msr_safe(unsigned msr,
return PVOP_CALL3(int, pv_cpu_ops.write_msr_safe, msr, low, high);
}
-/* These should all do BUG_ON(_err), but our headers are too tangled. */
#define rdmsr(msr, val1, val2) \
do { \
- int _err; \
- u64 _l = paravirt_read_msr_safe(msr, &_err); \
+ u64 _l = paravirt_read_msr(msr); \
val1 = (u32)_l; \
val2 = _l >> 32; \
} while (0)
#define wrmsr(msr, val1, val2) \
do { \
- paravirt_write_msr_safe(msr, val1, val2); \
+ paravirt_write_msr(msr, val1, val2); \
} while (0)
#define rdmsrl(msr, val) \
do { \
- int _err; \
- val = paravirt_read_msr_safe(msr, &_err); \
+ val = paravirt_read_msr(msr); \
} while (0)
static inline void wrmsrl(unsigned msr, u64 val)
--
2.5.0
This adds paravirt hooks for unsafe MSR access. On native, they
call native_{read,write}_msr. On Xen, they use
xen_{read,write}_msr_safe.
Nothing uses them yet for ease of bisection. The next patch will
use them in rdmsrl, wrmsrl, etc.
I intentionally didn't make them OOPS on #GP on Xen. I think that
should be done separately by the Xen maintainers.
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/msr.h | 5 +++--
arch/x86/include/asm/paravirt.h | 11 +++++++++++
arch/x86/include/asm/paravirt_types.h | 10 ++++++++--
arch/x86/kernel/paravirt.c | 2 ++
arch/x86/xen/enlighten.c | 23 +++++++++++++++++++++++
5 files changed, 47 insertions(+), 4 deletions(-)
diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index 1487054a1a70..13da359881d7 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -119,8 +119,9 @@ static inline unsigned long long native_read_msr_safe(unsigned int msr,
return EAX_EDX_VAL(val, low, high);
}
-static inline void native_write_msr(unsigned int msr,
- unsigned low, unsigned high)
+/* Can be uninlined because referenced by paravirt */
+notrace static inline void native_write_msr(unsigned int msr,
+ unsigned low, unsigned high)
{
asm volatile("1: wrmsr\n"
"2:\n"
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index 2e49228ed9a3..68297d87e85c 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -129,6 +129,17 @@ static inline void wbinvd(void)
#define get_kernel_rpl() (pv_info.kernel_rpl)
+static inline u64 paravirt_read_msr(unsigned msr)
+{
+ return PVOP_CALL1(u64, pv_cpu_ops.read_msr, msr);
+}
+
+static inline void paravirt_write_msr(unsigned msr,
+ unsigned low, unsigned high)
+{
+ return PVOP_VCALL3(pv_cpu_ops.write_msr, msr, low, high);
+}
+
static inline u64 paravirt_read_msr_safe(unsigned msr, int *err)
{
return PVOP_CALL2(u64, pv_cpu_ops.read_msr_safe, msr, err);
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 5a06cccd36f0..b85aec45a1b8 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -155,8 +155,14 @@ struct pv_cpu_ops {
void (*cpuid)(unsigned int *eax, unsigned int *ebx,
unsigned int *ecx, unsigned int *edx);
- /* MSR operations.
- err = 0/-EIO. wrmsr returns 0/-EIO. */
+ /* Unsafe MSR operations. These will warn or panic on failure. */
+ u64 (*read_msr)(unsigned int msr);
+ void (*write_msr)(unsigned int msr, unsigned low, unsigned high);
+
+ /*
+ * Safe MSR operations.
+ * read sets err to 0 or -EIO. write returns 0 or -EIO.
+ */
u64 (*read_msr_safe)(unsigned int msr, int *err);
int (*write_msr_safe)(unsigned int msr, unsigned low, unsigned high);
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index 8aad95478ae5..f9583917c7c4 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -339,6 +339,8 @@ __visible struct pv_cpu_ops pv_cpu_ops = {
.write_cr8 = native_write_cr8,
#endif
.wbinvd = native_wbinvd,
+ .read_msr = native_read_msr,
+ .write_msr = native_write_msr,
.read_msr_safe = native_read_msr_safe,
.write_msr_safe = native_write_msr_safe,
.read_pmc = native_read_pmc,
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index ff2df20d0067..548cd3e02b9e 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1092,6 +1092,26 @@ static int xen_write_msr_safe(unsigned int msr, unsigned low, unsigned high)
return ret;
}
+static u64 xen_read_msr(unsigned int msr)
+{
+ /*
+ * This will silently swallow a #GP from RDMSR. It may be worth
+ * changing that.
+ */
+ int err;
+
+ return xen_read_msr_safe(msr, &err);
+}
+
+static void xen_write_msr(unsigned int msr, unsigned low, unsigned high)
+{
+ /*
+ * This will silently swallow a #GP from WRMSR. It may be worth
+ * changing that.
+ */
+ xen_write_msr_safe(msr, low, high);
+}
+
void xen_setup_shared_info(void)
{
if (!xen_feature(XENFEAT_auto_translated_physmap)) {
@@ -1222,6 +1242,9 @@ static const struct pv_cpu_ops xen_cpu_ops __initconst = {
.wbinvd = native_wbinvd,
+ .read_msr = xen_read_msr,
+ .write_msr = xen_write_msr,
+
.read_msr_safe = xen_read_msr_safe,
.write_msr_safe = xen_write_msr_safe,
--
2.5.0
This demotes an OOPS and likely panic due to a failed non-"safe" MSR
access to a WARN_ONCE and, for RDMSR, a return value of zero. If
panic_on_oops is set, then failed unsafe MSR accesses will still
oops and panic.
To be clear, this type of failure should *not* happen. This patch
exists to minimize the chance of nasty undebuggable failures due on
systems that used to work due to a now-fixed CONFIG_PARAVIRT=y bug.
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/msr.h | 10 ++++++++--
arch/x86/mm/extable.c | 33 +++++++++++++++++++++++++++++++++
2 files changed, 41 insertions(+), 2 deletions(-)
diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index 93fb7c1cffda..1487054a1a70 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -92,7 +92,10 @@ static inline unsigned long long native_read_msr(unsigned int msr)
{
DECLARE_ARGS(val, low, high);
- asm volatile("rdmsr" : EAX_EDX_RET(val, low, high) : "c" (msr));
+ asm volatile("1: rdmsr\n"
+ "2:\n"
+ _ASM_EXTABLE_HANDLE(1b, 2b, ex_handler_rdmsr_unsafe)
+ : EAX_EDX_RET(val, low, high) : "c" (msr));
if (msr_tracepoint_active(__tracepoint_read_msr))
do_trace_read_msr(msr, EAX_EDX_VAL(val, low, high), 0);
return EAX_EDX_VAL(val, low, high);
@@ -119,7 +122,10 @@ static inline unsigned long long native_read_msr_safe(unsigned int msr,
static inline void native_write_msr(unsigned int msr,
unsigned low, unsigned high)
{
- asm volatile("wrmsr" : : "c" (msr), "a"(low), "d" (high) : "memory");
+ asm volatile("1: wrmsr\n"
+ "2:\n"
+ _ASM_EXTABLE_HANDLE(1b, 2b, ex_handler_wrmsr_unsafe)
+ : : "c" (msr), "a"(low), "d" (high) : "memory");
if (msr_tracepoint_active(__tracepoint_read_msr))
do_trace_write_msr(msr, ((u64)high << 32 | low), 0);
}
diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c
index 9dd7e4b7fcde..9e6749b34524 100644
--- a/arch/x86/mm/extable.c
+++ b/arch/x86/mm/extable.c
@@ -49,6 +49,39 @@ bool ex_handler_ext(const struct exception_table_entry *fixup,
}
EXPORT_SYMBOL(ex_handler_ext);
+bool ex_handler_rdmsr_unsafe(const struct exception_table_entry *fixup,
+ struct pt_regs *regs, int trapnr)
+{
+ WARN_ONCE(1, "unchecked MSR access error: RDMSR from 0x%x",
+ (unsigned int)regs->cx);
+
+ /* If panic_on_oops is set, don't try to recover. */
+ if (panic_on_oops)
+ return false;
+
+ regs->ip = ex_fixup_addr(fixup);
+ regs->ax = 0;
+ regs->dx = 0;
+ return true;
+}
+EXPORT_SYMBOL(ex_handler_rdmsr_unsafe);
+
+bool ex_handler_wrmsr_unsafe(const struct exception_table_entry *fixup,
+ struct pt_regs *regs, int trapnr)
+{
+ WARN_ONCE(1, "unchecked MSR access error: WRMSR to 0x%x (tried to write 0x%08x%08x)\n",
+ (unsigned int)regs->cx,
+ (unsigned int)regs->dx, (unsigned int)regs->ax);
+
+ /* If panic_on_oops is set, don't try to recover. */
+ if (panic_on_oops)
+ return false;
+
+ regs->ip = ex_fixup_addr(fixup);
+ return true;
+}
+EXPORT_SYMBOL(ex_handler_wrmsr_unsafe);
+
bool ex_has_fault_handler(unsigned long ip)
{
const struct exception_table_entry *e;
--
2.5.0
These hooks match the _safe variants, so name them accordingly.
This will make room for unsafe PV hooks.
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/paravirt.h | 33 +++++++++++++++++----------------
arch/x86/include/asm/paravirt_types.h | 8 ++++----
arch/x86/kernel/paravirt.c | 4 ++--
arch/x86/xen/enlighten.c | 4 ++--
4 files changed, 25 insertions(+), 24 deletions(-)
diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
index f6192502149e..2e49228ed9a3 100644
--- a/arch/x86/include/asm/paravirt.h
+++ b/arch/x86/include/asm/paravirt.h
@@ -129,34 +129,35 @@ static inline void wbinvd(void)
#define get_kernel_rpl() (pv_info.kernel_rpl)
-static inline u64 paravirt_read_msr(unsigned msr, int *err)
+static inline u64 paravirt_read_msr_safe(unsigned msr, int *err)
{
- return PVOP_CALL2(u64, pv_cpu_ops.read_msr, msr, err);
+ return PVOP_CALL2(u64, pv_cpu_ops.read_msr_safe, msr, err);
}
-static inline int paravirt_write_msr(unsigned msr, unsigned low, unsigned high)
+static inline int paravirt_write_msr_safe(unsigned msr,
+ unsigned low, unsigned high)
{
- return PVOP_CALL3(int, pv_cpu_ops.write_msr, msr, low, high);
+ return PVOP_CALL3(int, pv_cpu_ops.write_msr_safe, msr, low, high);
}
/* These should all do BUG_ON(_err), but our headers are too tangled. */
#define rdmsr(msr, val1, val2) \
do { \
int _err; \
- u64 _l = paravirt_read_msr(msr, &_err); \
+ u64 _l = paravirt_read_msr_safe(msr, &_err); \
val1 = (u32)_l; \
val2 = _l >> 32; \
} while (0)
#define wrmsr(msr, val1, val2) \
do { \
- paravirt_write_msr(msr, val1, val2); \
+ paravirt_write_msr_safe(msr, val1, val2); \
} while (0)
#define rdmsrl(msr, val) \
do { \
int _err; \
- val = paravirt_read_msr(msr, &_err); \
+ val = paravirt_read_msr_safe(msr, &_err); \
} while (0)
static inline void wrmsrl(unsigned msr, u64 val)
@@ -164,23 +165,23 @@ static inline void wrmsrl(unsigned msr, u64 val)
wrmsr(msr, (u32)val, (u32)(val>>32));
}
-#define wrmsr_safe(msr, a, b) paravirt_write_msr(msr, a, b)
+#define wrmsr_safe(msr, a, b) paravirt_write_msr_safe(msr, a, b)
/* rdmsr with exception handling */
-#define rdmsr_safe(msr, a, b) \
-({ \
- int _err; \
- u64 _l = paravirt_read_msr(msr, &_err); \
- (*a) = (u32)_l; \
- (*b) = _l >> 32; \
- _err; \
+#define rdmsr_safe(msr, a, b) \
+({ \
+ int _err; \
+ u64 _l = paravirt_read_msr_safe(msr, &_err); \
+ (*a) = (u32)_l; \
+ (*b) = _l >> 32; \
+ _err; \
})
static inline int rdmsrl_safe(unsigned msr, unsigned long long *p)
{
int err;
- *p = paravirt_read_msr(msr, &err);
+ *p = paravirt_read_msr_safe(msr, &err);
return err;
}
diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
index 77db5616a473..5a06cccd36f0 100644
--- a/arch/x86/include/asm/paravirt_types.h
+++ b/arch/x86/include/asm/paravirt_types.h
@@ -155,10 +155,10 @@ struct pv_cpu_ops {
void (*cpuid)(unsigned int *eax, unsigned int *ebx,
unsigned int *ecx, unsigned int *edx);
- /* MSR, PMC and TSR operations.
- err = 0/-EFAULT. wrmsr returns 0/-EFAULT. */
- u64 (*read_msr)(unsigned int msr, int *err);
- int (*write_msr)(unsigned int msr, unsigned low, unsigned high);
+ /* MSR operations.
+ err = 0/-EIO. wrmsr returns 0/-EIO. */
+ u64 (*read_msr_safe)(unsigned int msr, int *err);
+ int (*write_msr_safe)(unsigned int msr, unsigned low, unsigned high);
u64 (*read_pmc)(int counter);
diff --git a/arch/x86/kernel/paravirt.c b/arch/x86/kernel/paravirt.c
index f08ac28b8136..8aad95478ae5 100644
--- a/arch/x86/kernel/paravirt.c
+++ b/arch/x86/kernel/paravirt.c
@@ -339,8 +339,8 @@ __visible struct pv_cpu_ops pv_cpu_ops = {
.write_cr8 = native_write_cr8,
#endif
.wbinvd = native_wbinvd,
- .read_msr = native_read_msr_safe,
- .write_msr = native_write_msr_safe,
+ .read_msr_safe = native_read_msr_safe,
+ .write_msr_safe = native_write_msr_safe,
.read_pmc = native_read_pmc,
.load_tr_desc = native_load_tr_desc,
.set_ldt = native_set_ldt,
diff --git a/arch/x86/xen/enlighten.c b/arch/x86/xen/enlighten.c
index d09e4c9d7cc5..ff2df20d0067 100644
--- a/arch/x86/xen/enlighten.c
+++ b/arch/x86/xen/enlighten.c
@@ -1222,8 +1222,8 @@ static const struct pv_cpu_ops xen_cpu_ops __initconst = {
.wbinvd = native_wbinvd,
- .read_msr = xen_read_msr_safe,
- .write_msr = xen_write_msr_safe,
+ .read_msr_safe = xen_read_msr_safe,
+ .write_msr_safe = xen_write_msr_safe,
.read_pmc = xen_read_pmc,
--
2.5.0
This will cause unchecked native_rdmsr_safe failures to return
deterministic results.
Signed-off-by: Andy Lutomirski <[email protected]>
---
arch/x86/include/asm/msr.h | 5 ++++-
1 file changed, 4 insertions(+), 1 deletion(-)
diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
index 13da359881d7..e97e79f8a22b 100644
--- a/arch/x86/include/asm/msr.h
+++ b/arch/x86/include/asm/msr.h
@@ -109,7 +109,10 @@ static inline unsigned long long native_read_msr_safe(unsigned int msr,
asm volatile("2: rdmsr ; xor %[err],%[err]\n"
"1:\n\t"
".section .fixup,\"ax\"\n\t"
- "3: mov %[fault],%[err] ; jmp 1b\n\t"
+ "3: mov %[fault],%[err]\n\t"
+ "xorl %%eax, %%eax\n\t"
+ "xorl %%edx, %%edx\n\t"
+ "jmp 1b\n\t"
".previous\n\t"
_ASM_EXTABLE(2b, 3b)
: [err] "=r" (*err), EAX_EDX_RET(val, low, high)
--
2.5.0
On Sat, Mar 12, 2016 at 10:08:48AM -0800, Andy Lutomirski wrote:
> These hooks match the _safe variants, so name them accordingly.
> This will make room for unsafe PV hooks.
>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/include/asm/paravirt.h | 33 +++++++++++++++++----------------
> arch/x86/include/asm/paravirt_types.h | 8 ++++----
> arch/x86/kernel/paravirt.c | 4 ++--
> arch/x86/xen/enlighten.c | 4 ++--
> 4 files changed, 25 insertions(+), 24 deletions(-)
>
> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
> index f6192502149e..2e49228ed9a3 100644
> --- a/arch/x86/include/asm/paravirt.h
> +++ b/arch/x86/include/asm/paravirt.h
> @@ -129,34 +129,35 @@ static inline void wbinvd(void)
>
> #define get_kernel_rpl() (pv_info.kernel_rpl)
>
> -static inline u64 paravirt_read_msr(unsigned msr, int *err)
> +static inline u64 paravirt_read_msr_safe(unsigned msr, int *err)
> {
> - return PVOP_CALL2(u64, pv_cpu_ops.read_msr, msr, err);
> + return PVOP_CALL2(u64, pv_cpu_ops.read_msr_safe, msr, err);
> }
>
> -static inline int paravirt_write_msr(unsigned msr, unsigned low, unsigned high)
> +static inline int paravirt_write_msr_safe(unsigned msr,
> + unsigned low, unsigned high)
> {
> - return PVOP_CALL3(int, pv_cpu_ops.write_msr, msr, low, high);
> + return PVOP_CALL3(int, pv_cpu_ops.write_msr_safe, msr, low, high);
> }
>
> /* These should all do BUG_ON(_err), but our headers are too tangled. */
> #define rdmsr(msr, val1, val2) \
> do { \
> int _err; \
> - u64 _l = paravirt_read_msr(msr, &_err); \
> + u64 _l = paravirt_read_msr_safe(msr, &_err); \
> val1 = (u32)_l; \
> val2 = _l >> 32; \
> } while (0)
>
> #define wrmsr(msr, val1, val2) \
> do { \
> - paravirt_write_msr(msr, val1, val2); \
> + paravirt_write_msr_safe(msr, val1, val2); \
> } while (0)
>
> #define rdmsrl(msr, val) \
> do { \
> int _err; \
> - val = paravirt_read_msr(msr, &_err); \
> + val = paravirt_read_msr_safe(msr, &_err); \
> } while (0)
>
> static inline void wrmsrl(unsigned msr, u64 val)
> @@ -164,23 +165,23 @@ static inline void wrmsrl(unsigned msr, u64 val)
> wrmsr(msr, (u32)val, (u32)(val>>32));
> }
>
> -#define wrmsr_safe(msr, a, b) paravirt_write_msr(msr, a, b)
> +#define wrmsr_safe(msr, a, b) paravirt_write_msr_safe(msr, a, b)
>
> /* rdmsr with exception handling */
> -#define rdmsr_safe(msr, a, b) \
> -({ \
> - int _err; \
> - u64 _l = paravirt_read_msr(msr, &_err); \
> - (*a) = (u32)_l; \
> - (*b) = _l >> 32; \
> - _err; \
> +#define rdmsr_safe(msr, a, b) \
> +({ \
> + int _err; \
> + u64 _l = paravirt_read_msr_safe(msr, &_err); \
> + (*a) = (u32)_l; \
> + (*b) = _l >> 32; \
> + _err; \
> })
>
> static inline int rdmsrl_safe(unsigned msr, unsigned long long *p)
> {
> int err;
>
> - *p = paravirt_read_msr(msr, &err);
> + *p = paravirt_read_msr_safe(msr, &err);
> return err;
> }
>
> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
> index 77db5616a473..5a06cccd36f0 100644
> --- a/arch/x86/include/asm/paravirt_types.h
> +++ b/arch/x86/include/asm/paravirt_types.h
> @@ -155,10 +155,10 @@ struct pv_cpu_ops {
> void (*cpuid)(unsigned int *eax, unsigned int *ebx,
> unsigned int *ecx, unsigned int *edx);
>
> - /* MSR, PMC and TSR operations.
> - err = 0/-EFAULT. wrmsr returns 0/-EFAULT. */
> - u64 (*read_msr)(unsigned int msr, int *err);
> - int (*write_msr)(unsigned int msr, unsigned low, unsigned high);
> + /* MSR operations.
> + err = 0/-EIO. wrmsr returns 0/-EIO. */
Please fix the comment format while you're touching this :
/*
* A sentence.
* Another sentence.
*/
Other than that:
Acked-by: Borislav Petkov <[email protected]>
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
On Sat, Mar 12, 2016 at 10:08:49AM -0800, Andy Lutomirski wrote:
> This demotes an OOPS and likely panic due to a failed non-"safe" MSR
> access to a WARN_ONCE and, for RDMSR, a return value of zero. If
> panic_on_oops is set, then failed unsafe MSR accesses will still
> oops and panic.
>
> To be clear, this type of failure should *not* happen. This patch
> exists to minimize the chance of nasty undebuggable failures due on
> systems that used to work due to a now-fixed CONFIG_PARAVIRT=y bug.
>
> Signed-off-by: Andy Lutomirski <[email protected]>
> ---
> arch/x86/include/asm/msr.h | 10 ++++++++--
> arch/x86/mm/extable.c | 33 +++++++++++++++++++++++++++++++++
> 2 files changed, 41 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
> index 93fb7c1cffda..1487054a1a70 100644
> --- a/arch/x86/include/asm/msr.h
> +++ b/arch/x86/include/asm/msr.h
> @@ -92,7 +92,10 @@ static inline unsigned long long native_read_msr(unsigned int msr)
> {
> DECLARE_ARGS(val, low, high);
>
> - asm volatile("rdmsr" : EAX_EDX_RET(val, low, high) : "c" (msr));
> + asm volatile("1: rdmsr\n"
> + "2:\n"
> + _ASM_EXTABLE_HANDLE(1b, 2b, ex_handler_rdmsr_unsafe)
> + : EAX_EDX_RET(val, low, high) : "c" (msr));
> if (msr_tracepoint_active(__tracepoint_read_msr))
> do_trace_read_msr(msr, EAX_EDX_VAL(val, low, high), 0);
> return EAX_EDX_VAL(val, low, high);
> @@ -119,7 +122,10 @@ static inline unsigned long long native_read_msr_safe(unsigned int msr,
> static inline void native_write_msr(unsigned int msr,
> unsigned low, unsigned high)
> {
> - asm volatile("wrmsr" : : "c" (msr), "a"(low), "d" (high) : "memory");
> + asm volatile("1: wrmsr\n"
> + "2:\n"
> + _ASM_EXTABLE_HANDLE(1b, 2b, ex_handler_wrmsr_unsafe)
This might be a good idea:
[ 0.220066] cpuidle: using governor menu
[ 0.224000] ------------[ cut here ]------------
[ 0.224000] WARNING: CPU: 0 PID: 1 at arch/x86/mm/extable.c:74 ex_handler_wrmsr_unsafe+0x73/0x80()
[ 0.224000] unchecked MSR access error: WRMSR to 0xdeadbeef (tried to write 0x000000000000caca)
[ 0.224000] Modules linked in:
[ 0.224000] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.5.0-rc7+ #7
[ 0.224000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[ 0.224000] 0000000000000000 ffff88007c0d7c08 ffffffff812f13a3 ffff88007c0d7c50
[ 0.224000] ffffffff81a40ffe ffff88007c0d7c40 ffffffff8105c3b1 ffffffff81717710
[ 0.224000] ffff88007c0d7d18 0000000000000000 ffffffff816207d0 0000000000000000
[ 0.224000] Call Trace:
[ 0.224000] [<ffffffff812f13a3>] dump_stack+0x67/0x94
[ 0.224000] [<ffffffff8105c3b1>] warn_slowpath_common+0x91/0xd0
[ 0.224000] [<ffffffff816207d0>] ? amd_cpu_notify+0x40/0x40
[ 0.224000] [<ffffffff8105c43c>] warn_slowpath_fmt+0x4c/0x50
[ 0.224000] [<ffffffff816207d0>] ? amd_cpu_notify+0x40/0x40
[ 0.224000] [<ffffffff8131de53>] ? __this_cpu_preempt_check+0x13/0x20
[ 0.224000] [<ffffffff8104efe3>] ex_handler_wrmsr_unsafe+0x73/0x80
and it looks helpful and all but when you do it pretty early - for
example I added a
wrmsrl(0xdeadbeef, 0xcafe);
at the end of pat_bsp_init() and the machine explodes with an early
panic. I'm wondering what is better - early panic or an early #GP from a
missing MSR.
And more specifically, can we do better to handle the early case
gracefully too?
--
Regards/Gruss,
Boris.
ECO tip #101: Trim your mails when you reply.
On 03/12/2016 01:08 PM, Andy Lutomirski wrote:
> Setting CONFIG_PARAVIRT=y has an unintended side effect: it silently
> turns all rdmsr and wrmsr operations into the safe variants without
> any checks that the operations actually succeed.
>
> With CONFIG_PARAVIRT=n, unchecked MSR failures OOPS and probably
> cause boot to fail if they happen before init starts.
>
> Neither behavior is very good, and it's particularly unfortunate that
> the behavior changes depending on CONFIG_PARAVIRT.
>
> In particular, KVM guests might be unwittingly depending on the
> PARAVIRT=y behavior because CONFIG_KVM_GUEST currently depends on
> CONFIG_PARAVIRT, and, because accesses in that case are completely
> unchecked, we wouldn't even see a warning.
>
> This series changes the native behavior, regardless of
> CONFIG_PARAVIRT. A non-"safe" MSR failure will give an informative
> warning once and will be fixed up -- native_read_msr will return
> zero, and both reads and writes will continue where they left off.
>
> If panic_on_oops is set, they will still OOPS and panic.
>
> By using the shiny new custom exception handler infrastructure,
> there should be no overhead on the success paths.
>
> I didn't change the behavior on Xen, but, with this series applied,
> it would be straightforward for the Xen maintainers to make the
> corresponding change -- knowledge of whether the access is "safe" is
> now propagated into the pvops.
>
> Doing this is probably a prerequisite to sanely decoupling
> CONFIG_KVM_GUEST and CONFIG_PARAVIRT, which would probably make
> Arjan and the rest of the Clear Containers people happy :)
>
> There's also room to reduce the code size of the "safe" variants
> using custom exception handlers in the future.
>
> Changes from v3:
> - WARN_ONCE instead of WARN (Ingo)
> - In the warning text, s/unsafe/unchecked/ (Ingo, sort of)
>
> Changes from earlier versions: lots of changes!
>
> Andy Lutomirski (5):
> x86/paravirt: Add _safe to the read_msr and write_msr PV hooks
> x86/msr: Carry on after a non-"safe" MSR access fails without
> !panic_on_oops
> x86/paravirt: Add paravirt_{read,write}_msr
> x86/paravirt: Make "unsafe" MSR accesses unsafe even if PARAVIRT=y
> x86/msr: Set the return value to zero when native_rdmsr_safe fails
>
> arch/x86/include/asm/msr.h | 20 ++++++++++++----
> arch/x86/include/asm/paravirt.h | 45 +++++++++++++++++++++--------------
> arch/x86/include/asm/paravirt_types.h | 14 +++++++----
> arch/x86/kernel/paravirt.c | 6 +++--
> arch/x86/mm/extable.c | 33 +++++++++++++++++++++++++
> arch/x86/xen/enlighten.c | 27 +++++++++++++++++++--
> 6 files changed, 114 insertions(+), 31 deletions(-)
>
I don't see any issues as far as Xen is concerned but let me run this
through our nightly. I'll wait for the next version (which I think you
might have based on the comments). Please copy me.
-boris
On Mon, Mar 14, 2016 at 5:02 AM, Borislav Petkov <[email protected]> wrote:
> On Sat, Mar 12, 2016 at 10:08:49AM -0800, Andy Lutomirski wrote:
>> This demotes an OOPS and likely panic due to a failed non-"safe" MSR
>> access to a WARN_ONCE and, for RDMSR, a return value of zero. If
>> panic_on_oops is set, then failed unsafe MSR accesses will still
>> oops and panic.
>>
>> To be clear, this type of failure should *not* happen. This patch
>> exists to minimize the chance of nasty undebuggable failures due on
>> systems that used to work due to a now-fixed CONFIG_PARAVIRT=y bug.
>>
>> Signed-off-by: Andy Lutomirski <[email protected]>
>> ---
>> arch/x86/include/asm/msr.h | 10 ++++++++--
>> arch/x86/mm/extable.c | 33 +++++++++++++++++++++++++++++++++
>> 2 files changed, 41 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/msr.h b/arch/x86/include/asm/msr.h
>> index 93fb7c1cffda..1487054a1a70 100644
>> --- a/arch/x86/include/asm/msr.h
>> +++ b/arch/x86/include/asm/msr.h
>> @@ -92,7 +92,10 @@ static inline unsigned long long native_read_msr(unsigned int msr)
>> {
>> DECLARE_ARGS(val, low, high);
>>
>> - asm volatile("rdmsr" : EAX_EDX_RET(val, low, high) : "c" (msr));
>> + asm volatile("1: rdmsr\n"
>> + "2:\n"
>> + _ASM_EXTABLE_HANDLE(1b, 2b, ex_handler_rdmsr_unsafe)
>> + : EAX_EDX_RET(val, low, high) : "c" (msr));
>> if (msr_tracepoint_active(__tracepoint_read_msr))
>> do_trace_read_msr(msr, EAX_EDX_VAL(val, low, high), 0);
>> return EAX_EDX_VAL(val, low, high);
>> @@ -119,7 +122,10 @@ static inline unsigned long long native_read_msr_safe(unsigned int msr,
>> static inline void native_write_msr(unsigned int msr,
>> unsigned low, unsigned high)
>> {
>> - asm volatile("wrmsr" : : "c" (msr), "a"(low), "d" (high) : "memory");
>> + asm volatile("1: wrmsr\n"
>> + "2:\n"
>> + _ASM_EXTABLE_HANDLE(1b, 2b, ex_handler_wrmsr_unsafe)
>
> This might be a good idea:
>
> [ 0.220066] cpuidle: using governor menu
> [ 0.224000] ------------[ cut here ]------------
> [ 0.224000] WARNING: CPU: 0 PID: 1 at arch/x86/mm/extable.c:74 ex_handler_wrmsr_unsafe+0x73/0x80()
> [ 0.224000] unchecked MSR access error: WRMSR to 0xdeadbeef (tried to write 0x000000000000caca)
> [ 0.224000] Modules linked in:
> [ 0.224000] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 4.5.0-rc7+ #7
> [ 0.224000] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
> [ 0.224000] 0000000000000000 ffff88007c0d7c08 ffffffff812f13a3 ffff88007c0d7c50
> [ 0.224000] ffffffff81a40ffe ffff88007c0d7c40 ffffffff8105c3b1 ffffffff81717710
> [ 0.224000] ffff88007c0d7d18 0000000000000000 ffffffff816207d0 0000000000000000
> [ 0.224000] Call Trace:
> [ 0.224000] [<ffffffff812f13a3>] dump_stack+0x67/0x94
> [ 0.224000] [<ffffffff8105c3b1>] warn_slowpath_common+0x91/0xd0
> [ 0.224000] [<ffffffff816207d0>] ? amd_cpu_notify+0x40/0x40
> [ 0.224000] [<ffffffff8105c43c>] warn_slowpath_fmt+0x4c/0x50
> [ 0.224000] [<ffffffff816207d0>] ? amd_cpu_notify+0x40/0x40
> [ 0.224000] [<ffffffff8131de53>] ? __this_cpu_preempt_check+0x13/0x20
> [ 0.224000] [<ffffffff8104efe3>] ex_handler_wrmsr_unsafe+0x73/0x80
>
> and it looks helpful and all but when you do it pretty early - for
> example I added a
>
> wrmsrl(0xdeadbeef, 0xcafe);
>
> at the end of pat_bsp_init() and the machine explodes with an early
> panic. I'm wondering what is better - early panic or an early #GP from a
> missing MSR.
You're hitting:
/* special handling not supported during early boot */
if (handler != ex_handler_default)
return 0;
which means that the behavior with and without my series applied is
identical, for better or for worse.
>
> And more specifically, can we do better to handle the early case
> gracefully too?
We could probably remove that check and let custom fixups run early.
I don't see any compelling reason to keep them disabled. That should
probably be a separate change, though.
--Andy
On Mon, Mar 14, 2016 at 4:57 AM, Borislav Petkov <[email protected]> wrote:
> On Sat, Mar 12, 2016 at 10:08:48AM -0800, Andy Lutomirski wrote:
>> These hooks match the _safe variants, so name them accordingly.
>> This will make room for unsafe PV hooks.
>>
>> Signed-off-by: Andy Lutomirski <[email protected]>
>> ---
>> arch/x86/include/asm/paravirt.h | 33 +++++++++++++++++----------------
>> arch/x86/include/asm/paravirt_types.h | 8 ++++----
>> arch/x86/kernel/paravirt.c | 4 ++--
>> arch/x86/xen/enlighten.c | 4 ++--
>> 4 files changed, 25 insertions(+), 24 deletions(-)
>>
>> diff --git a/arch/x86/include/asm/paravirt.h b/arch/x86/include/asm/paravirt.h
>> index f6192502149e..2e49228ed9a3 100644
>> --- a/arch/x86/include/asm/paravirt.h
>> +++ b/arch/x86/include/asm/paravirt.h
>> @@ -129,34 +129,35 @@ static inline void wbinvd(void)
>>
>> #define get_kernel_rpl() (pv_info.kernel_rpl)
>>
>> -static inline u64 paravirt_read_msr(unsigned msr, int *err)
>> +static inline u64 paravirt_read_msr_safe(unsigned msr, int *err)
>> {
>> - return PVOP_CALL2(u64, pv_cpu_ops.read_msr, msr, err);
>> + return PVOP_CALL2(u64, pv_cpu_ops.read_msr_safe, msr, err);
>> }
>>
>> -static inline int paravirt_write_msr(unsigned msr, unsigned low, unsigned high)
>> +static inline int paravirt_write_msr_safe(unsigned msr,
>> + unsigned low, unsigned high)
>> {
>> - return PVOP_CALL3(int, pv_cpu_ops.write_msr, msr, low, high);
>> + return PVOP_CALL3(int, pv_cpu_ops.write_msr_safe, msr, low, high);
>> }
>>
>> /* These should all do BUG_ON(_err), but our headers are too tangled. */
>> #define rdmsr(msr, val1, val2) \
>> do { \
>> int _err; \
>> - u64 _l = paravirt_read_msr(msr, &_err); \
>> + u64 _l = paravirt_read_msr_safe(msr, &_err); \
>> val1 = (u32)_l; \
>> val2 = _l >> 32; \
>> } while (0)
>>
>> #define wrmsr(msr, val1, val2) \
>> do { \
>> - paravirt_write_msr(msr, val1, val2); \
>> + paravirt_write_msr_safe(msr, val1, val2); \
>> } while (0)
>>
>> #define rdmsrl(msr, val) \
>> do { \
>> int _err; \
>> - val = paravirt_read_msr(msr, &_err); \
>> + val = paravirt_read_msr_safe(msr, &_err); \
>> } while (0)
>>
>> static inline void wrmsrl(unsigned msr, u64 val)
>> @@ -164,23 +165,23 @@ static inline void wrmsrl(unsigned msr, u64 val)
>> wrmsr(msr, (u32)val, (u32)(val>>32));
>> }
>>
>> -#define wrmsr_safe(msr, a, b) paravirt_write_msr(msr, a, b)
>> +#define wrmsr_safe(msr, a, b) paravirt_write_msr_safe(msr, a, b)
>>
>> /* rdmsr with exception handling */
>> -#define rdmsr_safe(msr, a, b) \
>> -({ \
>> - int _err; \
>> - u64 _l = paravirt_read_msr(msr, &_err); \
>> - (*a) = (u32)_l; \
>> - (*b) = _l >> 32; \
>> - _err; \
>> +#define rdmsr_safe(msr, a, b) \
>> +({ \
>> + int _err; \
>> + u64 _l = paravirt_read_msr_safe(msr, &_err); \
>> + (*a) = (u32)_l; \
>> + (*b) = _l >> 32; \
>> + _err; \
>> })
>>
>> static inline int rdmsrl_safe(unsigned msr, unsigned long long *p)
>> {
>> int err;
>>
>> - *p = paravirt_read_msr(msr, &err);
>> + *p = paravirt_read_msr_safe(msr, &err);
>> return err;
>> }
>>
>> diff --git a/arch/x86/include/asm/paravirt_types.h b/arch/x86/include/asm/paravirt_types.h
>> index 77db5616a473..5a06cccd36f0 100644
>> --- a/arch/x86/include/asm/paravirt_types.h
>> +++ b/arch/x86/include/asm/paravirt_types.h
>> @@ -155,10 +155,10 @@ struct pv_cpu_ops {
>> void (*cpuid)(unsigned int *eax, unsigned int *ebx,
>> unsigned int *ecx, unsigned int *edx);
>>
>> - /* MSR, PMC and TSR operations.
>> - err = 0/-EFAULT. wrmsr returns 0/-EFAULT. */
>> - u64 (*read_msr)(unsigned int msr, int *err);
>> - int (*write_msr)(unsigned int msr, unsigned low, unsigned high);
>> + /* MSR operations.
>> + err = 0/-EIO. wrmsr returns 0/-EIO. */
>
> Please fix the comment format while you're touching this :
>
> /*
> * A sentence.
> * Another sentence.
> */
>
> Other than that:
>
> Acked-by: Borislav Petkov <[email protected]>
I fixed this in "x86/paravirt: Add paravirt_{read,write}_msr" later in
the series. Is that good enough?
--Andy
On Mon, Mar 14, 2016 at 10:11 AM, Linus Torvalds
<[email protected]> wrote:
>
> On Mar 14, 2016 10:05 AM, "Andy Lutomirski" <[email protected]> wrote:
>>
>> We could probably remove that check and let custom fixups run early.
>> I don't see any compelling reason to keep them disabled. That should
>> probably be a separate change, though.
>
> Or we could just use the existing wrmsr_safe() code and not add this new
> special code at all.
>
> Look, why are you doing this? We should get rid of the difference between
> wrmsr and wrmsr_safe(), not make it bigger.
>
> Just make everything safe. There has never in the history of anything been
> an advantage to making things oops and to making things more complicated.
>
> Why is rd/wrmsr() so magically important?
Because none of this is actually safe unless the code that calls
whatever_safe actually does something intelligent with the return
value. I think that most code that does rdmsr or wrmsr actually
thinks "I know that this MSR exists and I want to access it" and the
code may actually malfunction if the access fails. So I think we
really do want the warning so we can fix the bugs if they happen. And
I wouldn't be at all surprised if there's a bug or two in perf where
some perf event registers itself incorrectly in some cases because it
messed up the probe sequence. We don't want to totally ignore the
resulting failure of the perf event -- we want to get a warning so
that we know about the bug.
Or suppose we're writing some important but easy-to-miss MSR, like PAT
or one of the excessive number of system call setup MSRs. If the
write fails, the system could easily still appear to work until
something unfortunate happens.
So yes, let's please warn. I'm okay with removing the panic_on_oops
thing though. (But if anyone suggests that we should stop OOPSing on
bad kernel page faults, I *will* fight back.)
--Andy
On Mon, Mar 14, 2016 at 10:17 AM, Andy Lutomirski <[email protected]> wrote:
>
> So yes, let's please warn. I'm okay with removing the panic_on_oops
> thing though. (But if anyone suggests that we should stop OOPSing on
> bad kernel page faults, I *will* fight back.)
Bad kernel page faults are something completely different. They would
be actual bugs regardless.
The MSR thing has *often* been just silly "this CPU doesn't do that
MSR". Generic bootup setup code etc that just didn't know or care
about the particular badly documented rule for one particular random
CPU version and stepping.
In fact, when I say "often", I suspect I should really just say
"always". I don't think we've ever found a case where oopsing would
have been the right thing. But it has definitely caused lots of
problems, especially in the early paths where your code doesn't even
work right now.
Now, when it comes to the warning, I guess I could live with it, but I
think it's stupid to make this a low-level exception handler thing.
So what I think should be done:
- make sure that wr/rdmsr_safe() actually works during very early
init. At some point it didn't.
- get rid of the current wrmsr/rdmsr *entirely*. It's shit.
- Add this wrapper:
#define wrmsr(msr, low, high) \
WARN_ON_ONCE(wrmsr_safe(msr, low, high))
and be done with it. We could even decide to make that WARN_ON_ONCE()
be something we could configure out, because it's really a debugging
thing and isn't like it should be all that fatal.
None of this insane complicated crap that buys us exactly *nothing*,
and depends on fancy new exception handling support etc etc.
So what's the downside to just doing this simple thing?
Linus
On Mon, Mar 14, 2016 at 11:04 AM, Linus Torvalds
<[email protected]> wrote:
> On Mon, Mar 14, 2016 at 10:17 AM, Andy Lutomirski <[email protected]> wrote:
>>
>> So yes, let's please warn. I'm okay with removing the panic_on_oops
>> thing though. (But if anyone suggests that we should stop OOPSing on
>> bad kernel page faults, I *will* fight back.)
>
> Bad kernel page faults are something completely different. They would
> be actual bugs regardless.
>
> The MSR thing has *often* been just silly "this CPU doesn't do that
> MSR". Generic bootup setup code etc that just didn't know or care
> about the particular badly documented rule for one particular random
> CPU version and stepping.
>
> In fact, when I say "often", I suspect I should really just say
> "always". I don't think we've ever found a case where oopsing would
> have been the right thing. But it has definitely caused lots of
> problems, especially in the early paths where your code doesn't even
> work right now.
I can fix that part by literally deleting a few lines of code. I just
need to audit things to make sure that won't break anything.
>
> Now, when it comes to the warning, I guess I could live with it, but I
> think it's stupid to make this a low-level exception handler thing.
>
> So what I think should be done:
>
> - make sure that wr/rdmsr_safe() actually works during very early
> init. At some point it didn't.
>
> - get rid of the current wrmsr/rdmsr *entirely*. It's shit.
>
> - Add this wrapper:
>
> #define wrmsr(msr, low, high) \
> WARN_ON_ONCE(wrmsr_safe(msr, low, high))
>
> and be done with it. We could even decide to make that WARN_ON_ONCE()
> be something we could configure out, because it's really a debugging
> thing and isn't like it should be all that fatal.
>
> None of this insane complicated crap that buys us exactly *nothing*,
> and depends on fancy new exception handling support etc etc.
>
> So what's the downside to just doing this simple thing?
More code size increase and extra branches on fast paths. Using the
fancy new exception handling means we get to have the check and the
warning completely out of line. In fact, I'm tempted to change the
_safe variants to use the fancy new handling as well (once it works in
early boot) to shorten the code size and remove some asm code.
A couple of the wrmsr users actually care about performance. These
are the ones involved in context switching and, to a lesser extent, in
switching in and out of guest mode.
--Andy
On Mon, Mar 14, 2016 at 11:04 AM, Linus Torvalds
<[email protected]> wrote:
>
> None of this insane complicated crap that buys us exactly *nothing*,
> and depends on fancy new exception handling support etc etc.
Actually, the one _new_ thing we could do is to instead of removing
the old crappy rdmsr()/wrmsr() implementation entirely, we'll just
rename it to "rd/wrmsr_unsafe()", and not have any exception table
stuff for that at all (so now you *will* get an oops and panic if you
use that).
The only reason to do that is for performance-critical MSR's that
really don't want any overhead. Sure, they could just be changed to
use "wrmsr_safe()", but for things like the APIC accesses, or
update_debugctlmsr() (that is supposed to check for processor version)
that can be truly critical, an explicitly _unsafe_ version may be the
right thing to do.
The fact is, the problem with rd/wrmsr is that we just did the
defaults the wrong way around. Making the simple and obvious version
be unsafe is just wrong.
Linus
On Mon, Mar 14, 2016 at 11:10 AM, Andy Lutomirski <[email protected]> wrote:
>
> A couple of the wrmsr users actually care about performance. These
> are the ones involved in context switching and, to a lesser extent, in
> switching in and out of guest mode.
.. ok, see the crossed emails.
I'd *much* rather special case the special cases. Not make the generic
case something complex.
And *particularly* not make the generic case be something where people
think it's sane to oops and kill the machine. Christ.
Linus
On Mon, Mar 14, 2016 at 11:15 AM, Linus Torvalds
<[email protected]> wrote:
> On Mon, Mar 14, 2016 at 11:10 AM, Andy Lutomirski <[email protected]> wrote:
>>
>> A couple of the wrmsr users actually care about performance. These
>> are the ones involved in context switching and, to a lesser extent, in
>> switching in and out of guest mode.
>
> .. ok, see the crossed emails.
>
> I'd *much* rather special case the special cases. Not make the generic
> case something complex.
The code in my queue is, literally:
bool ex_handler_rdmsr_unsafe(const struct exception_table_entry *fixup,
struct pt_regs *regs, int trapnr)
{
WARN_ONCE(1, "unchecked MSR access error: RDMSR from 0x%x",
(unsigned int)regs->cx);
/* Pretend that the read succeeded and returned 0. */
regs->ip = ex_fixup_addr(fixup);
regs->ax = 0;
regs->dx = 0;
return true;
}
EXPORT_SYMBOL(ex_handler_rdmsr_unsafe);
The only regard in which this is any more complex than the _safe
variant is because there's a warning (one line of code) and because
the _safe variant forgot to zero the result (two lines of code). My
series fixes the latter, so we're talking about almost no source code
size difference. There *is* a difference in binary size, though --
the _safe variant emits a copy of its fixup every time it appears,
whereas the new fixup appears once.
So I think we should apply my patches (with the early handling fixed
and the panic_on_oops removed), and then consider reimplementing the
_safe variant using fancy handlers to reduce number of lines of asm
and code size, and then consider changing the overall API on top if we
think there's a better API to be had.
Is that okay?
>
> And *particularly* not make the generic case be something where people
> think it's sane to oops and kill the machine. Christ.
I've already removed the panic_on_oops thing from my tree.
--Andy
On Mon, Mar 14, 2016 at 11:24 AM, Andy Lutomirski <[email protected]> wrote:
>
> The code in my queue is, literally:
>
> bool ex_handler_rdmsr_unsafe(const struct exception_table_entry *fixup,
> struct pt_regs *regs, int trapnr)
> {
> WARN_ONCE(1, "unchecked MSR access error: RDMSR from 0x%x",
> (unsigned int)regs->cx);
>
> /* Pretend that the read succeeded and returned 0. */
> regs->ip = ex_fixup_addr(fixup);
> regs->ax = 0;
> regs->dx = 0;
> return true;
> }
> EXPORT_SYMBOL(ex_handler_rdmsr_unsafe);
I guess I can live with this, as long as we also extend the
early-fault handling to work with the special exception handlers.
And as long as people start understanding that killing the machine is
a bad bad bad thing. It's a debugging nightmare.
Linus
On Mon, Mar 14, 2016 at 11:40 AM, Linus Torvalds
<[email protected]> wrote:
> On Mon, Mar 14, 2016 at 11:24 AM, Andy Lutomirski <[email protected]> wrote:
>>
>> The code in my queue is, literally:
>>
>> bool ex_handler_rdmsr_unsafe(const struct exception_table_entry *fixup,
>> struct pt_regs *regs, int trapnr)
>> {
>> WARN_ONCE(1, "unchecked MSR access error: RDMSR from 0x%x",
>> (unsigned int)regs->cx);
>>
>> /* Pretend that the read succeeded and returned 0. */
>> regs->ip = ex_fixup_addr(fixup);
>> regs->ax = 0;
>> regs->dx = 0;
>> return true;
>> }
>> EXPORT_SYMBOL(ex_handler_rdmsr_unsafe);
>
> I guess I can live with this, as long as we also extend the
> early-fault handling to work with the special exception handlers.
OK, will do. I need to rewrork the early IDT code a bit so it
generates a real pt_regs layout, but that's arguably a cleanup anyway.
--Andy
On Mon, Mar 14, 2016 at 11:10:16AM -0700, Andy Lutomirski wrote:
> A couple of the wrmsr users actually care about performance. These
> are the ones involved in context switching and, to a lesser extent, in
> switching in and out of guest mode.
Right, this very much includes a number of perf MSRs. Some of those get
called from the context switch path, others from the NMI handler.
* Linus Torvalds <[email protected]> wrote:
> On Mon, Mar 14, 2016 at 10:17 AM, Andy Lutomirski <[email protected]> wrote:
> >
> > So yes, let's please warn. I'm okay with removing the panic_on_oops
> > thing though. (But if anyone suggests that we should stop OOPSing on
> > bad kernel page faults, I *will* fight back.)
>
> Bad kernel page faults are something completely different. They would
> be actual bugs regardless.
>
> The MSR thing has *often* been just silly "this CPU doesn't do that
> MSR". Generic bootup setup code etc that just didn't know or care
> about the particular badly documented rule for one particular random
> CPU version and stepping.
>
> In fact, when I say "often", I suspect I should really just say
> "always". I don't think we've ever found a case where oopsing would
> have been the right thing. But it has definitely caused lots of
> problems, especially in the early paths where your code doesn't even
> work right now.
>
> Now, when it comes to the warning, I guess I could live with it, but I
> think it's stupid to make this a low-level exception handler thing.
>
> So what I think should be done:
>
> - make sure that wr/rdmsr_safe() actually works during very early
> init. At some point it didn't.
>
> - get rid of the current wrmsr/rdmsr *entirely*. It's shit.
>
> - Add this wrapper:
>
> #define wrmsr(msr, low, high) \
> WARN_ON_ONCE(wrmsr_safe(msr, low, high))
>
> and be done with it. We could even decide to make that WARN_ON_ONCE()
> be something we could configure out, because it's really a debugging
> thing and isn't like it should be all that fatal.
>
> None of this insane complicated crap that buys us exactly *nothing*,
> and depends on fancy new exception handling support etc etc.
>
> So what's the downside to just doing this simple thing?
This was my original suggestion as well.
The objection was that sometimes it's performance critical. To which I replied
that 1) I highly doubt that a single extra branch of the check is measurable 2)
and even if so, then _those_ performance critical cases should be done in some
special way (rdmsr_nocheck() or whatever) - at which point the argument was that
there's a lot of such cases.
Which final line of argument I still don't really buy, btw., but it became a me
against everyone else argument and I cowardly punted.
Now that the 800 lb gorilla is on my side I of course stand my ground, principles
are principles!
Thanks,
Ingo
* Andy Lutomirski <[email protected]> wrote:
> On Mon, Mar 14, 2016 at 11:40 AM, Linus Torvalds
> <[email protected]> wrote:
> > On Mon, Mar 14, 2016 at 11:24 AM, Andy Lutomirski <[email protected]> wrote:
> >>
> >> The code in my queue is, literally:
> >>
> >> bool ex_handler_rdmsr_unsafe(const struct exception_table_entry *fixup,
> >> struct pt_regs *regs, int trapnr)
> >> {
> >> WARN_ONCE(1, "unchecked MSR access error: RDMSR from 0x%x",
> >> (unsigned int)regs->cx);
> >>
> >> /* Pretend that the read succeeded and returned 0. */
> >> regs->ip = ex_fixup_addr(fixup);
> >> regs->ax = 0;
> >> regs->dx = 0;
> >> return true;
> >> }
> >> EXPORT_SYMBOL(ex_handler_rdmsr_unsafe);
> >
> > I guess I can live with this, as long as we also extend the
> > early-fault handling to work with the special exception handlers.
>
> OK, will do. I need to rewrork the early IDT code a bit so it
> generates a real pt_regs layout, but that's arguably a cleanup anyway.
Ok, with that's I'm pretty happy about it as well.
Thanks,
Ingo
* Ingo Molnar <[email protected]> wrote:
> * Andy Lutomirski <[email protected]> wrote:
>
> > On Mon, Mar 14, 2016 at 11:40 AM, Linus Torvalds
> > <[email protected]> wrote:
> > > On Mon, Mar 14, 2016 at 11:24 AM, Andy Lutomirski <[email protected]> wrote:
> > >>
> > >> The code in my queue is, literally:
> > >>
> > >> bool ex_handler_rdmsr_unsafe(const struct exception_table_entry *fixup,
> > >> struct pt_regs *regs, int trapnr)
> > >> {
> > >> WARN_ONCE(1, "unchecked MSR access error: RDMSR from 0x%x",
> > >> (unsigned int)regs->cx);
> > >>
> > >> /* Pretend that the read succeeded and returned 0. */
> > >> regs->ip = ex_fixup_addr(fixup);
> > >> regs->ax = 0;
> > >> regs->dx = 0;
> > >> return true;
> > >> }
> > >> EXPORT_SYMBOL(ex_handler_rdmsr_unsafe);
> > >
> > > I guess I can live with this, as long as we also extend the
> > > early-fault handling to work with the special exception handlers.
> >
> > OK, will do. I need to rewrork the early IDT code a bit so it
> > generates a real pt_regs layout, but that's arguably a cleanup anyway.
>
> Ok, with that's I'm pretty happy about it as well.
Note, I think it's pretty clear at this point that this is v4.7 material.
Thanks,
Ingo