2015-12-11 19:32:30

by Luck, Tony

[permalink] [raw]
Subject: [PATCHV2 0/3] Machine check recovery when kernel accesses poison

This series is initially targeted at the folks doing filesystems
on top of NVDIMMs. They really want to be able to return -EIO
when there is a h/w error (just like spinning rust, and SSD does).

I plan to use the same infrastructure in parts 1&2 to write a
machine check aware "copy_from_user()" that will SIGBUS the
calling application when a syscall touches poison in user space
(just like we do when the application touches the poison itself).

Changes V1->V2:

0-day: Reported build errors and warnings on 32-bit systems. Fixed
0-day: Reported bloat to tinyconfig. Fixed
Boris: Suggestions to use extra macros to reduce code duplication in _ASM_*EXTABLE. Done
Boris: Re-write "tolerant==3" check to reduce indentation level. See below.
Andy: Check IP is valid before searching kernel exception tables. Done.
Andy: Explain use of BIT(63) on return value from mcsafe_memcpy(). Done (added decode macros).
Andy: Untangle mess of code in tail of do_machine_check() to make it
clear what is going on (e.g. that we only enter the ist_begin_non_atomic()
if we were called from user code, not from kernel!). Done

Tony Luck (3):
x86, ras: Add new infrastructure for machine check fixup tables
2/6] x86, ras: Extend machine check recovery code to annotated ring0
areas
3/6] x86, ras: Add mcsafe_memcpy() function to recover from machine
checks

arch/x86/Kconfig | 4 ++
arch/x86/include/asm/asm.h | 10 +++-
arch/x86/include/asm/uaccess.h | 8 +++
arch/x86/include/asm/uaccess_64.h | 5 ++
arch/x86/kernel/cpu/mcheck/mce-severity.c | 22 +++++++-
arch/x86/kernel/cpu/mcheck/mce.c | 69 +++++++++++------------
arch/x86/kernel/x8664_ksyms_64.c | 2 +
arch/x86/lib/copy_user_64.S | 91 +++++++++++++++++++++++++++++++
arch/x86/mm/extable.c | 19 +++++++
include/asm-generic/vmlinux.lds.h | 6 ++
include/linux/module.h | 1 +
kernel/extable.c | 20 +++++++
12 files changed, 219 insertions(+), 38 deletions(-)

--
2.1.4


2015-12-11 19:32:18

by Luck, Tony

[permalink] [raw]
Subject: [PATCHV2 1/3] x86, ras: Add new infrastructure for machine check fixup tables

Copy the existing page fault fixup mechanisms to create a new table
to be used when fixing machine checks. Note:
1) At this time we only provide a macro to annotate assembly code
2) We assume all fixups will in code builtin to the kernel.
3) Only for x86_64
4) New code under CONFIG_MCE_KERNEL_RECOVERY

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/Kconfig | 4 ++++
arch/x86/include/asm/asm.h | 10 ++++++++--
arch/x86/include/asm/uaccess.h | 8 ++++++++
arch/x86/mm/extable.c | 19 +++++++++++++++++++
include/asm-generic/vmlinux.lds.h | 6 ++++++
include/linux/module.h | 1 +
kernel/extable.c | 20 ++++++++++++++++++++
7 files changed, 66 insertions(+), 2 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 96d058a87100..db5c6e1d6e37 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -1001,6 +1001,10 @@ config X86_MCE_INJECT
If you don't know what a machine check is and you don't do kernel
QA it is safe to say n.

+config MCE_KERNEL_RECOVERY
+ depends on X86_MCE && X86_64
+ def_bool y
+
config X86_THERMAL_VECTOR
def_bool y
depends on X86_MCE_INTEL
diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h
index 189679aba703..a5d483ac11fa 100644
--- a/arch/x86/include/asm/asm.h
+++ b/arch/x86/include/asm/asm.h
@@ -44,13 +44,19 @@

/* Exception table entry */
#ifdef __ASSEMBLY__
-# define _ASM_EXTABLE(from,to) \
- .pushsection "__ex_table","a" ; \
+# define __ASM_EXTABLE(from, to, table) \
+ .pushsection table, "a" ; \
.balign 8 ; \
.long (from) - . ; \
.long (to) - . ; \
.popsection

+# define _ASM_EXTABLE(from, to) \
+ __ASM_EXTABLE(from, to, "__ex_table")
+
+# define _ASM_MCEXTABLE(from, to) \
+ __ASM_EXTABLE(from, to, "__mcex_table")
+
# define _ASM_EXTABLE_EX(from,to) \
.pushsection "__ex_table","a" ; \
.balign 8 ; \
diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
index a8df874f3e88..7b02ca1991b4 100644
--- a/arch/x86/include/asm/uaccess.h
+++ b/arch/x86/include/asm/uaccess.h
@@ -111,6 +111,14 @@ struct exception_table_entry {
#define ARCH_HAS_SEARCH_EXTABLE

extern int fixup_exception(struct pt_regs *regs);
+#ifdef CONFIG_MCE_KERNEL_RECOVERY
+extern int fixup_mcexception(struct pt_regs *regs, u64 addr);
+#else
+static inline int fixup_mcexception(struct pt_regs *regs, u64 addr)
+{
+ return 0;
+}
+#endif
extern int early_fixup_exception(unsigned long *ip);

/*
diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c
index 903ec1e9c326..a461c4212758 100644
--- a/arch/x86/mm/extable.c
+++ b/arch/x86/mm/extable.c
@@ -49,6 +49,25 @@ int fixup_exception(struct pt_regs *regs)
return 0;
}

+#ifdef CONFIG_MCE_KERNEL_RECOVERY
+int fixup_mcexception(struct pt_regs *regs, u64 addr)
+{
+ const struct exception_table_entry *fixup;
+ unsigned long new_ip;
+
+ fixup = search_mcexception_tables(regs->ip);
+ if (fixup) {
+ new_ip = ex_fixup_addr(fixup);
+
+ regs->ip = new_ip;
+ regs->ax = BIT(63) | addr;
+ return 1;
+ }
+
+ return 0;
+}
+#endif
+
/* Restricted version used during very early boot */
int __init early_fixup_exception(unsigned long *ip)
{
diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
index 1781e54ea6d3..21bb20d1172a 100644
--- a/include/asm-generic/vmlinux.lds.h
+++ b/include/asm-generic/vmlinux.lds.h
@@ -473,6 +473,12 @@
VMLINUX_SYMBOL(__start___ex_table) = .; \
*(__ex_table) \
VMLINUX_SYMBOL(__stop___ex_table) = .; \
+ } \
+ . = ALIGN(align); \
+ __mcex_table : AT(ADDR(__mcex_table) - LOAD_OFFSET) { \
+ VMLINUX_SYMBOL(__start___mcex_table) = .; \
+ *(__mcex_table) \
+ VMLINUX_SYMBOL(__stop___mcex_table) = .; \
}

/*
diff --git a/include/linux/module.h b/include/linux/module.h
index 3a19c79918e0..ffecbfcc462c 100644
--- a/include/linux/module.h
+++ b/include/linux/module.h
@@ -270,6 +270,7 @@ extern const typeof(name) __mod_##type##__##name##_device_table \

/* Given an address, look for it in the exception tables */
const struct exception_table_entry *search_exception_tables(unsigned long add);
+const struct exception_table_entry *search_mcexception_tables(unsigned long a);

struct notifier_block;

diff --git a/kernel/extable.c b/kernel/extable.c
index e820ccee9846..7b224fbcb708 100644
--- a/kernel/extable.c
+++ b/kernel/extable.c
@@ -34,6 +34,10 @@ DEFINE_MUTEX(text_mutex);

extern struct exception_table_entry __start___ex_table[];
extern struct exception_table_entry __stop___ex_table[];
+#ifdef CONFIG_MCE_KERNEL_RECOVERY
+extern struct exception_table_entry __start___mcex_table[];
+extern struct exception_table_entry __stop___mcex_table[];
+#endif

/* Cleared by build time tools if the table is already sorted. */
u32 __initdata __visible main_extable_sort_needed = 1;
@@ -45,6 +49,10 @@ void __init sort_main_extable(void)
pr_notice("Sorting __ex_table...\n");
sort_extable(__start___ex_table, __stop___ex_table);
}
+#ifdef CONFIG_MCE_KERNEL_RECOVERY
+ if (__stop___mcex_table > __start___mcex_table)
+ sort_extable(__start___mcex_table, __stop___mcex_table);
+#endif
}

/* Given an address, look for it in the exception tables. */
@@ -58,6 +66,18 @@ const struct exception_table_entry *search_exception_tables(unsigned long addr)
return e;
}

+#ifdef CONFIG_MCE_KERNEL_RECOVERY
+/* Given an address, look for it in the machine check exception tables. */
+const struct exception_table_entry *search_mcexception_tables(
+ unsigned long addr)
+{
+ const struct exception_table_entry *e;
+
+ e = search_extable(__start___mcex_table, __stop___mcex_table-1, addr);
+ return e;
+}
+#endif
+
static inline int init_kernel_text(unsigned long addr)
{
if (addr >= (unsigned long)_sinittext &&
--
2.1.4

2015-12-11 19:32:51

by Luck, Tony

[permalink] [raw]
Subject: [PATCHV2 2/3] x86, ras: Extend machine check recovery code to annotated ring0 areas

Extend the severity checking code to add a new context IN_KERN_RECOV
which is used to indicate that the machine check was triggered by code
in the kernel with a fixup entry.

Add code to check for this situation and respond by altering the return
IP to the fixup address and changing the regs->ax so that the recovery
code knows the physical address of the error. Note that we also set bit
63 because 0x0 is a legal physical address.

Major re-work to the tail code in do_machine_check() to make all this
readable/maintainable. One functional change is that tolerant=3 no longer
stops recovery actions. Revert to only skipping sending SIGBUS to the
current process.

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/kernel/cpu/mcheck/mce-severity.c | 22 +++++++++-
arch/x86/kernel/cpu/mcheck/mce.c | 69 ++++++++++++++++---------------
2 files changed, 55 insertions(+), 36 deletions(-)

diff --git a/arch/x86/kernel/cpu/mcheck/mce-severity.c b/arch/x86/kernel/cpu/mcheck/mce-severity.c
index 9c682c222071..ac7fbb0689fb 100644
--- a/arch/x86/kernel/cpu/mcheck/mce-severity.c
+++ b/arch/x86/kernel/cpu/mcheck/mce-severity.c
@@ -12,6 +12,7 @@
#include <linux/kernel.h>
#include <linux/seq_file.h>
#include <linux/init.h>
+#include <linux/module.h>
#include <linux/debugfs.h>
#include <asm/mce.h>

@@ -29,7 +30,7 @@
* panic situations)
*/

-enum context { IN_KERNEL = 1, IN_USER = 2 };
+enum context { IN_KERNEL = 1, IN_USER = 2, IN_KERNEL_RECOV = 3 };
enum ser { SER_REQUIRED = 1, NO_SER = 2 };
enum exception { EXCP_CONTEXT = 1, NO_EXCP = 2 };

@@ -48,6 +49,7 @@ static struct severity {
#define MCESEV(s, m, c...) { .sev = MCE_ ## s ## _SEVERITY, .msg = m, ## c }
#define KERNEL .context = IN_KERNEL
#define USER .context = IN_USER
+#define KERNEL_RECOV .context = IN_KERNEL_RECOV
#define SER .ser = SER_REQUIRED
#define NOSER .ser = NO_SER
#define EXCP .excp = EXCP_CONTEXT
@@ -87,6 +89,10 @@ static struct severity {
EXCP, KERNEL, MCGMASK(MCG_STATUS_RIPV, 0)
),
MCESEV(
+ PANIC, "In kernel and no restart IP",
+ EXCP, KERNEL_RECOV, MCGMASK(MCG_STATUS_RIPV, 0)
+ ),
+ MCESEV(
DEFERRED, "Deferred error",
NOSER, MASK(MCI_STATUS_UC|MCI_STATUS_DEFERRED|MCI_STATUS_POISON, MCI_STATUS_DEFERRED)
),
@@ -123,6 +129,11 @@ static struct severity {
MCGMASK(MCG_STATUS_RIPV|MCG_STATUS_EIPV, MCG_STATUS_RIPV)
),
MCESEV(
+ AR, "Action required: data load error recoverable area of kernel",
+ SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|MCACOD_DATA),
+ KERNEL_RECOV
+ ),
+ MCESEV(
AR, "Action required: data load error in a user process",
SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|MCACOD_DATA),
USER
@@ -170,6 +181,9 @@ static struct severity {
) /* always matches. keep at end */
};

+#define mc_recoverable(mcg) (((mcg) & (MCG_STATUS_RIPV|MCG_STATUS_EIPV)) == \
+ (MCG_STATUS_RIPV|MCG_STATUS_EIPV))
+
/*
* If mcgstatus indicated that ip/cs on the stack were
* no good, then "m->cs" will be zero and we will have
@@ -183,7 +197,11 @@ static struct severity {
*/
static int error_context(struct mce *m)
{
- return ((m->cs & 3) == 3) ? IN_USER : IN_KERNEL;
+ if ((m->cs & 3) == 3)
+ return IN_USER;
+ if (mc_recoverable(m->mcgstatus) && search_mcexception_tables(m->ip))
+ return IN_KERNEL_RECOV;
+ return IN_KERNEL;
}

/*
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index 9d014b82a124..f2f568ad6409 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -31,6 +31,7 @@
#include <linux/types.h>
#include <linux/slab.h>
#include <linux/init.h>
+#include <linux/module.h>
#include <linux/kmod.h>
#include <linux/poll.h>
#include <linux/nmi.h>
@@ -958,6 +959,20 @@ static void mce_clear_state(unsigned long *toclear)
}
}

+static int do_memory_failure(struct mce *m)
+{
+ int flags = MF_ACTION_REQUIRED;
+ int ret;
+
+ pr_err("Uncorrected hardware memory error in user-access at %llx", m->addr);
+ if (!(m->mcgstatus & MCG_STATUS_RIPV))
+ flags |= MF_MUST_KILL;
+ ret = memory_failure(m->addr >> PAGE_SHIFT, MCE_VECTOR, flags);
+ if (ret)
+ pr_err("Memory error not recovered");
+ return ret;
+}
+
/*
* The actual machine check handler. This only handles real
* exceptions when something got corrupted coming in through int 18.
@@ -995,8 +1010,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
DECLARE_BITMAP(toclear, MAX_NR_BANKS);
DECLARE_BITMAP(valid_banks, MAX_NR_BANKS);
char *msg = "Unknown";
- u64 recover_paddr = ~0ull;
- int flags = MF_ACTION_REQUIRED;
int lmce = 0;

ist_enter(regs);
@@ -1123,22 +1136,13 @@ void do_machine_check(struct pt_regs *regs, long error_code)
}

/*
- * At insane "tolerant" levels we take no action. Otherwise
- * we only die if we have no other choice. For less serious
- * issues we try to recover, or limit damage to the current
- * process.
+ * If tolerant is at an insane level we drop requests to kill
+ * processes and continue even when there is no way out
*/
- if (cfg->tolerant < 3) {
- if (no_way_out)
- mce_panic("Fatal machine check on current CPU", &m, msg);
- if (worst == MCE_AR_SEVERITY) {
- recover_paddr = m.addr;
- if (!(m.mcgstatus & MCG_STATUS_RIPV))
- flags |= MF_MUST_KILL;
- } else if (kill_it) {
- force_sig(SIGBUS, current);
- }
- }
+ if (cfg->tolerant == 3)
+ kill_it = 0;
+ else if (no_way_out)
+ mce_panic("Fatal machine check on current CPU", &m, msg);

if (worst > 0)
mce_report_event(regs);
@@ -1146,25 +1150,22 @@ void do_machine_check(struct pt_regs *regs, long error_code)
out:
sync_core();

- if (recover_paddr == ~0ull)
- goto done;
+ /* Fault was in user mode and we need to take some action */
+ if ((m.cs & 3) == 3 && (worst == MCE_AR_SEVERITY || kill_it)) {
+ ist_begin_non_atomic(regs);
+ local_irq_enable();

- pr_err("Uncorrected hardware memory error in user-access at %llx",
- recover_paddr);
- /*
- * We must call memory_failure() here even if the current process is
- * doomed. We still need to mark the page as poisoned and alert any
- * other users of the page.
- */
- ist_begin_non_atomic(regs);
- local_irq_enable();
- if (memory_failure(recover_paddr >> PAGE_SHIFT, MCE_VECTOR, flags) < 0) {
- pr_err("Memory error not recovered");
- force_sig(SIGBUS, current);
+ if (kill_it || do_memory_failure(&m))
+ force_sig(SIGBUS, current);
+ local_irq_disable();
+ ist_end_non_atomic();
}
- local_irq_disable();
- ist_end_non_atomic();
-done:
+
+ /* Fault was in recoverable area of the kernel */
+ if ((m.cs & 3) != 3 && worst == MCE_AR_SEVERITY)
+ if (!fixup_mcexception(regs, m.addr))
+ mce_panic("Failed kernel mode recovery", &m, NULL);
+
ist_exit(regs);
}
EXPORT_SYMBOL_GPL(do_machine_check);
--
2.1.4

2015-12-11 19:32:41

by Luck, Tony

[permalink] [raw]
Subject: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks

Using __copy_user_nocache() as inspiration create a memory copy
routine for use by kernel code with annotations to allow for
recovery from machine checks.

Notes:
1) Unlike the original we make no attempt to copy all the bytes
up to the faulting address. The original achieves that by
re-executing the failing part as a byte-by-byte copy,
which will take another page fault. We don't want to have
a second machine check!
2) Likewise the return value for the original indicates exactly
how many bytes were not copied. Instead we provide the physical
address of the fault (thanks to help from do_machine_check()
3) Provide helpful macros to decode the return value.

Signed-off-by: Tony Luck <[email protected]>
---
arch/x86/include/asm/uaccess_64.h | 5 +++
arch/x86/kernel/x8664_ksyms_64.c | 2 +
arch/x86/lib/copy_user_64.S | 91 +++++++++++++++++++++++++++++++++++++++
3 files changed, 98 insertions(+)

diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
index f2f9b39b274a..779cb0e77ecc 100644
--- a/arch/x86/include/asm/uaccess_64.h
+++ b/arch/x86/include/asm/uaccess_64.h
@@ -216,6 +216,11 @@ __copy_to_user_inatomic(void __user *dst, const void *src, unsigned size)
extern long __copy_user_nocache(void *dst, const void __user *src,
unsigned size, int zerorest);

+extern u64 mcsafe_memcpy(void *dst, const void __user *src,
+ unsigned size);
+#define COPY_HAD_MCHECK(ret) ((ret) & BIT(63))
+#define COPY_MCHECK_PADDR(ret) ((ret) & ~BIT(63))
+
static inline int
__copy_from_user_nocache(void *dst, const void __user *src, unsigned size)
{
diff --git a/arch/x86/kernel/x8664_ksyms_64.c b/arch/x86/kernel/x8664_ksyms_64.c
index a0695be19864..ec988c92c055 100644
--- a/arch/x86/kernel/x8664_ksyms_64.c
+++ b/arch/x86/kernel/x8664_ksyms_64.c
@@ -37,6 +37,8 @@ EXPORT_SYMBOL(__copy_user_nocache);
EXPORT_SYMBOL(_copy_from_user);
EXPORT_SYMBOL(_copy_to_user);

+EXPORT_SYMBOL(mcsafe_memcpy);
+
EXPORT_SYMBOL(copy_page);
EXPORT_SYMBOL(clear_page);

diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
index 982ce34f4a9b..ffce93cbc9a5 100644
--- a/arch/x86/lib/copy_user_64.S
+++ b/arch/x86/lib/copy_user_64.S
@@ -319,3 +319,94 @@ ENTRY(__copy_user_nocache)
_ASM_EXTABLE(21b,50b)
_ASM_EXTABLE(22b,50b)
ENDPROC(__copy_user_nocache)
+
+/*
+ * mcsafe_memcpy - Uncached memory copy with machine check exception handling
+ * Note that we only catch machine checks when reading the source addresses.
+ * Writes to target are posted and don't generate machine checks.
+ * This will force destination/source out of cache for more performance.
+ */
+ENTRY(mcsafe_memcpy)
+ cmpl $8,%edx
+ jb 20f /* less then 8 bytes, go to byte copy loop */
+
+ /* check for bad alignment of destination */
+ movl %edi,%ecx
+ andl $7,%ecx
+ jz 102f /* already aligned */
+ subl $8,%ecx
+ negl %ecx
+ subl %ecx,%edx
+0: movb (%rsi),%al
+ movb %al,(%rdi)
+ incq %rsi
+ incq %rdi
+ decl %ecx
+ jnz 100b
+102:
+ movl %edx,%ecx
+ andl $63,%edx
+ shrl $6,%ecx
+ jz 17f
+1: movq (%rsi),%r8
+2: movq 1*8(%rsi),%r9
+3: movq 2*8(%rsi),%r10
+4: movq 3*8(%rsi),%r11
+ movnti %r8,(%rdi)
+ movnti %r9,1*8(%rdi)
+ movnti %r10,2*8(%rdi)
+ movnti %r11,3*8(%rdi)
+9: movq 4*8(%rsi),%r8
+10: movq 5*8(%rsi),%r9
+11: movq 6*8(%rsi),%r10
+12: movq 7*8(%rsi),%r11
+ movnti %r8,4*8(%rdi)
+ movnti %r9,5*8(%rdi)
+ movnti %r10,6*8(%rdi)
+ movnti %r11,7*8(%rdi)
+ leaq 64(%rsi),%rsi
+ leaq 64(%rdi),%rdi
+ decl %ecx
+ jnz 1b
+17: movl %edx,%ecx
+ andl $7,%edx
+ shrl $3,%ecx
+ jz 20f
+18: movq (%rsi),%r8
+ movnti %r8,(%rdi)
+ leaq 8(%rsi),%rsi
+ leaq 8(%rdi),%rdi
+ decl %ecx
+ jnz 18b
+20: andl %edx,%edx
+ jz 23f
+ movl %edx,%ecx
+21: movb (%rsi),%al
+ movb %al,(%rdi)
+ incq %rsi
+ incq %rdi
+ decl %ecx
+ jnz 21b
+23: xorl %eax,%eax
+ sfence
+ ret
+
+ .section .fixup,"ax"
+30:
+ sfence
+ /* do_machine_check() sets %eax return value */
+ ret
+ .previous
+
+ _ASM_MCEXTABLE(0b,30b)
+ _ASM_MCEXTABLE(1b,30b)
+ _ASM_MCEXTABLE(2b,30b)
+ _ASM_MCEXTABLE(3b,30b)
+ _ASM_MCEXTABLE(4b,30b)
+ _ASM_MCEXTABLE(9b,30b)
+ _ASM_MCEXTABLE(10b,30b)
+ _ASM_MCEXTABLE(11b,30b)
+ _ASM_MCEXTABLE(12b,30b)
+ _ASM_MCEXTABLE(18b,30b)
+ _ASM_MCEXTABLE(21b,30b)
+ENDPROC(mcsafe_memcpy)
--
2.1.4

2015-12-11 20:07:05

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCHV2 1/3] x86, ras: Add new infrastructure for machine check fixup tables

On Thu, Dec 10, 2015 at 1:58 PM, Tony Luck <[email protected]> wrote:
> Copy the existing page fault fixup mechanisms to create a new table
> to be used when fixing machine checks. Note:
> 1) At this time we only provide a macro to annotate assembly code
> 2) We assume all fixups will in code builtin to the kernel.
> 3) Only for x86_64
> 4) New code under CONFIG_MCE_KERNEL_RECOVERY
>
> Signed-off-by: Tony Luck <[email protected]>
> ---

> +#ifdef CONFIG_MCE_KERNEL_RECOVERY
> +int fixup_mcexception(struct pt_regs *regs, u64 addr)
> +{
> + const struct exception_table_entry *fixup;
> + unsigned long new_ip;
> +
> + fixup = search_mcexception_tables(regs->ip);
> + if (fixup) {
> + new_ip = ex_fixup_addr(fixup);
> +
> + regs->ip = new_ip;
> + regs->ax = BIT(63) | addr;

Can this be an actual #define?

--Andy

2015-12-11 20:08:26

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCHV2 2/3] x86, ras: Extend machine check recovery code to annotated ring0 areas

On Thu, Dec 10, 2015 at 4:14 PM, Tony Luck <[email protected]> wrote:
> Extend the severity checking code to add a new context IN_KERN_RECOV
> which is used to indicate that the machine check was triggered by code
> in the kernel with a fixup entry.
>
> Add code to check for this situation and respond by altering the return
> IP to the fixup address and changing the regs->ax so that the recovery
> code knows the physical address of the error. Note that we also set bit
> 63 because 0x0 is a legal physical address.
>
> Major re-work to the tail code in do_machine_check() to make all this
> readable/maintainable. One functional change is that tolerant=3 no longer
> stops recovery actions. Revert to only skipping sending SIGBUS to the
> current process.

This is IMO much, much nicer than the old code. Thanks!

--Andy

2015-12-11 20:09:31

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks

On Thu, Dec 10, 2015 at 4:21 PM, Tony Luck <[email protected]> wrote:
> Using __copy_user_nocache() as inspiration create a memory copy
> routine for use by kernel code with annotations to allow for
> recovery from machine checks.
>
> Notes:
> 1) Unlike the original we make no attempt to copy all the bytes
> up to the faulting address. The original achieves that by
> re-executing the failing part as a byte-by-byte copy,
> which will take another page fault. We don't want to have
> a second machine check!
> 2) Likewise the return value for the original indicates exactly
> how many bytes were not copied. Instead we provide the physical
> address of the fault (thanks to help from do_machine_check()
> 3) Provide helpful macros to decode the return value.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/include/asm/uaccess_64.h | 5 +++
> arch/x86/kernel/x8664_ksyms_64.c | 2 +
> arch/x86/lib/copy_user_64.S | 91 +++++++++++++++++++++++++++++++++++++++
> 3 files changed, 98 insertions(+)
>
> diff --git a/arch/x86/include/asm/uaccess_64.h b/arch/x86/include/asm/uaccess_64.h
> index f2f9b39b274a..779cb0e77ecc 100644
> --- a/arch/x86/include/asm/uaccess_64.h
> +++ b/arch/x86/include/asm/uaccess_64.h
> @@ -216,6 +216,11 @@ __copy_to_user_inatomic(void __user *dst, const void *src, unsigned size)
> extern long __copy_user_nocache(void *dst, const void __user *src,
> unsigned size, int zerorest);
>
> +extern u64 mcsafe_memcpy(void *dst, const void __user *src,
> + unsigned size);
> +#define COPY_HAD_MCHECK(ret) ((ret) & BIT(63))
> +#define COPY_MCHECK_PADDR(ret) ((ret) & ~BIT(63))
> +
> static inline int
> __copy_from_user_nocache(void *dst, const void __user *src, unsigned size)
> {
> diff --git a/arch/x86/kernel/x8664_ksyms_64.c b/arch/x86/kernel/x8664_ksyms_64.c
> index a0695be19864..ec988c92c055 100644
> --- a/arch/x86/kernel/x8664_ksyms_64.c
> +++ b/arch/x86/kernel/x8664_ksyms_64.c
> @@ -37,6 +37,8 @@ EXPORT_SYMBOL(__copy_user_nocache);
> EXPORT_SYMBOL(_copy_from_user);
> EXPORT_SYMBOL(_copy_to_user);
>
> +EXPORT_SYMBOL(mcsafe_memcpy);
> +
> EXPORT_SYMBOL(copy_page);
> EXPORT_SYMBOL(clear_page);
>
> diff --git a/arch/x86/lib/copy_user_64.S b/arch/x86/lib/copy_user_64.S
> index 982ce34f4a9b..ffce93cbc9a5 100644
> --- a/arch/x86/lib/copy_user_64.S
> +++ b/arch/x86/lib/copy_user_64.S
> @@ -319,3 +319,94 @@ ENTRY(__copy_user_nocache)
> _ASM_EXTABLE(21b,50b)
> _ASM_EXTABLE(22b,50b)
> ENDPROC(__copy_user_nocache)
> +
> +/*
> + * mcsafe_memcpy - Uncached memory copy with machine check exception handling
> + * Note that we only catch machine checks when reading the source addresses.
> + * Writes to target are posted and don't generate machine checks.
> + * This will force destination/source out of cache for more performance.
> + */
> +ENTRY(mcsafe_memcpy)
> + cmpl $8,%edx
> + jb 20f /* less then 8 bytes, go to byte copy loop */
> +
> + /* check for bad alignment of destination */
> + movl %edi,%ecx
> + andl $7,%ecx
> + jz 102f /* already aligned */
> + subl $8,%ecx
> + negl %ecx
> + subl %ecx,%edx
> +0: movb (%rsi),%al
> + movb %al,(%rdi)
> + incq %rsi
> + incq %rdi
> + decl %ecx
> + jnz 100b
> +102:
> + movl %edx,%ecx
> + andl $63,%edx
> + shrl $6,%ecx
> + jz 17f
> +1: movq (%rsi),%r8
> +2: movq 1*8(%rsi),%r9
> +3: movq 2*8(%rsi),%r10
> +4: movq 3*8(%rsi),%r11
> + movnti %r8,(%rdi)
> + movnti %r9,1*8(%rdi)
> + movnti %r10,2*8(%rdi)
> + movnti %r11,3*8(%rdi)
> +9: movq 4*8(%rsi),%r8
> +10: movq 5*8(%rsi),%r9
> +11: movq 6*8(%rsi),%r10
> +12: movq 7*8(%rsi),%r11
> + movnti %r8,4*8(%rdi)
> + movnti %r9,5*8(%rdi)
> + movnti %r10,6*8(%rdi)
> + movnti %r11,7*8(%rdi)
> + leaq 64(%rsi),%rsi
> + leaq 64(%rdi),%rdi
> + decl %ecx
> + jnz 1b
> +17: movl %edx,%ecx
> + andl $7,%edx
> + shrl $3,%ecx
> + jz 20f
> +18: movq (%rsi),%r8
> + movnti %r8,(%rdi)
> + leaq 8(%rsi),%rsi
> + leaq 8(%rdi),%rdi
> + decl %ecx
> + jnz 18b
> +20: andl %edx,%edx
> + jz 23f
> + movl %edx,%ecx
> +21: movb (%rsi),%al
> + movb %al,(%rdi)
> + incq %rsi
> + incq %rdi
> + decl %ecx
> + jnz 21b
> +23: xorl %eax,%eax
> + sfence
> + ret
> +
> + .section .fixup,"ax"
> +30:
> + sfence
> + /* do_machine_check() sets %eax return value */
> + ret
> + .previous
> +
> + _ASM_MCEXTABLE(0b,30b)
> + _ASM_MCEXTABLE(1b,30b)
> + _ASM_MCEXTABLE(2b,30b)
> + _ASM_MCEXTABLE(3b,30b)
> + _ASM_MCEXTABLE(4b,30b)
> + _ASM_MCEXTABLE(9b,30b)
> + _ASM_MCEXTABLE(10b,30b)
> + _ASM_MCEXTABLE(11b,30b)
> + _ASM_MCEXTABLE(12b,30b)
> + _ASM_MCEXTABLE(18b,30b)
> + _ASM_MCEXTABLE(21b,30b)
> +ENDPROC(mcsafe_memcpy)

I still don't get the BIT(63) thing. Can you explain it?

--Andy

2015-12-11 21:01:52

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCHV2 1/3] x86, ras: Add new infrastructure for machine check fixup tables

>> + regs->ip = new_ip;
>> + regs->ax = BIT(63) | addr;
>
> Can this be an actual #define?

Doh! Yes, of course. That would be much better.

Now I need to think of a good name for it.

-Tony
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-12-11 21:19:20

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks

> I still don't get the BIT(63) thing. Can you explain it?

It will be more obvious when I get around to writing copy_from_user().

Then we will have a function that can take page faults if there are pages
that are not present. If the page faults can't be fixed we have a -EFAULT
condition. We can also take machine checks if we reads from a location with an
uncorrected error.

We need to distinguish these two cases because the action we take is
different. For the unresolved page fault we already have the ABI that the
copy_to/from_user() functions return zero for success, and a non-zero
return is the number of not-copied bytes.

So for my new case I'm setting bit63 ... this is never going to be set for
a failed page fault.

copy_from_user() conceptually will look like this:

int copy_from_user(void *to, void *from, unsigned long n)
{
u64 ret = mcsafe_memcpy(to, from, n);

if (COPY_HAD_MCHECK(r)) {
if (memory_failure(COPY_MCHECK_PADDR(ret) >> PAGE_SIZE, ...))
force_sig(SIGBUS, current);
return something;
} else
return ret;
}

-Tony
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-12-11 21:51:13

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks

On Fri, Dec 11, 2015 at 1:19 PM, Luck, Tony <[email protected]> wrote:
>> I still don't get the BIT(63) thing. Can you explain it?
>
> It will be more obvious when I get around to writing copy_from_user().
>
> Then we will have a function that can take page faults if there are pages
> that are not present. If the page faults can't be fixed we have a -EFAULT
> condition. We can also take machine checks if we reads from a location with an
> uncorrected error.
>
> We need to distinguish these two cases because the action we take is
> different. For the unresolved page fault we already have the ABI that the
> copy_to/from_user() functions return zero for success, and a non-zero
> return is the number of not-copied bytes.

I'm missing something, though. The normal fixup_exception path
doesn't touch rax at all. The memory_failure path does. But couldn't
you distinguish them by just pointing the exception handlers at
different landing pads?

Also, would it be more straightforward if the mcexception landing pad
looked up the va -> pa mapping by itself? Or is that somehow not
reliable?

--Andy

2015-12-11 22:17:15

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks

> I'm missing something, though. The normal fixup_exception path
> doesn't touch rax at all. The memory_failure path does. But couldn't
> you distinguish them by just pointing the exception handlers at
> different landing pads?

Perhaps I'm just trying to take a short cut to avoid writing
some clever fixup code for the target ip that goes into the
exception table.

For __copy_user_nocache() we have four possible targets
for fixup depending on where we were in the function.

.section .fixup,"ax"
30: shll $6,%ecx
addl %ecx,%edx
jmp 60f
40: lea (%rdx,%rcx,8),%rdx
jmp 60f
50: movl %ecx,%edx
60: sfence
jmp copy_user_handle_tail
.previous

Note that this code also takes a shortcut
by jumping to copy_user_handle_tail() to
finish up the copy a byte at a time ... and
running back into the same page fault a 2nd
time to make sure the byte count is exactly
right.

I really, really, don't want to run back into
the poison again. It would probably work, but
because current generation Intel cpus broadcast machine
checks to every logical cpu, it is a lot of overhead,
and potentially risky.

> Also, would it be more straightforward if the mcexception landing pad
> looked up the va -> pa mapping by itself? Or is that somehow not
> reliable?

If we did get all the above right, then we could have
target use virt_to_phys() to convert to physical ...
I don't see that this part would be a problem.

-Tony




????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-12-11 22:20:19

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks

On Fri, Dec 11, 2015 at 2:17 PM, Luck, Tony <[email protected]> wrote:
>> Also, would it be more straightforward if the mcexception landing pad
>> looked up the va -> pa mapping by itself? Or is that somehow not
>> reliable?
>
> If we did get all the above right, then we could have
> target use virt_to_phys() to convert to physical ...
> I don't see that this part would be a problem.

virt_to_phys() implies a linear address. In the case of the use in
the pmem driver we'll be using an ioremap()'d address off somewherein
vmalloc space.

2015-12-11 22:27:15

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks

On Fri, Dec 11, 2015 at 2:20 PM, Dan Williams <[email protected]> wrote:
> On Fri, Dec 11, 2015 at 2:17 PM, Luck, Tony <[email protected]> wrote:
>>> Also, would it be more straightforward if the mcexception landing pad
>>> looked up the va -> pa mapping by itself? Or is that somehow not
>>> reliable?
>>
>> If we did get all the above right, then we could have
>> target use virt_to_phys() to convert to physical ...
>> I don't see that this part would be a problem.
>
> virt_to_phys() implies a linear address. In the case of the use in
> the pmem driver we'll be using an ioremap()'d address off somewherein
> vmalloc space.

There's always slow_virt_to_phys.

Note that I don't fundamentally object to passing the pa to the fixup
handler. I just think we should try to disentangle that from figuring
out what exactly the failure was.

Also, are there really PCOMMIT-capable CPUs that still forcibly
broadcast MCE? If, so, that's unfortunate.

--Andy

2015-12-11 22:35:20

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks

> Also, are there really PCOMMIT-capable CPUs that still forcibly
> broadcast MCE? If, so, that's unfortunate.

PCOMMIT and LMCE arrive together ... though BIOS is in the decision
path to enable LMCE, so it is possible that some systems could still
broadcast if the BIOS writer decides to not allow local.

But a machine check safe copy_from_user() would be useful
current generation cpus that broadcast all the time.

-Tony
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-12-11 22:38:34

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks

On Fri, Dec 11, 2015 at 2:35 PM, Luck, Tony <[email protected]> wrote:
>> Also, are there really PCOMMIT-capable CPUs that still forcibly
>> broadcast MCE? If, so, that's unfortunate.
>
> PCOMMIT and LMCE arrive together ... though BIOS is in the decision
> path to enable LMCE, so it is possible that some systems could still
> broadcast if the BIOS writer decides to not allow local.

I really wish Intel would stop doing that.

>
> But a machine check safe copy_from_user() would be useful
> current generation cpus that broadcast all the time.

Fair enough.

--Andy

2015-12-11 22:45:36

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks

>> But a machine check safe copy_from_user() would be useful
>> current generation cpus that broadcast all the time.
>
> Fair enough.

Thanks for spending the time to look at this. Coaxing me to re-write the
tail of do_machine_check() has made that code much better. Too many
years of one patch on top of another without looking at the whole context.

Cogitate on this series over the weekend and see if you can give me
an Acked-by or Reviewed-by (I'll be adding a #define for BIT(63)).

-Tony
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-12-11 22:56:08

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks

On Fri, Dec 11, 2015 at 2:45 PM, Luck, Tony <[email protected]> wrote:
>>> But a machine check safe copy_from_user() would be useful
>>> current generation cpus that broadcast all the time.
>>
>> Fair enough.
>
> Thanks for spending the time to look at this. Coaxing me to re-write the
> tail of do_machine_check() has made that code much better. Too many
> years of one patch on top of another without looking at the whole context.
>
> Cogitate on this series over the weekend and see if you can give me
> an Acked-by or Reviewed-by (I'll be adding a #define for BIT(63)).

I can't review the MCE decoding part, because I don't understand it
nearly well enough. The interaction with the core fault handling
looks fine, modulo any need to bikeshed on the macro naming (which
I'll refrain from doing).

I still think it would be better if you get rid of BIT(63) and use a
pair of landing pads, though. They could be as simple as:

.Lpage_fault_goes_here:
xorq %rax, %rax
jmp .Lbad

.Lmce_goes_here:
/* set high bit of rax or whatever */
/* fall through */

.Lbad:
/* deal with it */

That way the magic is isolated to the function that needs the magic.

Also, at least renaming the macro to EXTABLE_MC_PA_IN_AX might be
nice. It'll keep future users honest. Maybe some day there'll be a
PA_IN_AX flag, and, heck, maybe some day there'll be ways to get info
for non-MCE faults delivered through fixup_exception.

--Andy

2015-12-12 10:11:49

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCHV2 1/3] x86, ras: Add new infrastructure for machine check fixup tables

On Thu, Dec 10, 2015 at 01:58:04PM -0800, Tony Luck wrote:
> Copy the existing page fault fixup mechanisms to create a new table
> to be used when fixing machine checks. Note:
> 1) At this time we only provide a macro to annotate assembly code
> 2) We assume all fixups will in code builtin to the kernel.
> 3) Only for x86_64
> 4) New code under CONFIG_MCE_KERNEL_RECOVERY
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/Kconfig | 4 ++++
> arch/x86/include/asm/asm.h | 10 ++++++++--
> arch/x86/include/asm/uaccess.h | 8 ++++++++
> arch/x86/mm/extable.c | 19 +++++++++++++++++++
> include/asm-generic/vmlinux.lds.h | 6 ++++++
> include/linux/module.h | 1 +
> kernel/extable.c | 20 ++++++++++++++++++++
> 7 files changed, 66 insertions(+), 2 deletions(-)
>
> diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
> index 96d058a87100..db5c6e1d6e37 100644
> --- a/arch/x86/Kconfig
> +++ b/arch/x86/Kconfig
> @@ -1001,6 +1001,10 @@ config X86_MCE_INJECT
> If you don't know what a machine check is and you don't do kernel
> QA it is safe to say n.
>
> +config MCE_KERNEL_RECOVERY
> + depends on X86_MCE && X86_64
> + def_bool y

Shouldn't that depend on NVDIMM or whatnot? Looks too generic now.

> +
> config X86_THERMAL_VECTOR
> def_bool y
> depends on X86_MCE_INTEL
> diff --git a/arch/x86/include/asm/asm.h b/arch/x86/include/asm/asm.h
> index 189679aba703..a5d483ac11fa 100644
> --- a/arch/x86/include/asm/asm.h
> +++ b/arch/x86/include/asm/asm.h
> @@ -44,13 +44,19 @@
>
> /* Exception table entry */
> #ifdef __ASSEMBLY__
> -# define _ASM_EXTABLE(from,to) \
> - .pushsection "__ex_table","a" ; \
> +# define __ASM_EXTABLE(from, to, table) \
> + .pushsection table, "a" ; \
> .balign 8 ; \
> .long (from) - . ; \
> .long (to) - . ; \
> .popsection
>
> +# define _ASM_EXTABLE(from, to) \
> + __ASM_EXTABLE(from, to, "__ex_table")
> +
> +# define _ASM_MCEXTABLE(from, to) \
> + __ASM_EXTABLE(from, to, "__mcex_table")
> +
> # define _ASM_EXTABLE_EX(from,to) \
> .pushsection "__ex_table","a" ; \
> .balign 8 ; \
> diff --git a/arch/x86/include/asm/uaccess.h b/arch/x86/include/asm/uaccess.h
> index a8df874f3e88..7b02ca1991b4 100644
> --- a/arch/x86/include/asm/uaccess.h
> +++ b/arch/x86/include/asm/uaccess.h
> @@ -111,6 +111,14 @@ struct exception_table_entry {
> #define ARCH_HAS_SEARCH_EXTABLE
>
> extern int fixup_exception(struct pt_regs *regs);
> +#ifdef CONFIG_MCE_KERNEL_RECOVERY
> +extern int fixup_mcexception(struct pt_regs *regs, u64 addr);
> +#else
> +static inline int fixup_mcexception(struct pt_regs *regs, u64 addr)
> +{
> + return 0;
> +}
> +#endif
> extern int early_fixup_exception(unsigned long *ip);

No need for "extern"

>
> /*
> diff --git a/arch/x86/mm/extable.c b/arch/x86/mm/extable.c
> index 903ec1e9c326..a461c4212758 100644
> --- a/arch/x86/mm/extable.c
> +++ b/arch/x86/mm/extable.c
> @@ -49,6 +49,25 @@ int fixup_exception(struct pt_regs *regs)
> return 0;
> }
>
> +#ifdef CONFIG_MCE_KERNEL_RECOVERY
> +int fixup_mcexception(struct pt_regs *regs, u64 addr)
> +{

If you move the #ifdef here, you can save yourself the ifdeffery in the
header above.

> + const struct exception_table_entry *fixup;
> + unsigned long new_ip;
> +
> + fixup = search_mcexception_tables(regs->ip);
> + if (fixup) {
> + new_ip = ex_fixup_addr(fixup);
> +
> + regs->ip = new_ip;
> + regs->ax = BIT(63) | addr;
> + return 1;
> + }
> +
> + return 0;
> +}
> +#endif
> +
> /* Restricted version used during very early boot */
> int __init early_fixup_exception(unsigned long *ip)
> {
> diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
> index 1781e54ea6d3..21bb20d1172a 100644
> --- a/include/asm-generic/vmlinux.lds.h
> +++ b/include/asm-generic/vmlinux.lds.h
> @@ -473,6 +473,12 @@
> VMLINUX_SYMBOL(__start___ex_table) = .; \
> *(__ex_table) \
> VMLINUX_SYMBOL(__stop___ex_table) = .; \
> + } \
> + . = ALIGN(align); \
> + __mcex_table : AT(ADDR(__mcex_table) - LOAD_OFFSET) { \
> + VMLINUX_SYMBOL(__start___mcex_table) = .; \
> + *(__mcex_table) \
> + VMLINUX_SYMBOL(__stop___mcex_table) = .; \

Of all the places, this one is missing #ifdef CONFIG_MCE_KERNEL_RECOVERY.

> }
>
> /*
> diff --git a/include/linux/module.h b/include/linux/module.h
> index 3a19c79918e0..ffecbfcc462c 100644
> --- a/include/linux/module.h
> +++ b/include/linux/module.h
> @@ -270,6 +270,7 @@ extern const typeof(name) __mod_##type##__##name##_device_table \
>
> /* Given an address, look for it in the exception tables */
> const struct exception_table_entry *search_exception_tables(unsigned long add);
> +const struct exception_table_entry *search_mcexception_tables(unsigned long a);
>
> struct notifier_block;
>
> diff --git a/kernel/extable.c b/kernel/extable.c
> index e820ccee9846..7b224fbcb708 100644
> --- a/kernel/extable.c
> +++ b/kernel/extable.c
> @@ -34,6 +34,10 @@ DEFINE_MUTEX(text_mutex);
>
> extern struct exception_table_entry __start___ex_table[];
> extern struct exception_table_entry __stop___ex_table[];
> +#ifdef CONFIG_MCE_KERNEL_RECOVERY
> +extern struct exception_table_entry __start___mcex_table[];
> +extern struct exception_table_entry __stop___mcex_table[];
> +#endif
>
> /* Cleared by build time tools if the table is already sorted. */
> u32 __initdata __visible main_extable_sort_needed = 1;
> @@ -45,6 +49,10 @@ void __init sort_main_extable(void)
> pr_notice("Sorting __ex_table...\n");
> sort_extable(__start___ex_table, __stop___ex_table);
> }
> +#ifdef CONFIG_MCE_KERNEL_RECOVERY
> + if (__stop___mcex_table > __start___mcex_table)
> + sort_extable(__start___mcex_table, __stop___mcex_table);
> +#endif
> }
>
> /* Given an address, look for it in the exception tables. */
> @@ -58,6 +66,18 @@ const struct exception_table_entry *search_exception_tables(unsigned long addr)
> return e;
> }
>
> +#ifdef CONFIG_MCE_KERNEL_RECOVERY
> +/* Given an address, look for it in the machine check exception tables. */
> +const struct exception_table_entry *search_mcexception_tables(
> + unsigned long addr)
> +{
> + const struct exception_table_entry *e;
> +
> + e = search_extable(__start___mcex_table, __stop___mcex_table-1, addr);
> + return e;
> +}
> +#endif

You can make this one a bit more readable by doing:

/* Given an address, look for it in the machine check exception tables. */
const struct exception_table_entry *
search_mcexception_tables(unsigned long addr)
{
#ifdef CONFIG_MCE_KERNEL_RECOVERY
return search_extable(__start___mcex_table,
__stop___mcex_table - 1, addr);
#endif
}

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

2015-12-14 08:36:31

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks


* Andy Lutomirski <[email protected]> wrote:

> I still think it would be better if you get rid of BIT(63) and use a
> pair of landing pads, though. They could be as simple as:
>
> .Lpage_fault_goes_here:
> xorq %rax, %rax
> jmp .Lbad
>
> .Lmce_goes_here:
> /* set high bit of rax or whatever */
> /* fall through */
>
> .Lbad:
> /* deal with it */
>
> That way the magic is isolated to the function that needs the magic.

Seconded - this is the usual pattern we use in all assembly functions.

Thanks,

Ingo

2015-12-14 17:58:47

by Ross Zwisler

[permalink] [raw]
Subject: Re: [PATCHV2 1/3] x86, ras: Add new infrastructure for machine check fixup tables

On Sat, Dec 12, 2015 at 3:11 AM, Borislav Petkov <[email protected]> wrote:
> On Thu, Dec 10, 2015 at 01:58:04PM -0800, Tony Luck wrote:
<>
>> +#ifdef CONFIG_MCE_KERNEL_RECOVERY
>> +/* Given an address, look for it in the machine check exception tables. */
>> +const struct exception_table_entry *search_mcexception_tables(
>> + unsigned long addr)
>> +{
>> + const struct exception_table_entry *e;
>> +
>> + e = search_extable(__start___mcex_table, __stop___mcex_table-1, addr);
>> + return e;
>> +}
>> +#endif
>
> You can make this one a bit more readable by doing:
>
> /* Given an address, look for it in the machine check exception tables. */
> const struct exception_table_entry *
> search_mcexception_tables(unsigned long addr)
> {
> #ifdef CONFIG_MCE_KERNEL_RECOVERY
> return search_extable(__start___mcex_table,
> __stop___mcex_table - 1, addr);
> #endif
> }

With this code if CONFIG_MCE_KERNEL_RECOVERY isn't defined you'll get
a compiler error that the function doesn't have a return statement,
right? I think we need an #else to return NULL, or to have the #ifdef
encompass the whole function definition as it was in Tony's version.

2015-12-14 19:46:50

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks

On Mon, Dec 14, 2015 at 09:36:25AM +0100, Ingo Molnar wrote:
> > /* deal with it */
> >
> > That way the magic is isolated to the function that needs the magic.
>
> Seconded - this is the usual pattern we use in all assembly functions.

Ok - you want me to write some x86 assembly code (you may regret that).

Initial question ... here's the fixup for __copy_user_nocache()

.section .fixup,"ax"
30: shll $6,%ecx
addl %ecx,%edx
jmp 60f
40: lea (%rdx,%rcx,8),%rdx
jmp 60f
50: movl %ecx,%edx
60: sfence
jmp copy_user_handle_tail
.previous

Are %ecx and %rcx synonyms for the same register? Is there some
super subtle reason we use the 'r' names in the "40" fixup, but
the 'e' names everywhere else in this code (and the 'e' names in
the body of the original function)?

-Tony

2015-12-14 20:12:22

by Andy Lutomirski

[permalink] [raw]
Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks

On Mon, Dec 14, 2015 at 11:46 AM, Luck, Tony <[email protected]> wrote:
> On Mon, Dec 14, 2015 at 09:36:25AM +0100, Ingo Molnar wrote:
>> > /* deal with it */
>> >
>> > That way the magic is isolated to the function that needs the magic.
>>
>> Seconded - this is the usual pattern we use in all assembly functions.
>
> Ok - you want me to write some x86 assembly code (you may regret that).
>

All you have to do is erase all of the ia64 asm knowledge from your
brain and repurpose 1% of that space for x86 asm. You'll be a
world-class expert!

> Initial question ... here's the fixup for __copy_user_nocache()
>
> .section .fixup,"ax"
> 30: shll $6,%ecx
> addl %ecx,%edx
> jmp 60f
> 40: lea (%rdx,%rcx,8),%rdx
> jmp 60f
> 50: movl %ecx,%edx
> 60: sfence
> jmp copy_user_handle_tail
> .previous
>
> Are %ecx and %rcx synonyms for the same register? Is there some
> super subtle reason we use the 'r' names in the "40" fixup, but
> the 'e' names everywhere else in this code (and the 'e' names in
> the body of the original function)?

rcx is a 64-bit register. ecx is the low 32 bits of it. If you read
from ecx, you get the low 32 bits, but if you write to ecx, you zero
the high bits as a side-effect.

--Andy

2015-12-14 22:28:07

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCHV2 1/3] x86, ras: Add new infrastructure for machine check fixup tables

On Mon, Dec 14, 2015 at 10:58:45AM -0700, Ross Zwisler wrote:
> With this code if CONFIG_MCE_KERNEL_RECOVERY isn't defined you'll get
> a compiler error that the function doesn't have a return statement,
> right? I think we need an #else to return NULL, or to have the #ifdef
> encompass the whole function definition as it was in Tony's version.

Right, correct.

Thanks.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

2015-12-15 01:01:28

by Luck, Tony

[permalink] [raw]
Subject: Re: [PATCHV2 1/3] x86, ras: Add new infrastructure for machine check fixup tables

On Sat, Dec 12, 2015 at 11:11:42AM +0100, Borislav Petkov wrote:
> > +config MCE_KERNEL_RECOVERY
> > + depends on X86_MCE && X86_64
> > + def_bool y
>
> Shouldn't that depend on NVDIMM or whatnot? Looks too generic now.

Not sure what the "whatnot" would be though. Making it depend on
X86_MCE should keep it out of the tiny configurations. By the time
you have MCE support, this seems like a pretty small incremental
change.

> > +#ifdef CONFIG_MCE_KERNEL_RECOVERY
> > +int fixup_mcexception(struct pt_regs *regs, u64 addr)
> > +{
>
> If you move the #ifdef here, you can save yourself the ifdeffery in the
> header above.

I realized I didn't need the inline stub function in the header.

> > diff --git a/include/asm-generic/vmlinux.lds.h b/include/asm-generic/vmlinux.lds.h
> > index 1781e54ea6d3..21bb20d1172a 100644
> > --- a/include/asm-generic/vmlinux.lds.h
> > +++ b/include/asm-generic/vmlinux.lds.h
> > @@ -473,6 +473,12 @@
> > VMLINUX_SYMBOL(__start___ex_table) = .; \
> > *(__ex_table) \
> > VMLINUX_SYMBOL(__stop___ex_table) = .; \
> > + } \
> > + . = ALIGN(align); \
> > + __mcex_table : AT(ADDR(__mcex_table) - LOAD_OFFSET) { \
> > + VMLINUX_SYMBOL(__start___mcex_table) = .; \
> > + *(__mcex_table) \
> > + VMLINUX_SYMBOL(__stop___mcex_table) = .; \
>
> Of all the places, this one is missing #ifdef CONFIG_MCE_KERNEL_RECOVERY.

Is there some cpp magic to use an #ifdef inside a multi-line macro like this?
Impact of not having the #ifdef is two extra symbols (the start/stop ones)
in the symbol table of the final binary. If that's unacceptable I can fall
back to an earlier unpublished version that had separate EXCEPTION_TABLE and
MCEXCEPTION_TABLE macros with both invoked in the x86 vmlinux.lds.S file.

> You can make this one a bit more readable by doing:
>
> /* Given an address, look for it in the machine check exception tables. */
> const struct exception_table_entry *
> search_mcexception_tables(unsigned long addr)
> {
> #ifdef CONFIG_MCE_KERNEL_RECOVERY
> return search_extable(__start___mcex_table,
> __stop___mcex_table - 1, addr);
> #endif
> }

I got rid of the local variable and the return ... but left the
#ifdef/#endif around the whole function.

-Tony

2015-12-15 09:47:04

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCHV2 1/3] x86, ras: Add new infrastructure for machine check fixup tables

On Mon, Dec 14, 2015 at 05:00:59PM -0800, Luck, Tony wrote:
> Not sure what the "whatnot" would be though. Making it depend on
> X86_MCE should keep it out of the tiny configurations. By the time
> you have MCE support, this seems like a pretty small incremental
> change.

Ok, so it is called CONFIG_LIBNVDIMM. Do you see a use case for this
stuff except on machines with NVDIMM hw? CONFIG_LIBNVDIMM can select it
but on !NVDIMM systems you don't really need it enabled.

> Is there some cpp magic to use an #ifdef inside a multi-line macro like this?
> Impact of not having the #ifdef is two extra symbols (the start/stop ones)
> in the symbol table of the final binary. If that's unacceptable I can fall
> back to an earlier unpublished version that had separate EXCEPTION_TABLE and
> MCEXCEPTION_TABLE macros with both invoked in the x86 vmlinux.lds.S file.

I think what is more important is that this should be in the
x86-specific linker script, not in the generic one. And yes, we should
strive to be clean and not pullute the kernel image with symbols which
are unused, i.e. when CONFIG_MCE_KERNEL_RECOVERY is not enabled.

This below seems to build ok here, ontop of yours. It could be a
MCEXCEPTION_TABLE macro, as you say:

Index: b/include/asm-generic/vmlinux.lds.h
===================================================================
--- a/include/asm-generic/vmlinux.lds.h 2015-12-15 10:17:25.568046033 +0100
+++ b/include/asm-generic/vmlinux.lds.h 2015-12-15 10:07:06.064034490 +0100
@@ -484,12 +484,6 @@
*(__ex_table) \
VMLINUX_SYMBOL(__stop___ex_table) = .; \
} \
- . = ALIGN(align); \
- __mcex_table : AT(ADDR(__mcex_table) - LOAD_OFFSET) { \
- VMLINUX_SYMBOL(__start___mcex_table) = .; \
- *(__mcex_table) \
- VMLINUX_SYMBOL(__stop___mcex_table) = .; \
- }

/*
* Init task
Index: b/arch/x86/kernel/vmlinux.lds.S
===================================================================
--- a/arch/x86/kernel/vmlinux.lds.S 2015-12-14 11:38:58.188150070 +0100
+++ b/arch/x86/kernel/vmlinux.lds.S 2015-12-15 10:09:04.624036699 +0100
@@ -110,7 +110,17 @@ SECTIONS

NOTES :text :note

- EXCEPTION_TABLE(16) :text = 0x9090
+ EXCEPTION_TABLE(16)
+
+#ifdef CONFIG_MCE_KERNEL_RECOVERY
+ . = ALIGN(16);
+ __mcex_table : AT(ADDR(__mcex_table) - LOAD_OFFSET) {
+ VMLINUX_SYMBOL(__start___mcex_table) = .;
+ *(__mcex_table)
+ VMLINUX_SYMBOL(__stop___mcex_table) = .;
+ }
+#endif
+ :text = 0x9090

#if defined(CONFIG_DEBUG_RODATA)
/* .text should occupy whole number of pages */

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

2015-12-15 10:44:11

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCHV2 1/3] x86, ras: Add new infrastructure for machine check fixup tables

On Tue, Dec 15, 2015 at 10:46:53AM +0100, Borislav Petkov wrote:
> I think what is more important is that this should be in the
> x86-specific linker script, not in the generic one.

And related to that, I think all those additions to kernel/extable.c
should be somewhere in arch/x86/ and not in generic code.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

2015-12-15 11:43:25

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCHV2 2/3] x86, ras: Extend machine check recovery code to annotated ring0 areas

On Thu, Dec 10, 2015 at 04:14:44PM -0800, Tony Luck wrote:
> Extend the severity checking code to add a new context IN_KERN_RECOV
> which is used to indicate that the machine check was triggered by code
> in the kernel with a fixup entry.
>
> Add code to check for this situation and respond by altering the return
> IP to the fixup address and changing the regs->ax so that the recovery
> code knows the physical address of the error. Note that we also set bit
> 63 because 0x0 is a legal physical address.
>
> Major re-work to the tail code in do_machine_check() to make all this
> readable/maintainable. One functional change is that tolerant=3 no longer
> stops recovery actions. Revert to only skipping sending SIGBUS to the
> current process.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/kernel/cpu/mcheck/mce-severity.c | 22 +++++++++-
> arch/x86/kernel/cpu/mcheck/mce.c | 69 ++++++++++++++++---------------
> 2 files changed, 55 insertions(+), 36 deletions(-)
>
> diff --git a/arch/x86/kernel/cpu/mcheck/mce-severity.c b/arch/x86/kernel/cpu/mcheck/mce-severity.c
> index 9c682c222071..ac7fbb0689fb 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce-severity.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce-severity.c
> @@ -12,6 +12,7 @@
> #include <linux/kernel.h>
> #include <linux/seq_file.h>
> #include <linux/init.h>
> +#include <linux/module.h>
> #include <linux/debugfs.h>
> #include <asm/mce.h>
>
> @@ -29,7 +30,7 @@
> * panic situations)
> */
>
> -enum context { IN_KERNEL = 1, IN_USER = 2 };
> +enum context { IN_KERNEL = 1, IN_USER = 2, IN_KERNEL_RECOV = 3 };
> enum ser { SER_REQUIRED = 1, NO_SER = 2 };
> enum exception { EXCP_CONTEXT = 1, NO_EXCP = 2 };
>
> @@ -48,6 +49,7 @@ static struct severity {
> #define MCESEV(s, m, c...) { .sev = MCE_ ## s ## _SEVERITY, .msg = m, ## c }
> #define KERNEL .context = IN_KERNEL
> #define USER .context = IN_USER
> +#define KERNEL_RECOV .context = IN_KERNEL_RECOV
> #define SER .ser = SER_REQUIRED
> #define NOSER .ser = NO_SER
> #define EXCP .excp = EXCP_CONTEXT
> @@ -87,6 +89,10 @@ static struct severity {
> EXCP, KERNEL, MCGMASK(MCG_STATUS_RIPV, 0)
> ),
> MCESEV(
> + PANIC, "In kernel and no restart IP",
> + EXCP, KERNEL_RECOV, MCGMASK(MCG_STATUS_RIPV, 0)
> + ),
> + MCESEV(
> DEFERRED, "Deferred error",
> NOSER, MASK(MCI_STATUS_UC|MCI_STATUS_DEFERRED|MCI_STATUS_POISON, MCI_STATUS_DEFERRED)
> ),
> @@ -123,6 +129,11 @@ static struct severity {
> MCGMASK(MCG_STATUS_RIPV|MCG_STATUS_EIPV, MCG_STATUS_RIPV)
> ),
> MCESEV(
> + AR, "Action required: data load error recoverable area of kernel",

... in ...

> + SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|MCACOD_DATA),
> + KERNEL_RECOV
> + ),
> + MCESEV(
> AR, "Action required: data load error in a user process",
> SER, MASK(MCI_STATUS_OVER|MCI_UC_SAR|MCI_ADDR|MCACOD, MCI_UC_SAR|MCI_ADDR|MCACOD_DATA),
> USER
> @@ -170,6 +181,9 @@ static struct severity {
> ) /* always matches. keep at end */
> };
>
> +#define mc_recoverable(mcg) (((mcg) & (MCG_STATUS_RIPV|MCG_STATUS_EIPV)) == \
> + (MCG_STATUS_RIPV|MCG_STATUS_EIPV))
> +
> /*
> * If mcgstatus indicated that ip/cs on the stack were
> * no good, then "m->cs" will be zero and we will have
> @@ -183,7 +197,11 @@ static struct severity {
> */
> static int error_context(struct mce *m)
> {
> - return ((m->cs & 3) == 3) ? IN_USER : IN_KERNEL;
> + if ((m->cs & 3) == 3)
> + return IN_USER;
> + if (mc_recoverable(m->mcgstatus) && search_mcexception_tables(m->ip))
> + return IN_KERNEL_RECOV;
> + return IN_KERNEL;
> }
>
> /*
> diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
> index 9d014b82a124..f2f568ad6409 100644
> --- a/arch/x86/kernel/cpu/mcheck/mce.c
> +++ b/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -31,6 +31,7 @@
> #include <linux/types.h>
> #include <linux/slab.h>
> #include <linux/init.h>
> +#include <linux/module.h>
> #include <linux/kmod.h>
> #include <linux/poll.h>
> #include <linux/nmi.h>
> @@ -958,6 +959,20 @@ static void mce_clear_state(unsigned long *toclear)
> }
> }
>
> +static int do_memory_failure(struct mce *m)
> +{
> + int flags = MF_ACTION_REQUIRED;
> + int ret;
> +
> + pr_err("Uncorrected hardware memory error in user-access at %llx", m->addr);
> + if (!(m->mcgstatus & MCG_STATUS_RIPV))
> + flags |= MF_MUST_KILL;
> + ret = memory_failure(m->addr >> PAGE_SHIFT, MCE_VECTOR, flags);
> + if (ret)
> + pr_err("Memory error not recovered");
> + return ret;
> +}
> +
> /*
> * The actual machine check handler. This only handles real
> * exceptions when something got corrupted coming in through int 18.
> @@ -995,8 +1010,6 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> DECLARE_BITMAP(toclear, MAX_NR_BANKS);
> DECLARE_BITMAP(valid_banks, MAX_NR_BANKS);
> char *msg = "Unknown";
> - u64 recover_paddr = ~0ull;
> - int flags = MF_ACTION_REQUIRED;
> int lmce = 0;
>
> ist_enter(regs);
> @@ -1123,22 +1136,13 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> }
>
> /*
> - * At insane "tolerant" levels we take no action. Otherwise
> - * we only die if we have no other choice. For less serious
> - * issues we try to recover, or limit damage to the current
> - * process.
> + * If tolerant is at an insane level we drop requests to kill
> + * processes and continue even when there is no way out
^
|
. Fullstop here.

> */
> - if (cfg->tolerant < 3) {
> - if (no_way_out)
> - mce_panic("Fatal machine check on current CPU", &m, msg);
> - if (worst == MCE_AR_SEVERITY) {
> - recover_paddr = m.addr;
> - if (!(m.mcgstatus & MCG_STATUS_RIPV))
> - flags |= MF_MUST_KILL;
> - } else if (kill_it) {
> - force_sig(SIGBUS, current);
> - }
> - }
> + if (cfg->tolerant == 3)

Btw, I don't see where we limit the input values for that tolerant
setting, i.e., user could easily enter something > 3.

I think we should add a check in a separate patch to not allow anything
except [0-3].

> + kill_it = 0;
> + else if (no_way_out)
> + mce_panic("Fatal machine check on current CPU", &m, msg);
>
> if (worst > 0)
> mce_report_event(regs);
> @@ -1146,25 +1150,22 @@ void do_machine_check(struct pt_regs *regs, long error_code)
> out:
> sync_core();
>
> - if (recover_paddr == ~0ull)
> - goto done;
> + /* Fault was in user mode and we need to take some action */
> + if ((m.cs & 3) == 3 && (worst == MCE_AR_SEVERITY || kill_it)) {
> + ist_begin_non_atomic(regs);
> + local_irq_enable();
>
> - pr_err("Uncorrected hardware memory error in user-access at %llx",
> - recover_paddr);
> - /*
> - * We must call memory_failure() here even if the current process is
> - * doomed. We still need to mark the page as poisoned and alert any
> - * other users of the page.
> - */
> - ist_begin_non_atomic(regs);
> - local_irq_enable();
> - if (memory_failure(recover_paddr >> PAGE_SHIFT, MCE_VECTOR, flags) < 0) {
> - pr_err("Memory error not recovered");
> - force_sig(SIGBUS, current);
> + if (kill_it || do_memory_failure(&m))
> + force_sig(SIGBUS, current);
> + local_irq_disable();
> + ist_end_non_atomic();
> }
> - local_irq_disable();
> - ist_end_non_atomic();
> -done:
> +
> + /* Fault was in recoverable area of the kernel */
> + if ((m.cs & 3) != 3 && worst == MCE_AR_SEVERITY)
> + if (!fixup_mcexception(regs, m.addr))
> + mce_panic("Failed kernel mode recovery", &m, NULL);
^^^^^^^^^^^^^^^^^^^^^^^^^^^

Does that always imply a failed kernel mode recovery? I don't see

(m.cs == 0 and MCE_AR_SEVERITY)

MCEs always meaning that a recovery should be attempted there. I think
this should simply say

mce_panic("Fatal machine check on current CPU", &m, msg);

Also, how about taking out that worst and kill_it check. It is a bit
more readable this way IMO:

---
out:
sync_core();

if (worst < MCE_AR_SEVERITY && !kill_it)
goto out_ist;

/* Fault was in user mode and we need to take some action */
if ((m.cs & 3) == 3) {
ist_begin_non_atomic(regs);
local_irq_enable();

if (kill_it || do_memory_failure(&m))
force_sig(SIGBUS, current);

local_irq_disable();
ist_end_non_atomic();
} else {
if (!fixup_mcexception(regs, m.addr))
mce_panic("Fatal machine check on current CPU", &m, NULL);
}

out_ist:
ist_exit(regs);
}
EXPORT_SYMBOL_GPL(do_machine_check);
---

Hmm...

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

2015-12-15 13:11:44

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks

On Thu, Dec 10, 2015 at 04:21:50PM -0800, Tony Luck wrote:
> Using __copy_user_nocache() as inspiration create a memory copy
> routine for use by kernel code with annotations to allow for
> recovery from machine checks.
>
> Notes:
> 1) Unlike the original we make no attempt to copy all the bytes
> up to the faulting address. The original achieves that by
> re-executing the failing part as a byte-by-byte copy,
> which will take another page fault. We don't want to have
> a second machine check!
> 2) Likewise the return value for the original indicates exactly
> how many bytes were not copied. Instead we provide the physical
> address of the fault (thanks to help from do_machine_check()
> 3) Provide helpful macros to decode the return value.
>
> Signed-off-by: Tony Luck <[email protected]>
> ---
> arch/x86/include/asm/uaccess_64.h | 5 +++
> arch/x86/kernel/x8664_ksyms_64.c | 2 +
> arch/x86/lib/copy_user_64.S | 91 +++++++++++++++++++++++++++++++++++++++
> 3 files changed, 98 insertions(+)

...

> + * mcsafe_memcpy - Uncached memory copy with machine check exception handling
> + * Note that we only catch machine checks when reading the source addresses.
> + * Writes to target are posted and don't generate machine checks.
> + * This will force destination/source out of cache for more performance.

... and the non-temporal version is the optimal one even though we're
defaulting to copy_user_enhanced_fast_string for memcpy on modern Intel
CPUs...?

Btw, it should be also inside an ifdef if we're going to ifdef
CONFIG_MCE_KERNEL_RECOVERY everywhere else.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

2015-12-15 17:45:07

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks

On Tue, Dec 15, 2015 at 5:11 AM, Borislav Petkov <[email protected]> wrote:
> On Thu, Dec 10, 2015 at 04:21:50PM -0800, Tony Luck wrote:
>> Using __copy_user_nocache() as inspiration create a memory copy
>> routine for use by kernel code with annotations to allow for
>> recovery from machine checks.
>>
>> Notes:
>> 1) Unlike the original we make no attempt to copy all the bytes
>> up to the faulting address. The original achieves that by
>> re-executing the failing part as a byte-by-byte copy,
>> which will take another page fault. We don't want to have
>> a second machine check!
>> 2) Likewise the return value for the original indicates exactly
>> how many bytes were not copied. Instead we provide the physical
>> address of the fault (thanks to help from do_machine_check()
>> 3) Provide helpful macros to decode the return value.
>>
>> Signed-off-by: Tony Luck <[email protected]>
>> ---
>> arch/x86/include/asm/uaccess_64.h | 5 +++
>> arch/x86/kernel/x8664_ksyms_64.c | 2 +
>> arch/x86/lib/copy_user_64.S | 91 +++++++++++++++++++++++++++++++++++++++
>> 3 files changed, 98 insertions(+)
>
> ...
>
>> + * mcsafe_memcpy - Uncached memory copy with machine check exception handling
>> + * Note that we only catch machine checks when reading the source addresses.
>> + * Writes to target are posted and don't generate machine checks.
>> + * This will force destination/source out of cache for more performance.
>
> ... and the non-temporal version is the optimal one even though we're
> defaulting to copy_user_enhanced_fast_string for memcpy on modern Intel
> CPUs...?

At least the pmem driver use case does not want caching of the
source-buffer since that is the raw "disk" media. I.e. in
pmem_do_bvec() we'd use this to implement memcpy_from_pmem().
However, caching the destination-buffer may prove beneficial since
that data is likely to be consumed immediately by the thread that
submitted the i/o.

2015-12-15 17:53:50

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks

>> ... and the non-temporal version is the optimal one even though we're
>> defaulting to copy_user_enhanced_fast_string for memcpy on modern Intel
>> CPUs...?

My current generation cpu has a bit of an issue with recovering from a
machine check in a "rep mov" ... so I'm working with a version of memcpy
that unrolls into individual mov instructions for now.

> At least the pmem driver use case does not want caching of the
> source-buffer since that is the raw "disk" media. I.e. in
> pmem_do_bvec() we'd use this to implement memcpy_from_pmem().
> However, caching the destination-buffer may prove beneficial since
> that data is likely to be consumed immediately by the thread that
> submitted the i/o.

I can drop the "nti" from the destination moves. Does "nti" work
on the load from source address side to avoid cache allocation?

On another topic raised by Boris ... is there some CONFIG_PMEM*
that I should use as a dependency to enable all this?

-Tony
????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-12-15 18:21:10

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks

On Tue, Dec 15, 2015 at 05:53:31PM +0000, Luck, Tony wrote:
> My current generation cpu has a bit of an issue with recovering from a
> machine check in a "rep mov" ... so I'm working with a version of memcpy
> that unrolls into individual mov instructions for now.

Ah.

> I can drop the "nti" from the destination moves. Does "nti" work
> on the load from source address side to avoid cache allocation?

I don't think so:

+1: movq (%rsi),%r8
+2: movq 1*8(%rsi),%r9
+3: movq 2*8(%rsi),%r10
+4: movq 3*8(%rsi),%r11
...

You need to load the data into registers first because MOVNTI needs them
there as it does reg -> mem movement. That first load from memory into
registers with a normal MOV will pull the data into the cache.

Perhaps the first thing to try would be to see what slowdown normal MOVs
bring and if not really noticeable, use those instead.

> On another topic raised by Boris ... is there some CONFIG_PMEM*
> that I should use as a dependency to enable all this?

I found CONFIG_LIBNVDIMM only today:

drivers/nvdimm/Kconfig:1:menuconfig LIBNVDIMM
drivers/nvdimm/Kconfig:2: tristate "NVDIMM (Non-Volatile Memory Device) Support"

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

2015-12-15 18:27:42

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks

On Tue, Dec 15, 2015 at 9:53 AM, Luck, Tony <[email protected]> wrote:
>>> ... and the non-temporal version is the optimal one even though we're
>>> defaulting to copy_user_enhanced_fast_string for memcpy on modern Intel
>>> CPUs...?
>
> My current generation cpu has a bit of an issue with recovering from a
> machine check in a "rep mov" ... so I'm working with a version of memcpy
> that unrolls into individual mov instructions for now.
>
>> At least the pmem driver use case does not want caching of the
>> source-buffer since that is the raw "disk" media. I.e. in
>> pmem_do_bvec() we'd use this to implement memcpy_from_pmem().
>> However, caching the destination-buffer may prove beneficial since
>> that data is likely to be consumed immediately by the thread that
>> submitted the i/o.
>
> I can drop the "nti" from the destination moves. Does "nti" work
> on the load from source address side to avoid cache allocation?

My mistake, I don't think we have an uncached load capability, only store.

> On another topic raised by Boris ... is there some CONFIG_PMEM*
> that I should use as a dependency to enable all this?

I'd rather make this a "select ARCH_MCSAFE_MEMCPY". Since it's not a
hard dependency and the details will be hidden behind
memcpy_from_pmem(). Specifically, the details will be handled by a
new arch_memcpy_from_pmem() in arch/x86/include/asm/pmem.h to
supplement the existing arch_memcpy_to_pmem().

2015-12-15 18:35:51

by Dan Williams

[permalink] [raw]
Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks

On Tue, Dec 15, 2015 at 10:27 AM, Dan Williams <[email protected]> wrote:
> On Tue, Dec 15, 2015 at 9:53 AM, Luck, Tony <[email protected]> wrote:
>>>> ... and the non-temporal version is the optimal one even though we're
>>>> defaulting to copy_user_enhanced_fast_string for memcpy on modern Intel
>>>> CPUs...?
>>
>> My current generation cpu has a bit of an issue with recovering from a
>> machine check in a "rep mov" ... so I'm working with a version of memcpy
>> that unrolls into individual mov instructions for now.
>>
>>> At least the pmem driver use case does not want caching of the
>>> source-buffer since that is the raw "disk" media. I.e. in
>>> pmem_do_bvec() we'd use this to implement memcpy_from_pmem().
>>> However, caching the destination-buffer may prove beneficial since
>>> that data is likely to be consumed immediately by the thread that
>>> submitted the i/o.
>>
>> I can drop the "nti" from the destination moves. Does "nti" work
>> on the load from source address side to avoid cache allocation?
>
> My mistake, I don't think we have an uncached load capability, only store.

Correction we have MOVNTDQA, but that requires saving the fpu state
and marking the memory as WC, i.e. probably not worth it.

2015-12-15 18:39:33

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks

On Tue, Dec 15, 2015 at 10:35:49AM -0800, Dan Williams wrote:
> Correction we have MOVNTDQA, but that requires saving the fpu state
> and marking the memory as WC, i.e. probably not worth it.

Not really. Last time I tried an SSE3 memcpy in the kernel like glibc
does, it wasn't worth it. The enhanced REP; MOVSB is hands down faster.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

Subject: RE: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks



> -----Original Message-----
> From: Linux-nvdimm [mailto:[email protected]] On Behalf
> Of Borislav Petkov
> Sent: Tuesday, December 15, 2015 12:39 PM
> To: Dan Williams <[email protected]>
> Cc: Luck, Tony <[email protected]>; linux-nvdimm <linux-
> [email protected]>; X86 ML <[email protected]>; linux-
> [email protected]; Linux MM <[email protected]>; Andy Lutomirski
> <[email protected]>; Andrew Morton <[email protected]>; Ingo Molnar
> <[email protected]>
> Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to
> recover from machine checks
>
> On Tue, Dec 15, 2015 at 10:35:49AM -0800, Dan Williams wrote:
> > Correction we have MOVNTDQA, but that requires saving the fpu state
> > and marking the memory as WC, i.e. probably not worth it.
>
> Not really. Last time I tried an SSE3 memcpy in the kernel like glibc
> does, it wasn't worth it. The enhanced REP; MOVSB is hands down faster.

Reading from NVDIMM, rep movsb is efficient, but it
fills the CPU caches with the NVDIMM addresses. For
large data moves (not uncommon for storage) this
will crowd out more important cacheable data.

For normal block device reads made through the pmem
block device driver, this CPU cache consumption is
wasteful, since it is unlikely the application will
ask pmem to read the same addresses anytime soon.
Due to the historic long latency of storage devices,
applications don't re-read from storage again; they
save the results. So, the streaming-load
instructions are beneficial:
* movntdqa (16-byte xmm registers)
* vmovntdqa (32-byte ymm registers)
* vmovntdqa (64-byte zmm registers)

Dan Williams wrote:
> Correction we have MOVNTDQA, but that requires
> saving the fpu state and marking the memory as WC
> i.e. probably not worth it.

Although the WC memory type is described in the SDM
in the most detail:
"An implementation may also make use of the
non-temporal hint associated with this instruction
if the memory source is WB (write back) memory
type. ... may optimize cache reads generated by
(V)MOVNTDQA on WB memory type to reduce cache
evictions."

For applications doing loads from mmap() DAX memory,
the CPU cache usage could be worthwhile, because
applications expect mmap() regions to consist of
traditional writeback-cached memory and might do
lots of loads/stores.

Writing to the NVDIMM requires either:
* non-temporal stores; or
* normal stores + cache flushes + fences

movnti is OK for small transfers, but these are
better for bulk moves:
* movntdq (16-byte xmm registers)
* vmovntdq (32-byte ymm registers)
* vmovntdq (64-byte zmm registers)

---
Robert Elliott, HPE Persistent Memory

2015-12-15 19:28:47

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks

On Tue, Dec 15, 2015 at 07:19:58PM +0000, Elliott, Robert (Persistent Memory) wrote:

...

> Due to the historic long latency of storage devices,
> applications don't re-read from storage again; they
> save the results.
> So, the streaming-load instructions are beneficial:

That's the theory...

Do you also have some actual performance numbers where non-temporal
operations are better than the REP; MOVSB and *actually* show
improvements? And no microbenchmarks please.

Thanks.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.

Subject: RE: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks



---
Robert Elliott, HPE Persistent Memory


> -----Original Message-----
> From: Borislav Petkov [mailto:[email protected]]
> Sent: Tuesday, December 15, 2015 1:29 PM
> To: Elliott, Robert (Persistent Memory) <[email protected]>
> Cc: Dan Williams <[email protected]>; Luck, Tony
> <[email protected]>; linux-nvdimm <[email protected]>; X86 ML
> <[email protected]>; [email protected]; Linux MM <linux-
> [email protected]>; Andy Lutomirski <[email protected]>; Andrew Morton
> <[email protected]>; Ingo Molnar <[email protected]>
> Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to
> recover from machine checks
>
> On Tue, Dec 15, 2015 at 07:19:58PM +0000, Elliott, Robert (Persistent
> Memory) wrote:
>
> ...
>
> > Due to the historic long latency of storage devices,
> > applications don't re-read from storage again; they
> > save the results.
> > So, the streaming-load instructions are beneficial:
>
> That's the theory...
>
> Do you also have some actual performance numbers where non-temporal
> operations are better than the REP; MOVSB and *actually* show
> improvements? And no microbenchmarks please.
>
> Thanks.
>

This isn't exactly what you're looking for, but here is
an example of fio doing reads from pmem devices (reading
from NVDIMMs, writing to DIMMs) with various transfer
sizes.

At 256 KiB, all the main memory buffers fit in the CPU
caches, so no write traffic appears on DDR (just the reads
from the NVDIMMs). At 1 MiB, the data spills out of the
caches, and writes to the DIMMs end up on DDR.

Although DDR is busier, fio gets a lot less work done:
* 256 KiB: 90 GiB/s by fio
* 1 MiB: 49 GiB/s by fio

We could try modifying pmem to use its own non-temporal
memcpy functions (I've posted experimental patches
before that did this) to see if that transition point
shifts. We can also watch the CPU cache statistics
while running.

Here are statistics from Intel's pcm-memory.x
(pardon the wide formatting):

256 KiB
=======
pmem0: (groupid=0, jobs=40): err= 0: pid=20867: Tue Nov 24 18:20:08 2015
read : io=5219.1GB, bw=89079MB/s, iops=356314, runt= 60006msec
cpu : usr=1.74%, sys=96.16%, ctx=49576, majf=0, minf=21997

Run status group 0 (all jobs):
READ: io=5219.1GB, aggrb=89079MB/s, minb=89079MB/s, maxb=89079MB/s, mint=60006msec, maxt=60006msec

|---------------------------------------||---------------------------------------|
|-- Socket 0 --||-- Socket 1 --|
|---------------------------------------||---------------------------------------|
|-- Memory Channel Monitoring --||-- Memory Channel Monitoring --|
|---------------------------------------||---------------------------------------|
|-- Mem Ch 0: Reads (MB/s): 11778.11 --||-- Mem Ch 0: Reads (MB/s): 11743.99 --|
|-- Writes(MB/s): 51.83 --||-- Writes(MB/s): 43.25 --|
|-- Mem Ch 1: Reads (MB/s): 11779.90 --||-- Mem Ch 1: Reads (MB/s): 11736.06 --|
|-- Writes(MB/s): 48.73 --||-- Writes(MB/s): 37.86 --|
|-- Mem Ch 4: Reads (MB/s): 11784.79 --||-- Mem Ch 4: Reads (MB/s): 11746.94 --|
|-- Writes(MB/s): 52.90 --||-- Writes(MB/s): 43.73 --|
|-- Mem Ch 5: Reads (MB/s): 11778.48 --||-- Mem Ch 5: Reads (MB/s): 11741.55 --|
|-- Writes(MB/s): 47.62 --||-- Writes(MB/s): 37.80 --|
|-- NODE 0 Mem Read (MB/s) : 47121.27 --||-- NODE 1 Mem Read (MB/s) : 46968.53 --|
|-- NODE 0 Mem Write(MB/s) : 201.08 --||-- NODE 1 Mem Write(MB/s) : 162.65 --|
|-- NODE 0 P. Write (T/s): 190927 --||-- NODE 1 P. Write (T/s): 182961 --|
|-- NODE 0 Memory (MB/s): 47322.36 --||-- NODE 1 Memory (MB/s): 47131.17 --|
|---------------------------------------||---------------------------------------|
|---------------------------------------||---------------------------------------|
|-- System Read Throughput(MB/s): 94089.80 --|
|-- System Write Throughput(MB/s): 363.73 --|
|-- System Memory Throughput(MB/s): 94453.52 --|
|---------------------------------------||---------------------------------------|

1 MiB
=====
|---------------------------------------||---------------------------------------|
|-- Socket 0 --||-- Socket 1 --|
|---------------------------------------||---------------------------------------|
|-- Memory Channel Monitoring --||-- Memory Channel Monitoring --|
|---------------------------------------||---------------------------------------|
|-- Mem Ch 0: Reads (MB/s): 7227.83 --||-- Mem Ch 0: Reads (MB/s): 7047.45 --|
|-- Writes(MB/s): 5894.47 --||-- Writes(MB/s): 6010.66 --|
|-- Mem Ch 1: Reads (MB/s): 7229.32 --||-- Mem Ch 1: Reads (MB/s): 7041.79 --|
|-- Writes(MB/s): 5891.38 --||-- Writes(MB/s): 6003.19 --|
|-- Mem Ch 4: Reads (MB/s): 7230.70 --||-- Mem Ch 4: Reads (MB/s): 7052.44 --|
|-- Writes(MB/s): 5888.63 --||-- Writes(MB/s): 6012.49 --|
|-- Mem Ch 5: Reads (MB/s): 7229.16 --||-- Mem Ch 5: Reads (MB/s): 7047.19 --|
|-- Writes(MB/s): 5882.45 --||-- Writes(MB/s): 6008.11 --|
|-- NODE 0 Mem Read (MB/s) : 28917.01 --||-- NODE 1 Mem Read (MB/s) : 28188.87 --|
|-- NODE 0 Mem Write(MB/s) : 23556.93 --||-- NODE 1 Mem Write(MB/s) : 24034.46 --|
|-- NODE 0 P. Write (T/s): 238713 --||-- NODE 1 P. Write (T/s): 228040 --|
|-- NODE 0 Memory (MB/s): 52473.94 --||-- NODE 1 Memory (MB/s): 52223.33 --|
|---------------------------------------||---------------------------------------|
|---------------------------------------||---------------------------------------|
|-- System Read Throughput(MB/s): 57105.87 --|
|-- System Write Throughput(MB/s): 47591.39 --|
|-- System Memory Throughput(MB/s): 104697.27 --|
|---------------------------------------||---------------------------------------|


????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-12-15 23:46:07

by Luck, Tony

[permalink] [raw]
Subject: RE: [PATCHV2 2/3] x86, ras: Extend machine check recovery code to annotated ring0 areas

>> + /* Fault was in recoverable area of the kernel */
>> + if ((m.cs & 3) != 3 && worst == MCE_AR_SEVERITY)
>> + if (!fixup_mcexception(regs, m.addr))
>> + mce_panic("Failed kernel mode recovery", &m, NULL);
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> Does that always imply a failed kernel mode recovery? I don't see
>
> (m.cs == 0 and MCE_AR_SEVERITY)
>
> MCEs always meaning that a recovery should be attempted there. I think
> this should simply say
>
> mce_panic("Fatal machine check on current CPU", &m, msg);

I don't think this can ever happen. If we were in kernel mode and decided
that the severity was AR_SEVERITY ... then search_mcexception_table()
found an entry for the IP where the machine check happened.

The only way for fixup_exception to fail is if search_mcexception_table()
now suddenly doesn't find the entry it found earlier.

But if this "can't happen" thing actually does happen ... I'd like the panic
message to be different from other mce_panic() so you'll know to blame
me.

Applied all the other suggestions.

-Tony

????{.n?+???????+%?????ݶ??w??{.n?+????{??G?????{ay?ʇڙ?,j??f???h?????????z_??(?階?ݢj"???m??????G????????????&???~???iO???z??v?^?m???? ????????I?

2015-12-21 17:33:30

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCHV2 3/3] x86, ras: Add mcsafe_memcpy() function to recover from machine checks

On Tue, Dec 15, 2015 at 08:25:37PM +0000, Elliott, Robert (Persistent Memory) wrote:
> This isn't exactly what you're looking for, but here is
> an example of fio doing reads from pmem devices (reading
> from NVDIMMs, writing to DIMMs) with various transfer
> sizes.

... and "fio" is?

> At 256 KiB, all the main memory buffers fit in the CPU
> caches, so no write traffic appears on DDR (just the reads
> from the NVDIMMs). At 1 MiB, the data spills out of the
> caches, and writes to the DIMMs end up on DDR.
>
> Although DDR is busier, fio gets a lot less work done:
> * 256 KiB: 90 GiB/s by fio
> * 1 MiB: 49 GiB/s by fio

Yeah, I don't think that answers the question I had: whether REP; MOVSB
is faster/better than using non-temporal stores. But you say that
already above.

Also, if you do non-temporal stores then you're expected to have *more*
memory controller and DIMM traffic as you're pushing everything out
through the WCC.

What would need to be measured instead is, IMO, two things:

* compare NTI vs REP; MOVSB data movement to see the differences in
performance aspects

* run a benchmark (no idea which one) which would measure the positive
impact of the NTI versions which do not pollute the cache and thus do
not hurt other workloads' working set being pushed out of the cache.

Also, we don't really know (at least I don't) what REP; MOVSB
improvements hide behind those enhanced fast string optimizations.
It could be that microcode is doing some aggregation into cachelines
and doing much bigger writes which could compensate for the cache
pollution.

Questions over questions...

> We could try modifying pmem to use its own non-temporal
> memcpy functions (I've posted experimental patches
> before that did this) to see if that transition point
> shifts. We can also watch the CPU cache statistics
> while running.
>
> Here are statistics from Intel's pcm-memory.x
> (pardon the wide formatting):
>
> 256 KiB
> =======
> pmem0: (groupid=0, jobs=40): err= 0: pid=20867: Tue Nov 24 18:20:08 2015
> read : io=5219.1GB, bw=89079MB/s, iops=356314, runt= 60006msec
> cpu : usr=1.74%, sys=96.16%, ctx=49576, majf=0, minf=21997
>
> Run status group 0 (all jobs):
> READ: io=5219.1GB, aggrb=89079MB/s, minb=89079MB/s, maxb=89079MB/s, mint=60006msec, maxt=60006msec
>
> |---------------------------------------||---------------------------------------|
> |-- Socket 0 --||-- Socket 1 --|
> |---------------------------------------||---------------------------------------|
> |-- Memory Channel Monitoring --||-- Memory Channel Monitoring --|
> |---------------------------------------||---------------------------------------|
> |-- Mem Ch 0: Reads (MB/s): 11778.11 --||-- Mem Ch 0: Reads (MB/s): 11743.99 --|
> |-- Writes(MB/s): 51.83 --||-- Writes(MB/s): 43.25 --|
> |-- Mem Ch 1: Reads (MB/s): 11779.90 --||-- Mem Ch 1: Reads (MB/s): 11736.06 --|
> |-- Writes(MB/s): 48.73 --||-- Writes(MB/s): 37.86 --|
> |-- Mem Ch 4: Reads (MB/s): 11784.79 --||-- Mem Ch 4: Reads (MB/s): 11746.94 --|
> |-- Writes(MB/s): 52.90 --||-- Writes(MB/s): 43.73 --|
> |-- Mem Ch 5: Reads (MB/s): 11778.48 --||-- Mem Ch 5: Reads (MB/s): 11741.55 --|
> |-- Writes(MB/s): 47.62 --||-- Writes(MB/s): 37.80 --|
> |-- NODE 0 Mem Read (MB/s) : 47121.27 --||-- NODE 1 Mem Read (MB/s) : 46968.53 --|
> |-- NODE 0 Mem Write(MB/s) : 201.08 --||-- NODE 1 Mem Write(MB/s) : 162.65 --|
> |-- NODE 0 P. Write (T/s): 190927 --||-- NODE 1 P. Write (T/s): 182961 --|

What does T/s mean?

> |-- NODE 0 Memory (MB/s): 47322.36 --||-- NODE 1 Memory (MB/s): 47131.17 --|
> |---------------------------------------||---------------------------------------|
> |---------------------------------------||---------------------------------------|
> |-- System Read Throughput(MB/s): 94089.80 --|
> |-- System Write Throughput(MB/s): 363.73 --|
> |-- System Memory Throughput(MB/s): 94453.52 --|
> |---------------------------------------||---------------------------------------|
>
> 1 MiB
> =====
> |---------------------------------------||---------------------------------------|
> |-- Socket 0 --||-- Socket 1 --|
> |---------------------------------------||---------------------------------------|
> |-- Memory Channel Monitoring --||-- Memory Channel Monitoring --|
> |---------------------------------------||---------------------------------------|
> |-- Mem Ch 0: Reads (MB/s): 7227.83 --||-- Mem Ch 0: Reads (MB/s): 7047.45 --|
> |-- Writes(MB/s): 5894.47 --||-- Writes(MB/s): 6010.66 --|
> |-- Mem Ch 1: Reads (MB/s): 7229.32 --||-- Mem Ch 1: Reads (MB/s): 7041.79 --|
> |-- Writes(MB/s): 5891.38 --||-- Writes(MB/s): 6003.19 --|
> |-- Mem Ch 4: Reads (MB/s): 7230.70 --||-- Mem Ch 4: Reads (MB/s): 7052.44 --|
> |-- Writes(MB/s): 5888.63 --||-- Writes(MB/s): 6012.49 --|
> |-- Mem Ch 5: Reads (MB/s): 7229.16 --||-- Mem Ch 5: Reads (MB/s): 7047.19 --|
> |-- Writes(MB/s): 5882.45 --||-- Writes(MB/s): 6008.11 --|
> |-- NODE 0 Mem Read (MB/s) : 28917.01 --||-- NODE 1 Mem Read (MB/s) : 28188.87 --|
> |-- NODE 0 Mem Write(MB/s) : 23556.93 --||-- NODE 1 Mem Write(MB/s) : 24034.46 --|
> |-- NODE 0 P. Write (T/s): 238713 --||-- NODE 1 P. Write (T/s): 228040 --|
> |-- NODE 0 Memory (MB/s): 52473.94 --||-- NODE 1 Memory (MB/s): 52223.33 --|
> |---------------------------------------||---------------------------------------|
> |---------------------------------------||---------------------------------------|
> |-- System Read Throughput(MB/s): 57105.87 --|
> |-- System Write Throughput(MB/s): 47591.39 --|
> |-- System Memory Throughput(MB/s): 104697.27 --|
> |---------------------------------------||---------------------------------------|

Looks to me like, because writes have increased, the read bandwidth has
dropped too, which makes sense.

--
Regards/Gruss,
Boris.

ECO tip #101: Trim your mails when you reply.