2011-05-26 17:07:45

by K.Prasad

[permalink] [raw]
Subject: [RFC Patch 0/6] slimdump: Enable slimdump if crashing kernel memory is not required

Hi All,
Please find a patchset that implements 'slimdump' - a new type
of coredump (captured through kdump) which will retain only essential
debugging information but will discard old kernel memory (which is mostly
irrelevant) in case of system crashes triggered through fatal hardware errors.

When a system crashes due to an unrecoverable memory error, kdump would
ordinarily attempt to read/capture the crashing kernel's memory which has
the following undesirable consequences:
- The old kernel memory is irrelevant for debugging crashes initiated due
to hardware errors (such as physical memory corruption causing fatal
machine check exceptions - MCE).
- The kdump capture kernel might experience a second MCE when attempting
to read the corrupt memory area having fatal results (and failure to
capture the coredump).

'slimdump' is particularly useful in avoiding the above hazards. It is a
light-weight dump and increases the reliability of the system by avoiding
unsafe operations.

In essence slimdump enables light-weight coredumps and increases the
reliability of the system by avoiding unsafe operations.

Summary of the patches
----
Patch 1, 2 - Patches from Andi Kleen's tree
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6.git which
introduce an enhanced panic function - xpanic().

Patch 3 - Introduces a new PANIC_MCE flag to signal the need for
SlimDump for
MCE initiated panic calls.

Patch 4 - Append a new elf-note inside coredump containing MCE related
registers

Patch 5 - Create a SlimDump by discarding old kernel's memory

Patch 6 - Patch for 'crash' tool
(http://people.redhat.com/anderson/crash-5.1.5.tar.gz) to recognise the
new
elf-notes section.

The patches have been tested from inside a VM and have been found to
work fine. The user-space tool used for copying the coredump is 'cp'
(makedumpfile testing will be done soon).

Example
--------
Normal vmcore
-----------
# ls -lh /home/prasadkr/vmcore.usual
-r-------- 1 root root 1.6G May 17 01:07 /home/prasadkr/vmcore.usual

slimdump
-----------
# ls -lh /home/prasadkr/vmcore.SlimDump
-r-------- 1 root root 1.8K May 18 23:28 /home/prasadkr/vmcore.SlimDump
# ./crash vmlinux vmcore.SlimDump

crash 5.1.5
Copyright (C) 2002-2011 Red Hat, Inc.
Copyright (C) 2004, 2005, 2006 IBM Corporation
Copyright (C) 1999-2006 Hewlett-Packard Co
Copyright (C) 2005, 2006 Fujitsu Limited
Copyright (C) 2006, 2007 VA Linux Systems Japan K.K.
Copyright (C) 2005 NEC Corporation
Copyright (C) 1999, 2002, 2007 Silicon Graphics, Inc.
Copyright (C) 1999, 2000, 2001, 2002 Mission Critical Linux, Inc.
This program is free software, covered by the GNU General Public
License,
and you are welcome to change it and/or distribute copies of it under
certain conditions. Enter "help copying" to see the conditions.
This program has absolutely no warranty. Enter "help warranty" for
details.

"System crashed due to a hardware memory error. No coredump available."
#
#

Thanks,
K.Prasad
_______________________________________________________
ltcras mailing list [email protected]
To unsubscribe from the list, change your list options
or if you have forgotten your list password visit:
http://lists.linux.ibm.com/mailman/listinfo/ltcras


2011-05-26 17:12:24

by K.Prasad

[permalink] [raw]
Subject: [Patch 1/6] XPANIC: Add extended panic interface


commit e668fa1aea7844ac4c7ea09030a2f3e647a4adb1
Author: Andi Kleen <[email protected]>
Date: Fri Nov 19 18:36:44 2010 +0100

XPANIC: Add extended panic interface

One panic size doesn't fit all.

Machine check has some special requirements on panic, like:
- It wants to do a reboot timeout by default so that the machine
check event can be logged to disk after warm reboot.
- For memory errors it usually doesn't want to do crash dumps because
that can lead to crash loops (dumping corrupted memory would
lead to another machine check)
- It doesn't want to do backtraces because machine checks
are not a kernel bug.

In a earlier patch this was done with various adhoc hacks,
but it's cleaner to extend panic to a 'xpanic' that directly
gets a flag and timeout argument.

The only user right now will be x86 machine checks, but I consider
it likely that other users will switch to this too.

For example one obvious candidate would be the "no root
found" panic which doesn't really deserve a backtrace.

I exported a vpanic() interface too as a global. That's not
needed by the current user, but the interface has to exist
internally anyways and I could see how other code would
find a v* variant of panic useful.

Originally based on a suggestion by H. Peter Anvin.
Signed-off-by: Andi Kleen <[email protected]>

---
include/linux/kernel.h | 11 +++++++++++
kernel/panic.c | 41 +++++++++++++++++++++++++++++++----------
2 files changed, 42 insertions(+), 10 deletions(-)

Index: linux-2.6.slim_kdump/include/linux/kernel.h
===================================================================
--- linux-2.6.slim_kdump.orig/include/linux/kernel.h
+++ linux-2.6.slim_kdump/include/linux/kernel.h
@@ -175,10 +175,21 @@ static inline void might_fault(void)
}
#endif

+enum panic_flags {
+ PANIC_NO_KEXEC = (1 << 0),
+ PANIC_NO_BACKTRACE = (1 << 1),
+};
+
extern struct atomic_notifier_head panic_notifier_list;
extern long (*panic_blink)(int state);
NORET_TYPE void panic(const char * fmt, ...)
__attribute__ ((NORET_AND format (printf, 1, 2))) __cold;
+NORET_TYPE void xpanic(enum panic_flags flags, int timeout,
+ const char *fmt, ...)
+ __attribute__ ((NORET_AND format (printf, 3, 4))) __cold;
+NORET_TYPE void vpanic(enum panic_flags flags, int timeout,
+ const char *fmt,
+ va_list ap) __noreturn __cold;
extern void oops_enter(void);
extern void oops_exit(void);
void print_oops_end_marker(void);
Index: linux-2.6.slim_kdump/kernel/panic.c
===================================================================
--- linux-2.6.slim_kdump.orig/kernel/panic.c
+++ linux-2.6.slim_kdump/kernel/panic.c
@@ -57,14 +57,37 @@ EXPORT_SYMBOL(panic_blink);
*
* This function never returns.
*/
-NORET_TYPE void panic(const char * fmt, ...)
+NORET_TYPE void panic(const char *fmt, ...)
+{
+ va_list ap;
+ va_start(ap, fmt);
+ vpanic(0, 0, fmt, ap);
+}
+EXPORT_SYMBOL(panic);
+
+NORET_TYPE void xpanic(enum panic_flags flags, int timeout,
+ const char *fmt, ...)
+{
+ va_list ap;
+ va_start(ap, fmt);
+ xpanic(flags, timeout, fmt, ap);
+}
+EXPORT_SYMBOL(xpanic);
+
+NORET_TYPE void vpanic(enum panic_flags flags, int timeout,
+ const char * fmt, va_list args)
{
static char buf[1024];
- va_list args;
long i, i_next = 0;
int state = 0;

/*
+ * Let user always override panic_timeout.
+ */
+ if (panic_timeout > 0)
+ timeout = panic_timeout;
+
+ /*
* It's possible to come here directly from a panic-assertion and
* not have preempt disabled. Some functions called from here want
* preempt to be disabled. No point enabling it later though...
@@ -73,12 +96,11 @@ NORET_TYPE void panic(const char * fmt,

console_verbose();
bust_spinlocks(1);
- va_start(args, fmt);
vsnprintf(buf, sizeof(buf), fmt, args);
- va_end(args);
printk(KERN_EMERG "Kernel panic - not syncing: %s\n",buf);
#ifdef CONFIG_DEBUG_BUGVERBOSE
- dump_stack();
+ if (!(flags & PANIC_NO_BACKTRACE))
+ dump_stack();
#endif

/*
@@ -86,7 +108,8 @@ NORET_TYPE void panic(const char * fmt,
* everything else.
* Do we want to call this before we try to display a message?
*/
- crash_kexec(NULL);
+ if (!(flags & PANIC_NO_KEXEC))
+ crash_kexec(NULL);

kmsg_dump(KMSG_DUMP_PANIC);

@@ -104,7 +127,7 @@ NORET_TYPE void panic(const char * fmt,
if (!panic_blink)
panic_blink = no_blink;

- if (panic_timeout > 0) {
+ if (timeout > 0) {
/*
* Delay timeout seconds before rebooting the machine.
* We can't use the "normal" timers since we just panicked.
@@ -152,9 +175,7 @@ NORET_TYPE void panic(const char * fmt,
mdelay(PANIC_TIMER_STEP);
}
}
-
-EXPORT_SYMBOL(panic);
-
+EXPORT_SYMBOL(vpanic);

struct tnt {
u8 bit;

2011-05-26 17:13:15

by K.Prasad

[permalink] [raw]
Subject: [Patch 2/6] x86: mce: Convert mce code to xpanic


commit aea29c4a9324f24b5c61e7c3919a8137a53be935
Author: Andi Kleen <[email protected]>
Date: Fri Nov 19 18:42:02 2010 +0100

x86: mce: Convert mce code to xpanic

- Pass in the panic timeout directly instead of
abusing global variable.
- Disable backtraces and kexecs on machine check panics
because they don't do anything useful.

Signed-off-by: Andi Kleen <[email protected]>

---
arch/x86/kernel/cpu/mcheck/mce.c | 10 ++++------
1 file changed, 4 insertions(+), 6 deletions(-)

Index: linux-2.6.slim_kdump/arch/x86/kernel/cpu/mcheck/mce.c
===================================================================
--- linux-2.6.slim_kdump.orig/arch/x86/kernel/cpu/mcheck/mce.c
+++ linux-2.6.slim_kdump/arch/x86/kernel/cpu/mcheck/mce.c
@@ -258,9 +258,8 @@ static void wait_for_panic(void)
local_irq_enable();
while (timeout-- > 0)
udelay(1);
- if (panic_timeout == 0)
- panic_timeout = mce_panic_timeout;
- panic("Panicing machine check CPU died");
+ xpanic(PANIC_NO_KEXEC|PANIC_NO_BACKTRACE, 0,
+ "Panicing machine check CPU died");
}

static void mce_panic(char *msg, struct mce *final, char *exp)
@@ -316,9 +315,8 @@ static void mce_panic(char *msg, struct
if (exp)
pr_emerg(HW_ERR "Machine check: %s\n", exp);
if (!fake_panic) {
- if (panic_timeout == 0)
- panic_timeout = mce_panic_timeout;
- panic(msg);
+ xpanic(PANIC_NO_KEXEC|PANIC_NO_BACKTRACE, mce_panic_timeout,
+ msg);
} else
pr_emerg(HW_ERR "Fake kernel panic: %s\n", msg);
}

2011-05-26 17:14:34

by K.Prasad

[permalink] [raw]
Subject: [Bugfix][Patch 3/3] Invoke vpanic inside xpanic function


Invoke vpanic inside xpanic function

xpanic must invoke the worker routine vpanic to perform the erstwhile panic
related function. Instead it erroneously invokes xpanic (recursively) which
needs to be fixed.

Signed-off-by: K.Prasad <[email protected]>
---
kernel/panic.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

Index: linux-2.6.slim_kdump/kernel/panic.c
===================================================================
--- linux-2.6.slim_kdump.orig/kernel/panic.c
+++ linux-2.6.slim_kdump/kernel/panic.c
@@ -70,7 +70,7 @@ NORET_TYPE void xpanic(enum panic_flags
{
va_list ap;
va_start(ap, fmt);
- xpanic(flags, timeout, fmt, ap);
+ vpanic(flags, timeout, fmt, ap);
}
EXPORT_SYMBOL(xpanic);

2011-05-26 17:15:39

by K.Prasad

[permalink] [raw]
Subject: [RFC Patch 4/6] PANIC_MCE: Introduce a new panic flag for fatal MCE, capture related information


PANIC_MCE: Introduce a new panic flag for fatal MCE, capture related information

Fatal machine check exceptions (caused due to hardware memory errors) will now
result in a 'slim' coredump that captures vital information about the MCE. This
patch introduces a new panic flag, and new parameters to *panic functions
that can capture more information pertaining to the cause of crash.

Enable a new elf-notes section to store additional information about the crash.
For MCE, enable a new notes section that captures relevant register status
(struct mce) to be later read during coredump analysis.

Signed-off-by: K.Prasad <[email protected]>
---
arch/arm/kernel/traps.c | 2 +-
arch/powerpc/kernel/traps.c | 2 +-
arch/sh/kernel/traps_32.c | 2 +-
arch/x86/kernel/cpu/mcheck/mce.c | 7 +++----
arch/x86/kernel/dumpstack.c | 2 +-
include/linux/elf.h | 5 +++++
include/linux/kernel.h | 9 +++++----
include/linux/kexec.h | 9 ++++++---
kernel/kexec.c | 17 +++++++++++------
kernel/panic.c | 16 ++++++++--------
10 files changed, 42 insertions(+), 29 deletions(-)

Index: linux-2.6.slim_kdump/include/linux/kernel.h
===================================================================
--- linux-2.6.slim_kdump.orig/include/linux/kernel.h
+++ linux-2.6.slim_kdump/include/linux/kernel.h
@@ -178,17 +178,18 @@ static inline void might_fault(void)
enum panic_flags {
PANIC_NO_KEXEC = (1 << 0),
PANIC_NO_BACKTRACE = (1 << 1),
+ PANIC_MCE = (1 << 2),
};

extern struct atomic_notifier_head panic_notifier_list;
extern long (*panic_blink)(int state);
NORET_TYPE void panic(const char * fmt, ...)
__attribute__ ((NORET_AND format (printf, 1, 2))) __cold;
-NORET_TYPE void xpanic(enum panic_flags flags, int timeout,
- const char *fmt, ...)
- __attribute__ ((NORET_AND format (printf, 3, 4))) __cold;
+NORET_TYPE void xpanic(enum panic_flags flags, int timeout, void *arch_info,
+ size_t arch_info_size, const char *fmt, ...)
+ __attribute__ ((NORET_AND format (printf, 5, 6))) __cold;
NORET_TYPE void vpanic(enum panic_flags flags, int timeout,
- const char *fmt,
+ void *arch_info, size_t arch_info_size, const char *fmt,
va_list ap) __noreturn __cold;
extern void oops_enter(void);
extern void oops_exit(void);
Index: linux-2.6.slim_kdump/kernel/panic.c
===================================================================
--- linux-2.6.slim_kdump.orig/kernel/panic.c
+++ linux-2.6.slim_kdump/kernel/panic.c
@@ -61,21 +61,21 @@ NORET_TYPE void panic(const char *fmt, .
{
va_list ap;
va_start(ap, fmt);
- vpanic(0, 0, fmt, ap);
+ vpanic(0, 0, NULL, 0, fmt, ap);
}
EXPORT_SYMBOL(panic);

-NORET_TYPE void xpanic(enum panic_flags flags, int timeout,
- const char *fmt, ...)
+NORET_TYPE void xpanic(enum panic_flags flags, int timeout, void *arch_info,
+ size_t arch_info_size, const char *fmt, ...)
{
va_list ap;
va_start(ap, fmt);
- vpanic(flags, timeout, fmt, ap);
+ vpanic(flags, timeout, arch_info, arch_info_size, fmt, ap);
}
EXPORT_SYMBOL(xpanic);

-NORET_TYPE void vpanic(enum panic_flags flags, int timeout,
- const char * fmt, va_list args)
+NORET_TYPE void vpanic(enum panic_flags flags, int timeout, void *arch_info,
+ size_t arch_info_size, const char * fmt, va_list args)
{
static char buf[1024];
long i, i_next = 0;
@@ -99,7 +99,7 @@ NORET_TYPE void vpanic(enum panic_flags
vsnprintf(buf, sizeof(buf), fmt, args);
printk(KERN_EMERG "Kernel panic - not syncing: %s\n",buf);
#ifdef CONFIG_DEBUG_BUGVERBOSE
- if (!(flags & PANIC_NO_BACKTRACE))
+ if (!(flags & (PANIC_NO_BACKTRACE | PANIC_MCE)))
dump_stack();
#endif

@@ -109,7 +109,7 @@ NORET_TYPE void vpanic(enum panic_flags
* Do we want to call this before we try to display a message?
*/
if (!(flags & PANIC_NO_KEXEC))
- crash_kexec(NULL);
+ crash_kexec(NULL, arch_info, arch_info_size, flags);

kmsg_dump(KMSG_DUMP_PANIC);

Index: linux-2.6.slim_kdump/arch/x86/kernel/cpu/mcheck/mce.c
===================================================================
--- linux-2.6.slim_kdump.orig/arch/x86/kernel/cpu/mcheck/mce.c
+++ linux-2.6.slim_kdump/arch/x86/kernel/cpu/mcheck/mce.c
@@ -258,8 +258,7 @@ static void wait_for_panic(void)
local_irq_enable();
while (timeout-- > 0)
udelay(1);
- xpanic(PANIC_NO_KEXEC|PANIC_NO_BACKTRACE, 0,
- "Panicing machine check CPU died");
+ xpanic(PANIC_MCE, 0, NULL, 0, "Panicing machine check CPU died");
}

static void mce_panic(char *msg, struct mce *final, char *exp)
@@ -315,8 +314,8 @@ static void mce_panic(char *msg, struct
if (exp)
pr_emerg(HW_ERR "Machine check: %s\n", exp);
if (!fake_panic) {
- xpanic(PANIC_NO_KEXEC|PANIC_NO_BACKTRACE, mce_panic_timeout,
- msg);
+ xpanic(PANIC_MCE, mce_panic_timeout, final,
+ sizeof(struct mce), msg);
} else
pr_emerg(HW_ERR "Fake kernel panic: %s\n", msg);
}
Index: linux-2.6.slim_kdump/arch/arm/kernel/traps.c
===================================================================
--- linux-2.6.slim_kdump.orig/arch/arm/kernel/traps.c
+++ linux-2.6.slim_kdump/arch/arm/kernel/traps.c
@@ -274,7 +274,7 @@ void die(const char *str, struct pt_regs
ret = __die(str, err, thread, regs);

if (regs && kexec_should_crash(thread->task))
- crash_kexec(regs);
+ crash_kexec(regs, NULL, 0, 0);

bust_spinlocks(0);
add_taint(TAINT_DIE);
Index: linux-2.6.slim_kdump/arch/powerpc/kernel/traps.c
===================================================================
--- linux-2.6.slim_kdump.orig/arch/powerpc/kernel/traps.c
+++ linux-2.6.slim_kdump/arch/powerpc/kernel/traps.c
@@ -161,7 +161,7 @@ int die(const char *str, struct pt_regs

if (kexec_should_crash(current) ||
kexec_sr_activated(smp_processor_id()))
- crash_kexec(regs);
+ crash_kexec(regs, NULL, 0, 0);
crash_kexec_secondary(regs);

if (in_interrupt())
Index: linux-2.6.slim_kdump/arch/sh/kernel/traps_32.c
===================================================================
--- linux-2.6.slim_kdump.orig/arch/sh/kernel/traps_32.c
+++ linux-2.6.slim_kdump/arch/sh/kernel/traps_32.c
@@ -106,7 +106,7 @@ void die(const char * str, struct pt_reg
oops_exit();

if (kexec_should_crash(current))
- crash_kexec(regs);
+ crash_kexec(regs, NULL, 0, 0);

if (in_interrupt())
panic("Fatal exception in interrupt");
Index: linux-2.6.slim_kdump/arch/x86/kernel/dumpstack.c
===================================================================
--- linux-2.6.slim_kdump.orig/arch/x86/kernel/dumpstack.c
+++ linux-2.6.slim_kdump/arch/x86/kernel/dumpstack.c
@@ -241,7 +241,7 @@ EXPORT_SYMBOL_GPL(oops_begin);
void __kprobes oops_end(unsigned long flags, struct pt_regs *regs, int signr)
{
if (regs && kexec_should_crash(current))
- crash_kexec(regs);
+ crash_kexec(regs, NULL, 0, 0);

bust_spinlocks(0);
die_owner = -1;
Index: linux-2.6.slim_kdump/include/linux/elf.h
===================================================================
--- linux-2.6.slim_kdump.orig/include/linux/elf.h
+++ linux-2.6.slim_kdump/include/linux/elf.h
@@ -381,6 +381,11 @@ typedef struct elf64_shdr {
#define NT_PRPSINFO 3
#define NT_TASKSTRUCT 4
#define NT_AUXV 6
+/*
+ * Although numbers 1 - 6 have been defined here, the user-space include files
+ * have numbers 1 - 20 taken up. Hence defining NT_MCE as 21.
+ */
+#define NT_MCE 21 /* Machine Check Exception related data */
#define NT_PRXFPREG 0x46e62b7f /* copied from gdb5.1/include/elf/common.h */
#define NT_PPC_VMX 0x100 /* PowerPC Altivec/VMX registers */
#define NT_PPC_SPE 0x101 /* PowerPC SPE/EVR registers */
Index: linux-2.6.slim_kdump/include/linux/kexec.h
===================================================================
--- linux-2.6.slim_kdump.orig/include/linux/kexec.h
+++ linux-2.6.slim_kdump/include/linux/kexec.h
@@ -125,10 +125,12 @@ extern asmlinkage long compat_sys_kexec_
#endif
extern struct page *kimage_alloc_control_pages(struct kimage *image,
unsigned int order);
-extern void crash_kexec(struct pt_regs *);
+extern void crash_kexec(struct pt_regs *, void *arch_info,
+ size_t arch_info_size, enum panic_flags);
int kexec_should_crash(struct task_struct *);
void crash_save_cpu(struct pt_regs *regs, int cpu);
-void crash_save_vmcoreinfo(void);
+void crash_save_vmcoreinfo(void *arch_info, size_t arch_info_size,
+ enum panic_flags);
void arch_crash_save_vmcoreinfo(void);
void vmcoreinfo_append_str(const char *fmt, ...)
__attribute__ ((format (printf, 1, 2)));
@@ -213,7 +215,8 @@ void crash_free_reserved_phys_range(unsi
#else /* !CONFIG_KEXEC */
struct pt_regs;
struct task_struct;
-static inline void crash_kexec(struct pt_regs *regs) { }
+static inline void crash_kexec(struct pt_regs *regs, void *arch_info,
+ size_t arch_info_size, enum panic_flags) { }
static inline int kexec_should_crash(struct task_struct *p) { return 0; }
#endif /* CONFIG_KEXEC */
#endif /* LINUX_KEXEC_H */
Index: linux-2.6.slim_kdump/kernel/kexec.c
===================================================================
--- linux-2.6.slim_kdump.orig/kernel/kexec.c
+++ linux-2.6.slim_kdump/kernel/kexec.c
@@ -1065,7 +1065,8 @@ asmlinkage long compat_sys_kexec_load(un
}
#endif

-void crash_kexec(struct pt_regs *regs)
+void crash_kexec(struct pt_regs *regs, void *arch_info, size_t arch_info_size,
+ enum panic_flags flags)
{
/* Take the kexec_mutex here to prevent sys_kexec_load
* running on one cpu from replacing the crash kernel
@@ -1082,7 +1083,7 @@ void crash_kexec(struct pt_regs *regs)
kmsg_dump(KMSG_DUMP_KEXEC);

crash_setup_regs(&fixed_regs, regs);
- crash_save_vmcoreinfo();
+ crash_save_vmcoreinfo(arch_info, arch_info_size, flags);
machine_crash_shutdown(&fixed_regs);
machine_kexec(kexec_crash_image);
}
@@ -1381,7 +1382,8 @@ int __init parse_crashkernel(char *cm



-void crash_save_vmcoreinfo(void)
+void crash_save_vmcoreinfo(void *arch_info, size_t arch_info_size,
+ enum panic_flags flags)
{
u32 *buf;

@@ -1392,9 +1394,12 @@ void crash_save_vmcoreinfo(void)

buf = (u32 *)vmcoreinfo_note;

- buf = append_elf_note(buf, VMCOREINFO_NOTE_NAME, 0, vmcoreinfo_data,
- vmcoreinfo_size);
-
+ if (flags & PANIC_MCE)
+ buf = append_elf_note(buf, "PANIC_MCE", NT_MCE, arch_info,
+ arch_info_size);
+ else
+ buf = append_elf_note(buf, VMCOREINFO_NOTE_NAME, 0,
+ vmcoreinfo_data, vmcoreinfo_size);
final_note(buf);
}

2011-05-26 17:23:29

by K.Prasad

[permalink] [raw]
Subject: [RFC Patch 5/6] slimdump: Capture slimdump for fatal MCE generated crashes


slimdump: Capture slimdump for fatal MCE generated crashes

System crashes resulting from fatal hardware errors (such as MCE) don't need
all the contents from crashing-kernel's memory. Generate a new 'slimdump' that
retains only essential information while discarding the old memory.

Signed-off-by: K.Prasad <[email protected]>
---
fs/proc/vmcore.c | 130 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
1 file changed, 128 insertions(+), 2 deletions(-)

Index: linux-2.6.slim_kdump/fs/proc/vmcore.c
===================================================================
--- linux-2.6.slim_kdump.orig/fs/proc/vmcore.c
+++ linux-2.6.slim_kdump/fs/proc/vmcore.c
@@ -483,9 +483,60 @@ static void __init set_vmcore_list_offse
}
}

+/*
+ * Check if the crash was due to a fatal Memory Check Exception
+ */
+static int is_mce_crash64(void)
+{
+ int i, j, len = 0, rc;
+ Elf64_Ehdr *ehdr_ptr;
+ Elf64_Phdr *phdr_ptr;
+ Elf64_Nhdr *nhdr_ptr;
+
+ ehdr_ptr = (Elf64_Ehdr *)elfcorebuf;
+ phdr_ptr = (Elf64_Phdr *)(elfcorebuf + sizeof(Elf64_Ehdr));
+
+ for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
+ void *notes_section;
+ u64 offset, max_sz;
+ if (phdr_ptr->p_type != PT_NOTE)
+ continue;
+ max_sz = phdr_ptr->p_memsz;
+ offset = phdr_ptr->p_offset;
+ notes_section = kmalloc(max_sz, GFP_KERNEL);
+ if (!notes_section)
+ return -ENOMEM;
+ rc = read_from_oldmem(notes_section, max_sz, &offset, 0);
+ if (rc < 0) {
+ kfree(notes_section);
+ return rc;
+ }
+
+ for (j = 0; j < phdr_ptr->p_filesz; j += len) {
+ nhdr_ptr = notes_section + j;
+ if (nhdr_ptr->n_type == NT_MCE)
+ {
+ kfree(notes_section);
+ return 1;
+ }
+ /*
+ * The elf-64 standard specifies 8-byte alignment while
+ * append_elf_note function does only 4-byte roundup.
+ * Hence this code also does a 4-byte roundup.
+ */
+ len = sizeof(Elf64_Nhdr);
+ len = roundup(len + nhdr_ptr->n_namesz, 4);
+ len = roundup(len + nhdr_ptr->n_descsz, 4);
+ }
+ kfree(notes_section);
+ }
+ return 0;
+}
+
static int __init parse_crash_elf64_headers(void)
{
- int rc=0;
+ int i, rc = 0;
+ Elf64_Phdr *phdr_ptr;
Elf64_Ehdr ehdr;
u64 addr;

@@ -523,6 +574,18 @@ static int __init parse_crash_elf64_head
return rc;
}

+ phdr_ptr = (Elf64_Phdr *)(elfcorebuf + sizeof(Elf64_Ehdr));
+ if (is_mce_crash64() > 0) {
+ /*
+ * If crash is due to Machine Check exception, don't populate
+ * sections other than elf-notes. Mark their sizes as zero.
+ */
+ for (i = 0; i < ehdr.e_phnum; i++, phdr_ptr++) {
+ if (phdr_ptr->p_type != PT_NOTE)
+ phdr_ptr->p_memsz = phdr_ptr->p_filesz = 0;
+ }
+ }
+
/* Merge all PT_NOTE headers into one. */
rc = merge_note_headers_elf64(elfcorebuf, &elfcorebuf_sz, &vmcore_list);
if (rc) {
@@ -539,9 +602,60 @@ static int __init parse_crash_elf64_head
return 0;
}

+/*
+ * Check if the crash was due to a fatal Memory Check Exception
+ */
+static int is_mce_crash32(void)
+{
+ int i, j, len = 0, rc;
+ Elf32_Ehdr *ehdr_ptr;
+ Elf32_Phdr *phdr_ptr;
+ Elf32_Nhdr *nhdr_ptr;
+
+ ehdr_ptr = (Elf32_Ehdr *)elfcorebuf;
+ phdr_ptr = (Elf32_Phdr *)(elfcorebuf + sizeof(Elf32_Ehdr));
+
+ for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
+ void *notes_section;
+ u64 offset, max_sz;
+ if (phdr_ptr->p_type != PT_NOTE)
+ continue;
+ max_sz = phdr_ptr->p_memsz;
+ offset = phdr_ptr->p_offset;
+ notes_section = kmalloc(max_sz, GFP_KERNEL);
+ if (!notes_section)
+ return -ENOMEM;
+ rc = read_from_oldmem(notes_section, max_sz, &offset, 0);
+ if (rc < 0) {
+ kfree(notes_section);
+ return rc;
+ }
+
+ for (j = 0; j < phdr_ptr->p_filesz; j += len) {
+ nhdr_ptr = notes_section + j;
+ if (nhdr_ptr->n_type == NT_MCE)
+ {
+ kfree(notes_section);
+ return 1;
+ }
+ /*
+ * The elf-64 standard specifies 8-byte alignment while
+ * append_elf_note function does only 4-byte roundup.
+ * Hence this code also does a 4-byte roundup.
+ */
+ len = sizeof(Elf64_Nhdr);
+ len = roundup(len + nhdr_ptr->n_namesz, 4);
+ len = roundup(len + nhdr_ptr->n_descsz, 4);
+ }
+ kfree(notes_section);
+ }
+ return 0;
+}
+
static int __init parse_crash_elf32_headers(void)
{
- int rc=0;
+ int i, rc = 0;
+ Elf32_Phdr *phdr_ptr;
Elf32_Ehdr ehdr;
u64 addr;

@@ -579,6 +693,18 @@ static int __init parse_crash_elf32_head
return rc;
}

+ phdr_ptr = (Elf32_Phdr *)(elfcorebuf + sizeof(Elf32_Ehdr));
+ if (is_mce_crash32() > 0) {
+ /*
+ * If crash is due to Machine Check exception, don't populate
+ * sections other than elf-notes. Mark their sizes as zero.
+ */
+ for (i = 0; i < ehdr.e_phnum; i++, phdr_ptr++) {
+ if (phdr_ptr->p_type != PT_NOTE)
+ phdr_ptr->p_memsz = phdr_ptr->p_filesz = 0;
+ }
+ }
+
/* Merge all PT_NOTE headers into one. */
rc = merge_note_headers_elf32(elfcorebuf, &elfcorebuf_sz, &vmcore_list);
if (rc) {

2011-05-26 17:26:39

by K.Prasad

[permalink] [raw]
Subject: [RFC Patch 6/6] Crash: Recognise slim coredumps and process new elf-note sections


Crash: Recognise slim coredumps and process new elf-note sections

The Linux kernel will begin to support SlimDump for certain types of crashes
and the 'crash' tool needs to recognise them. For these types of coredumps, it
need not lookout for usual elf-structures and start gdb. Also process new
elf-note sections that contain additional information about the crash.

Signed-off-by: K.Prasad <[email protected]>
---
diskdump.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
netdump.c | 8 +++++
x86.h | 91 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 183 insertions(+)

Index: crash-5.1.5.slim_kdump/x86.h
===================================================================
--- /dev/null
+++ crash-5.1.5.slim_kdump/x86.h
@@ -0,0 +1,91 @@
+/*
+ * x86.h - x86 Architecture specific definitions
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License as published by
+ * the Free Software Foundation; either version 2 of the License, or
+ * (at your option) any later version.
+ *
+ * This program is distributed in the hope that it will be useful,
+ * but WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
+ * GNU General Public License for more details.
+ *
+ * You should have received a copy of the GNU General Public License
+ * along with this program; if not, write to the Free Software
+ * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
+ *
+ * Copyright (C) IBM Corporation, 2011
+ *
+ * Author: K.Prasad <[email protected]>
+ */
+
+typedef unsigned long long u64;
+typedef unsigned int u32;
+typedef unsigned short u16;
+typedef unsigned char u8;
+
+#define __u64 u64
+#define __u32 u32
+#define __u16 u16
+#define __u8 u8
+
+/* Mask for finding the address mode in IA32_MCi_MISC[8:6] register */
+#define MCI_MISC_ADDR_MODE 0X1C0
+/* Number of bits to shift the IA32_MCi_MISC to read the address-mode bits */
+#define MISC_ADDR_MODE_POS 6
+
+/* Address Modes in IA32_MCi_MISC[8:6] */
+#define MCM_ADDR_SEGOFF 0 /* segment offset */
+#define MCM_ADDR_LINEAR 1 /* linear address */
+#define MCM_ADDR_PHYS 2 /* physical address */
+#define MCM_ADDR_MEM 3 /* memory address */
+#define MCM_ADDR_GENERIC 7 /* generic */
+
+#define MCI_STATUS_MISCV (1ULL<<59) /* misc error reg. valid */
+#define MCI_STATUS_ADDRV (1ULL<<58) /* addr reg. valid */
+
+#define PAGE_SHIFT 12
+
+static const char *mce_addr_mode[] =
+{
+ "Segment offset", /* MCM_ADDR_SEGOFF */
+ "Linear address", /* MCM_ADDR_LINEAR */
+ "Physical address", /* MCM_ADDR_PHYS */
+ "Memory address", /* MCM_ADDR_MEM */
+ "", /* reserved */
+ "", /* reserved */
+ "", /* reserved */
+ "Generic" /* MCM_ADDR_GENERIC */
+};
+
+/*
+ * kernel structure: Keep this in sync with the definition in
+ * arch/x86/include/asm/mce.h of linux source code.
+ *
+ * Fields are zero when not available
+ *
+ */
+struct mce {
+ __u64 status;
+ __u64 misc;
+ __u64 addr;
+ __u64 mcgstatus;
+ __u64 ip;
+ __u64 tsc; /* cpu time stamp counter */
+ __u64 time; /* wall time_t when error was detected */
+ __u8 cpuvendor; /* cpu vendor as encoded in system.h */
+ __u8 pad1;
+ __u16 pad2;
+ __u32 cpuid; /* CPUID 1 EAX */
+ __u8 cs; /* code segment */
+ __u8 bank; /* machine check bank */
+ __u8 cpu; /* cpu number; obsolete; use extcpu now */
+ __u8 finished; /* entry is valid */
+ __u32 extcpu; /* linux cpu number that detected the error */
+ __u32 socketid; /* CPU socket ID */
+ __u32 apicid; /* CPU initial apic ID */
+ __u64 mcgcap; /* MCGCAP MSR: machine check capabilities of CPU */
+ __u64 aux0;
+ __u64 aux1;
+};
Index: crash-5.1.5.slim_kdump/netdump.c
===================================================================
--- crash-5.1.5.slim_kdump.orig/netdump.c
+++ crash-5.1.5.slim_kdump/netdump.c
@@ -331,6 +331,10 @@ is_netdump(char *file, ulong source_quer
}
nd->notes32 = (Elf32_Phdr *)
&nd->elf_header[sizeof(Elf32_Ehdr)];
+ if (machdep->process_elf_notes)
+ machdep->process_elf_notes((char *)nd->elf32 +
+ nd->notes32->p_offset,
+ nd->notes32->p_filesz);
nd->load32 = (Elf32_Phdr *)
&nd->elf_header[sizeof(Elf32_Ehdr)+sizeof(Elf32_Phdr)];
if (DUMPFILE_FORMAT(nd->flags) == NETDUMP_ELF32)
@@ -360,6 +364,10 @@ is_netdump(char *file, ulong source_quer
}
nd->notes64 = (Elf64_Phdr *)
&nd->elf_header[sizeof(Elf64_Ehdr)];
+ if (machdep->process_elf_notes)
+ machdep->process_elf_notes((char *)nd->elf64 +
+ nd->notes64->p_offset,
+ nd->notes64->p_filesz);
nd->load64 = (Elf64_Phdr *)
&nd->elf_header[sizeof(Elf64_Ehdr)+sizeof(Elf64_Phdr)];
if (DUMPFILE_FORMAT(nd->flags) == NETDUMP_ELF64)
Index: crash-5.1.5.slim_kdump/diskdump.c
===================================================================
--- crash-5.1.5.slim_kdump.orig/diskdump.c
+++ crash-5.1.5.slim_kdump/diskdump.c
@@ -231,6 +231,27 @@ open_dump_file(char *file)
dd->dfd = fd;
return TRUE;
}
+#if defined(X86_64) || defined(X86)
+#include "x86.h"
+
+/*
+ * Check if the address reported by the CPU is in a format we can parse.
+ * It would be possible to add code for most other cases, but all would
+ * be somewhat complicated (e.g. segment offset would require an instruction
+ * parser). So only support physical addresses up to page granuality for now.
+ *
+ * Function derived from arch/x86/kernel/cpu/mcheck/mce.c in Linux source
+ *
+ */
+static int mce_usable_address(struct mce *m)
+{
+ if (!(m->status & MCI_STATUS_MISCV) || !(m->status & MCI_STATUS_ADDRV))
+ return 0;
+ if ((m->misc & 0x3f) > PAGE_SHIFT)
+ return 0;
+ return 1;
+}
+#endif /* defined(X86_64) || defined(X86) */

void
x86_process_elf_notes(void *note_ptr, unsigned long size_note)
@@ -239,10 +260,43 @@ x86_process_elf_notes(void *note_ptr, un
Elf64_Nhdr *note64 = NULL;
size_t tot, len = 0;
int num = 0;
+#if defined(X86_64) || defined(X86)
+ struct mce *mce;
+ ushort addr_mode;
+#endif /* defined(X86_64) || defined(X86) */

for (tot = 0; tot < size_note; tot += len) {
if (machine_type("X86_64")) {
note64 = note_ptr + tot;
+#ifdef X86_64
+ /*
+ * If vmcore is generated due to fatal Machine Check
+ * Exception, we only have a 'slim' crashdump. Don't
+ * analyse further, inform the user about it and exit.
+ */
+ if (note64->n_type == NT_MCE) {
+ fprintf(fp, "\"System crashed due to a hardware"
+ " memory error. No coredump"
+ " available.\"\n");
+
+ /* Do we have a copy of 'struct mce'? */
+ if (note64->n_descsz == 0)
+ goto exit;
+
+ mce = (struct mce *)((char *)note64 +
+ sizeof(Elf64_Nhdr) + note64->n_namesz);
+ if (!mce_usable_address(mce))
+ goto exit;
+
+ addr_mode = (mce->misc >> MISC_ADDR_MODE_POS) &
+ MCI_MISC_ADDR_MODE;
+ fprintf(fp, "Memory error occured at %llx "
+ "(address type: %s\n)", mce->addr,
+ mce_addr_mode[addr_mode]);
+exit:
+ clean_exit(0);
+ }
+#endif /* X86_64 */

if (note64->n_type == NT_PRSTATUS) {
dd->nt_prstatus_percpu[num] = note64;
@@ -255,6 +309,36 @@ x86_process_elf_notes(void *note_ptr, un
} else if (machine_type("X86")) {
note32 = note_ptr + tot;

+#ifdef X86
+ /*
+ * If vmcore is generated due to fatal Machine Check
+ * Exception, we only have a 'slim' crashdump. Don't
+ * analyse further, inform the user about it and exit.
+ */
+ if (note32->n_type == NT_MCE) {
+ fprintf(fp, "\"System crashed due to a hardware"
+ " memory error. No coredump"
+ " available.\"\n");
+
+ /* Do we have a copy of 'struct mce'? */
+ if (note32->n_descsz == 0)
+ goto exit;
+
+ mce = (struct mce *)((char *)note32 +
+ sizeof(Elf32_Nhdr) + note32->n_namesz);
+ if (!mce_usable_address(mce))
+ goto exit;
+
+ addr_mode = (mce->misc >> MISC_ADDR_MODE_POS) &
+ MCI_MISC_ADDR_MODE;
+ fprintf(fp, "Memory error occured at %llx "
+ "(address type: %s\n)", mce->addr,
+ mce_addr_mode[addr_mode]);
+exit:
+ clean_exit(0);
+ }
+#endif /* X86 */
+
if (note32->n_type == NT_PRSTATUS) {
dd->nt_prstatus_percpu[num] = note32;
num++;

2011-05-26 17:32:08

by K.Prasad

[permalink] [raw]
Subject: Re: [RFC Patch 0/6] slimdump: Enable slimdump if crashing kernel memory is not required

On Thu, May 26, 2011 at 10:37:22PM +0530, K.Prasad wrote:
> Hi All,

A small correction related to patch description...

>
> Summary of the patches
> ----
> Patch 1, 2 - Patches from Andi Kleen's tree
> git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6.git which
> introduce an enhanced panic function - xpanic().
>

Patch 3 - A small bugfix for patch 1,2

Patch 4 - Introduces a new PANIC_MCE flag to signal the need for slimdump
for MCE initiated panic calls. Appends a new elf-note inside coredump
containing MCE related registers.

Thanks,
K.Prasad

2011-05-26 17:33:00

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC Patch 5/6] slimdump: Capture slimdump for fatal MCE generated crashes

On Thu, May 26, 2011 at 10:53:05PM +0530, K.Prasad wrote:
>
> slimdump: Capture slimdump for fatal MCE generated crashes
>
> System crashes resulting from fatal hardware errors (such as MCE) don't need
> all the contents from crashing-kernel's memory. Generate a new 'slimdump' that
> retains only essential information while discarding the old memory.

While this is a good idea, note there may be still poisoned lines
in memory that haven't resulted in a machine check yet, but could
still be fatal when read after a full crash dump for some other
reason.

So you still need

http://git.kernel.org/?p=linux/kernel/git/ak/linux-mce-2.6.git;a=commit;h=fe61906edce9e70d02481a77a617ba1397573dce
and
http://git.kernel.org/?p=linux/kernel/git/ak/linux-mce-2.6.git;a=commit;h=cb58f049ae6709ddbab71be199390dc6852018cd

in addition.

-Andi

2011-05-26 17:38:22

by Richard Weinberger

[permalink] [raw]
Subject: Re: [Patch 1/6] XPANIC: Add extended panic interface

On Thu, May 26, 2011 at 7:12 PM, K.Prasad <[email protected]> wrote:
>
> commit e668fa1aea7844ac4c7ea09030a2f3e647a4adb1
> Author: Andi Kleen <[email protected]>
> Date: ? Fri Nov 19 18:36:44 2010 +0100
>
> ? ?XPANIC: Add extended panic interface
>
> ? ?One panic size doesn't fit all.
>
> ? ?Machine check has some special requirements on panic, like:
> ? ?- It wants to do a reboot timeout by default so that the machine
> ? ?check event can be logged to disk after warm reboot.
> ? ?- For memory errors it usually doesn't want to do crash dumps because
> ? ?that can lead to crash loops (dumping corrupted memory would
> ? ?lead to another machine check)
> ? ?- It doesn't want to do backtraces because machine checks
> ? ?are not a kernel bug.
>
> ? ?In a earlier patch this was done with various adhoc hacks,
> ? ?but it's cleaner to extend panic to a 'xpanic' that directly
> ? ?gets a flag and timeout argument.
>
> ? ?The only user right now will be x86 machine checks, but I consider
> ? ?it likely that other users will switch to this too.
>
> ? ?For example one obvious candidate would be the "no root
> ? ?found" panic which doesn't really deserve a backtrace.
>
> ? ?I exported a vpanic() interface too as a global. That's not
> ? ?needed by the current user, but the interface has to exist
> ? ?internally anyways and I could see how other code would
> ? ?find a v* variant of panic useful.
>
> ? ?Originally based on a suggestion by H. Peter Anvin.
> ? ?Signed-off-by: Andi Kleen <[email protected]>
>
> ---
> ?include/linux/kernel.h | ? 11 +++++++++++
> ?kernel/panic.c ? ? ? ? | ? 41 +++++++++++++++++++++++++++++++----------
> ?2 files changed, 42 insertions(+), 10 deletions(-)
>
> Index: linux-2.6.slim_kdump/include/linux/kernel.h
> ===================================================================
> --- linux-2.6.slim_kdump.orig/include/linux/kernel.h
> +++ linux-2.6.slim_kdump/include/linux/kernel.h
> @@ -175,10 +175,21 @@ static inline void might_fault(void)
> ?}
> ?#endif
>
> +enum panic_flags {
> + ? ? ? PANIC_NO_KEXEC ? ? = (1 << 0),
> + ? ? ? PANIC_NO_BACKTRACE = (1 << 1),
> +};
> +
> ?extern struct atomic_notifier_head panic_notifier_list;
> ?extern long (*panic_blink)(int state);
> ?NORET_TYPE void panic(const char * fmt, ...)
> ? ? ? ?__attribute__ ((NORET_AND format (printf, 1, 2))) __cold;
> +NORET_TYPE void xpanic(enum panic_flags flags, int timeout,
> + ? ? ? ? ? ? ? ? ? ? ?const char *fmt, ...)
> + ? ? ? __attribute__ ((NORET_AND format (printf, 3, 4))) __cold;
> +NORET_TYPE void vpanic(enum panic_flags flags, int timeout,
> + ? ? ? ? ? ? ? ? ? ? ?const char *fmt,
> + ? ? ? ? ? ? ? ? ? ? ?va_list ap) __noreturn __cold;
> ?extern void oops_enter(void);
> ?extern void oops_exit(void);
> ?void print_oops_end_marker(void);
> Index: linux-2.6.slim_kdump/kernel/panic.c
> ===================================================================
> --- linux-2.6.slim_kdump.orig/kernel/panic.c
> +++ linux-2.6.slim_kdump/kernel/panic.c
> @@ -57,14 +57,37 @@ EXPORT_SYMBOL(panic_blink);
> ?*
> ?* ? ? This function never returns.
> ?*/
> -NORET_TYPE void panic(const char * fmt, ...)
> +NORET_TYPE void panic(const char *fmt, ...)
> +{
> + ? ? ? va_list ap;
> + ? ? ? va_start(ap, fmt);
> + ? ? ? vpanic(0, 0, fmt, ap);
> +}
> +EXPORT_SYMBOL(panic);
> +
> +NORET_TYPE void xpanic(enum panic_flags flags, int timeout,
> + ? ? ? ? ? ? ? ? ? ? ? const char *fmt, ...)
> +{
> + ? ? ? va_list ap;
> + ? ? ? va_start(ap, fmt);
> + ? ? ? xpanic(flags, timeout, fmt, ap);

Why are you calling xpanic() here again?
I guess you meant vpanic().

> +}
> +EXPORT_SYMBOL(xpanic);
> +
> +NORET_TYPE void vpanic(enum panic_flags flags, int timeout,
> + ? ? ? ? ? ? ? ? ? ? ?const char * fmt, va_list args)
> ?{
> ? ? ? ?static char buf[1024];
> - ? ? ? va_list args;
> ? ? ? ?long i, i_next = 0;
> ? ? ? ?int state = 0;
>
> ? ? ? ?/*
> + ? ? ? ?* Let user always override panic_timeout.
> + ? ? ? ?*/
> + ? ? ? if (panic_timeout > 0)
> + ? ? ? ? ? ? ? timeout = panic_timeout;
> +
> + ? ? ? /*
> ? ? ? ? * It's possible to come here directly from a panic-assertion and
> ? ? ? ? * not have preempt disabled. Some functions called from here want
> ? ? ? ? * preempt to be disabled. No point enabling it later though...
> @@ -73,12 +96,11 @@ NORET_TYPE void panic(const char * fmt,
>
> ? ? ? ?console_verbose();
> ? ? ? ?bust_spinlocks(1);
> - ? ? ? va_start(args, fmt);
> ? ? ? ?vsnprintf(buf, sizeof(buf), fmt, args);
> - ? ? ? va_end(args);
> ? ? ? ?printk(KERN_EMERG "Kernel panic - not syncing: %s\n",buf);
> ?#ifdef CONFIG_DEBUG_BUGVERBOSE
> - ? ? ? dump_stack();
> + ? ? ? if (!(flags & PANIC_NO_BACKTRACE))
> + ? ? ? ? ? ? ? dump_stack();
> ?#endif
>
> ? ? ? ?/*
> @@ -86,7 +108,8 @@ NORET_TYPE void panic(const char * fmt,
> ? ? ? ? * everything else.
> ? ? ? ? * Do we want to call this before we try to display a message?
> ? ? ? ? */
> - ? ? ? crash_kexec(NULL);
> + ? ? ? if (!(flags & PANIC_NO_KEXEC))
> + ? ? ? ? ? ? ? crash_kexec(NULL);
>
> ? ? ? ?kmsg_dump(KMSG_DUMP_PANIC);
>
> @@ -104,7 +127,7 @@ NORET_TYPE void panic(const char * fmt,
> ? ? ? ?if (!panic_blink)
> ? ? ? ? ? ? ? ?panic_blink = no_blink;
>
> - ? ? ? if (panic_timeout > 0) {
> + ? ? ? if (timeout > 0) {
> ? ? ? ? ? ? ? ?/*
> ? ? ? ? ? ? ? ? * Delay timeout seconds before rebooting the machine.
> ? ? ? ? ? ? ? ? * We can't use the "normal" timers since we just panicked.
> @@ -152,9 +175,7 @@ NORET_TYPE void panic(const char * fmt,
> ? ? ? ? ? ? ? ?mdelay(PANIC_TIMER_STEP);
> ? ? ? ?}
> ?}
> -
> -EXPORT_SYMBOL(panic);
> -
> +EXPORT_SYMBOL(vpanic);
>
> ?struct tnt {
> ? ? ? ?u8 ? ? ?bit;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at ?http://www.tux.org/lkml/
>

--
Thanks,
//richard

2011-05-26 17:45:40

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC Patch 5/6] slimdump: Capture slimdump for fatal MCE generated crashes

On Thu, May 26, 2011 at 10:53:05PM +0530, K.Prasad wrote:
>
> slimdump: Capture slimdump for fatal MCE generated crashes
>
> System crashes resulting from fatal hardware errors (such as MCE) don't need
> all the contents from crashing-kernel's memory. Generate a new 'slimdump' that
> retains only essential information while discarding the old memory.
>

Why to enforce zeroing out of rest of the vmcore data in kernel. Why not
leave it to user space.

As Andi said, you anyway will require disabling MCE temporarily in second
kernel. So that should allay the concern that we might run into second
MCE while trying to read the vmcore.

A user might want to extract just the dmesg buffers also after MCE from
vmcore.

What I am saying is that this sounds more like a policy. Don't enforce
it in kernel. Probably give user MCE registers as part of ELF note of
vmcore. Now leave it up to user what data he wants to extract from vmcore.
Just the MCE registers or something more then that.

Thanks
Vivek

> Signed-off-by: K.Prasad <[email protected]>
> ---
> fs/proc/vmcore.c | 130 ++++++++++++++++++++++++++++++++++++++++++++++++++++++-
> 1 file changed, 128 insertions(+), 2 deletions(-)
>
> Index: linux-2.6.slim_kdump/fs/proc/vmcore.c
> ===================================================================
> --- linux-2.6.slim_kdump.orig/fs/proc/vmcore.c
> +++ linux-2.6.slim_kdump/fs/proc/vmcore.c
> @@ -483,9 +483,60 @@ static void __init set_vmcore_list_offse
> }
> }
>
> +/*
> + * Check if the crash was due to a fatal Memory Check Exception
> + */
> +static int is_mce_crash64(void)
> +{
> + int i, j, len = 0, rc;
> + Elf64_Ehdr *ehdr_ptr;
> + Elf64_Phdr *phdr_ptr;
> + Elf64_Nhdr *nhdr_ptr;
> +
> + ehdr_ptr = (Elf64_Ehdr *)elfcorebuf;
> + phdr_ptr = (Elf64_Phdr *)(elfcorebuf + sizeof(Elf64_Ehdr));
> +
> + for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
> + void *notes_section;
> + u64 offset, max_sz;
> + if (phdr_ptr->p_type != PT_NOTE)
> + continue;
> + max_sz = phdr_ptr->p_memsz;
> + offset = phdr_ptr->p_offset;
> + notes_section = kmalloc(max_sz, GFP_KERNEL);
> + if (!notes_section)
> + return -ENOMEM;
> + rc = read_from_oldmem(notes_section, max_sz, &offset, 0);
> + if (rc < 0) {
> + kfree(notes_section);
> + return rc;
> + }
> +
> + for (j = 0; j < phdr_ptr->p_filesz; j += len) {
> + nhdr_ptr = notes_section + j;
> + if (nhdr_ptr->n_type == NT_MCE)
> + {
> + kfree(notes_section);
> + return 1;
> + }
> + /*
> + * The elf-64 standard specifies 8-byte alignment while
> + * append_elf_note function does only 4-byte roundup.
> + * Hence this code also does a 4-byte roundup.
> + */
> + len = sizeof(Elf64_Nhdr);
> + len = roundup(len + nhdr_ptr->n_namesz, 4);
> + len = roundup(len + nhdr_ptr->n_descsz, 4);
> + }
> + kfree(notes_section);
> + }
> + return 0;
> +}
> +
> static int __init parse_crash_elf64_headers(void)
> {
> - int rc=0;
> + int i, rc = 0;
> + Elf64_Phdr *phdr_ptr;
> Elf64_Ehdr ehdr;
> u64 addr;
>
> @@ -523,6 +574,18 @@ static int __init parse_crash_elf64_head
> return rc;
> }
>
> + phdr_ptr = (Elf64_Phdr *)(elfcorebuf + sizeof(Elf64_Ehdr));
> + if (is_mce_crash64() > 0) {
> + /*
> + * If crash is due to Machine Check exception, don't populate
> + * sections other than elf-notes. Mark their sizes as zero.
> + */
> + for (i = 0; i < ehdr.e_phnum; i++, phdr_ptr++) {
> + if (phdr_ptr->p_type != PT_NOTE)
> + phdr_ptr->p_memsz = phdr_ptr->p_filesz = 0;
> + }
> + }
> +
> /* Merge all PT_NOTE headers into one. */
> rc = merge_note_headers_elf64(elfcorebuf, &elfcorebuf_sz, &vmcore_list);
> if (rc) {
> @@ -539,9 +602,60 @@ static int __init parse_crash_elf64_head
> return 0;
> }
>
> +/*
> + * Check if the crash was due to a fatal Memory Check Exception
> + */
> +static int is_mce_crash32(void)
> +{
> + int i, j, len = 0, rc;
> + Elf32_Ehdr *ehdr_ptr;
> + Elf32_Phdr *phdr_ptr;
> + Elf32_Nhdr *nhdr_ptr;
> +
> + ehdr_ptr = (Elf32_Ehdr *)elfcorebuf;
> + phdr_ptr = (Elf32_Phdr *)(elfcorebuf + sizeof(Elf32_Ehdr));
> +
> + for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
> + void *notes_section;
> + u64 offset, max_sz;
> + if (phdr_ptr->p_type != PT_NOTE)
> + continue;
> + max_sz = phdr_ptr->p_memsz;
> + offset = phdr_ptr->p_offset;
> + notes_section = kmalloc(max_sz, GFP_KERNEL);
> + if (!notes_section)
> + return -ENOMEM;
> + rc = read_from_oldmem(notes_section, max_sz, &offset, 0);
> + if (rc < 0) {
> + kfree(notes_section);
> + return rc;
> + }
> +
> + for (j = 0; j < phdr_ptr->p_filesz; j += len) {
> + nhdr_ptr = notes_section + j;
> + if (nhdr_ptr->n_type == NT_MCE)
> + {
> + kfree(notes_section);
> + return 1;
> + }
> + /*
> + * The elf-64 standard specifies 8-byte alignment while
> + * append_elf_note function does only 4-byte roundup.
> + * Hence this code also does a 4-byte roundup.
> + */
> + len = sizeof(Elf64_Nhdr);
> + len = roundup(len + nhdr_ptr->n_namesz, 4);
> + len = roundup(len + nhdr_ptr->n_descsz, 4);
> + }
> + kfree(notes_section);
> + }
> + return 0;
> +}
> +
> static int __init parse_crash_elf32_headers(void)
> {
> - int rc=0;
> + int i, rc = 0;
> + Elf32_Phdr *phdr_ptr;
> Elf32_Ehdr ehdr;
> u64 addr;
>
> @@ -579,6 +693,18 @@ static int __init parse_crash_elf32_head
> return rc;
> }
>
> + phdr_ptr = (Elf32_Phdr *)(elfcorebuf + sizeof(Elf32_Ehdr));
> + if (is_mce_crash32() > 0) {
> + /*
> + * If crash is due to Machine Check exception, don't populate
> + * sections other than elf-notes. Mark their sizes as zero.
> + */
> + for (i = 0; i < ehdr.e_phnum; i++, phdr_ptr++) {
> + if (phdr_ptr->p_type != PT_NOTE)
> + phdr_ptr->p_memsz = phdr_ptr->p_filesz = 0;
> + }
> + }
> +
> /* Merge all PT_NOTE headers into one. */
> rc = merge_note_headers_elf32(elfcorebuf, &elfcorebuf_sz, &vmcore_list);
> if (rc) {

2011-05-26 18:09:33

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC Patch 5/6] slimdump: Capture slimdump for fatal MCE generated crashes

On Thu, May 26, 2011 at 01:44:47PM -0400, Vivek Goyal wrote:
> On Thu, May 26, 2011 at 10:53:05PM +0530, K.Prasad wrote:
> >
> > slimdump: Capture slimdump for fatal MCE generated crashes
> >
> > System crashes resulting from fatal hardware errors (such as MCE) don't need
> > all the contents from crashing-kernel's memory. Generate a new 'slimdump' that
> > retains only essential information while discarding the old memory.
> >
>
> Why to enforce zeroing out of rest of the vmcore data in kernel. Why not
> leave it to user space.

I think it's a good default to not do a full dump on MCE.
It's very unlikely to be useful for anything, and will just waste
reboot time (aka nines).

That said including the dmesg too may be a good idea.

On the other hand I think the slim dump should be probably generalized
and used for more situations. I could well imagine enabling it by
default on various systems.

-Andi

2011-05-26 18:27:11

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC Patch 5/6] slimdump: Capture slimdump for fatal MCE generated crashes

On Thu, May 26, 2011 at 08:09:31PM +0200, Andi Kleen wrote:
> On Thu, May 26, 2011 at 01:44:47PM -0400, Vivek Goyal wrote:
> > On Thu, May 26, 2011 at 10:53:05PM +0530, K.Prasad wrote:
> > >
> > > slimdump: Capture slimdump for fatal MCE generated crashes
> > >
> > > System crashes resulting from fatal hardware errors (such as MCE) don't need
> > > all the contents from crashing-kernel's memory. Generate a new 'slimdump' that
> > > retains only essential information while discarding the old memory.
> > >
> >
> > Why to enforce zeroing out of rest of the vmcore data in kernel. Why not
> > leave it to user space.
>
> I think it's a good default to not do a full dump on MCE.
> It's very unlikely to be useful for anything, and will just waste
> reboot time (aka nines).

If we are just extracting and saving MCE registers from vmcore, then
reboot time does not increase. It increases only if user decides to
extract and save extra data from vmcore.

>
> That said including the dmesg too may be a good idea.

dmesg is already part of vmcore and user space tools can easily find
it.

I can easily imagine a default policy of a distro in user space where
in case of MCE crash, we just extract dmesg and MCE registers (from vmcore
notes section) reboot. This will be fast and will reduce the amount of code
in kernel.

IMHO, we should not introduce any additional notion of slimdump as such in
kernel. A better thing would be to just read MCE registers and export to
user space through ELF notes and then let user space automate the rest of
it.

Thanks
Vivek

2011-05-26 18:44:33

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC Patch 4/6] PANIC_MCE: Introduce a new panic flag for fatal MCE, capture related information

On Thu, May 26, 2011 at 10:45:21PM +0530, K.Prasad wrote:

[..]
> Index: linux-2.6.slim_kdump/arch/x86/kernel/cpu/mcheck/mce.c
> ===================================================================
> --- linux-2.6.slim_kdump.orig/arch/x86/kernel/cpu/mcheck/mce.c
> +++ linux-2.6.slim_kdump/arch/x86/kernel/cpu/mcheck/mce.c
> @@ -258,8 +258,7 @@ static void wait_for_panic(void)
> local_irq_enable();
> while (timeout-- > 0)
> udelay(1);
> - xpanic(PANIC_NO_KEXEC|PANIC_NO_BACKTRACE, 0,
> - "Panicing machine check CPU died");
> + xpanic(PANIC_MCE, 0, NULL, 0, "Panicing machine check CPU died");
> }
>
> static void mce_panic(char *msg, struct mce *final, char *exp)
> @@ -315,8 +314,8 @@ static void mce_panic(char *msg, struct
> if (exp)
> pr_emerg(HW_ERR "Machine check: %s\n", exp);
> if (!fake_panic) {
> - xpanic(PANIC_NO_KEXEC|PANIC_NO_BACKTRACE, mce_panic_timeout,
> - msg);
> + xpanic(PANIC_MCE, mce_panic_timeout, final,
> + sizeof(struct mce), msg);
> } else
> pr_emerg(HW_ERR "Fake kernel panic: %s\n", msg);
> }

In previous patches you introduce PANIC_NO_KEXEC and PANIC_NO_BACKTRACE.
Now in this patch you got rid of those. Are there any other users left
of PANIC_NO_BACKTRACE and PANIC_NO_EXEC? If not, then why to introduce
these to begin with.

Thanks
Vivek

2011-05-26 18:58:40

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC Patch 5/6] slimdump: Capture slimdump for fatal MCE generated crashes

> If we are just extracting and saving MCE registers from vmcore, then
> reboot time does not increase. It increases only if user decides to
> extract and save extra data from vmcore.

Hmm I was thinking of user space usually saving the dump first
before analyzing it. But yes it could probably do some minimal
analysis first.

I guess most user spaces just won't change and will just continue
to DTWT though.

-Andi

--
[email protected] -- Speaking for myself only.

2011-05-26 19:11:59

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC Patch 5/6] slimdump: Capture slimdump for fatal MCE generated crashes

On Thu, May 26, 2011 at 08:58:38PM +0200, Andi Kleen wrote:
> > If we are just extracting and saving MCE registers from vmcore, then
> > reboot time does not increase. It increases only if user decides to
> > extract and save extra data from vmcore.
>
> Hmm I was thinking of user space usually saving the dump first
> before analyzing it. But yes it could probably do some minimal
> analysis first.

In RHEL, now we filter out the dump by default until and unless user
decides to no filter the dump with the help of config options.

So I think in this case we can just introduce an extra filtering
option in "makedumpfile" and ask it to save only MCE registers
if it notices that there is NT_MCE type of ELF note in vmcore.

I can very well imaging that extracting dmesg along with MCE registers
can be useful. If nothing else, it can give us useful information about
the system configuration.

Zeroing out all the other ELF headers in vmcore takes away the
capability to extract log buffers in case of MCE and I think
it probably is not the best idea.

Thanks
Vivek

2011-05-26 23:44:08

by Simon Horman

[permalink] [raw]
Subject: Re: [RFC Patch 5/6] slimdump: Capture slimdump for fatal MCE generated crashes

On Thu, May 26, 2011 at 03:10:08PM -0400, Vivek Goyal wrote:
> On Thu, May 26, 2011 at 08:58:38PM +0200, Andi Kleen wrote:
> > > If we are just extracting and saving MCE registers from vmcore, then
> > > reboot time does not increase. It increases only if user decides to
> > > extract and save extra data from vmcore.
> >
> > Hmm I was thinking of user space usually saving the dump first
> > before analyzing it. But yes it could probably do some minimal
> > analysis first.
>
> In RHEL, now we filter out the dump by default until and unless user
> decides to no filter the dump with the help of config options.
>
> So I think in this case we can just introduce an extra filtering
> option in "makedumpfile" and ask it to save only MCE registers
> if it notices that there is NT_MCE type of ELF note in vmcore.
>
> I can very well imaging that extracting dmesg along with MCE registers
> can be useful. If nothing else, it can give us useful information about
> the system configuration.
>
> Zeroing out all the other ELF headers in vmcore takes away the
> capability to extract log buffers in case of MCE and I think
> it probably is not the best idea.
>

FWIW, my initial thoughts are inline with Vivek's.
I think that the bulk of this, if not all of it, can
be done in user space. And that user space is the preferable
location for such logic.

Subject: Re: [RFC Patch 6/6] Crash: Recognise slim coredumps and process new elf-note sections

On 2011-05-26 22:56:18 Thu, K.Prasad wrote:
>
> Crash: Recognise slim coredumps and process new elf-note sections
>
> The Linux kernel will begin to support SlimDump for certain types of crashes
> and the 'crash' tool needs to recognise them. For these types of coredumps, it
> need not lookout for usual elf-structures and start gdb. Also process new
> elf-note sections that contain additional information about the crash.
>
> Signed-off-by: K.Prasad <[email protected]>
> ---
> diskdump.c | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> netdump.c | 8 +++++
> x86.h | 91 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 183 insertions(+)
>
> Index: crash-5.1.5.slim_kdump/x86.h
> ===================================================================
> --- /dev/null
> +++ crash-5.1.5.slim_kdump/x86.h
> @@ -0,0 +1,91 @@
> +/*
> + * x86.h - x86 Architecture specific definitions
> + *
> + * This program is free software; you can redistribute it and/or modify
> + * it under the terms of the GNU General Public License as published by
> + * the Free Software Foundation; either version 2 of the License, or
> + * (at your option) any later version.
> + *
> + * This program is distributed in the hope that it will be useful,
> + * but WITHOUT ANY WARRANTY; without even the implied warranty of
> + * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
> + * GNU General Public License for more details.
> + *
> + * You should have received a copy of the GNU General Public License
> + * along with this program; if not, write to the Free Software
> + * Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.
> + *
> + * Copyright (C) IBM Corporation, 2011
> + *
> + * Author: K.Prasad <[email protected]>
> + */
> +
> +typedef unsigned long long u64;
> +typedef unsigned int u32;
> +typedef unsigned short u16;
> +typedef unsigned char u8;
> +
> +#define __u64 u64
> +#define __u32 u32
> +#define __u16 u16
> +#define __u8 u8
> +
> +/* Mask for finding the address mode in IA32_MCi_MISC[8:6] register */
> +#define MCI_MISC_ADDR_MODE 0X1C0
> +/* Number of bits to shift the IA32_MCi_MISC to read the address-mode bits */
> +#define MISC_ADDR_MODE_POS 6
> +
> +/* Address Modes in IA32_MCi_MISC[8:6] */
> +#define MCM_ADDR_SEGOFF 0 /* segment offset */
> +#define MCM_ADDR_LINEAR 1 /* linear address */
> +#define MCM_ADDR_PHYS 2 /* physical address */
> +#define MCM_ADDR_MEM 3 /* memory address */
> +#define MCM_ADDR_GENERIC 7 /* generic */
> +
> +#define MCI_STATUS_MISCV (1ULL<<59) /* misc error reg. valid */
> +#define MCI_STATUS_ADDRV (1ULL<<58) /* addr reg. valid */
> +
> +#define PAGE_SHIFT 12
> +
> +static const char *mce_addr_mode[] =
> +{
> + "Segment offset", /* MCM_ADDR_SEGOFF */
> + "Linear address", /* MCM_ADDR_LINEAR */
> + "Physical address", /* MCM_ADDR_PHYS */
> + "Memory address", /* MCM_ADDR_MEM */
> + "", /* reserved */
> + "", /* reserved */
> + "", /* reserved */
> + "Generic" /* MCM_ADDR_GENERIC */
> +};
> +
> +/*
> + * kernel structure: Keep this in sync with the definition in
> + * arch/x86/include/asm/mce.h of linux source code.
> + *
> + * Fields are zero when not available
> + *
> + */
> +struct mce {
> + __u64 status;
> + __u64 misc;
> + __u64 addr;
> + __u64 mcgstatus;
> + __u64 ip;
> + __u64 tsc; /* cpu time stamp counter */
> + __u64 time; /* wall time_t when error was detected */
> + __u8 cpuvendor; /* cpu vendor as encoded in system.h */
> + __u8 pad1;
> + __u16 pad2;
> + __u32 cpuid; /* CPUID 1 EAX */
> + __u8 cs; /* code segment */
> + __u8 bank; /* machine check bank */
> + __u8 cpu; /* cpu number; obsolete; use extcpu now */
> + __u8 finished; /* entry is valid */
> + __u32 extcpu; /* linux cpu number that detected the error */
> + __u32 socketid; /* CPU socket ID */
> + __u32 apicid; /* CPU initial apic ID */
> + __u64 mcgcap; /* MCGCAP MSR: machine check capabilities of CPU */
> + __u64 aux0;
> + __u64 aux1;
> +};
> Index: crash-5.1.5.slim_kdump/netdump.c
> ===================================================================
> --- crash-5.1.5.slim_kdump.orig/netdump.c
> +++ crash-5.1.5.slim_kdump/netdump.c
> @@ -331,6 +331,10 @@ is_netdump(char *file, ulong source_quer
> }
> nd->notes32 = (Elf32_Phdr *)
> &nd->elf_header[sizeof(Elf32_Ehdr)];
> + if (machdep->process_elf_notes)
> + machdep->process_elf_notes((char *)nd->elf32 +
> + nd->notes32->p_offset,
> + nd->notes32->p_filesz);
> nd->load32 = (Elf32_Phdr *)
> &nd->elf_header[sizeof(Elf32_Ehdr)+sizeof(Elf32_Phdr)];
> if (DUMPFILE_FORMAT(nd->flags) == NETDUMP_ELF32)
> @@ -360,6 +364,10 @@ is_netdump(char *file, ulong source_quer
> }
> nd->notes64 = (Elf64_Phdr *)
> &nd->elf_header[sizeof(Elf64_Ehdr)];
> + if (machdep->process_elf_notes)
> + machdep->process_elf_notes((char *)nd->elf64 +
> + nd->notes64->p_offset,
> + nd->notes64->p_filesz);

Now that machdep->process_elf_notes() is invoked in generic KDUMP
processing code path, please remove the separate invocation of
machdep->dumpfile_init() which was introduced for s390x architecture in
dump_Elf64_Nhdr() function. The reason is, machdep->process_elf_notes()
on s390x internally invokes machdep->dumpfile_init(). Hence we can
safely remove it.

> nd->load64 = (Elf64_Phdr *)
> &nd->elf_header[sizeof(Elf64_Ehdr)+sizeof(Elf64_Phdr)];
> if (DUMPFILE_FORMAT(nd->flags) == NETDUMP_ELF64)
> Index: crash-5.1.5.slim_kdump/diskdump.c
> ===================================================================
> --- crash-5.1.5.slim_kdump.orig/diskdump.c
> +++ crash-5.1.5.slim_kdump/diskdump.c
> @@ -231,6 +231,27 @@ open_dump_file(char *file)
> dd->dfd = fd;
> return TRUE;
> }
> +#if defined(X86_64) || defined(X86)
> +#include "x86.h"
> +
> +/*
> + * Check if the address reported by the CPU is in a format we can parse.
> + * It would be possible to add code for most other cases, but all would
> + * be somewhat complicated (e.g. segment offset would require an instruction
> + * parser). So only support physical addresses up to page granuality for now.
> + *
> + * Function derived from arch/x86/kernel/cpu/mcheck/mce.c in Linux source
> + *
> + */
> +static int mce_usable_address(struct mce *m)
> +{
> + if (!(m->status & MCI_STATUS_MISCV) || !(m->status & MCI_STATUS_ADDRV))
> + return 0;
> + if ((m->misc & 0x3f) > PAGE_SHIFT)
> + return 0;
> + return 1;
> +}
> +#endif /* defined(X86_64) || defined(X86) */
>
> void
> x86_process_elf_notes(void *note_ptr, unsigned long size_note)
> @@ -239,10 +260,43 @@ x86_process_elf_notes(void *note_ptr, un
> Elf64_Nhdr *note64 = NULL;
> size_t tot, len = 0;
> int num = 0;
> +#if defined(X86_64) || defined(X86)
> + struct mce *mce;
> + ushort addr_mode;
> +#endif /* defined(X86_64) || defined(X86) */
>
> for (tot = 0; tot < size_note; tot += len) {
> if (machine_type("X86_64")) {
> note64 = note_ptr + tot;
> +#ifdef X86_64
> + /*
> + * If vmcore is generated due to fatal Machine Check
> + * Exception, we only have a 'slim' crashdump. Don't
> + * analyse further, inform the user about it and exit.
> + */
> + if (note64->n_type == NT_MCE) {
> + fprintf(fp, "\"System crashed due to a hardware"
> + " memory error. No coredump"
> + " available.\"\n");
> +
> + /* Do we have a copy of 'struct mce'? */
> + if (note64->n_descsz == 0)
> + goto exit;
> +
> + mce = (struct mce *)((char *)note64 +
> + sizeof(Elf64_Nhdr) + note64->n_namesz);
> + if (!mce_usable_address(mce))
> + goto exit;
> +
> + addr_mode = (mce->misc >> MISC_ADDR_MODE_POS) &
> + MCI_MISC_ADDR_MODE;
> + fprintf(fp, "Memory error occured at %llx "
> + "(address type: %s\n)", mce->addr,
> + mce_addr_mode[addr_mode]);
> +exit:
> + clean_exit(0);
> + }
> +#endif /* X86_64 */

The function x86_process_elf_notes() is invoked through
machdep->process_elf_notes function pointer, which makes it arch
dependent code. How about moving this function (x86_process_elf_notes)
to an arch dependent file say x86_common.c ? By doing so we can get rid of
all "#ifdefs" here.

Hi Dave, what do you say?

>
> if (note64->n_type == NT_PRSTATUS) {
> dd->nt_prstatus_percpu[num] = note64;
> @@ -255,6 +309,36 @@ x86_process_elf_notes(void *note_ptr, un
> } else if (machine_type("X86")) {
> note32 = note_ptr + tot;
>
> +#ifdef X86
> + /*
> + * If vmcore is generated due to fatal Machine Check
> + * Exception, we only have a 'slim' crashdump. Don't
> + * analyse further, inform the user about it and exit.
> + */
> + if (note32->n_type == NT_MCE) {
> + fprintf(fp, "\"System crashed due to a hardware"
> + " memory error. No coredump"
> + " available.\"\n");
> +
> + /* Do we have a copy of 'struct mce'? */
> + if (note32->n_descsz == 0)
> + goto exit;
> +
> + mce = (struct mce *)((char *)note32 +
> + sizeof(Elf32_Nhdr) + note32->n_namesz);
> + if (!mce_usable_address(mce))
> + goto exit;
> +
> + addr_mode = (mce->misc >> MISC_ADDR_MODE_POS) &
> + MCI_MISC_ADDR_MODE;
> + fprintf(fp, "Memory error occured at %llx "
> + "(address type: %s\n)", mce->addr,
> + mce_addr_mode[addr_mode]);
> +exit:
> + clean_exit(0);
> + }
> +#endif /* X86 */
> +
> if (note32->n_type == NT_PRSTATUS) {
> dd->nt_prstatus_percpu[num] = note32;
> num++;

--
Mahesh J Salgaonkar

2011-05-27 15:54:13

by K.Prasad

[permalink] [raw]
Subject: Re: [RFC Patch 5/6] slimdump: Capture slimdump for fatal MCE generated crashes

On Thu, May 26, 2011 at 07:32:57PM +0200, Andi Kleen wrote:
> On Thu, May 26, 2011 at 10:53:05PM +0530, K.Prasad wrote:
> >
> > slimdump: Capture slimdump for fatal MCE generated crashes
> >
> > System crashes resulting from fatal hardware errors (such as MCE) don't need
> > all the contents from crashing-kernel's memory. Generate a new 'slimdump' that
> > retains only essential information while discarding the old memory.
>
> While this is a good idea, note there may be still poisoned lines
> in memory that haven't resulted in a machine check yet, but could
> still be fatal when read after a full crash dump for some other
> reason.
>

True, this patch does not handle the discovery of old poisoned lines/new
memory errors that may occur when inside the kdump kernel.

> So you still need
>
> http://git.kernel.org/?p=linux/kernel/git/ak/linux-mce-2.6.git;a=commit;h=fe61906edce9e70d02481a77a617ba1397573dce
> and
> http://git.kernel.org/?p=linux/kernel/git/ak/linux-mce-2.6.git;a=commit;h=cb58f049ae6709ddbab71be199390dc6852018cd
>
> in addition.
>
> -Andi

So, there could be (atleast) two ways to handle fatal MCEs in kdump
kernel:

- To disable MCE exceptions as done by the patches cited above. However
the result of a read operation on corrupted memory is unknown and the
system behaviour is undefined. We're unsure if this is a safe thing to
do.

- To disable capture of kdump (when panic is invoked from) inside kdump
kernel and simply reboot the system. Since the chance of memory error
inside kdump kernel (which runs for a very short duration) is rare, I
think this solution is preferrable.

Let me know your thoughts on this.

Thanks,
K.Prasad

2011-05-27 15:57:17

by K.Prasad

[permalink] [raw]
Subject: Re: [Patch 1/6] XPANIC: Add extended panic interface

On Thu, May 26, 2011 at 07:38:19PM +0200, richard -rw- weinberger wrote:
> On Thu, May 26, 2011 at 7:12 PM, K.Prasad <[email protected]> wrote:
> >
> > commit e668fa1aea7844ac4c7ea09030a2f3e647a4adb1
> > Author: Andi Kleen <[email protected]>
> > Date: ? Fri Nov 19 18:36:44 2010 +0100
> >
> > Index: linux-2.6.slim_kdump/kernel/panic.c
> > ===================================================================
> > --- linux-2.6.slim_kdump.orig/kernel/panic.c
> > +++ linux-2.6.slim_kdump/kernel/panic.c
> > @@ -57,14 +57,37 @@ EXPORT_SYMBOL(panic_blink);
> > ?*
> > ?* ? ? This function never returns.
> > ?*/
> > -NORET_TYPE void panic(const char * fmt, ...)
> > +NORET_TYPE void panic(const char *fmt, ...)
> > +{
> > + ? ? ? va_list ap;
> > + ? ? ? va_start(ap, fmt);
> > + ? ? ? vpanic(0, 0, fmt, ap);
> > +}
> > +EXPORT_SYMBOL(panic);
> > +
> > +NORET_TYPE void xpanic(enum panic_flags flags, int timeout,
> > + ? ? ? ? ? ? ? ? ? ? ? const char *fmt, ...)
> > +{
> > + ? ? ? va_list ap;
> > + ? ? ? va_start(ap, fmt);
> > + ? ? ? xpanic(flags, timeout, fmt, ap);
>
> Why are you calling xpanic() here again?
> I guess you meant vpanic().
>

True, this is a bug in the original patch found in the git tree
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6.git, for
which I have proposed a patch here: https://lkml.org/lkml/2011/5/26/337.

Thanks,
K.Prasad

2011-05-27 16:57:51

by K.Prasad

[permalink] [raw]
Subject: Re: [RFC Patch 5/6] slimdump: Capture slimdump for fatal MCE generated crashes

On Thu, May 26, 2011 at 01:44:47PM -0400, Vivek Goyal wrote:
> On Thu, May 26, 2011 at 10:53:05PM +0530, K.Prasad wrote:
> >
> > slimdump: Capture slimdump for fatal MCE generated crashes
> >
> > System crashes resulting from fatal hardware errors (such as MCE) don't need
> > all the contents from crashing-kernel's memory. Generate a new 'slimdump' that
> > retains only essential information while discarding the old memory.
> >
>
> Why to enforce zeroing out of rest of the vmcore data in kernel. Why not
> leave it to user space.
>

Our concern comes from the fact that it is unsafe for the OS to read
any part of the corrupt memory region, so the kernel does not have to make
that address space available for read/write.

> As Andi said, you anyway will require disabling MCE temporarily in second
> kernel. So that should allay the concern that we might run into second
> MCE while trying to read the vmcore.
>

The processor manuals don't define the behaviour for a read operation
upon a corrupt memory location. So the consequences for a read after
disabling MCE is unknown (as I've mentioned here:
https://lkml.org/lkml/2011/5/27/258).

> A user might want to extract just the dmesg buffers also after MCE from
> vmcore.
>
> What I am saying is that this sounds more like a policy. Don't enforce
> it in kernel. Probably give user MCE registers as part of ELF note of
> vmcore. Now leave it up to user what data he wants to extract from vmcore.
> Just the MCE registers or something more then that.

I'm unsure, at the moment, as to what it actually entails to extract
dmesg buffers from the old kernel's memory; but given that there exists
a corrupt memory region with unknown consequences for read operations
upon it, I doubt if it is a safe idea to allow user-space application to
navigate through the kernel data-structures to extract the dmesg.

Alternatively, we might leave MCE enabled for the kdump kernel and
modify the user-space application to turn resilient to SIGBUS signal. So
if it reads a corrupt memory region, it will receive a signal from
do_machine_check() upon which it can skip that particular address and/or
fill it with zeroes (or somesuch). We haven't gone through the
details....but just some loud thinking

Thanks,
K.Prasad

2011-05-27 17:03:47

by K.Prasad

[permalink] [raw]
Subject: Re: [RFC Patch 4/6] PANIC_MCE: Introduce a new panic flag for fatal MCE, capture related information

On Thu, May 26, 2011 at 02:43:50PM -0400, Vivek Goyal wrote:
> On Thu, May 26, 2011 at 10:45:21PM +0530, K.Prasad wrote:
>
> [..]
> > Index: linux-2.6.slim_kdump/arch/x86/kernel/cpu/mcheck/mce.c
> > ===================================================================
> > --- linux-2.6.slim_kdump.orig/arch/x86/kernel/cpu/mcheck/mce.c
> > +++ linux-2.6.slim_kdump/arch/x86/kernel/cpu/mcheck/mce.c
> > @@ -258,8 +258,7 @@ static void wait_for_panic(void)
> > local_irq_enable();
> > while (timeout-- > 0)
> > udelay(1);
> > - xpanic(PANIC_NO_KEXEC|PANIC_NO_BACKTRACE, 0,
> > - "Panicing machine check CPU died");
> > + xpanic(PANIC_MCE, 0, NULL, 0, "Panicing machine check CPU died");
> > }
> >
> > static void mce_panic(char *msg, struct mce *final, char *exp)
> > @@ -315,8 +314,8 @@ static void mce_panic(char *msg, struct
> > if (exp)
> > pr_emerg(HW_ERR "Machine check: %s\n", exp);
> > if (!fake_panic) {
> > - xpanic(PANIC_NO_KEXEC|PANIC_NO_BACKTRACE, mce_panic_timeout,
> > - msg);
> > + xpanic(PANIC_MCE, mce_panic_timeout, final,
> > + sizeof(struct mce), msg);
> > } else
> > pr_emerg(HW_ERR "Fake kernel panic: %s\n", msg);
> > }
>
> In previous patches you introduce PANIC_NO_KEXEC and PANIC_NO_BACKTRACE.
> Now in this patch you got rid of those. Are there any other users left
> of PANIC_NO_BACKTRACE and PANIC_NO_EXEC? If not, then why to introduce
> these to begin with.
>

The previous patch also converts panic to xpanic and is taken from
Andi's
git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6.git tree.
The changes are kept as two separate patches to identify their origin.

Thanks,
K.Prasad

2011-05-27 17:59:30

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Patch 1/6] XPANIC: Add extended panic interface

"K.Prasad" <[email protected]> writes:

> commit e668fa1aea7844ac4c7ea09030a2f3e647a4adb1
> Author: Andi Kleen <[email protected]>
> Date: Fri Nov 19 18:36:44 2010 +0100
>
> XPANIC: Add extended panic interface
>
> One panic size doesn't fit all.
>
> Machine check has some special requirements on panic, like:
> - It wants to do a reboot timeout by default so that the machine
> check event can be logged to disk after warm reboot.

There is certainly not comment to that effect and I have not see a
mechanism that does that.

> - For memory errors it usually doesn't want to do crash dumps because
> that can lead to crash loops (dumping corrupted memory would
> lead to another machine check)

It usually doesn't, and it would not lead to a crash loop just a failure
to capture the cause of the crash.

> - It doesn't want to do backtraces because machine checks
> are not a kernel bug.

But it sure is nice to know where it happened anyway, so you can guess
the damage. This sounds like more of the Andi thing where he doesn't
want to hear about memory errors, because it is not a software problem.

That is definitely not sufficient reason to turn off backtraces.

> In a earlier patch this was done with various adhoc hacks,
> but it's cleaner to extend panic to a 'xpanic' that directly
> gets a flag and timeout argument.
>
> The only user right now will be x86 machine checks, but I consider
> it likely that other users will switch to this too.
>
> For example one obvious candidate would be the "no root
> found" panic which doesn't really deserve a backtrace.

That does seem like a good fit for that case. However it doesn't seem
like a good fit for your case.

> I exported a vpanic() interface too as a global. That's not
> needed by the current user, but the interface has to exist
> internally anyways and I could see how other code would
> find a v* variant of panic useful.
>
> Originally based on a suggestion by H. Peter Anvin.
> Signed-off-by: Andi Kleen <[email protected]>

Nacked-by: "Eric W. Biederman" <[email protected]>

This entire chunk of two patches appears to be complete non-sense.

mce_panic_timeout is hard coded policy that userspace does not get
to control, that overrides userspace policy.

It looks to me like the right solution is just to delete the
mce_panic_timeout.

As for the extra flags that you add the panic path. They can
only make the code more fragile.

Eric

2011-05-27 18:00:10

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC Patch 5/6] slimdump: Capture slimdump for fatal MCE generated crashes

On Fri, May 27, 2011 at 10:27:36PM +0530, K.Prasad wrote:
> On Thu, May 26, 2011 at 01:44:47PM -0400, Vivek Goyal wrote:
> > On Thu, May 26, 2011 at 10:53:05PM +0530, K.Prasad wrote:
> > >
> > > slimdump: Capture slimdump for fatal MCE generated crashes
> > >
> > > System crashes resulting from fatal hardware errors (such as MCE) don't need
> > > all the contents from crashing-kernel's memory. Generate a new 'slimdump' that
> > > retains only essential information while discarding the old memory.
> > >
> >
> > Why to enforce zeroing out of rest of the vmcore data in kernel. Why not
> > leave it to user space.
> >
>
> Our concern comes from the fact that it is unsafe for the OS to read
> any part of the corrupt memory region, so the kernel does not have to make
> that address space available for read/write.

The very fact you are booting into second kernel you are reading lots
of address space already. The assumption is that whole of the reserved
region is fine otherwise kernel will not even boot. Then even in older
kernel you are accessing memory used for saving the elf headers and
mce registers.

>
> > As Andi said, you anyway will require disabling MCE temporarily in second
> > kernel. So that should allay the concern that we might run into second
> > MCE while trying to read the vmcore.
> >
>
> The processor manuals don't define the behaviour for a read operation
> upon a corrupt memory location. So the consequences for a read after
> disabling MCE is unknown (as I've mentioned here:
> https://lkml.org/lkml/2011/5/27/258).
>

First of all a user can simply have a configuration to extract MCE
registers and your concern is addressed. Eric W. Biederman said that
he had got bunch of MCE triggered dumps also and things were fine. I think
we are over engineering this. In the first step lets just keep it simple
and just export MCE registers as a note. If accessing rest of regions
becomes a issue in real life, then we can have a look at this again.


> > A user might want to extract just the dmesg buffers also after MCE from
> > vmcore.
> >
> > What I am saying is that this sounds more like a policy. Don't enforce
> > it in kernel. Probably give user MCE registers as part of ELF note of
> > vmcore. Now leave it up to user what data he wants to extract from vmcore.
> > Just the MCE registers or something more then that.
>
> I'm unsure, at the moment, as to what it actually entails to extract
> dmesg buffers from the old kernel's memory; but given that there exists
> a corrupt memory region with unknown consequences for read operations
> upon it, I doubt if it is a safe idea to allow user-space application to
> navigate through the kernel data-structures to extract the dmesg.
>
> Alternatively, we might leave MCE enabled for the kdump kernel and
> modify the user-space application to turn resilient to SIGBUS signal. So
> if it reads a corrupt memory region, it will receive a signal from
> do_machine_check() upon which it can skip that particular address and/or
> fill it with zeroes (or somesuch). We haven't gone through the
> details....but just some loud thinking

Again, I think simplest would be to disable MCE while we are accessing
previous kernel's memory and configure user space to either just extract
MCE registers. If you are too paranoid, you can do it in two steps. First
save MCE registes and make sure these are on disk and then go for
extracting rest of the info like dmesg and in the process if you go down,
anyway it was best effort thing.

Thanks
Vivek

2011-05-27 18:01:40

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Patch 2/6] x86: mce: Convert mce code to xpanic

"K.Prasad" <[email protected]> writes:

> commit aea29c4a9324f24b5c61e7c3919a8137a53be935
> Author: Andi Kleen <[email protected]>
> Date: Fri Nov 19 18:42:02 2010 +0100
>
> x86: mce: Convert mce code to xpanic
>
> - Pass in the panic timeout directly instead of
> abusing global variable.
> - Disable backtraces and kexecs on machine check panics
> because they don't do anything useful.

Nacked-by: "Eric W. Biederman" <[email protected]>

Kexec on panic does not fact do something useful, and usable
today.

The mce_panic_timeout looks like a horrible hack that should
be killed. Why do we think we know better than the sysadmin
that sets up the system.

Eric

2011-05-27 18:04:20

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC Patch 4/6] PANIC_MCE: Introduce a new panic flag for fatal MCE, capture related information

"K.Prasad" <[email protected]> writes:

> PANIC_MCE: Introduce a new panic flag for fatal MCE, capture related information
>
> Fatal machine check exceptions (caused due to hardware memory errors) will now
> result in a 'slim' coredump that captures vital information about the MCE. This
> patch introduces a new panic flag, and new parameters to *panic functions
> that can capture more information pertaining to the cause of crash.
>
> Enable a new elf-notes section to store additional information about the crash.
> For MCE, enable a new notes section that captures relevant register status
> (struct mce) to be later read during coredump analysis.

There may be a reason to pass everything struct mce through 5 layers of
code but right now it looks like it just makes everything uglier to no
real purpose.

Eric

2011-05-27 18:09:51

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC Patch 4/6] PANIC_MCE: Introduce a new panic flag for fatal MCE, capture related information

"K.Prasad" <[email protected]> writes:

> PANIC_MCE: Introduce a new panic flag for fatal MCE, capture related information
>
> Fatal machine check exceptions (caused due to hardware memory errors) will now
> result in a 'slim' coredump that captures vital information about the MCE. This
> patch introduces a new panic flag, and new parameters to *panic functions
> that can capture more information pertaining to the cause of crash.
>
> Enable a new elf-notes section to store additional information about the crash.
> For MCE, enable a new notes section that captures relevant register status
> (struct mce) to be later read during coredump analysis.
>
> Signed-off-by: K.Prasad <[email protected]>

Nacked-by: "Eric W. Biederman" <[email protected]>


> -void crash_save_vmcoreinfo(void)
> +void crash_save_vmcoreinfo(void *arch_info, size_t arch_info_size,
> + enum panic_flags flags)
> {
> u32 *buf;
>
> @@ -1392,9 +1394,12 @@ void crash_save_vmcoreinfo(void)
>
> buf = (u32 *)vmcoreinfo_note;
>
> - buf = append_elf_note(buf, VMCOREINFO_NOTE_NAME, 0, vmcoreinfo_data,
> - vmcoreinfo_size);
> -
> + if (flags & PANIC_MCE)
> + buf = append_elf_note(buf, "PANIC_MCE", NT_MCE, arch_info,
> + arch_info_size);

Ok. This is uglier than I thought. You have this terribly generic
interface than when you go to use it you assume it is an mce. Ugh
no.

Furthermore you short circuit out other pieces of this code.

This patch is a total hack that is massively over engineered.

Saving NT_MCE is fine, in and of itself. The rest of the information
is totally nonsense.

Eric


> + else
> + buf = append_elf_note(buf, VMCOREINFO_NOTE_NAME, 0,
> + vmcoreinfo_data, vmcoreinfo_size);
> final_note(buf);
> }
>

2011-05-27 18:14:09

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC Patch 5/6] slimdump: Capture slimdump for fatal MCE generated crashes

"K.Prasad" <[email protected]> writes:

> slimdump: Capture slimdump for fatal MCE generated crashes
>
> System crashes resulting from fatal hardware errors (such as MCE) don't need
> all the contents from crashing-kernel's memory. Generate a new 'slimdump' that
> retains only essential information while discarding the old memory.

Ugh. You are changing the kernel to change userspace policy?

And because of how you have previous populated this we won't get the
notes to let us extract dmesg easily.

No thanks. Let's change this in userspace instead.

You are going to an awful lot of work to avoid a problem that doesn't
happen to in practice.

Patching makedumpfile to key in on an mce note would be more appropriate.

Nacked-by: "Eric W. Biederman" <[email protected]>

2011-05-27 18:16:41

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC Patch 6/6] Crash: Recognise slim coredumps and process new elf-note sections

"K.Prasad" <[email protected]> writes:

> Crash: Recognise slim coredumps and process new elf-note sections
>
> The Linux kernel will begin to support SlimDump for certain types of crashes
> and the 'crash' tool needs to recognise them. For these types of coredumps, it
> need not lookout for usual elf-structures and start gdb. Also process new
> elf-note sections that contain additional information about the crash.

I suppose patching crash make sense.

Unfortunately crash doesn't work on 99% of the kernels I run so, so I
stopped caring a while ago.

Eric

2011-05-27 18:22:26

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC Patch 6/6] Crash: Recognise slim coredumps and process new elf-note sections

On Fri, May 27, 2011 at 11:16:34AM -0700, Eric W. Biederman wrote:
> "K.Prasad" <[email protected]> writes:
>
> > Crash: Recognise slim coredumps and process new elf-note sections
> >
> > The Linux kernel will begin to support SlimDump for certain types of crashes
> > and the 'crash' tool needs to recognise them. For these types of coredumps, it
> > need not lookout for usual elf-structures and start gdb. Also process new
> > elf-note sections that contain additional information about the crash.
>
> I suppose patching crash make sense.
>
> Unfortunately crash doesn't work on 99% of the kernels I run so, so I
> stopped caring a while ago.

Are these patched kernels? Otherwise Dave Anderson is pretty good about
keeping crash running across multiple kernels.

For me it works just great. I recently analyzed a crash dump with
2.6.39-rc7 kernel. Dave is generally takes patches for crash for latest
upstream kernel changes.

Thanks
Vivek

2011-05-27 18:29:33

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC Patch 4/6] PANIC_MCE: Introduce a new panic flag for fatal MCE, capture related information

On Fri, May 27, 2011 at 10:33:31PM +0530, K.Prasad wrote:
> On Thu, May 26, 2011 at 02:43:50PM -0400, Vivek Goyal wrote:
> > On Thu, May 26, 2011 at 10:45:21PM +0530, K.Prasad wrote:
> >
> > [..]
> > > Index: linux-2.6.slim_kdump/arch/x86/kernel/cpu/mcheck/mce.c
> > > ===================================================================
> > > --- linux-2.6.slim_kdump.orig/arch/x86/kernel/cpu/mcheck/mce.c
> > > +++ linux-2.6.slim_kdump/arch/x86/kernel/cpu/mcheck/mce.c
> > > @@ -258,8 +258,7 @@ static void wait_for_panic(void)
> > > local_irq_enable();
> > > while (timeout-- > 0)
> > > udelay(1);
> > > - xpanic(PANIC_NO_KEXEC|PANIC_NO_BACKTRACE, 0,
> > > - "Panicing machine check CPU died");
> > > + xpanic(PANIC_MCE, 0, NULL, 0, "Panicing machine check CPU died");
> > > }
> > >
> > > static void mce_panic(char *msg, struct mce *final, char *exp)
> > > @@ -315,8 +314,8 @@ static void mce_panic(char *msg, struct
> > > if (exp)
> > > pr_emerg(HW_ERR "Machine check: %s\n", exp);
> > > if (!fake_panic) {
> > > - xpanic(PANIC_NO_KEXEC|PANIC_NO_BACKTRACE, mce_panic_timeout,
> > > - msg);
> > > + xpanic(PANIC_MCE, mce_panic_timeout, final,
> > > + sizeof(struct mce), msg);
> > > } else
> > > pr_emerg(HW_ERR "Fake kernel panic: %s\n", msg);
> > > }
> >
> > In previous patches you introduce PANIC_NO_KEXEC and PANIC_NO_BACKTRACE.
> > Now in this patch you got rid of those. Are there any other users left
> > of PANIC_NO_BACKTRACE and PANIC_NO_EXEC? If not, then why to introduce
> > these to begin with.
> >
>
> The previous patch also converts panic to xpanic and is taken from
> Andi's
> git://git.kernel.org/pub/scm/linux/kernel/git/ak/linux-mce-2.6.git tree.
> The changes are kept as two separate patches to identify their origin.

If you have to remove dead piece of code from original patch, I think
you can always take original patch, modify it and give the credit to
original author by explicitly mentioning it in your changelog.

Thanks
Vivek

2011-05-27 18:36:01

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC Patch 6/6] Crash: Recognise slim coredumps and process new elf-note sections

Vivek Goyal <[email protected]> writes:

> On Fri, May 27, 2011 at 11:16:34AM -0700, Eric W. Biederman wrote:
>> "K.Prasad" <[email protected]> writes:
>>
>> > Crash: Recognise slim coredumps and process new elf-note sections
>> >
>> > The Linux kernel will begin to support SlimDump for certain types of crashes
>> > and the 'crash' tool needs to recognise them. For these types of coredumps, it
>> > need not lookout for usual elf-structures and start gdb. Also process new
>> > elf-note sections that contain additional information about the crash.
>>
>> I suppose patching crash make sense.
>>
>> Unfortunately crash doesn't work on 99% of the kernels I run so, so I
>> stopped caring a while ago.
>
> Are these patched kernels? Otherwise Dave Anderson is pretty good about
> keeping crash running across multiple kernels.

Only slightly patched. But they are definitely kernels I have built
myself. It might just be a logistics problem where if I want to use
crash I need to upgrade it when I upgrade my kernels.

I got frustrated a while ago trying to get something out of crash
(and I'm afraid it shows), and I have just gone to extracting the
dmesg. Which gives me all of the information I generally have time
to process.

Eric

2011-05-31 17:40:52

by K.Prasad

[permalink] [raw]
Subject: Re: [RFC Patch 4/6] PANIC_MCE: Introduce a new panic flag for fatal MCE, capture related information

On Fri, May 27, 2011 at 11:04:06AM -0700, Eric W. Biederman wrote:
> "K.Prasad" <[email protected]> writes:
>
> > PANIC_MCE: Introduce a new panic flag for fatal MCE, capture related information
> >
> > Fatal machine check exceptions (caused due to hardware memory errors) will now
> > result in a 'slim' coredump that captures vital information about the MCE. This
> > patch introduces a new panic flag, and new parameters to *panic functions
> > that can capture more information pertaining to the cause of crash.
> >
> > Enable a new elf-notes section to store additional information about the crash.
> > For MCE, enable a new notes section that captures relevant register status
> > (struct mce) to be later read during coredump analysis.
>
> There may be a reason to pass everything struct mce through 5 layers of
> code but right now it looks like it just makes everything uglier to no
> real purpose.

We could have stopped with just a blank elf-note of type NT_MCE
indicating an MCE triggered panic, but dumping 'struct mce' in it will
help gather more useful information about the error - especially the
memory address that experienced unrecoverable error (stored in
mce->addr).

The patch 6/6 for the 'crash' tool enabled decoding of 'struct
mce' to show this information (although the sample log in patch 0/6)
didn't show these benefits because 'mce-inject' tool used to soft-inject
these errors doesn't populate all registers with valid contents.

The idea was that when mce->addr contains physical address is shown
while decoding coredump, the corresponding memory DIMM could be identified
for replacement/isolation.

Given that 'struct mce' isn't placed in a user-space visible file its
duplicate copies have to be maintained in 'crash' (like it is done in
'mcelog' tool), and that's one disadvantage.

If you think that this complicates the patch, I'll start with a much
'slimmer' version (!) of the slimdump and the improvements may be
contemplated iteratively.

Thanks,
K.Prasad

2011-06-01 17:18:37

by Dave Anderson

[permalink] [raw]
Subject: Re: [RFC Patch 4/6] PANIC_MCE: Introduce a new panic flag for fatal MCE, capture related information



----- Original Message -----
> On Fri, May 27, 2011 at 11:04:06AM -0700, Eric W. Biederman wrote:
> > "K.Prasad" <[email protected]> writes:
> >
> > > PANIC_MCE: Introduce a new panic flag for fatal MCE, capture
> > > related information
> > >
> > > Fatal machine check exceptions (caused due to hardware memory errors) will now
> > > result in a 'slim' coredump that captures vital information about the MCE. This
> > > patch introduces a new panic flag, and new parameters to *panic functions
> > > that can capture more information pertaining to the cause of
> > > crash.
> > >
> > > Enable a new elf-notes section to store additional information about the crash.
> > > For MCE, enable a new notes section that captures relevant register status
> > > (struct mce) to be later read during coredump analysis.
> >
> > There may be a reason to pass everything struct mce through 5 layers of
> > code but right now it looks like it just makes everything uglier to no
> > real purpose.
>
> We could have stopped with just a blank elf-note of type NT_MCE
> indicating an MCE triggered panic, but dumping 'struct mce' in it will
> help gather more useful information about the error - especially the
> memory address that experienced unrecoverable error (stored in mce->addr).
>
> The patch 6/6 for the 'crash' tool enabled decoding of 'struct
> mce' to show this information (although the sample log in patch 0/6)
> didn't show these benefits because 'mce-inject' tool used to soft-inject
> these errors doesn't populate all registers with valid contents.
>
> The idea was that when mce->addr contains physical address is shown
> while decoding coredump, the corresponding memory DIMM could be identified
> for replacement/isolation.
>
> Given that 'struct mce' isn't placed in a user-space visible file its
> duplicate copies have to be maintained in 'crash' (like it is done in
> 'mcelog' tool), and that's one disadvantage.

FWIW, unlike mcelog, it really doesn't have to be maintained in the crash
utility. It's just another kernel data structure whose contents can be
determined dynamically during runtime:

crash> struct mce
struct mce {
__u64 status;
__u64 misc;
__u64 addr;
__u64 mcgstatus;
__u64 ip;
__u64 tsc;
__u64 time;
__u8 cpuvendor;
__u8 inject_flags;
__u16 pad;
__u32 cpuid;
__u8 cs;
__u8 bank;
__u8 cpu;
__u8 finished;
__u32 extcpu;
__u32 socketid;
__u32 apicid;
__u64 mcgcap;
}
SIZE: 88
crash>

Dave

> If you think that this complicates the patch, I'll start with a much
> 'slimmer' version (!) of the slimdump and the improvements may be
> contemplated iteratively.
>
> Thanks,
> K.Prasad

2011-06-01 17:23:46

by Vivek Goyal

[permalink] [raw]
Subject: Re: [RFC Patch 4/6] PANIC_MCE: Introduce a new panic flag for fatal MCE, capture related information

On Wed, Jun 01, 2011 at 01:18:16PM -0400, Dave Anderson wrote:
>
>
> ----- Original Message -----
> > On Fri, May 27, 2011 at 11:04:06AM -0700, Eric W. Biederman wrote:
> > > "K.Prasad" <[email protected]> writes:
> > >
> > > > PANIC_MCE: Introduce a new panic flag for fatal MCE, capture
> > > > related information
> > > >
> > > > Fatal machine check exceptions (caused due to hardware memory errors) will now
> > > > result in a 'slim' coredump that captures vital information about the MCE. This
> > > > patch introduces a new panic flag, and new parameters to *panic functions
> > > > that can capture more information pertaining to the cause of
> > > > crash.
> > > >
> > > > Enable a new elf-notes section to store additional information about the crash.
> > > > For MCE, enable a new notes section that captures relevant register status
> > > > (struct mce) to be later read during coredump analysis.
> > >
> > > There may be a reason to pass everything struct mce through 5 layers of
> > > code but right now it looks like it just makes everything uglier to no
> > > real purpose.
> >
> > We could have stopped with just a blank elf-note of type NT_MCE
> > indicating an MCE triggered panic, but dumping 'struct mce' in it will
> > help gather more useful information about the error - especially the
> > memory address that experienced unrecoverable error (stored in mce->addr).
> >
> > The patch 6/6 for the 'crash' tool enabled decoding of 'struct
> > mce' to show this information (although the sample log in patch 0/6)
> > didn't show these benefits because 'mce-inject' tool used to soft-inject
> > these errors doesn't populate all registers with valid contents.
> >
> > The idea was that when mce->addr contains physical address is shown
> > while decoding coredump, the corresponding memory DIMM could be identified
> > for replacement/isolation.
> >
> > Given that 'struct mce' isn't placed in a user-space visible file its
> > duplicate copies have to be maintained in 'crash' (like it is done in
> > 'mcelog' tool), and that's one disadvantage.
>
> FWIW, unlike mcelog, it really doesn't have to be maintained in the crash
> utility. It's just another kernel data structure whose contents can be
> determined dynamically during runtime:
>

That's what I was wondering. Why can't we simple extract the contents
of this structure from /proc/vmcore and save it, instead of trying to
export it by appending additional elf notes to vmcore.

Thanks
Vivek

> crash> struct mce
> struct mce {
> __u64 status;
> __u64 misc;
> __u64 addr;
> __u64 mcgstatus;
> __u64 ip;
> __u64 tsc;
> __u64 time;
> __u8 cpuvendor;
> __u8 inject_flags;
> __u16 pad;
> __u32 cpuid;
> __u8 cs;
> __u8 bank;
> __u8 cpu;
> __u8 finished;
> __u32 extcpu;
> __u32 socketid;
> __u32 apicid;
> __u64 mcgcap;
> }
> SIZE: 88
> crash>
>
> Dave
>
> > If you think that this complicates the patch, I'll start with a much
> > 'slimmer' version (!) of the slimdump and the improvements may be
> > contemplated iteratively.
> >
> > Thanks,
> > K.Prasad

2011-06-01 17:41:51

by Dave Anderson

[permalink] [raw]
Subject: Re: [RFC Patch 4/6] PANIC_MCE: Introduce a new panic flag for fatal MCE, capture related information



----- Original Message -----
> On Wed, Jun 01, 2011 at 01:18:16PM -0400, Dave Anderson wrote:
> > FWIW, unlike mcelog, it really doesn't have to be maintained in the crash
> > utility. It's just another kernel data structure whose contents can be
> > determined dynamically during runtime:
> >
>
> That's what I was wondering. Why can't we simple extract the contents
> of this structure from /proc/vmcore and save it, instead of trying to
> export it by appending additional elf notes to vmcore.

Actually I take that back -- Prasad's patch looks at the ELF header
contents and exits before the crash utility invokes the embedded gdb,
so the structure contents are unavailable at that time. It doesn't even
use the vmlinux file.

Dave

>
> Thanks
> Vivek
>
> > crash> struct mce
> > struct mce {
> > __u64 status;
> > __u64 misc;
> > __u64 addr;
> > __u64 mcgstatus;
> > __u64 ip;
> > __u64 tsc;
> > __u64 time;
> > __u8 cpuvendor;
> > __u8 inject_flags;
> > __u16 pad;
> > __u32 cpuid;
> > __u8 cs;
> > __u8 bank;
> > __u8 cpu;
> > __u8 finished;
> > __u32 extcpu;
> > __u32 socketid;
> > __u32 apicid;
> > __u64 mcgcap;
> > }
> > SIZE: 88
> > crash>
> >
> > Dave
> >
> > > If you think that this complicates the patch, I'll start with a
> > > much
> > > 'slimmer' version (!) of the slimdump and the improvements may be
> > > contemplated iteratively.
> > >
> > > Thanks,
> > > K.Prasad

2011-06-08 17:00:25

by K.Prasad

[permalink] [raw]
Subject: Re: [RFC Patch 5/6] slimdump: Capture slimdump for fatal MCE generated crashes

On Fri, May 27, 2011 at 01:59:49PM -0400, Vivek Goyal wrote:
> On Fri, May 27, 2011 at 10:27:36PM +0530, K.Prasad wrote:
> > On Thu, May 26, 2011 at 01:44:47PM -0400, Vivek Goyal wrote:
> > > On Thu, May 26, 2011 at 10:53:05PM +0530, K.Prasad wrote:
> > > >
> > > > slimdump: Capture slimdump for fatal MCE generated crashes
> > > >
> > > > System crashes resulting from fatal hardware errors (such as MCE) don't need
> > > > all the contents from crashing-kernel's memory. Generate a new 'slimdump' that
> > > > retains only essential information while discarding the old memory.
> > > >
> > >
> > > Why to enforce zeroing out of rest of the vmcore data in kernel. Why not
> > > leave it to user space.
> > >
> >
> > Our concern comes from the fact that it is unsafe for the OS to read
> > any part of the corrupt memory region, so the kernel does not have to make
> > that address space available for read/write.
>
> The very fact you are booting into second kernel you are reading lots
> of address space already. The assumption is that whole of the reserved
> region is fine otherwise kernel will not even boot. Then even in older
> kernel you are accessing memory used for saving the elf headers and
> mce registers.

The difference, IMHO, is that we avoid a deliberate access of the memory
location which has previously experienced an unrecoverable memory error
and upon which a read operation can cause fatal MCE.

While older kernel's memory is used to save elf headers, it is not a
location that is known to have a memory error.

There could be a case where either a new memory error surfaces in the
older kernel/inside the kdump kernel's memory region or a previously
hw-poisoned memory is consumed by the new kernel, during which
we suspect that the kdump kernel would reboot...but given the small time
window during which it operates such a situation is going to be very
rare.

> >
> > > As Andi said, you anyway will require disabling MCE temporarily in second
> > > kernel. So that should allay the concern that we might run into second
> > > MCE while trying to read the vmcore.
> > >
> >
> > The processor manuals don't define the behaviour for a read operation
> > upon a corrupt memory location. So the consequences for a read after
> > disabling MCE is unknown (as I've mentioned here:
> > https://lkml.org/lkml/2011/5/27/258).
> >
>
> First of all a user can simply have a configuration to extract MCE
> registers and your concern is addressed. Eric W. Biederman said that
> he had got bunch of MCE triggered dumps also and things were fine. I think
> we are over engineering this. In the first step lets just keep it simple
> and just export MCE registers as a note. If accessing rest of regions
> becomes a issue in real life, then we can have a look at this again.

Again, this suggestion is based on the fact that the coredump from the
first kernel is available...but the as stated before, capturing the
coredump from first kernel involves a compulsory read operation over a
memory location that is known to have an uncorrected error with the
problems described above.

> > > A user might want to extract just the dmesg buffers also after MCE from
> > > vmcore.
> > >
> > > What I am saying is that this sounds more like a policy. Don't enforce
> > > it in kernel. Probably give user MCE registers as part of ELF note of
> > > vmcore. Now leave it up to user what data he wants to extract from vmcore.
> > > Just the MCE registers or something more then that.
> >
> > I'm unsure, at the moment, as to what it actually entails to extract
> > dmesg buffers from the old kernel's memory; but given that there exists
> > a corrupt memory region with unknown consequences for read operations
> > upon it, I doubt if it is a safe idea to allow user-space application to
> > navigate through the kernel data-structures to extract the dmesg.
> >
> > Alternatively, we might leave MCE enabled for the kdump kernel and
> > modify the user-space application to turn resilient to SIGBUS signal. So
> > if it reads a corrupt memory region, it will receive a signal from
> > do_machine_check() upon which it can skip that particular address and/or
> > fill it with zeroes (or somesuch). We haven't gone through the
> > details....but just some loud thinking
>
> Again, I think simplest would be to disable MCE while we are accessing
> previous kernel's memory and configure user space to either just extract
> MCE registers. If you are too paranoid, you can do it in two steps. First
> save MCE registes and make sure these are on disk and then go for
> extracting rest of the info like dmesg and in the process if you go down,
> anyway it was best effort thing.
>

The problem here is that the effect of a read operation over a location
with memory error (with MCEs disabled) is unknown. We are making
attempts to characterise the behaviour in conjunction with the hardware
folks and I will share any information that we learn in this regard.

Thanks,
K.Prasad

2011-06-08 17:16:51

by K.Prasad

[permalink] [raw]
Subject: Re: [RFC Patch 4/6] PANIC_MCE: Introduce a new panic flag for fatal MCE, capture related information

On Tue, May 31, 2011 at 11:10:43PM +0530, K.Prasad wrote:
> On Fri, May 27, 2011 at 11:04:06AM -0700, Eric W. Biederman wrote:
> > "K.Prasad" <[email protected]> writes:
> >
> > > PANIC_MCE: Introduce a new panic flag for fatal MCE, capture related information
> > >
> > > Fatal machine check exceptions (caused due to hardware memory errors) will now
> > > result in a 'slim' coredump that captures vital information about the MCE. This
> > > patch introduces a new panic flag, and new parameters to *panic functions
> > > that can capture more information pertaining to the cause of crash.
> > >
> > > Enable a new elf-notes section to store additional information about the crash.
> > > For MCE, enable a new notes section that captures relevant register status
> > > (struct mce) to be later read during coredump analysis.
> >
> > There may be a reason to pass everything struct mce through 5 layers of
> > code but right now it looks like it just makes everything uglier to no
> > real purpose.
>
> We could have stopped with just a blank elf-note of type NT_MCE
> indicating an MCE triggered panic, but dumping 'struct mce' in it will
> help gather more useful information about the error - especially the
> memory address that experienced unrecoverable error (stored in
> mce->addr).
>
> The patch 6/6 for the 'crash' tool enabled decoding of 'struct
> mce' to show this information (although the sample log in patch 0/6)
> didn't show these benefits because 'mce-inject' tool used to soft-inject
> these errors doesn't populate all registers with valid contents.
>
> The idea was that when mce->addr contains physical address is shown
> while decoding coredump, the corresponding memory DIMM could be identified
> for replacement/isolation.
>
> Given that 'struct mce' isn't placed in a user-space visible file its
> duplicate copies have to be maintained in 'crash' (like it is done in
> 'mcelog' tool), and that's one disadvantage.
>
> If you think that this complicates the patch, I'll start with a much
> 'slimmer' version (!) of the slimdump and the improvements may be
> contemplated iteratively.

While there are reports that kdump works fine (like in your case) in
capturing the coredump for a crash resulting from fatal MCE,
unfortunately we don't have means to recreate such a behaviour due to
the inability to inject memory errors in hardware to study further.

So our fears arise due to the premise that reading a faulty memory
location leads to undesirable consequences (whether MCE is disabled
or not) and would like to modify the OS to avoid such an operation.

While the ugliness of the patch (which I believe is due to
non-separation of generic and arch-specific code) is something that can
be addressed, I hope that the reasons for the patch are seen to be
valid.

Here's an attempt to make the slimdump patch more generic that can be
used by any hardware generated crash to prevent a coredump from being
captured (compile tested only).

I'll post a more formal version of the patch upon hearing further
comments.

Thanks,
K.Prasad


Index: linux-2.6.orig/include/linux/elf.h
===================================================================
--- linux-2.6.orig.orig/include/linux/elf.h
+++ linux-2.6.orig/include/linux/elf.h
@@ -381,6 +381,8 @@ typedef struct elf64_shdr {
#define NT_PRPSINFO 3
#define NT_TASKSTRUCT 4
#define NT_AUXV 6
+/* Note to indicate absence of coredump for crashes initiated due to hardware errors */
+#define NT_NOCOREDUMP 21
#define NT_PRXFPREG 0x46e62b7f /* copied from gdb5.1/include/elf/common.h */
#define NT_PPC_VMX 0x100 /* PowerPC Altivec/VMX registers */
#define NT_PPC_SPE 0x101 /* PowerPC SPE/EVR registers */
Index: linux-2.6.orig/kernel/kexec.c
===================================================================
--- linux-2.6.orig.orig/kernel/kexec.c
+++ linux-2.6.orig/kernel/kexec.c
@@ -1379,7 +1379,7 @@ int __init parse_crashkernel(char *cm
return 0;
}

-
+__weak void arch_add_nocoredump_note(u32 *buf) {}

void crash_save_vmcoreinfo(void)
{
@@ -1395,6 +1395,8 @@ void crash_save_vmcoreinfo(void)
buf = append_elf_note(buf, VMCOREINFO_NOTE_NAME, 0, vmcoreinfo_data,
vmcoreinfo_size);

+ arch_add_nocoredump_note(buf);
+
final_note(buf);
}

Index: linux-2.6.orig/arch/x86/kernel/cpu/mcheck/mce.c
===================================================================
--- linux-2.6.orig.orig/arch/x86/kernel/cpu/mcheck/mce.c
+++ linux-2.6.orig/arch/x86/kernel/cpu/mcheck/mce.c
@@ -241,6 +241,30 @@ static atomic_t mce_paniced;
static int fake_panic;
static atomic_t mce_fake_paniced;

+void arch_add_nocoredump_note(u32 *buf)
+{
+ struct elf_note note;
+ char *note_name = "PANIC_MCE";
+
+ /*
+ * Prevent coredump from being captured if the panic was triggered due
+ * to fatal Machine Check Exceptions (MCE).
+ */
+ if (!mce_paniced)
+ return;
+
+ note.n_namesz = strlen(note_name) + 1;
+ /* We have no additional description */
+ note.n_descsz = 0;
+ note.n_type = NT_NOCOREDUMP;
+
+ memcpy(buf, &note, sizeof(note));
+ buf += (sizeof(note) + 3)/4;
+ memcpy(buf, note_name, note.n_namesz);
+ buf += (note.n_namesz + 3)/4;
+}
+
+
/* Panic in progress. Enable interrupts and wait for final IPI */
static void wait_for_panic(void)
{
Index: linux-2.6.orig/fs/proc/vmcore.c
===================================================================
--- linux-2.6.orig.orig/fs/proc/vmcore.c
+++ linux-2.6.orig/fs/proc/vmcore.c
@@ -529,6 +529,56 @@ static void __init set_vmcore_list_offse
}
}

+/*
+ * Check if the crash was due to a fatal Memory Check Exception
+ */
+static int has_nocoredump_note64(void)
+{
+ int i, j, len = 0, rc;
+ Elf64_Ehdr *ehdr_ptr;
+ Elf64_Phdr *phdr_ptr;
+ Elf64_Nhdr *nhdr_ptr;
+
+ ehdr_ptr = (Elf64_Ehdr *)elfcorebuf;
+ phdr_ptr = (Elf64_Phdr *)(elfcorebuf + sizeof(Elf64_Ehdr));
+
+ for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
+ void *notes_section;
+ u64 offset, max_sz;
+ if (phdr_ptr->p_type != PT_NOTE)
+ continue;
+ max_sz = phdr_ptr->p_memsz;
+ offset = phdr_ptr->p_offset;
+ notes_section = kmalloc(max_sz, GFP_KERNEL);
+ if (!notes_section)
+ return -ENOMEM;
+ rc = read_from_oldmem(notes_section, max_sz, &offset, 0);
+ if (rc < 0) {
+ kfree(notes_section);
+ return rc;
+ }
+
+ for (j = 0; j < phdr_ptr->p_filesz; j += len) {
+ nhdr_ptr = notes_section + j;
+ if (nhdr_ptr->n_type == NT_NOCOREDUMP)
+ {
+ kfree(notes_section);
+ return 1;
+ }
+ /*
+ * The elf-64 standard specifies 8-byte alignment while
+ * append_elf_note function does only 4-byte roundup.
+ * Hence this code also does a 4-byte roundup.
+ */
+ len = sizeof(Elf64_Nhdr);
+ len = roundup(len + nhdr_ptr->n_namesz, 4);
+ len = roundup(len + nhdr_ptr->n_descsz, 4);
+ }
+ kfree(notes_section);
+ }
+ return 0;
+}
+
static int __init parse_crash_elf64_headers(void)
{
int rc=0;
@@ -569,6 +619,19 @@ static int __init parse_crash_elf64_head
return rc;
}

+ phdr_ptr = (Elf64_Phdr *)(elfcorebuf + sizeof(Elf64_Ehdr));
+ if (has_nocoredump_note64() > 0) {
+ /*
+ * If crash is due to hardware errors, elf-note of
+ * type NT_NOCOREDUMP is present. If so, don't populate
+ * sections other than elf-notes. Mark their sizes as zero.
+ */
+ for (i = 0; i < ehdr.e_phnum; i++, phdr_ptr++) {
+ if (phdr_ptr->p_type != PT_NOTE)
+ phdr_ptr->p_memsz = phdr_ptr->p_filesz = 0;
+ }
+ }
+
/* Merge all PT_NOTE headers into one. */
rc = merge_note_headers_elf64(elfcorebuf, &elfcorebuf_sz, &vmcore_list);
if (rc) {

2011-06-12 15:44:34

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC Patch 4/6] PANIC_MCE: Introduce a new panic flag for fatal MCE, capture related information

"K.Prasad" <[email protected]> writes:

> On Tue, May 31, 2011 at 11:10:43PM +0530, K.Prasad wrote:
>> On Fri, May 27, 2011 at 11:04:06AM -0700, Eric W. Biederman wrote:
>> > "K.Prasad" <[email protected]> writes:
>> >
>> > > PANIC_MCE: Introduce a new panic flag for fatal MCE, capture related information
>> > >
>> > > Fatal machine check exceptions (caused due to hardware memory errors) will now
>> > > result in a 'slim' coredump that captures vital information about the MCE. This
>> > > patch introduces a new panic flag, and new parameters to *panic functions
>> > > that can capture more information pertaining to the cause of crash.
>> > >
>> > > Enable a new elf-notes section to store additional information about the crash.
>> > > For MCE, enable a new notes section that captures relevant register status
>> > > (struct mce) to be later read during coredump analysis.
>> >
>> > There may be a reason to pass everything struct mce through 5 layers of
>> > code but right now it looks like it just makes everything uglier to no
>> > real purpose.
>>
>> We could have stopped with just a blank elf-note of type NT_MCE
>> indicating an MCE triggered panic, but dumping 'struct mce' in it will
>> help gather more useful information about the error - especially the
>> memory address that experienced unrecoverable error (stored in
>> mce->addr).
>>
>> The patch 6/6 for the 'crash' tool enabled decoding of 'struct
>> mce' to show this information (although the sample log in patch 0/6)
>> didn't show these benefits because 'mce-inject' tool used to soft-inject
>> these errors doesn't populate all registers with valid contents.
>>
>> The idea was that when mce->addr contains physical address is shown
>> while decoding coredump, the corresponding memory DIMM could be identified
>> for replacement/isolation.
>>
>> Given that 'struct mce' isn't placed in a user-space visible file its
>> duplicate copies have to be maintained in 'crash' (like it is done in
>> 'mcelog' tool), and that's one disadvantage.
>>
>> If you think that this complicates the patch, I'll start with a much
>> 'slimmer' version (!) of the slimdump and the improvements may be
>> contemplated iteratively.
>
> While there are reports that kdump works fine (like in your case) in
> capturing the coredump for a crash resulting from fatal MCE,
> unfortunately we don't have means to recreate such a behaviour due to
> the inability to inject memory errors in hardware to study further.

Most modern memory controllers have the functionality of generating
memory writes with deliberately bad ecc data, and playing with a heat
gun or shorting to data wires together on dimms, to generate real ecc
errors isn't hard. So with a day or so of effort you should be able to
test what happens when you get an MCE.

Furthermore there is the lkdtm module which let's you test other types
of crash dumps so you can verify that all sorts of code paths work.

> So our fears arise due to the premise that reading a faulty memory
> location leads to undesirable consequences (whether MCE is disabled
> or not) and would like to modify the OS to avoid such an operation.
>
> While the ugliness of the patch (which I believe is due to
> non-separation of generic and arch-specific code) is something that can
> be addressed, I hope that the reasons for the patch are seen to be
> valid.

Yes. The objection really is to not exporting the information you need
to solve this in userspace and then fixing the one userspace tool that
uses this to work correctly.

> Here's an attempt to make the slimdump patch more generic that can be
> used by any hardware generated crash to prevent a coredump from being
> captured (compile tested only).
>
> I'll post a more formal version of the patch upon hearing further
> comments.

But this is not the way. The kernel does not generate the core dump
it just gives the information needed for userspace to generate the core
dump.

Giving a little more information to userspace and letting the program
that reads vmcore have the policy on what do is the preferred way to do
this.

You are asking for yet another way to filter crashdumps which is
entirely reasonable. Patching out the ability in the kernel for the
rest of us to have our own policies of what to dump is unreasonable.

Eric

2011-06-15 02:07:01

by K.Prasad

[permalink] [raw]
Subject: Re: [RFC Patch 4/6] PANIC_MCE: Introduce a new panic flag for fatal MCE, capture related information

On Sun, Jun 12, 2011 at 08:44:25AM -0700, Eric W. Biederman wrote:
> "K.Prasad" <[email protected]> writes:
>
> > On Tue, May 31, 2011 at 11:10:43PM +0530, K.Prasad wrote:
> >> On Fri, May 27, 2011 at 11:04:06AM -0700, Eric W. Biederman wrote:
> >> > "K.Prasad" <[email protected]> writes:
> >> >
[snipped]
> > So our fears arise due to the premise that reading a faulty memory
> > location leads to undesirable consequences (whether MCE is disabled
> > or not) and would like to modify the OS to avoid such an operation.
> >
> > While the ugliness of the patch (which I believe is due to
> > non-separation of generic and arch-specific code) is something that can
> > be addressed, I hope that the reasons for the patch are seen to be
> > valid.
>
> Yes. The objection really is to not exporting the information you need
> to solve this in userspace and then fixing the one userspace tool that
> uses this to work correctly.
>
> > Here's an attempt to make the slimdump patch more generic that can be
> > used by any hardware generated crash to prevent a coredump from being
> > captured (compile tested only).
> >
> > I'll post a more formal version of the patch upon hearing further
> > comments.
>
> But this is not the way. The kernel does not generate the core dump
> it just gives the information needed for userspace to generate the core
> dump.
>
> Giving a little more information to userspace and letting the program
> that reads vmcore have the policy on what do is the preferred way to do
> this.
>
> You are asking for yet another way to filter crashdumps which is
> entirely reasonable. Patching out the ability in the kernel for the
> rest of us to have our own policies of what to dump is unreasonable.
>

I think I get the drift of your idea. So adding a new elf-note to
indicate the cause of crash is fine...and continue to allow the kernel
to provide a way to read old kernel memory irrespective of the crash.

Instead, make the user-space tool (makedumpfile?) more intelligent to
recognise the cause of crash as fatal MCE or any other hardware-error
induced crash (using the new elf-note as a clue) and abstain from
reading the contents of the old memory to create a full coredump. A
different user with a different need might want to try out something
adventurous (by reading the full coredump) but let not the kernel
prevent it, which sounds fine. I'll get some patches ready to this
effect.

However, there's another part of the problem - wherein pages with
uncorrectable memory errors which are detected during scrubbing, marked
as 'hardware poisoned' (with PG_hwpoison flag) in order to be quarantined,
may be read during creation of coredump initiated through usual kernel
panics (software bug or otherwise).

This will result in a fatal MCE for the kdump kernel and will cause a
reboot of the system without any coredump (and possibly without any
message in /var/log/messages). We're contemplating ways to avoid such a
situation and the panic+crash_kexec+vmcore/makedumpfile code may
require suitable changes to address this issue.

Let us know your views on this.

Thanks,
K.Prasad