When the kdump service is loaded, if a CPU or memory is hot
un/plugged, the crash elfcorehdr which describes the CPUs and memory
in the system, must also be updated, else the resulting vmcore is
inaccurate (eg. missing either CPU context or memory regions).
The current solution utilizes udev to initiate an unload-then-reload
of the kdump image (e. kernel, initrd, boot_params, puratory and
elfcorehdr) by the userspace kexec utility.
In the post https://lkml.org/lkml/2020/12/14/532 I outlined two
problems with this userspace-initiated unload-then-reload approach as
it pertains to supporting CPU and memory hot un/plug for kdump.
(Note in that post, I erroneously called the elfcorehdr the vmcoreinfo
structure. There is a vmcoreinfo structure, but it has a different
purpose. So in that post substitute "elfcorehdr" for "vmcoreinfo".)
The first problem being the time needed to complete the unload-then-
reload of the kdump image, and the second being the effective race
window that unload-then-reload effort creates.
The scenario I measured was a 32GiB guest being resized to 512GiB and
observing it took over 4 minutes for udev to "settle down" and
complete the unload-then-reload of the resulting 3840 hot plug events.
Empirical evidence within our fleet substantiates this problem.
Each unload-then-reload creates a race window the size of which is the
time it takes to reload the complete kdump image. Within the race
window, kdump is not loaded and should a panic occur, the kernel halts
rather than dumping core via kdump.
This patchset significantly improves upon the current solution by
enabling the kernel to update only the necessary items of the kdump
image. In the case of x86_64, that is just the elfcorehdr and the
purgatory segments. These updates occur as fast as the hot un/plug
events and significantly reduce the size of the race window.
This patchset introduces a generic crash hot un/plug handler that
registers with the CPU and memory notifiers. Upon CPU or memory
changes, this generic handler is invoked and performs important
housekeeping, for example obtaining the appropriate lock, and then
invokes an architecture specific handler to do the appropriate
updates.
In the case of x86_64, the arch specific handler generates a new
elfcorehdr, which reflects the current CPUs and memory regions, into a
buffer. Since purgatory also does an integrity check via hash digests
of the loaded segments, purgatory must also be updated with the new
digests. The arch handler also generates a new purgatory into a
buffer, performs the hash digests of the new memory segments, and then
patches purgatory with the new digests. If all succeeds, then the
elfcorehdr and purgatory buffers over write the existing buffers and
the new kdump image is live and ready to go. No involvement with
userspace at all.
To accommodate a growing number of resources via hotplug, the
elfcorehdr memory must be sufficiently large enough to accommodate
changes. The CRASH_HOTPLUG_ELFCOREHDR_SZ configure item does just
this.
To realize the benefits/test this patchset, one must make a couple
of minor changes to userspace:
- Disable the udev rule for updating kdump on hot un/plug changes
Eg. on RHEL: rm -f /usr/lib/udev/rules.d/98-kexec.rules
or other technique to neuter the rule.
- Change to the kexec_file_load for loading the kdump kernel:
Eg. on RHEL: in /usr/bin/kdumpctl, change to:
standard_kexec_args="-p -d -s"
which adds the -s to select kexec_file_load syscall.
This work has raised the following questions for me:
First and foremost, this patch only works for the kexec_file_load
syscall path (via "kexec -s -p" utility). The reason being that, for
x86_64 anyway, the purgatory blob provided by userspace can not be
readily decoded in order to update the hash digest. (The
kexec_file_load purgatory is actually a small ELF object with symbols,
so can be patched at run time.) With no way to update purgatory, the
integrity check will always fail and and cause purgatory to hang at
panic time.
That being said, I actually developed this against the kexec_load path
and did have that working by making two one-line changes to userspace
kexec utility: one change that effectively is
CRASH_HOTPLUG_ELFCOREHDR_SZ and the other to disable the integrity
check. But that does not seem to be a long term solution. A possible
long term solution would be to allow the use of the kexec_file_load
purgatory ELF object with the kexec_load path. While I believe that
would work, I am unsure if there are any downsides to doing so.
The second problem is the use of CPUHP_AP_ONLINE_DYN. The
cpuhp_setup_state_nocalls() is invoked with parameter
CPUHP_AP_ONLINE_DYN. While this works, when a CPU is being unplugged,
the CPU still shows up in foreach_present_cpu() during the
regeneration of the elfcorehdr, thus the need to explicitly check and
exclude the soon-to-be offlined CPU in crash_prepare_elf64_headers().
Perhaps if value(s) new/different than CPUHP_AP_ONLINE_DYN to
cpuhp_setup_state() was utilized, then the offline cpu would no longer
be in foreach_present_cpu(), and this change could be eliminated. I do
not understand cpuhp_setup_state() well enough to choose, or create,
appropriate value(s).
The third problem is the number of memory hot un/plug events. If, for
example, a 1GiB DIMM is hotplugged, that generate 8 memory events, one
for each 128MiB memblock, yet the walk_system_ram_res() that is used
to obtain the list of memory regions reports the single 1GiB; thus
there are 7 un-necessary trips through this crash hotplug handler.
Perhaps there is another way of handling memory events that would see
the single 1GiB DIMM rather than each memblock?
Regards,
eric
Eric DeVolder (8):
crash: fix minor typo/bug in debug message
crash hp: Introduce CRASH_HOTPLUG configuration options
crash hp: definitions and prototypes for crash hotplug support
crash hp: generic crash hotplug support infrastructure
crash hp: kexec_file changes for use by crash hotplug handler
crash hp: Add x86 crash hotplug state items to kimage
crash hp: Add x86 crash hotplug support for kexec_file_load
crash hp: Add x86 crash hotplug support for bzImage
arch/x86/Kconfig | 26 +++
arch/x86/include/asm/kexec.h | 10 ++
arch/x86/kernel/crash.c | 257 +++++++++++++++++++++++++++++-
arch/x86/kernel/kexec-bzimage64.c | 12 ++
include/linux/kexec.h | 22 ++-
kernel/crash_core.c | 118 ++++++++++++++
kernel/kexec_file.c | 19 ++-
7 files changed, 455 insertions(+), 9 deletions(-)
--
2.27.0
The bulk of the support for CPU and memory hotplug support for
crash is controlled by the CRASH_HOTPLUG configuration option,
introduced by this patch.
The CRASH_HOTPLUG_ELFCOREHDR_SZ related configuration option is
also introduced with this patch.
Signed-off-by: Eric DeVolder <[email protected]>
---
arch/x86/Kconfig | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 95dd1ee01546..5feb829adb60 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2046,6 +2046,32 @@ config CRASH_DUMP
(CONFIG_RELOCATABLE=y).
For more details see Documentation/admin-guide/kdump/kdump.rst
+config CRASH_HOTPLUG
+ bool "kernel updates of crash elfcorehdr"
+ depends on CRASH_DUMP && (HOTPLUG_CPU || MEMORY_HOTPLUG) && KEXEC_FILE
+ help
+ Enable the kernel to update the crash elfcorehdr (which contains
+ the list of CPUs and memory regions) directly when hot plug/unplug
+ of CPUs or memory. Otherwise userspace must monitor these hot
+ plug/unplug change notifications via udev in order to
+ unload-then-reload the crash kernel so that the list of CPUs and
+ memory regions is kept up-to-date. Note that the udev CPU and
+ memory change notifications still occur (however, userspace is not
+ required to monitor for crash dump purposes).
+
+config CRASH_HOTPLUG_ELFCOREHDR_SZ
+ depends on CRASH_HOTPLUG
+ int
+ default 131072
+ help
+ Specify the maximum size of the elfcorehdr buffer/segment.
+ The 128KiB default is sized so that it can accommodate 2048
+ Elf64_Phdr, where each Phdr represents either a CPU or a
+ region of memory.
+ For example, this size can accommodate hotplugging a machine
+ with up to 1024 CPUs and up to 1024 memory regions (e.g. 1TiB
+ with 1024 1GiB memory DIMMs).
+
config KEXEC_JUMP
bool "kexec jump"
depends on KEXEC && HIBERNATION
--
2.27.0
This change adds two members to struct kimage to faciliate crash
hotplug support.
This change also defines crash hotplug events and associated
prototypes for use by the crash hotplug handlers.
Signed-off-by: Eric DeVolder <[email protected]>
---
include/linux/kexec.h | 22 ++++++++++++++++++++--
1 file changed, 20 insertions(+), 2 deletions(-)
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 0c994ae37729..8bbf2ae7a040 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -221,8 +221,9 @@ struct crash_mem {
extern int crash_exclude_mem_range(struct crash_mem *mem,
unsigned long long mstart,
unsigned long long mend);
-extern int crash_prepare_elf64_headers(struct crash_mem *mem, int kernel_map,
- void **addr, unsigned long *sz);
+extern int crash_prepare_elf64_headers(struct kimage *image,
+ struct crash_mem *mem, int kernel_map,
+ void **addr, unsigned long *sz);
#endif /* CONFIG_KEXEC_FILE */
#ifdef CONFIG_KEXEC_ELF
@@ -299,6 +300,11 @@ struct kimage {
/* Information for loading purgatory */
struct purgatory_info purgatory_info;
+
+#ifdef CONFIG_CRASH_HOTPLUG
+ int hotplug_event;
+ int offlinecpu;
+#endif
#endif
#ifdef CONFIG_IMA_KEXEC
@@ -315,6 +321,18 @@ struct kimage {
unsigned long elf_load_addr;
};
+#ifdef CONFIG_CRASH_HOTPLUG
+void arch_update_crash_elfcorehdr(struct kimage *image,
+ unsigned int hp_action, unsigned long a, unsigned long b);
+#define KEXEC_CRASH_HP_REMOVE_CPU 0
+#define KEXEC_CRASH_HP_ADD_CPU 1
+#define KEXEC_CRASH_HP_REMOVE_MEMORY 2
+#define KEXEC_CRASH_HP_ADD_MEMORY 3
+#endif /* CONFIG_CRASH_HOTPLUG */
+int kexec_calculate_store_digests(struct kimage *image);
+int kimage_load_segment(struct kimage *image, struct kexec_segment *segment);
+void kimage_file_post_load_cleanup(struct kimage *image);
+
/* kexec interface functions */
extern void machine_kexec(struct kimage *image);
extern int machine_kexec_prepare(struct kimage *image);
--
2.27.0
To facilitate hotplug updates of the crash elfcorehdr, a few members
are added to the kimage arch struct.
The indices of the elfcorehdr and purgatory segments are recorded
here so that upon a hotplug event, those segments can be efficiently
located and updated accordingly.
The purgatory image also requires a few register context values in
order to transition from purgatory to the next/capture kernel.
Signed-off-by: Eric DeVolder <[email protected]>
---
arch/x86/include/asm/kexec.h | 10 ++++++++++
1 file changed, 10 insertions(+)
diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 11b7c06e2828..b08fe56239c8 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -150,6 +150,16 @@ struct kimage_arch {
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
+#ifdef CONFIG_CRASH_HOTPLUG
+ struct {
+ int elf_index;
+ int purg_index;
+ unsigned long rbx;
+ unsigned long rsi;
+ unsigned long rip;
+ unsigned long rsp;
+ } hp;
+#endif
};
#endif /* CONFIG_X86_32 */
--
2.27.0
This patch introduces a generic crash hot plug/unplug handler for CPU
and memory changes. This crash_update_elfcorehdr() handler is invoked
upon CPU and memory changes. This generic handler obtains the
appropriate lock, does some important house keeping and then
dispatches the hot plug/unplug event to the architecture specific
handler arch_update_crash_elfcorehdr(), and when that architecture
specific handler returns, the lock is released.
This patch modifies crash_core.c to implement a subsys_initcall()
function that installs handlers for hot plug/unplug events. If CPU
hotplug is enabled, then cpuhp_setup_state() is invoked to register a
handler for CPU changes. Similarly, if memory hotplug is enabled, then
register_memory_notifier() is invoked to install a handler for memory
changes. These handlers in turn invoke the common handler
crash_update_elfcorehdr().
On the CPU side, cpuhp_setup_state_nocalls() is invoked with parameter
CPUHP_AP_ONLINE_DYN. While this works, when a CPU is being unplugged,
the CPU still shows up in foreach_present_cpu() during the regeneration
of the elfcorehdr, thus the need to explicitly check and exclude the
soon-to-be offlined CPU in crash_prepare_elf64_headers().
On the memory side, each un/plugged memory block passes through the
handler. For example, if a 1GiB DIMM is hotplugged, that generate 8
memory events, one for each 128MiB memblock.
Signed-off-by: Eric DeVolder <[email protected]>
---
kernel/crash_core.c | 118 ++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 118 insertions(+)
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index eb53f5ec62c9..d567839b476c 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -8,12 +8,16 @@
#include <linux/crash_core.h>
#include <linux/utsname.h>
#include <linux/vmalloc.h>
+#include <linux/memory.h>
+#include <linux/cpuhotplug.h>
#include <asm/page.h>
#include <asm/sections.h>
#include <crypto/sha1.h>
+#include "kexec_internal.h"
+
/* vmcoreinfo stuff */
unsigned char *vmcoreinfo_data;
size_t vmcoreinfo_size;
@@ -480,3 +484,117 @@ static int __init crash_save_vmcoreinfo_init(void)
}
subsys_initcall(crash_save_vmcoreinfo_init);
+
+#ifdef CONFIG_CRASH_HOTPLUG
+void __weak arch_update_crash_elfcorehdr(struct kimage *image,
+ unsigned int hp_action, unsigned long a, unsigned long b)
+{
+ pr_warn("crash hp: %s not implemented", __func__);
+}
+
+static void crash_update_elfcorehdr(unsigned int hp_action,
+ unsigned long a, unsigned long b)
+{
+ /* Obtain lock while changing crash information */
+ if (!mutex_trylock(&kexec_mutex))
+ return;
+
+ /* Check kdump is loaded */
+ if (kexec_crash_image) {
+ pr_debug("crash hp: hp_action %u, a %lu, b %lu", hp_action,
+ a, b);
+
+ /* Needed in order for the segments to be updated */
+ arch_kexec_unprotect_crashkres();
+
+ /* Flag to differentiate between normal load and hotplug */
+ kexec_crash_image->hotplug_event = 1;
+
+ /*
+ * Due to use of CPUHP_AP_ONLINE_DYN, upon unplug and during
+ * this callback, the CPU is still in the present list. Must
+ * explicitly look to exclude it when building new elfcorehdr.
+ */
+ kexec_crash_image->offlinecpu =
+ (hp_action == KEXEC_CRASH_HP_REMOVE_CPU) ?
+ (unsigned int)a : ~0U;
+
+ /* Now update the elfcorehdr and friends */
+ arch_update_crash_elfcorehdr(kexec_crash_image, hp_action,
+ a, b);
+
+ /* No longer in hotplug */
+ kexec_crash_image->hotplug_event = 0;
+
+ /* Change back to read-only */
+ arch_kexec_protect_crashkres();
+ }
+
+ /* Release lock now that update complete */
+ mutex_unlock(&kexec_mutex);
+}
+
+#if defined(CONFIG_MEMORY_HOTPLUG)
+static int crash_memhp_notifier(struct notifier_block *nb,
+ unsigned long val, void *v)
+{
+ struct memory_notify *mhp = v;
+ unsigned long start, end;
+
+ start = mhp->start_pfn << PAGE_SHIFT;
+ end = ((mhp->start_pfn + mhp->nr_pages) << PAGE_SHIFT) - 1;
+
+ switch (val) {
+ case MEM_GOING_ONLINE:
+ crash_update_elfcorehdr(KEXEC_CRASH_HP_ADD_MEMORY,
+ start, end-start);
+ break;
+
+ case MEM_OFFLINE:
+ case MEM_CANCEL_ONLINE:
+ crash_update_elfcorehdr(KEXEC_CRASH_HP_REMOVE_MEMORY,
+ start, end-start);
+ break;
+ }
+ return NOTIFY_OK;
+}
+
+static struct notifier_block crash_memhp_nb = {
+ .notifier_call = crash_memhp_notifier,
+ .priority = 0
+};
+#endif
+
+#if defined(CONFIG_HOTPLUG_CPU)
+static int crash_cpuhp_online(unsigned int cpu)
+{
+ crash_update_elfcorehdr(KEXEC_CRASH_HP_ADD_CPU, cpu, 0);
+ return 0;
+}
+
+static int crash_cpuhp_offline(unsigned int cpu)
+{
+ crash_update_elfcorehdr(KEXEC_CRASH_HP_REMOVE_CPU, cpu, 0);
+ return 0;
+}
+#endif
+
+static int __init crash_hotplug_init(void)
+{
+ int result = 0;
+
+#if defined(CONFIG_MEMORY_HOTPLUG)
+ register_memory_notifier(&crash_memhp_nb);
+#endif
+
+#if defined(CONFIG_HOTPLUG_CPU)
+ result = cpuhp_setup_state_nocalls(CPUHP_AP_ONLINE_DYN,
+ "crash/cpuhp",
+ crash_cpuhp_online, crash_cpuhp_offline);
+#endif
+
+ return result;
+}
+
+subsys_initcall(crash_hotplug_init);
+#endif /* CONFIG_CRASH_HOTPLUG */
--
2.27.0
Three minor changes are made to this file to facilitate re-using
the functions from within a crash hotplug handler.
The prototype for kexec_calculate_store_digests() was hoisted to
linux/kexec.h and the function made non-static.
The crash_prepare_elf64_headers() had the struct kimage parameter
added so that the check for the offlinecpu could be performed and
this routine re-used by the crash hotplug handler.
Finally, kexec_add_buffer() changed slightly so that it does not
attempt to add the buffer to the kimage when being called from
the crash hotplug handler.
Signed-off-by: Eric DeVolder <[email protected]>
---
kernel/kexec_file.c | 19 ++++++++++++++-----
1 file changed, 14 insertions(+), 5 deletions(-)
diff --git a/kernel/kexec_file.c b/kernel/kexec_file.c
index 8347fc158d2b..c9e8e5b94382 100644
--- a/kernel/kexec_file.c
+++ b/kernel/kexec_file.c
@@ -29,8 +29,6 @@
#include <linux/vmalloc.h>
#include "kexec_internal.h"
-static int kexec_calculate_store_digests(struct kimage *image);
-
/*
* Currently this is the only default function that is exported as some
* architectures need it to do additional handlings.
@@ -674,6 +672,12 @@ int kexec_add_buffer(struct kexec_buf *kbuf)
struct kexec_segment *ksegment;
int ret;
+#ifdef CONFIG_CRASH_HOTPLUG
+ /* When servicing a hotplug event, do not add buffers to the image */
+ if (kbuf->image->hotplug_event)
+ return 0;
+#endif
+
/* Currently adding segment this way is allowed only in file mode */
if (!kbuf->image->file_mode)
return -EINVAL;
@@ -713,7 +717,7 @@ int kexec_add_buffer(struct kexec_buf *kbuf)
}
/* Calculate and store the digest of segments */
-static int kexec_calculate_store_digests(struct kimage *image)
+int kexec_calculate_store_digests(struct kimage *image)
{
struct crypto_shash *tfm;
struct shash_desc *desc;
@@ -1260,8 +1264,8 @@ int crash_exclude_mem_range(struct crash_mem *mem,
return 0;
}
-int crash_prepare_elf64_headers(struct crash_mem *mem, int kernel_map,
- void **addr, unsigned long *sz)
+int crash_prepare_elf64_headers(struct kimage *image, struct crash_mem *mem,
+ int kernel_map, void **addr, unsigned long *sz)
{
Elf64_Ehdr *ehdr;
Elf64_Phdr *phdr;
@@ -1308,6 +1312,11 @@ int crash_prepare_elf64_headers(struct crash_mem *mem, int kernel_map,
/* Prepare one phdr of type PT_NOTE for each present CPU */
for_each_present_cpu(cpu) {
+#ifdef CONFIG_CRASH_HOTPLUG
+ /* Skip the soon-to-be offlined cpu */
+ if (image->hotplug_event && (cpu == image->offlinecpu))
+ continue;
+#endif
phdr->p_type = PT_NOTE;
notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu));
phdr->p_offset = phdr->p_paddr = notes_addr;
--
2.27.0
The few changes to kexec-bzimage64.c are to record information needed
in advance of when a hotplug event occurs. This information is recorded
when the crash/capture kernel is loaded.
In particular, the index of the purgatory segment, and the handful of
register context values needed by purgatory, are recorded.
Signed-off-by: Eric DeVolder <[email protected]>
---
arch/x86/kernel/kexec-bzimage64.c | 12 ++++++++++++
1 file changed, 12 insertions(+)
diff --git a/arch/x86/kernel/kexec-bzimage64.c b/arch/x86/kernel/kexec-bzimage64.c
index 170d0fd68b1f..1886c215215d 100644
--- a/arch/x86/kernel/kexec-bzimage64.c
+++ b/arch/x86/kernel/kexec-bzimage64.c
@@ -378,6 +378,10 @@ static void *bzImage64_load(struct kimage *image, char *kernel,
* Load purgatory. For 64bit entry point, purgatory code can be
* anywhere.
*/
+#ifdef CONFIG_CRASH_HOTPLUG
+ /* Record the index of the purgatory segment */
+ image->arch.hp.purg_index = image->nr_segments;
+#endif
ret = kexec_load_purgatory(image, &pbuf);
if (ret) {
pr_err("Loading purgatory failed\n");
@@ -488,6 +492,14 @@ static void *bzImage64_load(struct kimage *image, char *kernel,
if (ret)
goto out_free_params;
+#ifdef CONFIG_CRASH_HOTPLUG
+ /* Save for use on hotplug to patch up purgatory context */
+ image->arch.hp.rbx = regs64.rbx;
+ image->arch.hp.rsi = regs64.rsi;
+ image->arch.hp.rip = regs64.rip;
+ image->arch.hp.rsp = regs64.rsp;
+#endif
+
ret = setup_boot_parameters(image, params, bootparam_load_addr,
efi_map_offset, efi_map_sz,
efi_setup_data_offset);
--
2.27.0
When CPU or memory is hot un/plugged, the crash elfcorehdr which
describes the CPUs and memory in the system, must also be updated.
To update the elfcorehdr for x86_64, a new elfcorehdr must be
generated from the available CPUs and memory, and placed into
memory. Since purgatory also does an integrity check via hash
digests of the loaded segments, purgatory must also be updated
with the new digests.
Once the new elfcorehdr and purgatory contents are fully prepared
and no errors occur, they are installed over the top of the
existing segments. As a result, no changes to boot_params are
needed as the elfcorehdr= kernel command line parameter pointer
remains unchanged and correct.)
To accommodate a growing number of resources via hotplug, the
elfcorehdr segment must be sufficiently large enough to accommodate
changes, see the CRASH_HOTPLUG_ELFCOREHDR_SZ configure item. The
purgatory segment was already properly sized at load time.
NOTE that this only supports kexec_file_load. Support for
kexec_load is not possible since the userland-supplied purgatory
segment is a binary blob that can not readily be decoded so as to
be updated with the new hash digests.
Signed-off-by: Eric DeVolder <[email protected]>
---
arch/x86/kernel/crash.c | 255 +++++++++++++++++++++++++++++++++++++++-
1 file changed, 254 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index 9730c88530fc..d08e112cd345 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -25,6 +25,9 @@
#include <linux/slab.h>
#include <linux/vmalloc.h>
#include <linux/memblock.h>
+#include <linux/mm.h>
+#include <linux/io.h>
+#include <linux/highmem.h>
#include <asm/processor.h>
#include <asm/hardirq.h>
@@ -265,7 +268,8 @@ static int prepare_elf_headers(struct kimage *image, void **addr,
goto out;
/* By default prepare 64bit headers */
- ret = crash_prepare_elf64_headers(cmem, IS_ENABLED(CONFIG_X86_64), addr, sz);
+ ret = crash_prepare_elf64_headers(image, cmem,
+ IS_ENABLED(CONFIG_X86_64), addr, sz);
out:
vfree(cmem);
@@ -397,7 +401,16 @@ int crash_load_segments(struct kimage *image)
image->elf_headers = kbuf.buffer;
image->elf_headers_sz = kbuf.bufsz;
+#ifdef CONFIG_CRASH_HOTPLUG
+ /* Ensure elfcorehdr segment large enough for hotplug changes */
+ kbuf.memsz = CONFIG_CRASH_HOTPLUG_ELFCOREHDR_SZ;
+ /* For marking as usable to crash kernel */
+ image->elf_headers_sz = kbuf.memsz;
+ /* Record the index of the elfcorehdr segment */
+ image->arch.hp.elf_index = image->nr_segments;
+#else
kbuf.memsz = kbuf.bufsz;
+#endif
kbuf.buf_align = ELF_CORE_HEADER_ALIGN;
kbuf.mem = KEXEC_BUF_MEM_UNKNOWN;
ret = kexec_add_buffer(&kbuf);
@@ -412,3 +425,243 @@ int crash_load_segments(struct kimage *image)
return ret;
}
#endif /* CONFIG_KEXEC_FILE */
+
+#ifdef CONFIG_CRASH_HOTPLUG
+void *map_crash_pages(unsigned long paddr, unsigned long size)
+{
+ /*
+ * NOTE: The addresses and sizes passed to this routine have
+ * already been fully aligned on page boundaries. There is no
+ * need for massaging the address or size.
+ */
+ void *ptr = NULL;
+
+ /* NOTE: requires arch_kexec_[un]protect_crashkres() for write access */
+ if (size > 0) {
+ struct page *page = pfn_to_page(paddr >> PAGE_SHIFT);
+
+ ptr = kmap(page);
+ }
+
+ return ptr;
+}
+
+void unmap_crash_pages(void **ptr)
+{
+ if (ptr) {
+ if (*ptr)
+ kunmap(*ptr);
+ *ptr = NULL;
+ }
+}
+
+void arch_update_crash_elfcorehdr(struct kimage *image,
+ unsigned int hp_action, unsigned long a, unsigned long b)
+{
+ /*
+ * To accurately reflect hot un/plug changes, the elfcorehdr (which
+ * is passed to the crash kernel via the elfcorehdr= parameter)
+ * must be updated with the new list of CPUs and memories. Due
+ * to the change to the elfcorehdr, the loaded segment hash/digests
+ * contained within purgatory must also be updated. Thus purgatory
+ * also be updated. Both the elfcorehdr and purgatory are prepared
+ * in new kernel buffers, and if all succeeds, then new elfcorehdr
+ * and purgatory are written into the corresponding crash memory.
+ *
+ * Note this code currently only support the kexec_file_load syscall.
+ * For kexec_load, all the segments are provided by userspace.
+ * In particular, the ability to locate and then update the
+ * purgatory blob with a proper register context and hash/digests
+ * prevents support for kexec_load. The kexec_file_load, on the
+ * other hand, is all contained within the kernel and all needed
+ * pieces of information can be located.
+ */
+ struct kexec_segment *ksegment;
+ struct kexec_entry64_regs regs64;
+ struct kexec_buf pbuf;
+ unsigned char *ptr = NULL;
+ unsigned long elfsz = 0;
+ void *elfbuf = NULL;
+ unsigned long mem, memsz;
+ unsigned int n;
+ int ret;
+
+ /*
+ * Invalidate the pointers left over from the initial load or
+ * previous hotplug update operation.
+ */
+ for (n = 0; n < image->nr_segments; ++n)
+ image->segment[n].kbuf = NULL;
+
+ /* Only support kexec_file_load */
+ if (!image->file_mode) {
+ pr_err("crash hp: support kexec_file_load only");
+ goto out;
+ }
+
+ /*
+ * When the struct kimage is alloced, it is wiped to zero, so
+ * the elf_index and purg_index should never be zero or the
+ * same index.
+ */
+ if (image->arch.hp.elf_index == image->arch.hp.purg_index) {
+ pr_err("crash hp: unable to locate elfcorehdr or purgatory segments");
+ goto out;
+ }
+
+ /*
+ * Create the new elfcorehdr reflecting the changes to CPU and/or
+ * memory resources. The elfcorehdr segment memsz must be
+ * sufficiently large to accommodate increases due to hotplug
+ * activity. See CRASH_HOTPLUG_ELFCOREHDR_SZ.
+ */
+ if (prepare_elf_headers(image, &elfbuf, &elfsz)) {
+ pr_err("crash hp: unable to prepare elfcore headers");
+ goto out;
+ }
+ ksegment = &image->segment[image->arch.hp.elf_index];
+ memsz = ksegment->memsz;
+ if (elfsz > memsz) {
+ pr_err("crash hp: not enough room to update elfcorehdr elfsz %lu > memsz %lu",
+ elfsz, memsz);
+ goto out;
+ }
+ /* Setup for kexec_calculate_store_digests() (for hash/digest) */
+ ksegment->kbuf = elfbuf;
+ ksegment->bufsz = elfsz;
+
+ /*
+ * To update purgatory, must initialize the purgatory ELF blob,
+ * then record the crash kernel entry point register context, and
+ * finally must recompute the hash/digests for the loaded segments.
+ */
+ ksegment = &image->segment[image->arch.hp.purg_index];
+ mem = ksegment->mem;
+ memsz = ksegment->memsz;
+
+ /*
+ * Initialize the purgatory ELF blob. Need to initialize the
+ * kexec_buf in order to maneuver through kexec_load_purgatory()
+ */
+ pbuf.image = image;
+ pbuf.buffer = NULL;
+ pbuf.buf_min = mem;
+ pbuf.buf_max = mem+memsz;
+ pbuf.top_down = true;
+ pbuf.mem = mem;
+ pbuf.memsz = memsz;
+ if (kexec_load_purgatory(image, &pbuf)) {
+ pr_err("crash hp: Initializing purgatory failed\n");
+ goto out;
+ }
+ /* Setup for kexec_calculate_store_digests() (to skip this segment) */
+ ksegment->kbuf = pbuf.buffer;
+ ksegment->bufsz = pbuf.bufsz;
+
+ /*
+ * Rebuild and patch the purgatory ELF blob with updates
+ * to the regs64 entry point context.
+ */
+ ret = kexec_purgatory_get_set_symbol(image, "entry64_regs", ®s64,
+ sizeof(regs64), 1);
+ if (ret) {
+ pr_err("crash hp: can not extract entry64_regs");
+ goto out;
+ }
+ regs64.rbx = image->arch.hp.rbx;
+ regs64.rsi = image->arch.hp.rsi;
+ regs64.rip = image->arch.hp.rip;
+ regs64.rsp = image->arch.hp.rsp;
+ ret = kexec_purgatory_get_set_symbol(image, "entry64_regs", ®s64,
+ sizeof(regs64), 0);
+ if (ret) {
+ pr_err("crash hp: Could not set entry64_regs");
+ goto out;
+ }
+
+ /*
+ * To compute the hash/digests, must establish valid kernel
+ * pointers to all the image segments. Both the elfcorehdr and
+ * the purgatory segments already have valid pointers.
+ */
+ for (n = 0; n < image->nr_segments; ++n) {
+ ksegment = &image->segment[n];
+ if (ksegment->kbuf == NULL) {
+ mem = ksegment->mem;
+ memsz = ksegment->memsz;
+ ksegment->kbuf = map_crash_pages(mem, memsz);
+ if (!ksegment->kbuf) {
+ pr_err("crash hp: unable to map segment %u: %lx for %lu bytes",
+ n, mem, memsz);
+ goto out;
+ }
+ }
+ }
+
+ /* Recompute the digests for the segments */
+ if (kexec_calculate_store_digests(image)) {
+ pr_err("crash hp: recompute digest failed");
+ goto out;
+ }
+
+ /*
+ * At this point, we are all but assured of success.
+ * Temporarily invalidate the crash image while its new (and
+ * accurate) segments are written to memory. A panic during
+ * this operation will NOT generate a crash dump.
+ */
+ xchg(&kexec_crash_image, NULL);
+
+ /* Copy new elfcorehdr into destination */
+ ksegment = &image->segment[image->arch.hp.elf_index];
+ mem = ksegment->mem;
+ memsz = ksegment->memsz;
+ ptr = map_crash_pages(mem, memsz);
+ if (ptr) {
+ /* Write the new elfcorehdr into memory */
+ memcpy((void *)ptr, elfbuf, elfsz);
+ /*
+ * Zero the memory between bufsz and memsz to match run-time
+ * purgatory hash calculations.
+ */
+ memset((void *)(ptr+elfsz), 0, memsz-elfsz);
+ }
+ unmap_crash_pages((void **)&ptr);
+ pr_debug("crash hp: re-loaded elfcorehdr at 0x%lx\n", mem);
+
+ /* With purgatory fully updated, store into crash kernel memory */
+ ksegment = &image->segment[image->arch.hp.purg_index];
+ if (kimage_load_segment(image, ksegment)) {
+ pr_err("crash hp: reloading purgatory failed");
+ goto out;
+ }
+ pr_debug("crash hp: re-loaded purgatory at 0x%lx\n", ksegment->mem);
+
+//FIX??? somekind of cache flush perhaps?
+
+ /*
+ * The crash image is now valid once again, panics will cause a
+ * crash dump to occur.
+ */
+ xchg(&kexec_crash_image, image);
+
+out:
+ /* Free/release buffers */
+ kimage_file_post_load_cleanup(image);
+ /* Free elfbuf */
+ ksegment = &image->segment[image->arch.hp.elf_index];
+ vfree(ksegment->kbuf);
+ ksegment->kbuf = NULL; /* for loop below */
+ /*
+ * Free purgatory buffer; this ksegment->kbuf is pi->purgatory_buf
+ * and already freed in kimage_file_post_load_cleanup().
+ */
+ ksegment = &image->segment[image->arch.hp.purg_index];
+ ksegment->kbuf = NULL; /* for loop below */
+ /* Free/release mappings */
+ for (n = 0; n < image->nr_segments; ++n) {
+ ksegment = &image->segment[n];
+ unmap_crash_pages((void **)&ksegment->kbuf);
+ }
+}
+#endif /* CONFIG_CRASH_HOTPLUG */
--
2.27.0
Hi Eric,
On 11/18/21 at 12:49pm, Eric DeVolder wrote:
> When the kdump service is loaded, if a CPU or memory is hot
> un/plugged, the crash elfcorehdr which describes the CPUs and memory
> in the system, must also be updated, else the resulting vmcore is
> inaccurate (eg. missing either CPU context or memory regions).
>
> The current solution utilizes udev to initiate an unload-then-reload
> of the kdump image (e. kernel, initrd, boot_params, puratory and
> elfcorehdr) by the userspace kexec utility.
>
> In the post https://lkml.org/lkml/2020/12/14/532 I outlined two
> problems with this userspace-initiated unload-then-reload approach as
> it pertains to supporting CPU and memory hot un/plug for kdump.
> (Note in that post, I erroneously called the elfcorehdr the vmcoreinfo
> structure. There is a vmcoreinfo structure, but it has a different
> purpose. So in that post substitute "elfcorehdr" for "vmcoreinfo".)
It's great you finally make this patchset to address the cpu/mem hotplug
issues raised before, I will review it carefully. And I have to say
sorry, because I ever promised you to do this but didn't keep it due to
personal reasons.
Thanks again for doing this.
>
> The first problem being the time needed to complete the unload-then-
> reload of the kdump image, and the second being the effective race
> window that unload-then-reload effort creates.
>
> The scenario I measured was a 32GiB guest being resized to 512GiB and
> observing it took over 4 minutes for udev to "settle down" and
> complete the unload-then-reload of the resulting 3840 hot plug events.
> Empirical evidence within our fleet substantiates this problem.
>
> Each unload-then-reload creates a race window the size of which is the
> time it takes to reload the complete kdump image. Within the race
> window, kdump is not loaded and should a panic occur, the kernel halts
> rather than dumping core via kdump.
>
> This patchset significantly improves upon the current solution by
> enabling the kernel to update only the necessary items of the kdump
> image. In the case of x86_64, that is just the elfcorehdr and the
> purgatory segments. These updates occur as fast as the hot un/plug
> events and significantly reduce the size of the race window.
>
> This patchset introduces a generic crash hot un/plug handler that
> registers with the CPU and memory notifiers. Upon CPU or memory
> changes, this generic handler is invoked and performs important
> housekeeping, for example obtaining the appropriate lock, and then
> invokes an architecture specific handler to do the appropriate
> updates.
>
> In the case of x86_64, the arch specific handler generates a new
> elfcorehdr, which reflects the current CPUs and memory regions, into a
> buffer. Since purgatory also does an integrity check via hash digests
> of the loaded segments, purgatory must also be updated with the new
> digests. The arch handler also generates a new purgatory into a
> buffer, performs the hash digests of the new memory segments, and then
> patches purgatory with the new digests. If all succeeds, then the
> elfcorehdr and purgatory buffers over write the existing buffers and
> the new kdump image is live and ready to go. No involvement with
> userspace at all.
>
> To accommodate a growing number of resources via hotplug, the
> elfcorehdr memory must be sufficiently large enough to accommodate
> changes. The CRASH_HOTPLUG_ELFCOREHDR_SZ configure item does just
> this.
>
> To realize the benefits/test this patchset, one must make a couple
> of minor changes to userspace:
>
> - Disable the udev rule for updating kdump on hot un/plug changes
> Eg. on RHEL: rm -f /usr/lib/udev/rules.d/98-kexec.rules
> or other technique to neuter the rule.
>
> - Change to the kexec_file_load for loading the kdump kernel:
> Eg. on RHEL: in /usr/bin/kdumpctl, change to:
> standard_kexec_args="-p -d -s"
> which adds the -s to select kexec_file_load syscall.
>
> This work has raised the following questions for me:
>
> First and foremost, this patch only works for the kexec_file_load
> syscall path (via "kexec -s -p" utility). The reason being that, for
> x86_64 anyway, the purgatory blob provided by userspace can not be
> readily decoded in order to update the hash digest. (The
> kexec_file_load purgatory is actually a small ELF object with symbols,
> so can be patched at run time.) With no way to update purgatory, the
> integrity check will always fail and and cause purgatory to hang at
> panic time.
>
> That being said, I actually developed this against the kexec_load path
> and did have that working by making two one-line changes to userspace
> kexec utility: one change that effectively is
> CRASH_HOTPLUG_ELFCOREHDR_SZ and the other to disable the integrity
> check. But that does not seem to be a long term solution. A possible
> long term solution would be to allow the use of the kexec_file_load
> purgatory ELF object with the kexec_load path. While I believe that
> would work, I am unsure if there are any downsides to doing so.
>
> The second problem is the use of CPUHP_AP_ONLINE_DYN. The
> cpuhp_setup_state_nocalls() is invoked with parameter
> CPUHP_AP_ONLINE_DYN. While this works, when a CPU is being unplugged,
> the CPU still shows up in foreach_present_cpu() during the
> regeneration of the elfcorehdr, thus the need to explicitly check and
> exclude the soon-to-be offlined CPU in crash_prepare_elf64_headers().
> Perhaps if value(s) new/different than CPUHP_AP_ONLINE_DYN to
> cpuhp_setup_state() was utilized, then the offline cpu would no longer
> be in foreach_present_cpu(), and this change could be eliminated. I do
> not understand cpuhp_setup_state() well enough to choose, or create,
> appropriate value(s).
>
> The third problem is the number of memory hot un/plug events. If, for
> example, a 1GiB DIMM is hotplugged, that generate 8 memory events, one
> for each 128MiB memblock, yet the walk_system_ram_res() that is used
> to obtain the list of memory regions reports the single 1GiB; thus
> there are 7 un-necessary trips through this crash hotplug handler.
> Perhaps there is another way of handling memory events that would see
> the single 1GiB DIMM rather than each memblock?
>
> Regards,
> eric
>
> Eric DeVolder (8):
> crash: fix minor typo/bug in debug message
> crash hp: Introduce CRASH_HOTPLUG configuration options
> crash hp: definitions and prototypes for crash hotplug support
> crash hp: generic crash hotplug support infrastructure
> crash hp: kexec_file changes for use by crash hotplug handler
> crash hp: Add x86 crash hotplug state items to kimage
> crash hp: Add x86 crash hotplug support for kexec_file_load
> crash hp: Add x86 crash hotplug support for bzImage
>
> arch/x86/Kconfig | 26 +++
> arch/x86/include/asm/kexec.h | 10 ++
> arch/x86/kernel/crash.c | 257 +++++++++++++++++++++++++++++-
> arch/x86/kernel/kexec-bzimage64.c | 12 ++
> include/linux/kexec.h | 22 ++-
> kernel/crash_core.c | 118 ++++++++++++++
> kernel/kexec_file.c | 19 ++-
> 7 files changed, 455 insertions(+), 9 deletions(-)
>
> --
> 2.27.0
>
Hi,
On 11/18/21 at 12:49pm, Eric DeVolder wrote:
......
> This patchset introduces a generic crash hot un/plug handler that
> registers with the CPU and memory notifiers. Upon CPU or memory
> changes, this generic handler is invoked and performs important
> housekeeping, for example obtaining the appropriate lock, and then
> invokes an architecture specific handler to do the appropriate
> updates.
>
> In the case of x86_64, the arch specific handler generates a new
> elfcorehdr, which reflects the current CPUs and memory regions, into a
> buffer. Since purgatory also does an integrity check via hash digests
> of the loaded segments, purgatory must also be updated with the new
When I tried to address this with a draft patch, I started with a
different way in which udev rule triggers reloading and only elfcorehdr
segment is updated. The update should be less time consuming. Seems
internal notifier is better in your way. But I didn't update purgatory
since I just skipped the elfcorehdr part when calculate the digest of
segments. The reason from my mind is kernel text, initrd must contribute
most part of the digest, elfcorehdr is much less, and it will simplify
code change more. Doing so let us have no need to touch purgatory at
all. What do you think?
Still reviewing.
> digests. The arch handler also generates a new purgatory into a
> buffer, performs the hash digests of the new memory segments, and then
> patches purgatory with the new digests. If all succeeds, then the
> elfcorehdr and purgatory buffers over write the existing buffers and
> the new kdump image is live and ready to go. No involvement with
> userspace at all.
Hello Eric,
On 18/11/21 23:19, Eric DeVolder wrote:
> When the kdump service is loaded, if a CPU or memory is hot
> un/plugged, the crash elfcorehdr which describes the CPUs and memory
> in the system, must also be updated, else the resulting vmcore is
> inaccurate (eg. missing either CPU context or memory regions).
>
> The current solution utilizes udev to initiate an unload-then-reload
> of the kdump image (e. kernel, initrd, boot_params, puratory and
> elfcorehdr) by the userspace kexec utility.
>
> In the post https://lkml.org/lkml/2020/12/14/532 I outlined two
> problems with this userspace-initiated unload-then-reload approach as
> it pertains to supporting CPU and memory hot un/plug for kdump.
> (Note in that post, I erroneously called the elfcorehdr the vmcoreinfo
> structure. There is a vmcoreinfo structure, but it has a different
> purpose. So in that post substitute "elfcorehdr" for "vmcoreinfo".)
>
> The first problem being the time needed to complete the unload-then-
> reload of the kdump image, and the second being the effective race
> window that unload-then-reload effort creates.
>
> The scenario I measured was a 32GiB guest being resized to 512GiB and
> observing it took over 4 minutes for udev to "settle down" and
> complete the unload-then-reload of the resulting 3840 hot plug events.
> Empirical evidence within our fleet substantiates this problem.
>
> Each unload-then-reload creates a race window the size of which is the
> time it takes to reload the complete kdump image. Within the race
> window, kdump is not loaded and should a panic occur, the kernel halts
> rather than dumping core via kdump.
>
> This patchset significantly improves upon the current solution by
> enabling the kernel to update only the necessary items of the kdump
> image. In the case of x86_64, that is just the elfcorehdr and the
> purgatory segments. These updates occur as fast as the hot un/plug
> events and significantly reduce the size of the race window.
>
> This patchset introduces a generic crash hot un/plug handler that
> registers with the CPU and memory notifiers. Upon CPU or memory
> changes, this generic handler is invoked and performs important
> housekeeping, for example obtaining the appropriate lock, and then
> invokes an architecture specific handler to do the appropriate
> updates.
>
> In the case of x86_64, the arch specific handler generates a new
> elfcorehdr, which reflects the current CPUs and memory regions, into a
> buffer. Since purgatory also does an integrity check via hash digests
> of the loaded segments, purgatory must also be updated with the new
> digests. The arch handler also generates a new purgatory into a
> buffer, performs the hash digests of the new memory segments, and then
> patches purgatory with the new digests. If all succeeds, then the
> elfcorehdr and purgatory buffers over write the existing buffers and
> the new kdump image is live and ready to go. No involvement with
> userspace at all.
>
> To accommodate a growing number of resources via hotplug, the
> elfcorehdr memory must be sufficiently large enough to accommodate
> changes. The CRASH_HOTPLUG_ELFCOREHDR_SZ configure item does just
> this.
>
> To realize the benefits/test this patchset, one must make a couple
> of minor changes to userspace:
>
> - Disable the udev rule for updating kdump on hot un/plug changes
> Eg. on RHEL: rm -f /usr/lib/udev/rules.d/98-kexec.rules
> or other technique to neuter the rule.
>
> - Change to the kexec_file_load for loading the kdump kernel:
> Eg. on RHEL: in /usr/bin/kdumpctl, change to:
> standard_kexec_args="-p -d -s"
> which adds the -s to select kexec_file_load syscall.
>
> This work has raised the following questions for me:
>
> First and foremost, this patch only works for the kexec_file_load
> syscall path (via "kexec -s -p" utility). The reason being that, for
> x86_64 anyway, the purgatory blob provided by userspace can not be
> readily decoded in order to update the hash digest. (The
> kexec_file_load purgatory is actually a small ELF object with symbols,
> so can be patched at run time.) With no way to update purgatory, the
> integrity check will always fail and and cause purgatory to hang at
> panic time.
We are designing a solution for a similar problem in PowerPC. Agree that
manipulating kexec segment in the kernel for kexec_load system call is
bit complex compare to kexec_file_load system call due to SHA verification
in purgatory.
What if we have a pre-allocated memory hole for the kexec segment
and ask kexec to use that and skip the SHA verification for the same.
For example, on PowerPC, all the CPUs and memory-related info is part
of FDT. Whenever there is hotplug event we have to update the kdump
FDT segment to provide correct details to the kdump kernel.
One way to keep the kdump FDT up-to-date with the latest CPU and memory
is to load the kdump FDT to the pre-allocated memory hole for both
kexec_load
and kexec_file_laod system call and let the kernel keep updating the FDT
on hotplug event.
Adapting the above solution for the kexec_file_load case is easy because
we do everything in the kernel. But what about the kexec_load system
call? How
kexec tool will know about this pre-allocated memory hole? What will happen
to digest verification if the kernel updates the kdump FDT segment post
kdump
load?
The kernel will expose the pre-allocated memory to userspace via a
sysfs. When kexec
tool loads the kexec segments it will check for this pre-allocated
memory for
kdump FDT and if available it will use it and skip the SHA verification
for the same.
Please provide your input on the above method of handling things for
the kexec_file system call?
I am still reviewing your patch series.
Thanks,
Sourabh Jain
> That being said, I actually developed this against the kexec_load path
> and did have that working by making two one-line changes to userspace
> kexec utility: one change that effectively is
> CRASH_HOTPLUG_ELFCOREHDR_SZ and the other to disable the integrity
> check. But that does not seem to be a long term solution. A possible
> long term solution would be to allow the use of the kexec_file_load
> purgatory ELF object with the kexec_load path. While I believe that
> would work, I am unsure if there are any downsides to doing so.
>
> The second problem is the use of CPUHP_AP_ONLINE_DYN. The
> cpuhp_setup_state_nocalls() is invoked with parameter
> CPUHP_AP_ONLINE_DYN. While this works, when a CPU is being unplugged,
> the CPU still shows up in foreach_present_cpu() during the
> regeneration of the elfcorehdr, thus the need to explicitly check and
> exclude the soon-to-be offlined CPU in crash_prepare_elf64_headers().
> Perhaps if value(s) new/different than CPUHP_AP_ONLINE_DYN to
> cpuhp_setup_state() was utilized, then the offline cpu would no longer
> be in foreach_present_cpu(), and this change could be eliminated. I do
> not understand cpuhp_setup_state() well enough to choose, or create,
> appropriate value(s).
>
> The third problem is the number of memory hot un/plug events. If, for
> example, a 1GiB DIMM is hotplugged, that generate 8 memory events, one
> for each 128MiB memblock, yet the walk_system_ram_res() that is used
> to obtain the list of memory regions reports the single 1GiB; thus
> there are 7 un-necessary trips through this crash hotplug handler.
> Perhaps there is another way of handling memory events that would see
> the single 1GiB DIMM rather than each memblock?
>
> Regards,
> eric
>
> Eric DeVolder (8):
> crash: fix minor typo/bug in debug message
> crash hp: Introduce CRASH_HOTPLUG configuration options
> crash hp: definitions and prototypes for crash hotplug support
> crash hp: generic crash hotplug support infrastructure
> crash hp: kexec_file changes for use by crash hotplug handler
> crash hp: Add x86 crash hotplug state items to kimage
> crash hp: Add x86 crash hotplug support for kexec_file_load
> crash hp: Add x86 crash hotplug support for bzImage
>
> arch/x86/Kconfig | 26 +++
> arch/x86/include/asm/kexec.h | 10 ++
> arch/x86/kernel/crash.c | 257 +++++++++++++++++++++++++++++-
> arch/x86/kernel/kexec-bzimage64.c | 12 ++
> include/linux/kexec.h | 22 ++-
> kernel/crash_core.c | 118 ++++++++++++++
> kernel/kexec_file.c | 19 ++-
> 7 files changed, 455 insertions(+), 9 deletions(-)
>
Hi, see below.
eric
On 11/24/21 03:02, Baoquan He wrote:
> Hi,
>
> On 11/18/21 at 12:49pm, Eric DeVolder wrote:
> ......
>> This patchset introduces a generic crash hot un/plug handler that
>> registers with the CPU and memory notifiers. Upon CPU or memory
>> changes, this generic handler is invoked and performs important
>> housekeeping, for example obtaining the appropriate lock, and then
>> invokes an architecture specific handler to do the appropriate
>> updates.
>>
>> In the case of x86_64, the arch specific handler generates a new
>> elfcorehdr, which reflects the current CPUs and memory regions, into a
>> buffer. Since purgatory also does an integrity check via hash digests
>> of the loaded segments, purgatory must also be updated with the new
>
> When I tried to address this with a draft patch, I started with a
> different way in which udev rule triggers reloading and only elfcorehdr
> segment is updated. The update should be less time consuming. Seems
> internal notifier is better in your way. But I didn't update purgatory
> since I just skipped the elfcorehdr part when calculate the digest of
> segments. The reason from my mind is kernel text, initrd must contribute
> most part of the digest, elfcorehdr is much less, and it will simplify
> code change more. Doing so let us have no need to touch purgatory at
> all. What do you think?
Well certainly if purgatory did not need to be updated, then that simplifies
matters quite a bit!
I do not have any context on the history of including elfcorehdr in the purgatory
integrity check. I do agree with you that checking kernel, initrd, boot_params
is most important. Perhaps allowing the elfcorehdr data structure to change
isn't too bad without including in the integrity check is ok as there is some
sanity checking of it by the capture kernel as it reads it for /proc/vmcore setup.
>
> Still reviewing.
Thank you!
>
>> digests. The arch handler also generates a new purgatory into a
>> buffer, performs the hash digests of the new memory segments, and then
>> patches purgatory with the new digests. If all succeeds, then the
>> elfcorehdr and purgatory buffers over write the existing buffers and
>> the new kdump image is live and ready to go. No involvement with
>> userspace at all.
>
Hi, see below. eric
On 11/29/21 02:45, Sourabh Jain wrote:
> Hello Eric,
>
> On 18/11/21 23:19, Eric DeVolder wrote:
>> When the kdump service is loaded, if a CPU or memory is hot
>> un/plugged, the crash elfcorehdr which describes the CPUs and memory
>> in the system, must also be updated, else the resulting vmcore is
>> inaccurate (eg. missing either CPU context or memory regions).
>>
>> The current solution utilizes udev to initiate an unload-then-reload
>> of the kdump image (e. kernel, initrd, boot_params, puratory and
>> elfcorehdr) by the userspace kexec utility.
>>
>> In the post https://lkml.org/lkml/2020/12/14/532 I outlined two
>> problems with this userspace-initiated unload-then-reload approach as
>> it pertains to supporting CPU and memory hot un/plug for kdump.
>> (Note in that post, I erroneously called the elfcorehdr the vmcoreinfo
>> structure. There is a vmcoreinfo structure, but it has a different
>> purpose. So in that post substitute "elfcorehdr" for "vmcoreinfo".)
>>
>> The first problem being the time needed to complete the unload-then-
>> reload of the kdump image, and the second being the effective race
>> window that unload-then-reload effort creates.
>>
>> The scenario I measured was a 32GiB guest being resized to 512GiB and
>> observing it took over 4 minutes for udev to "settle down" and
>> complete the unload-then-reload of the resulting 3840 hot plug events.
>> Empirical evidence within our fleet substantiates this problem.
>>
>> Each unload-then-reload creates a race window the size of which is the
>> time it takes to reload the complete kdump image. Within the race
>> window, kdump is not loaded and should a panic occur, the kernel halts
>> rather than dumping core via kdump.
>>
>> This patchset significantly improves upon the current solution by
>> enabling the kernel to update only the necessary items of the kdump
>> image. In the case of x86_64, that is just the elfcorehdr and the
>> purgatory segments. These updates occur as fast as the hot un/plug
>> events and significantly reduce the size of the race window.
>>
>> This patchset introduces a generic crash hot un/plug handler that
>> registers with the CPU and memory notifiers. Upon CPU or memory
>> changes, this generic handler is invoked and performs important
>> housekeeping, for example obtaining the appropriate lock, and then
>> invokes an architecture specific handler to do the appropriate
>> updates.
>>
>> In the case of x86_64, the arch specific handler generates a new
>> elfcorehdr, which reflects the current CPUs and memory regions, into a
>> buffer. Since purgatory also does an integrity check via hash digests
>> of the loaded segments, purgatory must also be updated with the new
>> digests. The arch handler also generates a new purgatory into a
>> buffer, performs the hash digests of the new memory segments, and then
>> patches purgatory with the new digests. If all succeeds, then the
>> elfcorehdr and purgatory buffers over write the existing buffers and
>> the new kdump image is live and ready to go. No involvement with
>> userspace at all.
>>
>> To accommodate a growing number of resources via hotplug, the
>> elfcorehdr memory must be sufficiently large enough to accommodate
>> changes. The CRASH_HOTPLUG_ELFCOREHDR_SZ configure item does just
>> this.
>>
>> To realize the benefits/test this patchset, one must make a couple
>> of minor changes to userspace:
>>
>> - Disable the udev rule for updating kdump on hot un/plug changes
>> Eg. on RHEL: rm -f /usr/lib/udev/rules.d/98-kexec.rules
>> or other technique to neuter the rule.
>>
>> - Change to the kexec_file_load for loading the kdump kernel:
>> Eg. on RHEL: in /usr/bin/kdumpctl, change to:
>> standard_kexec_args="-p -d -s"
>> which adds the -s to select kexec_file_load syscall.
>>
>> This work has raised the following questions for me:
>>
>> First and foremost, this patch only works for the kexec_file_load
>> syscall path (via "kexec -s -p" utility). The reason being that, for
>> x86_64 anyway, the purgatory blob provided by userspace can not be
>> readily decoded in order to update the hash digest. (The
>> kexec_file_load purgatory is actually a small ELF object with symbols,
>> so can be patched at run time.) With no way to update purgatory, the
>> integrity check will always fail and and cause purgatory to hang at
>> panic time.
>
>
> We are designing a solution for a similar problem in PowerPC. Agree that
> manipulating kexec segment in the kernel for kexec_load system call is
> bit complex compare to kexec_file_load system call due to SHA verification
> in purgatory.
>
> What if we have a pre-allocated memory hole for the kexec segment
> and ask kexec to use that and skip the SHA verification for the same.
> For example, on PowerPC, all the CPUs and memory-related info is part
> of FDT. Whenever there is hotplug event we have to update the kdump
> FDT segment to provide correct details to the kdump kernel.
>
> One way to keep the kdump FDT up-to-date with the latest CPU and memory
> is to load the kdump FDT to the pre-allocated memory hole for both kexec_load
> and kexec_file_laod system call and let the kernel keep updating the FDT
> on hotplug event.
>
> Adapting the above solution for the kexec_file_load case is easy because
> we do everything in the kernel. But what about the kexec_load system call? How
> kexec tool will know about this pre-allocated memory hole? What will happen
> to digest verification if the kernel updates the kdump FDT segment post kdump
> load?
>
> The kernel will expose the pre-allocated memory to userspace via a sysfs. When kexec
> tool loads the kexec segments it will check for this pre-allocated memory for
> kdump FDT and if available it will use it and skip the SHA verification
> for the same.
>
> Please provide your input on the above method of handling things for the kexec_file system call?
While I am not at all familiar with PPC FDT; this sounds quite doable; the pre-allocated
memory for FDT sounds quite similar to the handling of the crashkernel= parameter.
From the description provided above, it sounds to me that excluding (from the purgatory
integrity check) the PPC FDT would be quite similar to what Baoquan is proposing by
excluding (from the purgatory integrity check) the elfcorehdr for x86.
If we can achieve a consensus on excluding from the purgatory check the elfcorehdr (for
x86) and the FDT (for PPC), then I believe that support for kexec_load and hotplug is
readily achievable.
eric
>
> I am still reviewing your patch series.
>
> Thanks,
> Sourabh Jain
>
>> That being said, I actually developed this against the kexec_load path
>> and did have that working by making two one-line changes to userspace
>> kexec utility: one change that effectively is
>> CRASH_HOTPLUG_ELFCOREHDR_SZ and the other to disable the integrity
>> check. But that does not seem to be a long term solution. A possible
>> long term solution would be to allow the use of the kexec_file_load
>> purgatory ELF object with the kexec_load path. While I believe that
>> would work, I am unsure if there are any downsides to doing so.
>>
>> The second problem is the use of CPUHP_AP_ONLINE_DYN. The
>> cpuhp_setup_state_nocalls() is invoked with parameter
>> CPUHP_AP_ONLINE_DYN. While this works, when a CPU is being unplugged,
>> the CPU still shows up in foreach_present_cpu() during the
>> regeneration of the elfcorehdr, thus the need to explicitly check and
>> exclude the soon-to-be offlined CPU in crash_prepare_elf64_headers().
>> Perhaps if value(s) new/different than CPUHP_AP_ONLINE_DYN to
>> cpuhp_setup_state() was utilized, then the offline cpu would no longer
>> be in foreach_present_cpu(), and this change could be eliminated. I do
>> not understand cpuhp_setup_state() well enough to choose, or create,
>> appropriate value(s).
>>
>> The third problem is the number of memory hot un/plug events. If, for
>> example, a 1GiB DIMM is hotplugged, that generate 8 memory events, one
>> for each 128MiB memblock, yet the walk_system_ram_res() that is used
>> to obtain the list of memory regions reports the single 1GiB; thus
>> there are 7 un-necessary trips through this crash hotplug handler.
>> Perhaps there is another way of handling memory events that would see
>> the single 1GiB DIMM rather than each memblock?
>>
>> Regards,
>> eric
>>
>> Eric DeVolder (8):
>> crash: fix minor typo/bug in debug message
>> crash hp: Introduce CRASH_HOTPLUG configuration options
>> crash hp: definitions and prototypes for crash hotplug support
>> crash hp: generic crash hotplug support infrastructure
>> crash hp: kexec_file changes for use by crash hotplug handler
>> crash hp: Add x86 crash hotplug state items to kimage
>> crash hp: Add x86 crash hotplug support for kexec_file_load
>> crash hp: Add x86 crash hotplug support for bzImage
>>
>> arch/x86/Kconfig | 26 +++
>> arch/x86/include/asm/kexec.h | 10 ++
>> arch/x86/kernel/crash.c | 257 +++++++++++++++++++++++++++++-
>> arch/x86/kernel/kexec-bzimage64.c | 12 ++
>> include/linux/kexec.h | 22 ++-
>> kernel/crash_core.c | 118 ++++++++++++++
>> kernel/kexec_file.c | 19 ++-
>> 7 files changed, 455 insertions(+), 9 deletions(-)
>>
+ akpm
On 11/29/21 at 01:42pm, Eric DeVolder wrote:
> Hi, see below.
> eric
>
> On 11/24/21 03:02, Baoquan He wrote:
> > Hi,
> >
> > On 11/18/21 at 12:49pm, Eric DeVolder wrote:
> > ......
> > > This patchset introduces a generic crash hot un/plug handler that
> > > registers with the CPU and memory notifiers. Upon CPU or memory
> > > changes, this generic handler is invoked and performs important
> > > housekeeping, for example obtaining the appropriate lock, and then
> > > invokes an architecture specific handler to do the appropriate
> > > updates.
> > >
> > > In the case of x86_64, the arch specific handler generates a new
> > > elfcorehdr, which reflects the current CPUs and memory regions, into a
> > > buffer. Since purgatory also does an integrity check via hash digests
> > > of the loaded segments, purgatory must also be updated with the new
> >
> > When I tried to address this with a draft patch, I started with a
> > different way in which udev rule triggers reloading and only elfcorehdr
> > segment is updated. The update should be less time consuming. Seems
> > internal notifier is better in your way. But I didn't update purgatory
> > since I just skipped the elfcorehdr part when calculate the digest of
> > segments. The reason from my mind is kernel text, initrd must contribute
> > most part of the digest, elfcorehdr is much less, and it will simplify
> > code change more. Doing so let us have no need to touch purgatory at
> > all. What do you think?
>
> Well certainly if purgatory did not need to be updated, then that simplifies
> matters quite a bit!
>
> I do not have any context on the history of including elfcorehdr in the purgatory
> integrity check. I do agree with you that checking kernel, initrd, boot_params
> is most important. Perhaps allowing the elfcorehdr data structure to change
> isn't too bad without including in the integrity check is ok as there is some
> sanity checking of it by the capture kernel as it reads it for /proc/vmcore setup.
Well, I think the check has included elfcorehdr since user space
kexec-tools added the check. We can do the skipping in kexec_file load
in kernel for the time being, see if anyone has concern about the
safety or security. Since agreement has been reached, can you split out
the purgatory update and repost a new patchset with the current
frame work to only update elfcorehdr?
Any by the way, I think you have written a very great cover letter which
tells almost all details about the change. However, pity that they are
not put in patch log. In your patch log, you just tell what change is
made in the patch, but the why we need it which is the most important part
is not seen. Most of time, we can get what change has been made from the
code, surely it's very helpful if patch log has told it and can save
reviewers much time, but it's not easy to get why it's needed or
introduced if not being involved in earlier discussion or any context.
And as you know, cover letter will be stripped away whem maintainers
merge patch, only patch log is kept.
Thanks
Baoquan
>
> >
> > Still reviewing.
>
> Thank you!
>
> >
> > > digests. The arch handler also generates a new purgatory into a
> > > buffer, performs the hash digests of the new memory segments, and then
> > > patches purgatory with the new digests. If all succeeds, then the
> > > elfcorehdr and purgatory buffers over write the existing buffers and
> > > the new kdump image is live and ready to go. No involvement with
> > > userspace at all.
> >
>
See below, thanks, eric
On 12/1/21 06:59, Baoquan He wrote:
> + akpm
>
> On 11/29/21 at 01:42pm, Eric DeVolder wrote:
>> Hi, see below.
>> eric
>>
>> On 11/24/21 03:02, Baoquan He wrote:
>>> Hi,
>>>
>>> On 11/18/21 at 12:49pm, Eric DeVolder wrote:
>>> ......
>>>> This patchset introduces a generic crash hot un/plug handler that
>>>> registers with the CPU and memory notifiers. Upon CPU or memory
>>>> changes, this generic handler is invoked and performs important
>>>> housekeeping, for example obtaining the appropriate lock, and then
>>>> invokes an architecture specific handler to do the appropriate
>>>> updates.
>>>>
>>>> In the case of x86_64, the arch specific handler generates a new
>>>> elfcorehdr, which reflects the current CPUs and memory regions, into a
>>>> buffer. Since purgatory also does an integrity check via hash digests
>>>> of the loaded segments, purgatory must also be updated with the new
>>>
>>> When I tried to address this with a draft patch, I started with a
>>> different way in which udev rule triggers reloading and only elfcorehdr
>>> segment is updated. The update should be less time consuming. Seems
>>> internal notifier is better in your way. But I didn't update purgatory
>>> since I just skipped the elfcorehdr part when calculate the digest of
>>> segments. The reason from my mind is kernel text, initrd must contribute
>>> most part of the digest, elfcorehdr is much less, and it will simplify
>>> code change more. Doing so let us have no need to touch purgatory at
>>> all. What do you think?
>>
>> Well certainly if purgatory did not need to be updated, then that simplifies
>> matters quite a bit!
>>
>> I do not have any context on the history of including elfcorehdr in the purgatory
>> integrity check. I do agree with you that checking kernel, initrd, boot_params
>> is most important. Perhaps allowing the elfcorehdr data structure to change
>> isn't too bad without including in the integrity check is ok as there is some
>> sanity checking of it by the capture kernel as it reads it for /proc/vmcore setup.
>
> Well, I think the check has included elfcorehdr since user space
> kexec-tools added the check. We can do the skipping in kexec_file load
> in kernel for the time being, see if anyone has concern about the
> safety or security. Since agreement has been reached, can you split out
> the purgatory update and repost a new patchset with the current
> frame work to only update elfcorehdr?
I reworked the patchset as you suggested and removed the reload of purgatory.
It simplified things considerably.
>
> Any by the way, I think you have written a very great cover letter which
> tells almost all details about the change. However, pity that they are
> not put in patch log. In your patch log, you just tell what change is
> made in the patch, but the why we need it which is the most important part
> is not seen. Most of time, we can get what change has been made from the
> code, surely it's very helpful if patch log has told it and can save
> reviewers much time, but it's not easy to get why it's needed or
> introduced if not being involved in earlier discussion or any context.
> And as you know, cover letter will be stripped away whem maintainers
> merge patch, only patch log is kept.
I've tried to place more information in the individual patch commit messages,
or the code itself.
I just posted v2!
Thanks for your interest and review!
eric
>
> Thanks
> Baoquan
>
>>
>>>
>>> Still reviewing.
>>
>> Thank you!
>>
>>>
>>>> digests. The arch handler also generates a new purgatory into a
>>>> buffer, performs the hash digests of the new memory segments, and then
>>>> patches purgatory with the new digests. If all succeeds, then the
>>>> elfcorehdr and purgatory buffers over write the existing buffers and
>>>> the new kdump image is live and ready to go. No involvement with
>>>> userspace at all.
>>>
>>
>