2023-06-29 19:59:22

by Eric DeVolder

[permalink] [raw]
Subject: [PATCH v25 00/10] crash: Kernel handling of CPU and memory hot un/plug

This series is dependent upon "refactor Kconfig to consolidate
KEXEC and CRASH options".
https://lore.kernel.org/lkml/[email protected]/

Once the kdump service is loaded, if changes to CPUs or memory occur,
either by hot un/plug or off/onlining, the crash elfcorehdr must also
be updated.

The elfcorehdr describes to kdump the CPUs and memory in the system,
and any inaccuracies can result in a vmcore with missing CPU context
or memory regions.

The current solution utilizes udev to initiate an unload-then-reload
of the kdump image (eg. kernel, initrd, boot_params, purgatory and
elfcorehdr) by the userspace kexec utility. In the original post I
outlined the significant performance problems related to offloading
this activity to userspace.

This patchset introduces a generic crash handler that registers with
the CPU and memory notifiers. Upon CPU or memory changes, from either
hot un/plug or off/onlining, this generic handler is invoked and
performs important housekeeping, for example obtaining the appropriate
lock, and then invokes an architecture specific handler to do the
appropriate elfcorehdr update.

Note the description in patch 'crash: change crash_prepare_elf64_headers()
to for_each_possible_cpu()' and 'x86/crash: optimize CPU changes' that
enables further optimizations related to CPU plug/unplug/online/offline
performance of elfcorehdr updates.

In the case of x86_64, the arch specific handler generates a new
elfcorehdr, and overwrites the old one in memory; thus no involvement
with userspace needed.

To realize the benefits/test this patchset, one must make a couple
of minor changes to userspace:

- Prevent udev from updating kdump crash kernel on hot un/plug changes.
Add the following as the first lines to the RHEL udev rule file
/usr/lib/udev/rules.d/98-kexec.rules:

# The kernel updates the crash elfcorehdr for CPU and memory changes
SUBSYSTEM=="cpu", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"
SUBSYSTEM=="memory", ATTRS{crash_hotplug}=="1", GOTO="kdump_reload_end"

With this changeset applied, the two rules evaluate to false for
CPU and memory change events and thus skip the userspace
unload-then-reload of kdump.

- Change to the kexec_file_load for loading the kdump kernel:
Eg. on RHEL: in /usr/bin/kdumpctl, change to:
standard_kexec_args="-p -d -s"
which adds the -s to select kexec_file_load() syscall.

This kernel patchset also supports kexec_load() with a modified kexec
userspace utility. A working changeset to the kexec userspace utility
is posted to the kexec-tools mailing list here:

http://lists.infradead.org/pipermail/kexec/2023-May/027049.html

To use the kexec-tools patch, apply, build and install kexec-tools,
then change the kdumpctl's standard_kexec_args to replace the -s with
--hotplug. The removal of -s reverts to the kexec_load syscall and
the addition of --hotplug invokes the changes put forth in the
kexec-tools patch.

Regards,
eric
---
v25: 29jun2023
- Properly applied IS_ENABLED() to the function bodies of callbacks
in drivers/base/cpu|memory.c.
- Re-ran compile and run-time testing of the impacted attributes for
both enabled and not enabled config settings.

v24: 28jun2023
https://lore.kernel.org/lkml/[email protected]/
- Rebased onto 6.4.0
- Included Documentation/ABI/testing entries for the new sysfs
crash_hotplug attributes, per Greg Kroah-Hartman.
- Refactored drivers/base/cpu|memory.c to use the .is_visible()
method for attributes, per Greg Kroah-Hartman.
- Retained all existing Acks and RBs as the few changes as a result
of Greg's requests were trivial.

v23: 12jun2023
https://lore.kernel.org/lkml/[email protected]/
- Rebased onto 6.4.0-rc6
- Refactored Kconfig, per Thomas. See series:
https://lore.kernel.org/lkml/[email protected]/
- Reworked commit messages to conform to style, per Thomas.
- Applied Baoquan He Acked-by to kexec_load() patch.
- Applied Hari Bathini Acked-by for the series.
- No code changes.

v22: 3may2023
https://lore.kernel.org/lkml/[email protected]/
- Rebased onto 6.3.0
- Improved support for kexec_load(), per Hari Bathini. See
"crash: hotplug support for kexec_load()" which is the only
change to this series.
- Applied Baoquan He's Acked-by for all other patches.

v21: 4apr2023
https://lkml.org/lkml/2023/4/4/1136
https://lore.kernel.org/lkml/[email protected]/
- Rebased onto 6.3.0-rc5
- Additional simplification of indentation in crash_handle_hotplug_event(),
per Baoquan.

v20: 17mar2023
https://lkml.org/lkml/2023/3/17/1169
https://lore.kernel.org/lkml/[email protected]/
- Rebased onto 6.3.0-rc2
- Defaulting CRASH_HOTPLUG for x86 to Y, per Sourabh.
- Explicitly initializing image->hp_action, per Baoquan.
- Simplified kexec_trylock() in crash_handle_hotplug_event(),
per Baoquan.
- Applied Sourabh's Reviewed-by to the series.

v19: 6mar2023
https://lkml.org/lkml/2023/3/6/1358
https://lore.kernel.org/lkml/[email protected]/
- Rebased onto 6.2.0
- Did away with offlinecpu, per Thomas Gleixner.
- Changed to CPUHP_BP_PREPARE_DYN instead of CPUHP_AP_ONLINE_DYN.
- Did away with elfcorehdr_index_valid, per Sourabh.
- Convert to for_each_possible_cpu() in crash_prepare_elf64_headers()
per Sourabh.
- Small optimization for x86 cpu changes.

v18: 31jan2023
https://lkml.org/lkml/2023/1/31/1356
https://lore.kernel.org/lkml/[email protected]/
- Rebased onto 6.2.0-rc6
- Renamed struct kimage member hotplug_event to hp_action, and
re-enumerated the KEXEC_CRASH_HP_x items, adding _NONE at 0.
- Moved to cpuhp state CPUHP_BP_PREPARE_DYN instead of
CPUHP_AP_ONLINE_DYN in order to minimize window of time CPU
is not reflected in elfcorehdr.
- Reworked some of the comments and commit messages to offer
more of the why, than what, per Thomas Gleixner.

v17: 18jan2023
https://lkml.org/lkml/2023/1/18/1420
https://lore.kernel.org/lkml/[email protected]/
- Rebased onto 6.2.0-rc4
- Moved a bit of code around so that kexec_load()-only builds
work, per Sourabh.
- Corrected computation of number of memory region Phdrs needed
when x86 memory hotplug is not enabled, per Baoquan.

v16: 5jan2023
https://lkml.org/lkml/2023/1/5/673
https://lore.kernel.org/lkml/[email protected]/
- Rebased onto 6.2.0-rc2
- Corrected error identified by Baoquan.

v15: 9dec2022
https://lkml.org/lkml/2022/12/9/520
https://lore.kernel.org/lkml/[email protected]/
- Rebased onto 6.1.0-rc8
- Replaced arch_un/map_crash_pages() with direct use of
kun/map_local_pages(), per Boris.
- Some x86 changes, per Boris.

v14: 16nov2022
https://lkml.org/lkml/2022/11/16/1645
https://lore.kernel.org/lkml/[email protected]/
- Rebased onto 6.1.0-rc5
- Introduced CRASH_HOTPLUG Kconfig item to better fine tune
compilation of feature components, per Boris.
- Removed hp_action parameter to arch_crash_handle_hotplug_event()
as it is unused.

v13: 31oct2022
https://lkml.org/lkml/2022/10/31/854
https://lore.kernel.org/lkml/[email protected]/
- Rebased onto 6.1.0-rc3, which means converting to use the new
kexec_trylock() away from mutex_lock(kexec_mutex).
- Moved arch_un/map_crash_pages() into kexec.h and default
implementation using k/unmap_local_pages().
- Changed more #ifdef's into IS_ENABLED()
- Changed CRASH_MAX_MEMORY_RANGES to 8192 from 32768, and it moved
into x86 crash.c as #define rather Kconfig item, per Boris.
- Check number of Phdrs against PN_XNUM, max possible.

v12: 9sep2022
https://lkml.org/lkml/2022/9/9/1358
https://lore.kernel.org/lkml/[email protected]/
- Rebased onto 6.0-rc4
- Addressed some minor formatting items, per Baoquan

v11: 26aug2022
https://lkml.org/lkml/2022/8/26/963
https://lore.kernel.org/lkml/[email protected]/
- Rebased onto 6.0-rc2
- Redid the rework of __weak to use asm/kexec.h, per Baoquan
- Reworked some comments and minor items, per Baoquan

v10: 21jul2022
https://lkml.org/lkml/2022/7/21/1007
https://lore.kernel.org/lkml/[email protected]/
- Rebased to 5.19.0-rc7
- Per Sourabh, corrected build issue with arch_un/map_crash_pages()
for architectures not supporting this feature.
- Per David Hildebrand, removed the WARN_ONCE() altogether.
- Per David Hansen, converted to use of kmap_local_page().
- Per Baoquan He, replaced use of __weak with the kexec technique.

v9: 13jun2022
https://lkml.org/lkml/2022/6/13/3382
https://lore.kernel.org/lkml/[email protected]/
- Rebased to 5.18.0
- Per Sourabh, moved crash_prepare_elf64_headers() into common
crash_core.c to avoid compile issues with kexec_load only path.
- Per David Hildebrand, replaced mutex_trylock() with mutex_lock().
- Changed the __weak arch_crash_handle_hotplug_event() to utilize
WARN_ONCE() instead of WARN(). Fix some formatting issues.
- Per Sourabh, introduced sysfs attribute crash_hotplug for memory
and CPUs; for use by userspace (udev) to determine if the kernel
performs crash hot un/plug support.
- Per Sourabh, moved the code detecting the elfcorehdr segment from
arch/x86 into crash_core:handle_hotplug_event() so both kexec_load
and kexec_file_load can benefit.
- Updated userspace kexec-tools kexec utility to reflect change to
using CRASH_MAX_MEMORY_RANGES and get_nr_cpus().
- Updated the new proposed udev rules to reflect using the sysfs
attributes crash_hotplug.

v8: 5may2022
https://lkml.org/lkml/2022/5/5/1133
https://lore.kernel.org/lkml/[email protected]/
- Per Borislav Petkov, eliminated CONFIG_CRASH_HOTPLUG in favor
of CONFIG_HOTPLUG_CPU || CONFIG_MEMORY_HOTPLUG, ie a new define
is not needed. Also use of IS_ENABLED() rather than #ifdef's.
Renamed crash_hotplug_handler() to handle_hotplug_event().
And other corrections.
- Per Baoquan, minimized the parameters to the arch_crash_
handle_hotplug_event() to hp_action and cpu.
- Introduce KEXEC_CRASH_HP_INVALID_CPU definition, per Baoquan.
- Per Sourabh Jain, renamed and repurposed CRASH_HOTPLUG_ELFCOREHDR_SZ
to CONFIG_CRASH_MAX_MEMORY_RANGES, mirroring kexec-tools change
by David Hildebrand. Folded this patch into the x86
kexec_file_load support patch.

v7: 13apr2022
https://lkml.org/lkml/2022/4/13/850
https://lore.kernel.org/lkml/[email protected]/
- Resolved parameter usage to crash_hotplug_handler(), per Baoquan.

v6: 1apr2022
https://lkml.org/lkml/2022/4/1/1203
https://lore.kernel.org/lkml/[email protected]/
- Reword commit messages and some comment cleanup per Baoquan.
- Changed elf_index to elfcorehdr_index for clarity.
- Minor code changes per Baoquan.

v5: 3mar2022
https://lkml.org/lkml/2022/3/3/674
https://lore.kernel.org/lkml/[email protected]/
- Reworded description of CRASH_HOTPLUG_ELFCOREHDR_SZ, per
David Hildenbrand.
- Refactored slightly a few patches per Baoquan recommendation.

v4: 9feb2022
https://lkml.org/lkml/2022/2/9/1406
https://lore.kernel.org/lkml/[email protected]/
- Refactored patches per Baoquan suggestsions.
- A few corrections, per Baoquan.

v3: 10jan2022
https://lkml.org/lkml/2022/1/10/1212
https://lore.kernel.org/lkml/[email protected]/
- Rebasing per Baoquan He request.
- Changed memory notifier per David Hildenbrand.
- Providing example kexec userspace change in cover letter.

RFC v2: 7dec2021
https://lkml.org/lkml/2021/12/7/1088
https://lore.kernel.org/lkml/[email protected]/
- Acting upon Baoquan He suggestion of removing elfcorehdr from
the purgatory list of segments, removed purgatory code from
patchset, and it is signficiantly simpler now.

RFC v1: 18nov2021
https://lkml.org/lkml/2021/11/18/845
https://lore.kernel.org/lkml/[email protected]/
- working patchset demonstrating kernel handling of hotplug
updates to x86 elfcorehdr for kexec_file_load

RFC: 14dec2020
https://lkml.org/lkml/2020/12/14/532
https://lore.kernel.org/lkml/[email protected]/
- proposed concept of allowing kernel to handle hotplug update
of elfcorehdr
---


Eric DeVolder (10):
drivers/base: refactor cpu.c to use .is_visible()
drivers/base: refactor memory.c to use .is_visible()
crash: move a few code bits to setup support of crash hotplug
crash: add generic infrastructure for crash hotplug support
kexec: exclude elfcorehdr from the segment digest
crash: memory and CPU hotplug sysfs attributes
x86/crash: add x86 crash hotplug support
crash: hotplug support for kexec_load()
crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()
x86/crash: optimize CPU changes

.../ABI/testing/sysfs-devices-memory | 8 +
.../ABI/testing/sysfs-devices-system-cpu | 8 +
.../admin-guide/mm/memory-hotplug.rst | 8 +
Documentation/core-api/cpu_hotplug.rst | 18 +
arch/x86/Kconfig | 3 +
arch/x86/include/asm/kexec.h | 18 +
arch/x86/kernel/crash.c | 140 ++++++-
drivers/base/cpu.c | 141 ++++---
drivers/base/memory.c | 242 +++++++-----
include/linux/crash_core.h | 9 +
include/linux/kexec.h | 63 +++-
include/linux/tick.h | 2 +-
include/uapi/linux/kexec.h | 1 +
kernel/Kconfig.kexec | 35 ++
kernel/crash_core.c | 355 ++++++++++++++++++
kernel/kexec.c | 5 +
kernel/kexec_core.c | 6 +
kernel/kexec_file.c | 187 +--------
kernel/ksysfs.c | 15 +
19 files changed, 922 insertions(+), 342 deletions(-)

--
2.31.1



2023-06-29 20:36:14

by Eric DeVolder

[permalink] [raw]
Subject: [PATCH v25 09/10] crash: change crash_prepare_elf64_headers() to for_each_possible_cpu()

The function crash_prepare_elf64_headers() generates the elfcorehdr
which describes the CPUs and memory in the system for the crash kernel.
In particular, it writes out ELF PT_NOTEs for memory regions and the
CPUs in the system.

With respect to the CPUs, the current implementation utilizes
for_each_present_cpu() which means that as CPUs are added and removed,
the elfcorehdr must again be updated to reflect the new set of CPUs.

The reasoning behind the move to use for_each_possible_cpu(), is:

- At kernel boot time, all percpu crash_notes are allocated for all
possible CPUs; that is, crash_notes are not allocated dynamically
when CPUs are plugged/unplugged. Thus the crash_notes for each
possible CPU are always available.

- The crash_prepare_elf64_headers() creates an ELF PT_NOTE per CPU.
Changing to for_each_possible_cpu() is valid as the crash_notes
pointed to by each CPU PT_NOTE are present and always valid.

Furthermore, examining a common crash processing path of:

kernel panic -> crash kernel -> makedumpfile -> 'crash' analyzer
elfcorehdr /proc/vmcore vmcore

reveals how the ELF CPU PT_NOTEs are utilized:

- Upon panic, each CPU is sent an IPI and shuts itself down, recording
its state in its crash_notes. When all CPUs are shutdown, the
crash kernel is launched with a pointer to the elfcorehdr.

- The crash kernel via linux/fs/proc/vmcore.c does not examine or
use the contents of the PT_NOTEs, it exposes them via /proc/vmcore.

- The makedumpfile utility uses /proc/vmcore and reads the CPU
PT_NOTEs to craft a nr_cpus variable, which is reported in a
header but otherwise generally unused. Makedumpfile creates the
vmcore.

- The 'crash' dump analyzer does not appear to reference the CPU
PT_NOTEs. Instead it looks-up the cpu_[possible|present|onlin]_mask
symbols and directly examines those structure contents from vmcore
memory. From that information it is able to determine which CPUs
are present and online, and locate the corresponding crash_notes.
Said differently, it appears that 'crash' analyzer does not rely
on the ELF PT_NOTEs for CPUs; rather it obtains the information
directly via kernel symbols and the memory within the vmcore.

(There maybe other vmcore generating and analysis tools that do use
these PT_NOTEs, but 'makedumpfile' and 'crash' seems to be the most
common solution.)

This results in the benefit of having all CPUs described in the
elfcorehdr, and therefore reducing the need to re-generate the
elfcorehdr on CPU changes, at the small expense of an additional
56 bytes per PT_NOTE for not-present-but-possible CPUs.

On systems where kexec_file_load() syscall is utilized, all the above
is valid. On systems where kexec_load() syscall is utilized, there
may be the need for the elfcorehdr to be regenerated once. The reason
being that some archs only populate the 'present' CPUs from the
/sys/devices/system/cpus entries, which the userspace 'kexec' utility
uses to generate the userspace-supplied elfcorehdr. In this situation,
one memory or CPU change will rewrite the elfcorehdr via the
crash_prepare_elf64_headers() function and now all possible CPUs will
be described, just as with kexec_file_load() syscall.

Suggested-by: Sourabh Jain <[email protected]>
Signed-off-by: Eric DeVolder <[email protected]>
Reviewed-by: Sourabh Jain <[email protected]>
Acked-by: Hari Bathini <[email protected]>
Acked-by: Baoquan He <[email protected]>
---
kernel/crash_core.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index fa918176d46d..7378b501fada 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -364,8 +364,8 @@ int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map,
ehdr->e_ehsize = sizeof(Elf64_Ehdr);
ehdr->e_phentsize = sizeof(Elf64_Phdr);

- /* Prepare one phdr of type PT_NOTE for each present CPU */
- for_each_present_cpu(cpu) {
+ /* Prepare one phdr of type PT_NOTE for each possible CPU */
+ for_each_possible_cpu(cpu) {
phdr->p_type = PT_NOTE;
notes_addr = per_cpu_ptr_to_phys(per_cpu_ptr(crash_notes, cpu));
phdr->p_offset = phdr->p_paddr = notes_addr;
--
2.31.1