LinuxLists.cc - [PATCH 0/4] kdump: crashkernel reservation from CMA

2023-11-24 19:54:54

Subject: [PATCH 0/4] kdump: crashkernel reservation from CMA

Hi,

this series implements a new way to reserve additional crash kernel
memory using CMA.

Currently, all the memory for the crash kernel is not usable by
the 1st (production) kernel. It is also unmapped so that it can't
be corrupted by the fault that will eventually trigger the crash.
This makes sense for the memory actually used by the kexec-loaded
crash kernel image and initrd and the data prepared during the
load (vmcoreinfo, ...). However, the reserved space needs to be
much larger than that to provide enough run-time memory for the
crash kernel and the kdump userspace. Estimating the amount of
memory to reserve is difficult. Being too careful makes kdump
likely to end in OOM, being too generous takes even more memory
from the production system. Also, the reservation only allows
reserving a single contiguous block (or two with the "low"
suffix). I've seen systems where this fails because the physical
memory is fragmented.

By reserving additional crashkernel memory from CMA, the main
crashkernel reservation can be just small enough to fit the
kernel and initrd image, minimizing the memory taken away from
the production system. Most of the run-time memory for the crash
kernel will be memory previously available to userspace in the
production system. As this memory is no longer wasted, the
reservation can be done with a generous margin, making kdump more
reliable. Kernel memory that we need to preserve for dumping is
never allocated from CMA. User data is typically not dumped by
makedumpfile. When dumping of user data is intended this new CMA
reservation cannot be used.

There are four patches in this series:

The first adds a new ",cma" suffix to the recenly introduced generic
crashkernel parsing code. parse_crashkernel() takes one more
argument to store the cma reservation size.

The second patch implements reserve_crashkernel_cma() which
performs the reservation. If the requested size is not available
in a single range, multiple smaller ranges will be reserved.

The third patch enables the functionality for x86 as a proof of
concept. There are just three things every arch needs to do:
- call reserve_crashkernel_cma()
- include the CMA-reserved ranges in the physical memory map
- exclude the CMA-reserved ranges from the memory available
through /proc/vmcore by excluding them from the vmcoreinfo
PT_LOAD ranges.
Adding other architectures is easy and I can do that as soon as
this series is merged.

The fourth patch just updates Documentation/

Now, specifying
crashkernel=100M craskhernel=1G,cma
on the command line will make a standard crashkernel reservation
of 100M, where kexec will load the kernel and initrd.

An additional 1G will be reserved from CMA, still usable by the
production system. The crash kernel will have 1.1G memory
available. The 100M can be reliably predicted based on the size
of the kernel and initrd.

When no crashkernel=size,cma is specified, everything works as
before.

--
Jiri Bohac <[email protected]>
SUSE Labs, Prague, Czechia

2023-11-24 19:58:10

by Jiri Bohac

[permalink] [raw]

Subject: [PATCH 1/4] kdump: add crashkernel cma suffix

Add a new optional ",cma" suffix to the crashkernel= command line option.

Add a new cma_size parameter to parse_crashkernel().
When not NULL, call __parse_crashkernel to parse the CMA
reservation size from "crashkernel=size,cma" and store it
in cma_size.

Set cma_size to NULL in all calls to parse_crashkernel().

Signed-off-by: Jiri Bohac <[email protected]>
---
arch/arm/kernel/setup.c | 2 +-
arch/arm64/mm/init.c | 2 +-
arch/loongarch/kernel/setup.c | 2 +-
arch/mips/kernel/setup.c | 2 +-
arch/powerpc/kernel/fadump.c | 2 +-
arch/powerpc/kexec/core.c | 2 +-
arch/powerpc/mm/nohash/kaslr_booke.c | 2 +-
arch/riscv/mm/init.c | 2 +-
arch/s390/kernel/setup.c | 2 +-
arch/sh/kernel/machine_kexec.c | 2 +-
arch/x86/kernel/setup.c | 2 +-
include/linux/crash_core.h | 3 ++-
kernel/crash_core.c | 20 ++++++++++++++++----
13 files changed, 29 insertions(+), 16 deletions(-)

diff --git a/arch/arm/kernel/setup.c b/arch/arm/kernel/setup.c
index ff2299ce1ad7..cb940553c757 100644
--- a/arch/arm/kernel/setup.c
+++ b/arch/arm/kernel/setup.c
@@ -1010,7 +1010,7 @@ static void __init reserve_crashkernel(void)
total_mem = get_total_mem();
ret = parse_crashkernel(boot_command_line, total_mem,
&crash_size, &crash_base,
- NULL, NULL);
+ NULL, NULL, NULL);
/* invalid value specified or crashkernel=0 */
if (ret || !crash_size)
return;
diff --git a/arch/arm64/mm/init.c b/arch/arm64/mm/init.c
index 74c1db8ce271..819b8979584c 100644
--- a/arch/arm64/mm/init.c
+++ b/arch/arm64/mm/init.c
@@ -105,7 +105,7 @@ static void __init arch_reserve_crashkernel(void)

ret = parse_crashkernel(cmdline, memblock_phys_mem_size(),
&crash_size, &crash_base,
- &low_size, &high);
+ &low_size, NULL, &high);
if (ret)
return;

diff --git a/arch/loongarch/kernel/setup.c b/arch/loongarch/kernel/setup.c
index d183a745fb85..0489c8188b83 100644
--- a/arch/loongarch/kernel/setup.c
+++ b/arch/loongarch/kernel/setup.c
@@ -266,7 +266,7 @@ static void __init arch_parse_crashkernel(void)
total_mem = memblock_phys_mem_size();
ret = parse_crashkernel(boot_command_line, total_mem,
&crash_size, &crash_base,
- NULL, NULL);
+ NULL, NULL, NULL);
if (ret < 0 || crash_size <= 0)
return;

diff --git a/arch/mips/kernel/setup.c b/arch/mips/kernel/setup.c
index 2d2ca024bd47..98afa80ec002 100644
--- a/arch/mips/kernel/setup.c
+++ b/arch/mips/kernel/setup.c
@@ -456,7 +456,7 @@ static void __init mips_parse_crashkernel(void)
total_mem = memblock_phys_mem_size();
ret = parse_crashkernel(boot_command_line, total_mem,
&crash_size, &crash_base,
- NULL, NULL);
+ NULL, NULL, NULL);
if (ret != 0 || crash_size <= 0)
return;

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index d14eda1e8589..6fa5ab01f4e8 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -313,7 +313,7 @@ static __init u64 fadump_calculate_reserve_size(void)
* memory at a predefined offset.
*/
ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
- &size, &base, NULL, NULL);
+ &size, &base, NULL, NULL, NULL);
if (ret == 0 && size > 0) {
unsigned long max_size;

diff --git a/arch/powerpc/kexec/core.c b/arch/powerpc/kexec/core.c
index 85846cadb9b5..c1e0afd94c90 100644
--- a/arch/powerpc/kexec/core.c
+++ b/arch/powerpc/kexec/core.c
@@ -112,7 +112,7 @@ void __init reserve_crashkernel(void)
total_mem_sz = memory_limit ? memory_limit : memblock_phys_mem_size();
/* use common parsing */
ret = parse_crashkernel(boot_command_line, total_mem_sz,
- &crash_size, &crash_base, NULL, NULL);
+ &crash_size, &crash_base, NULL, NULL, NULL);
if (ret == 0 && crash_size > 0) {
crashk_res.start = crash_base;
crashk_res.end = crash_base + crash_size - 1;
diff --git a/arch/powerpc/mm/nohash/kaslr_booke.c b/arch/powerpc/mm/nohash/kaslr_booke.c
index b4f2786a7d2b..df083fe158b6 100644
--- a/arch/powerpc/mm/nohash/kaslr_booke.c
+++ b/arch/powerpc/mm/nohash/kaslr_booke.c
@@ -178,7 +178,7 @@ static void __init get_crash_kernel(void *fdt, unsigned long size)
int ret;

ret = parse_crashkernel(boot_command_line, size, &crash_size,
- &crash_base, NULL, NULL);
+ &crash_base, NULL, NULL, NULL);
if (ret != 0 || crash_size == 0)
return;
if (crash_base == 0)
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 2e011cbddf3a..d0bae97c9a7a 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -1355,7 +1355,7 @@ static void __init arch_reserve_crashkernel(void)

ret = parse_crashkernel(cmdline, memblock_phys_mem_size(),
&crash_size, &crash_base,
- &low_size, &high);
+ &low_size, NULL, &high);
if (ret)
return;

diff --git a/arch/s390/kernel/setup.c b/arch/s390/kernel/setup.c
index 5701356f4f33..4d18b6b8f5ca 100644
--- a/arch/s390/kernel/setup.c
+++ b/arch/s390/kernel/setup.c
@@ -619,7 +619,7 @@ static void __init reserve_crashkernel(void)
int rc;

rc = parse_crashkernel(boot_command_line, ident_map_size,
- &crash_size, &crash_base, NULL, NULL);
+ &crash_size, &crash_base, NULL, NULL, NULL);

crash_base = ALIGN(crash_base, KEXEC_CRASH_MEM_ALIGN);
crash_size = ALIGN(crash_size, KEXEC_CRASH_MEM_ALIGN);
diff --git a/arch/sh/kernel/machine_kexec.c b/arch/sh/kernel/machine_kexec.c
index fa3a7b36190a..e754860a7236 100644
--- a/arch/sh/kernel/machine_kexec.c
+++ b/arch/sh/kernel/machine_kexec.c
@@ -154,7 +154,7 @@ void __init reserve_crashkernel(void)
int ret;

ret = parse_crashkernel(boot_command_line, memblock_phys_mem_size(),
- &crash_size, &crash_base, NULL, NULL);
+ &crash_size, &crash_base, NULL, NULL, NULL);
if (ret == 0 && crash_size > 0) {
crashk_res.start = crash_base;
crashk_res.end = crash_base + crash_size - 1;
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index 1526747bedf2..f271b2cc3054 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -478,7 +478,7 @@ static void __init arch_reserve_crashkernel(void)

ret = parse_crashkernel(cmdline, memblock_phys_mem_size(),
&crash_size, &crash_base,
- &low_size, &high);
+ &low_size, NULL, &high);
if (ret)
return;

diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
index 5126a4fecb44..f1edefcf7377 100644
--- a/include/linux/crash_core.h
+++ b/include/linux/crash_core.h
@@ -95,7 +95,8 @@ void final_note(Elf_Word *buf);

int __init parse_crashkernel(char *cmdline, unsigned long long system_ram,
unsigned long long *crash_size, unsigned long long *crash_base,
- unsigned long long *low_size, bool *high);
+ unsigned long long *low_size, unsigned long long *cma_size,
+ bool *high);

#ifdef CONFIG_ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION
#ifndef DEFAULT_CRASH_KERNEL_LOW_SIZE
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index efe87d501c8c..1e952d2e451b 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -184,11 +184,13 @@ static int __init parse_crashkernel_simple(char *cmdline,

#define SUFFIX_HIGH 0
#define SUFFIX_LOW 1
-#define SUFFIX_NULL 2
+#define SUFFIX_CMA 2
+#define SUFFIX_NULL 3
static __initdata char *suffix_tbl[] = {
- [SUFFIX_HIGH] = ",high",
- [SUFFIX_LOW] = ",low",
- [SUFFIX_NULL] = NULL,
+ [SUFFIX_HIGH] = ",high",
+ [SUFFIX_LOW] = ",low",
+ [SUFFIX_CMA] = ",cma",
+ [SUFFIX_NULL] = NULL,
};

/*
@@ -310,9 +312,11 @@ int __init parse_crashkernel(char *cmdline,
unsigned long long *crash_size,
unsigned long long *crash_base,
unsigned long long *low_size,
+ unsigned long long *cma_size,
bool *high)
{
int ret;
+ unsigned long long cma_base;

/* crashkernel=X[@offset] */
ret = __parse_crashkernel(cmdline, system_ram, crash_size,
@@ -343,6 +347,14 @@ int __init parse_crashkernel(char *cmdline,

*high = true;
}
+
+ /*
+ * optional CMA reservation
+ * cma_base is ignored
+ */
+ if (cma_size)
+ __parse_crashkernel(cmdline, 0, cma_size,
+ &cma_base, suffix_tbl[SUFFIX_CMA]);
#endif
if (!*crash_size)
ret = -EINVAL;

--
Jiri Bohac <[email protected]>
SUSE Labs, Prague, Czechia

2023-11-24 19:58:24

by Jiri Bohac

[permalink] [raw]

Subject: [PATCH 2/4] kdump: implement reserve_crashkernel_cma

reserve_crashkernel_cma() reserves CMA ranges for the
crash kernel. If allocating the requested size fails,
try to reserve in smaller pieces.

Store the reserved ranges in the crashk_cma_ranges array
and the number of ranges in crashk_cma_cnt.

Signed-off-by: Jiri Bohac <[email protected]>

---
include/linux/crash_core.h | 9 +++++++
kernel/crash_core.c | 52 ++++++++++++++++++++++++++++++++++++++
2 files changed, 61 insertions(+)

diff --git a/include/linux/crash_core.h b/include/linux/crash_core.h
index f1edefcf7377..5afe3866d35a 100644
--- a/include/linux/crash_core.h
+++ b/include/linux/crash_core.h
@@ -14,6 +14,13 @@
extern struct resource crashk_res;
extern struct resource crashk_low_res;

+extern struct range crashk_cma_ranges[];
+#ifdef CONFIG_CMA
+extern int crashk_cma_cnt;
+#else
+#define crashk_cma_cnt 0
+#endif
+
#define CRASH_CORE_NOTE_NAME "CORE"
#define CRASH_CORE_NOTE_HEAD_BYTES ALIGN(sizeof(struct elf_note), 4)
#define CRASH_CORE_NOTE_NAME_BYTES ALIGN(sizeof(CRASH_CORE_NOTE_NAME), 4)
@@ -98,6 +105,8 @@ int __init parse_crashkernel(char *cmdline, unsigned long long system_ram,
unsigned long long *low_size, unsigned long long *cma_size,
bool *high);

+void __init reserve_crashkernel_cma(unsigned long long cma_size);
+
#ifdef CONFIG_ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION
#ifndef DEFAULT_CRASH_KERNEL_LOW_SIZE
#define DEFAULT_CRASH_KERNEL_LOW_SIZE (128UL << 20)
diff --git a/kernel/crash_core.c b/kernel/crash_core.c
index 1e952d2e451b..d71ecd59f16b 100644
--- a/kernel/crash_core.c
+++ b/kernel/crash_core.c
@@ -15,6 +15,7 @@
#include <linux/memblock.h>
#include <linux/kexec.h>
#include <linux/kmemleak.h>
+#include <linux/cma.h>

#include <asm/page.h>
#include <asm/sections.h>
@@ -475,6 +476,57 @@ void __init reserve_crashkernel_generic(char *cmdline,
}
#endif

+#ifdef CONFIG_CMA
+#define CRASHKERNEL_CMA_RANGES_MAX 4
+
+struct range crashk_cma_ranges[CRASHKERNEL_CMA_RANGES_MAX];
+int crashk_cma_cnt = 0;
+
+void __init reserve_crashkernel_cma(unsigned long long cma_size)
+{
+ unsigned long long request_size = roundup(cma_size, PAGE_SIZE);
+ unsigned long long reserved_size = 0;
+
+ while (cma_size > reserved_size &&
+ crashk_cma_cnt < CRASHKERNEL_CMA_RANGES_MAX) {
+
+ struct cma *res;
+
+ if (cma_declare_contiguous(0, request_size, 0, 0, 0, false,
+ "crashkernel", &res)) {
+ /* reservation failed, try half-sized blocks */
+ if (request_size <= PAGE_SIZE)
+ break;
+
+ request_size = roundup(request_size / 2, PAGE_SIZE);
+ continue;
+ }
+
+ crashk_cma_ranges[crashk_cma_cnt].start = cma_get_base(res);
+ crashk_cma_ranges[crashk_cma_cnt].end =
+ crashk_cma_ranges[crashk_cma_cnt].start +
+ cma_get_size(res) - 1;
+ ++crashk_cma_cnt;
+ reserved_size += request_size;
+ }
+
+ if (cma_size > reserved_size)
+ pr_warn("crashkernel CMA reservation failed: %lld MB requested, %lld MB reserved in %d ranges\n",
+ cma_size >> 20, reserved_size >> 20, crashk_cma_cnt);
+ else
+ pr_info("crashkernel CMA reserved: %lld MB in %d ranges\n",
+ reserved_size >> 20, crashk_cma_cnt);
+}
+
+#else /* CONFIG_CMA */
+struct range crashk_cma_ranges[0];
+void __init reserve_crashkernel_cma(unsigned long long cma_size)
+{
+ if (cma_size)
+ pr_warn("crashkernel CMA reservation failed: CMA disabled\n");
+}
+#endif
+
int crash_prepare_elf64_headers(struct crash_mem *mem, int need_kernel_map,
void **addr, unsigned long *sz)
{

--
Jiri Bohac <[email protected]>
SUSE Labs, Prague, Czechia

2023-11-24 19:58:47

by Jiri Bohac

[permalink] [raw]

Subject: [PATCH 3/4] kdump, x86: implement crashkernel CMA reservation

Implement the crashkernel CMA reservation for x86:
- enable parsing of the cma suffix by parse_crashkernel()
- reserve memory with reserve_crashkernel_cma()
- add the CMA-reserved ranges to the e820 map for the crash kernel
- exclude the CMA-reserved ranges from vmcore

Signed-off-by: Jiri Bohac <[email protected]>

---
arch/x86/kernel/crash.c | 26 ++++++++++++++++++++++----
arch/x86/kernel/setup.c | 5 +++--
2 files changed, 25 insertions(+), 6 deletions(-)

diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index c92d88680dbf..f27d09386157 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -147,10 +147,10 @@ static struct crash_mem *fill_up_crash_elf_data(void)
return NULL;

/*
- * Exclusion of crash region and/or crashk_low_res may cause
- * another range split. So add extra two slots here.
+ * Exclusion of crash region, crashk_low_res and/or crashk_cma_ranges
+ * may cause range splits. So add extra slots here.
*/
- nr_ranges += 2;
+ nr_ranges += 2 + crashk_cma_cnt;
cmem = vzalloc(struct_size(cmem, ranges, nr_ranges));
if (!cmem)
return NULL;
@@ -168,6 +168,7 @@ static struct crash_mem *fill_up_crash_elf_data(void)
static int elf_header_exclude_ranges(struct crash_mem *cmem)
{
int ret = 0;
+ int i;

/* Exclude the low 1M because it is always reserved */
ret = crash_exclude_mem_range(cmem, 0, (1<<20)-1);
@@ -182,8 +183,17 @@ static int elf_header_exclude_ranges(struct crash_mem *cmem)
if (crashk_low_res.end)
ret = crash_exclude_mem_range(cmem, crashk_low_res.start,
crashk_low_res.end);
+ if (ret)
+ return ret;

- return ret;
+ for (i = 0; i < crashk_cma_cnt; ++i) {
+ ret = crash_exclude_mem_range(cmem, crashk_cma_ranges[i].start,
+ crashk_cma_ranges[i].end);
+ if (ret)
+ return ret;
+ }
+
+ return 0;
}

static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg)
@@ -336,6 +346,14 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
add_e820_entry(params, &ei);
}

+ for (i = 0; i < crashk_cma_cnt; ++i) {
+ ei.addr = crashk_cma_ranges[i].start;
+ ei.size = crashk_cma_ranges[i].end -
+ crashk_cma_ranges[i].start + 1;
+ ei.type = E820_TYPE_RAM;
+ add_e820_entry(params, &ei);
+ }
+
out:
vfree(cmem);
return ret;
diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index f271b2cc3054..5994d18fd2a0 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -468,7 +468,7 @@ static void __init memblock_x86_reserve_range_setup_data(void)

static void __init arch_reserve_crashkernel(void)
{
- unsigned long long crash_base, crash_size, low_size = 0;
+ unsigned long long crash_base, crash_size, low_size = 0, cma_size = 0;
char *cmdline = boot_command_line;
bool high = false;
int ret;
@@ -478,7 +478,7 @@ static void __init arch_reserve_crashkernel(void)

ret = parse_crashkernel(cmdline, memblock_phys_mem_size(),
&crash_size, &crash_base,
- &low_size, NULL, &high);
+ &low_size, &cma_size, &high);
if (ret)
return;

@@ -489,6 +489,7 @@ static void __init arch_reserve_crashkernel(void)

reserve_crashkernel_generic(cmdline, crash_size, crash_base,
low_size, high);
+ reserve_crashkernel_cma(cma_size);
}

static struct resource standard_io_resources[] = {

--
Jiri Bohac <[email protected]>
SUSE Labs, Prague, Czechia

2023-11-24 19:58:57

by Jiri Bohac

[permalink] [raw]

Subject: [PATCH 4/4] kdump, documentation: describe craskernel CMA reservation

Describe the new crashkernel ",cma" suffix in Documentation/

---
Documentation/admin-guide/kdump/kdump.rst | 10 ++++++++++
Documentation/admin-guide/kernel-parameters.txt | 7 +++++++
2 files changed, 17 insertions(+)

diff --git a/Documentation/admin-guide/kdump/kdump.rst b/Documentation/admin-guide/kdump/kdump.rst
index 5762e7477a0c..4ec08e5843dc 100644
--- a/Documentation/admin-guide/kdump/kdump.rst
+++ b/Documentation/admin-guide/kdump/kdump.rst
@@ -317,6 +317,16 @@ crashkernel syntax

crashkernel=0,low

+4) crashkernel=size,cma
+
+ Reserves additional memory from CMA. A standard crashkernel reservation, as
+ described above, is still needed, but can be just small enough to hold the
+ kernel and initrd. All the memory the crash kernel needs for its runtime and
+ for running the kdump userspace processes can be provided by this CMA
+ reservation, re-using memory available to the production system's userspace.
+ Because of this re-using, the CMA reservation should not be used if it's
+ intended to dump userspce memory.
+
Boot into System Kernel
-----------------------
1) Update the boot loader (such as grub, yaboot, or lilo) configuration
diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt
index 65731b060e3f..ee9fc40a97fd 100644
--- a/Documentation/admin-guide/kernel-parameters.txt
+++ b/Documentation/admin-guide/kernel-parameters.txt
@@ -914,6 +914,13 @@
0: to disable low allocation.
It will be ignored when crashkernel=X,high is not used
or memory reserved is below 4G.
+ crashkernel=size[KMG],cma
+ [KNL, X86] Reserve additional crash kernel memory from CMA.
+ This reservation is usable by the 1st system's userspace,
+ so this should not be used if dumping of userspace
+ memory is intended. A standard crashkernel reservation,
+ as described above, is still needed to hold the crash
+ kernel and initrd.

cryptomgr.notests
[KNL] Disable crypto self-tests

--
Jiri Bohac <[email protected]>
SUSE Labs, Prague, Czechia

2023-11-25 01:58:35

by Tao Liu

[permalink] [raw]

Subject: Re: [PATCH 0/4] kdump: crashkernel reservation from CMA

Hi Jiri,

On Sat, Nov 25, 2023 at 3:55 AM Jiri Bohac <[email protected]> wrote:
>
> Hi,
>
> this series implements a new way to reserve additional crash kernel
> memory using CMA.
>
> Currently, all the memory for the crash kernel is not usable by
> the 1st (production) kernel. It is also unmapped so that it can't
> be corrupted by the fault that will eventually trigger the crash.
> This makes sense for the memory actually used by the kexec-loaded
> crash kernel image and initrd and the data prepared during the
> load (vmcoreinfo, ...). However, the reserved space needs to be
> much larger than that to provide enough run-time memory for the
> crash kernel and the kdump userspace. Estimating the amount of
> memory to reserve is difficult. Being too careful makes kdump
> likely to end in OOM, being too generous takes even more memory
> from the production system. Also, the reservation only allows
> reserving a single contiguous block (or two with the "low"
> suffix). I've seen systems where this fails because the physical
> memory is fragmented.
>
> By reserving additional crashkernel memory from CMA, the main
> crashkernel reservation can be just small enough to fit the
> kernel and initrd image, minimizing the memory taken away from
> the production system. Most of the run-time memory for the crash
> kernel will be memory previously available to userspace in the
> production system. As this memory is no longer wasted, the
> reservation can be done with a generous margin, making kdump more
> reliable. Kernel memory that we need to preserve for dumping is
> never allocated from CMA. User data is typically not dumped by
> makedumpfile. When dumping of user data is intended this new CMA
> reservation cannot be used.
>

Thanks for the idea of using CMA as part of memory for the 2nd kernel.
However I have a question:

What if there is on-going DMA/RDMA access on the CMA range when 1st
kernel crash? There might be data corruption when 2nd kernel and
DMA/RDMA write to the same place, how to address such an issue?

Thanks,
Tao Liu

> There are four patches in this series:
>
> The first adds a new ",cma" suffix to the recenly introduced generic
> crashkernel parsing code. parse_crashkernel() takes one more
> argument to store the cma reservation size.
>
> The second patch implements reserve_crashkernel_cma() which
> performs the reservation. If the requested size is not available
> in a single range, multiple smaller ranges will be reserved.
>
> The third patch enables the functionality for x86 as a proof of
> concept. There are just three things every arch needs to do:
> - call reserve_crashkernel_cma()
> - include the CMA-reserved ranges in the physical memory map
> - exclude the CMA-reserved ranges from the memory available
> through /proc/vmcore by excluding them from the vmcoreinfo
> PT_LOAD ranges.
> Adding other architectures is easy and I can do that as soon as
> this series is merged.
>
> The fourth patch just updates Documentation/
>
> Now, specifying
> crashkernel=100M craskhernel=1G,cma
> on the command line will make a standard crashkernel reservation
> of 100M, where kexec will load the kernel and initrd.
>
> An additional 1G will be reserved from CMA, still usable by the
> production system. The crash kernel will have 1.1G memory
> available. The 100M can be reliably predicted based on the size
> of the kernel and initrd.
>
> When no crashkernel=size,cma is specified, everything works as
> before.
>
> --
> Jiri Bohac <[email protected]>
> SUSE Labs, Prague, Czechia
>
>
> _______________________________________________
> kexec mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/kexec
>

2023-11-25 07:32:46

by kernel test robot

[permalink] [raw]

Subject: Re: [PATCH 1/4] kdump: add crashkernel cma suffix

Hi Jiri,

kernel test robot noticed the following build warnings:

[auto build test WARNING on powerpc/next]
[also build test WARNING on powerpc/fixes s390/features linus/master v6.7-rc2 next-20231124]
[cannot apply to tip/x86/core arm64/for-next/core]
[If your patch is applied to the wrong git tree, kindly drop us a note.
And when submitting patch, we suggest to use '--base' as documented in
https://git-scm.com/docs/git-format-patch#_base_tree_information]

url: https://github.com/intel-lab-lkp/linux/commits/Jiri-Bohac/kdump-implement-reserve_crashkernel_cma/20231125-082724
base: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next
patch link: https://lore.kernel.org/r/ZWEAPXiCCgAf1WrY%40dwarf.suse.cz
patch subject: [PATCH 1/4] kdump: add crashkernel cma suffix
config: alpha-randconfig-r081-20231125 (https://download.01.org/0day-ci/archive/20231125/[email protected]/config)
compiler: alpha-linux-gcc (GCC) 13.2.0
reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20231125/[email protected]/reproduce)

If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-kbuild-all/[email protected]/

All warnings (new ones prefixed by >>):

kernel/crash_core.c: In function 'parse_crashkernel':
>> kernel/crash_core.c:319:28: warning: unused variable 'cma_base' [-Wunused-variable]
319 | unsigned long long cma_base;
| ^~~~~~~~

vim +/cma_base +319 kernel/crash_core.c

302
303 /*
304 * That function is the entry point for command line parsing and should be
305 * called from the arch-specific code.
306 *
307 * If crashkernel=,high|low is supported on architecture, non-NULL values
308 * should be passed to parameters 'low_size' and 'high'.
309 */
310 int __init parse_crashkernel(char *cmdline,
311 unsigned long long system_ram,
312 unsigned long long *crash_size,
313 unsigned long long *crash_base,
314 unsigned long long *low_size,
315 unsigned long long *cma_size,
316 bool *high)
317 {
318 int ret;
> 319 unsigned long long cma_base;
320
321 /* crashkernel=X[@offset] */
322 ret = __parse_crashkernel(cmdline, system_ram, crash_size,
323 crash_base, NULL);
324 #ifdef CONFIG_ARCH_HAS_GENERIC_CRASHKERNEL_RESERVATION
325 /*
326 * If non-NULL 'high' passed in and no normal crashkernel
327 * setting detected, try parsing crashkernel=,high|low.
328 */
329 if (high && ret == -ENOENT) {
330 ret = __parse_crashkernel(cmdline, 0, crash_size,
331 crash_base, suffix_tbl[SUFFIX_HIGH]);
332 if (ret || !*crash_size)
333 return -EINVAL;
334
335 /*
336 * crashkernel=Y,low can be specified or not, but invalid value
337 * is not allowed.
338 */
339 ret = __parse_crashkernel(cmdline, 0, low_size,
340 crash_base, suffix_tbl[SUFFIX_LOW]);
341 if (ret == -ENOENT) {
342 *low_size = DEFAULT_CRASH_KERNEL_LOW_SIZE;
343 ret = 0;
344 } else if (ret) {
345 return ret;
346 }
347
348 *high = true;
349 }
350
351 /*
352 * optional CMA reservation
353 * cma_base is ignored
354 */
355 if (cma_size)
356 __parse_crashkernel(cmdline, 0, cma_size,
357 &cma_base, suffix_tbl[SUFFIX_CMA]);
358 #endif
359 if (!*crash_size)
360 ret = -EINVAL;
361
362 return ret;
363 }
364

--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki

2023-11-25 21:27:38

by Jiri Bohac

[permalink] [raw]

Subject: Re: [PATCH 0/4] kdump: crashkernel reservation from CMA

Hi Tao,

On Sat, Nov 25, 2023 at 09:51:54AM +0800, Tao Liu wrote:
> Thanks for the idea of using CMA as part of memory for the 2nd kernel.
> However I have a question:
>
> What if there is on-going DMA/RDMA access on the CMA range when 1st
> kernel crash? There might be data corruption when 2nd kernel and
> DMA/RDMA write to the same place, how to address such an issue?

The crash kernel CMA area(s) registered via
cma_declare_contiguous() are distinct from the
dma_contiguous_default_area or device-specific CMA areas that
dma_alloc_contiguous() would use to reserve memory for DMA.

Kernel pages will not be allocated from the crash kernel CMA
area(s), because they are not GFP_MOVABLE. The CMA area will only
be used for user pages.

User pages for RDMA, should be pinned with FOLL_LONGTERM and that
would migrate them away from the CMA area.

But you're right that DMA to user pages pinned without
FOLL_LONGTERM would still be possible. Would this be a problem in
practice? Do you see any way around it?

Thanks,

--
Jiri Bohac <[email protected]>
SUSE Labs, Prague, Czechia

2023-11-28 01:13:32

by Tao Liu

[permalink] [raw]

Subject: Re: [PATCH 0/4] kdump: crashkernel reservation from CMA

Hi Jiri,

On Sun, Nov 26, 2023 at 5:22 AM Jiri Bohac <[email protected]> wrote:
>
> Hi Tao,
>
> On Sat, Nov 25, 2023 at 09:51:54AM +0800, Tao Liu wrote:
> > Thanks for the idea of using CMA as part of memory for the 2nd kernel.
> > However I have a question:
> >
> > What if there is on-going DMA/RDMA access on the CMA range when 1st
> > kernel crash? There might be data corruption when 2nd kernel and
> > DMA/RDMA write to the same place, how to address such an issue?
>
> The crash kernel CMA area(s) registered via
> cma_declare_contiguous() are distinct from the
> dma_contiguous_default_area or device-specific CMA areas that
> dma_alloc_contiguous() would use to reserve memory for DMA.
>
> Kernel pages will not be allocated from the crash kernel CMA
> area(s), because they are not GFP_MOVABLE. The CMA area will only
> be used for user pages.
>
> User pages for RDMA, should be pinned with FOLL_LONGTERM and that
> would migrate them away from the CMA area.
>
> But you're right that DMA to user pages pinned without
> FOLL_LONGTERM would still be possible. Would this be a problem in
> practice? Do you see any way around it?
>

Thanks for the explanation! Sorry I don't have any ideas so far...

@Pingfan Liu @Baoquan He Hi, do you have any suggestions for it?

Thanks,
Tao Liu

> Thanks,
>
> --
> Jiri Bohac <[email protected]>
> SUSE Labs, Prague, Czechia
>

2023-11-28 02:11:58

by Pingfan Liu

[permalink] [raw]

Subject: Re: [PATCH 0/4] kdump: crashkernel reservation from CMA

On Sun, Nov 26, 2023 at 5:24 AM Jiri Bohac <[email protected]> wrote:
>
> Hi Tao,
>
> On Sat, Nov 25, 2023 at 09:51:54AM +0800, Tao Liu wrote:
> > Thanks for the idea of using CMA as part of memory for the 2nd kernel.
> > However I have a question:
> >
> > What if there is on-going DMA/RDMA access on the CMA range when 1st
> > kernel crash? There might be data corruption when 2nd kernel and
> > DMA/RDMA write to the same place, how to address such an issue?
>
> The crash kernel CMA area(s) registered via
> cma_declare_contiguous() are distinct from the
> dma_contiguous_default_area or device-specific CMA areas that
> dma_alloc_contiguous() would use to reserve memory for DMA.
>
> Kernel pages will not be allocated from the crash kernel CMA
> area(s), because they are not GFP_MOVABLE. The CMA area will only
> be used for user pages.
>
> User pages for RDMA, should be pinned with FOLL_LONGTERM and that
> would migrate them away from the CMA area.
>
> But you're right that DMA to user pages pinned without
> FOLL_LONGTERM would still be possible. Would this be a problem in
> practice? Do you see any way around it?
>

I have not a real case in mind. But this problem has kept us from
using the CMA area in kdump for years. Most importantly, this method
will introduce an uneasy tracking bug.

For a way around, maybe you can introduce a specific zone, and for any
GUP, migrate the pages away. I have doubts about whether this approach
is worthwhile, considering the trade-off between benefits and
complexity.

Thanks,

Pingfan

2023-11-28 02:12:07

by Baoquan He

[permalink] [raw]

Subject: Re: [PATCH 0/4] kdump: crashkernel reservation from CMA

On 11/28/23 at 09:12am, Tao Liu wrote:
> Hi Jiri,
>
> On Sun, Nov 26, 2023 at 5:22 AM Jiri Bohac <[email protected]> wrote:
> >
> > Hi Tao,
> >
> > On Sat, Nov 25, 2023 at 09:51:54AM +0800, Tao Liu wrote:
> > > Thanks for the idea of using CMA as part of memory for the 2nd kernel.
> > > However I have a question:
> > >
> > > What if there is on-going DMA/RDMA access on the CMA range when 1st
> > > kernel crash? There might be data corruption when 2nd kernel and
> > > DMA/RDMA write to the same place, how to address such an issue?
> >
> > The crash kernel CMA area(s) registered via
> > cma_declare_contiguous() are distinct from the
> > dma_contiguous_default_area or device-specific CMA areas that
> > dma_alloc_contiguous() would use to reserve memory for DMA.
> >
> > Kernel pages will not be allocated from the crash kernel CMA
> > area(s), because they are not GFP_MOVABLE. The CMA area will only
> > be used for user pages.
> >
> > User pages for RDMA, should be pinned with FOLL_LONGTERM and that
> > would migrate them away from the CMA area.
> >
> > But you're right that DMA to user pages pinned without
> > FOLL_LONGTERM would still be possible. Would this be a problem in
> > practice? Do you see any way around it?

Thanks for the effort to bring this up, Jiri.

I am wondering how you will use this crashkernel=,cma parameter. I mean
the scenario of crashkernel=,cma. Asking this because I don't know how
SUSE deploy kdump in SUSE distros. In SUSE distros, kdump kernel's
initramfs is the same as the 1st kernel, or only contain those needed
kernel modules for needed devices. E.g if we dump to local disk, NIC
driver will be filter out? If latter case, It's possibly having the
on-flight DMA issue, e.g NIC has DMA buffer in the CMA area, but not
reset during kdump bootup because the NIC driver is not loaded in to
initialize. Not sure if this is 100%, possible in theory?

Recently we are seeing an issue that on a HPE system, PCI error messages
are always seen in kdump kernel, while it's a local dump, NIC device is
not needed and the igb driver is not loaded in. Then adding igb driver
into kdump initramfs can work around it. It's similar with above
on-flight DMA.

The crashkernel=,cma requires no userspace data dumping, from our
support engineers' feedback, customer never express they don't need to
dump user space data. Assume a server with huge databse deployed, and
the database often collapsed recently and database provider claimed that
it's not database's fault, OS need prove their innocence. What will you
do?

So this looks like a nice to have to me. At least in fedora/rhel's
usage, we may only back port this patch, and add one sentence in our
user guide saying "there's a crashkernel=,cma added, can be used with
crashkernel= to save memory. Please feel free to try if you like".
Unless SUSE or other distros decides to use it as default config or
something like that. Please correct me if I missed anything or took
anything wrong.

Thanks
Baoquan

2023-11-28 08:59:12