2019-10-12 02:23:02

by Lianbo Jiang

[permalink] [raw]
Subject: [PATCH 0/3 v3] x86/kdump: Fix 'kmem -s' reported an invalid freepointer when SME was active

In purgatory(), the main things are as below:

[1] verify sha256 hashes for various segments.
Lets keep these codes, and do not touch the logic.

[2] copy the first 640k content to a backup region.
Lets safely remove it and clean all code related to backup region.

This patch series will remove the backup region, because the current
handling of copying the first 640k runs into problems when SME is
active.

The low 1MiB region will always be reserved when the crashkernel kernel
command line option is specified. And this way makes it unnecessary to
do anything with the low 1MiB region, because the memory allocated later
won't fall into the low 1MiB area.

This series includes three patches:
[1] Fix 'kmem -s' reported an invalid freepointer when SME was active
The low 1MiB region will always be reserved when the crashkernel
kernel command line option is specified, which ensures that the
memory allocated later won't fall into the low 1MiB area.

[2] x86/kdump cleanup: remove the unused crash_copy_backup_region()
The crash_copy_backup_region() has never been used, so clean
up the redundant code.

[3] x86/kdump: clean up all the code related to the backup region
Remove the backup region and clean up.

Changes since v1:
[1] Add extra checking condition: when the crashkernel option is
specified, reserve the low 640k area.

Changes since v2:
[1] Reserve the low 1MiB region when the crashkernel option is only
specified.(Suggested by Eric)

[2] Remove the unused crash_copy_backup_region()

[3] Remove the backup region and clean up

[4] Split them into three patches

Lianbo Jiang (3):
x86/kdump: Fix 'kmem -s' reported an invalid freepointer when SME was
active
x86/kdump cleanup: remove the unused crash_copy_backup_region()
x86/kdump: clean up all the code related to the backup region

arch/x86/include/asm/crash.h | 1 -
arch/x86/include/asm/kexec.h | 10 ----
arch/x86/include/asm/purgatory.h | 10 ----
arch/x86/kernel/crash.c | 91 ++++++------------------------
arch/x86/kernel/machine_kexec_64.c | 47 ---------------
arch/x86/purgatory/purgatory.c | 19 -------
arch/x86/realmode/init.c | 11 ++++
7 files changed, 27 insertions(+), 162 deletions(-)

--
2.17.1


2019-10-12 02:23:57

by Lianbo Jiang

[permalink] [raw]
Subject: [PATCH 1/3 v3] x86/kdump: Fix 'kmem -s' reported an invalid freepointer when SME was active

Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=204793

Kdump kernel will reuse the first 640k region because of some reasons,
for example: the trampline and conventional PC system BIOS region may
require to allocate memory in this area. Obviously, kdump kernel will
also overwrite the first 640k region, therefore, kernel has to copy
the contents of the first 640k area to a backup area, which is done in
purgatory(), because vmcore may need the old memory. When vmcore is
dumped, kdump kernel will read the old memory from the backup area of
the first 640k area.

Basically, the main reason should be clear, kernel does not correctly
handle the first 640k region when SME is active, which causes that
kernel does not properly copy these old memory to the backup area in
purgatory(). Therefore, kdump kernel reads out the incorrect contents
from the backup area when dumping vmcore. Finally, the phenomenon is
as follow:

[root linux]$ crash vmlinux /var/crash/127.0.0.1-2019-09-19-08\:31\:27/vmcore
WARNING: kernel relocated [240MB]: patching 97110 gdb minimal_symbol values

KERNEL: /var/crash/127.0.0.1-2019-09-19-08:31:27/vmlinux
DUMPFILE: /var/crash/127.0.0.1-2019-09-19-08:31:27/vmcore [PARTIAL DUMP]
CPUS: 128
DATE: Thu Sep 19 08:31:18 2019
UPTIME: 00:01:21
LOAD AVERAGE: 0.16, 0.07, 0.02
TASKS: 1343
NODENAME: amd-ethanol
RELEASE: 5.3.0-rc7+
VERSION: #4 SMP Thu Sep 19 08:14:00 EDT 2019
MACHINE: x86_64 (2195 Mhz)
MEMORY: 127.9 GB
PANIC: "Kernel panic - not syncing: sysrq triggered crash"
PID: 9789
COMMAND: "bash"
TASK: "ffff89711894ae80 [THREAD_INFO: ffff89711894ae80]"
CPU: 83
STATE: TASK_RUNNING (PANIC)

crash> kmem -s|grep -i invalid
kmem: dma-kmalloc-512: slab:ffffd77680001c00 invalid freepointer:a6086ac099f0c5a4
kmem: dma-kmalloc-512: slab:ffffd77680001c00 invalid freepointer:a6086ac099f0c5a4
crash>

BTW: I also tried to fix the above problem in purgatory(), but there
are too many restricts in purgatory() context, for example: i can't
allocate new memory to create the identity mapping page table for SME
situation.

Currently, there are two places where the first 640k area is needed,
the first one is in the find_trampoline_placement(), another one is
in the reserve_real_mode(), and their content doesn't matter. To avoid
the above error, when the crashkernel kernel command line option is
specified, lets reserve the remain low 1MiB memory(after reserving
real mode memroy) so that the allocated memory does not fall into the
low 1MiB area, which makes us not to copy the first 640k content to a
backup region in purgatory().

In addition, also need to clean all the code related to the backup
region later.

Signed-off-by: Lianbo Jiang <[email protected]>
---
arch/x86/realmode/init.c | 11 +++++++++++
1 file changed, 11 insertions(+)

diff --git a/arch/x86/realmode/init.c b/arch/x86/realmode/init.c
index 7dce39c8c034..bf4c8ffc5ed9 100644
--- a/arch/x86/realmode/init.c
+++ b/arch/x86/realmode/init.c
@@ -34,6 +34,17 @@ void __init reserve_real_mode(void)

memblock_reserve(mem, size);
set_real_mode_mem(mem);
+
+#ifdef CONFIG_KEXEC_CORE
+ /*
+ * When the crashkernel option is specified, only use the low
+ * 1MiB for the real mode trampoline.
+ */
+ if (strstr(boot_command_line, "crashkernel=")) {
+ memblock_reserve(0, SZ_1M);
+ pr_info("Reserving low 1MiB of memory for crashkernel\n");
+ }
+#endif /* CONFIG_KEXEC_CORE */
}

static void __init setup_real_mode(void)
--
2.17.1

2019-10-12 02:24:47

by Lianbo Jiang

[permalink] [raw]
Subject: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region

When the crashkernel kernel command line option is specified, the
low 1MiB memory will always be reserved, which makes that the memory
allocated later won't fall into the low 1MiB area, thereby, it's not
necessary to create a backup region and also no need to copy the first
640k content to a backup region.

Currently, the code related to the backup region can be safely removed,
so lets clean up.

Signed-off-by: Lianbo Jiang <[email protected]>
---
arch/x86/include/asm/kexec.h | 10 ----
arch/x86/include/asm/purgatory.h | 10 ----
arch/x86/kernel/crash.c | 91 ++++++------------------------
arch/x86/kernel/machine_kexec_64.c | 47 ---------------
arch/x86/purgatory/purgatory.c | 19 -------
5 files changed, 16 insertions(+), 161 deletions(-)

diff --git a/arch/x86/include/asm/kexec.h b/arch/x86/include/asm/kexec.h
index 5e7d6b46de97..6802c59e8252 100644
--- a/arch/x86/include/asm/kexec.h
+++ b/arch/x86/include/asm/kexec.h
@@ -66,10 +66,6 @@ struct kimage;
# define KEXEC_ARCH KEXEC_ARCH_X86_64
#endif

-/* Memory to backup during crash kdump */
-#define KEXEC_BACKUP_SRC_START (0UL)
-#define KEXEC_BACKUP_SRC_END (640 * 1024UL - 1) /* 640K */
-
/*
* This function is responsible for capturing register states if coming
* via panic otherwise just fix up the ss and sp if coming via kernel
@@ -154,12 +150,6 @@ struct kimage_arch {
pud_t *pud;
pmd_t *pmd;
pte_t *pte;
- /* Details of backup region */
- unsigned long backup_src_start;
- unsigned long backup_src_sz;
-
- /* Physical address of backup segment */
- unsigned long backup_load_addr;

/* Core ELF header buffer */
void *elf_headers;
diff --git a/arch/x86/include/asm/purgatory.h b/arch/x86/include/asm/purgatory.h
index 92c34e517da1..5528e9325049 100644
--- a/arch/x86/include/asm/purgatory.h
+++ b/arch/x86/include/asm/purgatory.h
@@ -6,16 +6,6 @@
#include <linux/purgatory.h>

extern void purgatory(void);
-/*
- * These forward declarations serve two purposes:
- *
- * 1) Make sparse happy when checking arch/purgatory
- * 2) Document that these are required to be global so the symbol
- * lookup in kexec works
- */
-extern unsigned long purgatory_backup_dest;
-extern unsigned long purgatory_backup_src;
-extern unsigned long purgatory_backup_sz;
#endif /* __ASSEMBLY__ */

#endif /* _ASM_PURGATORY_H */
diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
index eb651fbde92a..cc5774fc84c0 100644
--- a/arch/x86/kernel/crash.c
+++ b/arch/x86/kernel/crash.c
@@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs)

#ifdef CONFIG_KEXEC_FILE

-static unsigned long crash_zero_bytes;
-
static int get_nr_ram_ranges_callback(struct resource *res, void *arg)
{
unsigned int *nr_ranges = arg;
@@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg)
{
struct crash_mem *cmem = arg;

- cmem->ranges[cmem->nr_ranges].start = res->start;
- cmem->ranges[cmem->nr_ranges].end = res->end;
- cmem->nr_ranges++;
+ if (res->start >= SZ_1M) {
+ cmem->ranges[cmem->nr_ranges].start = res->start;
+ cmem->ranges[cmem->nr_ranges].end = res->end;
+ cmem->nr_ranges++;
+ } else if (res->end > SZ_1M) {
+ cmem->ranges[cmem->nr_ranges].start = SZ_1M;
+ cmem->ranges[cmem->nr_ranges].end = res->end;
+ cmem->nr_ranges++;
+ }

return 0;
}
@@ -246,9 +250,7 @@ static int prepare_elf_headers(struct kimage *image, void **addr,
unsigned long *sz)
{
struct crash_mem *cmem;
- Elf64_Ehdr *ehdr;
- Elf64_Phdr *phdr;
- int ret, i;
+ int ret;

cmem = fill_up_crash_elf_data();
if (!cmem)
@@ -270,19 +272,6 @@ static int prepare_elf_headers(struct kimage *image, void **addr,
if (ret)
goto out;

- /*
- * If a range matches backup region, adjust offset to backup
- * segment.
- */
- ehdr = (Elf64_Ehdr *)*addr;
- phdr = (Elf64_Phdr *)(ehdr + 1);
- for (i = 0; i < ehdr->e_phnum; phdr++, i++)
- if (phdr->p_type == PT_LOAD &&
- phdr->p_paddr == image->arch.backup_src_start &&
- phdr->p_memsz == image->arch.backup_src_sz) {
- phdr->p_offset = image->arch.backup_load_addr;
- break;
- }
out:
vfree(cmem);
return ret;
@@ -321,19 +310,11 @@ static int memmap_exclude_ranges(struct kimage *image, struct crash_mem *cmem,
unsigned long long mend)
{
unsigned long start, end;
- int ret = 0;

cmem->ranges[0].start = mstart;
cmem->ranges[0].end = mend;
cmem->nr_ranges = 1;

- /* Exclude Backup region */
- start = image->arch.backup_load_addr;
- end = start + image->arch.backup_src_sz - 1;
- ret = crash_exclude_mem_range(cmem, start, end);
- if (ret)
- return ret;
-
/* Exclude elf header region */
start = image->arch.elf_load_addr;
end = start + image->arch.elf_headers_sz - 1;
@@ -356,9 +337,12 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
memset(&cmd, 0, sizeof(struct crash_memmap_data));
cmd.params = params;

- /* Add first 640K segment */
- ei.addr = image->arch.backup_src_start;
- ei.size = image->arch.backup_src_sz;
+ /*
+ * Add the low memory range[0x1000, SZ_1M], skip
+ * the first zero page.
+ */
+ ei.addr = PAGE_SIZE;
+ ei.size = SZ_1M - PAGE_SIZE;
ei.type = E820_TYPE_RAM;
add_e820_entry(params, &ei);

@@ -409,55 +393,12 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
return ret;
}

-static int determine_backup_region(struct resource *res, void *arg)
-{
- struct kimage *image = arg;
-
- image->arch.backup_src_start = res->start;
- image->arch.backup_src_sz = resource_size(res);
-
- /* Expecting only one range for backup region */
- return 1;
-}
-
int crash_load_segments(struct kimage *image)
{
int ret;
struct kexec_buf kbuf = { .image = image, .buf_min = 0,
.buf_max = ULONG_MAX, .top_down = false };

- /*
- * Determine and load a segment for backup area. First 640K RAM
- * region is backup source
- */
-
- ret = walk_system_ram_res(KEXEC_BACKUP_SRC_START, KEXEC_BACKUP_SRC_END,
- image, determine_backup_region);
-
- /* Zero or postive return values are ok */
- if (ret < 0)
- return ret;
-
- /* Add backup segment. */
- if (image->arch.backup_src_sz) {
- kbuf.buffer = &crash_zero_bytes;
- kbuf.bufsz = sizeof(crash_zero_bytes);
- kbuf.memsz = image->arch.backup_src_sz;
- kbuf.buf_align = PAGE_SIZE;
- /*
- * Ideally there is no source for backup segment. This is
- * copied in purgatory after crash. Just add a zero filled
- * segment for now to make sure checksum logic works fine.
- */
- ret = kexec_add_buffer(&kbuf);
- if (ret)
- return ret;
- image->arch.backup_load_addr = kbuf.mem;
- pr_debug("Loaded backup region at 0x%lx backup_start=0x%lx memsz=0x%lx\n",
- image->arch.backup_load_addr,
- image->arch.backup_src_start, kbuf.memsz);
- }
-
/* Prepare elf headers and add a segment */
ret = prepare_elf_headers(image, &kbuf.buffer, &kbuf.bufsz);
if (ret)
diff --git a/arch/x86/kernel/machine_kexec_64.c b/arch/x86/kernel/machine_kexec_64.c
index 5dcd438ad8f2..16e125a50b33 100644
--- a/arch/x86/kernel/machine_kexec_64.c
+++ b/arch/x86/kernel/machine_kexec_64.c
@@ -298,48 +298,6 @@ static void load_segments(void)
);
}

-#ifdef CONFIG_KEXEC_FILE
-/* Update purgatory as needed after various image segments have been prepared */
-static int arch_update_purgatory(struct kimage *image)
-{
- int ret = 0;
-
- if (!image->file_mode)
- return 0;
-
- /* Setup copying of backup region */
- if (image->type == KEXEC_TYPE_CRASH) {
- ret = kexec_purgatory_get_set_symbol(image,
- "purgatory_backup_dest",
- &image->arch.backup_load_addr,
- sizeof(image->arch.backup_load_addr), 0);
- if (ret)
- return ret;
-
- ret = kexec_purgatory_get_set_symbol(image,
- "purgatory_backup_src",
- &image->arch.backup_src_start,
- sizeof(image->arch.backup_src_start), 0);
- if (ret)
- return ret;
-
- ret = kexec_purgatory_get_set_symbol(image,
- "purgatory_backup_sz",
- &image->arch.backup_src_sz,
- sizeof(image->arch.backup_src_sz), 0);
- if (ret)
- return ret;
- }
-
- return ret;
-}
-#else /* !CONFIG_KEXEC_FILE */
-static inline int arch_update_purgatory(struct kimage *image)
-{
- return 0;
-}
-#endif /* CONFIG_KEXEC_FILE */
-
int machine_kexec_prepare(struct kimage *image)
{
unsigned long start_pgtable;
@@ -353,11 +311,6 @@ int machine_kexec_prepare(struct kimage *image)
if (result)
return result;

- /* update purgatory as needed */
- result = arch_update_purgatory(image);
- if (result)
- return result;
-
return 0;
}

diff --git a/arch/x86/purgatory/purgatory.c b/arch/x86/purgatory/purgatory.c
index 3b95410ff0f8..2961234d0795 100644
--- a/arch/x86/purgatory/purgatory.c
+++ b/arch/x86/purgatory/purgatory.c
@@ -14,28 +14,10 @@

#include "../boot/string.h"

-unsigned long purgatory_backup_dest __section(.kexec-purgatory);
-unsigned long purgatory_backup_src __section(.kexec-purgatory);
-unsigned long purgatory_backup_sz __section(.kexec-purgatory);
-
u8 purgatory_sha256_digest[SHA256_DIGEST_SIZE] __section(.kexec-purgatory);

struct kexec_sha_region purgatory_sha_regions[KEXEC_SEGMENT_MAX] __section(.kexec-purgatory);

-/*
- * On x86, second kernel requries first 640K of memory to boot. Copy
- * first 640K to a backup region in reserved memory range so that second
- * kernel can use first 640K.
- */
-static int copy_backup_region(void)
-{
- if (purgatory_backup_dest) {
- memcpy((void *)purgatory_backup_dest,
- (void *)purgatory_backup_src, purgatory_backup_sz);
- }
- return 0;
-}
-
static int verify_sha256_digest(void)
{
struct kexec_sha_region *ptr, *end;
@@ -66,7 +48,6 @@ void purgatory(void)
for (;;)
;
}
- copy_backup_region();
}

/*
--
2.17.1

2019-10-12 02:26:27

by Lianbo Jiang

[permalink] [raw]
Subject: [PATCH 2/3 v3] x86/kdump cleanup: remove the unused crash_copy_backup_region()

The crash_copy_backup_region() has never been used, so clean
up the redundant code.

Signed-off-by: Lianbo Jiang <[email protected]>
---
arch/x86/include/asm/crash.h | 1 -
1 file changed, 1 deletion(-)

diff --git a/arch/x86/include/asm/crash.h b/arch/x86/include/asm/crash.h
index 0acf5ee45a21..089b2850f9d1 100644
--- a/arch/x86/include/asm/crash.h
+++ b/arch/x86/include/asm/crash.h
@@ -3,7 +3,6 @@
#define _ASM_X86_CRASH_H

int crash_load_segments(struct kimage *image);
-int crash_copy_backup_region(struct kimage *image);
int crash_setup_memmap_entries(struct kimage *image,
struct boot_params *params);
void crash_smp_send_stop(void);
--
2.17.1

2019-10-12 11:34:58

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region

Lianbo Jiang <[email protected]> writes:

> When the crashkernel kernel command line option is specified, the
> low 1MiB memory will always be reserved, which makes that the memory
> allocated later won't fall into the low 1MiB area, thereby, it's not
> necessary to create a backup region and also no need to copy the first
> 640k content to a backup region.
>
> Currently, the code related to the backup region can be safely removed,
> so lets clean up.
>
> Signed-off-by: Lianbo Jiang <[email protected]>
> ---

> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> index eb651fbde92a..cc5774fc84c0 100644
> --- a/arch/x86/kernel/crash.c
> +++ b/arch/x86/kernel/crash.c
> @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>
> #ifdef CONFIG_KEXEC_FILE
>
> -static unsigned long crash_zero_bytes;
> -
> static int get_nr_ram_ranges_callback(struct resource *res, void *arg)
> {
> unsigned int *nr_ranges = arg;
> @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg)
> {
> struct crash_mem *cmem = arg;
>
> - cmem->ranges[cmem->nr_ranges].start = res->start;
> - cmem->ranges[cmem->nr_ranges].end = res->end;
> - cmem->nr_ranges++;
> + if (res->start >= SZ_1M) {
> + cmem->ranges[cmem->nr_ranges].start = res->start;
> + cmem->ranges[cmem->nr_ranges].end = res->end;
> + cmem->nr_ranges++;
> + } else if (res->end > SZ_1M) {
> + cmem->ranges[cmem->nr_ranges].start = SZ_1M;
> + cmem->ranges[cmem->nr_ranges].end = res->end;
> + cmem->nr_ranges++;
> + }

What is going on with this chunk? I can guess but this needs a clear
comment.

>
> return 0;
> }

> @@ -356,9 +337,12 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
> memset(&cmd, 0, sizeof(struct crash_memmap_data));
> cmd.params = params;
>
> - /* Add first 640K segment */
> - ei.addr = image->arch.backup_src_start;
> - ei.size = image->arch.backup_src_sz;
> + /*
> + * Add the low memory range[0x1000, SZ_1M], skip
> + * the first zero page.
> + */
> + ei.addr = PAGE_SIZE;
> + ei.size = SZ_1M - PAGE_SIZE;
> ei.type = E820_TYPE_RAM;
> add_e820_entry(params, &ei);

Likewise here. Why do we need a special case?
Why the magic with PAGE_SIZE?

Is this needed because of your other special case above?

When SME is active and the crashkernel command line is enabled do we
just want to leave the low 1MB unencrypted? So we don't need any
special cases?

Eric

2019-10-12 12:19:47

by Dave Young

[permalink] [raw]
Subject: Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region

Hi Eric,

On 10/12/19 at 06:26am, Eric W. Biederman wrote:
> Lianbo Jiang <[email protected]> writes:
>
> > When the crashkernel kernel command line option is specified, the
> > low 1MiB memory will always be reserved, which makes that the memory
> > allocated later won't fall into the low 1MiB area, thereby, it's not
> > necessary to create a backup region and also no need to copy the first
> > 640k content to a backup region.
> >
> > Currently, the code related to the backup region can be safely removed,
> > so lets clean up.
> >
> > Signed-off-by: Lianbo Jiang <[email protected]>
> > ---
>
> > diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> > index eb651fbde92a..cc5774fc84c0 100644
> > --- a/arch/x86/kernel/crash.c
> > +++ b/arch/x86/kernel/crash.c
> > @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
> >
> > #ifdef CONFIG_KEXEC_FILE
> >
> > -static unsigned long crash_zero_bytes;
> > -
> > static int get_nr_ram_ranges_callback(struct resource *res, void *arg)
> > {
> > unsigned int *nr_ranges = arg;
> > @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg)
> > {
> > struct crash_mem *cmem = arg;
> >
> > - cmem->ranges[cmem->nr_ranges].start = res->start;
> > - cmem->ranges[cmem->nr_ranges].end = res->end;
> > - cmem->nr_ranges++;
> > + if (res->start >= SZ_1M) {
> > + cmem->ranges[cmem->nr_ranges].start = res->start;
> > + cmem->ranges[cmem->nr_ranges].end = res->end;
> > + cmem->nr_ranges++;
> > + } else if (res->end > SZ_1M) {
> > + cmem->ranges[cmem->nr_ranges].start = SZ_1M;
> > + cmem->ranges[cmem->nr_ranges].end = res->end;
> > + cmem->nr_ranges++;
> > + }
>
> What is going on with this chunk? I can guess but this needs a clear
> comment.

Indeed it needs some code comment, this is based on some offline
discussion. cat /proc/vmcore will give a warning because ioremap is
mapping the system ram.

We pass the first 1M to kdump kernel in e820 as system ram so that 2nd
kernel can use the low 1M memory because for example the trampoline
code.

>
> >
> > return 0;
> > }
>
> > @@ -356,9 +337,12 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
> > memset(&cmd, 0, sizeof(struct crash_memmap_data));
> > cmd.params = params;
> >
> > - /* Add first 640K segment */
> > - ei.addr = image->arch.backup_src_start;
> > - ei.size = image->arch.backup_src_sz;
> > + /*
> > + * Add the low memory range[0x1000, SZ_1M], skip
> > + * the first zero page.
> > + */
> > + ei.addr = PAGE_SIZE;
> > + ei.size = SZ_1M - PAGE_SIZE;
> > ei.type = E820_TYPE_RAM;
> > add_e820_entry(params, &ei);
>
> Likewise here. Why do we need a special case?
> Why the magic with PAGE_SIZE?

Good catch, the zero page part is useless, I think no other special
reason, just assumed zero page is not usable, but it should be ok to
remove the special handling, just pass 0 - 1M is good enough.

Thanks
Dave

2019-10-13 03:56:15

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region

Dave Young <[email protected]> writes:

> Hi Eric,
>
> On 10/12/19 at 06:26am, Eric W. Biederman wrote:
>> Lianbo Jiang <[email protected]> writes:
>>
>> > When the crashkernel kernel command line option is specified, the
>> > low 1MiB memory will always be reserved, which makes that the memory
>> > allocated later won't fall into the low 1MiB area, thereby, it's not
>> > necessary to create a backup region and also no need to copy the first
>> > 640k content to a backup region.
>> >
>> > Currently, the code related to the backup region can be safely removed,
>> > so lets clean up.
>> >
>> > Signed-off-by: Lianbo Jiang <[email protected]>
>> > ---
>>
>> > diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>> > index eb651fbde92a..cc5774fc84c0 100644
>> > --- a/arch/x86/kernel/crash.c
>> > +++ b/arch/x86/kernel/crash.c
>> > @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>> >
>> > #ifdef CONFIG_KEXEC_FILE
>> >
>> > -static unsigned long crash_zero_bytes;
>> > -
>> > static int get_nr_ram_ranges_callback(struct resource *res, void *arg)
>> > {
>> > unsigned int *nr_ranges = arg;
>> > @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg)
>> > {
>> > struct crash_mem *cmem = arg;
>> >
>> > - cmem->ranges[cmem->nr_ranges].start = res->start;
>> > - cmem->ranges[cmem->nr_ranges].end = res->end;
>> > - cmem->nr_ranges++;
>> > + if (res->start >= SZ_1M) {
>> > + cmem->ranges[cmem->nr_ranges].start = res->start;
>> > + cmem->ranges[cmem->nr_ranges].end = res->end;
>> > + cmem->nr_ranges++;
>> > + } else if (res->end > SZ_1M) {
>> > + cmem->ranges[cmem->nr_ranges].start = SZ_1M;
>> > + cmem->ranges[cmem->nr_ranges].end = res->end;
>> > + cmem->nr_ranges++;
>> > + }
>>
>> What is going on with this chunk? I can guess but this needs a clear
>> comment.
>
> Indeed it needs some code comment, this is based on some offline
> discussion. cat /proc/vmcore will give a warning because ioremap is
> mapping the system ram.
>
> We pass the first 1M to kdump kernel in e820 as system ram so that 2nd
> kernel can use the low 1M memory because for example the trampoline
> code.
>
>>
>> >
>> > return 0;
>> > }
>>
>> > @@ -356,9 +337,12 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
>> > memset(&cmd, 0, sizeof(struct crash_memmap_data));
>> > cmd.params = params;
>> >
>> > - /* Add first 640K segment */
>> > - ei.addr = image->arch.backup_src_start;
>> > - ei.size = image->arch.backup_src_sz;
>> > + /*
>> > + * Add the low memory range[0x1000, SZ_1M], skip
>> > + * the first zero page.
>> > + */
>> > + ei.addr = PAGE_SIZE;
>> > + ei.size = SZ_1M - PAGE_SIZE;
>> > ei.type = E820_TYPE_RAM;
>> > add_e820_entry(params, &ei);
>>
>> Likewise here. Why do we need a special case?
>> Why the magic with PAGE_SIZE?
>
> Good catch, the zero page part is useless, I think no other special
> reason, just assumed zero page is not usable, but it should be ok to
> remove the special handling, just pass 0 - 1M is good enough.

But if we have stopped special casing the low 1M. Why do we need a
special case here at all?

If you need the special case it is almost certainly wrong to say you
have ram above 640KiB and below 1MiB. That is the legacy ROM and video
MMIO area.

There is a reason the original code said 640KiB.

Eric

2019-10-13 09:37:48

by Lianbo Jiang

[permalink] [raw]
Subject: Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region

在 2019年10月13日 11:54, Eric W. Biederman 写道:
> Dave Young <[email protected]> writes:
>
>> Hi Eric,
>>
>> On 10/12/19 at 06:26am, Eric W. Biederman wrote:
>>> Lianbo Jiang <[email protected]> writes:
>>>
>>>> When the crashkernel kernel command line option is specified, the
>>>> low 1MiB memory will always be reserved, which makes that the memory
>>>> allocated later won't fall into the low 1MiB area, thereby, it's not
>>>> necessary to create a backup region and also no need to copy the first
>>>> 640k content to a backup region.
>>>>
>>>> Currently, the code related to the backup region can be safely removed,
>>>> so lets clean up.
>>>>
>>>> Signed-off-by: Lianbo Jiang <[email protected]>
>>>> ---
>>>
>>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>>>> index eb651fbde92a..cc5774fc84c0 100644
>>>> --- a/arch/x86/kernel/crash.c
>>>> +++ b/arch/x86/kernel/crash.c
>>>> @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>>>>
>>>> #ifdef CONFIG_KEXEC_FILE
>>>>
>>>> -static unsigned long crash_zero_bytes;
>>>> -
>>>> static int get_nr_ram_ranges_callback(struct resource *res, void *arg)
>>>> {
>>>> unsigned int *nr_ranges = arg;
>>>> @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg)
>>>> {
>>>> struct crash_mem *cmem = arg;
>>>>
>>>> - cmem->ranges[cmem->nr_ranges].start = res->start;
>>>> - cmem->ranges[cmem->nr_ranges].end = res->end;
>>>> - cmem->nr_ranges++;
>>>> + if (res->start >= SZ_1M) {
>>>> + cmem->ranges[cmem->nr_ranges].start = res->start;
>>>> + cmem->ranges[cmem->nr_ranges].end = res->end;
>>>> + cmem->nr_ranges++;
>>>> + } else if (res->end > SZ_1M) {
>>>> + cmem->ranges[cmem->nr_ranges].start = SZ_1M;
>>>> + cmem->ranges[cmem->nr_ranges].end = res->end;
>>>> + cmem->nr_ranges++;
>>>> + }
>>>
>>> What is going on with this chunk? I can guess but this needs a clear
>>> comment.
>>
>> Indeed it needs some code comment, this is based on some offline
>> discussion. cat /proc/vmcore will give a warning because ioremap is
>> mapping the system ram.
>>
>> We pass the first 1M to kdump kernel in e820 as system ram so that 2nd
>> kernel can use the low 1M memory because for example the trampoline
>> code.
>>
>>>
>>>>
>>>> return 0;
>>>> }
>>>
>>>> @@ -356,9 +337,12 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
>>>> memset(&cmd, 0, sizeof(struct crash_memmap_data));
>>>> cmd.params = params;
>>>>
>>>> - /* Add first 640K segment */
>>>> - ei.addr = image->arch.backup_src_start;
>>>> - ei.size = image->arch.backup_src_sz;
>>>> + /*
>>>> + * Add the low memory range[0x1000, SZ_1M], skip
>>>> + * the first zero page.
>>>> + */
>>>> + ei.addr = PAGE_SIZE;
>>>> + ei.size = SZ_1M - PAGE_SIZE;
>>>> ei.type = E820_TYPE_RAM;
>>>> add_e820_entry(params, &ei);
>>>
>>> Likewise here. Why do we need a special case?
>>> Why the magic with PAGE_SIZE?
>>
>> Good catch, the zero page part is useless, I think no other special
>> reason, just assumed zero page is not usable, but it should be ok to
>> remove the special handling, just pass 0 - 1M is good enough.
>
> But if we have stopped special casing the low 1M. Why do we need a
> special case here at all?
>
Here, need to pass the low memory range to kdump kernel, which will guarantee
the availability of low memory in kdump kernel, otherwise, kdump kernel won't
use the low memory region.

> If you need the special case it is almost certainly wrong to say you
> have ram above 640KiB and below 1MiB. That is the legacy ROM and video
> MMIO area.
>
> There is a reason the original code said 640KiB.
>
Do you mean that the 640k region is good enough here instead of 1MiB?

Thanks.
Lianbo

> Eric
>

2019-10-13 10:24:20

by Dave Young

[permalink] [raw]
Subject: Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region

On 10/12/19 at 10:54pm, Eric W. Biederman wrote:
> Dave Young <[email protected]> writes:
>
> > Hi Eric,
> >
> > On 10/12/19 at 06:26am, Eric W. Biederman wrote:
> >> Lianbo Jiang <[email protected]> writes:
> >>
> >> > When the crashkernel kernel command line option is specified, the
> >> > low 1MiB memory will always be reserved, which makes that the memory
> >> > allocated later won't fall into the low 1MiB area, thereby, it's not
> >> > necessary to create a backup region and also no need to copy the first
> >> > 640k content to a backup region.
> >> >
> >> > Currently, the code related to the backup region can be safely removed,
> >> > so lets clean up.
> >> >
> >> > Signed-off-by: Lianbo Jiang <[email protected]>
> >> > ---
> >>
> >> > diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
> >> > index eb651fbde92a..cc5774fc84c0 100644
> >> > --- a/arch/x86/kernel/crash.c
> >> > +++ b/arch/x86/kernel/crash.c
> >> > @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
> >> >
> >> > #ifdef CONFIG_KEXEC_FILE
> >> >
> >> > -static unsigned long crash_zero_bytes;
> >> > -
> >> > static int get_nr_ram_ranges_callback(struct resource *res, void *arg)
> >> > {
> >> > unsigned int *nr_ranges = arg;
> >> > @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg)
> >> > {
> >> > struct crash_mem *cmem = arg;
> >> >
> >> > - cmem->ranges[cmem->nr_ranges].start = res->start;
> >> > - cmem->ranges[cmem->nr_ranges].end = res->end;
> >> > - cmem->nr_ranges++;
> >> > + if (res->start >= SZ_1M) {
> >> > + cmem->ranges[cmem->nr_ranges].start = res->start;
> >> > + cmem->ranges[cmem->nr_ranges].end = res->end;
> >> > + cmem->nr_ranges++;
> >> > + } else if (res->end > SZ_1M) {
> >> > + cmem->ranges[cmem->nr_ranges].start = SZ_1M;
> >> > + cmem->ranges[cmem->nr_ranges].end = res->end;
> >> > + cmem->nr_ranges++;
> >> > + }
> >>
> >> What is going on with this chunk? I can guess but this needs a clear
> >> comment.
> >
> > Indeed it needs some code comment, this is based on some offline
> > discussion. cat /proc/vmcore will give a warning because ioremap is
> > mapping the system ram.
> >
> > We pass the first 1M to kdump kernel in e820 as system ram so that 2nd
> > kernel can use the low 1M memory because for example the trampoline
> > code.
> >
> >>
> >> >
> >> > return 0;
> >> > }
> >>
> >> > @@ -356,9 +337,12 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
> >> > memset(&cmd, 0, sizeof(struct crash_memmap_data));
> >> > cmd.params = params;
> >> >
> >> > - /* Add first 640K segment */
> >> > - ei.addr = image->arch.backup_src_start;
> >> > - ei.size = image->arch.backup_src_sz;
> >> > + /*
> >> > + * Add the low memory range[0x1000, SZ_1M], skip
> >> > + * the first zero page.
> >> > + */
> >> > + ei.addr = PAGE_SIZE;
> >> > + ei.size = SZ_1M - PAGE_SIZE;
> >> > ei.type = E820_TYPE_RAM;
> >> > add_e820_entry(params, &ei);
> >>
> >> Likewise here. Why do we need a special case?
> >> Why the magic with PAGE_SIZE?
> >
> > Good catch, the zero page part is useless, I think no other special
> > reason, just assumed zero page is not usable, but it should be ok to
> > remove the special handling, just pass 0 - 1M is good enough.
>
> But if we have stopped special casing the low 1M. Why do we need a
> special case here at all?

Seems both Lianbo and I do not understand the query. Let me try to
explain it more.

The 2nd kernel still need use low memory for the trampoline use. So we
have to let 2nd kernel access the low memory as system ram.

The original special case is far more than above we are doing, it does
several things:
1. backup the low 640K into kdump reserved high memory
2. set the low 640K as System RAM in kdump kernel as we do in this
patch.
3. in /proc/vmcore elf header, map the low 640K to the backup region so
that /proc/vmcore can give right old memory for that region.

After the change we are doing in this series, we dropped the 1 and 3 but
2 is still needed because kdump kernel still need use low memory.
But we do not care the vmcore part because nobody use the memory in old
kernel, we already reserve it, and excluded the range in vmcore.

I think another thing you mentioned about some reserved memory under 1M,
even if we set 0-1M as System RAM, we still keep all the reserved
regions in /proc/iomem as identical between 1st and 2nd kernels, so it
just works, see below about cat /proc/iomem in kdump kernel (dropped
into a shell before copying the vmcore out):
kdump:/# cat /proc/iomem|less
00000000-00000fff : Reserved
00001000-0009ffff : System RAM
000a0000-000bffff : PCI Bus 0000:00
000f0000-000fffff : System ROM
30000000-3000006f : System RAM
30000070-39f5cfff : System RAM
38600000-39001070 : Kernel code
39001071-396ce5ff : Kernel data
39acd000-39bfffff : Kernel bss

You can see 000a0000-000bffff : PCI Bus 0000:00 is same across the kdump
reboot.

But maybe if it is not elegant enough with simply using 0 - 1M, maybe
use 0 - 640K as Lianbo said in another reply?

-------------

BTW, we also discussed about compatibility issues, for kexec_file it
just works because our change is in kernel. For kexec-tools part, we
can just leave the userspace code as is, that means if one wants the SME
case be fixed he needs an kernel update to reserve the low memory.

We can not drop the kexec-tools special case about 640K because if we
drop it and people use old kernels which does not reserve low 1M then
kdump can not work at all.

Thanks
Dave

2019-10-14 10:06:11

by Lianbo Jiang

[permalink] [raw]
Subject: Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region

在 2019年10月12日 20:16, Dave Young 写道:
> Hi Eric,
>
> On 10/12/19 at 06:26am, Eric W. Biederman wrote:
>> Lianbo Jiang <[email protected]> writes:
>>
>>> When the crashkernel kernel command line option is specified, the
>>> low 1MiB memory will always be reserved, which makes that the memory
>>> allocated later won't fall into the low 1MiB area, thereby, it's not
>>> necessary to create a backup region and also no need to copy the first
>>> 640k content to a backup region.
>>>
>>> Currently, the code related to the backup region can be safely removed,
>>> so lets clean up.
>>>
>>> Signed-off-by: Lianbo Jiang <[email protected]>
>>> ---
>>
>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>>> index eb651fbde92a..cc5774fc84c0 100644
>>> --- a/arch/x86/kernel/crash.c
>>> +++ b/arch/x86/kernel/crash.c
>>> @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>>>
>>> #ifdef CONFIG_KEXEC_FILE
>>>
>>> -static unsigned long crash_zero_bytes;
>>> -
>>> static int get_nr_ram_ranges_callback(struct resource *res, void *arg)
>>> {
>>> unsigned int *nr_ranges = arg;
>>> @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg)
>>> {
>>> struct crash_mem *cmem = arg;
>>>
>>> - cmem->ranges[cmem->nr_ranges].start = res->start;
>>> - cmem->ranges[cmem->nr_ranges].end = res->end;
>>> - cmem->nr_ranges++;
>>> + if (res->start >= SZ_1M) {
>>> + cmem->ranges[cmem->nr_ranges].start = res->start;
>>> + cmem->ranges[cmem->nr_ranges].end = res->end;
>>> + cmem->nr_ranges++;
>>> + } else if (res->end > SZ_1M) {
>>> + cmem->ranges[cmem->nr_ranges].start = SZ_1M;
>>> + cmem->ranges[cmem->nr_ranges].end = res->end;
>>> + cmem->nr_ranges++;
>>> + }
>>
>> What is going on with this chunk? I can guess but this needs a clear
>> comment.
>
> Indeed it needs some code comment, this is based on some offline
> discussion. cat /proc/vmcore will give a warning because ioremap is
> mapping the system ram.
>
> We pass the first 1M to kdump kernel in e820 as system ram so that 2nd
> kernel can use the low 1M memory because for example the trampoline
> code.
>
Thank you, Eric and Dave. I will add the code comment as below if it would be OK.

@@ -234,9 +232,20 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg)
{
struct crash_mem *cmem = arg;

- cmem->ranges[cmem->nr_ranges].start = res->start;
- cmem->ranges[cmem->nr_ranges].end = res->end;
- cmem->nr_ranges++;
+ /*
+ * Currently, pass the low 1MiB range to kdump kernel in e820
+ * as system ram so that kdump kernel can also use the low 1MiB
+ * memory due to the real mode trampoline code.
+ * And later, the low 1MiB range will be exclued from elf header,
+ * which will avoid remapping the 1MiB system ram when dumping
+ * vmcore.
+ */
+ if (res->start >= SZ_1M) {
+ cmem->ranges[cmem->nr_ranges].start = res->start;
+ cmem->ranges[cmem->nr_ranges].end = res->end;
+ cmem->nr_ranges++;
+ } else if (res->end > SZ_1M) {
+ cmem->ranges[cmem->nr_ranges].start = SZ_1M;
+ cmem->ranges[cmem->nr_ranges].end = res->end;
+ cmem->nr_ranges++;
+ }

return 0;
}

>>
>>>
>>> return 0;
>>> }
>>
>>> @@ -356,9 +337,12 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
>>> memset(&cmd, 0, sizeof(struct crash_memmap_data));
>>> cmd.params = params;
>>>
>>> - /* Add first 640K segment */
>>> - ei.addr = image->arch.backup_src_start;
>>> - ei.size = image->arch.backup_src_sz;
>>> + /*
>>> + * Add the low memory range[0x1000, SZ_1M], skip
>>> + * the first zero page.
>>> + */
>>> + ei.addr = PAGE_SIZE;
>>> + ei.size = SZ_1M - PAGE_SIZE;
>>> ei.type = E820_TYPE_RAM;
>>> add_e820_entry(params, &ei);
>>
>> Likewise here. Why do we need a special case?
>> Why the magic with PAGE_SIZE?
>
> Good catch, the zero page part is useless, I think no other special
> reason, just assumed zero page is not usable, but it should be ok to
> remove the special handling, just pass 0 - 1M is good enough.
>> Thanks
> Dave
>

2019-10-15 12:08:39

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region

lijiang <[email protected]> writes:

> 在 2019年10月13日 11:54, Eric W. Biederman 写道:
>> Dave Young <[email protected]> writes:
>>
>>> Hi Eric,
>>>
>>> On 10/12/19 at 06:26am, Eric W. Biederman wrote:
>>>> Lianbo Jiang <[email protected]> writes:
>>>>
>>>>> When the crashkernel kernel command line option is specified, the
>>>>> low 1MiB memory will always be reserved, which makes that the memory
>>>>> allocated later won't fall into the low 1MiB area, thereby, it's not
>>>>> necessary to create a backup region and also no need to copy the first
>>>>> 640k content to a backup region.
>>>>>
>>>>> Currently, the code related to the backup region can be safely removed,
>>>>> so lets clean up.
>>>>>
>>>>> Signed-off-by: Lianbo Jiang <[email protected]>
>>>>> ---
>>>>
>>>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>>>>> index eb651fbde92a..cc5774fc84c0 100644
>>>>> --- a/arch/x86/kernel/crash.c
>>>>> +++ b/arch/x86/kernel/crash.c
>>>>> @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>>>>>
>>>>> #ifdef CONFIG_KEXEC_FILE
>>>>>
>>>>> -static unsigned long crash_zero_bytes;
>>>>> -
>>>>> static int get_nr_ram_ranges_callback(struct resource *res, void *arg)
>>>>> {
>>>>> unsigned int *nr_ranges = arg;
>>>>> @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg)
>>>>> {
>>>>> struct crash_mem *cmem = arg;
>>>>>
>>>>> - cmem->ranges[cmem->nr_ranges].start = res->start;
>>>>> - cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>> - cmem->nr_ranges++;
>>>>> + if (res->start >= SZ_1M) {
>>>>> + cmem->ranges[cmem->nr_ranges].start = res->start;
>>>>> + cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>> + cmem->nr_ranges++;
>>>>> + } else if (res->end > SZ_1M) {
>>>>> + cmem->ranges[cmem->nr_ranges].start = SZ_1M;
>>>>> + cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>> + cmem->nr_ranges++;
>>>>> + }
>>>>
>>>> What is going on with this chunk? I can guess but this needs a clear
>>>> comment.
>>>
>>> Indeed it needs some code comment, this is based on some offline
>>> discussion. cat /proc/vmcore will give a warning because ioremap is
>>> mapping the system ram.
>>>
>>> We pass the first 1M to kdump kernel in e820 as system ram so that 2nd
>>> kernel can use the low 1M memory because for example the trampoline
>>> code.
>>>
>>>>
>>>>>
>>>>> return 0;
>>>>> }
>>>>
>>>>> @@ -356,9 +337,12 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
>>>>> memset(&cmd, 0, sizeof(struct crash_memmap_data));
>>>>> cmd.params = params;
>>>>>
>>>>> - /* Add first 640K segment */
>>>>> - ei.addr = image->arch.backup_src_start;
>>>>> - ei.size = image->arch.backup_src_sz;
>>>>> + /*
>>>>> + * Add the low memory range[0x1000, SZ_1M], skip
>>>>> + * the first zero page.
>>>>> + */
>>>>> + ei.addr = PAGE_SIZE;
>>>>> + ei.size = SZ_1M - PAGE_SIZE;
>>>>> ei.type = E820_TYPE_RAM;
>>>>> add_e820_entry(params, &ei);
>>>>
>>>> Likewise here. Why do we need a special case?
>>>> Why the magic with PAGE_SIZE?
>>>
>>> Good catch, the zero page part is useless, I think no other special
>>> reason, just assumed zero page is not usable, but it should be ok to
>>> remove the special handling, just pass 0 - 1M is good enough.
>>
>> But if we have stopped special casing the low 1M. Why do we need a
>> special case here at all?
>>
> Here, need to pass the low memory range to kdump kernel, which will guarantee
> the availability of low memory in kdump kernel, otherwise, kdump kernel won't
> use the low memory region.
>
>> If you need the special case it is almost certainly wrong to say you
>> have ram above 640KiB and below 1MiB. That is the legacy ROM and video
>> MMIO area.
>>
>> There is a reason the original code said 640KiB.
>>
> Do you mean that the 640k region is good enough here instead of 1MiB?

Reading through the code of crash_setup_memap_entries I see that what
the code is doing now. The code is repeating the e820 memory map with
the memory areas that were not reserved for the crash kernel removed.

In which case what the code needs to be doing something like:

cmd.type = E820_TYPE_RAM;
flags = IORESOURCE_MEM;
walk_iomem_res_desc(IORES_DESC_RESERVED, flags, 0, 1024*1024, &cmd,
memmap_entry_callback);

Depending on which bugs exist it might make sense to limit this to
the low 640KiB. But finding something the kernel already recognizes
as RAM should prevent most of those problems already. Barring bugs
I admit it doesn't make sense to repeat the work that someone else
has already done.

This bit:
/* Add e820 reserved ranges */
cmd.type = E820_TYPE_RESERVED;
flags = IORESOURCE_MEM;
walk_iomem_res_desc(IORES_DESC_RESERVED, flags, 0, -1, &cmd,
memmap_entry_callback);

Should probably start at 1MiB instead of 0. Just so we don't report the
memory below 1MiB as unconditionally reserved. I don't properly
understand the IORES_DESC_RESERVED flag, and how that differs from
flags. So please test my suggestions to verify the code works as
expected.

Eric

2019-10-15 12:10:37

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region

lijiang <[email protected]> writes:

> 在 2019年10月12日 20:16, Dave Young 写道:
>> Hi Eric,
>>
>> On 10/12/19 at 06:26am, Eric W. Biederman wrote:
>>> Lianbo Jiang <[email protected]> writes:
>>>
>>>> When the crashkernel kernel command line option is specified, the
>>>> low 1MiB memory will always be reserved, which makes that the memory
>>>> allocated later won't fall into the low 1MiB area, thereby, it's not
>>>> necessary to create a backup region and also no need to copy the first
>>>> 640k content to a backup region.
>>>>
>>>> Currently, the code related to the backup region can be safely removed,
>>>> so lets clean up.
>>>>
>>>> Signed-off-by: Lianbo Jiang <[email protected]>
>>>> ---
>>>
>>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>>>> index eb651fbde92a..cc5774fc84c0 100644
>>>> --- a/arch/x86/kernel/crash.c
>>>> +++ b/arch/x86/kernel/crash.c
>>>> @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>>>>
>>>> #ifdef CONFIG_KEXEC_FILE
>>>>
>>>> -static unsigned long crash_zero_bytes;
>>>> -
>>>> static int get_nr_ram_ranges_callback(struct resource *res, void *arg)
>>>> {
>>>> unsigned int *nr_ranges = arg;
>>>> @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg)
>>>> {
>>>> struct crash_mem *cmem = arg;
>>>>
>>>> - cmem->ranges[cmem->nr_ranges].start = res->start;
>>>> - cmem->ranges[cmem->nr_ranges].end = res->end;
>>>> - cmem->nr_ranges++;
>>>> + if (res->start >= SZ_1M) {
>>>> + cmem->ranges[cmem->nr_ranges].start = res->start;
>>>> + cmem->ranges[cmem->nr_ranges].end = res->end;
>>>> + cmem->nr_ranges++;
>>>> + } else if (res->end > SZ_1M) {
>>>> + cmem->ranges[cmem->nr_ranges].start = SZ_1M;
>>>> + cmem->ranges[cmem->nr_ranges].end = res->end;
>>>> + cmem->nr_ranges++;
>>>> + }
>>>
>>> What is going on with this chunk? I can guess but this needs a clear
>>> comment.
>>
>> Indeed it needs some code comment, this is based on some offline
>> discussion. cat /proc/vmcore will give a warning because ioremap is
>> mapping the system ram.
>>
>> We pass the first 1M to kdump kernel in e820 as system ram so that 2nd
>> kernel can use the low 1M memory because for example the trampoline
>> code.
>>
> Thank you, Eric and Dave. I will add the code comment as below if it would be OK.
>
> @@ -234,9 +232,20 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg)
> {
> struct crash_mem *cmem = arg;
>
> - cmem->ranges[cmem->nr_ranges].start = res->start;
> - cmem->ranges[cmem->nr_ranges].end = res->end;
> - cmem->nr_ranges++;
> + /*
> + * Currently, pass the low 1MiB range to kdump kernel in e820
> + * as system ram so that kdump kernel can also use the low 1MiB
> + * memory due to the real mode trampoline code.
> + * And later, the low 1MiB range will be exclued from elf header,
> + * which will avoid remapping the 1MiB system ram when dumping
> + * vmcore.
> + */
> + if (res->start >= SZ_1M) {
> + cmem->ranges[cmem->nr_ranges].start = res->start;
> + cmem->ranges[cmem->nr_ranges].end = res->end;
> + cmem->nr_ranges++;
> + } else if (res->end > SZ_1M) {
> + cmem->ranges[cmem->nr_ranges].start = SZ_1M;
> + cmem->ranges[cmem->nr_ranges].end = res->end;
> + cmem->nr_ranges++;
> + }
>
> return 0;
> }

I just read through the appropriate section of crash.c and the way
things are structured doing this work in
prepare_elf64_ram_headers_callback is wrong.

This can be done in a simpler manner in elf_header_exclude_ranges.
Something like:

/* The low 1MiB is always reserved */
ret = crash_exclude_mem_range(cmem, 0, 1024*1024);
if (ret)
return ret;

Eric

2019-10-16 11:13:08

by Lianbo Jiang

[permalink] [raw]
Subject: Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region

在 2019年10月15日 19:11, Eric W. Biederman 写道:
> lijiang <[email protected]> writes:
>
>> 在 2019年10月12日 20:16, Dave Young 写道:
>>> Hi Eric,
>>>
>>> On 10/12/19 at 06:26am, Eric W. Biederman wrote:
>>>> Lianbo Jiang <[email protected]> writes:
>>>>
>>>>> When the crashkernel kernel command line option is specified, the
>>>>> low 1MiB memory will always be reserved, which makes that the memory
>>>>> allocated later won't fall into the low 1MiB area, thereby, it's not
>>>>> necessary to create a backup region and also no need to copy the first
>>>>> 640k content to a backup region.
>>>>>
>>>>> Currently, the code related to the backup region can be safely removed,
>>>>> so lets clean up.
>>>>>
>>>>> Signed-off-by: Lianbo Jiang <[email protected]>
>>>>> ---
>>>>
>>>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>>>>> index eb651fbde92a..cc5774fc84c0 100644
>>>>> --- a/arch/x86/kernel/crash.c
>>>>> +++ b/arch/x86/kernel/crash.c
>>>>> @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>>>>>
>>>>> #ifdef CONFIG_KEXEC_FILE
>>>>>
>>>>> -static unsigned long crash_zero_bytes;
>>>>> -
>>>>> static int get_nr_ram_ranges_callback(struct resource *res, void *arg)
>>>>> {
>>>>> unsigned int *nr_ranges = arg;
>>>>> @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg)
>>>>> {
>>>>> struct crash_mem *cmem = arg;
>>>>>
>>>>> - cmem->ranges[cmem->nr_ranges].start = res->start;
>>>>> - cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>> - cmem->nr_ranges++;
>>>>> + if (res->start >= SZ_1M) {
>>>>> + cmem->ranges[cmem->nr_ranges].start = res->start;
>>>>> + cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>> + cmem->nr_ranges++;
>>>>> + } else if (res->end > SZ_1M) {
>>>>> + cmem->ranges[cmem->nr_ranges].start = SZ_1M;
>>>>> + cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>> + cmem->nr_ranges++;
>>>>> + }
>>>>
>>>> What is going on with this chunk? I can guess but this needs a clear
>>>> comment.
>>>
>>> Indeed it needs some code comment, this is based on some offline
>>> discussion. cat /proc/vmcore will give a warning because ioremap is
>>> mapping the system ram.
>>>
>>> We pass the first 1M to kdump kernel in e820 as system ram so that 2nd
>>> kernel can use the low 1M memory because for example the trampoline
>>> code.
>>>
>> Thank you, Eric and Dave. I will add the code comment as below if it would be OK.
>>
>> @@ -234,9 +232,20 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg)
>> {
>> struct crash_mem *cmem = arg;
>>
>> - cmem->ranges[cmem->nr_ranges].start = res->start;
>> - cmem->ranges[cmem->nr_ranges].end = res->end;
>> - cmem->nr_ranges++;
>> + /*
>> + * Currently, pass the low 1MiB range to kdump kernel in e820
>> + * as system ram so that kdump kernel can also use the low 1MiB
>> + * memory due to the real mode trampoline code.
>> + * And later, the low 1MiB range will be exclued from elf header,
>> + * which will avoid remapping the 1MiB system ram when dumping
>> + * vmcore.
>> + */
>> + if (res->start >= SZ_1M) {
>> + cmem->ranges[cmem->nr_ranges].start = res->start;
>> + cmem->ranges[cmem->nr_ranges].end = res->end;
>> + cmem->nr_ranges++;
>> + } else if (res->end > SZ_1M) {
>> + cmem->ranges[cmem->nr_ranges].start = SZ_1M;
>> + cmem->ranges[cmem->nr_ranges].end = res->end;
>> + cmem->nr_ranges++;
>> + }
>>
>> return 0;
>> }
>
> I just read through the appropriate section of crash.c and the way
> things are structured doing this work in
> prepare_elf64_ram_headers_callback is wrong.
>
> This can be done in a simpler manner in elf_header_exclude_ranges.
> Something like:
>
Thank you, Eric. It seems that here is a more reasonable place, i will make
a test about it and improve it in next post.

Lianbo

> /* The low 1MiB is always reserved */
> ret = crash_exclude_mem_range(cmem, 0, 1024*1024);
> if (ret)
> return ret;
>
> Eric
>

2019-10-16 13:02:34

by Lianbo Jiang

[permalink] [raw]
Subject: Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region

在 2019年10月15日 19:04, Eric W. Biederman 写道:
> lijiang <[email protected]> writes:
>
>> 在 2019年10月13日 11:54, Eric W. Biederman 写道:
>>> Dave Young <[email protected]> writes:
>>>
>>>> Hi Eric,
>>>>
>>>> On 10/12/19 at 06:26am, Eric W. Biederman wrote:
>>>>> Lianbo Jiang <[email protected]> writes:
>>>>>
>>>>>> When the crashkernel kernel command line option is specified, the
>>>>>> low 1MiB memory will always be reserved, which makes that the memory
>>>>>> allocated later won't fall into the low 1MiB area, thereby, it's not
>>>>>> necessary to create a backup region and also no need to copy the first
>>>>>> 640k content to a backup region.
>>>>>>
>>>>>> Currently, the code related to the backup region can be safely removed,
>>>>>> so lets clean up.
>>>>>>
>>>>>> Signed-off-by: Lianbo Jiang <[email protected]>
>>>>>> ---
>>>>>
>>>>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>>>>>> index eb651fbde92a..cc5774fc84c0 100644
>>>>>> --- a/arch/x86/kernel/crash.c
>>>>>> +++ b/arch/x86/kernel/crash.c
>>>>>> @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>>>>>>
>>>>>> #ifdef CONFIG_KEXEC_FILE
>>>>>>
>>>>>> -static unsigned long crash_zero_bytes;
>>>>>> -
>>>>>> static int get_nr_ram_ranges_callback(struct resource *res, void *arg)
>>>>>> {
>>>>>> unsigned int *nr_ranges = arg;
>>>>>> @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg)
>>>>>> {
>>>>>> struct crash_mem *cmem = arg;
>>>>>>
>>>>>> - cmem->ranges[cmem->nr_ranges].start = res->start;
>>>>>> - cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>>> - cmem->nr_ranges++;
>>>>>> + if (res->start >= SZ_1M) {
>>>>>> + cmem->ranges[cmem->nr_ranges].start = res->start;
>>>>>> + cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>>> + cmem->nr_ranges++;
>>>>>> + } else if (res->end > SZ_1M) {
>>>>>> + cmem->ranges[cmem->nr_ranges].start = SZ_1M;
>>>>>> + cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>>> + cmem->nr_ranges++;
>>>>>> + }
>>>>>
>>>>> What is going on with this chunk? I can guess but this needs a clear
>>>>> comment.
>>>>
>>>> Indeed it needs some code comment, this is based on some offline
>>>> discussion. cat /proc/vmcore will give a warning because ioremap is
>>>> mapping the system ram.
>>>>
>>>> We pass the first 1M to kdump kernel in e820 as system ram so that 2nd
>>>> kernel can use the low 1M memory because for example the trampoline
>>>> code.
>>>>
>>>>>
>>>>>>
>>>>>> return 0;
>>>>>> }
>>>>>
>>>>>> @@ -356,9 +337,12 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
>>>>>> memset(&cmd, 0, sizeof(struct crash_memmap_data));
>>>>>> cmd.params = params;
>>>>>>
>>>>>> - /* Add first 640K segment */
>>>>>> - ei.addr = image->arch.backup_src_start;
>>>>>> - ei.size = image->arch.backup_src_sz;
>>>>>> + /*
>>>>>> + * Add the low memory range[0x1000, SZ_1M], skip
>>>>>> + * the first zero page.
>>>>>> + */
>>>>>> + ei.addr = PAGE_SIZE;
>>>>>> + ei.size = SZ_1M - PAGE_SIZE;
>>>>>> ei.type = E820_TYPE_RAM;
>>>>>> add_e820_entry(params, &ei);
>>>>>
>>>>> Likewise here. Why do we need a special case?
>>>>> Why the magic with PAGE_SIZE?
>>>>
>>>> Good catch, the zero page part is useless, I think no other special
>>>> reason, just assumed zero page is not usable, but it should be ok to
>>>> remove the special handling, just pass 0 - 1M is good enough.
>>>
>>> But if we have stopped special casing the low 1M. Why do we need a
>>> special case here at all?
>>>
>> Here, need to pass the low memory range to kdump kernel, which will guarantee
>> the availability of low memory in kdump kernel, otherwise, kdump kernel won't
>> use the low memory region.
>>
>>> If you need the special case it is almost certainly wrong to say you
>>> have ram above 640KiB and below 1MiB. That is the legacy ROM and video
>>> MMIO area.
>>>
>>> There is a reason the original code said 640KiB.
>>>
>> Do you mean that the 640k region is good enough here instead of 1MiB?
>
> Reading through the code of crash_setup_memap_entries I see that what
> the code is doing now. The code is repeating the e820 memory map with
> the memory areas that were not reserved for the crash kernel removed.
>
> In which case what the code needs to be doing something like:
>
> cmd.type = E820_TYPE_RAM;
> flags = IORESOURCE_MEM;
> walk_iomem_res_desc(IORES_DESC_RESERVED, flags, 0, 1024*1024, &cmd,
> memmap_entry_callback);
>
> Depending on which bugs exist it might make sense to limit this to
> the low 640KiB. But finding something the kernel already recognizes
> as RAM should prevent most of those problems already. Barring bugs
> I admit it doesn't make sense to repeat the work that someone else
> has already done.
>
> This bit:
> /* Add e820 reserved ranges */
> cmd.type = E820_TYPE_RESERVED;
> flags = IORESOURCE_MEM;
> walk_iomem_res_desc(IORES_DESC_RESERVED, flags, 0, -1, &cmd,
> memmap_entry_callback);
>
> Should probably start at 1MiB instead of 0. Just so we don't report the
> memory below 1MiB as unconditionally reserved. I don't properly
> understand the IORES_DESC_RESERVED flag, and how that differs from
> flags. So please test my suggestions to verify the code works as
> expected.
>
Thanks for your comment, Eric.

I will make a test based on your suggestions. But i need an SME machine,
maybe i will reply later.

Thanks.
Lianbo

> Eric
>

2019-10-16 16:24:45

by Lianbo Jiang

[permalink] [raw]
Subject: Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region

在 2019年10月15日 19:04, Eric W. Biederman 写道:
> lijiang <[email protected]> writes:
>
>> 在 2019年10月13日 11:54, Eric W. Biederman 写道:
>>> Dave Young <[email protected]> writes:
>>>
>>>> Hi Eric,
>>>>
>>>> On 10/12/19 at 06:26am, Eric W. Biederman wrote:
>>>>> Lianbo Jiang <[email protected]> writes:
>>>>>
>>>>>> When the crashkernel kernel command line option is specified, the
>>>>>> low 1MiB memory will always be reserved, which makes that the memory
>>>>>> allocated later won't fall into the low 1MiB area, thereby, it's not
>>>>>> necessary to create a backup region and also no need to copy the first
>>>>>> 640k content to a backup region.
>>>>>>
>>>>>> Currently, the code related to the backup region can be safely removed,
>>>>>> so lets clean up.
>>>>>>
>>>>>> Signed-off-by: Lianbo Jiang <[email protected]>
>>>>>> ---
>>>>>
>>>>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>>>>>> index eb651fbde92a..cc5774fc84c0 100644
>>>>>> --- a/arch/x86/kernel/crash.c
>>>>>> +++ b/arch/x86/kernel/crash.c
>>>>>> @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>>>>>>
>>>>>> #ifdef CONFIG_KEXEC_FILE
>>>>>>
>>>>>> -static unsigned long crash_zero_bytes;
>>>>>> -
>>>>>> static int get_nr_ram_ranges_callback(struct resource *res, void *arg)
>>>>>> {
>>>>>> unsigned int *nr_ranges = arg;
>>>>>> @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg)
>>>>>> {
>>>>>> struct crash_mem *cmem = arg;
>>>>>>
>>>>>> - cmem->ranges[cmem->nr_ranges].start = res->start;
>>>>>> - cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>>> - cmem->nr_ranges++;
>>>>>> + if (res->start >= SZ_1M) {
>>>>>> + cmem->ranges[cmem->nr_ranges].start = res->start;
>>>>>> + cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>>> + cmem->nr_ranges++;
>>>>>> + } else if (res->end > SZ_1M) {
>>>>>> + cmem->ranges[cmem->nr_ranges].start = SZ_1M;
>>>>>> + cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>>> + cmem->nr_ranges++;
>>>>>> + }
>>>>>
>>>>> What is going on with this chunk? I can guess but this needs a clear
>>>>> comment.
>>>>
>>>> Indeed it needs some code comment, this is based on some offline
>>>> discussion. cat /proc/vmcore will give a warning because ioremap is
>>>> mapping the system ram.
>>>>
>>>> We pass the first 1M to kdump kernel in e820 as system ram so that 2nd
>>>> kernel can use the low 1M memory because for example the trampoline
>>>> code.
>>>>
>>>>>
>>>>>>
>>>>>> return 0;
>>>>>> }
>>>>>
>>>>>> @@ -356,9 +337,12 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
>>>>>> memset(&cmd, 0, sizeof(struct crash_memmap_data));
>>>>>> cmd.params = params;
>>>>>>
>>>>>> - /* Add first 640K segment */
>>>>>> - ei.addr = image->arch.backup_src_start;
>>>>>> - ei.size = image->arch.backup_src_sz;
>>>>>> + /*
>>>>>> + * Add the low memory range[0x1000, SZ_1M], skip
>>>>>> + * the first zero page.
>>>>>> + */
>>>>>> + ei.addr = PAGE_SIZE;
>>>>>> + ei.size = SZ_1M - PAGE_SIZE;
>>>>>> ei.type = E820_TYPE_RAM;
>>>>>> add_e820_entry(params, &ei);
>>>>>
>>>>> Likewise here. Why do we need a special case?
>>>>> Why the magic with PAGE_SIZE?
>>>>
>>>> Good catch, the zero page part is useless, I think no other special
>>>> reason, just assumed zero page is not usable, but it should be ok to
>>>> remove the special handling, just pass 0 - 1M is good enough.
>>>
>>> But if we have stopped special casing the low 1M. Why do we need a
>>> special case here at all?
>>>
>> Here, need to pass the low memory range to kdump kernel, which will guarantee
>> the availability of low memory in kdump kernel, otherwise, kdump kernel won't
>> use the low memory region.
>>
>>> If you need the special case it is almost certainly wrong to say you
>>> have ram above 640KiB and below 1MiB. That is the legacy ROM and video
>>> MMIO area.
>>>
>>> There is a reason the original code said 640KiB.
>>>
>> Do you mean that the 640k region is good enough here instead of 1MiB?
>
> Reading through the code of crash_setup_memap_entries I see that what
> the code is doing now. The code is repeating the e820 memory map with
> the memory areas that were not reserved for the crash kernel removed.
>
> In which case what the code needs to be doing something like:
>
> cmd.type = E820_TYPE_RAM;
> flags = IORESOURCE_MEM;
> walk_iomem_res_desc(IORES_DESC_RESERVED, flags, 0, 1024*1024, &cmd,
> memmap_entry_callback);
>
The above code does not get the results what we expected, it gets the reserved
memory marked as 'IORES_DESC_RESERVED' in the low 1MiB range.

Finally, kdump kernel happened the panic as follow:
......
[ 3.555662] Kernel panic - not syncing: Real mode trampoline was not allocated
[ 3.556660] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.4.0-rc3+ #4
[ 3.556660] Hardware name: AMD Corporation Speedway/Speedway, BIOS RSW1009C 07/27/2018
[ 3.556660] Call Trace:
[ 3.556660] dump_stack+0x46/0x60
[ 3.556660] panic+0xfb/0x2d7
[ 3.556660] ? hv_init_spinlocks+0x7f/0x7f
[ 3.556660] init_real_mode+0x27/0x1fa
[ 3.556660] ? hv_init_spinlocks+0x7f/0x7f
[ 3.556660] ? do_one_initcall+0x46/0x1e4
[ 3.556660] ? proc_register+0xd0/0x130
[ 3.556660] ? kernel_init_freeable+0xe2/0x242
[ 3.556660] ? rest_init+0xaa/0xaa
[ 3.556660] ? kernel_init+0xa/0x106
[ 3.556660] ? ret_from_fork+0x22/0x40
[ 3.556660] Rebooting in 10 seconds..
[ 3.556660] ACPI MEMORY or I/O RESET_REG.

I modified the above code, and tested it. This can find out the system ram in
the low 1MiB range. And it worked well.

@@ -356,11 +338,11 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
memset(&cmd, 0, sizeof(struct crash_memmap_data));
cmd.params = params;

+ /* Add the low 1MiB */
+ cmd.type = E820_TYPE_RAM;
+ flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
+ walk_iomem_res_desc(IORES_DESC_NONE, flags, 0, 1024*1024 - 1, &cmd,
+ memmap_entry_callback);

> Depending on which bugs exist it might make sense to limit this to
> the low 640KiB. But finding something the kernel already recognizes
> as RAM should prevent most of those problems already. Barring bugs
> I admit it doesn't make sense to repeat the work that someone else
> has already done.
>
> This bit:
> /* Add e820 reserved ranges */
> cmd.type = E820_TYPE_RESERVED;
> flags = IORESOURCE_MEM;
> walk_iomem_res_desc(IORES_DESC_RESERVED, flags, 0, -1, &cmd,
> memmap_entry_callback);
>
> Should probably start at 1MiB instead of 0. Just so we don't report the
If so, it can not find out the reserved memory marked as 'IORES_DESC_RESERVED' in
the low 1MiB range, finally, it doesn't pass the reserved memory in the low 1MiB to
kdump kernel, which could cause some problems, such as SME or PCI MMCONFIG issue.

> memory below 1MiB as unconditionally reserved. I don't properly
> understand the IORES_DESC_RESERVED flag, and how that differs from
I found three commits about 'IORES_DESC_RESERVED' flag, hope this helps.
1.ae9e13d621d6 ("x86/e820, ioport: Add a new I/O resource descriptor IORES_DESC_RESERVED")
2.5da04cc86d12 ("x86/mm: Rework ioremap resource mapping determination")
3.980621daf368 ("x86/crash: Add e820 reserved ranges to kdump kernel's e820 table")

> flags. So please test my suggestions to verify the code works as
> expected.
>
I have tested the two changes that you mentioned, please refer to the reply above.

Thanks.
Lianbo

> Eric
>

2019-10-17 12:37:35

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 3/3 v3] x86/kdump: clean up all the code related to the backup region

lijiang <[email protected]> writes:

> 在 2019年10月15日 19:04, Eric W. Biederman 写道:
>> lijiang <[email protected]> writes:
>>
>>> 在 2019年10月13日 11:54, Eric W. Biederman 写道:
>>>> Dave Young <[email protected]> writes:
>>>>
>>>>> Hi Eric,
>>>>>
>>>>> On 10/12/19 at 06:26am, Eric W. Biederman wrote:
>>>>>> Lianbo Jiang <[email protected]> writes:
>>>>>>
>>>>>>> When the crashkernel kernel command line option is specified, the
>>>>>>> low 1MiB memory will always be reserved, which makes that the memory
>>>>>>> allocated later won't fall into the low 1MiB area, thereby, it's not
>>>>>>> necessary to create a backup region and also no need to copy the first
>>>>>>> 640k content to a backup region.
>>>>>>>
>>>>>>> Currently, the code related to the backup region can be safely removed,
>>>>>>> so lets clean up.
>>>>>>>
>>>>>>> Signed-off-by: Lianbo Jiang <[email protected]>
>>>>>>> ---
>>>>>>
>>>>>>> diff --git a/arch/x86/kernel/crash.c b/arch/x86/kernel/crash.c
>>>>>>> index eb651fbde92a..cc5774fc84c0 100644
>>>>>>> --- a/arch/x86/kernel/crash.c
>>>>>>> +++ b/arch/x86/kernel/crash.c
>>>>>>> @@ -173,8 +173,6 @@ void native_machine_crash_shutdown(struct pt_regs *regs)
>>>>>>>
>>>>>>> #ifdef CONFIG_KEXEC_FILE
>>>>>>>
>>>>>>> -static unsigned long crash_zero_bytes;
>>>>>>> -
>>>>>>> static int get_nr_ram_ranges_callback(struct resource *res, void *arg)
>>>>>>> {
>>>>>>> unsigned int *nr_ranges = arg;
>>>>>>> @@ -234,9 +232,15 @@ static int prepare_elf64_ram_headers_callback(struct resource *res, void *arg)
>>>>>>> {
>>>>>>> struct crash_mem *cmem = arg;
>>>>>>>
>>>>>>> - cmem->ranges[cmem->nr_ranges].start = res->start;
>>>>>>> - cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>>>> - cmem->nr_ranges++;
>>>>>>> + if (res->start >= SZ_1M) {
>>>>>>> + cmem->ranges[cmem->nr_ranges].start = res->start;
>>>>>>> + cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>>>> + cmem->nr_ranges++;
>>>>>>> + } else if (res->end > SZ_1M) {
>>>>>>> + cmem->ranges[cmem->nr_ranges].start = SZ_1M;
>>>>>>> + cmem->ranges[cmem->nr_ranges].end = res->end;
>>>>>>> + cmem->nr_ranges++;
>>>>>>> + }
>>>>>>
>>>>>> What is going on with this chunk? I can guess but this needs a clear
>>>>>> comment.
>>>>>
>>>>> Indeed it needs some code comment, this is based on some offline
>>>>> discussion. cat /proc/vmcore will give a warning because ioremap is
>>>>> mapping the system ram.
>>>>>
>>>>> We pass the first 1M to kdump kernel in e820 as system ram so that 2nd
>>>>> kernel can use the low 1M memory because for example the trampoline
>>>>> code.
>>>>>
>>>>>>
>>>>>>>
>>>>>>> return 0;
>>>>>>> }
>>>>>>
>>>>>>> @@ -356,9 +337,12 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
>>>>>>> memset(&cmd, 0, sizeof(struct crash_memmap_data));
>>>>>>> cmd.params = params;
>>>>>>>
>>>>>>> - /* Add first 640K segment */
>>>>>>> - ei.addr = image->arch.backup_src_start;
>>>>>>> - ei.size = image->arch.backup_src_sz;
>>>>>>> + /*
>>>>>>> + * Add the low memory range[0x1000, SZ_1M], skip
>>>>>>> + * the first zero page.
>>>>>>> + */
>>>>>>> + ei.addr = PAGE_SIZE;
>>>>>>> + ei.size = SZ_1M - PAGE_SIZE;
>>>>>>> ei.type = E820_TYPE_RAM;
>>>>>>> add_e820_entry(params, &ei);
>>>>>>
>>>>>> Likewise here. Why do we need a special case?
>>>>>> Why the magic with PAGE_SIZE?
>>>>>
>>>>> Good catch, the zero page part is useless, I think no other special
>>>>> reason, just assumed zero page is not usable, but it should be ok to
>>>>> remove the special handling, just pass 0 - 1M is good enough.
>>>>
>>>> But if we have stopped special casing the low 1M. Why do we need a
>>>> special case here at all?
>>>>
>>> Here, need to pass the low memory range to kdump kernel, which will guarantee
>>> the availability of low memory in kdump kernel, otherwise, kdump kernel won't
>>> use the low memory region.
>>>
>>>> If you need the special case it is almost certainly wrong to say you
>>>> have ram above 640KiB and below 1MiB. That is the legacy ROM and video
>>>> MMIO area.
>>>>
>>>> There is a reason the original code said 640KiB.
>>>>
>>> Do you mean that the 640k region is good enough here instead of 1MiB?
>>
>> Reading through the code of crash_setup_memap_entries I see that what
>> the code is doing now. The code is repeating the e820 memory map with
>> the memory areas that were not reserved for the crash kernel removed.
>>
>> In which case what the code needs to be doing something like:
>>
>> cmd.type = E820_TYPE_RAM;
>> flags = IORESOURCE_MEM;
>> walk_iomem_res_desc(IORES_DESC_RESERVED, flags, 0, 1024*1024, &cmd,
>> memmap_entry_callback);
>>
> The above code does not get the results what we expected, it gets the reserved
> memory marked as 'IORES_DESC_RESERVED' in the low 1MiB range.
>
> Finally, kdump kernel happened the panic as follow:
> ......
> [ 3.555662] Kernel panic - not syncing: Real mode trampoline was not allocated
> [ 3.556660] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.4.0-rc3+ #4
> [ 3.556660] Hardware name: AMD Corporation Speedway/Speedway, BIOS RSW1009C 07/27/2018
> [ 3.556660] Call Trace:
> [ 3.556660] dump_stack+0x46/0x60
> [ 3.556660] panic+0xfb/0x2d7
> [ 3.556660] ? hv_init_spinlocks+0x7f/0x7f
> [ 3.556660] init_real_mode+0x27/0x1fa
> [ 3.556660] ? hv_init_spinlocks+0x7f/0x7f
> [ 3.556660] ? do_one_initcall+0x46/0x1e4
> [ 3.556660] ? proc_register+0xd0/0x130
> [ 3.556660] ? kernel_init_freeable+0xe2/0x242
> [ 3.556660] ? rest_init+0xaa/0xaa
> [ 3.556660] ? kernel_init+0xa/0x106
> [ 3.556660] ? ret_from_fork+0x22/0x40
> [ 3.556660] Rebooting in 10 seconds..
> [ 3.556660] ACPI MEMORY or I/O RESET_REG.
>
> I modified the above code, and tested it. This can find out the system ram in
> the low 1MiB range. And it worked well.
>
> @@ -356,11 +338,11 @@ int crash_setup_memmap_entries(struct kimage *image, struct boot_params *params)
> memset(&cmd, 0, sizeof(struct crash_memmap_data));
> cmd.params = params;
>
> + /* Add the low 1MiB */
> + cmd.type = E820_TYPE_RAM;
> + flags = IORESOURCE_SYSTEM_RAM | IORESOURCE_BUSY;
> + walk_iomem_res_desc(IORES_DESC_NONE, flags, 0, 1024*1024 - 1, &cmd,
> + memmap_entry_callback);
>

That looks like a very reasonable fix.

>> Depending on which bugs exist it might make sense to limit this to
>> the low 640KiB. But finding something the kernel already recognizes
>> as RAM should prevent most of those problems already. Barring bugs
>> I admit it doesn't make sense to repeat the work that someone else
>> has already done.
>>
>> This bit:
>> /* Add e820 reserved ranges */
>> cmd.type = E820_TYPE_RESERVED;
>> flags = IORESOURCE_MEM;
>> walk_iomem_res_desc(IORES_DESC_RESERVED, flags, 0, -1, &cmd,
>> memmap_entry_callback);
>>
>> Should probably start at 1MiB instead of 0. Just so we don't report the
> If so, it can not find out the reserved memory marked as 'IORES_DESC_RESERVED' in
> the low 1MiB range, finally, it doesn't pass the reserved memory in the low 1MiB to
> kdump kernel, which could cause some problems, such as SME or PCI MMCONFIG issue.

Good point. For some reason I was thinking IORESOURCE_MEM and
IORESOURCE_SYSTEM_RAM were the same thing. It has been way to long
since I have been in that part of the code.

So yes let's leave that part alone.

>> memory below 1MiB as unconditionally reserved. I don't properly
>> understand the IORES_DESC_RESERVED flag, and how that differs from
> I found three commits about 'IORES_DESC_RESERVED' flag, hope this helps.
> 1.ae9e13d621d6 ("x86/e820, ioport: Add a new I/O resource descriptor IORES_DESC_RESERVED")
> 2.5da04cc86d12 ("x86/mm: Rework ioremap resource mapping determination")
> 3.980621daf368 ("x86/crash: Add e820 reserved ranges to kdump kernel's e820 table")
>
>> flags. So please test my suggestions to verify the code works as
>> expected.
>>
> I have tested the two changes that you mentioned, please refer to the
> reply above.

Thank you. It looks like you have figured out how these things should
work.

Eric