2010-04-22 16:23:41

by Vitaly Mayatskih

[permalink] [raw]
Subject: [PATCH 0/5] Add second memory region for crash kernel

Patch applies to 2.6.34-rc5

On x86 platform, even if hardware is 64-bit capable, kernel starts
execution in 32-bit mode. When system is kdump-enabled, crashed kernel
switches to 32 bit mode and jumps into new kernel. This automatically
limits location of dump-capture kernel image and it's initrd by first
4Gb of memory. Switching to 32 bit mode is performed by purgatory
code, which has relocations of type R_X86_64_32S (32-bit signed), and
this cuts "good" address space for crash kernel down to 2 Gb. I/O
regions may cut down this space further.

When system has a lot of memory (hundreds of gigabytes), dump-capture
kernel also needs relatively a lot of memory to account old kernel's
pages. It may be impossible to reserve enough memory below 2 or even 4
Gb. Simplest solution is it break dump-capture kernel's reserved
memory region into two pieces: first (small) region for kernel and
initrd images may be easily placed in "good" address space in the
beginning of physical memory, and second region may be located
anywhere.

This serie of patches realizes this approach. It requires also changes
in kexec utility to make this feature work, but is
backward-compatible: old versions of kexec will work with new
kernel. I will post patch to kexec-tools upstream separately.

Signed-off-by: Vitaly Mayatskikh <[email protected]>

Documentation/kdump/kdump.txt | 40 ++++++++
Documentation/kernel-parameters.txt | 19 +++-
arch/x86/kernel/setup.c | 56 +++++++----
include/linux/kexec.h | 6 +
kernel/kexec.c | 182 ++++++++++++++++++++++++++---------
5 files changed, 232 insertions(+), 71 deletions(-)


2010-04-22 16:23:45

by Vitaly Mayatskih

[permalink] [raw]
Subject: [PATCH 1/5] Introduce second memory resource for crash kernel

Currently crash kernel uses only one memory region (described by
struct resource). When this region gets enough large, there may appear
a problem to reside this region in a valid addresses range.

This patch introduces second memory region, which may be also used by
crash kernel. First region may be enough small to place only kernel
and initrd images at low addresses, and second region may be placed
almost anywhere.

Second memory resource has another name with aim not to confuse
existing userspace utilities, like kexec.

Signed-off-by: Vitaly Mayatskikh <[email protected]>
---
include/linux/kexec.h | 1 +
kernel/kexec.c | 11 ++++++++++-
2 files changed, 11 insertions(+), 1 deletions(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 03e8e8d..1a3b0a3 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -198,6 +198,7 @@ extern struct kimage *kexec_crash_image;
/* Location of a reserved region to hold the crash kernel.
*/
extern struct resource crashk_res;
+extern struct resource crashk_res_hi;
typedef u32 note_buf_t[KEXEC_NOTE_BYTES/4];
extern note_buf_t __percpu *crash_notes;
extern u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4];
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 87ebe8a..1bd0199 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -49,7 +49,7 @@ u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4];
size_t vmcoreinfo_size;
size_t vmcoreinfo_max_size = sizeof(vmcoreinfo_data);

-/* Location of the reserved area for the crash kernel */
+/* Location of the reserved area for the crash kernel in low memory */
struct resource crashk_res = {
.name = "Crash kernel",
.start = 0,
@@ -57,6 +57,14 @@ struct resource crashk_res = {
.flags = IORESOURCE_BUSY | IORESOURCE_MEM
};

+/* Location of the reserved area for the crash kernel in high memory */
+struct resource crashk_res_hi = {
+ .name = "Crash high memory",
+ .start = 0,
+ .end = 0,
+ .flags = IORESOURCE_BUSY | IORESOURCE_MEM
+};
+
int kexec_should_crash(struct task_struct *p)
{
if (in_interrupt() || !p->pid || is_global_init(p) || panic_on_oops)
@@ -1092,6 +1100,7 @@ size_t crash_get_memory_size(void)
size_t size;
mutex_lock(&kexec_mutex);
size = crashk_res.end - crashk_res.start + 1;
+ size += crashk_res_hi.end - crashk_res_hi.start + 1;
mutex_unlock(&kexec_mutex);
return size;
}
--
1.7.0.1

2010-04-22 16:23:48

by Vitaly Mayatskih

[permalink] [raw]
Subject: [PATCH 2/5] Modify parse_crashkernel* for new syntax

crashkernel= syntax of kernel command line was extended to allow
reservation of two memory regions for dump-capture kernel.

Syntax for simple case was changed from

crashkernel=size[@offset]

to

crashkernel=<low>/<high>

Where <low> and <high> are memory regions for dump-capture kernel in
usual crashkernel format (size@offset).

Crashkernel syntax, involving conditional reservation based on memory
size, was changed from

crashkernel=<range1>:<size1>[,<range2>:<size2>,...][@offset]

to

crashkernel=<range1>:<low_size1>[/<high_size1>]
[,<range2>:<low_size2>[/high_size2],...]
[@low_offset][/high_offset]

New syntax is backward compatible.

Signed-off-by: Vitaly Mayatskikh <[email protected]>
---
include/linux/kexec.h | 5 ++
kernel/kexec.c | 116 +++++++++++++++++++++++++++++++++++++------------
2 files changed, 93 insertions(+), 28 deletions(-)

diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 1a3b0a3..d2063f8 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -207,6 +207,11 @@ extern size_t vmcoreinfo_max_size;

int __init parse_crashkernel(char *cmdline, unsigned long long system_ram,
unsigned long long *crash_size, unsigned long long *crash_base);
+int __init parse_crashkernel_ext(char *cmdline, unsigned long long system_ram,
+ unsigned long long *crash_size,
+ unsigned long long *crash_base,
+ unsigned long long *crash_size_hi,
+ unsigned long long *crash_base_hi);
int crash_shrink_memory(unsigned long new_size);
size_t crash_get_memory_size(void);

diff --git a/kernel/kexec.c b/kernel/kexec.c
index 1bd0199..b8fd6eb 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -1229,23 +1229,42 @@ module_init(crash_notes_memory_init)
*/


+static char * __init parse_crashkernel_region(char *cmdline,
+ unsigned long long *crash_size,
+ unsigned long long *crash_base)
+{
+ char *cur = cmdline;
+
+ *crash_size = memparse(cmdline, &cur);
+ if (cmdline == cur) {
+ pr_warning("crashkernel: memory value expected\n");
+ return 0;
+ }
+
+ if (*cur == '@')
+ *crash_base = memparse(cur + 1, &cur);
+ return cur;
+}
+
/*
* This function parses command lines in the format
*
- * crashkernel=ramsize-range:size[,...][@offset]
+ * crashkernel=ramsize-range:size[/size2][,...][@offset][/offset2]
*
* The function returns 0 on success and -EINVAL on failure.
*/
-static int __init parse_crashkernel_mem(char *cmdline,
+static int __init parse_crashkernel_mem(char *cmdline,
unsigned long long system_ram,
unsigned long long *crash_size,
- unsigned long long *crash_base)
+ unsigned long long *crash_base,
+ unsigned long long *crash_size_hi,
+ unsigned long long *crash_base_hi)
{
char *cur = cmdline, *tmp;

/* for each entry of the comma-separated list */
do {
- unsigned long long start, end = ULLONG_MAX, size;
+ unsigned long long start, end = ULLONG_MAX, size, size_hi;

/* get the start of the range */
start = memparse(cur, &tmp);
@@ -1287,6 +1306,17 @@ static int __init parse_crashkernel_mem(char *cmdline,
return -EINVAL;
}
cur = tmp;
+
+ if (*cur == '/') {
+ cur++;
+ size_hi = memparse(cur, &tmp);
+ if (cur == tmp) {
+ pr_warning("Memory value expected\n");
+ return -EINVAL;
+ }
+ cur = tmp;
+ }
+
if (size >= system_ram) {
pr_warning("crashkernel: invalid size\n");
return -EINVAL;
@@ -1295,6 +1325,8 @@ static int __init parse_crashkernel_mem(char *cmdline,
/* match ? */
if (system_ram >= start && system_ram < end) {
*crash_size = size;
+ if (crash_size_hi)
+ *crash_size_hi = size_hi;
break;
}
} while (*cur++ == ',');
@@ -1310,6 +1342,17 @@ static int __init parse_crashkernel_mem(char *cmdline,
"after '@'\n");
return -EINVAL;
}
+ cur = tmp;
+ if (*cur == '/') {
+ cur++;
+ if (crash_base_hi)
+ *crash_base_hi = memparse(cur, &tmp);
+ if (cur == tmp) {
+ pr_warning("Memory value expected "
+ "after '@'\n");
+ return -EINVAL;
+ }
+ }
}
}

@@ -1319,43 +1362,46 @@ static int __init parse_crashkernel_mem(char *cmdline,
/*
* That function parses "simple" (old) crashkernel command lines like
*
- * crashkernel=size[@offset]
+ * crashkernel=size[@offset][/size_hi][@offset_hi]
*
* It returns 0 on success and -EINVAL on failure.
*/
-static int __init parse_crashkernel_simple(char *cmdline,
- unsigned long long *crash_size,
- unsigned long long *crash_base)
+static int __init parse_crashkernel_simple(char *cmdline,
+ unsigned long long *crash_size,
+ unsigned long long *crash_base,
+ unsigned long long *crash_size_hi,
+ unsigned long long *crash_base_hi)
{
- char *cur = cmdline;
+ char *cur = parse_crashkernel_region(cmdline, crash_size, crash_base);

- *crash_size = memparse(cmdline, &cur);
- if (cmdline == cur) {
- pr_warning("crashkernel: memory value expected\n");
+ if (!cur) {
return -EINVAL;
+ } else if (*cur == '/' && crash_size_hi && crash_base_hi) {
+ cur = parse_crashkernel_region(cur + 1, crash_size_hi,
+ crash_base_hi);
+ if (!cur)
+ return -EINVAL;
}
-
- if (*cur == '@')
- *crash_base = memparse(cur+1, &cur);
-
return 0;
}

-/*
- * That function is the entry point for command line parsing and should be
- * called from the arch-specific code.
- */
-int __init parse_crashkernel(char *cmdline,
- unsigned long long system_ram,
- unsigned long long *crash_size,
- unsigned long long *crash_base)
+int __init parse_crashkernel_ext(char *cmdline,
+ unsigned long long system_ram,
+ unsigned long long *crash_size,
+ unsigned long long *crash_base,
+ unsigned long long *crash_size_hi,
+ unsigned long long *crash_base_hi)
{
- char *p = cmdline, *ck_cmdline = NULL;
+ char *p = cmdline, *ck_cmdline = NULL;
char *first_colon, *first_space;

BUG_ON(!crash_size || !crash_base);
*crash_size = 0;
*crash_base = 0;
+ if (crash_size_hi)
+ *crash_size_hi = 0;
+ if (crash_base_hi)
+ *crash_base_hi = 0;

/* find crashkernel and use the last one if there are more */
p = strstr(p, "crashkernel=");
@@ -1377,15 +1423,29 @@ int __init parse_crashkernel(char *cmdline,
first_space = strchr(ck_cmdline, ' ');
if (first_colon && (!first_space || first_colon < first_space))
return parse_crashkernel_mem(ck_cmdline, system_ram,
- crash_size, crash_base);
+ crash_size, crash_base,
+ crash_size_hi, crash_base_hi);
else
return parse_crashkernel_simple(ck_cmdline, crash_size,
- crash_base);
+ crash_base, crash_size_hi,
+ crash_base_hi);

return 0;
}

-
+/*
+ * That function is the entry point for command line parsing and should be
+ * called from the arch-specific code.
+ */
+int __init parse_crashkernel(char *cmdline,
+ unsigned long long system_ram,
+ unsigned long long *crash_size,
+ unsigned long long *crash_base)
+{
+ return parse_crashkernel_ext(cmdline, system_ram,
+ crash_size, crash_base,
+ 0, 0);
+}

void crash_save_vmcoreinfo(void)
{
--
1.7.0.1

2010-04-22 16:23:55

by Vitaly Mayatskih

[permalink] [raw]
Subject: [PATCH 3/5] Support second memory region in crash_shrink_memory()

This patch changes crash_shrink_memory() to work with previosly added
memory region also. When shrink occurs, second region is shrunk first.

Signed-off-by: Vitaly Mayatskikh <[email protected]>
---
kernel/kexec.c | 55 ++++++++++++++++++++++++++++++++++++++++---------------
1 files changed, 40 insertions(+), 15 deletions(-)

diff --git a/kernel/kexec.c b/kernel/kexec.c
index b8fd6eb..dfaa01e 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -1117,10 +1117,36 @@ static void free_reserved_phys_range(unsigned long begin, unsigned long end)
}
}

+int crash_shrink_region(struct resource *crashk, unsigned long new_size)
+{
+ unsigned long start, end, size;
+
+ start = crashk->start;
+ end = crashk->end;
+ size = end - start + 1;
+
+ if (!size || new_size == size) /* Nothing to free */
+ return 0;
+
+ if (new_size > size)
+ return -EINVAL;
+
+ start = roundup(start, PAGE_SIZE);
+ end = roundup(start + new_size, PAGE_SIZE);
+
+ free_reserved_phys_range(end, crashk->end);
+
+ if (start == end)
+ release_resource(crashk);
+ crashk->end = end - 1;
+
+ return 0;
+}
+
int crash_shrink_memory(unsigned long new_size)
{
int ret = 0;
- unsigned long start, end;
+ unsigned long crash_size, low_size;

mutex_lock(&kexec_mutex);

@@ -1128,26 +1154,25 @@ int crash_shrink_memory(unsigned long new_size)
ret = -ENOENT;
goto unlock;
}
- start = crashk_res.start;
- end = crashk_res.end;

- if (new_size >= end - start + 1) {
+ crash_size = low_size = crashk_res.end - crashk_res.start + 1;
+ crash_size += crashk_res_hi.end - crashk_res_hi.start + 1;
+
+ if (crash_size == new_size)
+ goto unlock;
+ if (crash_size < new_size) {
ret = -EINVAL;
- if (new_size == end - start + 1)
- ret = 0;
goto unlock;
}

- start = roundup(start, PAGE_SIZE);
- end = roundup(start + new_size, PAGE_SIZE);
-
- free_reserved_phys_range(end, crashk_res.end);
-
- if (start == end) {
- crashk_res.end = end;
- release_resource(&crashk_res);
+ if (new_size < low_size) {
+ /* Reap crashk_res_hi */
+ ret = crash_shrink_region(&crashk_res_hi, 0);
+ if (ret)
+ goto unlock;
+ ret = crash_shrink_region(&crashk_res, new_size);
} else
- crashk_res.end = end - 1;
+ ret = crash_shrink_region(&crashk_res_hi, new_size - low_size);

unlock:
mutex_unlock(&kexec_mutex);
--
1.7.0.1

2010-04-22 16:24:01

by Vitaly Mayatskih

[permalink] [raw]
Subject: [PATCH 5/5] kexec: update documentation

Mention new crashkernel= syntax in documentation.

Signed-off-by: Vitaly Mayatskikh <[email protected]>
---
Documentation/kdump/kdump.txt | 40 +++++++++++++++++++++++++++++++++++
Documentation/kernel-parameters.txt | 19 +++++++++++-----
2 files changed, 53 insertions(+), 6 deletions(-)

diff --git a/Documentation/kdump/kdump.txt b/Documentation/kdump/kdump.txt
index cab61d8..9f93d17 100644
--- a/Documentation/kdump/kdump.txt
+++ b/Documentation/kdump/kdump.txt
@@ -266,7 +266,47 @@ This would mean:
2) if the RAM size is between 512M and 2G (exclusive), then reserve 64M
3) if the RAM size is larger than 2G, then reserve 128M

+Avoiding memory reservation problem on large systems
+====================================================

+For large systems with huge amount of memory dump-capture kernel
+requires more memory to handle properly old kernel's pages. However,
+it raises issues with h/w-dependent limitations on some platforms. For
+example, on x86-64 system kernel and initrd still have to be placed in
+first 2 gigabytes, because kernel starts executing in 32-bit mode, and
+kdump purgatory code can jump only to 32-bit signed addresses. This
+limitation is a real problem in cases, when dump-capturing region is
+large and cannot fit in good area. For such cases it's possible to use
+special crashkernel syntax:
+
+ crashkernel=<low>/<high>
+
+<low> and <high> are memory regions for dump-capture kernel in usual
+crashkernel format (size@offset). For example:
+
+ crashkernel=64M/1G@4G
+
+This would mean to allocate 64M of memory at the lowest valid address
+and to allocate 1G at physical address 4G.
+
+New syntax for extended format (in case of memory dependent
+reservation):
+
+ crashkernel=<range1>:<low_size1>[/<high_size1>]
+ [,<range2>:<low_size2>[/high_size2],...]
+ [@low_offset][/high_offset]
+ range=start-[end]
+
+For example:
+
+ crashkernel=2G-32G:256M,32G-:256M/1G@0/8G
+
+This would mean:
+
+ 1) if the RAM is smaller than 2G, then don't reserve anything
+ 2) if the RAM size is between 2G and 32G (exclusive), then reserve 256M
+ 3) if the RAM size is larger than 32G, then reserve 256M at first suitable
+ address (offset 0 means automatically) and reserve 1G at address 8G

Boot into System Kernel
=======================
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index e2202e9..5e9f234 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -568,16 +568,23 @@ and is between 256 and 4096 characters. It is defined in the file
Format:
<first_slot>,<last_slot>,<port>,<enum_bit>[,<debug>]

- crashkernel=nn[KMG]@ss[KMG]
- [KNL] Reserve a chunk of physical memory to
- hold a kernel to switch to with kexec on panic.
-
- crashkernel=range1:size1[,range2:size2,...][@offset]
- [KNL] Same as above, but depends on the memory
+ crashkernel= [KNL]
+ nn[KMG]@ss[KMG]
+ Reserve a chunk of physical memory to hold a
+ kernel to switch to with kexec on panic.
+ nn1[KMG]@ss1[KMG]/nn2[KMG]@ss2[KMG]
+ Same as above, but reserve 2 chunks of
+ physical memory.
+
+ crashkernel= [KNL]
+ range1:size1[,range2:size2,...][@offset]
+ Same as above, but depends on the memory
in the running system. The syntax of range is
start-[end] where start and end are both
a memory unit (amount[KMG]). See also
Documentation/kdump/kdump.txt for a example.
+ range1:size1lo/size1hi[,range2:size2lo/size2hi,...][@offset_lo][/offset_hi]
+ Same as above, but reserve 2 chunks of memory.

cs89x0_dma= [HW,NET]
Format: <dma>
--
1.7.0.1

2010-04-22 16:24:26

by Vitaly Mayatskih

[permalink] [raw]
Subject: [PATCH 4/5] x86: use second memory region for dump-capture kernel

This patch adds second memory region support for kexec on x86
platform.

Signed-off-by: Vitaly Mayatskikh <[email protected]>
---
arch/x86/kernel/setup.c | 56 +++++++++++++++++++++++++++++-----------------
1 files changed, 35 insertions(+), 21 deletions(-)

diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
index c4851ef..9b395bb 100644
--- a/arch/x86/kernel/setup.c
+++ b/arch/x86/kernel/setup.c
@@ -501,19 +501,11 @@ static inline unsigned long long get_total_mem(void)
return total << PAGE_SHIFT;
}

-static void __init reserve_crashkernel(void)
+static int __init reserve_crashkernel_region(char *region_name,
+ struct resource *crashk,
+ unsigned long long crash_size,
+ unsigned long long crash_base)
{
- unsigned long long total_mem;
- unsigned long long crash_size, crash_base;
- int ret;
-
- total_mem = get_total_mem();
-
- ret = parse_crashkernel(boot_command_line, total_mem,
- &crash_size, &crash_base);
- if (ret != 0 || crash_size <= 0)
- return;
-
/* 0 means: find the address automatically */
if (crash_base <= 0) {
const unsigned long long alignment = 16<<20; /* 16M */
@@ -522,7 +514,7 @@ static void __init reserve_crashkernel(void)
alignment);
if (crash_base == -1ULL) {
pr_info("crashkernel reservation failed - No suitable area found.\n");
- return;
+ return -EINVAL;
}
} else {
unsigned long long start;
@@ -531,20 +523,42 @@ static void __init reserve_crashkernel(void)
1<<20);
if (start != crash_base) {
pr_info("crashkernel reservation failed - memory is in use.\n");
- return;
+ return -EINVAL;
}
}
- reserve_early(crash_base, crash_base + crash_size, "CRASH KERNEL");
+ reserve_early(crash_base, crash_base + crash_size, region_name);

printk(KERN_INFO "Reserving %ldMB of memory at %ldMB "
- "for crashkernel (System RAM: %ldMB)\n",
+ "for crashkernel\n",
(unsigned long)(crash_size >> 20),
- (unsigned long)(crash_base >> 20),
- (unsigned long)(total_mem >> 20));
+ (unsigned long)(crash_base >> 20));
+
+ crashk->start = crash_base;
+ crashk->end = crash_base + crash_size - 1;
+ insert_resource(&iomem_resource, crashk);
+ return 0;
+}
+
+static void __init reserve_crashkernel(void)
+{
+ unsigned long long total_mem;
+ unsigned long long crash_size, crash_base;
+ unsigned long long crash_size_hi, crash_base_hi;
+ int ret;
+
+ total_mem = get_total_mem();
+
+ ret = parse_crashkernel_ext(boot_command_line, total_mem,
+ &crash_size, &crash_base,
+ &crash_size_hi, &crash_base_hi);
+ if (ret != 0 || crash_size <= 0)
+ return;

- crashk_res.start = crash_base;
- crashk_res.end = crash_base + crash_size - 1;
- insert_resource(&iomem_resource, &crashk_res);
+ ret = reserve_crashkernel_region("CRASH KERNEL", &crashk_res,
+ crash_size, crash_base);
+ if (ret == 0 && crash_size_hi > 0)
+ reserve_crashkernel_region("CRASH HIMEM", &crashk_res_hi,
+ crash_size_hi, crash_base_hi);
}
#else
static void __init reserve_crashkernel(void)
--
1.7.0.1

2010-04-22 22:07:24

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/5] Add second memory region for crash kernel

Vitaly Mayatskikh <[email protected]> writes:

> Patch applies to 2.6.34-rc5
>
> On x86 platform, even if hardware is 64-bit capable, kernel starts
> execution in 32-bit mode. When system is kdump-enabled, crashed kernel
> switches to 32 bit mode and jumps into new kernel. This automatically
> limits location of dump-capture kernel image and it's initrd by first
> 4Gb of memory. Switching to 32 bit mode is performed by purgatory
> code, which has relocations of type R_X86_64_32S (32-bit signed), and
> this cuts "good" address space for crash kernel down to 2 Gb. I/O
> regions may cut down this space further.
>
> When system has a lot of memory (hundreds of gigabytes), dump-capture
> kernel also needs relatively a lot of memory to account old kernel's
> pages. It may be impossible to reserve enough memory below 2 or even 4
> Gb. Simplest solution is it break dump-capture kernel's reserved
> memory region into two pieces: first (small) region for kernel and
> initrd images may be easily placed in "good" address space in the
> beginning of physical memory, and second region may be located
> anywhere.
>
> This serie of patches realizes this approach. It requires also changes
> in kexec utility to make this feature work, but is
> backward-compatible: old versions of kexec will work with new
> kernel. I will post patch to kexec-tools upstream separately.

Have you tried loading a 64bit vmlinux directly into a higher address
range? There may be a bit or two missing but you should be able to
load a linux kernel above 4GB. I tested the basics of that mechanism
when I made the 64bit relocatable kernel.

I don't buy the argument that there is a direct connection between
the amount of memory you have and how much memory it takes to dump it.
Even an indirect connections seems suspicious.

Eric

2010-04-22 22:42:00

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 0/5] Add second memory region for crash kernel

On 04/22/2010 03:07 PM, Eric W. Biederman wrote:
>
> Have you tried loading a 64bit vmlinux directly into a higher address
> range? There may be a bit or two missing but you should be able to
> load a linux kernel above 4GB. I tested the basics of that mechanism
> when I made the 64bit relocatable kernel.
>
> I don't buy the argument that there is a direct connection between
> the amount of memory you have and how much memory it takes to dump it.
> Even an indirect connections seems suspicious.
>

We actually have a 64-bit entry point even in bzImage; it is at offset
+0x200 from the 32-bit entry point. Right now that offset is not
exported anywhere, but it has been stable for a very long time... at
least for as far back as the decompressor has been 64 bits.

The interface to the 64-bit code is by necessity wider, since there is
no such thing as paging off in 64-bit mode, but it probably isn't *too*
hard to figure out how page tables need to be set up in order to work
properly. At that point, it would be good to document it.

-hpa

2010-04-22 22:45:48

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 0/5] Add second memory region for crash kernel

On Thu, Apr 22, 2010 at 03:07:11PM -0700, Eric W. Biederman wrote:
> Vitaly Mayatskikh <[email protected]> writes:
>
> > Patch applies to 2.6.34-rc5
> >
> > On x86 platform, even if hardware is 64-bit capable, kernel starts
> > execution in 32-bit mode. When system is kdump-enabled, crashed kernel
> > switches to 32 bit mode and jumps into new kernel. This automatically
> > limits location of dump-capture kernel image and it's initrd by first
> > 4Gb of memory. Switching to 32 bit mode is performed by purgatory
> > code, which has relocations of type R_X86_64_32S (32-bit signed), and
> > this cuts "good" address space for crash kernel down to 2 Gb. I/O
> > regions may cut down this space further.
> >
> > When system has a lot of memory (hundreds of gigabytes), dump-capture
> > kernel also needs relatively a lot of memory to account old kernel's
> > pages. It may be impossible to reserve enough memory below 2 or even 4
> > Gb. Simplest solution is it break dump-capture kernel's reserved
> > memory region into two pieces: first (small) region for kernel and
> > initrd images may be easily placed in "good" address space in the
> > beginning of physical memory, and second region may be located
> > anywhere.
> >
> > This serie of patches realizes this approach. It requires also changes
> > in kexec utility to make this feature work, but is
> > backward-compatible: old versions of kexec will work with new
> > kernel. I will post patch to kexec-tools upstream separately.
>
> Have you tried loading a 64bit vmlinux directly into a higher address
> range? There may be a bit or two missing but you should be able to
> load a linux kernel above 4GB. I tested the basics of that mechanism
> when I made the 64bit relocatable kernel.

I guess even if it works, for distributions it will become additional
liability to carry vmlinux (instead of relocatable bzImage). So we shall
have to find a way to make bzImage work.

>
> I don't buy the argument that there is a direct connection between
> the amount of memory you have and how much memory it takes to dump it.
> Even an indirect connections seems suspicious.

Memory requirement by user space might be of interest though like dump
filtering tools. I vaguely remember that it used to first traverse all
the memory pages, create some internal data structures and then start
dumping.

So memory required by filtering tool might be directly proportional to
amount of memory present in the system.

Vitaly, have you really run into cases where 2G upper limit is a concern.
What is the configuration you have, how much memory it has and how much
memory are you planning to reserve for kdump kernel?

Thanks
Vivek

2010-04-23 00:49:08

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/5] Add second memory region for crash kernel

Vivek Goyal <[email protected]> writes:

> On Thu, Apr 22, 2010 at 03:07:11PM -0700, Eric W. Biederman wrote:
>> Vitaly Mayatskikh <[email protected]> writes:
>> >
>> > This serie of patches realizes this approach. It requires also changes
>> > in kexec utility to make this feature work, but is
>> > backward-compatible: old versions of kexec will work with new
>> > kernel. I will post patch to kexec-tools upstream separately.
>>
>> Have you tried loading a 64bit vmlinux directly into a higher address
>> range? There may be a bit or two missing but you should be able to
>> load a linux kernel above 4GB. I tested the basics of that mechanism
>> when I made the 64bit relocatable kernel.
>
> I guess even if it works, for distributions it will become additional
> liability to carry vmlinux (instead of relocatable bzImage). So we shall
> have to find a way to make bzImage work.

As Peter pointed out we actually have everything thing we need except
a bit of documentation and the flag that says this is a 64bit kernel.

>From a testing perspective a 64bit vmlinux should work today without
changes. Once it is confirmed there is a solution with the 64bit
kernel we just need a small patch to boot.txt and a few tweaks to
/sbin/kexec to handle a 64bit bzImage.

>> I don't buy the argument that there is a direct connection between
>> the amount of memory you have and how much memory it takes to dump it.
>> Even an indirect connections seems suspicious.
>
> Memory requirement by user space might be of interest though like dump
> filtering tools. I vaguely remember that it used to first traverse all
> the memory pages, create some internal data structures and then start
> dumping.
>
> So memory required by filtering tool might be directly proportional to
> amount of memory present in the system.

Assuming your dump filtering tool creates a bitmap of pages to be dumped
you get a ration of 32K to 1. Or 3MB for 100GB and 32MB for 1TB.
Which is noticeable in the worst case but definitely not enough to push
us past 2GB.

> Vitaly, have you really run into cases where 2G upper limit is a concern.
> What is the configuration you have, how much memory it has and how much
> memory are you planning to reserve for kdump kernel?

A good question.

Eric

2010-04-23 05:17:49

by Cong Wang

[permalink] [raw]
Subject: Re: [PATCH 0/5] Add second memory region for crash kernel

Eric W. Biederman wrote:
> Vivek Goyal <[email protected]> writes:
>
>> Vitaly, have you really run into cases where 2G upper limit is a concern.
>> What is the configuration you have, how much memory it has and how much
>> memory are you planning to reserve for kdump kernel?
>
> A good question.
>

We have observed that on a machine which has 66G memory, when we do
crashkernel=1G@4G, kexec failed to load the crash kernel, but the memory
reservation _did_ succeed.

Thanks.

2010-04-23 05:42:37

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/5] Add second memory region for crash kernel

Cong Wang <[email protected]> writes:

> Eric W. Biederman wrote:
>> Vivek Goyal <[email protected]> writes:
>>
>>> Vitaly, have you really run into cases where 2G upper limit is a concern.
>>> What is the configuration you have, how much memory it has and how much
>>> memory are you planning to reserve for kdump kernel?
>>
>> A good question.
>>
>
> We have observed that on a machine which has 66G memory, when we do
> crashkernel=1G@4G, kexec failed to load the crash kernel, but the memory
> reservation _did_ succeed.

Did you try loading vmlinux? If not this sounds like the fact that
/sbin/kexec doesn't realize it can boot a 64bit bzImage in 64bit
mode.

Eric

2010-04-23 06:43:22

by Vitaly Mayatskih

[permalink] [raw]
Subject: Re: [PATCH 0/5] Add second memory region for crash kernel

At Thu, 22 Apr 2010 22:42:25 -0700, Eric W. Biederman wrote:

> > We have observed that on a machine which has 66G memory, when we do
> > crashkernel=1G@4G, kexec failed to load the crash kernel, but the memory
> > reservation _did_ succeed.
>
> Did you try loading vmlinux? If not this sounds like the fact that
> /sbin/kexec doesn't realize it can boot a 64bit bzImage in 64bit
> mode.

/sbin/kexec currently has hardcoded limitations for bzImage and
initrd:

include/x86/x86-linux.h:

#define DEFAULT_INITRD_ADDR_MAX 0x37FFFFFF
#define DEFAULT_BZIMAGE_ADDR_MAX 0x37FFFFFF

This is easy to override. However, purgatory code still wants to see
kernel below 2 Gb (32-bit signed relocations).
--
wbr, Vitaly

2010-04-23 07:08:50

by Vitaly Mayatskih

[permalink] [raw]
Subject: Re: [PATCH 0/5] Add second memory region for crash kernel

At Thu, 22 Apr 2010 18:45:25 -0400, Vivek Goyal wrote:

> Vitaly, have you really run into cases where 2G upper limit is a concern.
> What is the configuration you have, how much memory it has and how much
> memory are you planning to reserve for kdump kernel?

I tried it on system with 96G of RAM. When I reserved 512M for kdump
kernel, system stopped loading somewhere in user space. With larger
reserved area /sbin/kexec can't load kernel (because of hardcoded
limitation in /sbin/kexec). After removing this limitation kernel was
loaded below 2G, but system even hasn't booted.

Unfortunately, I don't remember exact details now and have no access
to that machine temporarily. Will try to get access and come back with
details.
--
wbr, Vitaly

2010-04-23 14:45:15

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 0/5] Add second memory region for crash kernel

On Thu, Apr 22, 2010 at 05:48:53PM -0700, Eric W. Biederman wrote:
> Vivek Goyal <[email protected]> writes:
>
> > On Thu, Apr 22, 2010 at 03:07:11PM -0700, Eric W. Biederman wrote:
> >> Vitaly Mayatskikh <[email protected]> writes:
> >> >
> >> > This serie of patches realizes this approach. It requires also changes
> >> > in kexec utility to make this feature work, but is
> >> > backward-compatible: old versions of kexec will work with new
> >> > kernel. I will post patch to kexec-tools upstream separately.
> >>
> >> Have you tried loading a 64bit vmlinux directly into a higher address
> >> range? There may be a bit or two missing but you should be able to
> >> load a linux kernel above 4GB. I tested the basics of that mechanism
> >> when I made the 64bit relocatable kernel.
> >
> > I guess even if it works, for distributions it will become additional
> > liability to carry vmlinux (instead of relocatable bzImage). So we shall
> > have to find a way to make bzImage work.
>
> As Peter pointed out we actually have everything thing we need except
> a bit of documentation and the flag that says this is a 64bit kernel.
>
> >From a testing perspective a 64bit vmlinux should work today without
> changes. Once it is confirmed there is a solution with the 64bit
> kernel we just need a small patch to boot.txt and a few tweaks to
> /sbin/kexec to handle a 64bit bzImage.
>

Agreed. Doing little more testing and fixing some issues, if need be, and
making 64 bzImage work is the better way instead of splitting the reserved
memory.

Thanks
Vivek