Currently, read to /proc/vmcore is done by read_oldmem() that uses
ioremap/iounmap per a single page. For example, if memory is 1GB,
ioremap/iounmap is called (1GB / 4KB)-times, that is, 262144
times. This causes big performance degradation.
In particular, the current main user of this mmap() is makedumpfile,
which not only reads memory from /proc/vmcore but also does other
processing like filtering, compression and IO work. Update of page
table and the following TLB flush makes such processing much slow;
though I have yet to make patch for makedumpfile and yet to confirm
how it's improved.
To address the issue, this patch implements mmap() on /proc/vmcore to
improve read performance. My simple benchmark shows the improvement
from 200 [MiB/sec] to over 50.0 [GiB/sec].
ChangeLog
=========
v2 => v3)
- Rebase 3.9-rc3.
- Copy program headers seprately from e_phoff in ELF note segment
buffer. Now there's no risk to allocate huge memory if program
header table positions after memory segment.
=> See PATCH 01.
- Add cleanup patch that removes unnecessary variable.
=> See PATCH 02.
- Fix wrongly using the variable that is buffer size configurable at
runtime. Instead, use the varibale that has original buffer size.
=> See PATCH 05.
v1 => v2)
- Clean up the existing codes: use e_phoff, and remove the assumption
on PT_NOTE entries.
=> See PATCH 01, 02.
- Fix potencial bug that ELF haeader size is not included in exported
vmcoreinfo size.
=> See Patch 03.
- Divide patch modifying read_vmcore() into two: clean-up and primary
code change.
=> See Patch 9, 10.
- Put ELF note segments in page-size boundary on the 1st kernel
instead of copying them into the buffer on the 2nd kernel.
=> See Patch 11, 12, 13, 14, 16.
Benchmark
=========
No change is seen from the previous patch series. See the previous
one from here:
https://lkml.org/lkml/2013/2/14/89
The benchmark using fixed makedumpfile on 32GB memory system is found
at:
http://lists.infradead.org/pipermail/kexec/2013-March/008300.html
TODO
====
- Benchmark on system with tera-byte memory using fixed makedumpfile.
- fix crash utility to support NT_VMCORE_PAD note type, which donesn't
distinguish the same note types from different note names, which is
not conform to ELF specification; now NT_VMCORE_PAD note is wrongly
interpreted as NT_VMCORE_DEBUGINFO.
Test
====
This patch set is composed based on v3.9-rc3.
Done on x86-64, x86-32 both with 1GB and over 4GB memory environments.
---
HATAYAMA Daisuke (21):
vmcore: introduce mmap_vmcore()
vmcore: count holes generated by round-up operation for vmcore size
vmcore: round-up offset of vmcore object in page-size boundary
vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
vmcore: check NT_VMCORE_PAD as a mark indicating the end of ELF note buffer
kexec: fill note buffers by NT_VMCORE_PAD notes in page-size boundary
elf: introduce NT_VMCORE_PAD type
kexec, elf: introduce NT_VMCORE_DEBUGINFO note type
kexec: allocate vmcoreinfo note buffer on page-size boundary
vmcore: allocate per-cpu crash_notes objects on page-size boundary
vmcore: read buffers for vmcore objects copied from old memory
vmcore: clean up read_vmcore()
vmcore: modify vmcore clean-up function to free buffer on 2nd kernel
vmcore: copy non page-size aligned head and tail pages in 2nd kernel
vmcore, procfs: introduce a flag to distinguish objects copied in 2nd kernel
vmcore: round up buffer size of ELF headers by PAGE_SIZE
vmcore: allocate buffer for ELF headers on page-size alignment
vmcore, sysfs: export ELF note segment size instead of vmcoreinfo data size
vmcore: rearrange program headers without assuming consequtive PT_NOTE entries
vmcore: clean up by removing unnecessary variable
vmcore: reference e_phoff member explicitly to get position of program header table
arch/s390/include/asm/kexec.h | 8 -
fs/proc/vmcore.c | 595 ++++++++++++++++++++++++++++++++---------
include/linux/kexec.h | 16 +
include/linux/proc_fs.h | 8 -
include/uapi/linux/elf.h | 5
kernel/kexec.c | 47 ++-
kernel/ksysfs.c | 2
7 files changed, 522 insertions(+), 159 deletions(-)
--
Thanks.
HATAYAMA, Daisuke
Currently, the code assumes that position of program header table is
next to ELF header. But future change can break the assumption on
kexec-tools and the 1st kernel. To avoid worst case, reference e_phoff
member explicitly to get position of program header table in
file-offset.
Signed-off-by: Zhang Yanfei <[email protected]>
Signed-off-by: HATAYAMA Daisuke <[email protected]>
---
fs/proc/vmcore.c | 56 +++++++++++++++++++++++++++++++++++-------------------
1 files changed, 36 insertions(+), 20 deletions(-)
diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index b870f74..163281e 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -221,8 +221,8 @@ static u64 __init get_vmcore_size_elf64(char *elfptr)
Elf64_Phdr *phdr_ptr;
ehdr_ptr = (Elf64_Ehdr *)elfptr;
- phdr_ptr = (Elf64_Phdr*)(elfptr + sizeof(Elf64_Ehdr));
- size = sizeof(Elf64_Ehdr) + ((ehdr_ptr->e_phnum) * sizeof(Elf64_Phdr));
+ phdr_ptr = (Elf64_Phdr*)(elfptr + ehdr_ptr->e_phoff);
+ size = ehdr_ptr->e_phoff + ((ehdr_ptr->e_phnum) * sizeof(Elf64_Phdr));
for (i = 0; i < ehdr_ptr->e_phnum; i++) {
size += phdr_ptr->p_memsz;
phdr_ptr++;
@@ -238,8 +238,8 @@ static u64 __init get_vmcore_size_elf32(char *elfptr)
Elf32_Phdr *phdr_ptr;
ehdr_ptr = (Elf32_Ehdr *)elfptr;
- phdr_ptr = (Elf32_Phdr*)(elfptr + sizeof(Elf32_Ehdr));
- size = sizeof(Elf32_Ehdr) + ((ehdr_ptr->e_phnum) * sizeof(Elf32_Phdr));
+ phdr_ptr = (Elf32_Phdr*)(elfptr + ehdr_ptr->e_phoff);
+ size = ehdr_ptr->e_phoff + ((ehdr_ptr->e_phnum) * sizeof(Elf32_Phdr));
for (i = 0; i < ehdr_ptr->e_phnum; i++) {
size += phdr_ptr->p_memsz;
phdr_ptr++;
@@ -259,7 +259,7 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
u64 phdr_sz = 0, note_off;
ehdr_ptr = (Elf64_Ehdr *)elfptr;
- phdr_ptr = (Elf64_Phdr*)(elfptr + sizeof(Elf64_Ehdr));
+ phdr_ptr = (Elf64_Phdr*)(elfptr + ehdr_ptr->e_phoff);
for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
int j;
void *notes_section;
@@ -305,7 +305,7 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
/* Prepare merged PT_NOTE program header. */
phdr.p_type = PT_NOTE;
phdr.p_flags = 0;
- note_off = sizeof(Elf64_Ehdr) +
+ note_off = ehdr_ptr->e_phoff +
(ehdr_ptr->e_phnum - nr_ptnote +1) * sizeof(Elf64_Phdr);
phdr.p_offset = note_off;
phdr.p_vaddr = phdr.p_paddr = 0;
@@ -313,14 +313,14 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
phdr.p_align = 0;
/* Add merged PT_NOTE program header*/
- tmp = elfptr + sizeof(Elf64_Ehdr);
+ tmp = elfptr + ehdr_ptr->e_phoff;
memcpy(tmp, &phdr, sizeof(phdr));
tmp += sizeof(phdr);
/* Remove unwanted PT_NOTE program headers. */
i = (nr_ptnote - 1) * sizeof(Elf64_Phdr);
*elfsz = *elfsz - i;
- memmove(tmp, tmp+i, ((*elfsz)-sizeof(Elf64_Ehdr)-sizeof(Elf64_Phdr)));
+ memmove(tmp, tmp+i, ((*elfsz)-ehdr_ptr->e_phoff-sizeof(Elf64_Phdr)));
/* Modify e_phnum to reflect merged headers. */
ehdr_ptr->e_phnum = ehdr_ptr->e_phnum - nr_ptnote + 1;
@@ -340,7 +340,7 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
u64 phdr_sz = 0, note_off;
ehdr_ptr = (Elf32_Ehdr *)elfptr;
- phdr_ptr = (Elf32_Phdr*)(elfptr + sizeof(Elf32_Ehdr));
+ phdr_ptr = (Elf32_Phdr*)(elfptr + ehdr_ptr->e_phoff);
for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
int j;
void *notes_section;
@@ -386,7 +386,7 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
/* Prepare merged PT_NOTE program header. */
phdr.p_type = PT_NOTE;
phdr.p_flags = 0;
- note_off = sizeof(Elf32_Ehdr) +
+ note_off = ehdr_ptr->e_phoff +
(ehdr_ptr->e_phnum - nr_ptnote +1) * sizeof(Elf32_Phdr);
phdr.p_offset = note_off;
phdr.p_vaddr = phdr.p_paddr = 0;
@@ -394,14 +394,14 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
phdr.p_align = 0;
/* Add merged PT_NOTE program header*/
- tmp = elfptr + sizeof(Elf32_Ehdr);
+ tmp = elfptr + ehdr_ptr->e_phoff;
memcpy(tmp, &phdr, sizeof(phdr));
tmp += sizeof(phdr);
/* Remove unwanted PT_NOTE program headers. */
i = (nr_ptnote - 1) * sizeof(Elf32_Phdr);
*elfsz = *elfsz - i;
- memmove(tmp, tmp+i, ((*elfsz)-sizeof(Elf32_Ehdr)-sizeof(Elf32_Phdr)));
+ memmove(tmp, tmp+i, ((*elfsz)-ehdr_ptr->e_phoff-sizeof(Elf32_Phdr)));
/* Modify e_phnum to reflect merged headers. */
ehdr_ptr->e_phnum = ehdr_ptr->e_phnum - nr_ptnote + 1;
@@ -422,10 +422,10 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
struct vmcore *new;
ehdr_ptr = (Elf64_Ehdr *)elfptr;
- phdr_ptr = (Elf64_Phdr*)(elfptr + sizeof(Elf64_Ehdr)); /* PT_NOTE hdr */
+ phdr_ptr = (Elf64_Phdr*)(elfptr + ehdr_ptr->e_phoff); /* PT_NOTE hdr */
/* First program header is PT_NOTE header. */
- vmcore_off = sizeof(Elf64_Ehdr) +
+ vmcore_off = ehdr_ptr->e_phoff +
(ehdr_ptr->e_phnum) * sizeof(Elf64_Phdr) +
phdr_ptr->p_memsz; /* Note sections */
@@ -459,10 +459,10 @@ static int __init process_ptload_program_headers_elf32(char *elfptr,
struct vmcore *new;
ehdr_ptr = (Elf32_Ehdr *)elfptr;
- phdr_ptr = (Elf32_Phdr*)(elfptr + sizeof(Elf32_Ehdr)); /* PT_NOTE hdr */
+ phdr_ptr = (Elf32_Phdr*)(elfptr + ehdr_ptr->e_phoff); /* PT_NOTE hdr */
/* First program header is PT_NOTE header. */
- vmcore_off = sizeof(Elf32_Ehdr) +
+ vmcore_off = ehdr_ptr->e_phoff +
(ehdr_ptr->e_phnum) * sizeof(Elf32_Phdr) +
phdr_ptr->p_memsz; /* Note sections */
@@ -496,7 +496,7 @@ static void __init set_vmcore_list_offsets_elf64(char *elfptr,
ehdr_ptr = (Elf64_Ehdr *)elfptr;
/* Skip Elf header and program headers. */
- vmcore_off = sizeof(Elf64_Ehdr) +
+ vmcore_off = ehdr_ptr->e_phoff +
(ehdr_ptr->e_phnum) * sizeof(Elf64_Phdr);
list_for_each_entry(m, vc_list, list) {
@@ -516,7 +516,7 @@ static void __init set_vmcore_list_offsets_elf32(char *elfptr,
ehdr_ptr = (Elf32_Ehdr *)elfptr;
/* Skip Elf header and program headers. */
- vmcore_off = sizeof(Elf32_Ehdr) +
+ vmcore_off = ehdr_ptr->e_phoff +
(ehdr_ptr->e_phnum) * sizeof(Elf32_Phdr);
list_for_each_entry(m, vc_list, list) {
@@ -558,11 +558,19 @@ static int __init parse_crash_elf64_headers(void)
if (!elfcorebuf)
return -ENOMEM;
addr = elfcorehdr_addr;
- rc = read_from_oldmem(elfcorebuf, elfcorebuf_sz, &addr, 0);
+ rc = read_from_oldmem(elfcorebuf, sizeof(Elf64_Ehdr), &addr, 0);
if (rc < 0) {
kfree(elfcorebuf);
return rc;
}
+ addr = elfcorehdr_addr + ehdr.e_phoff;
+ rc = read_from_oldmem(elfcorebuf + sizeof(Elf64_Ehdr),
+ ehdr.e_phnum * sizeof(Elf64_Phdr), &addr, 0);
+ if (rc < 0) {
+ kfree(elfcorebuf);
+ return rc;
+ }
+ ((Elf64_Ehdr *)elfcorebuf)->e_phoff = sizeof(Elf64_Ehdr);
/* Merge all PT_NOTE headers into one. */
rc = merge_note_headers_elf64(elfcorebuf, &elfcorebuf_sz, &vmcore_list);
@@ -613,11 +621,19 @@ static int __init parse_crash_elf32_headers(void)
if (!elfcorebuf)
return -ENOMEM;
addr = elfcorehdr_addr;
- rc = read_from_oldmem(elfcorebuf, elfcorebuf_sz, &addr, 0);
+ rc = read_from_oldmem(elfcorebuf, sizeof(Elf32_Ehdr), &addr, 0);
+ if (rc < 0) {
+ kfree(elfcorebuf);
+ return rc;
+ }
+ addr = elfcorehdr_addr + ehdr.e_phoff;
+ rc = read_from_oldmem(elfcorebuf + sizeof(Elf32_Ehdr),
+ ehdr.e_phnum * sizeof(Elf32_Phdr), &addr, 0);
if (rc < 0) {
kfree(elfcorebuf);
return rc;
}
+ ((Elf32_Ehdr *)elfcorebuf)->e_phoff = sizeof(Elf32_Ehdr);
/* Merge all PT_NOTE headers into one. */
rc = merge_note_headers_elf32(elfcorebuf, &elfcorebuf_sz, &vmcore_list);
To satisfy mmap() page-size boundary requirement, round up buffer size
of ELF headers by PAGE_SIZE. The resulting value becomes offset of ELF
note segments and it's assigned in unique PT_NOTE program header
entry.
Also, some part that assumes past ELF headers' size is replaced by
this new rounded-up value.
Signed-off-by: HATAYAMA Daisuke <[email protected]>
---
fs/proc/vmcore.c | 16 ++++++++--------
1 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 17e2501..dd9769d 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -339,7 +339,7 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
phdr.p_flags = 0;
note_off = ehdr_ptr->e_phoff +
(ehdr_ptr->e_phnum - nr_ptnote +1) * sizeof(Elf64_Phdr);
- phdr.p_offset = note_off;
+ phdr.p_offset = roundup(note_off, PAGE_SIZE);
phdr.p_vaddr = phdr.p_paddr = 0;
phdr.p_filesz = phdr.p_memsz = phdr_sz;
phdr.p_align = 0;
@@ -352,6 +352,7 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
/* Modify e_phnum to reflect merged headers. */
ehdr_ptr->e_phnum = ehdr_ptr->e_phnum - nr_ptnote + 1;
+ *elfsz = roundup(*elfsz, PAGE_SIZE);
out:
return 0;
}
@@ -447,7 +448,7 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
phdr.p_flags = 0;
note_off = ehdr_ptr->e_phoff +
(ehdr_ptr->e_phnum - nr_ptnote +1) * sizeof(Elf32_Phdr);
- phdr.p_offset = note_off;
+ phdr.p_offset = roundup(note_off, PAGE_SIZE);
phdr.p_vaddr = phdr.p_paddr = 0;
phdr.p_filesz = phdr.p_memsz = phdr_sz;
phdr.p_align = 0;
@@ -460,6 +461,7 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
/* Modify e_phnum to reflect merged headers. */
ehdr_ptr->e_phnum = ehdr_ptr->e_phnum - nr_ptnote + 1;
+ *elfsz = roundup(*elfsz, PAGE_SIZE);
out:
return 0;
}
@@ -480,9 +482,8 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
phdr_ptr = (Elf64_Phdr*)(elfptr + ehdr_ptr->e_phoff); /* PT_NOTE hdr */
/* First program header is PT_NOTE header. */
- vmcore_off = ehdr_ptr->e_phoff +
- (ehdr_ptr->e_phnum) * sizeof(Elf64_Phdr) +
- phdr_ptr->p_memsz; /* Note sections */
+ vmcore_off = phdr_ptr->p_offset + roundup(phdr_ptr->p_memsz,
+ PAGE_SIZE);
for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
if (phdr_ptr->p_type != PT_LOAD)
@@ -517,9 +518,8 @@ static int __init process_ptload_program_headers_elf32(char *elfptr,
phdr_ptr = (Elf32_Phdr*)(elfptr + ehdr_ptr->e_phoff); /* PT_NOTE hdr */
/* First program header is PT_NOTE header. */
- vmcore_off = ehdr_ptr->e_phoff +
- (ehdr_ptr->e_phnum) * sizeof(Elf32_Phdr) +
- phdr_ptr->p_memsz; /* Note sections */
+ vmcore_off = phdr_ptr->p_offset + roundup(phdr_ptr->p_memsz,
+ PAGE_SIZE);
for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
if (phdr_ptr->p_type != PT_LOAD)
The part of dump target memory is copied into the 2nd kernel if it
doesn't satisfy mmap()'s page-size boundary requirement. To
distinguish such copied object from usual old memory, a flag
MEM_TYPE_CURRENT_KERNEL is introduced. If this flag is set, the object
is considered being copied into buffer on the 2nd kernel.
Signed-off-by: HATAYAMA Daisuke <[email protected]>
---
include/linux/proc_fs.h | 8 +++++++-
1 files changed, 7 insertions(+), 1 deletions(-)
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 8307f2f..11dd592 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -97,11 +97,17 @@ struct kcore_list {
int type;
};
+#define MEM_TYPE_CURRENT_KERNEL 0x1
+
struct vmcore {
struct list_head list;
- unsigned long long paddr;
+ union {
+ unsigned long long paddr;
+ char *buf;
+ };
unsigned long long size;
loff_t offset;
+ unsigned int flag;
};
#ifdef CONFIG_PROC_FS
Due to mmap() requirement, we need to copy pages not starting or
ending with page-size aligned address in 2nd kernel and to map them to
user-space.
For example, see the map below:
00000000-00010000 : reserved
00010000-0009f800 : System RAM
0009f800-000a0000 : reserved
where the System RAM ends with 0x9f800 that is not page-size
aligned. This map is divided into two parts:
00010000-0009f000
0009f000-0009f800
and the first one is kept in old memory and the 2nd one is copied into
buffer on 2nd kernel.
This kind of non-page-size-aligned area can always occur since any
part of System RAM can be converted into reserved area at runtime.
If not doing copying like this and if remapping non page-size aligned
pages on old memory directly, mmap() had to export memory which is not
dump target to user-space. In the above example this is reserved
0x9f800-0xa0000.
Signed-off-by: HATAYAMA Daisuke <[email protected]>
---
fs/proc/vmcore.c | 192 ++++++++++++++++++++++++++++++++++++++++++++++++------
1 files changed, 172 insertions(+), 20 deletions(-)
diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index dd9769d..766e75f 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -472,11 +472,10 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
size_t elfsz,
struct list_head *vc_list)
{
- int i;
+ int i, rc;
Elf64_Ehdr *ehdr_ptr;
Elf64_Phdr *phdr_ptr;
loff_t vmcore_off;
- struct vmcore *new;
ehdr_ptr = (Elf64_Ehdr *)elfptr;
phdr_ptr = (Elf64_Phdr*)(elfptr + ehdr_ptr->e_phoff); /* PT_NOTE hdr */
@@ -486,20 +485,97 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
PAGE_SIZE);
for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
+ u64 start, end, rest;
+
if (phdr_ptr->p_type != PT_LOAD)
continue;
- /* Add this contiguous chunk of memory to vmcore list.*/
- new = get_new_element();
- if (!new)
- return -ENOMEM;
- new->paddr = phdr_ptr->p_offset;
- new->size = phdr_ptr->p_memsz;
- list_add_tail(&new->list, vc_list);
+ start = phdr_ptr->p_offset;
+ end = phdr_ptr->p_offset + phdr_ptr->p_memsz;
+ rest = phdr_ptr->p_memsz;
+
+ if (start & ~PAGE_MASK) {
+ u64 paddr, len;
+ char *buf;
+ struct vmcore *new;
+
+ paddr = start;
+ len = min(roundup(start,PAGE_SIZE), end) - start;
+
+ buf = (char *)get_zeroed_page(GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+ rc = read_from_oldmem(buf + (start & ~PAGE_MASK), len,
+ &paddr, 0);
+ if (rc < 0) {
+ free_pages((unsigned long)buf, 0);
+ return rc;
+ }
+
+ new = get_new_element();
+ if (!new) {
+ free_pages((unsigned long)buf, 0);
+ return -ENOMEM;
+ }
+ new->flag |= MEM_TYPE_CURRENT_KERNEL;
+ new->size = PAGE_SIZE;
+ new->buf = buf;
+ list_add_tail(&new->list, vc_list);
+
+ rest -= len;
+ }
+
+ if (rest > 0 &&
+ roundup(start, PAGE_SIZE) < rounddown(end, PAGE_SIZE)) {
+ u64 paddr, len;
+ struct vmcore *new;
+
+ paddr = roundup(start, PAGE_SIZE);
+ len =rounddown(end,PAGE_SIZE)-roundup(start,PAGE_SIZE);
+
+ new = get_new_element();
+ if (!new)
+ return -ENOMEM;
+ new->paddr = paddr;
+ new->size = len;
+ list_add_tail(&new->list, vc_list);
+
+ rest -= len;
+ }
+
+ if (rest > 0) {
+ u64 paddr, len;
+ char *buf;
+ struct vmcore *new;
+
+ paddr = rounddown(end, PAGE_SIZE);
+ len = end - rounddown(end, PAGE_SIZE);
+
+ buf = (char *)get_zeroed_page(GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+ rc = read_from_oldmem(buf, len, &paddr, 0);
+ if (rc < 0) {
+ free_pages((unsigned long)buf, 0);
+ return rc;
+ }
+
+ new = get_new_element();
+ if (!new) {
+ free_pages((unsigned long)buf, 0);
+ return -ENOMEM;
+ }
+ new->flag |= MEM_TYPE_CURRENT_KERNEL;
+ new->size = PAGE_SIZE;
+ new->buf = buf;
+ list_add_tail(&new->list, vc_list);
+
+ rest -= len;
+ }
/* Update the program header offset. */
phdr_ptr->p_offset = vmcore_off;
- vmcore_off = vmcore_off + phdr_ptr->p_memsz;
+ vmcore_off +=roundup(end,PAGE_SIZE)-rounddown(start,PAGE_SIZE);
}
return 0;
}
@@ -508,11 +584,10 @@ static int __init process_ptload_program_headers_elf32(char *elfptr,
size_t elfsz,
struct list_head *vc_list)
{
- int i;
+ int i, rc;
Elf32_Ehdr *ehdr_ptr;
Elf32_Phdr *phdr_ptr;
loff_t vmcore_off;
- struct vmcore *new;
ehdr_ptr = (Elf32_Ehdr *)elfptr;
phdr_ptr = (Elf32_Phdr*)(elfptr + ehdr_ptr->e_phoff); /* PT_NOTE hdr */
@@ -522,20 +597,97 @@ static int __init process_ptload_program_headers_elf32(char *elfptr,
PAGE_SIZE);
for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
+ u64 start, end, rest;
+
if (phdr_ptr->p_type != PT_LOAD)
continue;
- /* Add this contiguous chunk of memory to vmcore list.*/
- new = get_new_element();
- if (!new)
- return -ENOMEM;
- new->paddr = phdr_ptr->p_offset;
- new->size = phdr_ptr->p_memsz;
- list_add_tail(&new->list, vc_list);
+ start = phdr_ptr->p_offset;
+ end = phdr_ptr->p_offset + phdr_ptr->p_memsz;
+ rest = phdr_ptr->p_memsz;
+
+ if (start & ~PAGE_MASK) {
+ u64 paddr, len;
+ char *buf;
+ struct vmcore *new;
+
+ paddr = start;
+ len = min(roundup(start,PAGE_SIZE), end) - start;
+
+ buf = (char *)get_zeroed_page(GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+ rc = read_from_oldmem(buf + (start & ~PAGE_MASK), len,
+ &paddr, 0);
+ if (rc < 0) {
+ free_pages((unsigned long)buf, 0);
+ return rc;
+ }
+
+ new = get_new_element();
+ if (!new) {
+ free_pages((unsigned long)buf, 0);
+ return -ENOMEM;
+ }
+ new->flag |= MEM_TYPE_CURRENT_KERNEL;
+ new->size = PAGE_SIZE;
+ new->buf = buf;
+ list_add_tail(&new->list, vc_list);
+
+ rest -= len;
+ }
+
+ if (rest > 0 &&
+ roundup(start, PAGE_SIZE) < rounddown(end, PAGE_SIZE)) {
+ u64 paddr, len;
+ struct vmcore *new;
+
+ paddr = roundup(start, PAGE_SIZE);
+ len =rounddown(end,PAGE_SIZE)-roundup(start,PAGE_SIZE);
+
+ new = get_new_element();
+ if (!new)
+ return -ENOMEM;
+ new->paddr = paddr;
+ new->size = len;
+ list_add_tail(&new->list, vc_list);
+
+ rest -= len;
+ }
+
+ if (rest > 0) {
+ u64 paddr, len;
+ char *buf;
+ struct vmcore *new;
+
+ paddr = rounddown(end, PAGE_SIZE);
+ len = end - rounddown(end, PAGE_SIZE);
+
+ buf = (char *)get_zeroed_page(GFP_KERNEL);
+ if (!buf)
+ return -ENOMEM;
+ rc = read_from_oldmem(buf, len, &paddr, 0);
+ if (rc < 0) {
+ free_pages((unsigned long)buf, 0);
+ return rc;
+ }
+
+ new = get_new_element();
+ if (!new) {
+ free_pages((unsigned long)buf, 0);
+ return -ENOMEM;
+ }
+ new->flag |= MEM_TYPE_CURRENT_KERNEL;
+ new->size = PAGE_SIZE;
+ new->buf = buf;
+ list_add_tail(&new->list, vc_list);
+
+ rest -= len;
+ }
/* Update the program header offset */
phdr_ptr->p_offset = vmcore_off;
- vmcore_off = vmcore_off + phdr_ptr->p_memsz;
+ vmcore_off +=roundup(end,PAGE_SIZE)-rounddown(start,PAGE_SIZE);
}
return 0;
}
If flag MEM_TYPE_CURRENT_KERNEL is set, the object is copied in some
buffer on the 2nd kernel, so clean-up funciton needs to free it.
Signed-off-by: HATAYAMA Daisuke <[email protected]>
---
fs/proc/vmcore.c | 4 ++++
1 files changed, 4 insertions(+), 0 deletions(-)
diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 766e75f..b85ba32 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -940,6 +940,10 @@ void vmcore_cleanup(void)
struct vmcore *m;
m = list_entry(pos, struct vmcore, list);
+
+ if (m->flag & MEM_TYPE_CURRENT_KERNEL)
+ free_pages((unsigned long)m->buf, get_order(m->size));
+
list_del(&m->list);
kfree(m);
}
Clean up read_vmcore(). Part for objects in vmcore_list can be written
uniformly to part for ELF headers. By this change, duplicate and
complicated codes are removed, so it's more clear to see what's done
there.
Also, by this change, map_offset_to_paddr() is no longer used. Remove
it.
Signed-off-by: HATAYAMA Daisuke <[email protected]>
---
fs/proc/vmcore.c | 68 ++++++++++++++++--------------------------------------
1 files changed, 20 insertions(+), 48 deletions(-)
diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index b85ba32..7e21d64 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -118,27 +118,6 @@ static ssize_t read_from_oldmem(char *buf, size_t count,
return read;
}
-/* Maps vmcore file offset to respective physical address in memroy. */
-static u64 map_offset_to_paddr(loff_t offset, struct list_head *vc_list,
- struct vmcore **m_ptr)
-{
- struct vmcore *m;
- u64 paddr;
-
- list_for_each_entry(m, vc_list, list) {
- u64 start, end;
- start = m->offset;
- end = m->offset + m->size - 1;
- if (offset >= start && offset <= end) {
- paddr = m->paddr + offset - start;
- *m_ptr = m;
- return paddr;
- }
- }
- *m_ptr = NULL;
- return 0;
-}
-
/* Read from the ELF header and then the crash dump. On error, negative value is
* returned otherwise number of bytes read are returned.
*/
@@ -147,8 +126,8 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer,
{
ssize_t acc = 0, tmp;
size_t tsz;
- u64 start, nr_bytes;
- struct vmcore *curr_m = NULL;
+ u64 start;
+ struct vmcore *m;
if (buflen == 0 || *fpos >= vmcore_size)
return 0;
@@ -174,33 +153,26 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer,
return acc;
}
- start = map_offset_to_paddr(*fpos, &vmcore_list, &curr_m);
- if (!curr_m)
- return -EINVAL;
-
- while (buflen) {
- tsz = min_t(size_t, buflen, PAGE_SIZE - (start & ~PAGE_MASK));
-
- /* Calculate left bytes in current memory segment. */
- nr_bytes = (curr_m->size - (start - curr_m->paddr));
- if (tsz > nr_bytes)
- tsz = nr_bytes;
-
- tmp = read_from_oldmem(buffer, tsz, &start, 1);
- if (tmp < 0)
- return tmp;
- buflen -= tsz;
- *fpos += tsz;
- buffer += tsz;
- acc += tsz;
- if (start >= (curr_m->paddr + curr_m->size)) {
- if (curr_m->list.next == &vmcore_list)
- return acc; /*EOF*/
- curr_m = list_entry(curr_m->list.next,
- struct vmcore, list);
- start = curr_m->paddr;
+ list_for_each_entry(m, &vmcore_list, list) {
+ if (*fpos < m->offset + m->size) {
+ tsz = m->offset + m->size - *fpos;
+ if (buflen < tsz)
+ tsz = buflen;
+ start = m->paddr + *fpos - m->offset;
+ tmp = read_from_oldmem(buffer, tsz, &start, 1);
+ if (tmp < 0)
+ return tmp;
+ buflen -= tsz;
+ *fpos += tsz;
+ buffer += tsz;
+ acc += tsz;
+
+ /* leave now if filled buffer already */
+ if (buflen == 0)
+ return acc;
}
}
+
return acc;
}
To satisfy mmap()'s page-size boundary requirement, allocate per-cpu
crash_notes objects on page-size boundary.
/proc/vmcore on the 2nd kernel checks if each note objects is
allocated on page-size boundary. If there's some object not satisfying
the page-size boundary requirement, /proc/vmcore doesn't provide
mmap() interface.
Signed-off-by: HATAYAMA Daisuke <[email protected]>
---
kernel/kexec.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)
diff --git a/kernel/kexec.c b/kernel/kexec.c
index bddd3d7..d1f365e 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -1234,7 +1234,8 @@ void crash_save_cpu(struct pt_regs *regs, int cpu)
static int __init crash_notes_memory_init(void)
{
/* Allocate memory for saving cpu registers. */
- crash_notes = alloc_percpu(note_buf_t);
+ crash_notes = __alloc_percpu(roundup(sizeof(note_buf_t), PAGE_SIZE),
+ PAGE_SIZE);
if (!crash_notes) {
printk("Kexec: Memory allocation for saving cpu register"
" states failed\n");
To satisfy mmap()'s page-size boundary requirement, specify aligned
attribute to vmcoreinfo_note objects to allocate it on page-size
boundary.
Signed-off-by: HATAYAMA Daisuke <[email protected]>
---
include/linux/kexec.h | 6 ++++--
kernel/kexec.c | 2 +-
2 files changed, 5 insertions(+), 3 deletions(-)
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index d2e6927..5113570 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -185,8 +185,10 @@ extern struct kimage *kexec_crash_image;
#define VMCOREINFO_BYTES (4096)
#define VMCOREINFO_NOTE_NAME "VMCOREINFO"
#define VMCOREINFO_NOTE_NAME_BYTES ALIGN(sizeof(VMCOREINFO_NOTE_NAME), 4)
-#define VMCOREINFO_NOTE_SIZE (KEXEC_NOTE_HEAD_BYTES*2 + VMCOREINFO_BYTES \
- + VMCOREINFO_NOTE_NAME_BYTES)
+#define VMCOREINFO_NOTE_SIZE ALIGN(KEXEC_NOTE_HEAD_BYTES*2 \
+ +VMCOREINFO_BYTES \
+ +VMCOREINFO_NOTE_NAME_BYTES, \
+ PAGE_SIZE)
/* Location of a reserved region to hold the crash kernel.
*/
diff --git a/kernel/kexec.c b/kernel/kexec.c
index d1f365e..195de6d 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -43,7 +43,7 @@ note_buf_t __percpu *crash_notes;
/* vmcoreinfo stuff */
static unsigned char vmcoreinfo_data[VMCOREINFO_BYTES];
-u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4];
+u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4] __aligned(PAGE_SIZE);
size_t vmcoreinfo_size;
size_t vmcoreinfo_max_size = sizeof(vmcoreinfo_data);
This patch introduces NT_VMCORE_DEBUGINFO to a unique note type in
VMCOREINFO name, which has had no name so far. The name means that
it's a kind of note type in vmcoreinfo that contains system kernel's
debug information.
Signed-off-by: HATAYAMA Daisuke <[email protected]>
---
include/uapi/linux/elf.h | 4 ++++
kernel/kexec.c | 4 ++--
2 files changed, 6 insertions(+), 2 deletions(-)
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index 8072d35..b869904 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -398,6 +398,10 @@ typedef struct elf64_shdr {
#define NT_METAG_CBUF 0x500 /* Metag catch buffer registers */
#define NT_METAG_RPIPE 0x501 /* Metag read pipeline state */
+/*
+ * Notes exported from /proc/vmcore, belonging to "VMCOREINFO" name.
+ */
+#define NT_VMCORE_DEBUGINFO 0 /* vmcore system kernel's debuginfo */
/* Note header in a PT_NOTE section */
typedef struct elf32_note {
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 195de6d..6597b82 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -1438,8 +1438,8 @@ static void update_vmcoreinfo_note(void)
if (!vmcoreinfo_size)
return;
- buf = append_elf_note(buf, VMCOREINFO_NOTE_NAME, 0, vmcoreinfo_data,
- vmcoreinfo_size);
+ buf = append_elf_note(buf, VMCOREINFO_NOTE_NAME, NT_VMCORE_DEBUGINFO,
+ vmcoreinfo_data, vmcoreinfo_size);
final_note(buf);
}
The NT_VMCORE_PAD type is introduced to make both crash_notes buffer
and vmcoreinfo_note buffer satisfy mmap()'s page-size boundary
requirement by filling them with this note type.
The purpose of this type is just to align the buffer in page-size
boundary; it has no meaning in contents, which are fully filled with
zero.
This note type belongs to "VMCOREINFO" name space and the type in this
name space is 7. The reason why the numbers from 1 to 5 is not chosen
is that for the ones from 1 to 4, there are the corresponding note
types using the same number in "CORE" name space, and crash utility
and makedumpfile don't distinguish note types by name space at all;
for the remaining 5, this has somehow not been used since v2.4.0
kernel despite the fact that NT_AUXV is defined as 6. It looks that it
avoids some dependency to 5. Here simply 5 is not chosen for
conservative viewpoint.
By this change, gdb and binutils work well without any change, but
makedumpfile and crash utility need their changes to distinguish two
note types in "VMCOREINFO" name space.
Signed-off-by: HATAYAMA Daisuke <[email protected]>
---
include/uapi/linux/elf.h | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/include/uapi/linux/elf.h b/include/uapi/linux/elf.h
index b869904..9753e4c 100644
--- a/include/uapi/linux/elf.h
+++ b/include/uapi/linux/elf.h
@@ -402,6 +402,7 @@ typedef struct elf64_shdr {
* Notes exported from /proc/vmcore, belonging to "VMCOREINFO" name.
*/
#define NT_VMCORE_DEBUGINFO 0 /* vmcore system kernel's debuginfo */
+#define NT_VMCORE_PAD 7 /* vmcore padding of note segments */
/* Note header in a PT_NOTE section */
typedef struct elf32_note {
Modern kernel marks the end of ELF note buffer with NT_VMCORE_PAD type
note in order to make the buffer satisfy mmap()'s page-size boundary
requirement. This patch makes finishing reading each buffer if the
note type now being read is NT_VMCORE_PAD type.
Signed-off-by: HATAYAMA Daisuke <[email protected]>
---
fs/proc/vmcore.c | 24 ++++++++++++++++++++++++
1 files changed, 24 insertions(+), 0 deletions(-)
diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index b252d17..2a0f885 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -258,12 +258,24 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
}
nhdr_ptr = notes_section;
while (real_sz < max_sz) {
+ char *name;
+
+ /* Old kernel marks the end of ELF note buffer
+ * with empty header. */
if (nhdr_ptr->n_namesz == 0)
break;
sz = sizeof(Elf64_Nhdr) +
((nhdr_ptr->n_namesz + 3) & ~3) +
((nhdr_ptr->n_descsz + 3) & ~3);
real_sz += sz;
+
+ /* Modern kernel marks the end of ELF note
+ * buffer with NT_VMCORE_PAD type note. */
+ name = (char *)(nhdr_ptr + 1);
+ if (strncmp(name, VMCOREINFO_NOTE_NAME,
+ sizeof(VMCOREINFO_NOTE_NAME)) == 0
+ && nhdr_ptr->n_type == NT_VMCORE_PAD)
+ break;
nhdr_ptr = (Elf64_Nhdr*)((char*)nhdr_ptr + sz);
}
@@ -367,12 +379,24 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
}
nhdr_ptr = notes_section;
while (real_sz < max_sz) {
+ char *name;
+
+ /* Old kernel marks the end of ELF note buffer
+ * with empty header. */
if (nhdr_ptr->n_namesz == 0)
break;
sz = sizeof(Elf32_Nhdr) +
((nhdr_ptr->n_namesz + 3) & ~3) +
((nhdr_ptr->n_descsz + 3) & ~3);
real_sz += sz;
+
+ /* Modern kernel marks the end of ELF note
+ * buffer with NT_VMCORE_PAD type note. */
+ name = (char *)(nhdr_ptr + 1);
+ if (strncmp(name, VMCOREINFO_NOTE_NAME,
+ sizeof(VMCOREINFO_NOTE_NAME)) == 0
+ && nhdr_ptr->n_type == NT_VMCORE_PAD)
+ break;
nhdr_ptr = (Elf32_Nhdr*)((char*)nhdr_ptr + sz);
}
If there's some vmcore object that doesn't satisfy page-size boundary
requirement, remap_pfn_range() fails to remap it to user-space.
Objects that posisbly don't satisfy the requirement are ELF note
segments only. The memory chunks corresponding to PT_LOAD entries are
guaranteed to satisfy page-size boundary requirement by the copy from
old memory to buffer in 2nd kernel done in later patch.
This patch doesn't copy each note segment into the 2nd kernel since
they amount to so large in total if there are multiple CPUs. For
example, current maximum number of CPUs in x86_64 is 5120, where note
segments exceed 1MB with NT_PRSTATUS only.
Signed-off-by: HATAYAMA Daisuke <[email protected]>
---
fs/proc/vmcore.c | 22 ++++++++++++++++++++++
1 files changed, 22 insertions(+), 0 deletions(-)
diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 2a0f885..0077a9a 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -38,6 +38,8 @@ static u64 vmcore_size;
static struct proc_dir_entry *proc_vmcore = NULL;
+static bool support_mmap_vmcore;
+
/*
* Returns > 0 for RAM pages, 0 for non-RAM pages, < 0 on error
* The called function has to take care of module refcounting.
@@ -911,6 +913,7 @@ static int __init parse_crash_elf_headers(void)
static int __init vmcore_init(void)
{
int rc = 0;
+ struct vmcore *m;
/* If elfcorehdr= has been passed in cmdline, then capture the dump.*/
if (!(is_vmcore_usable()))
@@ -921,6 +924,25 @@ static int __init vmcore_init(void)
return rc;
}
+ /* If some object doesn't satisfy PAGE_SIZE boundary
+ * requirement, mmap_vmcore() is not exported to
+ * user-space. */
+ support_mmap_vmcore = true;
+ list_for_each_entry(m, &vmcore_list, list) {
+ u64 paddr;
+
+ if (m->flag & MEM_TYPE_CURRENT_KERNEL)
+ paddr = (u64)__pa(m->buf);
+ else
+ paddr = m->paddr;
+
+ if ((m->offset & ~PAGE_MASK) || (paddr & ~PAGE_MASK)
+ || (m->size & ~PAGE_MASK)) {
+ support_mmap_vmcore = false;
+ break;
+ }
+ }
+
proc_vmcore = proc_create("vmcore", S_IRUSR, NULL, &proc_vmcore_operations);
if (proc_vmcore)
proc_vmcore->size = vmcore_size;
To satisfy mmap()'s page-size bounary requirement, round-up offset of
each vmcore objects in page-size boundary; each offset is connected to
user-space virtual address through mapping of mmap().
Signed-off-by: HATAYAMA Daisuke <[email protected]>
---
fs/proc/vmcore.c | 18 ++++++++----------
1 files changed, 8 insertions(+), 10 deletions(-)
diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 0077a9a..f15c881 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -698,7 +698,7 @@ static int __init process_ptload_program_headers_elf32(char *elfptr,
}
/* Sets offset fields of vmcore elements. */
-static void __init set_vmcore_list_offsets_elf64(char *elfptr,
+static void __init set_vmcore_list_offsets_elf64(char *elfptr, size_t elfsz,
struct list_head *vc_list)
{
loff_t vmcore_off;
@@ -708,17 +708,16 @@ static void __init set_vmcore_list_offsets_elf64(char *elfptr,
ehdr_ptr = (Elf64_Ehdr *)elfptr;
/* Skip Elf header and program headers. */
- vmcore_off = ehdr_ptr->e_phoff +
- (ehdr_ptr->e_phnum) * sizeof(Elf64_Phdr);
+ vmcore_off = elfsz;
list_for_each_entry(m, vc_list, list) {
m->offset = vmcore_off;
- vmcore_off += m->size;
+ vmcore_off += roundup(m->size, PAGE_SIZE);
}
}
/* Sets offset fields of vmcore elements. */
-static void __init set_vmcore_list_offsets_elf32(char *elfptr,
+static void __init set_vmcore_list_offsets_elf32(char *elfptr, size_t elfsz,
struct list_head *vc_list)
{
loff_t vmcore_off;
@@ -728,12 +727,11 @@ static void __init set_vmcore_list_offsets_elf32(char *elfptr,
ehdr_ptr = (Elf32_Ehdr *)elfptr;
/* Skip Elf header and program headers. */
- vmcore_off = ehdr_ptr->e_phoff +
- (ehdr_ptr->e_phnum) * sizeof(Elf32_Phdr);
+ vmcore_off = elfsz;
list_for_each_entry(m, vc_list, list) {
m->offset = vmcore_off;
- vmcore_off += m->size;
+ vmcore_off += roundup(m->size, PAGE_SIZE);
}
}
@@ -801,7 +799,7 @@ static int __init parse_crash_elf64_headers(void)
get_order(elfcorebuf_sz_orig));
return rc;
}
- set_vmcore_list_offsets_elf64(elfcorebuf, &vmcore_list);
+ set_vmcore_list_offsets_elf64(elfcorebuf, elfcorebuf_sz, &vmcore_list);
return 0;
}
@@ -869,7 +867,7 @@ static int __init parse_crash_elf32_headers(void)
get_order(elfcorebuf_sz_orig));
return rc;
}
- set_vmcore_list_offsets_elf32(elfcorebuf, &vmcore_list);
+ set_vmcore_list_offsets_elf32(elfcorebuf, elfcorebuf_sz, &vmcore_list);
return 0;
}
The previous patch changed offsets of each vmcore objects by round-up
operation. vmcore size must count the holes.
Signed-off-by: HATAYAMA Daisuke <[email protected]>
---
fs/proc/vmcore.c | 16 ++++++++--------
1 files changed, 8 insertions(+), 8 deletions(-)
diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index f15c881..dd1d601 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -195,7 +195,7 @@ static struct vmcore* __init get_new_element(void)
return kzalloc(sizeof(struct vmcore), GFP_KERNEL);
}
-static u64 __init get_vmcore_size_elf64(char *elfptr)
+static u64 __init get_vmcore_size_elf64(char *elfptr, size_t elfsz)
{
int i;
u64 size;
@@ -204,15 +204,15 @@ static u64 __init get_vmcore_size_elf64(char *elfptr)
ehdr_ptr = (Elf64_Ehdr *)elfptr;
phdr_ptr = (Elf64_Phdr*)(elfptr + ehdr_ptr->e_phoff);
- size = ehdr_ptr->e_phoff + ((ehdr_ptr->e_phnum) * sizeof(Elf64_Phdr));
+ size = elfsz;
for (i = 0; i < ehdr_ptr->e_phnum; i++) {
- size += phdr_ptr->p_memsz;
+ size += roundup(phdr_ptr->p_memsz, PAGE_SIZE);
phdr_ptr++;
}
return size;
}
-static u64 __init get_vmcore_size_elf32(char *elfptr)
+static u64 __init get_vmcore_size_elf32(char *elfptr, size_t elfsz)
{
int i;
u64 size;
@@ -221,9 +221,9 @@ static u64 __init get_vmcore_size_elf32(char *elfptr)
ehdr_ptr = (Elf32_Ehdr *)elfptr;
phdr_ptr = (Elf32_Phdr*)(elfptr + ehdr_ptr->e_phoff);
- size = ehdr_ptr->e_phoff + ((ehdr_ptr->e_phnum) * sizeof(Elf32_Phdr));
+ size = elfsz;
for (i = 0; i < ehdr_ptr->e_phnum; i++) {
- size += phdr_ptr->p_memsz;
+ size += roundup(phdr_ptr->p_memsz, PAGE_SIZE);
phdr_ptr++;
}
return size;
@@ -892,14 +892,14 @@ static int __init parse_crash_elf_headers(void)
return rc;
/* Determine vmcore size. */
- vmcore_size = get_vmcore_size_elf64(elfcorebuf);
+ vmcore_size = get_vmcore_size_elf64(elfcorebuf, elfcorebuf_sz);
} else if (e_ident[EI_CLASS] == ELFCLASS32) {
rc = parse_crash_elf32_headers();
if (rc)
return rc;
/* Determine vmcore size. */
- vmcore_size = get_vmcore_size_elf32(elfcorebuf);
+ vmcore_size = get_vmcore_size_elf32(elfcorebuf, elfcorebuf_sz);
} else {
pr_warn("Warning: Core image elf header is not sane\n");
return -EINVAL;
This patch introduces mmap_vmcore().
If flag MEM_TYPE_CURRENT_KERNEL is set, remapped is the buffer on the
2nd kernel. If not set, remapped is some area in old memory.
Neither writable nor executable mapping is permitted even with
mprotect(). Non-writable mapping is also requirement of
remap_pfn_range() when mapping linear pages on non-consequtive
physical pages; see is_cow_mapping().
On x86-32 PAE kernels, mmap() supports at most 16TB memory only. This
limitation comes from the fact that the third argument of
remap_pfn_range(), pfn, is of 32-bit length on x86-32: unsigned long.
Signed-off-by: HATAYAMA Daisuke <[email protected]>
---
fs/proc/vmcore.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 72 insertions(+), 0 deletions(-)
diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index dd1d601..bc4848c 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -185,9 +185,81 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer,
return acc;
}
+static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
+{
+ size_t size = vma->vm_end - vma->vm_start;
+ u64 start, end, len, tsz;
+ struct vmcore *m;
+
+ if (!support_mmap_vmcore)
+ return -ENODEV;
+
+ start = (u64)vma->vm_pgoff << PAGE_SHIFT;
+ end = start + size;
+
+ if (size > vmcore_size || end > vmcore_size)
+ return -EINVAL;
+
+ if (vma->vm_flags & (VM_WRITE | VM_EXEC))
+ return -EPERM;
+
+ vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
+
+ len = 0;
+
+ if (start < elfcorebuf_sz) {
+ u64 pfn;
+
+ tsz = elfcorebuf_sz - start;
+ if (size < tsz)
+ tsz = size;
+ pfn = __pa(elfcorebuf + start) >> PAGE_SHIFT;
+ if (remap_pfn_range(vma, vma->vm_start, pfn, tsz,
+ vma->vm_page_prot))
+ return -EAGAIN;
+ size -= tsz;
+ start += tsz;
+ len += tsz;
+
+ if (size == 0)
+ return 0;
+ }
+
+ list_for_each_entry(m, &vmcore_list, list) {
+ if (start < m->offset + m->size) {
+ u64 pfn = 0;
+
+ tsz = m->offset + m->size - start;
+ if (size < tsz)
+ tsz = size;
+ if (m->flag & MEM_TYPE_CURRENT_KERNEL) {
+ pfn = __pa(m->buf + start - m->offset)
+ >> PAGE_SHIFT;
+ } else {
+ pfn = (m->paddr + (start - m->offset))
+ >> PAGE_SHIFT;
+ }
+ if (remap_pfn_range(vma, vma->vm_start + len, pfn, tsz,
+ vma->vm_page_prot)) {
+ do_munmap(vma->vm_mm, vma->vm_start, len);
+ return -EAGAIN;
+ }
+ size -= tsz;
+ start += tsz;
+ len += tsz;
+
+ if (size == 0)
+ return 0;
+ }
+ }
+
+ return 0;
+}
+
static const struct file_operations proc_vmcore_operations = {
.read = read_vmcore,
.llseek = default_llseek,
+ .mmap = mmap_vmcore,
};
static struct vmcore* __init get_new_element(void)
Fill both crash_notes and vmcoreinfo_note buffers by NT_VMCORE_PAD
note type to make them satisfy mmap()'s page-size boundary
requirement.
So far, end of note segments has been marked by zero-filled elf
header. Instead, this patch writes NT_VMCORE_PAD note in the end of
note segments until the offset on page-size boundary.
Also, old kernel can treat the ELF segments created without null
header because it stops reading ELF segments if real size it reads
reachs p_memsz.
Signed-off-by: HATAYAMA Daisuke <[email protected]>
---
arch/s390/include/asm/kexec.h | 8 ++++---
include/linux/kexec.h | 12 ++++++-----
kernel/kexec.c | 46 ++++++++++++++++++++++++++---------------
3 files changed, 41 insertions(+), 25 deletions(-)
diff --git a/arch/s390/include/asm/kexec.h b/arch/s390/include/asm/kexec.h
index 694bcd6..f33ec08 100644
--- a/arch/s390/include/asm/kexec.h
+++ b/arch/s390/include/asm/kexec.h
@@ -41,8 +41,8 @@
/*
* Size for s390x ELF notes per CPU
*
- * Seven notes plus zero note at the end: prstatus, fpregset, timer,
- * tod_cmp, tod_reg, control regs, and prefix
+ * Seven notes plus note with NT_VMCORE_PAD type at the end: prstatus,
+ * fpregset, timer, tod_cmp, tod_reg, control regs, and prefix
*/
#define KEXEC_NOTE_BYTES \
(ALIGN(sizeof(struct elf_note), 4) * 8 + \
@@ -53,7 +53,9 @@
ALIGN(sizeof(u64), 4) + \
ALIGN(sizeof(u32), 4) + \
ALIGN(sizeof(u64) * 16, 4) + \
- ALIGN(sizeof(u32), 4) \
+ ALIGN(sizeof(u32), 4) + \
+ KEXEC_CORE_NOTE_DESC_BYTES + \
+ VMCOREINFO_NOTE_NAME_BYTES \
)
/* Provide a dummy definition to avoid build failures. */
diff --git a/include/linux/kexec.h b/include/linux/kexec.h
index 5113570..6592935 100644
--- a/include/linux/kexec.h
+++ b/include/linux/kexec.h
@@ -47,14 +47,16 @@
#define KEXEC_CORE_NOTE_NAME_BYTES ALIGN(sizeof(KEXEC_CORE_NOTE_NAME), 4)
#define KEXEC_CORE_NOTE_DESC_BYTES ALIGN(sizeof(struct elf_prstatus), 4)
/*
- * The per-cpu notes area is a list of notes terminated by a "NULL"
- * note header. For kdump, the code in vmcore.c runs in the context
- * of the second kernel to combine them into one note.
+ * The per-cpu notes area is a list of notes terminated by a note
+ * header with NT_VMCORE_PAD type. For kdump, the code in vmcore.c
+ * runs in the context of the second kernel to combine them into one
+ * note.
*/
#ifndef KEXEC_NOTE_BYTES
#define KEXEC_NOTE_BYTES ( (KEXEC_NOTE_HEAD_BYTES * 2) + \
KEXEC_CORE_NOTE_NAME_BYTES + \
- KEXEC_CORE_NOTE_DESC_BYTES )
+ KEXEC_CORE_NOTE_DESC_BYTES + \
+ VMCOREINFO_NOTE_NAME_BYTES)
#endif
/*
@@ -187,7 +189,7 @@ extern struct kimage *kexec_crash_image;
#define VMCOREINFO_NOTE_NAME_BYTES ALIGN(sizeof(VMCOREINFO_NOTE_NAME), 4)
#define VMCOREINFO_NOTE_SIZE ALIGN(KEXEC_NOTE_HEAD_BYTES*2 \
+VMCOREINFO_BYTES \
- +VMCOREINFO_NOTE_NAME_BYTES, \
+ +VMCOREINFO_NOTE_NAME_BYTES*2, \
PAGE_SIZE)
/* Location of a reserved region to hold the crash kernel.
diff --git a/kernel/kexec.c b/kernel/kexec.c
index 6597b82..fbdc0f0 100644
--- a/kernel/kexec.c
+++ b/kernel/kexec.c
@@ -40,6 +40,7 @@
/* Per cpu memory for storing cpu states in case of system crash. */
note_buf_t __percpu *crash_notes;
+static size_t crash_notes_size = ALIGN(sizeof(note_buf_t), PAGE_SIZE);
/* vmcoreinfo stuff */
static unsigned char vmcoreinfo_data[VMCOREINFO_BYTES];
@@ -1177,6 +1178,7 @@ unlock:
return ret;
}
+/* If @data is NULL, fill @buf with 0 in @data_len bytes. */
static u32 *append_elf_note(u32 *buf, char *name, unsigned type, void *data,
size_t data_len)
{
@@ -1189,26 +1191,36 @@ static u32 *append_elf_note(u32 *buf, char *name, unsigned type, void *data,
buf += (sizeof(note) + 3)/4;
memcpy(buf, name, note.n_namesz);
buf += (note.n_namesz + 3)/4;
- memcpy(buf, data, note.n_descsz);
+ if (data)
+ memcpy(buf, data, note.n_descsz);
+ else
+ memset(buf, 0, note.n_descsz);
buf += (note.n_descsz + 3)/4;
return buf;
}
-static void final_note(u32 *buf)
+static void final_note(u32 *buf, size_t buf_len, size_t data_len)
{
- struct elf_note note;
+ size_t used_bytes, pad_hdr_size;
- note.n_namesz = 0;
- note.n_descsz = 0;
- note.n_type = 0;
- memcpy(buf, ¬e, sizeof(note));
+ pad_hdr_size = KEXEC_NOTE_HEAD_BYTES + VMCOREINFO_NOTE_NAME_BYTES;
+
+ /*
+ * keep space for ELF note header and "VMCOREINFO" name to
+ * terminate ELF segment by NT_VMCORE_PAD note.
+ */
+ BUG_ON(data_len + pad_hdr_size > buf_len);
+
+ used_bytes = data_len + pad_hdr_size;
+ append_elf_note(buf, VMCOREINFO_NOTE_NAME, NT_VMCORE_PAD, NULL,
+ roundup(used_bytes, PAGE_SIZE) - used_bytes);
}
void crash_save_cpu(struct pt_regs *regs, int cpu)
{
struct elf_prstatus prstatus;
- u32 *buf;
+ u32 *buf, *buf_end;
if ((cpu < 0) || (cpu >= nr_cpu_ids))
return;
@@ -1226,16 +1238,15 @@ void crash_save_cpu(struct pt_regs *regs, int cpu)
memset(&prstatus, 0, sizeof(prstatus));
prstatus.pr_pid = current->pid;
elf_core_copy_kernel_regs(&prstatus.pr_reg, regs);
- buf = append_elf_note(buf, KEXEC_CORE_NOTE_NAME, NT_PRSTATUS,
- &prstatus, sizeof(prstatus));
- final_note(buf);
+ buf_end = append_elf_note(buf, KEXEC_CORE_NOTE_NAME, NT_PRSTATUS,
+ &prstatus, sizeof(prstatus));
+ final_note(buf_end, crash_notes_size, (buf_end - buf) * sizeof(u32));
}
static int __init crash_notes_memory_init(void)
{
/* Allocate memory for saving cpu registers. */
- crash_notes = __alloc_percpu(roundup(sizeof(note_buf_t), PAGE_SIZE),
- PAGE_SIZE);
+ crash_notes = __alloc_percpu(crash_notes_size, PAGE_SIZE);
if (!crash_notes) {
printk("Kexec: Memory allocation for saving cpu register"
" states failed\n");
@@ -1434,13 +1445,14 @@ int __init parse_crashkernel_low(char *cmdline,
static void update_vmcoreinfo_note(void)
{
- u32 *buf = vmcoreinfo_note;
+ u32 *buf = vmcoreinfo_note, *buf_end;
if (!vmcoreinfo_size)
return;
- buf = append_elf_note(buf, VMCOREINFO_NOTE_NAME, NT_VMCORE_DEBUGINFO,
- vmcoreinfo_data, vmcoreinfo_size);
- final_note(buf);
+ buf_end = append_elf_note(buf, VMCOREINFO_NOTE_NAME, NT_VMCORE_DEBUGINFO,
+ vmcoreinfo_data, vmcoreinfo_size);
+ final_note(buf_end, sizeof(vmcoreinfo_note),
+ (buf_end - buf) * sizeof(u32));
}
void crash_save_vmcoreinfo(void)
If flag MEM_TYPE_CURRENT_KERNEL is set, the object is copied in the
buffer on the 2nd kernel, then read_vmcore() reads the buffer. If the
flag is not set, read_vmcore() reads old memory as usual.
Signed-off-by: HATAYAMA Daisuke <[email protected]>
---
fs/proc/vmcore.c | 15 +++++++++++----
1 files changed, 11 insertions(+), 4 deletions(-)
diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 7e21d64..b252d17 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -158,10 +158,17 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer,
tsz = m->offset + m->size - *fpos;
if (buflen < tsz)
tsz = buflen;
- start = m->paddr + *fpos - m->offset;
- tmp = read_from_oldmem(buffer, tsz, &start, 1);
- if (tmp < 0)
- return tmp;
+ if (m->flag & MEM_TYPE_CURRENT_KERNEL) {
+ if (copy_to_user(buffer,
+ m->buf + *fpos - m->offset,
+ tsz))
+ return -EFAULT;
+ } else {
+ start = m->paddr + *fpos - m->offset;
+ tmp = read_from_oldmem(buffer, tsz, &start, 1);
+ if (tmp < 0)
+ return tmp;
+ }
buflen -= tsz;
*fpos += tsz;
buffer += tsz;
Allocate buffer for ELF headers on page-size aligned boudary to
satisfy mmap() requirement. For this, __get_free_pages() is used
instead of kmalloc().
Also, later patch will decrease actually used buffer size for ELF
headers, so it's necessary to keep original buffer size and actually
used buffer size separately. elfcorebuf_sz_orig keeps the original one
and elfcorebuf_sz the actually used one.
Reviewed-by: Zhang Yanfei <[email protected]>
Signed-off-by: HATAYAMA Daisuke <[email protected]>
---
fs/proc/vmcore.c | 30 +++++++++++++++++++++---------
1 files changed, 21 insertions(+), 9 deletions(-)
diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 7d2dc4c..17e2501 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -31,6 +31,7 @@ static LIST_HEAD(vmcore_list);
/* Stores the pointer to the buffer containing kernel elf core headers. */
static char *elfcorebuf;
static size_t elfcorebuf_sz;
+static size_t elfcorebuf_sz_orig;
/* Total size of vmcore file. */
static u64 vmcore_size;
@@ -608,13 +609,16 @@ static int __init parse_crash_elf64_headers(void)
/* Read in all elf headers. */
elfcorebuf_sz = sizeof(Elf64_Ehdr) + ehdr.e_phnum * sizeof(Elf64_Phdr);
- elfcorebuf = kmalloc(elfcorebuf_sz, GFP_KERNEL);
+ elfcorebuf_sz_orig = elfcorebuf_sz;
+ elfcorebuf = (void *) __get_free_pages(GFP_KERNEL | __GFP_ZERO,
+ get_order(elfcorebuf_sz_orig));
if (!elfcorebuf)
return -ENOMEM;
addr = elfcorehdr_addr;
rc = read_from_oldmem(elfcorebuf, sizeof(Elf64_Ehdr), &addr, 0);
if (rc < 0) {
- kfree(elfcorebuf);
+ free_pages((unsigned long)elfcorebuf,
+ get_order(elfcorebuf_sz_orig));
return rc;
}
addr = elfcorehdr_addr + ehdr.e_phoff;
@@ -629,13 +633,15 @@ static int __init parse_crash_elf64_headers(void)
/* Merge all PT_NOTE headers into one. */
rc = merge_note_headers_elf64(elfcorebuf, &elfcorebuf_sz, &vmcore_list);
if (rc) {
- kfree(elfcorebuf);
+ free_pages((unsigned long)elfcorebuf,
+ get_order(elfcorebuf_sz_orig));
return rc;
}
rc = process_ptload_program_headers_elf64(elfcorebuf, elfcorebuf_sz,
&vmcore_list);
if (rc) {
- kfree(elfcorebuf);
+ free_pages((unsigned long)elfcorebuf,
+ get_order(elfcorebuf_sz_orig));
return rc;
}
set_vmcore_list_offsets_elf64(elfcorebuf, &vmcore_list);
@@ -671,7 +677,9 @@ static int __init parse_crash_elf32_headers(void)
/* Read in all elf headers. */
elfcorebuf_sz = sizeof(Elf32_Ehdr) + ehdr.e_phnum * sizeof(Elf32_Phdr);
- elfcorebuf = kmalloc(elfcorebuf_sz, GFP_KERNEL);
+ elfcorebuf_sz_orig = elfcorebuf_sz;
+ elfcorebuf = (void *) __get_free_pages(GFP_KERNEL | __GFP_ZERO,
+ get_order(elfcorebuf_sz_orig));
if (!elfcorebuf)
return -ENOMEM;
addr = elfcorehdr_addr;
@@ -684,7 +692,8 @@ static int __init parse_crash_elf32_headers(void)
rc = read_from_oldmem(elfcorebuf + sizeof(Elf32_Ehdr),
ehdr.e_phnum * sizeof(Elf32_Phdr), &addr, 0);
if (rc < 0) {
- kfree(elfcorebuf);
+ free_pages((unsigned long)elfcorebuf,
+ get_order(elfcorebuf_sz_orig));
return rc;
}
((Elf32_Ehdr *)elfcorebuf)->e_phoff = sizeof(Elf32_Ehdr);
@@ -692,13 +701,15 @@ static int __init parse_crash_elf32_headers(void)
/* Merge all PT_NOTE headers into one. */
rc = merge_note_headers_elf32(elfcorebuf, &elfcorebuf_sz, &vmcore_list);
if (rc) {
- kfree(elfcorebuf);
+ free_pages((unsigned long)elfcorebuf,
+ get_order(elfcorebuf_sz_orig));
return rc;
}
rc = process_ptload_program_headers_elf32(elfcorebuf, elfcorebuf_sz,
&vmcore_list);
if (rc) {
- kfree(elfcorebuf);
+ free_pages((unsigned long)elfcorebuf,
+ get_order(elfcorebuf_sz_orig));
return rc;
}
set_vmcore_list_offsets_elf32(elfcorebuf, &vmcore_list);
@@ -780,7 +791,8 @@ void vmcore_cleanup(void)
list_del(&m->list);
kfree(m);
}
- kfree(elfcorebuf);
+ free_pages((unsigned long)elfcorebuf,
+ get_order(elfcorebuf_sz_orig));
elfcorebuf = NULL;
}
EXPORT_SYMBOL_GPL(vmcore_cleanup);
Current code assumes all PT_NOTE headers are placed at the beginning
of program header table and they are consequtive. But the assumption
could be broken by future changes on either kexec-tools or the 1st
kernel. This patch removes the assumption and rearranges program
headers as the following conditions are satisfied:
- PT_NOTE entry is unique at the first entry,
- the order of program headers are unchanged during this
rearrangement, only their positions are changed in positive
direction.
- unused part that occurs in the bottom of program headers are filled
with 0.
Also, this patch adds one exceptional case where the number of PT_NOTE
entries is somehow 0. Then, immediately go out of the function.
Signed-off-by: HATAYAMA Daisuke <[email protected]>
---
fs/proc/vmcore.c | 92 +++++++++++++++++++++++++++++++++++++++++++-----------
1 files changed, 74 insertions(+), 18 deletions(-)
diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 94743d2..7d2dc4c 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -251,8 +251,7 @@ static u64 __init get_vmcore_size_elf32(char *elfptr)
static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
struct list_head *vc_list)
{
- int i, nr_ptnote=0, rc=0;
- char *tmp;
+ int i, j, nr_ptnote=0, i_ptnote, rc=0;
Elf64_Ehdr *ehdr_ptr;
Elf64_Phdr phdr, *phdr_ptr;
Elf64_Nhdr *nhdr_ptr;
@@ -301,6 +300,39 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
kfree(notes_section);
}
+ if (nr_ptnote == 0)
+ goto out;
+
+ phdr_ptr = (Elf64_Phdr *)(elfptr + ehdr_ptr->e_phoff);
+
+ /* Remove unwanted PT_NOTE program headers. */
+
+ /* - 1st pass shifts non-PT_NOTE entries until the first
+ PT_NOTE entry. */
+ i_ptnote = -1;
+ for (i = 0; i < ehdr_ptr->e_phnum; ++i) {
+ if (phdr_ptr[i].p_type == PT_NOTE) {
+ i_ptnote = i;
+ break;
+ }
+ }
+ BUG_ON(i_ptnote == -1); /* impossible case since nr_ptnote > 0. */
+ memmove(phdr_ptr + 1, phdr_ptr, i_ptnote * sizeof(Elf64_Phdr));
+
+ /* - 2nd pass moves the remaining non-PT_NOTE entries under
+ the first PT_NOTE entry. */
+ for (i = j = i_ptnote + 1; i < ehdr_ptr->e_phnum; i++) {
+ if (phdr_ptr[i].p_type != PT_NOTE) {
+ memmove(phdr_ptr + j, phdr_ptr + i,
+ sizeof(Elf64_Phdr));
+ j++;
+ }
+ }
+
+ /* - Finally, fill unused part with 0. */
+ memset(phdr_ptr + ehdr_ptr->e_phnum - (nr_ptnote - 1), 0,
+ (nr_ptnote - 1) * sizeof(Elf64_Phdr));
+
/* Prepare merged PT_NOTE program header. */
phdr.p_type = PT_NOTE;
phdr.p_flags = 0;
@@ -312,18 +344,14 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
phdr.p_align = 0;
/* Add merged PT_NOTE program header*/
- tmp = elfptr + ehdr_ptr->e_phoff;
- memcpy(tmp, &phdr, sizeof(phdr));
- tmp += sizeof(phdr);
+ memcpy(phdr_ptr, &phdr, sizeof(Elf64_Phdr));
- /* Remove unwanted PT_NOTE program headers. */
- i = (nr_ptnote - 1) * sizeof(Elf64_Phdr);
- *elfsz = *elfsz - i;
- memmove(tmp, tmp+i, ((*elfsz)-ehdr_ptr->e_phoff-sizeof(Elf64_Phdr)));
+ *elfsz = *elfsz - (nr_ptnote - 1) * sizeof(Elf64_Phdr);
/* Modify e_phnum to reflect merged headers. */
ehdr_ptr->e_phnum = ehdr_ptr->e_phnum - nr_ptnote + 1;
+out:
return 0;
}
@@ -331,8 +359,7 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
struct list_head *vc_list)
{
- int i, nr_ptnote=0, rc=0;
- char *tmp;
+ int i, j, nr_ptnote=0, i_ptnote, rc=0;
Elf32_Ehdr *ehdr_ptr;
Elf32_Phdr phdr, *phdr_ptr;
Elf32_Nhdr *nhdr_ptr;
@@ -381,6 +408,39 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
kfree(notes_section);
}
+ if (nr_ptnote == 0)
+ goto out;
+
+ phdr_ptr = (Elf32_Phdr *)(elfptr + ehdr_ptr->e_phoff);
+
+ /* Remove unwanted PT_NOTE program headers. */
+
+ /* - 1st pass shifts non-PT_NOTE entries until the first
+ PT_NOTE entry. */
+ i_ptnote = -1;
+ for (i = 0; i < ehdr_ptr->e_phnum; ++i) {
+ if (phdr_ptr[i].p_type == PT_NOTE) {
+ i_ptnote = i;
+ break;
+ }
+ }
+ BUG_ON(i_ptnote == -1); /* impossible case since nr_ptnote > 0. */
+ memmove(phdr_ptr + 1, phdr_ptr, i_ptnote * sizeof(Elf32_Phdr));
+
+ /* - 2nd pass moves the remaining non-PT_NOTE entries under
+ the first PT_NOTE entry. */
+ for (i = j = i_ptnote + 1; i < ehdr_ptr->e_phnum; i++) {
+ if (phdr_ptr[i].p_type != PT_NOTE) {
+ memmove(phdr_ptr + j, phdr_ptr + i,
+ sizeof(Elf32_Phdr));
+ j++;
+ }
+ }
+
+ /* - Finally, fill unused part with 0. */
+ memset(phdr_ptr + ehdr_ptr->e_phnum - (nr_ptnote - 1), 0,
+ (nr_ptnote - 1) * sizeof(Elf32_Phdr));
+
/* Prepare merged PT_NOTE program header. */
phdr.p_type = PT_NOTE;
phdr.p_flags = 0;
@@ -392,18 +452,14 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
phdr.p_align = 0;
/* Add merged PT_NOTE program header*/
- tmp = elfptr + ehdr_ptr->e_phoff;
- memcpy(tmp, &phdr, sizeof(phdr));
- tmp += sizeof(phdr);
+ memcpy(phdr_ptr, &phdr, sizeof(Elf32_Phdr));
- /* Remove unwanted PT_NOTE program headers. */
- i = (nr_ptnote - 1) * sizeof(Elf32_Phdr);
- *elfsz = *elfsz - i;
- memmove(tmp, tmp+i, ((*elfsz)-ehdr_ptr->e_phoff-sizeof(Elf32_Phdr)));
+ *elfsz = *elfsz - (nr_ptnote - 1) * sizeof(Elf32_Phdr);
/* Modify e_phnum to reflect merged headers. */
ehdr_ptr->e_phnum = ehdr_ptr->e_phnum - nr_ptnote + 1;
+out:
return 0;
}
Currently, vmcoreinfo exports data part only, but kexec-tool sets it
in p_memsz member as a whole ELF note segment size. Due to this, it
would be no problem on the current ELF note segment size, but if it
grows in the future, then read possibly doesn't reach ELF note header
in larger p_memsz position, failing to read a whole ELF segment.
Note: kexec-tools assigns PAGE_SIZE to p_memsz for other ELF note
types. Due to the above reason, the same issue occurs if actual ELF
note data exceeds (PAGE_SIZE - 2 * KEXEC_NOTE_HEAD_BYTES).
Signed-off-by: HATAYAMA Daisuke <[email protected]>
---
kernel/ksysfs.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/kernel/ksysfs.c b/kernel/ksysfs.c
index 6ada93c..97d2763 100644
--- a/kernel/ksysfs.c
+++ b/kernel/ksysfs.c
@@ -126,7 +126,7 @@ static ssize_t vmcoreinfo_show(struct kobject *kobj,
{
return sprintf(buf, "%lx %x\n",
paddr_vmcoreinfo_note(),
- (unsigned int)vmcoreinfo_max_size);
+ (unsigned int)sizeof(vmcoreinfo_note));
}
KERNEL_ATTR_RO(vmcoreinfo);
The variable j has int type but it's compared with u64 type.
Also, the purpose of the variable j is exactly what the variable
real_sz is used for now. Replace the variable j by the variable
real_sz and remove the variable j.
Signed-off-by: HATAYAMA Daisuke <[email protected]>
---
fs/proc/vmcore.c | 6 ++----
1 files changed, 2 insertions(+), 4 deletions(-)
diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 163281e..94743d2 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -261,7 +261,6 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
ehdr_ptr = (Elf64_Ehdr *)elfptr;
phdr_ptr = (Elf64_Phdr*)(elfptr + ehdr_ptr->e_phoff);
for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
- int j;
void *notes_section;
struct vmcore *new;
u64 offset, max_sz, sz, real_sz = 0;
@@ -279,7 +278,7 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
return rc;
}
nhdr_ptr = notes_section;
- for (j = 0; j < max_sz; j += sz) {
+ while (real_sz < max_sz) {
if (nhdr_ptr->n_namesz == 0)
break;
sz = sizeof(Elf64_Nhdr) +
@@ -342,7 +341,6 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
ehdr_ptr = (Elf32_Ehdr *)elfptr;
phdr_ptr = (Elf32_Phdr*)(elfptr + ehdr_ptr->e_phoff);
for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
- int j;
void *notes_section;
struct vmcore *new;
u64 offset, max_sz, sz, real_sz = 0;
@@ -360,7 +358,7 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
return rc;
}
nhdr_ptr = notes_section;
- for (j = 0; j < max_sz; j += sz) {
+ while (real_sz < max_sz) {
if (nhdr_ptr->n_namesz == 0)
break;
sz = sizeof(Elf32_Nhdr) +
On Sat, 16 Mar 2013 13:00:47 +0900 HATAYAMA Daisuke <[email protected]> wrote:
> Currently, read to /proc/vmcore is done by read_oldmem() that uses
> ioremap/iounmap per a single page. For example, if memory is 1GB,
> ioremap/iounmap is called (1GB / 4KB)-times, that is, 262144
> times. This causes big performance degradation.
>
> In particular, the current main user of this mmap() is makedumpfile,
> which not only reads memory from /proc/vmcore but also does other
> processing like filtering, compression and IO work. Update of page
> table and the following TLB flush makes such processing much slow;
> though I have yet to make patch for makedumpfile and yet to confirm
> how it's improved.
>
> To address the issue, this patch implements mmap() on /proc/vmcore to
> improve read performance. My simple benchmark shows the improvement
> from 200 [MiB/sec] to over 50.0 [GiB/sec].
There are quite a lot of userspace-visible vmcore changes here. Is it
all fully back-compatible? Will all known userspace continue to work
OK on newer kernels?
On Sat, 16 Mar 2013 13:01:26 +0900 HATAYAMA Daisuke <[email protected]> wrote:
> The part of dump target memory is copied into the 2nd kernel if it
> doesn't satisfy mmap()'s page-size boundary requirement. To
> distinguish such copied object from usual old memory, a flag
> MEM_TYPE_CURRENT_KERNEL is introduced. If this flag is set, the object
> is considered being copied into buffer on the 2nd kernel.
>
I don't understand this description at all :( Perhaps we can have a
lengthier version which expands on the concepts a bit further?
> --- a/include/linux/proc_fs.h
> +++ b/include/linux/proc_fs.h
> @@ -97,11 +97,17 @@ struct kcore_list {
> int type;
> };
>
> +#define MEM_TYPE_CURRENT_KERNEL 0x1
A comment describing this would be useful.
> struct vmcore {
> struct list_head list;
> - unsigned long long paddr;
> + union {
> + unsigned long long paddr;
> + char *buf;
> + };
This change wasn't described in the changelog?
> unsigned long long size;
> loff_t offset;
> + unsigned int flag;
Presumably this is the place where we put MEM_TYPE_CURRENT_KERNEL?
That's unobvious from reading the code. Add the text "vmcore.flag
fields:" above the MEM_TYPE_CURRENT_KERNEL definition.
> };
>
> #ifdef CONFIG_PROC_FS
On Sat, 16 Mar 2013 13:01:32 +0900 HATAYAMA Daisuke <[email protected]> wrote:
> Due to mmap() requirement, we need to copy pages not starting or
> ending with page-size aligned address in 2nd kernel and to map them to
> user-space.
>
> For example, see the map below:
>
> 00000000-00010000 : reserved
> 00010000-0009f800 : System RAM
> 0009f800-000a0000 : reserved
>
> where the System RAM ends with 0x9f800 that is not page-size
> aligned. This map is divided into two parts:
>
> 00010000-0009f000
> 0009f000-0009f800
>
> and the first one is kept in old memory and the 2nd one is copied into
> buffer on 2nd kernel.
>
> This kind of non-page-size-aligned area can always occur since any
> part of System RAM can be converted into reserved area at runtime.
>
> If not doing copying like this and if remapping non page-size aligned
> pages on old memory directly, mmap() had to export memory which is not
> dump target to user-space. In the above example this is reserved
> 0x9f800-0xa0000.
Ah. This is all the missing info from [patch 07/21]. I suggest we
join 7 and 8 into a single patch, use the above changelog, add the
MEM_TYPE_CURRENT_KERNEL code comments.
On Sat, 16 Mar 2013 13:02:29 +0900 HATAYAMA Daisuke <[email protected]> wrote:
> If there's some vmcore object that doesn't satisfy page-size boundary
> requirement, remap_pfn_range() fails to remap it to user-space.
>
> Objects that posisbly don't satisfy the requirement are ELF note
> segments only. The memory chunks corresponding to PT_LOAD entries are
> guaranteed to satisfy page-size boundary requirement by the copy from
> old memory to buffer in 2nd kernel done in later patch.
>
> This patch doesn't copy each note segment into the 2nd kernel since
> they amount to so large in total if there are multiple CPUs. For
> example, current maximum number of CPUs in x86_64 is 5120, where note
> segments exceed 1MB with NT_PRSTATUS only.
I don't really understand this. Why does the number of or size of
note segments affect their alignment?
> --- a/fs/proc/vmcore.c
> +++ b/fs/proc/vmcore.c
> @@ -38,6 +38,8 @@ static u64 vmcore_size;
>
> static struct proc_dir_entry *proc_vmcore = NULL;
>
> +static bool support_mmap_vmcore;
This is quite regrettable. It means that on some kernels/machines,
mmap(vmcore) simply won't work. This means that people might write
code which works for them, but which will fail for others when deployed
on a small number of machines.
Can we avoid this? Why can't we just copy the notes even if there are
a large number of them?
HATAYAMA Daisuke <[email protected]> writes:
> Due to mmap() requirement, we need to copy pages not starting or
> ending with page-size aligned address in 2nd kernel and to map them to
> user-space.
>
> For example, see the map below:
>
> 00000000-00010000 : reserved
> 00010000-0009f800 : System RAM
> 0009f800-000a0000 : reserved
>
> where the System RAM ends with 0x9f800 that is not page-size
> aligned. This map is divided into two parts:
>
> 00010000-0009f000
> 0009f000-0009f800
>
> and the first one is kept in old memory and the 2nd one is copied into
> buffer on 2nd kernel.
>
> This kind of non-page-size-aligned area can always occur since any
> part of System RAM can be converted into reserved area at runtime.
>
> If not doing copying like this and if remapping non page-size aligned
> pages on old memory directly, mmap() had to export memory which is not
> dump target to user-space. In the above example this is reserved
> 0x9f800-0xa0000.
So I have a question. Would it not be easier to only support mmaping on
the things that are easily mmapable? And in the oddball corner cases
require reading the data instead.
My gut feel says that would reduce the net complexity of the system a
bit, and would also reduce the amount of memory that the kernel has to
allocate.
Alrernatively and probably better given the nature of what is happening
here. It seems entirely reasonable to round up the boundaries of the
supported mappings to the nearest page. A little memory leak of data
in a reserved page should be harmless.
Eric
HATAYAMA Daisuke <[email protected]> writes:
> To satisfy mmap()'s page-size boundary requirement, allocate per-cpu
> crash_notes objects on page-size boundary.
>
> /proc/vmcore on the 2nd kernel checks if each note objects is
> allocated on page-size boundary. If there's some object not satisfying
> the page-size boundary requirement, /proc/vmcore doesn't provide
> mmap() interface.
Does this actually help? My memory is that /proc/vmcore did some magic
behind the scenes to combine these multiple note sections into a single
note section.
Certainly someone has to combine them together to make a valid elf
executable.
At the same time I don't see any harm in rounding up to a page size
here, but I don't see the point either.
Eric
> Signed-off-by: HATAYAMA Daisuke <[email protected]>
> ---
>
> kernel/kexec.c | 3 ++-
> 1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index bddd3d7..d1f365e 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -1234,7 +1234,8 @@ void crash_save_cpu(struct pt_regs *regs, int cpu)
> static int __init crash_notes_memory_init(void)
> {
> /* Allocate memory for saving cpu registers. */
> - crash_notes = alloc_percpu(note_buf_t);
> + crash_notes = __alloc_percpu(roundup(sizeof(note_buf_t), PAGE_SIZE),
> + PAGE_SIZE);
> if (!crash_notes) {
> printk("Kexec: Memory allocation for saving cpu register"
> " states failed\n");
HATAYAMA Daisuke <[email protected]> writes:
> To satisfy mmap()'s page-size boundary requirement, specify aligned
> attribute to vmcoreinfo_note objects to allocate it on page-size
> boundary.
Again I am not seeing the point.
Eric
HATAYAMA Daisuke <[email protected]> writes:
> Modern kernel marks the end of ELF note buffer with NT_VMCORE_PAD type
> note in order to make the buffer satisfy mmap()'s page-size boundary
> requirement. This patch makes finishing reading each buffer if the
> note type now being read is NT_VMCORE_PAD type.
Ick. Even with a pad header you can mark the end with an empty header,
and my memory may be deceiving me but I believe an empty header is
specified by the ELF ABI docs.
Beyond which I don't quite see the point of any of this as all of these
headers need to be combined into a single note section before being
presented to user space.
Eric
Andrew Morton <[email protected]> writes:
> On Sat, 16 Mar 2013 13:02:29 +0900 HATAYAMA Daisuke <[email protected]> wrote:
>
>> If there's some vmcore object that doesn't satisfy page-size boundary
>> requirement, remap_pfn_range() fails to remap it to user-space.
>>
>> Objects that posisbly don't satisfy the requirement are ELF note
>> segments only. The memory chunks corresponding to PT_LOAD entries are
>> guaranteed to satisfy page-size boundary requirement by the copy from
>> old memory to buffer in 2nd kernel done in later patch.
>>
>> This patch doesn't copy each note segment into the 2nd kernel since
>> they amount to so large in total if there are multiple CPUs. For
>> example, current maximum number of CPUs in x86_64 is 5120, where note
>> segments exceed 1MB with NT_PRSTATUS only.
>
> I don't really understand this. Why does the number of or size of
> note segments affect their alignment?
>
>> --- a/fs/proc/vmcore.c
>> +++ b/fs/proc/vmcore.c
>> @@ -38,6 +38,8 @@ static u64 vmcore_size;
>>
>> static struct proc_dir_entry *proc_vmcore = NULL;
>>
>> +static bool support_mmap_vmcore;
>
> This is quite regrettable. It means that on some kernels/machines,
> mmap(vmcore) simply won't work. This means that people might write
> code which works for them, but which will fail for others when deployed
> on a small number of machines.
>
> Can we avoid this? Why can't we just copy the notes even if there are
> a large number of them?
Yes. If it simplifies things I don't see a need to support mmapping
everything. But even there I don't see much of an issue.
Today we allocate a buffer to hold the ELF header program headers and
the note segment, and we could easily allocate that buffer in such a way
to make it mmapable.
Eric
On Tue, Mar 19, 2013 at 01:59:57PM -0700, Eric W. Biederman wrote:
> HATAYAMA Daisuke <[email protected]> writes:
>
> > Due to mmap() requirement, we need to copy pages not starting or
> > ending with page-size aligned address in 2nd kernel and to map them to
> > user-space.
> >
> > For example, see the map below:
> >
> > 00000000-00010000 : reserved
> > 00010000-0009f800 : System RAM
> > 0009f800-000a0000 : reserved
> >
> > where the System RAM ends with 0x9f800 that is not page-size
> > aligned. This map is divided into two parts:
> >
> > 00010000-0009f000
> > 0009f000-0009f800
> >
> > and the first one is kept in old memory and the 2nd one is copied into
> > buffer on 2nd kernel.
> >
> > This kind of non-page-size-aligned area can always occur since any
> > part of System RAM can be converted into reserved area at runtime.
> >
> > If not doing copying like this and if remapping non page-size aligned
> > pages on old memory directly, mmap() had to export memory which is not
> > dump target to user-space. In the above example this is reserved
> > 0x9f800-0xa0000.
>
> So I have a question. Would it not be easier to only support mmaping on
> the things that are easily mmapable? And in the oddball corner cases
> require reading the data instead.
Hi Eric,
Are you saying that some parts of the vmcore file will support mmap() and
others will not. If yes, how would a user know which parts of file are
mappable and which are not.
Thanks
Vivek
HATAYAMA Daisuke <[email protected]> writes:
> Currently, the code assumes that position of program header table is
> next to ELF header. But future change can break the assumption on
> kexec-tools and the 1st kernel. To avoid worst case, reference e_phoff
> member explicitly to get position of program header table in
> file-offset.
In principle this looks good. However when I read this it looks like
you are going a little too far.
You are changing not only the reading of the supplied headers, but
you are changing the generation of the new new headers that describe
the data provided by /proc/vmcore.
I get lost in following this after you mangle merge_note_headers.
In principle removing silly assumptions seems reasonable, but I think
it is completely orthogonal to the task of maping vmcore mmapable.
I think it is fine to claim that the assumptions made here in vmcore are
part of the kexec on panic ABI at this point, which would generally make
this change unnecessary.
> Signed-off-by: Zhang Yanfei <[email protected]>
> Signed-off-by: HATAYAMA Daisuke <[email protected]>
> ---
>
> fs/proc/vmcore.c | 56 +++++++++++++++++++++++++++++++++++-------------------
> 1 files changed, 36 insertions(+), 20 deletions(-)
>
> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
> index b870f74..163281e 100644
> --- a/fs/proc/vmcore.c
> +++ b/fs/proc/vmcore.c
> @@ -221,8 +221,8 @@ static u64 __init get_vmcore_size_elf64(char *elfptr)
> Elf64_Phdr *phdr_ptr;
>
> ehdr_ptr = (Elf64_Ehdr *)elfptr;
> - phdr_ptr = (Elf64_Phdr*)(elfptr + sizeof(Elf64_Ehdr));
> - size = sizeof(Elf64_Ehdr) + ((ehdr_ptr->e_phnum) * sizeof(Elf64_Phdr));
> + phdr_ptr = (Elf64_Phdr*)(elfptr + ehdr_ptr->e_phoff);
> + size = ehdr_ptr->e_phoff + ((ehdr_ptr->e_phnum) * sizeof(Elf64_Phdr));
> for (i = 0; i < ehdr_ptr->e_phnum; i++) {
> size += phdr_ptr->p_memsz;
> phdr_ptr++;
> @@ -238,8 +238,8 @@ static u64 __init get_vmcore_size_elf32(char *elfptr)
> Elf32_Phdr *phdr_ptr;
>
> ehdr_ptr = (Elf32_Ehdr *)elfptr;
> - phdr_ptr = (Elf32_Phdr*)(elfptr + sizeof(Elf32_Ehdr));
> - size = sizeof(Elf32_Ehdr) + ((ehdr_ptr->e_phnum) * sizeof(Elf32_Phdr));
> + phdr_ptr = (Elf32_Phdr*)(elfptr + ehdr_ptr->e_phoff);
> + size = ehdr_ptr->e_phoff + ((ehdr_ptr->e_phnum) * sizeof(Elf32_Phdr));
> for (i = 0; i < ehdr_ptr->e_phnum; i++) {
> size += phdr_ptr->p_memsz;
> phdr_ptr++;
> @@ -259,7 +259,7 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
> u64 phdr_sz = 0, note_off;
>
> ehdr_ptr = (Elf64_Ehdr *)elfptr;
> - phdr_ptr = (Elf64_Phdr*)(elfptr + sizeof(Elf64_Ehdr));
> + phdr_ptr = (Elf64_Phdr*)(elfptr + ehdr_ptr->e_phoff);
> for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
> int j;
> void *notes_section;
Up to here things look good.
> @@ -305,7 +305,7 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
> /* Prepare merged PT_NOTE program header. */
> phdr.p_type = PT_NOTE;
> phdr.p_flags = 0;
> - note_off = sizeof(Elf64_Ehdr) +
> + note_off = ehdr_ptr->e_phoff +
> (ehdr_ptr->e_phnum - nr_ptnote +1) * sizeof(Elf64_Phdr);
And this is is just silly. There is no point in changing where the
regenerated headers live.
> phdr.p_offset = note_off;
> phdr.p_vaddr = phdr.p_paddr = 0;
> @@ -313,14 +313,14 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
> phdr.p_align = 0;
>
> /* Add merged PT_NOTE program header*/
> - tmp = elfptr + sizeof(Elf64_Ehdr);
> + tmp = elfptr + ehdr_ptr->e_phoff;
Again this looks very silly.
> memcpy(tmp, &phdr, sizeof(phdr));
> tmp += sizeof(phdr);
>
> /* Remove unwanted PT_NOTE program headers. */
> i = (nr_ptnote - 1) * sizeof(Elf64_Phdr);
> *elfsz = *elfsz - i;
> - memmove(tmp, tmp+i, ((*elfsz)-sizeof(Elf64_Ehdr)-sizeof(Elf64_Phdr)));
> + memmove(tmp, tmp+i, ((*elfsz)-ehdr_ptr->e_phoff-sizeof(Elf64_Phdr)));
This is a regenerated header so this change is dubious.
Eric
HATAYAMA Daisuke <[email protected]> writes:
> Current code assumes all PT_NOTE headers are placed at the beginning
> of program header table and they are consequtive. But the assumption
> could be broken by future changes on either kexec-tools or the 1st
> kernel. This patch removes the assumption and rearranges program
> headers as the following conditions are satisfied:
>
> - PT_NOTE entry is unique at the first entry,
>
> - the order of program headers are unchanged during this
> rearrangement, only their positions are changed in positive
> direction.
>
> - unused part that occurs in the bottom of program headers are filled
> with 0.
>
> Also, this patch adds one exceptional case where the number of PT_NOTE
> entries is somehow 0. Then, immediately go out of the function.
This patch looks like you have really overthought this part of the code.
You are adding a fair amount of complexity for very little gain.
To clean this up I would recommend two buffers. A temporary buffer
for the program headers read out of oldmem, and a longer lived
buffer where you generate the new headers into. Then the scary
memmove and the assumptions about location in the PT_LOAD chain
can be removed without having to do fancy hard to follow multi-pass
code.
If the result isn't going to be clean and easy to follow we might as
well deem the requirements of the existing code an ABI and not worry
about relaxing them.
Eric
> Signed-off-by: HATAYAMA Daisuke <[email protected]>
> ---
>
> fs/proc/vmcore.c | 92 +++++++++++++++++++++++++++++++++++++++++++-----------
> 1 files changed, 74 insertions(+), 18 deletions(-)
>
> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
> index 94743d2..7d2dc4c 100644
> --- a/fs/proc/vmcore.c
> +++ b/fs/proc/vmcore.c
> @@ -251,8 +251,7 @@ static u64 __init get_vmcore_size_elf32(char *elfptr)
> static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
> struct list_head *vc_list)
> {
> - int i, nr_ptnote=0, rc=0;
> - char *tmp;
> + int i, j, nr_ptnote=0, i_ptnote, rc=0;
> Elf64_Ehdr *ehdr_ptr;
> Elf64_Phdr phdr, *phdr_ptr;
> Elf64_Nhdr *nhdr_ptr;
> @@ -301,6 +300,39 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
> kfree(notes_section);
> }
>
> + if (nr_ptnote == 0)
> + goto out;
> +
> + phdr_ptr = (Elf64_Phdr *)(elfptr + ehdr_ptr->e_phoff);
> +
> + /* Remove unwanted PT_NOTE program headers. */
> +
> + /* - 1st pass shifts non-PT_NOTE entries until the first
> + PT_NOTE entry. */
> + i_ptnote = -1;
> + for (i = 0; i < ehdr_ptr->e_phnum; ++i) {
> + if (phdr_ptr[i].p_type == PT_NOTE) {
> + i_ptnote = i;
> + break;
> + }
> + }
> + BUG_ON(i_ptnote == -1); /* impossible case since nr_ptnote > 0. */
> + memmove(phdr_ptr + 1, phdr_ptr, i_ptnote * sizeof(Elf64_Phdr));
> +
> + /* - 2nd pass moves the remaining non-PT_NOTE entries under
> + the first PT_NOTE entry. */
> + for (i = j = i_ptnote + 1; i < ehdr_ptr->e_phnum; i++) {
> + if (phdr_ptr[i].p_type != PT_NOTE) {
> + memmove(phdr_ptr + j, phdr_ptr + i,
> + sizeof(Elf64_Phdr));
> + j++;
> + }
> + }
> +
> + /* - Finally, fill unused part with 0. */
> + memset(phdr_ptr + ehdr_ptr->e_phnum - (nr_ptnote - 1), 0,
> + (nr_ptnote - 1) * sizeof(Elf64_Phdr));
> +
> /* Prepare merged PT_NOTE program header. */
> phdr.p_type = PT_NOTE;
> phdr.p_flags = 0;
> @@ -312,18 +344,14 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
> phdr.p_align = 0;
>
> /* Add merged PT_NOTE program header*/
> - tmp = elfptr + ehdr_ptr->e_phoff;
> - memcpy(tmp, &phdr, sizeof(phdr));
> - tmp += sizeof(phdr);
> + memcpy(phdr_ptr, &phdr, sizeof(Elf64_Phdr));
>
> - /* Remove unwanted PT_NOTE program headers. */
> - i = (nr_ptnote - 1) * sizeof(Elf64_Phdr);
> - *elfsz = *elfsz - i;
> - memmove(tmp, tmp+i, ((*elfsz)-ehdr_ptr->e_phoff-sizeof(Elf64_Phdr)));
> + *elfsz = *elfsz - (nr_ptnote - 1) * sizeof(Elf64_Phdr);
>
> /* Modify e_phnum to reflect merged headers. */
> ehdr_ptr->e_phnum = ehdr_ptr->e_phnum - nr_ptnote + 1;
>
> +out:
> return 0;
> }
>
> @@ -331,8 +359,7 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
> static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
> struct list_head *vc_list)
> {
> - int i, nr_ptnote=0, rc=0;
> - char *tmp;
> + int i, j, nr_ptnote=0, i_ptnote, rc=0;
> Elf32_Ehdr *ehdr_ptr;
> Elf32_Phdr phdr, *phdr_ptr;
> Elf32_Nhdr *nhdr_ptr;
> @@ -381,6 +408,39 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
> kfree(notes_section);
> }
>
> + if (nr_ptnote == 0)
> + goto out;
> +
> + phdr_ptr = (Elf32_Phdr *)(elfptr + ehdr_ptr->e_phoff);
> +
> + /* Remove unwanted PT_NOTE program headers. */
> +
> + /* - 1st pass shifts non-PT_NOTE entries until the first
> + PT_NOTE entry. */
> + i_ptnote = -1;
> + for (i = 0; i < ehdr_ptr->e_phnum; ++i) {
> + if (phdr_ptr[i].p_type == PT_NOTE) {
> + i_ptnote = i;
> + break;
> + }
> + }
> + BUG_ON(i_ptnote == -1); /* impossible case since nr_ptnote > 0. */
> + memmove(phdr_ptr + 1, phdr_ptr, i_ptnote * sizeof(Elf32_Phdr));
> +
> + /* - 2nd pass moves the remaining non-PT_NOTE entries under
> + the first PT_NOTE entry. */
> + for (i = j = i_ptnote + 1; i < ehdr_ptr->e_phnum; i++) {
> + if (phdr_ptr[i].p_type != PT_NOTE) {
> + memmove(phdr_ptr + j, phdr_ptr + i,
> + sizeof(Elf32_Phdr));
> + j++;
> + }
> + }
> +
> + /* - Finally, fill unused part with 0. */
> + memset(phdr_ptr + ehdr_ptr->e_phnum - (nr_ptnote - 1), 0,
> + (nr_ptnote - 1) * sizeof(Elf32_Phdr));
> +
> /* Prepare merged PT_NOTE program header. */
> phdr.p_type = PT_NOTE;
> phdr.p_flags = 0;
> @@ -392,18 +452,14 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
> phdr.p_align = 0;
>
> /* Add merged PT_NOTE program header*/
> - tmp = elfptr + ehdr_ptr->e_phoff;
> - memcpy(tmp, &phdr, sizeof(phdr));
> - tmp += sizeof(phdr);
> + memcpy(phdr_ptr, &phdr, sizeof(Elf32_Phdr));
>
> - /* Remove unwanted PT_NOTE program headers. */
> - i = (nr_ptnote - 1) * sizeof(Elf32_Phdr);
> - *elfsz = *elfsz - i;
> - memmove(tmp, tmp+i, ((*elfsz)-ehdr_ptr->e_phoff-sizeof(Elf32_Phdr)));
> + *elfsz = *elfsz - (nr_ptnote - 1) * sizeof(Elf32_Phdr);
>
> /* Modify e_phnum to reflect merged headers. */
> ehdr_ptr->e_phnum = ehdr_ptr->e_phnum - nr_ptnote + 1;
>
> +out:
> return 0;
> }
>
HATAYAMA Daisuke <[email protected]> writes:
> To satisfy mmap() page-size boundary requirement, round up buffer size
> of ELF headers by PAGE_SIZE. The resulting value becomes offset of ELF
> note segments and it's assigned in unique PT_NOTE program header
> entry.
Ok. That is just silly. You can use a single buffer for the ELF header,
the program header and the notes. It just requires a bit of counting
ahead of time.
The ELF header itself is small, and so are the program headers,
especially if you only have one PT_NOTE segment. The only thing that
possibly gets big is the note segment, and then that only happens if you
have a lot of cpus.
Since there are entirely local constructs it seems extremely silly,
wasteful and complicated to place each logical part in a separately
mmapable buffer instead of placing them in the same mmapable buffer.
Eric
HATAYAMA Daisuke <[email protected]> writes:
> To satisfy mmap()'s page-size boundary requirement, allocate per-cpu
> crash_notes objects on page-size boundary.
>
> /proc/vmcore on the 2nd kernel checks if each note objects is
> allocated on page-size boundary. If there's some object not satisfying
> the page-size boundary requirement, /proc/vmcore doesn't provide
> mmap() interface.
On second look this requirement does not exist. These all get copyied
into the unifed PT_NOTE segment so this patch is pointless and wrong.
I see no evidence of any need to change anything in the first kernel,
and this noticably increases the amount of memory used in the first
kernel and in the second kernel for no gain.
Eric
>
> Signed-off-by: HATAYAMA Daisuke <[email protected]>
> ---
>
> kernel/kexec.c | 3 ++-
> 1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index bddd3d7..d1f365e 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -1234,7 +1234,8 @@ void crash_save_cpu(struct pt_regs *regs, int cpu)
> static int __init crash_notes_memory_init(void)
> {
> /* Allocate memory for saving cpu registers. */
> - crash_notes = alloc_percpu(note_buf_t);
> + crash_notes = __alloc_percpu(roundup(sizeof(note_buf_t), PAGE_SIZE),
> + PAGE_SIZE);
> if (!crash_notes) {
> printk("Kexec: Memory allocation for saving cpu register"
> " states failed\n");
HATAYAMA Daisuke <[email protected]> writes:
> To satisfy mmap()'s page-size boundary requirement, specify aligned
> attribute to vmcoreinfo_note objects to allocate it on page-size
> boundary.
Those requirements don't exist. Making this patch wasteful and wrong.
Eric
> Signed-off-by: HATAYAMA Daisuke <[email protected]>
> ---
>
> include/linux/kexec.h | 6 ++++--
> kernel/kexec.c | 2 +-
> 2 files changed, 5 insertions(+), 3 deletions(-)
>
> diff --git a/include/linux/kexec.h b/include/linux/kexec.h
> index d2e6927..5113570 100644
> --- a/include/linux/kexec.h
> +++ b/include/linux/kexec.h
> @@ -185,8 +185,10 @@ extern struct kimage *kexec_crash_image;
> #define VMCOREINFO_BYTES (4096)
> #define VMCOREINFO_NOTE_NAME "VMCOREINFO"
> #define VMCOREINFO_NOTE_NAME_BYTES ALIGN(sizeof(VMCOREINFO_NOTE_NAME), 4)
> -#define VMCOREINFO_NOTE_SIZE (KEXEC_NOTE_HEAD_BYTES*2 + VMCOREINFO_BYTES \
> - + VMCOREINFO_NOTE_NAME_BYTES)
> +#define VMCOREINFO_NOTE_SIZE ALIGN(KEXEC_NOTE_HEAD_BYTES*2 \
> + +VMCOREINFO_BYTES \
> + +VMCOREINFO_NOTE_NAME_BYTES, \
> + PAGE_SIZE)
>
> /* Location of a reserved region to hold the crash kernel.
> */
> diff --git a/kernel/kexec.c b/kernel/kexec.c
> index d1f365e..195de6d 100644
> --- a/kernel/kexec.c
> +++ b/kernel/kexec.c
> @@ -43,7 +43,7 @@ note_buf_t __percpu *crash_notes;
>
> /* vmcoreinfo stuff */
> static unsigned char vmcoreinfo_data[VMCOREINFO_BYTES];
> -u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4];
> +u32 vmcoreinfo_note[VMCOREINFO_NOTE_SIZE/4] __aligned(PAGE_SIZE);
> size_t vmcoreinfo_size;
> size_t vmcoreinfo_max_size = sizeof(vmcoreinfo_data);
>
HATAYAMA Daisuke <[email protected]> writes:
> Fill both crash_notes and vmcoreinfo_note buffers by NT_VMCORE_PAD
> note type to make them satisfy mmap()'s page-size boundary
> requirement.
The requirement does not exist. Making this change wasteful wrong, and
potentially a ABI change between the first and second kernels.
Further I believe if you check that final_note comes out of the ELF spec
for note segments.
Eric
HATAYAMA Daisuke <[email protected]> writes:
> Modern kernel marks the end of ELF note buffer with NT_VMCORE_PAD type
> note in order to make the buffer satisfy mmap()'s page-size boundary
> requirement. This patch makes finishing reading each buffer if the
> note type now being read is NT_VMCORE_PAD type.
There is absolutely no need for this ABI change. If you need a wasteful
pad note taking up space (which I see no evicence for) you can still
terminate this with an empty note.
Eric
HATAYAMA Daisuke <[email protected]> writes:
> If there's some vmcore object that doesn't satisfy page-size boundary
> requirement, remap_pfn_range() fails to remap it to user-space.
>
> Objects that posisbly don't satisfy the requirement are ELF note
> segments only. The memory chunks corresponding to PT_LOAD entries are
> guaranteed to satisfy page-size boundary requirement by the copy from
> old memory to buffer in 2nd kernel done in later patch.
>
> This patch doesn't copy each note segment into the 2nd kernel since
> they amount to so large in total if there are multiple CPUs. For
> example, current maximum number of CPUs in x86_64 is 5120, where note
> segments exceed 1MB with NT_PRSTATUS only.
So you require the first kernel to reserve an additional 20MB, instead
of just 1.6MB. 336 bytes versus 4096 bytes.
That seems like completely the wrong tradeoff in memory consumption,
filesize, and backwards compatibility.
Eric
HATAYAMA Daisuke <[email protected]> writes:
> Currently, read to /proc/vmcore is done by read_oldmem() that uses
> ioremap/iounmap per a single page. For example, if memory is 1GB,
> ioremap/iounmap is called (1GB / 4KB)-times, that is, 262144
> times. This causes big performance degradation.
>
> In particular, the current main user of this mmap() is makedumpfile,
> which not only reads memory from /proc/vmcore but also does other
> processing like filtering, compression and IO work. Update of page
> table and the following TLB flush makes such processing much slow;
> though I have yet to make patch for makedumpfile and yet to confirm
> how it's improved.
>
> To address the issue, this patch implements mmap() on /proc/vmcore to
> improve read performance. My simple benchmark shows the improvement
> from 200 [MiB/sec] to over 50.0 [GiB/sec].
I am in favor of this direction and the performance and other gains look
good.
I am not in favor of the ABI changes nor of the nearly order of
magnitude memory usage increase for elf notes by rounding everything up
to a page size boundary.
As a general note it is possible to support mmaping any partial page
by just rounding inside of your mmap function so you should not need to
copy partial pages.
If you don't want the memory overhead of merging the ELF notes in memory
in the second kernel you can simply require that the ELF header, the ELF
program header, and the PT_NOTE section be read from /proc/vmcore
instead of mmaped.
I did the math and with your changes to note generation in the worst
case you are reserving 20MiB in the first kernel to replace a 1.6MiB
with a 240KiB allocation in the second kernel. That is the wrong
tradeoff, especially when you require an ABI change at the same time,
and the 5120+ entries in vmcore_list will likely measurably slow down
setting up your mappings with mmap.
Eric
Vivek Goyal <[email protected]> writes:
> Are you saying that some parts of the vmcore file will support mmap() and
> others will not. If yes, how would a user know which parts of file are
> mappable and which are not.
I think I answered this in another email in my review but I will
reanswer here.
There is absolutely no need to copy pages. We can round up the mapping
to the nearest full page. Mmap already does that today for files so it
isn't even odd.
Which only leaves the headers and notes as potentially unmmapable. That
is more of a policy decision, and a decision on where we want to spend
memory. Rounding ELF notes to mutliples of PAGE_SIZE from a perspective
of memory usage seems pretty terrible taking memory usage up an order of
magnitude.
Eric
On Tue, Mar 19, 2013 at 03:12:10PM -0700, Eric W. Biederman wrote:
> HATAYAMA Daisuke <[email protected]> writes:
>
> > To satisfy mmap()'s page-size boundary requirement, allocate per-cpu
> > crash_notes objects on page-size boundary.
> >
> > /proc/vmcore on the 2nd kernel checks if each note objects is
> > allocated on page-size boundary. If there's some object not satisfying
> > the page-size boundary requirement, /proc/vmcore doesn't provide
> > mmap() interface.
>
> On second look this requirement does not exist. These all get copyied
> into the unifed PT_NOTE segment so this patch is pointless and wrong.
Hi Eric,
They get copied only temporarily and then we free them. Actual reading
of note data still goes to old kernel's memory.
We copy them temporarily to figure out how much data is there in the
note and update the size of single PT_NOTE header accordingly. We also
add a new element to vmcore_list so that next time a read happens on
an offset which falls into this particular note, we go and read it from
old memory.
merge_note_headers_elf64() {
notes_section = kmalloc(max_sz, GFP_KERNEL);
rc = read_from_oldmem(notes_section, max_sz, &offset, 0);
/* parse the note, update relevant data structuers */
kfree(notes_section);
}
And that's why we have the problem. Actually note buffers are physically
present in old kernel's memory but in /proc/vmcore we have exported them
as contiguous view. So we don't even have the option of mapping extra
bytes (there is no space for mapping extra bytes).
So there seem to be few options.
- Do not merge headers. Keep one separate PT_NOTE header for each note and
then map extra bytes aligned. But that's kind of different and gdb does
not expect that. All elf_prstatus are supposed to be in single PT_NOTE
header.
- Copy all notes to second kernel's memory.
- align notes in first kernel to page boundary and pad them. I had assumed
that we are already allocating close to 4K of memory in first kernel but
looks like that's not the case. So agree that will be quite wasteful of
memory.
In fact we are not exporting size of note to user space and kexec-tools
seems to be assuming MAX_NOTE_BYTES of 1024 and that seems very horrible.
Anyway, thats a different issue. We should also export size of reserved
memory for elf notes.
Then how about option 2. That is copy all notes in new kernel's memory.
Hatayama had initially implemented that appraoch and I suggested to pad
notes in first kernel to 4K page size boundary. (In an attempt to reduce
second kernel's memory usage). But sounds like per cpu elf note is much
smaller and not 4K. So rounding off these to 4K sounds much more wasteful
of memory.
Will you like option 2 here where we copy notes to new kernel's memory
in contiguous memory and go from there?
Thanks
Vivek
On Tue, Mar 19, 2013 at 01:02:29PM -0700, Andrew Morton wrote:
> On Sat, 16 Mar 2013 13:02:29 +0900 HATAYAMA Daisuke <[email protected]> wrote:
>
> > If there's some vmcore object that doesn't satisfy page-size boundary
> > requirement, remap_pfn_range() fails to remap it to user-space.
> >
> > Objects that posisbly don't satisfy the requirement are ELF note
> > segments only. The memory chunks corresponding to PT_LOAD entries are
> > guaranteed to satisfy page-size boundary requirement by the copy from
> > old memory to buffer in 2nd kernel done in later patch.
> >
> > This patch doesn't copy each note segment into the 2nd kernel since
> > they amount to so large in total if there are multiple CPUs. For
> > example, current maximum number of CPUs in x86_64 is 5120, where note
> > segments exceed 1MB with NT_PRSTATUS only.
>
> I don't really understand this. Why does the number of or size of
> note segments affect their alignment?
>
> > --- a/fs/proc/vmcore.c
> > +++ b/fs/proc/vmcore.c
> > @@ -38,6 +38,8 @@ static u64 vmcore_size;
> >
> > static struct proc_dir_entry *proc_vmcore = NULL;
> >
> > +static bool support_mmap_vmcore;
>
> This is quite regrettable. It means that on some kernels/machines,
> mmap(vmcore) simply won't work. This means that people might write
> code which works for them, but which will fail for others when deployed
> on a small number of machines.
>
> Can we avoid this? Why can't we just copy the notes even if there are
> a large number of them?
Actually initially he implemented copying notes to second kernel and I
suggested to go other way (Tried too hard to save memory in second
kernel). I guess it was not a good idea and copying notes keeps it simple.
Thanks
Vivek
On Tue, Mar 19, 2013 at 03:38:45PM -0700, Eric W. Biederman wrote:
> HATAYAMA Daisuke <[email protected]> writes:
>
> > If there's some vmcore object that doesn't satisfy page-size boundary
> > requirement, remap_pfn_range() fails to remap it to user-space.
> >
> > Objects that posisbly don't satisfy the requirement are ELF note
> > segments only. The memory chunks corresponding to PT_LOAD entries are
> > guaranteed to satisfy page-size boundary requirement by the copy from
> > old memory to buffer in 2nd kernel done in later patch.
> >
> > This patch doesn't copy each note segment into the 2nd kernel since
> > they amount to so large in total if there are multiple CPUs. For
> > example, current maximum number of CPUs in x86_64 is 5120, where note
> > segments exceed 1MB with NT_PRSTATUS only.
>
> So you require the first kernel to reserve an additional 20MB, instead
> of just 1.6MB. 336 bytes versus 4096 bytes.
>
> That seems like completely the wrong tradeoff in memory consumption,
> filesize, and backwards compatibility.
Agreed.
So we already copy ELF headers in second kernel's memory. If we start
copying notes too, then both headers and notes will support mmap().
For mmap() of memory regions which are not page aligned, we can map
extra bytes (as you suggested in one of the mails). Given the fact
that we have one ELF header for every memory range, we can always modify
the file offset where phdr data is starting to make space for mapping
of extra bytes.
That way whole of vmcore should be mmappable and user does not have
to worry about reading part of the file and mmaping the rest.
Thanks
Vivek
Vivek Goyal <[email protected]> writes:
> On Tue, Mar 19, 2013 at 03:12:10PM -0700, Eric W. Biederman wrote:
>> HATAYAMA Daisuke <[email protected]> writes:
>>
>> > To satisfy mmap()'s page-size boundary requirement, allocate per-cpu
>> > crash_notes objects on page-size boundary.
>> >
>> > /proc/vmcore on the 2nd kernel checks if each note objects is
>> > allocated on page-size boundary. If there's some object not satisfying
>> > the page-size boundary requirement, /proc/vmcore doesn't provide
>> > mmap() interface.
>>
>> On second look this requirement does not exist. These all get copyied
>> into the unifed PT_NOTE segment so this patch is pointless and wrong.
>
> Hi Eric,
>
> They get copied only temporarily and then we free them. Actual reading
> of note data still goes to old kernel's memory.
>
> We copy them temporarily to figure out how much data is there in the
> note and update the size of single PT_NOTE header accordingly. We also
> add a new element to vmcore_list so that next time a read happens on
> an offset which falls into this particular note, we go and read it from
> old memory.
>
> merge_note_headers_elf64() {
> notes_section = kmalloc(max_sz, GFP_KERNEL);
> rc = read_from_oldmem(notes_section, max_sz, &offset, 0);
> /* parse the note, update relevant data structuers */
> kfree(notes_section);
> }
>
> And that's why we have the problem. Actually note buffers are physically
> present in old kernel's memory but in /proc/vmcore we have exported them
> as contiguous view. So we don't even have the option of mapping extra
> bytes (there is no space for mapping extra bytes).
>
> So there seem to be few options.
>
> - Do not merge headers. Keep one separate PT_NOTE header for each note and
> then map extra bytes aligned. But that's kind of different and gdb does
> not expect that. All elf_prstatus are supposed to be in single PT_NOTE
> header.
>
> - Copy all notes to second kernel's memory.
>
> - align notes in first kernel to page boundary and pad them. I had assumed
> that we are already allocating close to 4K of memory in first kernel but
> looks like that's not the case. So agree that will be quite wasteful of
> memory.
>
> In fact we are not exporting size of note to user space and kexec-tools
> seems to be assuming MAX_NOTE_BYTES of 1024 and that seems very horrible.
> Anyway, thats a different issue. We should also export size of reserved
> memory for elf notes.
>
>
> Then how about option 2. That is copy all notes in new kernel's memory.
> Hatayama had initially implemented that appraoch and I suggested to pad
> notes in first kernel to 4K page size boundary. (In an attempt to reduce
> second kernel's memory usage). But sounds like per cpu elf note is much
> smaller and not 4K. So rounding off these to 4K sounds much more wasteful
> of memory.
>
> Will you like option 2 here where we copy notes to new kernel's memory
> in contiguous memory and go from there?
Blink. I had missed the fact that we were not keeping the merged copy
of the elf notes. That does seem silly after we go to all of the work
to create it.
The two practical options I see are.
1) Don't allow mmap of the new headers and the merged notes.
2) Copy the notes into a merged note section in the second kernel.
I have no problem with either option.
In practice I expect copying of the notes will scale better as that
terribly long link list won't need to hold all of the previous note
sections, but there are other ways to get that speed up.
Eric
Vivek Goyal <[email protected]> writes:
> On Tue, Mar 19, 2013 at 03:38:45PM -0700, Eric W. Biederman wrote:
>> HATAYAMA Daisuke <[email protected]> writes:
>>
>> > If there's some vmcore object that doesn't satisfy page-size boundary
>> > requirement, remap_pfn_range() fails to remap it to user-space.
>> >
>> > Objects that posisbly don't satisfy the requirement are ELF note
>> > segments only. The memory chunks corresponding to PT_LOAD entries are
>> > guaranteed to satisfy page-size boundary requirement by the copy from
>> > old memory to buffer in 2nd kernel done in later patch.
>> >
>> > This patch doesn't copy each note segment into the 2nd kernel since
>> > they amount to so large in total if there are multiple CPUs. For
>> > example, current maximum number of CPUs in x86_64 is 5120, where note
>> > segments exceed 1MB with NT_PRSTATUS only.
>>
>> So you require the first kernel to reserve an additional 20MB, instead
>> of just 1.6MB. 336 bytes versus 4096 bytes.
>>
>> That seems like completely the wrong tradeoff in memory consumption,
>> filesize, and backwards compatibility.
>
> Agreed.
>
> So we already copy ELF headers in second kernel's memory. If we start
> copying notes too, then both headers and notes will support mmap().
The only real is it could be a bit tricky to allocate all of the memory
for the notes section on high cpu count systems in a single allocation.
> For mmap() of memory regions which are not page aligned, we can map
> extra bytes (as you suggested in one of the mails). Given the fact
> that we have one ELF header for every memory range, we can always modify
> the file offset where phdr data is starting to make space for mapping
> of extra bytes.
Agreed ELF file offset % PAGE_SIZE should == physical address % PAGE_SIZE to
make mmap work.
> That way whole of vmcore should be mmappable and user does not have
> to worry about reading part of the file and mmaping the rest.
That sounds simplest.
If core counts on the high end do more than double every 2 years we
might have a problem. Otherwise making everything mmapable seems easy
and sound.
Eric
From: "Eric W. Biederman" <[email protected]>
Subject: Re: [PATCH v3 01/21] vmcore: reference e_phoff member explicitly to get position of program header table
Date: Tue, 19 Mar 2013 14:44:16 -0700
> HATAYAMA Daisuke <[email protected]> writes:
>
>> Currently, the code assumes that position of program header table is
>> next to ELF header. But future change can break the assumption on
>> kexec-tools and the 1st kernel. To avoid worst case, reference e_phoff
>> member explicitly to get position of program header table in
>> file-offset.
>
> In principle this looks good. However when I read this it looks like
> you are going a little too far.
>
> You are changing not only the reading of the supplied headers, but
> you are changing the generation of the new new headers that describe
> the data provided by /proc/vmcore.
>
> I get lost in following this after you mangle merge_note_headers.
>
> In principle removing silly assumptions seems reasonable, but I think
> it is completely orthogonal to the task of maping vmcore mmapable.
>
> I think it is fine to claim that the assumptions made here in vmcore are
> part of the kexec on panic ABI at this point, which would generally make
> this change unnecessary.
This was suggested by Vivek. He prefers generic one.
Vivek, do you agree to this? Or is it better to re-post this and other
clean-up patches as another one separately to this patch set?
Thanks.
HATAYAMA, Daisuke
From: "Eric W. Biederman" <[email protected]>
Subject: Re: [PATCH v3 17/21] vmcore: check NT_VMCORE_PAD as a mark indicating the end of ELF note buffer
Date: Tue, 19 Mar 2013 14:11:51 -0700
> HATAYAMA Daisuke <[email protected]> writes:
>
>> Modern kernel marks the end of ELF note buffer with NT_VMCORE_PAD type
>> note in order to make the buffer satisfy mmap()'s page-size boundary
>> requirement. This patch makes finishing reading each buffer if the
>> note type now being read is NT_VMCORE_PAD type.
>
> Ick. Even with a pad header you can mark the end with an empty header,
> and my memory may be deceiving me but I believe an empty header is
> specified by the ELF ABI docs.
>
> Beyond which I don't quite see the point of any of this as all of these
> headers need to be combined into a single note section before being
> presented to user space.
Though this patch might get unecessary later, I cannot find part
explaining necessity of marking end of ELF segmetns with an empty
header in ELF spec. For example:
gabi
http://www.sco.com/developers/gabi/latest/ch5.pheader.html#note_section
AMD64
http://www.x86-64.org/documentation/abi.pdf
But still there's possibility of another particular spec that says
necessity of empty header.
Also, it's possible to get size of a whole part of ELF note segments
from p_memsz or p_filesz, and gdb and binutils are reading the note
segments until reaching the size.
Thanks.
HATAYAMA, Daisuke
From: "Eric W. Biederman" <[email protected]>
Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
Date: Wed, 20 Mar 2013 13:55:55 -0700
> Vivek Goyal <[email protected]> writes:
>
>> On Tue, Mar 19, 2013 at 03:38:45PM -0700, Eric W. Biederman wrote:
>>> HATAYAMA Daisuke <[email protected]> writes:
>>>
>>> > If there's some vmcore object that doesn't satisfy page-size boundary
>>> > requirement, remap_pfn_range() fails to remap it to user-space.
>>> >
>>> > Objects that posisbly don't satisfy the requirement are ELF note
>>> > segments only. The memory chunks corresponding to PT_LOAD entries are
>>> > guaranteed to satisfy page-size boundary requirement by the copy from
>>> > old memory to buffer in 2nd kernel done in later patch.
>>> >
>>> > This patch doesn't copy each note segment into the 2nd kernel since
>>> > they amount to so large in total if there are multiple CPUs. For
>>> > example, current maximum number of CPUs in x86_64 is 5120, where note
>>> > segments exceed 1MB with NT_PRSTATUS only.
>>>
>>> So you require the first kernel to reserve an additional 20MB, instead
>>> of just 1.6MB. 336 bytes versus 4096 bytes.
>>>
>>> That seems like completely the wrong tradeoff in memory consumption,
>>> filesize, and backwards compatibility.
>>
>> Agreed.
>>
>> So we already copy ELF headers in second kernel's memory. If we start
>> copying notes too, then both headers and notes will support mmap().
>
> The only real is it could be a bit tricky to allocate all of the memory
> for the notes section on high cpu count systems in a single allocation.
>
Do you mean it's getting difficult on many-cpus machine to get free
pages consequtive enough to be able to cover all the notes?
If so, is it necessary to think about any care to it in the next
patch? Or, should it be pending for now?
>> For mmap() of memory regions which are not page aligned, we can map
>> extra bytes (as you suggested in one of the mails). Given the fact
>> that we have one ELF header for every memory range, we can always modify
>> the file offset where phdr data is starting to make space for mapping
>> of extra bytes.
>
> Agreed ELF file offset % PAGE_SIZE should == physical address % PAGE_SIZE to
> make mmap work.
>
OK, your conclusion is the 1st version is better than the 2nd.
The purpose of this design was not to export anything but dump target
memory to user-space from /proc/vmcore. I think it better to do it if
possible. it's possible for read interface to fill the corresponding
part with 0. But it's impossible for mmap interface to data on modify
old memory.
Do you agree two vmcores seen from read and mmap interfaces are no
longer coincide?
Thanks.
HATAYAMA, Daisuke
From: Andrew Morton <[email protected]>
Subject: Re: [PATCH v3 00/21] kdump, vmcore: support mmap() on /proc/vmcore
Date: Tue, 19 Mar 2013 12:30:05 -0700
> On Sat, 16 Mar 2013 13:00:47 +0900 HATAYAMA Daisuke <[email protected]> wrote:
>
>> Currently, read to /proc/vmcore is done by read_oldmem() that uses
>> ioremap/iounmap per a single page. For example, if memory is 1GB,
>> ioremap/iounmap is called (1GB / 4KB)-times, that is, 262144
>> times. This causes big performance degradation.
>>
>> In particular, the current main user of this mmap() is makedumpfile,
>> which not only reads memory from /proc/vmcore but also does other
>> processing like filtering, compression and IO work. Update of page
>> table and the following TLB flush makes such processing much slow;
>> though I have yet to make patch for makedumpfile and yet to confirm
>> how it's improved.
>>
>> To address the issue, this patch implements mmap() on /proc/vmcore to
>> improve read performance. My simple benchmark shows the improvement
>> from 200 [MiB/sec] to over 50.0 [GiB/sec].
>
> There are quite a lot of userspace-visible vmcore changes here. Is it
> all fully back-compatible? Will all known userspace continue to work
> OK on newer kernels?
>
I designed it to keep backward-compatibility at least for gdb and
binutils but not less for makedumpfile since it should follow kernel
changes; old makedumpfile cannot use newer kernels, and this is within
the range of this review.
Thanks.
HATAYAMA, Daisuke
HATAYAMA Daisuke <[email protected]> writes:
> From: "Eric W. Biederman" <[email protected]>
> Subject: Re: [PATCH v3 17/21] vmcore: check NT_VMCORE_PAD as a mark indicating the end of ELF note buffer
> Date: Tue, 19 Mar 2013 14:11:51 -0700
>
>> HATAYAMA Daisuke <[email protected]> writes:
>>
>>> Modern kernel marks the end of ELF note buffer with NT_VMCORE_PAD type
>>> note in order to make the buffer satisfy mmap()'s page-size boundary
>>> requirement. This patch makes finishing reading each buffer if the
>>> note type now being read is NT_VMCORE_PAD type.
>>
>> Ick. Even with a pad header you can mark the end with an empty header,
>> and my memory may be deceiving me but I believe an empty header is
>> specified by the ELF ABI docs.
>>
>> Beyond which I don't quite see the point of any of this as all of these
>> headers need to be combined into a single note section before being
>> presented to user space.
>
> Though this patch might get unecessary later, I cannot find part
> explaining necessity of marking end of ELF segmetns with an empty
> header in ELF spec. For example:
You are right. It appears to be our own invention and just part of
the ABI of taking a crash dump. Something we should sit down and
document someday.
> Also, it's possible to get size of a whole part of ELF note segments
> from p_memsz or p_filesz, and gdb and binutils are reading the note
> segments until reaching the size.
Agreed. Except in our weird case where we generate the notes on the
fly, and generate the NOTE segment header much earlier.
Eric
HATAYAMA Daisuke <[email protected]> writes:
> From: "Eric W. Biederman" <[email protected]>
> Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
> Date: Wed, 20 Mar 2013 13:55:55 -0700
>
>> Vivek Goyal <[email protected]> writes:
>>
>>> On Tue, Mar 19, 2013 at 03:38:45PM -0700, Eric W. Biederman wrote:
>>>> HATAYAMA Daisuke <[email protected]> writes:
>>>>
>>>> > If there's some vmcore object that doesn't satisfy page-size boundary
>>>> > requirement, remap_pfn_range() fails to remap it to user-space.
>>>> >
>>>> > Objects that posisbly don't satisfy the requirement are ELF note
>>>> > segments only. The memory chunks corresponding to PT_LOAD entries are
>>>> > guaranteed to satisfy page-size boundary requirement by the copy from
>>>> > old memory to buffer in 2nd kernel done in later patch.
>>>> >
>>>> > This patch doesn't copy each note segment into the 2nd kernel since
>>>> > they amount to so large in total if there are multiple CPUs. For
>>>> > example, current maximum number of CPUs in x86_64 is 5120, where note
>>>> > segments exceed 1MB with NT_PRSTATUS only.
>>>>
>>>> So you require the first kernel to reserve an additional 20MB, instead
>>>> of just 1.6MB. 336 bytes versus 4096 bytes.
>>>>
>>>> That seems like completely the wrong tradeoff in memory consumption,
>>>> filesize, and backwards compatibility.
>>>
>>> Agreed.
>>>
>>> So we already copy ELF headers in second kernel's memory. If we start
>>> copying notes too, then both headers and notes will support mmap().
>>
>> The only real is it could be a bit tricky to allocate all of the memory
>> for the notes section on high cpu count systems in a single allocation.
>>
>
> Do you mean it's getting difficult on many-cpus machine to get free
> pages consequtive enough to be able to cover all the notes?
>
> If so, is it necessary to think about any care to it in the next
> patch? Or, should it be pending for now?
I meant that in general allocations > PAGE_SIZE get increasingly
unreliable the larger they are. And on large cpu count machines we are
having larger allocations. Of course large cpu count machines typically
have more memory so the odds go up.
Right now MAX_ORDER seems to be set to 11 which is 8MiB, and my x86_64
machine certainly succeeded in an order 11 allocation during boot so I
don't expect any real problems with a 2MiB allocation but it is
something to keep an eye on with kernel memory.
>>> For mmap() of memory regions which are not page aligned, we can map
>>> extra bytes (as you suggested in one of the mails). Given the fact
>>> that we have one ELF header for every memory range, we can always modify
>>> the file offset where phdr data is starting to make space for mapping
>>> of extra bytes.
>>
>> Agreed ELF file offset % PAGE_SIZE should == physical address % PAGE_SIZE to
>> make mmap work.
>>
>
> OK, your conclusion is the 1st version is better than the 2nd.
>
> The purpose of this design was not to export anything but dump target
> memory to user-space from /proc/vmcore. I think it better to do it if
> possible. it's possible for read interface to fill the corresponding
> part with 0. But it's impossible for mmap interface to data on modify
> old memory.
In practice someone lied. You can't have a chunk of memory that is
smaller than page size. So I don't see it doing any harm to export
the memory that is there but some silly system lied to us about.
> Do you agree two vmcores seen from read and mmap interfaces are no
> longer coincide?
That is an interesting point. I don't think there is any point in
having read and mmap disagree, that just seems to lead to complications,
especially since the data we are talking about adding is actually memory
contents.
I do think it makes sense to have logical chunks of the file that are
not covered by PT_LOAD segments. Logical chunks like the leading edge
of a page inside of which a PT_LOAD segment starts, and the trailing
edge of a page in which a PT_LOAD segment ends.
Implementaton wise this would mean extending the struct vmcore entry to
cover missing bits, by rounding down the start address and rounding up
the end address to the nearest page size boundary. The generated
PT_LOAD segment would then have it's file offset adjusted to point skip
the bytes of the page that are there but we don't care about.
Eric
HATAYAMA Daisuke <[email protected]> writes:
> From: "Eric W. Biederman" <[email protected]>
> Subject: Re: [PATCH v3 01/21] vmcore: reference e_phoff member explicitly to get position of program header table
> Date: Tue, 19 Mar 2013 14:44:16 -0700
>
>> HATAYAMA Daisuke <[email protected]> writes:
>>
>>> Currently, the code assumes that position of program header table is
>>> next to ELF header. But future change can break the assumption on
>>> kexec-tools and the 1st kernel. To avoid worst case, reference e_phoff
>>> member explicitly to get position of program header table in
>>> file-offset.
>>
>> In principle this looks good. However when I read this it looks like
>> you are going a little too far.
>>
>> You are changing not only the reading of the supplied headers, but
>> you are changing the generation of the new new headers that describe
>> the data provided by /proc/vmcore.
>>
>> I get lost in following this after you mangle merge_note_headers.
>>
>> In principle removing silly assumptions seems reasonable, but I think
>> it is completely orthogonal to the task of maping vmcore mmapable.
>>
>> I think it is fine to claim that the assumptions made here in vmcore are
>> part of the kexec on panic ABI at this point, which would generally make
>> this change unnecessary.
>
> This was suggested by Vivek. He prefers generic one.
>
> Vivek, do you agree to this? Or is it better to re-post this and other
> clean-up patches as another one separately to this patch set?
My deep problem with this patch was that you were changing how we think
about the generated headers, and that just didn't seem to make any
sense, and certainly it seems to be orthogonal from the patch itself.
For headers that we generate it is perfectly fine to make all kinds of
assumptions.
Eric
From: "Eric W. Biederman" <[email protected]>
Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
Date: Wed, 20 Mar 2013 21:18:37 -0700
> HATAYAMA Daisuke <[email protected]> writes:
>
>> From: "Eric W. Biederman" <[email protected]>
>> Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
>> Date: Wed, 20 Mar 2013 13:55:55 -0700
>>
>>> Vivek Goyal <[email protected]> writes:
>>>
>>>> On Tue, Mar 19, 2013 at 03:38:45PM -0700, Eric W. Biederman wrote:
>>>>> HATAYAMA Daisuke <[email protected]> writes:
>>>>>
>>>>> > If there's some vmcore object that doesn't satisfy page-size boundary
>>>>> > requirement, remap_pfn_range() fails to remap it to user-space.
>>>>> >
>>>>> > Objects that posisbly don't satisfy the requirement are ELF note
>>>>> > segments only. The memory chunks corresponding to PT_LOAD entries are
>>>>> > guaranteed to satisfy page-size boundary requirement by the copy from
>>>>> > old memory to buffer in 2nd kernel done in later patch.
>>>>> >
>>>>> > This patch doesn't copy each note segment into the 2nd kernel since
>>>>> > they amount to so large in total if there are multiple CPUs. For
>>>>> > example, current maximum number of CPUs in x86_64 is 5120, where note
>>>>> > segments exceed 1MB with NT_PRSTATUS only.
>>>>>
>>>>> So you require the first kernel to reserve an additional 20MB, instead
>>>>> of just 1.6MB. 336 bytes versus 4096 bytes.
>>>>>
>>>>> That seems like completely the wrong tradeoff in memory consumption,
>>>>> filesize, and backwards compatibility.
>>>>
>>>> Agreed.
>>>>
>>>> So we already copy ELF headers in second kernel's memory. If we start
>>>> copying notes too, then both headers and notes will support mmap().
>>>
>>> The only real is it could be a bit tricky to allocate all of the memory
>>> for the notes section on high cpu count systems in a single allocation.
>>>
>>
>> Do you mean it's getting difficult on many-cpus machine to get free
>> pages consequtive enough to be able to cover all the notes?
>>
>> If so, is it necessary to think about any care to it in the next
>> patch? Or, should it be pending for now?
>
> I meant that in general allocations > PAGE_SIZE get increasingly
> unreliable the larger they are. And on large cpu count machines we are
> having larger allocations. Of course large cpu count machines typically
> have more memory so the odds go up.
>
> Right now MAX_ORDER seems to be set to 11 which is 8MiB, and my x86_64
> machine certainly succeeded in an order 11 allocation during boot so I
> don't expect any real problems with a 2MiB allocation but it is
> something to keep an eye on with kernel memory.
>
OK, rigorously, suceess or faliure of the requested free pages
allocation depends on actual memory layout at the 2nd kernel boot. To
increase the possibility of allocating memory, we have no method but
reserve more memory for the 2nd kernel now.
>>>> For mmap() of memory regions which are not page aligned, we can map
>>>> extra bytes (as you suggested in one of the mails). Given the fact
>>>> that we have one ELF header for every memory range, we can always modify
>>>> the file offset where phdr data is starting to make space for mapping
>>>> of extra bytes.
>>>
>>> Agreed ELF file offset % PAGE_SIZE should == physical address % PAGE_SIZE to
>>> make mmap work.
>>>
>>
>> OK, your conclusion is the 1st version is better than the 2nd.
>>
>> The purpose of this design was not to export anything but dump target
>> memory to user-space from /proc/vmcore. I think it better to do it if
>> possible. it's possible for read interface to fill the corresponding
>> part with 0. But it's impossible for mmap interface to data on modify
>> old memory.
>
> In practice someone lied. You can't have a chunk of memory that is
> smaller than page size. So I don't see it doing any harm to export
> the memory that is there but some silly system lied to us about.
>
>> Do you agree two vmcores seen from read and mmap interfaces are no
>> longer coincide?
>
> That is an interesting point. I don't think there is any point in
> having read and mmap disagree, that just seems to lead to complications,
> especially since the data we are talking about adding is actually memory
> contents.
>
> I do think it makes sense to have logical chunks of the file that are
> not covered by PT_LOAD segments. Logical chunks like the leading edge
> of a page inside of which a PT_LOAD segment starts, and the trailing
> edge of a page in which a PT_LOAD segment ends.
>
> Implementaton wise this would mean extending the struct vmcore entry to
> cover missing bits, by rounding down the start address and rounding up
> the end address to the nearest page size boundary. The generated
> PT_LOAD segment would then have it's file offset adjusted to point skip
> the bytes of the page that are there but we don't care about.
Do you mean for each range represented by each PT_LOAD entry, say:
[p_paddr, p_paddr + p_memsz]
extend it as:
[rounddown(p_paddr, PAGE_SIZE), roundup(p_paddr + p_memsz, PAGE_SIZE)].
not only objects in vmcore_list, but also updating p_paddr and p_memsz
members themselves of each PT_LOAD entry? In other words, there's no
new holes not referenced by any PT_LOAD entry since the regions
referenced by some PT_LOAD entry, themselves are extended.
Then, the vmcores seen from read and mmap methods are coincide in the
direction of including both ranges
[rounddown(p_paddr, PAGE_SIZE), p_paddr]
and
[p_paddr + p_memsz, roundup(p_paddr + p_memsz, PAGE_SIZE)]
are included in both vmcores seen from read and mmap methods, although
they are originally not dump target memory, which you are not
problematic for ease of implementation.
Is there difference here from you understanding?
Thanks.
HATAYAMA, Daisuke
HATAYAMA Daisuke <[email protected]> writes:
> From: Andrew Morton <[email protected]>
> Subject: Re: [PATCH v3 00/21] kdump, vmcore: support mmap() on /proc/vmcore
> Date: Tue, 19 Mar 2013 12:30:05 -0700
>
>> On Sat, 16 Mar 2013 13:00:47 +0900 HATAYAMA Daisuke <[email protected]> wrote:
>>
>>> Currently, read to /proc/vmcore is done by read_oldmem() that uses
>>> ioremap/iounmap per a single page. For example, if memory is 1GB,
>>> ioremap/iounmap is called (1GB / 4KB)-times, that is, 262144
>>> times. This causes big performance degradation.
>>>
>>> In particular, the current main user of this mmap() is makedumpfile,
>>> which not only reads memory from /proc/vmcore but also does other
>>> processing like filtering, compression and IO work. Update of page
>>> table and the following TLB flush makes such processing much slow;
>>> though I have yet to make patch for makedumpfile and yet to confirm
>>> how it's improved.
>>>
>>> To address the issue, this patch implements mmap() on /proc/vmcore to
>>> improve read performance. My simple benchmark shows the improvement
>>> from 200 [MiB/sec] to over 50.0 [GiB/sec].
>>
>> There are quite a lot of userspace-visible vmcore changes here. Is it
>> all fully back-compatible? Will all known userspace continue to work
>> OK on newer kernels?
>>
>
> I designed it to keep backward-compatibility at least for gdb and
> binutils but not less for makedumpfile since it should follow kernel
> changes; old makedumpfile cannot use newer kernels, and this is within
> the range of this review.
To the extent possible we should different versions of tools to be
interchanged. That helps with bug hunting and for people who are in
resource constrained systems that build old versions of the tools that
are tiny and fit.
Given that rounding the per cpu NOTES turns out to be a waste of memory
I didn't see anything in the patchset that justfies any breakage.
Eric
HATAYAMA Daisuke <[email protected]> writes:
>
> Do you mean for each range represented by each PT_LOAD entry, say:
>
> [p_paddr, p_paddr + p_memsz]
>
> extend it as:
>
> [rounddown(p_paddr, PAGE_SIZE), roundup(p_paddr + p_memsz, PAGE_SIZE)].
>
> not only objects in vmcore_list, but also updating p_paddr and p_memsz
> members themselves of each PT_LOAD entry? In other words, there's no
> new holes not referenced by any PT_LOAD entry since the regions
> referenced by some PT_LOAD entry, themselves are extended.
No. p_paddr and p_memsz as exported should remain the same.
I am suggesting that we change p_offset.
I am suggesting to include the data in the file as if we had changed
p_paddr and p_memsz.
> Then, the vmcores seen from read and mmap methods are coincide in the
> direction of including both ranges
>
> [rounddown(p_paddr, PAGE_SIZE), p_paddr]
>
> and
>
> [p_paddr + p_memsz, roundup(p_paddr + p_memsz, PAGE_SIZE)]
>
> are included in both vmcores seen from read and mmap methods, although
> they are originally not dump target memory, which you are not
> problematic for ease of implementation.
>
> Is there difference here from you understanding?
Preserving the actual PT_LOAD segments p_paddr and p_memsz values is
important. p_offset we can change as much as we want. Which means there
can be logical holes in the file between PT_LOAD segments, where we put
the extra data needed to keep everything page aligned.
Eric
From: "Eric W. Biederman" <[email protected]>
Subject: Re: [PATCH v3 00/21] kdump, vmcore: support mmap() on /proc/vmcore
Date: Wed, 20 Mar 2013 23:16:20 -0700
> HATAYAMA Daisuke <[email protected]> writes:
>
>> From: Andrew Morton <[email protected]>
>> Subject: Re: [PATCH v3 00/21] kdump, vmcore: support mmap() on /proc/vmcore
>> Date: Tue, 19 Mar 2013 12:30:05 -0700
>>
>>> On Sat, 16 Mar 2013 13:00:47 +0900 HATAYAMA Daisuke <[email protected]> wrote:
>>>
>>>> Currently, read to /proc/vmcore is done by read_oldmem() that uses
>>>> ioremap/iounmap per a single page. For example, if memory is 1GB,
>>>> ioremap/iounmap is called (1GB / 4KB)-times, that is, 262144
>>>> times. This causes big performance degradation.
>>>>
>>>> In particular, the current main user of this mmap() is makedumpfile,
>>>> which not only reads memory from /proc/vmcore but also does other
>>>> processing like filtering, compression and IO work. Update of page
>>>> table and the following TLB flush makes such processing much slow;
>>>> though I have yet to make patch for makedumpfile and yet to confirm
>>>> how it's improved.
>>>>
>>>> To address the issue, this patch implements mmap() on /proc/vmcore to
>>>> improve read performance. My simple benchmark shows the improvement
>>>> from 200 [MiB/sec] to over 50.0 [GiB/sec].
>>>
>>> There are quite a lot of userspace-visible vmcore changes here. Is it
>>> all fully back-compatible? Will all known userspace continue to work
>>> OK on newer kernels?
>>>
>>
>> I designed it to keep backward-compatibility at least for gdb and
>> binutils but not less for makedumpfile since it should follow kernel
>> changes; old makedumpfile cannot use newer kernels, and this is within
>> the range of this review.
>
> To the extent possible we should different versions of tools to be
> interchanged. That helps with bug hunting and for people who are in
> resource constrained systems that build old versions of the tools that
> are tiny and fit.
>
> Given that rounding the per cpu NOTES turns out to be a waste of memory
> I didn't see anything in the patchset that justfies any breakage.
The breakage was caused by the introduction of new NT_VMCORE_PAD to
"VMCOREINFO" name, except for which it worked fine. But it will be
dropped in the next version. It'll be no problem for some time.
The breakage was caused by makedumpfile itself due to the bug that it
had so far seen note type only, not note name. It was possible to
avoid the breakage by choosing another note name but I didn't do
it. This topic will probably arise when some kind of new note types
are needed.
Thanks.
HATAYAMA, Daisuke
From: "Eric W. Biederman" <[email protected]>
Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
Date: Wed, 20 Mar 2013 23:29:05 -0700
> HATAYAMA Daisuke <[email protected]> writes:
>>
>> Do you mean for each range represented by each PT_LOAD entry, say:
>>
>> [p_paddr, p_paddr + p_memsz]
>>
>> extend it as:
>>
>> [rounddown(p_paddr, PAGE_SIZE), roundup(p_paddr + p_memsz, PAGE_SIZE)].
>>
>> not only objects in vmcore_list, but also updating p_paddr and p_memsz
>> members themselves of each PT_LOAD entry? In other words, there's no
>> new holes not referenced by any PT_LOAD entry since the regions
>> referenced by some PT_LOAD entry, themselves are extended.
>
> No. p_paddr and p_memsz as exported should remain the same.
> I am suggesting that we change p_offset.
>
> I am suggesting to include the data in the file as if we had changed
> p_paddr and p_memsz.
>
>> Then, the vmcores seen from read and mmap methods are coincide in the
>> direction of including both ranges
>>
>> [rounddown(p_paddr, PAGE_SIZE), p_paddr]
>>
>> and
>>
>> [p_paddr + p_memsz, roundup(p_paddr + p_memsz, PAGE_SIZE)]
>>
>> are included in both vmcores seen from read and mmap methods, although
>> they are originally not dump target memory, which you are not
>> problematic for ease of implementation.
>>
>> Is there difference here from you understanding?
>
> Preserving the actual PT_LOAD segments p_paddr and p_memsz values is
> important. p_offset we can change as much as we want. Which means there
> can be logical holes in the file between PT_LOAD segments, where we put
> the extra data needed to keep everything page aligned.
>
So, I have to make the same question again. Is it OK if two vmcores
are different? How do you intend the ``extra data'' to be deal with? I
mean mmap() has to export part of old memory as the ``extra data''.
If you think OK, I'll fill the ``extra data'' with 0 in case of read
method. If not OK, I'll fill with the corresponding part of old
memory.
Thanks.
HATAYAMA, Daisuke
HATAYAMA Daisuke <[email protected]> writes:
> From: "Eric W. Biederman" <[email protected]>
> Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
> Date: Wed, 20 Mar 2013 23:29:05 -0700
>
>> HATAYAMA Daisuke <[email protected]> writes:
>>>
>>> Do you mean for each range represented by each PT_LOAD entry, say:
>>>
>>> [p_paddr, p_paddr + p_memsz]
>>>
>>> extend it as:
>>>
>>> [rounddown(p_paddr, PAGE_SIZE), roundup(p_paddr + p_memsz, PAGE_SIZE)].
>>>
>>> not only objects in vmcore_list, but also updating p_paddr and p_memsz
>>> members themselves of each PT_LOAD entry? In other words, there's no
>>> new holes not referenced by any PT_LOAD entry since the regions
>>> referenced by some PT_LOAD entry, themselves are extended.
>>
>> No. p_paddr and p_memsz as exported should remain the same.
>> I am suggesting that we change p_offset.
>>
>> I am suggesting to include the data in the file as if we had changed
>> p_paddr and p_memsz.
>>
>>> Then, the vmcores seen from read and mmap methods are coincide in the
>>> direction of including both ranges
>>>
>>> [rounddown(p_paddr, PAGE_SIZE), p_paddr]
>>>
>>> and
>>>
>>> [p_paddr + p_memsz, roundup(p_paddr + p_memsz, PAGE_SIZE)]
>>>
>>> are included in both vmcores seen from read and mmap methods, although
>>> they are originally not dump target memory, which you are not
>>> problematic for ease of implementation.
>>>
>>> Is there difference here from you understanding?
>>
>> Preserving the actual PT_LOAD segments p_paddr and p_memsz values is
>> important. p_offset we can change as much as we want. Which means there
>> can be logical holes in the file between PT_LOAD segments, where we put
>> the extra data needed to keep everything page aligned.
>>
>
> So, I have to make the same question again. Is it OK if two vmcores
> are different? How do you intend the ``extra data'' to be deal with? I
> mean mmap() has to export part of old memory as the ``extra data''.
>
> If you think OK, I'll fill the ``extra data'' with 0 in case of read
> method. If not OK, I'll fill with the corresponding part of old
> memory.
I think the two having different contents violates the principle of
least surprise.
I think exporting the old memory as the ``extra data'' is the least
surprising and the easiest way to go.
I don't mind filling the extra data with zero's but I don't see the
point.
Eric
HATAYAMA Daisuke <[email protected]> writes:
> The breakage was caused by the introduction of new NT_VMCORE_PAD to
> "VMCOREINFO" name, except for which it worked fine. But it will be
> dropped in the next version. It'll be no problem for some time.
>
> The breakage was caused by makedumpfile itself due to the bug that it
> had so far seen note type only, not note name. It was possible to
> avoid the breakage by choosing another note name but I didn't do
> it. This topic will probably arise when some kind of new note types
> are needed.
Yes. Not ignoring unknown note types is a a deficiency in makedumpfile.
And definitely not something to keep us from introducing new note types.
It should be noted that a common use of /proc/vmcore is to do:
cp /proc/vmcore /somepath/core
gdb /somepath/core
makedumpfile is just an optimization on that for people who want to
write a smaller file.
Eric
HATAYAMA Daisuke <[email protected]> writes:
> OK, rigorously, suceess or faliure of the requested free pages
> allocation depends on actual memory layout at the 2nd kernel boot. To
> increase the possibility of allocating memory, we have no method but
> reserve more memory for the 2nd kernel now.
Good enough. If there are fragmentation issues that cause allocation
problems on larger boxes we can use vmalloc and remap_vmalloc_range, but
we certainly don't need to start there.
Especialy as for most 8 or 16 core boxes we are talking about a 4KiB or
an 8KiBP allocation. Aka order 0 or order 1.
Adding more memory is also useful. It is important in general to keep
the amount of memory needed for the kdump kernel low.
Eric
On Wed, Mar 20, 2013 at 01:55:55PM -0700, Eric W. Biederman wrote:
[..]
> If core counts on the high end do more than double every 2 years we
> might have a problem. Otherwise making everything mmapable seems easy
> and sound.
We already have mechanism to translate file offset into actual physical
address where data is. So if we can't allocate one contiguous chunk of
memory for notes, we should be able to break it down into multiple
page aligned areas and map offset into respective discontiguous areas
using vmcore_list.
Thanks
Vivek
On Thu, Mar 21, 2013 at 11:50:41AM +0900, HATAYAMA Daisuke wrote:
> From: "Eric W. Biederman" <[email protected]>
> Subject: Re: [PATCH v3 01/21] vmcore: reference e_phoff member explicitly to get position of program header table
> Date: Tue, 19 Mar 2013 14:44:16 -0700
>
> > HATAYAMA Daisuke <[email protected]> writes:
> >
> >> Currently, the code assumes that position of program header table is
> >> next to ELF header. But future change can break the assumption on
> >> kexec-tools and the 1st kernel. To avoid worst case, reference e_phoff
> >> member explicitly to get position of program header table in
> >> file-offset.
> >
> > In principle this looks good. However when I read this it looks like
> > you are going a little too far.
> >
> > You are changing not only the reading of the supplied headers, but
> > you are changing the generation of the new new headers that describe
> > the data provided by /proc/vmcore.
> >
> > I get lost in following this after you mangle merge_note_headers.
> >
> > In principle removing silly assumptions seems reasonable, but I think
> > it is completely orthogonal to the task of maping vmcore mmapable.
> >
> > I think it is fine to claim that the assumptions made here in vmcore are
> > part of the kexec on panic ABI at this point, which would generally make
> > this change unnecessary.
>
> This was suggested by Vivek. He prefers generic one.
>
> Vivek, do you agree to this? Or is it better to re-post this and other
> clean-up patches as another one separately to this patch set?
Given the fact that current code has been working, I am fine to just
re-post and take care of mmap() related issues. And we can take care
of cleaning up of some assumptions about PT_NOTE headers later. Trying
to club large cleanup with mmap() patches is making it hard to review.
Thanks
Vivek
On Wed, Mar 20, 2013 at 08:54:25PM -0700, Eric W. Biederman wrote:
[..]
> > Also, it's possible to get size of a whole part of ELF note segments
> > from p_memsz or p_filesz, and gdb and binutils are reading the note
> > segments until reaching the size.
>
> Agreed. Except in our weird case where we generate the notes on the
> fly, and generate the NOTE segment header much earlier.
And in our case we don't know the size of ELF note. Kernel is not
exporting the size. So kexec-tools is putting an upper limit of 1024
and putting that value in p_memsz and p_filesz fields.
Given the fact that we are reserving elf notes at boot. That means
we know the size of ELF notes. It should make sense to export it
to user space and let kexec-tools put right values.
In fact looks like /sys/kernel/vmcoreinfo is exporting two values. Address
and size. (This is kind of violation of sysfs poilcy of one value per
file). But for per cpu notes, we are exporting only address and not
size.
/sys/devices/system/cpu/cpu<n>/crash_notes
May be we should export another file
/sys/devices/system/cpu/cpu<n>/crash_notes_size
and let kexec-tools parse it.
Thanks
Vivek
On Thu, Mar 21, 2013 at 12:22:59AM -0700, Eric W. Biederman wrote:
> HATAYAMA Daisuke <[email protected]> writes:
>
> > OK, rigorously, suceess or faliure of the requested free pages
> > allocation depends on actual memory layout at the 2nd kernel boot. To
> > increase the possibility of allocating memory, we have no method but
> > reserve more memory for the 2nd kernel now.
>
> Good enough. If there are fragmentation issues that cause allocation
> problems on larger boxes we can use vmalloc and remap_vmalloc_range, but
> we certainly don't need to start there.
>
> Especialy as for most 8 or 16 core boxes we are talking about a 4KiB or
> an 8KiBP allocation. Aka order 0 or order 1.
>
Actually we are already handling the large SGI machines so we need
to plan for 4096 cpus now while we write these patches.
vmalloc() and remap_vmalloc_range() sounds reasonable. So that's what
we should probaly use.
Alternatively why not allocate everything in 4K pages and use vmcore_list
to map offset into right addresses and call remap_pfn_range() on these
addresses.
Thanks
Vivek
On Wed, Mar 20, 2013 at 11:29:05PM -0700, Eric W. Biederman wrote:
[..]
> Preserving the actual PT_LOAD segments p_paddr and p_memsz values is
> important. p_offset we can change as much as we want. Which means there
> can be logical holes in the file between PT_LOAD segments, where we put
> the extra data needed to keep everything page aligned.
Agreed. If one modifies p_paddr then one will have to modify p->vaddr
too. And user space tools look at p->vaddr and where is the corresponding
physical address. Keeping p_vaddr and p_paddr intact makes sense.
Thanks
Vivek
On Thu, Mar 21, 2013 at 12:07:12AM -0700, Eric W. Biederman wrote:
[..]
> I think the two having different contents violates the principle of
> least surprise.
>
> I think exporting the old memory as the ``extra data'' is the least
> surprising and the easiest way to go.
>
> I don't mind filling the extra data with zero's but I don't see the
> point.
I think only question would be if there is a problem in reading memory
areas which BIOS has kept reserved or possibly not exported. Are there
any surprises to be expected. (machines reboots while trying to reboot
a particular memory location etc).
So trying to zero the extra data can make theoritically make it somewhat
safer.
So if starting or end address of PT_LOAD header is not aligned, why
not we simply allocate a page. Copy the relevant data from old memory,
fill rest with zero. That way mmap and read view will be same. There
will be no surprises w.r.t reading old kernel memory beyond what's
specified by the headers.
And in practice I am not expecting many PT_LOAD ranges which are unaligned.
Just few. And allocating a few 4K pages should not be a big deal.
And vmcore_list will help us again map whether pfn lies in old memory
or new memory.
Thanks
Vivek
On Thu, Mar 21, 2013 at 11:21:24AM -0400, Vivek Goyal wrote:
[..]
> So if starting or end address of PT_LOAD header is not aligned, why
> not we simply allocate a page. Copy the relevant data from old memory,
> fill rest with zero. That way mmap and read view will be same. There
> will be no surprises w.r.t reading old kernel memory beyond what's
> specified by the headers.
Copying from old memory might spring surprises w.r.t hw poisoned
pages. I guess we will have to disable MCE, read page, enable it
back or something like that to take care of these issues.
In the past we have recommended makedumpfile to be careful, look
at struct pages and make sure we are not reading poisoned pages.
But vmcore itself is reading old memory and can run into this
issue too.
Thanks
Vivek
From: Vivek Goyal <[email protected]>
Subject: Re: [PATCH v3 01/21] vmcore: reference e_phoff member explicitly to get position of program header table
Date: Thu, 21 Mar 2013 10:12:02 -0400
> On Thu, Mar 21, 2013 at 11:50:41AM +0900, HATAYAMA Daisuke wrote:
>> From: "Eric W. Biederman" <[email protected]>
>> Subject: Re: [PATCH v3 01/21] vmcore: reference e_phoff member explicitly to get position of program header table
>> Date: Tue, 19 Mar 2013 14:44:16 -0700
>>
>> > HATAYAMA Daisuke <[email protected]> writes:
>> >
>> >> Currently, the code assumes that position of program header table is
>> >> next to ELF header. But future change can break the assumption on
>> >> kexec-tools and the 1st kernel. To avoid worst case, reference e_phoff
>> >> member explicitly to get position of program header table in
>> >> file-offset.
>> >
>> > In principle this looks good. However when I read this it looks like
>> > you are going a little too far.
>> >
>> > You are changing not only the reading of the supplied headers, but
>> > you are changing the generation of the new new headers that describe
>> > the data provided by /proc/vmcore.
>> >
>> > I get lost in following this after you mangle merge_note_headers.
>> >
>> > In principle removing silly assumptions seems reasonable, but I think
>> > it is completely orthogonal to the task of maping vmcore mmapable.
>> >
>> > I think it is fine to claim that the assumptions made here in vmcore are
>> > part of the kexec on panic ABI at this point, which would generally make
>> > this change unnecessary.
>>
>> This was suggested by Vivek. He prefers generic one.
>>
>> Vivek, do you agree to this? Or is it better to re-post this and other
>> clean-up patches as another one separately to this patch set?
>
> Given the fact that current code has been working, I am fine to just
> re-post and take care of mmap() related issues. And we can take care
> of cleaning up of some assumptions about PT_NOTE headers later. Trying
> to club large cleanup with mmap() patches is making it hard to review.
>
I see. I'll post the clean-up series separately.
Thanks.
HATAYAMA, Daisuke
From: Vivek Goyal <[email protected]>
Subject: Re: [PATCH v3 17/21] vmcore: check NT_VMCORE_PAD as a mark indicating the end of ELF note buffer
Date: Thu, 21 Mar 2013 10:36:56 -0400
> On Wed, Mar 20, 2013 at 08:54:25PM -0700, Eric W. Biederman wrote:
>
> [..]
>> > Also, it's possible to get size of a whole part of ELF note segments
>> > from p_memsz or p_filesz, and gdb and binutils are reading the note
>> > segments until reaching the size.
>>
>> Agreed. Except in our weird case where we generate the notes on the
>> fly, and generate the NOTE segment header much earlier.
>
> And in our case we don't know the size of ELF note. Kernel is not
> exporting the size. So kexec-tools is putting an upper limit of 1024
> and putting that value in p_memsz and p_filesz fields.
>
> Given the fact that we are reserving elf notes at boot. That means
> we know the size of ELF notes. It should make sense to export it
> to user space and let kexec-tools put right values.
>
> In fact looks like /sys/kernel/vmcoreinfo is exporting two values. Address
> and size. (This is kind of violation of sysfs poilcy of one value per
> file). But for per cpu notes, we are exporting only address and not
> size.
IIRC, Greg Norman pointed out this violation of vmcoreinfo file when
he found some monthes ago.
>
> /sys/devices/system/cpu/cpu<n>/crash_notes
>
> May be we should export another file
>
> /sys/devices/system/cpu/cpu<n>/crash_notes_size
>
> and let kexec-tools parse it.
Anyway, I think of this issue as beyond the scope of what I'm working
here...
Thanks.
HATAYAMA, Daisuke
HATAYAMA Daisuke <[email protected]> writes:
> From: Vivek Goyal <[email protected]>
> Subject: Re: [PATCH v3 17/21] vmcore: check NT_VMCORE_PAD as a mark indicating the end of ELF note buffer
> Date: Thu, 21 Mar 2013 10:36:56 -0400
>
>> And in our case we don't know the size of ELF note. Kernel is not
>> exporting the size. So kexec-tools is putting an upper limit of 1024
>> and putting that value in p_memsz and p_filesz fields.
>>
>> Given the fact that we are reserving elf notes at boot. That means
>> we know the size of ELF notes. It should make sense to export it
>> to user space and let kexec-tools put right values.
>>
>
> Anyway, I think of this issue as beyond the scope of what I'm working
> here...
Agreed. It is independent and can be fixed independently.
Eric
From: Vivek Goyal <[email protected]>
Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
Date: Thu, 21 Mar 2013 11:27:51 -0400
> On Thu, Mar 21, 2013 at 11:21:24AM -0400, Vivek Goyal wrote:
>
> [..]
>> So if starting or end address of PT_LOAD header is not aligned, why
>> not we simply allocate a page. Copy the relevant data from old memory,
>> fill rest with zero. That way mmap and read view will be same. There
>> will be no surprises w.r.t reading old kernel memory beyond what's
>> specified by the headers.
>
> Copying from old memory might spring surprises w.r.t hw poisoned
> pages. I guess we will have to disable MCE, read page, enable it
> back or something like that to take care of these issues.
>
> In the past we have recommended makedumpfile to be careful, look
> at struct pages and make sure we are not reading poisoned pages.
> But vmcore itself is reading old memory and can run into this
> issue too.
Yes, that has been already implemented in makedumpfile.
Not only copying, but also mmaping poisoned pages might be problematic
due to hardware cache prefetch performed by creation of page table to
the poisoned pages. Or MCE disables the prefetch? I'm not sure but
I'll investigate this. makedumpfile might also take care of calling
mmap.
Thanks.
HATAYAMA, Daisuke
Vivek Goyal <[email protected]> writes:
> On Thu, Mar 21, 2013 at 11:21:24AM -0400, Vivek Goyal wrote:
>
> [..]
>> So if starting or end address of PT_LOAD header is not aligned, why
>> not we simply allocate a page. Copy the relevant data from old memory,
>> fill rest with zero. That way mmap and read view will be same. There
>> will be no surprises w.r.t reading old kernel memory beyond what's
>> specified by the headers.
>
> Copying from old memory might spring surprises w.r.t hw poisoned
> pages. I guess we will have to disable MCE, read page, enable it
> back or something like that to take care of these issues.
>
> In the past we have recommended makedumpfile to be careful, look
> at struct pages and make sure we are not reading poisoned pages.
> But vmcore itself is reading old memory and can run into this
> issue too.
Vivek you are overthinking this.
If there are issues with reading partially exported pages we should
fix them in kexec-tools or in the kernel where the data is exported.
In the examples given in the patch what we were looking at were cases
where the BIOS rightly or wrongly was saying kernel this is my memory
stay off. But it was all perfectly healthy memory.
/proc/vmcore is a simple data dumper and prettifier. Let's keep it that
way so that we can predict how it will act when we feed it information.
/proc/vmcore should not be worrying about or covering up sins elsewhere
in the system.
At the level of /proc/vmcore we may want to do something about ensuring
MCE's don't kill us. But that is an orthogonal problem.
Eric
From: "Eric W. Biederman" <[email protected]>
Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
Date: Thu, 21 Mar 2013 17:54:22 -0700
> Vivek Goyal <[email protected]> writes:
>
>> On Thu, Mar 21, 2013 at 11:21:24AM -0400, Vivek Goyal wrote:
>>
>> [..]
>>> So if starting or end address of PT_LOAD header is not aligned, why
>>> not we simply allocate a page. Copy the relevant data from old memory,
>>> fill rest with zero. That way mmap and read view will be same. There
>>> will be no surprises w.r.t reading old kernel memory beyond what's
>>> specified by the headers.
>>
>> Copying from old memory might spring surprises w.r.t hw poisoned
>> pages. I guess we will have to disable MCE, read page, enable it
>> back or something like that to take care of these issues.
>>
>> In the past we have recommended makedumpfile to be careful, look
>> at struct pages and make sure we are not reading poisoned pages.
>> But vmcore itself is reading old memory and can run into this
>> issue too.
>
> Vivek you are overthinking this.
>
> If there are issues with reading partially exported pages we should
> fix them in kexec-tools or in the kernel where the data is exported.
>
> In the examples given in the patch what we were looking at were cases
> where the BIOS rightly or wrongly was saying kernel this is my memory
> stay off. But it was all perfectly healthy memory.
>
> /proc/vmcore is a simple data dumper and prettifier. Let's keep it that
> way so that we can predict how it will act when we feed it information.
> /proc/vmcore should not be worrying about or covering up sins elsewhere
> in the system.
>
> At the level of /proc/vmcore we may want to do something about ensuring
> MCE's don't kill us. But that is an orthogonal problem.
This is the part of old memory /proc/vmcore must read at its
initialization to generate its meta data, i.e. ELF header, program
header table and ELF note segments. Other memory chunks are part
makedumpfile should decide whether to read or avoid.
Thanks.
HATAYAMA, Daisuke
From: Vivek Goyal <[email protected]>
Subject: Re: [PATCH v3 18/21] vmcore: check if vmcore objects satify mmap()'s page-size boundary requirement
Date: Thu, 21 Mar 2013 10:49:29 -0400
> On Thu, Mar 21, 2013 at 12:22:59AM -0700, Eric W. Biederman wrote:
>> HATAYAMA Daisuke <[email protected]> writes:
>>
>> > OK, rigorously, suceess or faliure of the requested free pages
>> > allocation depends on actual memory layout at the 2nd kernel boot. To
>> > increase the possibility of allocating memory, we have no method but
>> > reserve more memory for the 2nd kernel now.
>>
>> Good enough. If there are fragmentation issues that cause allocation
>> problems on larger boxes we can use vmalloc and remap_vmalloc_range, but
>> we certainly don't need to start there.
>>
>> Especialy as for most 8 or 16 core boxes we are talking about a 4KiB or
>> an 8KiBP allocation. Aka order 0 or order 1.
>>
>
> Actually we are already handling the large SGI machines so we need
> to plan for 4096 cpus now while we write these patches.
>
> vmalloc() and remap_vmalloc_range() sounds reasonable. So that's what
> we should probaly use.
>
> Alternatively why not allocate everything in 4K pages and use vmcore_list
> to map offset into right addresses and call remap_pfn_range() on these
> addresses.
I have an introductory question about design of vmalloc. My
understanding is that vmalloc allocates *pages* enough to cover a
requested size and returns the first corresponding virtual address.
So, the address returned is inherently always page-size aligned.
It looks like vmalloc does so in the current implementation, but I
don't know older implementations and I cannot make sure this is
guranteed in vmalloc's interface. There's the comment explaing the
interface of vmalloc as below, but it seems to me a little vague in
that it doesn't say clearly what's is returned as an address.
/**
* vmalloc - allocate virtually contiguous memory
* @size: allocation size
* Allocate enough pages to cover @size from the page level
* allocator and map them into contiguous kernel virtual space.
*
* For tight control over page level allocator and protection flags
* use __vmalloc() instead.
*/
void *vmalloc(unsigned long size)
{
return __vmalloc_node_flags(size, NUMA_NO_NODE,
GFP_KERNEL | __GFP_HIGHMEM);
}
EXPORT_SYMBOL(vmalloc);
BTW, simple test module code also shows they returns page-size aligned
objects, where 1-byte objects are allocated 12-times.
$ dmesg | tail -n 12
[3552817.290982] test: objects[0] = ffffc9000060c000
[3552817.291197] test: objects[1] = ffffc9000060e000
[3552817.291379] test: objects[2] = ffffc9000067d000
[3552817.291566] test: objects[3] = ffffc90010f99000
[3552817.291833] test: objects[4] = ffffc90010f9b000
[3552817.292015] test: objects[5] = ffffc90010f9d000
[3552817.292207] test: objects[6] = ffffc90010f9f000
[3552817.292386] test: objects[7] = ffffc90010fa1000
[3552817.292574] test: objects[8] = ffffc90010fa3000
[3552817.292785] test: objects[9] = ffffc90010fa5000
[3552817.292964] test: objects[10] = ffffc90010fa7000
[3552817.293143] test: objects[11] = ffffc90010fa9000
Thanks.
HATAYAMA, Daisuke