2013-04-18 09:34:42

by Hatayama, Daisuke

[permalink] [raw]
Subject: [PATCH v4 0/8] kdump, vmcore: support mmap() on /proc/vmcore

Currently, read to /proc/vmcore is done by read_oldmem() that uses
ioremap/iounmap per a single page. For example, if memory is 1GB,
ioremap/iounmap is called (1GB / 4KB)-times, that is, 262144
times. This causes big performance degradation.

In particular, the current main user of this mmap() is makedumpfile,
which not only reads memory from /proc/vmcore but also does other
processing like filtering, compression and IO work.

To address the issue, this patch implements mmap() on /proc/vmcore to
improve read performance.

Benchmark
=========

You can see two benchmarks on terabyte memory system. Both show about
40 seconds on 2TB system. This is almost equal to performance by
experimtanal kernel-side memory filtering.

- makedumpfile mmap() benchmark, by Jingbai Ma
https://lkml.org/lkml/2013/3/27/19

- makedumpfile: benchmark on mmap() with /proc/vmcore on 2TB memory system
https://lkml.org/lkml/2013/3/26/914

ChangeLog
=========

v3 => v4)

- Rebase 3.9-rc7.
- Drop clean-up patches orthogonal to the main topic of this patch set.
- Copy ELF note segments in the 1st kernel just as in v1. Allocate
vmcore objects per pages. => See [PATCH 5/8]
- Map memory referenced by PT_LOAD entry directly even if the start or
end of the region doesn't fit inside page boundary, no longer copy
them as the previous v3. Then, holes, outside OS memory, are visible
from /proc/vmcore. => See [PATCH 7/8]

v2 => v3)

- Rebase 3.9-rc3.
- Copy program headers seprately from e_phoff in ELF note segment
buffer. Now there's no risk to allocate huge memory if program
header table positions after memory segment.
- Add cleanup patch that removes unnecessary variable.
- Fix wrongly using the variable that is buffer size configurable at
runtime. Instead, use the varibale that has original buffer size.

v1 => v2)

- Clean up the existing codes: use e_phoff, and remove the assumption
on PT_NOTE entries.
- Fix potencial bug that ELF haeader size is not included in exported
vmcoreinfo size.
- Divide patch modifying read_vmcore() into two: clean-up and primary
code change.
- Put ELF note segments in page-size boundary on the 1st kernel
instead of copying them into the buffer on the 2nd kernel.

Test
====

This patch set is composed based on v3.9-rc7.

Done on x86-64, x86-32 both with 1GB and over 4GB memory environments.

---

HATAYAMA Daisuke (8):
vmcore: support mmap() on /proc/vmcore
vmcore: treat memory chunks referenced by PT_LOAD program header entries in page-size boundary in vmcore_list
vmcore: count holes generated by round-up operation for page boudary for size of /proc/vmcore
vmcore: copy ELF note segments in the 2nd kernel per page vmcore objects
vmcore: Add helper function vmcore_add()
vmcore, procfs: introduce MEM_TYPE_CURRENT_KERNEL flag to distinguish objects copied in 2nd kernel
vmcore: clean up read_vmcore()
vmcore: allocate buffer for ELF headers on page-size alignment


fs/proc/vmcore.c | 349 ++++++++++++++++++++++++++++++++---------------
include/linux/proc_fs.h | 8 +
2 files changed, 245 insertions(+), 112 deletions(-)

--

Thanks.
HATAYAMA, Daisuke


2013-04-18 09:34:52

by Hatayama, Daisuke

[permalink] [raw]
Subject: [PATCH v4 1/8] vmcore: allocate buffer for ELF headers on page-size alignment

Allocate ELF headers on page-size boundary using __get_free_pages()
instead of kmalloc().

Later patch will merge PT_NOTE entries into a single unique one and
decrease the buffer size actually used. Keep original buffer size in
variable elfcorebuf_sz_orig to kfree the buffer later and actually
used buffer size with rounded up to page-size boundary in variable
elfcorebuf_sz seperately.

The size of part of the ELF buffer exported from /proc/vmcore is
elfcorebuf_sz.

The merged, removed PT_NOTE entries, i.e. the range [elfcorebuf_sz,
elfcorebuf_sz_orig], is filled with 0.

Use size of the ELF headers as an inital offset value in
set_vmcore_list_offsets_elf{64,32} and
process_ptload_program_headers_elf{64,32} in order to indicate that
the offset includes the holes towards the page boudary.

Signed-off-by: HATAYAMA Daisuke <[email protected]>
---

fs/proc/vmcore.c | 80 ++++++++++++++++++++++++++++++------------------------
1 files changed, 45 insertions(+), 35 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index b870f74..cfbd1f7 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -31,6 +31,7 @@ static LIST_HEAD(vmcore_list);
/* Stores the pointer to the buffer containing kernel elf core headers. */
static char *elfcorebuf;
static size_t elfcorebuf_sz;
+static size_t elfcorebuf_sz_orig;

/* Total size of vmcore file. */
static u64 vmcore_size;
@@ -213,7 +214,7 @@ static struct vmcore* __init get_new_element(void)
return kzalloc(sizeof(struct vmcore), GFP_KERNEL);
}

-static u64 __init get_vmcore_size_elf64(char *elfptr)
+static u64 __init get_vmcore_size_elf64(char *elfptr, size_t elfsz)
{
int i;
u64 size;
@@ -222,7 +223,7 @@ static u64 __init get_vmcore_size_elf64(char *elfptr)

ehdr_ptr = (Elf64_Ehdr *)elfptr;
phdr_ptr = (Elf64_Phdr*)(elfptr + sizeof(Elf64_Ehdr));
- size = sizeof(Elf64_Ehdr) + ((ehdr_ptr->e_phnum) * sizeof(Elf64_Phdr));
+ size = elfsz;
for (i = 0; i < ehdr_ptr->e_phnum; i++) {
size += phdr_ptr->p_memsz;
phdr_ptr++;
@@ -230,7 +231,7 @@ static u64 __init get_vmcore_size_elf64(char *elfptr)
return size;
}

-static u64 __init get_vmcore_size_elf32(char *elfptr)
+static u64 __init get_vmcore_size_elf32(char *elfptr, size_t elfsz)
{
int i;
u64 size;
@@ -239,7 +240,7 @@ static u64 __init get_vmcore_size_elf32(char *elfptr)

ehdr_ptr = (Elf32_Ehdr *)elfptr;
phdr_ptr = (Elf32_Phdr*)(elfptr + sizeof(Elf32_Ehdr));
- size = sizeof(Elf32_Ehdr) + ((ehdr_ptr->e_phnum) * sizeof(Elf32_Phdr));
+ size = elfsz;
for (i = 0; i < ehdr_ptr->e_phnum; i++) {
size += phdr_ptr->p_memsz;
phdr_ptr++;
@@ -307,7 +308,7 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
phdr.p_flags = 0;
note_off = sizeof(Elf64_Ehdr) +
(ehdr_ptr->e_phnum - nr_ptnote +1) * sizeof(Elf64_Phdr);
- phdr.p_offset = note_off;
+ phdr.p_offset = roundup(note_off, PAGE_SIZE);
phdr.p_vaddr = phdr.p_paddr = 0;
phdr.p_filesz = phdr.p_memsz = phdr_sz;
phdr.p_align = 0;
@@ -321,6 +322,8 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
i = (nr_ptnote - 1) * sizeof(Elf64_Phdr);
*elfsz = *elfsz - i;
memmove(tmp, tmp+i, ((*elfsz)-sizeof(Elf64_Ehdr)-sizeof(Elf64_Phdr)));
+ memset(elfptr + *elfsz, 0, i);
+ *elfsz = roundup(*elfsz, PAGE_SIZE);

/* Modify e_phnum to reflect merged headers. */
ehdr_ptr->e_phnum = ehdr_ptr->e_phnum - nr_ptnote + 1;
@@ -388,7 +391,7 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
phdr.p_flags = 0;
note_off = sizeof(Elf32_Ehdr) +
(ehdr_ptr->e_phnum - nr_ptnote +1) * sizeof(Elf32_Phdr);
- phdr.p_offset = note_off;
+ phdr.p_offset = roundup(note_off, PAGE_SIZE);
phdr.p_vaddr = phdr.p_paddr = 0;
phdr.p_filesz = phdr.p_memsz = phdr_sz;
phdr.p_align = 0;
@@ -402,6 +405,8 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
i = (nr_ptnote - 1) * sizeof(Elf32_Phdr);
*elfsz = *elfsz - i;
memmove(tmp, tmp+i, ((*elfsz)-sizeof(Elf32_Ehdr)-sizeof(Elf32_Phdr)));
+ memset(elfptr + *elfsz, 0, i);
+ *elfsz = roundup(*elfsz, PAGE_SIZE);

/* Modify e_phnum to reflect merged headers. */
ehdr_ptr->e_phnum = ehdr_ptr->e_phnum - nr_ptnote + 1;
@@ -425,9 +430,7 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
phdr_ptr = (Elf64_Phdr*)(elfptr + sizeof(Elf64_Ehdr)); /* PT_NOTE hdr */

/* First program header is PT_NOTE header. */
- vmcore_off = sizeof(Elf64_Ehdr) +
- (ehdr_ptr->e_phnum) * sizeof(Elf64_Phdr) +
- phdr_ptr->p_memsz; /* Note sections */
+ vmcore_off = elfsz + roundup(phdr_ptr->p_memsz, PAGE_SIZE);

for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
if (phdr_ptr->p_type != PT_LOAD)
@@ -462,9 +465,7 @@ static int __init process_ptload_program_headers_elf32(char *elfptr,
phdr_ptr = (Elf32_Phdr*)(elfptr + sizeof(Elf32_Ehdr)); /* PT_NOTE hdr */

/* First program header is PT_NOTE header. */
- vmcore_off = sizeof(Elf32_Ehdr) +
- (ehdr_ptr->e_phnum) * sizeof(Elf32_Phdr) +
- phdr_ptr->p_memsz; /* Note sections */
+ vmcore_off = elfsz + roundup(phdr_ptr->p_memsz, PAGE_SIZE);

for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
if (phdr_ptr->p_type != PT_LOAD)
@@ -486,7 +487,7 @@ static int __init process_ptload_program_headers_elf32(char *elfptr,
}

/* Sets offset fields of vmcore elements. */
-static void __init set_vmcore_list_offsets_elf64(char *elfptr,
+static void __init set_vmcore_list_offsets_elf64(char *elfptr, size_t elfsz,
struct list_head *vc_list)
{
loff_t vmcore_off;
@@ -496,8 +497,7 @@ static void __init set_vmcore_list_offsets_elf64(char *elfptr,
ehdr_ptr = (Elf64_Ehdr *)elfptr;

/* Skip Elf header and program headers. */
- vmcore_off = sizeof(Elf64_Ehdr) +
- (ehdr_ptr->e_phnum) * sizeof(Elf64_Phdr);
+ vmcore_off = elfsz;

list_for_each_entry(m, vc_list, list) {
m->offset = vmcore_off;
@@ -506,7 +506,7 @@ static void __init set_vmcore_list_offsets_elf64(char *elfptr,
}

/* Sets offset fields of vmcore elements. */
-static void __init set_vmcore_list_offsets_elf32(char *elfptr,
+static void __init set_vmcore_list_offsets_elf32(char *elfptr, size_t elfsz,
struct list_head *vc_list)
{
loff_t vmcore_off;
@@ -516,8 +516,7 @@ static void __init set_vmcore_list_offsets_elf32(char *elfptr,
ehdr_ptr = (Elf32_Ehdr *)elfptr;

/* Skip Elf header and program headers. */
- vmcore_off = sizeof(Elf32_Ehdr) +
- (ehdr_ptr->e_phnum) * sizeof(Elf32_Phdr);
+ vmcore_off = elfsz;

list_for_each_entry(m, vc_list, list) {
m->offset = vmcore_off;
@@ -553,30 +552,35 @@ static int __init parse_crash_elf64_headers(void)
}

/* Read in all elf headers. */
- elfcorebuf_sz = sizeof(Elf64_Ehdr) + ehdr.e_phnum * sizeof(Elf64_Phdr);
- elfcorebuf = kmalloc(elfcorebuf_sz, GFP_KERNEL);
+ elfcorebuf_sz_orig = sizeof(Elf64_Ehdr) + ehdr.e_phnum * sizeof(Elf64_Phdr);
+ elfcorebuf_sz = elfcorebuf_sz_orig;
+ elfcorebuf = (void *) __get_free_pages(GFP_KERNEL | __GFP_ZERO,
+ get_order(elfcorebuf_sz_orig));
if (!elfcorebuf)
return -ENOMEM;
addr = elfcorehdr_addr;
- rc = read_from_oldmem(elfcorebuf, elfcorebuf_sz, &addr, 0);
+ rc = read_from_oldmem(elfcorebuf, elfcorebuf_sz_orig, &addr, 0);
if (rc < 0) {
- kfree(elfcorebuf);
+ free_pages((unsigned long)elfcorebuf,
+ get_order(elfcorebuf_sz_orig));
return rc;
}

/* Merge all PT_NOTE headers into one. */
rc = merge_note_headers_elf64(elfcorebuf, &elfcorebuf_sz, &vmcore_list);
if (rc) {
- kfree(elfcorebuf);
+ free_pages((unsigned long)elfcorebuf,
+ get_order(elfcorebuf_sz_orig));
return rc;
}
rc = process_ptload_program_headers_elf64(elfcorebuf, elfcorebuf_sz,
&vmcore_list);
if (rc) {
- kfree(elfcorebuf);
+ free_pages((unsigned long)elfcorebuf,
+ get_order(elfcorebuf_sz_orig));
return rc;
}
- set_vmcore_list_offsets_elf64(elfcorebuf, &vmcore_list);
+ set_vmcore_list_offsets_elf64(elfcorebuf, elfcorebuf_sz, &vmcore_list);
return 0;
}

@@ -608,30 +612,35 @@ static int __init parse_crash_elf32_headers(void)
}

/* Read in all elf headers. */
- elfcorebuf_sz = sizeof(Elf32_Ehdr) + ehdr.e_phnum * sizeof(Elf32_Phdr);
- elfcorebuf = kmalloc(elfcorebuf_sz, GFP_KERNEL);
+ elfcorebuf_sz_orig = sizeof(Elf32_Ehdr) + ehdr.e_phnum * sizeof(Elf32_Phdr);
+ elfcorebuf_sz = elfcorebuf_sz_orig;
+ elfcorebuf = (void *) __get_free_pages(GFP_KERNEL | __GFP_ZERO,
+ get_order(elfcorebuf_sz_orig));
if (!elfcorebuf)
return -ENOMEM;
addr = elfcorehdr_addr;
- rc = read_from_oldmem(elfcorebuf, elfcorebuf_sz, &addr, 0);
+ rc = read_from_oldmem(elfcorebuf, elfcorebuf_sz_orig, &addr, 0);
if (rc < 0) {
- kfree(elfcorebuf);
+ free_pages((unsigned long)elfcorebuf,
+ get_order(elfcorebuf_sz_orig));
return rc;
}

/* Merge all PT_NOTE headers into one. */
rc = merge_note_headers_elf32(elfcorebuf, &elfcorebuf_sz, &vmcore_list);
if (rc) {
- kfree(elfcorebuf);
+ free_pages((unsigned long)elfcorebuf,
+ get_order(elfcorebuf_sz_orig));
return rc;
}
rc = process_ptload_program_headers_elf32(elfcorebuf, elfcorebuf_sz,
&vmcore_list);
if (rc) {
- kfree(elfcorebuf);
+ free_pages((unsigned long)elfcorebuf,
+ get_order(elfcorebuf_sz_orig));
return rc;
}
- set_vmcore_list_offsets_elf32(elfcorebuf, &vmcore_list);
+ set_vmcore_list_offsets_elf32(elfcorebuf, elfcorebuf_sz, &vmcore_list);
return 0;
}

@@ -656,14 +665,14 @@ static int __init parse_crash_elf_headers(void)
return rc;

/* Determine vmcore size. */
- vmcore_size = get_vmcore_size_elf64(elfcorebuf);
+ vmcore_size = get_vmcore_size_elf64(elfcorebuf, elfcorebuf_sz);
} else if (e_ident[EI_CLASS] == ELFCLASS32) {
rc = parse_crash_elf32_headers();
if (rc)
return rc;

/* Determine vmcore size. */
- vmcore_size = get_vmcore_size_elf32(elfcorebuf);
+ vmcore_size = get_vmcore_size_elf32(elfcorebuf, elfcorebuf_sz);
} else {
pr_warn("Warning: Core image elf header is not sane\n");
return -EINVAL;
@@ -710,7 +719,8 @@ void vmcore_cleanup(void)
list_del(&m->list);
kfree(m);
}
- kfree(elfcorebuf);
+ free_pages((unsigned long)elfcorebuf,
+ get_order(elfcorebuf_sz_orig));
elfcorebuf = NULL;
}
EXPORT_SYMBOL_GPL(vmcore_cleanup);

2013-04-18 09:34:57

by Hatayama, Daisuke

[permalink] [raw]
Subject: [PATCH v4 3/8] vmcore, procfs: introduce MEM_TYPE_CURRENT_KERNEL flag to distinguish objects copied in 2nd kernel

Later patch will copy ELF note segments in buffer on the 2nd
kernel. To handle memory on the 1st kernel (old memory) and memory on
the 2nd kernel in vmcore_list uniformly, introduce
MEM_TYPE_CURRENT_KERNEL flag. If this flag is set, the vmcore object
corresponds to buffer on the 2nd kernel.

Signed-off-by: HATAYAMA Daisuke <[email protected]>
---

include/linux/proc_fs.h | 8 +++++++-
1 files changed, 7 insertions(+), 1 deletions(-)

diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 94dfb2a..fefead4 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -97,11 +97,17 @@ struct kcore_list {
int type;
};

+#define MEM_TYPE_CURRENT_KERNEL 0x1
+
struct vmcore {
struct list_head list;
- unsigned long long paddr;
+ union {
+ unsigned long long paddr;
+ char *buf;
+ };
unsigned long long size;
loff_t offset;
+ unsigned int flag;
};

#ifdef CONFIG_PROC_FS

2013-04-18 09:35:04

by Hatayama, Daisuke

[permalink] [raw]
Subject: [PATCH v4 4/8] vmcore: Add helper function vmcore_add()

Later patch will introduce a helper function, vmcore_add_per_unit, to
add memory chunks per a given size in vmcore_list. As a preparation
this patch introduces a helper function that adds a given memory chunk
in vmcore_list in a simple manner.

Signed-off-by: HATAYAMA Daisuke <[email protected]>
---

fs/proc/vmcore.c | 44 ++++++++++++++++++++------------------------
1 files changed, 20 insertions(+), 24 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 7e7c7ca..131d8fa 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -186,6 +186,20 @@ static struct vmcore* __init get_new_element(void)
return kzalloc(sizeof(struct vmcore), GFP_KERNEL);
}

+static int __init vmcore_add(struct list_head *vc_list, u64 paddr, u64 size)
+{
+ struct vmcore *new;
+
+ new = get_new_element();
+ if (!new)
+ return -ENOMEM;
+ new->paddr = paddr;
+ new->size = size;
+ list_add_tail(&new->list, vc_list);
+
+ return 0;
+}
+
static u64 __init get_vmcore_size_elf64(char *elfptr, size_t elfsz)
{
int i;
@@ -236,7 +250,6 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
int j;
void *notes_section;
- struct vmcore *new;
u64 offset, max_sz, sz, real_sz = 0;
if (phdr_ptr->p_type != PT_NOTE)
continue;
@@ -263,14 +276,11 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
}

/* Add this contiguous chunk of notes section to vmcore list.*/
- new = get_new_element();
- if (!new) {
+ if (vmcore_add(vc_list, phdr_ptr->p_offset, real_sz)) {
kfree(notes_section);
return -ENOMEM;
}
- new->paddr = phdr_ptr->p_offset;
- new->size = real_sz;
- list_add_tail(&new->list, vc_list);
+
phdr_sz += real_sz;
kfree(notes_section);
}
@@ -319,7 +329,6 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
int j;
void *notes_section;
- struct vmcore *new;
u64 offset, max_sz, sz, real_sz = 0;
if (phdr_ptr->p_type != PT_NOTE)
continue;
@@ -346,14 +355,11 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
}

/* Add this contiguous chunk of notes section to vmcore list.*/
- new = get_new_element();
- if (!new) {
+ if (vmcore_add(vc_list, phdr_ptr->p_offset, real_sz)) {
kfree(notes_section);
return -ENOMEM;
}
- new->paddr = phdr_ptr->p_offset;
- new->size = real_sz;
- list_add_tail(&new->list, vc_list);
+
phdr_sz += real_sz;
kfree(notes_section);
}
@@ -396,7 +402,6 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
Elf64_Ehdr *ehdr_ptr;
Elf64_Phdr *phdr_ptr;
loff_t vmcore_off;
- struct vmcore *new;

ehdr_ptr = (Elf64_Ehdr *)elfptr;
phdr_ptr = (Elf64_Phdr*)(elfptr + sizeof(Elf64_Ehdr)); /* PT_NOTE hdr */
@@ -409,12 +414,8 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
continue;

/* Add this contiguous chunk of memory to vmcore list.*/
- new = get_new_element();
- if (!new)
+ if (vmcore_add(vc_list, phdr_ptr->p_offset, phdr_ptr->p_memsz))
return -ENOMEM;
- new->paddr = phdr_ptr->p_offset;
- new->size = phdr_ptr->p_memsz;
- list_add_tail(&new->list, vc_list);

/* Update the program header offset. */
phdr_ptr->p_offset = vmcore_off;
@@ -431,7 +432,6 @@ static int __init process_ptload_program_headers_elf32(char *elfptr,
Elf32_Ehdr *ehdr_ptr;
Elf32_Phdr *phdr_ptr;
loff_t vmcore_off;
- struct vmcore *new;

ehdr_ptr = (Elf32_Ehdr *)elfptr;
phdr_ptr = (Elf32_Phdr*)(elfptr + sizeof(Elf32_Ehdr)); /* PT_NOTE hdr */
@@ -444,12 +444,8 @@ static int __init process_ptload_program_headers_elf32(char *elfptr,
continue;

/* Add this contiguous chunk of memory to vmcore list.*/
- new = get_new_element();
- if (!new)
+ if (vmcore_add(vc_list, phdr_ptr->p_offset, phdr_ptr->p_memsz))
return -ENOMEM;
- new->paddr = phdr_ptr->p_offset;
- new->size = phdr_ptr->p_memsz;
- list_add_tail(&new->list, vc_list);

/* Update the program header offset */
phdr_ptr->p_offset = vmcore_off;

2013-04-18 09:35:23

by Hatayama, Daisuke

[permalink] [raw]
Subject: [PATCH v4 7/8] vmcore: treat memory chunks referenced by PT_LOAD program header entries in page-size boundary in vmcore_list

Treat memory chunks referenced by PT_LOAD program header entries in
page-size boundary in vmcore_list. Formally, for each range [start,
end], we set up the corresponding vmcore object in vmcore_list to
[rounddown(start, PAGE_SIZE), roundup(end, PAGE_SIZE)].

This change affects layout of /proc/vmcore. The gaps generated by the
rearrangement are newly made visible to applications as
holes. Concretely, they are two ranges [rounddown(start, PAGE_SIZE),
start] and [end, roundup(end, PAGE_SIZE)].

Signed-off-by: HATAYAMA Daisuke <[email protected]>
---

fs/proc/vmcore.c | 26 ++++++++++++++++++++------
1 files changed, 20 insertions(+), 6 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 029bdc0..cd0f9d9 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -477,16 +477,23 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
vmcore_off = elfsz + roundup(phdr_ptr->p_memsz, PAGE_SIZE);

for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
+ u64 paddr, start, end, size;
+
if (phdr_ptr->p_type != PT_LOAD)
continue;

+ paddr = phdr_ptr->p_offset;
+ start = rounddown(paddr, PAGE_SIZE);
+ end = roundup(paddr + phdr_ptr->p_memsz, PAGE_SIZE);
+ size = end - start;
+
/* Add this contiguous chunk of memory to vmcore list.*/
- if (vmcore_add(vc_list, phdr_ptr->p_offset, phdr_ptr->p_memsz))
+ if (vmcore_add(vc_list, start, size))
return -ENOMEM;

/* Update the program header offset. */
- phdr_ptr->p_offset = vmcore_off;
- vmcore_off = vmcore_off + phdr_ptr->p_memsz;
+ phdr_ptr->p_offset = vmcore_off + (paddr - start);
+ vmcore_off = vmcore_off + size;
}
return 0;
}
@@ -507,16 +514,23 @@ static int __init process_ptload_program_headers_elf32(char *elfptr,
vmcore_off = elfsz + roundup(phdr_ptr->p_memsz, PAGE_SIZE);

for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
+ u64 paddr, start, end, size;
+
if (phdr_ptr->p_type != PT_LOAD)
continue;

+ paddr = phdr_ptr->p_offset;
+ start = rounddown(paddr, PAGE_SIZE);
+ end = roundup(paddr + phdr_ptr->p_memsz, PAGE_SIZE);
+ size = end - start;
+
/* Add this contiguous chunk of memory to vmcore list.*/
- if (vmcore_add(vc_list, phdr_ptr->p_offset, phdr_ptr->p_memsz))
+ if (vmcore_add(vc_list, start, size))
return -ENOMEM;

/* Update the program header offset */
- phdr_ptr->p_offset = vmcore_off;
- vmcore_off = vmcore_off + phdr_ptr->p_memsz;
+ phdr_ptr->p_offset = vmcore_off + (paddr - start);
+ vmcore_off = vmcore_off + size;
}
return 0;
}

2013-04-18 09:35:15

by Hatayama, Daisuke

[permalink] [raw]
Subject: [PATCH v4 6/8] vmcore: count holes generated by round-up operation for page boudary for size of /proc/vmcore

Signed-off-by: HATAYAMA Daisuke <[email protected]>
---

fs/proc/vmcore.c | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index e27da40..029bdc0 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -266,7 +266,7 @@ static u64 __init get_vmcore_size_elf64(char *elfptr, size_t elfsz)
phdr_ptr = (Elf64_Phdr*)(elfptr + sizeof(Elf64_Ehdr));
size = elfsz;
for (i = 0; i < ehdr_ptr->e_phnum; i++) {
- size += phdr_ptr->p_memsz;
+ size += roundup(phdr_ptr->p_memsz, PAGE_SIZE);
phdr_ptr++;
}
return size;
@@ -283,7 +283,7 @@ static u64 __init get_vmcore_size_elf32(char *elfptr, size_t elfsz)
phdr_ptr = (Elf32_Phdr*)(elfptr + sizeof(Elf32_Ehdr));
size = elfsz;
for (i = 0; i < ehdr_ptr->e_phnum; i++) {
- size += phdr_ptr->p_memsz;
+ size += roundup(phdr_ptr->p_memsz, PAGE_SIZE);
phdr_ptr++;
}
return size;

2013-04-18 09:35:26

by Hatayama, Daisuke

[permalink] [raw]
Subject: [PATCH v4 8/8] vmcore: support mmap() on /proc/vmcore

This patch introduces mmap_vmcore().

If flag MEM_TYPE_CURRENT_KERNEL is set, map buffer on the 2nd
kernel. If not set, map some area in old memory.

Neither writable nor executable mapping is permitted even with
mprotect(). Non-writable mapping is also requirement of
remap_pfn_range() when mapping linear pages on non-consequtive
physical pages; see is_cow_mapping().

On x86-32 PAE kernels, mmap() supports at most 16TB memory only. This
limitation comes from the fact that the third argument of
remap_pfn_range(), pfn, is of 32-bit length on x86-32: unsigned long.

Signed-off-by: HATAYAMA Daisuke <[email protected]>
---

fs/proc/vmcore.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 68 insertions(+), 0 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index cd0f9d9..aecdc72 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -176,9 +176,77 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer,
return acc;
}

+static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
+{
+ size_t size = vma->vm_end - vma->vm_start;
+ u64 start, end, len, tsz;
+ struct vmcore *m;
+
+ start = (u64)vma->vm_pgoff << PAGE_SHIFT;
+ end = start + size;
+
+ if (size > vmcore_size || end > vmcore_size)
+ return -EINVAL;
+
+ if (vma->vm_flags & (VM_WRITE | VM_EXEC))
+ return -EPERM;
+
+ vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
+
+ len = 0;
+
+ if (start < elfcorebuf_sz) {
+ u64 pfn;
+
+ tsz = elfcorebuf_sz - start;
+ if (size < tsz)
+ tsz = size;
+ pfn = __pa(elfcorebuf + start) >> PAGE_SHIFT;
+ if (remap_pfn_range(vma, vma->vm_start, pfn, tsz,
+ vma->vm_page_prot))
+ return -EAGAIN;
+ size -= tsz;
+ start += tsz;
+ len += tsz;
+
+ if (size == 0)
+ return 0;
+ }
+
+ list_for_each_entry(m, &vmcore_list, list) {
+ if (start < m->offset + m->size) {
+ u64 paddr = 0;
+
+ tsz = m->offset + m->size - start;
+ if (size < tsz)
+ tsz = size;
+ if (m->flag & MEM_TYPE_CURRENT_KERNEL) {
+ paddr = __pa(m->buf + start - m->offset);
+ } else {
+ paddr = m->paddr + start - m->offset;
+ }
+ if (remap_pfn_range(vma, vma->vm_start + len,
+ paddr >> PAGE_SHIFT, tsz,
+ vma->vm_page_prot)) {
+ do_munmap(vma->vm_mm, vma->vm_start, len);
+ return -EAGAIN;
+ }
+ size -= tsz;
+ start += tsz;
+ len += tsz;
+
+ if (size == 0)
+ return 0;
+ }
+ }
+
+ return 0;
+}
+
static const struct file_operations proc_vmcore_operations = {
.read = read_vmcore,
.llseek = default_llseek,
+ .mmap = mmap_vmcore,
};

static struct vmcore* __init get_new_element(void)

2013-04-18 09:36:39

by Hatayama, Daisuke

[permalink] [raw]
Subject: [PATCH v4 5/8] vmcore: copy ELF note segments in the 2nd kernel per page vmcore objects

Copy ELF note segmnets in buffer on the 2nd kernel and the buffer is
allocated on page boundary for the purpose of mmap.

The reason why we don't allocate ELF note segments in the 1st kernel
(old memory) on page boundary is to keep backward compatibility for
old kernels, and one more reason is that if doing so, we waste not a
little memory due to round-up operation to fit the memory to page
boundary since most of the buffers are in per-cpu area.

ELF notes are per-cpu, so total size of ELF note segments increases
according to the number of CPUs. The current maximum number of CPUs on
x86_64 is 5192, and there's already system with 4192 CPUs in SGI,
where total size amounts to 1MB. This can be larger in the neare
futrue or possibly even now on another architecture. Thus, to avoid
the case where memory allocation for large block fails, we allocate
vmcore objects per pages.

Note that we cannot use remap_vmalloc_range here since the vma passed
to it as the first argument needs to cover a full range of memory. We
need to be able to specify where to map in pages, but we cannot do it
by remap_vmalloc_range.

Signed-off-by: HATAYAMA Daisuke <[email protected]>
---

fs/proc/vmcore.c | 71 ++++++++++++++++++++++++++++++++++++++++++++++++++++--
1 files changed, 69 insertions(+), 2 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 131d8fa..e27da40 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -200,7 +200,62 @@ static int __init vmcore_add(struct list_head *vc_list, u64 paddr, u64 size)
return 0;
}

+struct vmcore_per_unit_state {
+ struct vmcore *last;
+ u64 pos;
+ u64 unit;
+ struct list_head *vc_list;
+};
+
+/*
+ * Assumptions:
+ * - given physical addresses are all exclusive; don't check exclusiveness.
+ */
+static int __init vmcore_add_per_unit(struct vmcore_per_unit_state *state,
+ u64 paddr, u64 size)
+{
+ u64 offset = paddr, remaining_bytes = size;
+ int rc;
+
+ while (remaining_bytes > 0) {
+ u64 read_bytes;
+
+ BUG_ON(state->pos > state->unit);
+
+ if (!state->last || state->pos == state->unit) {
+ struct vmcore *new;
+
+ new = get_new_element();
+ if (!new)
+ return -ENOMEM;
+ new->flag = MEM_TYPE_CURRENT_KERNEL;
+ new->size = PAGE_SIZE;
+ new->buf = (char *)get_zeroed_page(GFP_KERNEL);
+ if (!new->buf) {
+ kfree(new);
+ return -ENOMEM;
+ }
+ list_add_tail(&new->list, state->vc_list);
+ state->last = new;
+ state->pos = 0;
+ }
+
+ read_bytes = min(remaining_bytes, state->unit - state->pos);
+
+ rc = read_from_oldmem(state->last->buf + state->pos, read_bytes,
+ &offset, 0);
+ if (rc < 0)
+ return rc;
+
+ state->pos += read_bytes;
+ remaining_bytes -= read_bytes;
+ }
+
+ return 0;
+}
+
static u64 __init get_vmcore_size_elf64(char *elfptr, size_t elfsz)
+
{
int i;
u64 size;
@@ -244,6 +299,12 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
Elf64_Phdr phdr, *phdr_ptr;
Elf64_Nhdr *nhdr_ptr;
u64 phdr_sz = 0, note_off;
+ struct vmcore_per_unit_state state = {
+ .last = NULL,
+ .pos = 0,
+ .unit = PAGE_SIZE,
+ .vc_list = vc_list,
+ };

ehdr_ptr = (Elf64_Ehdr *)elfptr;
phdr_ptr = (Elf64_Phdr*)(elfptr + sizeof(Elf64_Ehdr));
@@ -276,7 +337,7 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
}

/* Add this contiguous chunk of notes section to vmcore list.*/
- if (vmcore_add(vc_list, phdr_ptr->p_offset, real_sz)) {
+ if (vmcore_add_per_unit(&state, phdr_ptr->p_offset, real_sz)) {
kfree(notes_section);
return -ENOMEM;
}
@@ -323,6 +384,12 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
Elf32_Phdr phdr, *phdr_ptr;
Elf32_Nhdr *nhdr_ptr;
u64 phdr_sz = 0, note_off;
+ struct vmcore_per_unit_state state = {
+ .last = NULL,
+ .pos = 0,
+ .unit = PAGE_SIZE,
+ .vc_list = vc_list,
+ };

ehdr_ptr = (Elf32_Ehdr *)elfptr;
phdr_ptr = (Elf32_Phdr*)(elfptr + sizeof(Elf32_Ehdr));
@@ -355,7 +422,7 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
}

/* Add this contiguous chunk of notes section to vmcore list.*/
- if (vmcore_add(vc_list, phdr_ptr->p_offset, real_sz)) {
+ if (vmcore_add_per_unit(&state, phdr_ptr->p_offset, real_sz)) {
kfree(notes_section);
return -ENOMEM;
}

2013-04-18 09:37:23

by Hatayama, Daisuke

[permalink] [raw]
Subject: [PATCH v4 2/8] vmcore: clean up read_vmcore()

Rewrite part of read_vmcore() reading objects in vmcore_list in the
same way as part reading ELF headers, by which some duplicated and
redundunt codes are removed.

Later patch will introduce a flag for vmcore objects to represent
whether a given memory associated with the vmcore object resides in
the 1st kernel (old memory) or the 2nd kernel. Then, this function
will be modified so as to decide where to read according to the flag
of a given vmcore object. This cleanup will also be useful to make the
change clearer.

Signed-off-by: HATAYAMA Daisuke <[email protected]>
---

fs/proc/vmcore.c | 68 ++++++++++++++++--------------------------------------
1 files changed, 20 insertions(+), 48 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index cfbd1f7..7e7c7ca 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -118,27 +118,6 @@ static ssize_t read_from_oldmem(char *buf, size_t count,
return read;
}

-/* Maps vmcore file offset to respective physical address in memroy. */
-static u64 map_offset_to_paddr(loff_t offset, struct list_head *vc_list,
- struct vmcore **m_ptr)
-{
- struct vmcore *m;
- u64 paddr;
-
- list_for_each_entry(m, vc_list, list) {
- u64 start, end;
- start = m->offset;
- end = m->offset + m->size - 1;
- if (offset >= start && offset <= end) {
- paddr = m->paddr + offset - start;
- *m_ptr = m;
- return paddr;
- }
- }
- *m_ptr = NULL;
- return 0;
-}
-
/* Read from the ELF header and then the crash dump. On error, negative value is
* returned otherwise number of bytes read are returned.
*/
@@ -147,8 +126,8 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer,
{
ssize_t acc = 0, tmp;
size_t tsz;
- u64 start, nr_bytes;
- struct vmcore *curr_m = NULL;
+ u64 start;
+ struct vmcore *m = NULL;

if (buflen == 0 || *fpos >= vmcore_size)
return 0;
@@ -174,33 +153,26 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer,
return acc;
}

- start = map_offset_to_paddr(*fpos, &vmcore_list, &curr_m);
- if (!curr_m)
- return -EINVAL;
-
- while (buflen) {
- tsz = min_t(size_t, buflen, PAGE_SIZE - (start & ~PAGE_MASK));
-
- /* Calculate left bytes in current memory segment. */
- nr_bytes = (curr_m->size - (start - curr_m->paddr));
- if (tsz > nr_bytes)
- tsz = nr_bytes;
-
- tmp = read_from_oldmem(buffer, tsz, &start, 1);
- if (tmp < 0)
- return tmp;
- buflen -= tsz;
- *fpos += tsz;
- buffer += tsz;
- acc += tsz;
- if (start >= (curr_m->paddr + curr_m->size)) {
- if (curr_m->list.next == &vmcore_list)
- return acc; /*EOF*/
- curr_m = list_entry(curr_m->list.next,
- struct vmcore, list);
- start = curr_m->paddr;
+ list_for_each_entry(m, &vmcore_list, list) {
+ if (*fpos < m->offset + m->size) {
+ tsz = m->offset + m->size - *fpos;
+ if (buflen < tsz)
+ tsz = buflen;
+ start = m->paddr + *fpos - m->offset;
+ tmp = read_from_oldmem(buffer, tsz, &start, 1);
+ if (tmp < 0)
+ return tmp;
+ buflen -= tsz;
+ *fpos += tsz;
+ buffer += tsz;
+ acc += tsz;
+
+ /* leave now if filled buffer already */
+ if (buflen == 0)
+ return acc;
}
}
+
return acc;
}

2013-04-25 13:38:28

by Cliff Wickman

[permalink] [raw]
Subject: Re: [PATCH v4 0/8] kdump, vmcore: support mmap() on /proc/vmcore

On Fri, Apr 05, 2013 at 12:04:02AM +0000, HATAYAMA Daisuke wrote:
> Currently, read to /proc/vmcore is done by read_oldmem() that uses
> ioremap/iounmap per a single page. For example, if memory is 1GB,
> ioremap/iounmap is called (1GB / 4KB)-times, that is, 262144
> times. This causes big performance degradation.
>
> In particular, the current main user of this mmap() is makedumpfile,
> which not only reads memory from /proc/vmcore but also does other
> processing like filtering, compression and IO work.
>
> To address the issue, this patch implements mmap() on /proc/vmcore to
> improve read performance.
>
> Benchmark
> =========
>
> You can see two benchmarks on terabyte memory system. Both show about
> 40 seconds on 2TB system. This is almost equal to performance by
> experimtanal kernel-side memory filtering.
>
> - makedumpfile mmap() benchmark, by Jingbai Ma
> https://lkml.org/lkml/2013/3/27/19
>
> - makedumpfile: benchmark on mmap() with /proc/vmcore on 2TB memory system
> https://lkml.org/lkml/2013/3/26/914
>
> ChangeLog
> =========
>
> v3 => v4)
>
> - Rebase 3.9-rc7.
> - Drop clean-up patches orthogonal to the main topic of this patch set.
> - Copy ELF note segments in the 1st kernel just as in v1. Allocate
> vmcore objects per pages. => See [PATCH 5/8]
> - Map memory referenced by PT_LOAD entry directly even if the start or
> end of the region doesn't fit inside page boundary, no longer copy
> them as the previous v3. Then, holes, outside OS memory, are visible
> from /proc/vmcore. => See [PATCH 7/8]
>
> v2 => v3)
>
> - Rebase 3.9-rc3.
> - Copy program headers seprately from e_phoff in ELF note segment
> buffer. Now there's no risk to allocate huge memory if program
> header table positions after memory segment.
> - Add cleanup patch that removes unnecessary variable.
> - Fix wrongly using the variable that is buffer size configurable at
> runtime. Instead, use the varibale that has original buffer size.
>
> v1 => v2)
>
> - Clean up the existing codes: use e_phoff, and remove the assumption
> on PT_NOTE entries.
> - Fix potencial bug that ELF haeader size is not included in exported
> vmcoreinfo size.
> - Divide patch modifying read_vmcore() into two: clean-up and primary
> code change.
> - Put ELF note segments in page-size boundary on the 1st kernel
> instead of copying them into the buffer on the 2nd kernel.
>
> Test
> ====
>
> This patch set is composed based on v3.9-rc7.
>
> Done on x86-64, x86-32 both with 1GB and over 4GB memory environments.
>
> ---
>
> HATAYAMA Daisuke (8):
> vmcore: support mmap() on /proc/vmcore
> vmcore: treat memory chunks referenced by PT_LOAD program header entries in \
> page-size boundary in vmcore_list
> vmcore: count holes generated by round-up operation for page boudary for size \
> of /proc/vmcore
> vmcore: copy ELF note segments in the 2nd kernel per page vmcore objects
> vmcore: Add helper function vmcore_add()
> vmcore, procfs: introduce MEM_TYPE_CURRENT_KERNEL flag to distinguish objects \
> copied in 2nd kernel vmcore: clean up read_vmcore()
> vmcore: allocate buffer for ELF headers on page-size alignment
>
>
> fs/proc/vmcore.c | 349 ++++++++++++++++++++++++++++++++---------------
> include/linux/proc_fs.h | 8 +
> 2 files changed, 245 insertions(+), 112 deletions(-)
>
> --
>
> Thanks.
> HATAYAMA, Daisuke

This is a very important patch set for speeding the kdump process.
(patches 1 - 8)

We have found the mmap interface to /proc/vmcore about 80x faster than the
read interface.
That is, doing mmap's and copying data (in pieces the size of page
structures) transfers all of /proc/vmcore about 80 times faster than
reading it.

This greatly speeds up the capture of a kdump, as the scan of page
structures takes the bulk of the time in dumping the OS on a machine
with terabytes of memory.

We would very much like to see this set make it into the 3.10 release.

Acked-by: Cliff Wickman <[email protected]>

-Cliff
--
Cliff Wickman
SGI
[email protected]
(651) 683-3824

2013-04-29 19:36:50

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v4 5/8] vmcore: copy ELF note segments in the 2nd kernel per page vmcore objects

On Sat, Apr 13, 2013 at 09:21:33AM +0900, HATAYAMA Daisuke wrote:

[..]
> ELF notes are per-cpu, so total size of ELF note segments increases
> according to the number of CPUs. The current maximum number of CPUs on
> x86_64 is 5192, and there's already system with 4192 CPUs in SGI,
> where total size amounts to 1MB. This can be larger in the neare
> futrue or possibly even now on another architecture. Thus, to avoid
> the case where memory allocation for large block fails, we allocate
> vmcore objects per pages.

IIRC, eric had suggested using vmalloc() and remap_vmalloc_range(). What's
wrong with that? That should keep your vc_list relatively smaller.

Thanks
Vivek

2013-04-29 19:52:36

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v4 7/8] vmcore: treat memory chunks referenced by PT_LOAD program header entries in page-size boundary in vmcore_list

On Sat, Apr 13, 2013 at 09:21:46AM +0900, HATAYAMA Daisuke wrote:
> Treat memory chunks referenced by PT_LOAD program header entries in
> page-size boundary in vmcore_list. Formally, for each range [start,
> end], we set up the corresponding vmcore object in vmcore_list to
> [rounddown(start, PAGE_SIZE), roundup(end, PAGE_SIZE)].
>
> This change affects layout of /proc/vmcore. The gaps generated by the
> rearrangement are newly made visible to applications as
> holes. Concretely, they are two ranges [rounddown(start, PAGE_SIZE),
> start] and [end, roundup(end, PAGE_SIZE)].

Sorry did not understand this part. So if end is not page aligned, then
we roundup(end, PAGE_SIZE) and increase the PT_LOAD size accordingly?
Similarly for start, we do roundown().

- Can you really rounddown() start? Then you will have to change start
virtual address in program header and that's not really a good idea.

- So this extra memory for end, we read from old memory and not fill
with zeros?

>
> Signed-off-by: HATAYAMA Daisuke <[email protected]>
> ---
>
> fs/proc/vmcore.c | 26 ++++++++++++++++++++------
> 1 files changed, 20 insertions(+), 6 deletions(-)
>
> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
> index 029bdc0..cd0f9d9 100644
> --- a/fs/proc/vmcore.c
> +++ b/fs/proc/vmcore.c
> @@ -477,16 +477,23 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
> vmcore_off = elfsz + roundup(phdr_ptr->p_memsz, PAGE_SIZE);
>
> for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
> + u64 paddr, start, end, size;
> +
> if (phdr_ptr->p_type != PT_LOAD)
> continue;
>
> + paddr = phdr_ptr->p_offset;
> + start = rounddown(paddr, PAGE_SIZE);
> + end = roundup(paddr + phdr_ptr->p_memsz, PAGE_SIZE);
> + size = end - start;
> +
> /* Add this contiguous chunk of memory to vmcore list.*/
> - if (vmcore_add(vc_list, phdr_ptr->p_offset, phdr_ptr->p_memsz))
> + if (vmcore_add(vc_list, start, size))
> return -ENOMEM;
>
> /* Update the program header offset. */
> - phdr_ptr->p_offset = vmcore_off;
> - vmcore_off = vmcore_off + phdr_ptr->p_memsz;
> + phdr_ptr->p_offset = vmcore_off + (paddr - start);

What's paddr-start. Why following is not sufficient.

phdr_ptr->p_offset = vmcore_off

Thanks
Vivek

2013-05-07 07:38:57

by Hatayama, Daisuke

[permalink] [raw]
Subject: Re: [PATCH v4 7/8] vmcore: treat memory chunks referenced by PT_LOAD program header entries in page-size boundary in vmcore_list

Sorry for late reply.

(2013/04/30 4:51), Vivek Goyal wrote:
> On Sat, Apr 13, 2013 at 09:21:46AM +0900, HATAYAMA Daisuke wrote:
>> Treat memory chunks referenced by PT_LOAD program header entries in
>> page-size boundary in vmcore_list. Formally, for each range [start,
>> end], we set up the corresponding vmcore object in vmcore_list to
>> [rounddown(start, PAGE_SIZE), roundup(end, PAGE_SIZE)].
>>
>> This change affects layout of /proc/vmcore. The gaps generated by the
>> rearrangement are newly made visible to applications as
>> holes. Concretely, they are two ranges [rounddown(start, PAGE_SIZE),
>> start] and [end, roundup(end, PAGE_SIZE)].
>
> Sorry did not understand this part. So if end is not page aligned, then
> we roundup(end, PAGE_SIZE) and increase the PT_LOAD size accordingly?
> Similarly for start, we do roundown().
>
> - Can you really rounddown() start? Then you will have to change start
> virtual address in program header and that's not really a good idea.
>

No, there's no need to change paddr in program header. Pleaes notice
that difference between what objects in vc_list refer to and what
PT_LOAD program headers refer to. The former covers not only kernel
memory but also the extra memory, while the latter the kernel memory only.

> - So this extra memory for end, we read from old memory and not fill
> with zeros?

Yes. The extra memory is not covered by any program header, i.e. hole.
The extra memory is not modified at all, exported with no change; if it
is used by BIOS, users can see BIOS data there. This design comes from
the discussion with Erick.

>
>>
>> Signed-off-by: HATAYAMA Daisuke <[email protected]>
>> ---
>>
>> fs/proc/vmcore.c | 26 ++++++++++++++++++++------
>> 1 files changed, 20 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
>> index 029bdc0..cd0f9d9 100644
>> --- a/fs/proc/vmcore.c
>> +++ b/fs/proc/vmcore.c
>> @@ -477,16 +477,23 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
>> vmcore_off = elfsz + roundup(phdr_ptr->p_memsz, PAGE_SIZE);
>>
>> for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
>> + u64 paddr, start, end, size;
>> +
>> if (phdr_ptr->p_type != PT_LOAD)
>> continue;
>>
>> + paddr = phdr_ptr->p_offset;
>> + start = rounddown(paddr, PAGE_SIZE);
>> + end = roundup(paddr + phdr_ptr->p_memsz, PAGE_SIZE);
>> + size = end - start;
>> +
>> /* Add this contiguous chunk of memory to vmcore list.*/
>> - if (vmcore_add(vc_list, phdr_ptr->p_offset, phdr_ptr->p_memsz))
>> + if (vmcore_add(vc_list, start, size))
>> return -ENOMEM;
>>
>> /* Update the program header offset. */
>> - phdr_ptr->p_offset = vmcore_off;
>> - vmcore_off = vmcore_off + phdr_ptr->p_memsz;
>> + phdr_ptr->p_offset = vmcore_off + (paddr - start);
>
> What's paddr-start. Why following is not sufficient.
>
> phdr_ptr->p_offset = vmcore_off
>

(paddr - start) is offset of the memory program header refers to, from
which kernel memory starts. Pictrically:

vmcore_off +----------------------+
| extra memory |
| (non kernel memory) |
phdr->p_offset = +----------------------+
vmcore_off + (paddr - start) | |\
| kernel memory | phdr->p_memsz
| |/
+----------------------+
| extra memory |
| (non kernel memory) |
vmcore_off + size +----------------------+

--
Thanks.
HATAYAMA, Daisuke

2013-05-07 07:57:11

by Hatayama, Daisuke

[permalink] [raw]
Subject: Re: [PATCH v4 5/8] vmcore: copy ELF note segments in the 2nd kernel per page vmcore objects

(2013/04/30 4:36), Vivek Goyal wrote:
> On Sat, Apr 13, 2013 at 09:21:33AM +0900, HATAYAMA Daisuke wrote:
>
> [..]
>> ELF notes are per-cpu, so total size of ELF note segments increases
>> according to the number of CPUs. The current maximum number of CPUs on
>> x86_64 is 5192, and there's already system with 4192 CPUs in SGI,
>> where total size amounts to 1MB. This can be larger in the neare
>> futrue or possibly even now on another architecture. Thus, to avoid
>> the case where memory allocation for large block fails, we allocate
>> vmcore objects per pages.
>
> IIRC, eric had suggested using vmalloc() and remap_vmalloc_range(). What's
> wrong with that? That should keep your vc_list relatively smaller.
>

Yes, it's handy if it's possible to remap them in vmalloc space, but the
problem here is that remap_vmalloc_range requires the first argument vma
to cover full range of the requested map. This becomes problem when
requested area for mmap() overlaps multiple objects, for example, ELF
headers and memory refered to by the first PT_LOAD program header.

To use remap_vmalloc_range, it's necessary to prepare a new variant
similar to remap_pfn_range by which we can remap different objects
separately to a single vma.

/**
* remap_vmalloc_range - map vmalloc pages to userspace
* @vma: vma to cover (map full range of vma)
* @addr: vmalloc memory
* @pgoff: number of pages into addr before first page to map
*
* Returns: 0 for success, -Exxx on failure
*
* This function checks that addr is a valid vmalloc'ed area, and
* that it is big enough to cover the vma. Will return failure if
* that criteria isn't met.
*
* Similar to remap_pfn_range() (see mm/memory.c)
*/
int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
unsigned long pgoff)


--
Thanks.
HATAYAMA, Daisuke

2013-05-07 15:09:17

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v4 5/8] vmcore: copy ELF note segments in the 2nd kernel per page vmcore objects

On Tue, May 07, 2013 at 04:56:46PM +0900, HATAYAMA Daisuke wrote:
> (2013/04/30 4:36), Vivek Goyal wrote:
> >On Sat, Apr 13, 2013 at 09:21:33AM +0900, HATAYAMA Daisuke wrote:
> >
> >[..]
> >>ELF notes are per-cpu, so total size of ELF note segments increases
> >>according to the number of CPUs. The current maximum number of CPUs on
> >>x86_64 is 5192, and there's already system with 4192 CPUs in SGI,
> >>where total size amounts to 1MB. This can be larger in the neare
> >>futrue or possibly even now on another architecture. Thus, to avoid
> >>the case where memory allocation for large block fails, we allocate
> >>vmcore objects per pages.
> >
> >IIRC, eric had suggested using vmalloc() and remap_vmalloc_range(). What's
> >wrong with that? That should keep your vc_list relatively smaller.
> >
>
> Yes, it's handy if it's possible to remap them in vmalloc space, but
> the problem here is that remap_vmalloc_range requires the first
> argument vma to cover full range of the requested map. This becomes
> problem when requested area for mmap() overlaps multiple objects,
> for example, ELF headers and memory refered to by the first PT_LOAD
> program header.
>
> To use remap_vmalloc_range, it's necessary to prepare a new variant
> similar to remap_pfn_range by which we can remap different objects
> separately to a single vma.

Ok. Is it hard to prepare one such variant. If we can write one, it will
simplify the vmcore code.

Thanks
Vivek

2013-05-07 15:24:39

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v4 7/8] vmcore: treat memory chunks referenced by PT_LOAD program header entries in page-size boundary in vmcore_list

On Tue, May 07, 2013 at 04:38:04PM +0900, HATAYAMA Daisuke wrote:

[..]
> >> /* Update the program header offset. */
> >>- phdr_ptr->p_offset = vmcore_off;
> >>- vmcore_off = vmcore_off + phdr_ptr->p_memsz;
> >>+ phdr_ptr->p_offset = vmcore_off + (paddr - start);
> >
> >What's paddr-start. Why following is not sufficient.
> >
> >phdr_ptr->p_offset = vmcore_off
> >
>
> (paddr - start) is offset of the memory program header refers to,
> from which kernel memory starts. Pictrically:
>
> vmcore_off +----------------------+
> | extra memory |
> | (non kernel memory) |
> phdr->p_offset = +----------------------+
> vmcore_off + (paddr - start) | |\
> | kernel memory | phdr->p_memsz
> | |/
> +----------------------+
> | extra memory |
> | (non kernel memory) |
> vmcore_off + size +----------------------+

Ok, got it. So PT_LOAD header refers to only part of memory and we
align start and end to PAGE_SIZE and then add that full chunk to
vmcore list. We update the phdr->offset to point to PT_LOAD
mapping. vc_list area can contain the page aligned extra memory at the
beginning and end and that can be read from old memory if user wishes
to.

So this is not an issue.

Thanks
Vivek

2013-05-08 04:58:45

by Hatayama, Daisuke

[permalink] [raw]
Subject: Re: [PATCH v4 5/8] vmcore: copy ELF note segments in the 2nd kernel per page vmcore objects

(2013/05/08 0:08), Vivek Goyal wrote:
> On Tue, May 07, 2013 at 04:56:46PM +0900, HATAYAMA Daisuke wrote:
>> (2013/04/30 4:36), Vivek Goyal wrote:
>>> On Sat, Apr 13, 2013 at 09:21:33AM +0900, HATAYAMA Daisuke wrote:
>>>
>>> [..]
>>>> ELF notes are per-cpu, so total size of ELF note segments increases
>>>> according to the number of CPUs. The current maximum number of CPUs on
>>>> x86_64 is 5192, and there's already system with 4192 CPUs in SGI,
>>>> where total size amounts to 1MB. This can be larger in the neare
>>>> futrue or possibly even now on another architecture. Thus, to avoid
>>>> the case where memory allocation for large block fails, we allocate
>>>> vmcore objects per pages.
>>>
>>> IIRC, eric had suggested using vmalloc() and remap_vmalloc_range(). What's
>>> wrong with that? That should keep your vc_list relatively smaller.
>>>
>>
>> Yes, it's handy if it's possible to remap them in vmalloc space, but
>> the problem here is that remap_vmalloc_range requires the first
>> argument vma to cover full range of the requested map. This becomes
>> problem when requested area for mmap() overlaps multiple objects,
>> for example, ELF headers and memory refered to by the first PT_LOAD
>> program header.
>>
>> To use remap_vmalloc_range, it's necessary to prepare a new variant
>> similar to remap_pfn_range by which we can remap different objects
>> separately to a single vma.
>
> Ok. Is it hard to prepare one such variant. If we can write one, it will
> simplify the vmcore code.

I'll try to write it. Although I avoided implementing it once, now it
looks relatively easy to implement thanks to vm_insert_page, which does
all essential thing. All I have to do should be consider sanity-check only.

--
Thanks.
HATAYAMA, Daisuke