2013-05-15 09:05:45

by Hatayama, Daisuke

[permalink] [raw]
Subject: [PATCH v6 0/8] kdump, vmcore: support mmap() on /proc/vmcore

Currently, read to /proc/vmcore is done by read_oldmem() that uses
ioremap/iounmap per a single page. For example, if memory is 1GB,
ioremap/iounmap is called (1GB / 4KB)-times, that is, 262144
times. This causes big performance degradation.

In particular, the current main user of this mmap() is makedumpfile,
which not only reads memory from /proc/vmcore but also does other
processing like filtering, compression and IO work.

To address the issue, this patch implements mmap() on /proc/vmcore to
improve read performance.

Benchmark
=========

You can see two benchmarks on terabyte memory system. Both show about
40 seconds on 2TB system. This is almost equal to performance by
experimental kernel-side memory filtering.

- makedumpfile mmap() benchmark, by Jingbai Ma
https://lkml.org/lkml/2013/3/27/19

- makedumpfile: benchmark on mmap() with /proc/vmcore on 2TB memory system
https://lkml.org/lkml/2013/3/26/914

ChangeLog
=========

v5 => v6)

- Change patch order: clenaup patch => PT_LOAD change patch =>
vmalloc-related patch => mmap patch.
- Some cleanups: improve symbol names simply, add helper functoins for
processing ELF note segment and add comments for the helper
functions.
- Fix patch description of patch 7/8.

v4 => v5)

- Rebase 3.10-rc1.
- Introduce remap_vmalloc_range_partial() in order to remap vmalloc
memory in a part of vma area.
- Allocate buffer for ELF note segment at 2nd kernel by vmalloc(). Use
remap_vmalloc_range_partial() to remap the memory to userspace.

v3 => v4)

- Rebase 3.9-rc7.
- Drop clean-up patches orthogonal to the main topic of this patch set.
- Copy ELF note segments in the 2nd kernel just as in v1. Allocate
vmcore objects per pages. => See [PATCH 5/8]
- Map memory referenced by PT_LOAD entry directly even if the start or
end of the region doesn't fit inside page boundary, no longer copy
them as the previous v3. Then, holes, outside OS memory, are visible
from /proc/vmcore. => See [PATCH 7/8]

v2 => v3)

- Rebase 3.9-rc3.
- Copy program headers separately from e_phoff in ELF note segment
buffer. Now there's no risk to allocate huge memory if program
header table positions after memory segment.
- Add cleanup patch that removes unnecessary variable.
- Fix wrongly using the variable that is buffer size configurable at
runtime. Instead, use the variable that has original buffer size.

v1 => v2)

- Clean up the existing codes: use e_phoff, and remove the assumption
on PT_NOTE entries.
- Fix potential bug that ELF header size is not included in exported
vmcoreinfo size.
- Divide patch modifying read_vmcore() into two: clean-up and primary
code change.
- Put ELF note segments in page-size boundary on the 1st kernel
instead of copying them into the buffer on the 2nd kernel.

Test
====

This patch set is composed based on v3.10-rc1, tested on x86_64,
x86_32 both with 1GB and with 5GB (over 4GB) memory configurations.

---

HATAYAMA Daisuke (8):
vmcore: support mmap() on /proc/vmcore
vmcore: calculate vmcore file size from buffer size and total size of vmcore objects
vmcore: allocate ELF note segment in the 2nd kernel vmalloc memory
vmalloc: introduce remap_vmalloc_range_partial
vmalloc: make find_vm_area check in range
vmcore: treat memory chunks referenced by PT_LOAD program header entries in page-size boundary in vmcore_list
vmcore: allocate buffer for ELF headers on page-size alignment
vmcore: clean up read_vmcore()


fs/proc/vmcore.c | 539 +++++++++++++++++++++++++++++++++--------------
include/linux/vmalloc.h | 4
mm/vmalloc.c | 65 ++++--
3 files changed, 430 insertions(+), 178 deletions(-)

--

Thanks.
HATAYAMA, Daisuke


2013-05-15 09:05:51

by Hatayama, Daisuke

[permalink] [raw]
Subject: [PATCH v6 1/8] vmcore: clean up read_vmcore()

Rewrite part of read_vmcore() that reads objects in vmcore_list in the
same way as part reading ELF headers, by which some duplicated and
redundant codes are removed.

Signed-off-by: HATAYAMA Daisuke <[email protected]>
Acked-by: Vivek Goyal <[email protected]>
---

fs/proc/vmcore.c | 68 ++++++++++++++++--------------------------------------
1 files changed, 20 insertions(+), 48 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 17f7e08..ab0c92e 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -118,27 +118,6 @@ static ssize_t read_from_oldmem(char *buf, size_t count,
return read;
}

-/* Maps vmcore file offset to respective physical address in memroy. */
-static u64 map_offset_to_paddr(loff_t offset, struct list_head *vc_list,
- struct vmcore **m_ptr)
-{
- struct vmcore *m;
- u64 paddr;
-
- list_for_each_entry(m, vc_list, list) {
- u64 start, end;
- start = m->offset;
- end = m->offset + m->size - 1;
- if (offset >= start && offset <= end) {
- paddr = m->paddr + offset - start;
- *m_ptr = m;
- return paddr;
- }
- }
- *m_ptr = NULL;
- return 0;
-}
-
/* Read from the ELF header and then the crash dump. On error, negative value is
* returned otherwise number of bytes read are returned.
*/
@@ -147,8 +126,8 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer,
{
ssize_t acc = 0, tmp;
size_t tsz;
- u64 start, nr_bytes;
- struct vmcore *curr_m = NULL;
+ u64 start;
+ struct vmcore *m = NULL;

if (buflen == 0 || *fpos >= vmcore_size)
return 0;
@@ -174,33 +153,26 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer,
return acc;
}

- start = map_offset_to_paddr(*fpos, &vmcore_list, &curr_m);
- if (!curr_m)
- return -EINVAL;
-
- while (buflen) {
- tsz = min_t(size_t, buflen, PAGE_SIZE - (start & ~PAGE_MASK));
-
- /* Calculate left bytes in current memory segment. */
- nr_bytes = (curr_m->size - (start - curr_m->paddr));
- if (tsz > nr_bytes)
- tsz = nr_bytes;
-
- tmp = read_from_oldmem(buffer, tsz, &start, 1);
- if (tmp < 0)
- return tmp;
- buflen -= tsz;
- *fpos += tsz;
- buffer += tsz;
- acc += tsz;
- if (start >= (curr_m->paddr + curr_m->size)) {
- if (curr_m->list.next == &vmcore_list)
- return acc; /*EOF*/
- curr_m = list_entry(curr_m->list.next,
- struct vmcore, list);
- start = curr_m->paddr;
+ list_for_each_entry(m, &vmcore_list, list) {
+ if (*fpos < m->offset + m->size) {
+ tsz = m->offset + m->size - *fpos;
+ if (buflen < tsz)
+ tsz = buflen;
+ start = m->paddr + *fpos - m->offset;
+ tmp = read_from_oldmem(buffer, tsz, &start, 1);
+ if (tmp < 0)
+ return tmp;
+ buflen -= tsz;
+ *fpos += tsz;
+ buffer += tsz;
+ acc += tsz;
+
+ /* leave now if filled buffer already */
+ if (buflen == 0)
+ return acc;
}
}
+
return acc;
}

2013-05-15 09:05:58

by Hatayama, Daisuke

[permalink] [raw]
Subject: [PATCH v6 2/8] vmcore: allocate buffer for ELF headers on page-size alignment

Allocate ELF headers on page-size boundary using __get_free_pages()
instead of kmalloc().

Later patch will merge PT_NOTE entries into a single unique one and
decrease the buffer size actually used. Keep original buffer size in
variable elfcorebuf_sz_orig to kfree the buffer later and actually
used buffer size with rounded up to page-size boundary in variable
elfcorebuf_sz separately.

The size of part of the ELF buffer exported from /proc/vmcore is
elfcorebuf_sz.

The merged, removed PT_NOTE entries, i.e. the range [elfcorebuf_sz,
elfcorebuf_sz_orig], is filled with 0.

Use size of the ELF headers as an initial offset value in
set_vmcore_list_offsets_elf{64,32} and
process_ptload_program_headers_elf{64,32} in order to indicate that
the offset includes the holes towards the page boundary.

Signed-off-by: HATAYAMA Daisuke <[email protected]>
---

fs/proc/vmcore.c | 80 ++++++++++++++++++++++++++++++------------------------
1 files changed, 45 insertions(+), 35 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index ab0c92e..48886e6 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -32,6 +32,7 @@ static LIST_HEAD(vmcore_list);
/* Stores the pointer to the buffer containing kernel elf core headers. */
static char *elfcorebuf;
static size_t elfcorebuf_sz;
+static size_t elfcorebuf_sz_orig;

/* Total size of vmcore file. */
static u64 vmcore_size;
@@ -186,7 +187,7 @@ static struct vmcore* __init get_new_element(void)
return kzalloc(sizeof(struct vmcore), GFP_KERNEL);
}

-static u64 __init get_vmcore_size_elf64(char *elfptr)
+static u64 __init get_vmcore_size_elf64(char *elfptr, size_t elfsz)
{
int i;
u64 size;
@@ -195,7 +196,7 @@ static u64 __init get_vmcore_size_elf64(char *elfptr)

ehdr_ptr = (Elf64_Ehdr *)elfptr;
phdr_ptr = (Elf64_Phdr*)(elfptr + sizeof(Elf64_Ehdr));
- size = sizeof(Elf64_Ehdr) + ((ehdr_ptr->e_phnum) * sizeof(Elf64_Phdr));
+ size = elfsz;
for (i = 0; i < ehdr_ptr->e_phnum; i++) {
size += phdr_ptr->p_memsz;
phdr_ptr++;
@@ -203,7 +204,7 @@ static u64 __init get_vmcore_size_elf64(char *elfptr)
return size;
}

-static u64 __init get_vmcore_size_elf32(char *elfptr)
+static u64 __init get_vmcore_size_elf32(char *elfptr, size_t elfsz)
{
int i;
u64 size;
@@ -212,7 +213,7 @@ static u64 __init get_vmcore_size_elf32(char *elfptr)

ehdr_ptr = (Elf32_Ehdr *)elfptr;
phdr_ptr = (Elf32_Phdr*)(elfptr + sizeof(Elf32_Ehdr));
- size = sizeof(Elf32_Ehdr) + ((ehdr_ptr->e_phnum) * sizeof(Elf32_Phdr));
+ size = elfsz;
for (i = 0; i < ehdr_ptr->e_phnum; i++) {
size += phdr_ptr->p_memsz;
phdr_ptr++;
@@ -280,7 +281,7 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
phdr.p_flags = 0;
note_off = sizeof(Elf64_Ehdr) +
(ehdr_ptr->e_phnum - nr_ptnote +1) * sizeof(Elf64_Phdr);
- phdr.p_offset = note_off;
+ phdr.p_offset = roundup(note_off, PAGE_SIZE);
phdr.p_vaddr = phdr.p_paddr = 0;
phdr.p_filesz = phdr.p_memsz = phdr_sz;
phdr.p_align = 0;
@@ -294,6 +295,8 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
i = (nr_ptnote - 1) * sizeof(Elf64_Phdr);
*elfsz = *elfsz - i;
memmove(tmp, tmp+i, ((*elfsz)-sizeof(Elf64_Ehdr)-sizeof(Elf64_Phdr)));
+ memset(elfptr + *elfsz, 0, i);
+ *elfsz = roundup(*elfsz, PAGE_SIZE);

/* Modify e_phnum to reflect merged headers. */
ehdr_ptr->e_phnum = ehdr_ptr->e_phnum - nr_ptnote + 1;
@@ -361,7 +364,7 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
phdr.p_flags = 0;
note_off = sizeof(Elf32_Ehdr) +
(ehdr_ptr->e_phnum - nr_ptnote +1) * sizeof(Elf32_Phdr);
- phdr.p_offset = note_off;
+ phdr.p_offset = roundup(note_off, PAGE_SIZE);
phdr.p_vaddr = phdr.p_paddr = 0;
phdr.p_filesz = phdr.p_memsz = phdr_sz;
phdr.p_align = 0;
@@ -375,6 +378,8 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
i = (nr_ptnote - 1) * sizeof(Elf32_Phdr);
*elfsz = *elfsz - i;
memmove(tmp, tmp+i, ((*elfsz)-sizeof(Elf32_Ehdr)-sizeof(Elf32_Phdr)));
+ memset(elfptr + *elfsz, 0, i);
+ *elfsz = roundup(*elfsz, PAGE_SIZE);

/* Modify e_phnum to reflect merged headers. */
ehdr_ptr->e_phnum = ehdr_ptr->e_phnum - nr_ptnote + 1;
@@ -398,9 +403,7 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
phdr_ptr = (Elf64_Phdr*)(elfptr + sizeof(Elf64_Ehdr)); /* PT_NOTE hdr */

/* First program header is PT_NOTE header. */
- vmcore_off = sizeof(Elf64_Ehdr) +
- (ehdr_ptr->e_phnum) * sizeof(Elf64_Phdr) +
- phdr_ptr->p_memsz; /* Note sections */
+ vmcore_off = elfsz + roundup(phdr_ptr->p_memsz, PAGE_SIZE);

for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
if (phdr_ptr->p_type != PT_LOAD)
@@ -435,9 +438,7 @@ static int __init process_ptload_program_headers_elf32(char *elfptr,
phdr_ptr = (Elf32_Phdr*)(elfptr + sizeof(Elf32_Ehdr)); /* PT_NOTE hdr */

/* First program header is PT_NOTE header. */
- vmcore_off = sizeof(Elf32_Ehdr) +
- (ehdr_ptr->e_phnum) * sizeof(Elf32_Phdr) +
- phdr_ptr->p_memsz; /* Note sections */
+ vmcore_off = elfsz + roundup(phdr_ptr->p_memsz, PAGE_SIZE);

for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
if (phdr_ptr->p_type != PT_LOAD)
@@ -459,7 +460,7 @@ static int __init process_ptload_program_headers_elf32(char *elfptr,
}

/* Sets offset fields of vmcore elements. */
-static void __init set_vmcore_list_offsets_elf64(char *elfptr,
+static void __init set_vmcore_list_offsets_elf64(char *elfptr, size_t elfsz,
struct list_head *vc_list)
{
loff_t vmcore_off;
@@ -469,8 +470,7 @@ static void __init set_vmcore_list_offsets_elf64(char *elfptr,
ehdr_ptr = (Elf64_Ehdr *)elfptr;

/* Skip Elf header and program headers. */
- vmcore_off = sizeof(Elf64_Ehdr) +
- (ehdr_ptr->e_phnum) * sizeof(Elf64_Phdr);
+ vmcore_off = elfsz;

list_for_each_entry(m, vc_list, list) {
m->offset = vmcore_off;
@@ -479,7 +479,7 @@ static void __init set_vmcore_list_offsets_elf64(char *elfptr,
}

/* Sets offset fields of vmcore elements. */
-static void __init set_vmcore_list_offsets_elf32(char *elfptr,
+static void __init set_vmcore_list_offsets_elf32(char *elfptr, size_t elfsz,
struct list_head *vc_list)
{
loff_t vmcore_off;
@@ -489,8 +489,7 @@ static void __init set_vmcore_list_offsets_elf32(char *elfptr,
ehdr_ptr = (Elf32_Ehdr *)elfptr;

/* Skip Elf header and program headers. */
- vmcore_off = sizeof(Elf32_Ehdr) +
- (ehdr_ptr->e_phnum) * sizeof(Elf32_Phdr);
+ vmcore_off = elfsz;

list_for_each_entry(m, vc_list, list) {
m->offset = vmcore_off;
@@ -526,30 +525,35 @@ static int __init parse_crash_elf64_headers(void)
}

/* Read in all elf headers. */
- elfcorebuf_sz = sizeof(Elf64_Ehdr) + ehdr.e_phnum * sizeof(Elf64_Phdr);
- elfcorebuf = kmalloc(elfcorebuf_sz, GFP_KERNEL);
+ elfcorebuf_sz_orig = sizeof(Elf64_Ehdr) + ehdr.e_phnum * sizeof(Elf64_Phdr);
+ elfcorebuf_sz = elfcorebuf_sz_orig;
+ elfcorebuf = (void *) __get_free_pages(GFP_KERNEL | __GFP_ZERO,
+ get_order(elfcorebuf_sz_orig));
if (!elfcorebuf)
return -ENOMEM;
addr = elfcorehdr_addr;
- rc = read_from_oldmem(elfcorebuf, elfcorebuf_sz, &addr, 0);
+ rc = read_from_oldmem(elfcorebuf, elfcorebuf_sz_orig, &addr, 0);
if (rc < 0) {
- kfree(elfcorebuf);
+ free_pages((unsigned long)elfcorebuf,
+ get_order(elfcorebuf_sz_orig));
return rc;
}

/* Merge all PT_NOTE headers into one. */
rc = merge_note_headers_elf64(elfcorebuf, &elfcorebuf_sz, &vmcore_list);
if (rc) {
- kfree(elfcorebuf);
+ free_pages((unsigned long)elfcorebuf,
+ get_order(elfcorebuf_sz_orig));
return rc;
}
rc = process_ptload_program_headers_elf64(elfcorebuf, elfcorebuf_sz,
&vmcore_list);
if (rc) {
- kfree(elfcorebuf);
+ free_pages((unsigned long)elfcorebuf,
+ get_order(elfcorebuf_sz_orig));
return rc;
}
- set_vmcore_list_offsets_elf64(elfcorebuf, &vmcore_list);
+ set_vmcore_list_offsets_elf64(elfcorebuf, elfcorebuf_sz, &vmcore_list);
return 0;
}

@@ -581,30 +585,35 @@ static int __init parse_crash_elf32_headers(void)
}

/* Read in all elf headers. */
- elfcorebuf_sz = sizeof(Elf32_Ehdr) + ehdr.e_phnum * sizeof(Elf32_Phdr);
- elfcorebuf = kmalloc(elfcorebuf_sz, GFP_KERNEL);
+ elfcorebuf_sz_orig = sizeof(Elf32_Ehdr) + ehdr.e_phnum * sizeof(Elf32_Phdr);
+ elfcorebuf_sz = elfcorebuf_sz_orig;
+ elfcorebuf = (void *) __get_free_pages(GFP_KERNEL | __GFP_ZERO,
+ get_order(elfcorebuf_sz_orig));
if (!elfcorebuf)
return -ENOMEM;
addr = elfcorehdr_addr;
- rc = read_from_oldmem(elfcorebuf, elfcorebuf_sz, &addr, 0);
+ rc = read_from_oldmem(elfcorebuf, elfcorebuf_sz_orig, &addr, 0);
if (rc < 0) {
- kfree(elfcorebuf);
+ free_pages((unsigned long)elfcorebuf,
+ get_order(elfcorebuf_sz_orig));
return rc;
}

/* Merge all PT_NOTE headers into one. */
rc = merge_note_headers_elf32(elfcorebuf, &elfcorebuf_sz, &vmcore_list);
if (rc) {
- kfree(elfcorebuf);
+ free_pages((unsigned long)elfcorebuf,
+ get_order(elfcorebuf_sz_orig));
return rc;
}
rc = process_ptload_program_headers_elf32(elfcorebuf, elfcorebuf_sz,
&vmcore_list);
if (rc) {
- kfree(elfcorebuf);
+ free_pages((unsigned long)elfcorebuf,
+ get_order(elfcorebuf_sz_orig));
return rc;
}
- set_vmcore_list_offsets_elf32(elfcorebuf, &vmcore_list);
+ set_vmcore_list_offsets_elf32(elfcorebuf, elfcorebuf_sz, &vmcore_list);
return 0;
}

@@ -629,14 +638,14 @@ static int __init parse_crash_elf_headers(void)
return rc;

/* Determine vmcore size. */
- vmcore_size = get_vmcore_size_elf64(elfcorebuf);
+ vmcore_size = get_vmcore_size_elf64(elfcorebuf, elfcorebuf_sz);
} else if (e_ident[EI_CLASS] == ELFCLASS32) {
rc = parse_crash_elf32_headers();
if (rc)
return rc;

/* Determine vmcore size. */
- vmcore_size = get_vmcore_size_elf32(elfcorebuf);
+ vmcore_size = get_vmcore_size_elf32(elfcorebuf, elfcorebuf_sz);
} else {
pr_warn("Warning: Core image elf header is not sane\n");
return -EINVAL;
@@ -683,7 +692,8 @@ void vmcore_cleanup(void)
list_del(&m->list);
kfree(m);
}
- kfree(elfcorebuf);
+ free_pages((unsigned long)elfcorebuf,
+ get_order(elfcorebuf_sz_orig));
elfcorebuf = NULL;
}
EXPORT_SYMBOL_GPL(vmcore_cleanup);

2013-05-15 09:06:04

by Hatayama, Daisuke

[permalink] [raw]
Subject: [PATCH v6 3/8] vmcore: treat memory chunks referenced by PT_LOAD program header entries in page-size boundary in vmcore_list

Treat memory chunks referenced by PT_LOAD program header entries in
page-size boundary in vmcore_list. Formally, for each range [start,
end], we set up the corresponding vmcore object in vmcore_list to
[rounddown(start, PAGE_SIZE), roundup(end, PAGE_SIZE)].

This change affects layout of /proc/vmcore. The gaps generated by the
rearrangement are newly made visible to applications as
holes. Concretely, they are two ranges [rounddown(start, PAGE_SIZE),
start] and [end, roundup(end, PAGE_SIZE)].

Suppose variable m points at a vmcore object in vmcore_list, and
variable phdr points at the program header of PT_LOAD type the
variable m corresponds to. Then, pictorially:

m->offset +---------------+
| hole |
phdr->p_offset = +---------------+
m->offset + (paddr - start) | |\
| kernel memory | phdr->p_memsz
| |/
+---------------+
| hole |
m->offset + m->size +---------------+

where m->offset and m->offset + m->size are always page-size aligned.

Signed-off-by: HATAYAMA Daisuke <[email protected]>
Acked-by: Vivek Goyal <[email protected]>
---

fs/proc/vmcore.c | 30 ++++++++++++++++++++++--------
1 files changed, 22 insertions(+), 8 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 48886e6..6cf7fbd 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -406,20 +406,27 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
vmcore_off = elfsz + roundup(phdr_ptr->p_memsz, PAGE_SIZE);

for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
+ u64 paddr, start, end, size;
+
if (phdr_ptr->p_type != PT_LOAD)
continue;

+ paddr = phdr_ptr->p_offset;
+ start = rounddown(paddr, PAGE_SIZE);
+ end = roundup(paddr + phdr_ptr->p_memsz, PAGE_SIZE);
+ size = end - start;
+
/* Add this contiguous chunk of memory to vmcore list.*/
new = get_new_element();
if (!new)
return -ENOMEM;
- new->paddr = phdr_ptr->p_offset;
- new->size = phdr_ptr->p_memsz;
+ new->paddr = start;
+ new->size = size;
list_add_tail(&new->list, vc_list);

/* Update the program header offset. */
- phdr_ptr->p_offset = vmcore_off;
- vmcore_off = vmcore_off + phdr_ptr->p_memsz;
+ phdr_ptr->p_offset = vmcore_off + (paddr - start);
+ vmcore_off = vmcore_off + size;
}
return 0;
}
@@ -441,20 +448,27 @@ static int __init process_ptload_program_headers_elf32(char *elfptr,
vmcore_off = elfsz + roundup(phdr_ptr->p_memsz, PAGE_SIZE);

for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
+ u64 paddr, start, end, size;
+
if (phdr_ptr->p_type != PT_LOAD)
continue;

+ paddr = phdr_ptr->p_offset;
+ start = rounddown(paddr, PAGE_SIZE);
+ end = roundup(paddr + phdr_ptr->p_memsz, PAGE_SIZE);
+ size = end - start;
+
/* Add this contiguous chunk of memory to vmcore list.*/
new = get_new_element();
if (!new)
return -ENOMEM;
- new->paddr = phdr_ptr->p_offset;
- new->size = phdr_ptr->p_memsz;
+ new->paddr = start;
+ new->size = size;
list_add_tail(&new->list, vc_list);

/* Update the program header offset */
- phdr_ptr->p_offset = vmcore_off;
- vmcore_off = vmcore_off + phdr_ptr->p_memsz;
+ phdr_ptr->p_offset = vmcore_off + (paddr - start);
+ vmcore_off = vmcore_off + size;
}
return 0;
}

2013-05-15 09:06:08

by Hatayama, Daisuke

[permalink] [raw]
Subject: [PATCH v6 4/8] vmalloc: make find_vm_area check in range

Currently, __find_vmap_area searches for the kernel VM area starting
at a given address. This patch changes this behavior so that it
searches for the kernel VM area to which the address belongs. This
change is needed by remap_vmalloc_range_partial to be introduced in
later patch that receives any position of kernel VM area as target
address.

This patch changes the condition (addr > va->va_start) to the
equivalent (addr >= va->va_end) by taking advantage of the fact that
each kernel VM area is non-overlapping.

Signed-off-by: HATAYAMA Daisuke <[email protected]>
---

mm/vmalloc.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index d365724..3875fa2 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -292,7 +292,7 @@ static struct vmap_area *__find_vmap_area(unsigned long addr)
va = rb_entry(n, struct vmap_area, rb_node);
if (addr < va->va_start)
n = n->rb_left;
- else if (addr > va->va_start)
+ else if (addr >= va->va_end)
n = n->rb_right;
else
return va;

2013-05-15 09:06:26

by Hatayama, Daisuke

[permalink] [raw]
Subject: [PATCH v6 7/8] vmcore: calculate vmcore file size from buffer size and total size of vmcore objects

The previous patches newly added holes before each chunk of memory and
the holes need to be count in vmcore file size. There are two ways to
count file size in such a way:

1) supporse m as a poitner to the last vmcore object in vmcore_list.
, then file size is (m->offset + m->size), or

2) calculate sum of size of buffers for ELF header, program headers,
ELF note segments and objects in vmcore_list.

Although 1) is more direct and simpler than 2), 2) seems better in
that it reflects internal object structure of /proc/vmcore. Thus, this
patch changes get_vmcore_size_elf{64, 32} so that it calculates size
in the way of 2).

Signed-off-by: HATAYAMA Daisuke <[email protected]>
Acked-by: Vivek Goyal <[email protected]>
---

fs/proc/vmcore.c | 40 ++++++++++++++++++----------------------
1 files changed, 18 insertions(+), 22 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 4e121fda..7f2041c 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -210,36 +210,28 @@ static struct vmcore* __init get_new_element(void)
return kzalloc(sizeof(struct vmcore), GFP_KERNEL);
}

-static u64 __init get_vmcore_size_elf64(char *elfptr, size_t elfsz)
+static u64 __init get_vmcore_size_elf64(size_t elfsz, size_t elfnotesegsz,
+ struct list_head *vc_list)
{
- int i;
u64 size;
- Elf64_Ehdr *ehdr_ptr;
- Elf64_Phdr *phdr_ptr;
+ struct vmcore *m;

- ehdr_ptr = (Elf64_Ehdr *)elfptr;
- phdr_ptr = (Elf64_Phdr*)(elfptr + sizeof(Elf64_Ehdr));
- size = elfsz;
- for (i = 0; i < ehdr_ptr->e_phnum; i++) {
- size += phdr_ptr->p_memsz;
- phdr_ptr++;
+ size = elfsz + elfnotesegsz;
+ list_for_each_entry(m, vc_list, list) {
+ size += m->size;
}
return size;
}

-static u64 __init get_vmcore_size_elf32(char *elfptr, size_t elfsz)
+static u64 __init get_vmcore_size_elf32(size_t elfsz, size_t elfnotesegsz,
+ struct list_head *vc_list)
{
- int i;
u64 size;
- Elf32_Ehdr *ehdr_ptr;
- Elf32_Phdr *phdr_ptr;
+ struct vmcore *m;

- ehdr_ptr = (Elf32_Ehdr *)elfptr;
- phdr_ptr = (Elf32_Phdr*)(elfptr + sizeof(Elf32_Ehdr));
- size = elfsz;
- for (i = 0; i < ehdr_ptr->e_phnum; i++) {
- size += phdr_ptr->p_memsz;
- phdr_ptr++;
+ size = elfsz + elfnotesegsz;
+ list_for_each_entry(m, vc_list, list) {
+ size += m->size;
}
return size;
}
@@ -795,14 +787,18 @@ static int __init parse_crash_elf_headers(void)
return rc;

/* Determine vmcore size. */
- vmcore_size = get_vmcore_size_elf64(elfcorebuf, elfcorebuf_sz);
+ vmcore_size = get_vmcore_size_elf64(elfcorebuf_sz,
+ elfnotes_sz,
+ &vmcore_list);
} else if (e_ident[EI_CLASS] == ELFCLASS32) {
rc = parse_crash_elf32_headers();
if (rc)
return rc;

/* Determine vmcore size. */
- vmcore_size = get_vmcore_size_elf32(elfcorebuf, elfcorebuf_sz);
+ vmcore_size = get_vmcore_size_elf32(elfcorebuf_sz,
+ elfnotes_sz,
+ &vmcore_list);
} else {
pr_warn("Warning: Core image elf header is not sane\n");
return -EINVAL;

2013-05-15 09:06:15

by Hatayama, Daisuke

[permalink] [raw]
Subject: [PATCH v6 5/8] vmalloc: introduce remap_vmalloc_range_partial

We want to allocate ELF note segment buffer on the 2nd kernel in
vmalloc space and remap it to user-space in order to reduce the risk
that memory allocation fails on system with huge number of CPUs and so
with huge ELF note segment that exceeds 11-order block size.

Although there's already remap_vmalloc_range for the purpose of
remapping vmalloc memory to user-space, we need to specify user-space
range via vma. Mmap on /proc/vmcore needs to remap range across
multiple objects, so the interface that requires vma to cover full
range is problematic.

This patch introduces remap_vmalloc_range_partial that receives
user-space range as a pair of base address and size and can be used
for mmap on /proc/vmcore case.

remap_vmalloc_range is rewritten using remap_vmalloc_range_partial.

Signed-off-by: HATAYAMA Daisuke <[email protected]>
---

include/linux/vmalloc.h | 4 +++
mm/vmalloc.c | 63 +++++++++++++++++++++++++++++++++--------------
2 files changed, 48 insertions(+), 19 deletions(-)

diff --git a/include/linux/vmalloc.h b/include/linux/vmalloc.h
index 7d5773a..dd0a2c8 100644
--- a/include/linux/vmalloc.h
+++ b/include/linux/vmalloc.h
@@ -82,6 +82,10 @@ extern void *vmap(struct page **pages, unsigned int count,
unsigned long flags, pgprot_t prot);
extern void vunmap(const void *addr);

+extern int remap_vmalloc_range_partial(struct vm_area_struct *vma,
+ unsigned long uaddr, void *kaddr,
+ unsigned long size);
+
extern int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
unsigned long pgoff);
void vmalloc_sync_all(void);
diff --git a/mm/vmalloc.c b/mm/vmalloc.c
index 3875fa2..d9a9f4f6 100644
--- a/mm/vmalloc.c
+++ b/mm/vmalloc.c
@@ -2148,42 +2148,44 @@ finished:
}

/**
- * remap_vmalloc_range - map vmalloc pages to userspace
- * @vma: vma to cover (map full range of vma)
- * @addr: vmalloc memory
- * @pgoff: number of pages into addr before first page to map
+ * remap_vmalloc_range_partial - map vmalloc pages to userspace
+ * @vma: vma to cover
+ * @uaddr: target user address to start at
+ * @kaddr: virtual address of vmalloc kernel memory
+ * @size: size of map area
*
* Returns: 0 for success, -Exxx on failure
*
- * This function checks that addr is a valid vmalloc'ed area, and
- * that it is big enough to cover the vma. Will return failure if
- * that criteria isn't met.
+ * This function checks that @kaddr is a valid vmalloc'ed area,
+ * and that it is big enough to cover the range starting at
+ * @uaddr in @vma. Will return failure if that criteria isn't
+ * met.
*
* Similar to remap_pfn_range() (see mm/memory.c)
*/
-int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
- unsigned long pgoff)
+int remap_vmalloc_range_partial(struct vm_area_struct *vma, unsigned long uaddr,
+ void *kaddr, unsigned long size)
{
struct vm_struct *area;
- unsigned long uaddr = vma->vm_start;
- unsigned long usize = vma->vm_end - vma->vm_start;

- if ((PAGE_SIZE-1) & (unsigned long)addr)
+ size = PAGE_ALIGN(size);
+
+ if (((PAGE_SIZE-1) & (unsigned long)uaddr) ||
+ ((PAGE_SIZE-1) & (unsigned long)kaddr))
return -EINVAL;

- area = find_vm_area(addr);
+ area = find_vm_area(kaddr);
if (!area)
return -EINVAL;

if (!(area->flags & VM_USERMAP))
return -EINVAL;

- if (usize + (pgoff << PAGE_SHIFT) > area->size - PAGE_SIZE)
+ if (kaddr + size > area->addr + area->size)
return -EINVAL;

- addr += pgoff << PAGE_SHIFT;
do {
- struct page *page = vmalloc_to_page(addr);
+ struct page *page = vmalloc_to_page(kaddr);
int ret;

ret = vm_insert_page(vma, uaddr, page);
@@ -2191,14 +2193,37 @@ int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
return ret;

uaddr += PAGE_SIZE;
- addr += PAGE_SIZE;
- usize -= PAGE_SIZE;
- } while (usize > 0);
+ kaddr += PAGE_SIZE;
+ size -= PAGE_SIZE;
+ } while (size > 0);

vma->vm_flags |= VM_DONTEXPAND | VM_DONTDUMP;

return 0;
}
+EXPORT_SYMBOL(remap_vmalloc_range_partial);
+
+/**
+ * remap_vmalloc_range - map vmalloc pages to userspace
+ * @vma: vma to cover (map full range of vma)
+ * @addr: vmalloc memory
+ * @pgoff: number of pages into addr before first page to map
+ *
+ * Returns: 0 for success, -Exxx on failure
+ *
+ * This function checks that addr is a valid vmalloc'ed area, and
+ * that it is big enough to cover the vma. Will return failure if
+ * that criteria isn't met.
+ *
+ * Similar to remap_pfn_range() (see mm/memory.c)
+ */
+int remap_vmalloc_range(struct vm_area_struct *vma, void *addr,
+ unsigned long pgoff)
+{
+ return remap_vmalloc_range_partial(vma, vma->vm_start,
+ addr + (pgoff << PAGE_SHIFT),
+ vma->vm_end - vma->vm_start);
+}
EXPORT_SYMBOL(remap_vmalloc_range);

/*

2013-05-15 09:06:21

by Hatayama, Daisuke

[permalink] [raw]
Subject: [PATCH v6 6/8] vmcore: allocate ELF note segment in the 2nd kernel vmalloc memory

The reasons why we don't allocate ELF note segment in the 1st kernel
(old memory) on page boundary is to keep backward compatibility for
old kernels, and that if doing so, we waste not a little memory due to
round-up operation to fit the memory to page boundary since most of
the buffers are in per-cpu area.

ELF notes are per-cpu, so total size of ELF note segments depends on
number of CPUs. The current maximum number of CPUs on x86_64 is 5192,
and there's already system with 4192 CPUs in SGI, where total size
amounts to 1MB. This can be larger in the near future or possibly even
now on another architecture that has larger size of note per a single
cpu. Thus, to avoid the case where memory allocation for large block
fails, we allocate vmcore objects on vmalloc memory.

This patch adds elfnotes_buf and elfnotes_sz variables to keep pointer
to the ELF note segment buffer and its size. There's no longer the
vmcore object that corresponds to the ELF note segment in
vmcore_list. Accordingly, read_vmcore() has new case for ELF note
segment and set_vmcore_list_offsets_elf{64,32}() and other helper
functions starts calculating offset from sum of size of ELF headers
and size of ELF note segment.

Signed-off-by: HATAYAMA Daisuke <[email protected]>
---

fs/proc/vmcore.c | 273 +++++++++++++++++++++++++++++++++++++++++-------------
1 files changed, 209 insertions(+), 64 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 6cf7fbd..4e121fda 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -34,6 +34,9 @@ static char *elfcorebuf;
static size_t elfcorebuf_sz;
static size_t elfcorebuf_sz_orig;

+static char *elfnotes_buf;
+static size_t elfnotes_sz;
+
/* Total size of vmcore file. */
static u64 vmcore_size;

@@ -154,6 +157,26 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer,
return acc;
}

+ /* Read Elf note segment */
+ if (*fpos < elfcorebuf_sz + elfnotes_sz) {
+ void *kaddr;
+
+ tsz = elfcorebuf_sz + elfnotes_sz - *fpos;
+ if (buflen < tsz)
+ tsz = buflen;
+ kaddr = elfnotes_buf + *fpos - elfcorebuf_sz;
+ if (copy_to_user(buffer, kaddr, tsz))
+ return -EFAULT;
+ buflen -= tsz;
+ *fpos += tsz;
+ buffer += tsz;
+ acc += tsz;
+
+ /* leave now if filled buffer already */
+ if (buflen == 0)
+ return acc;
+ }
+
list_for_each_entry(m, &vmcore_list, list) {
if (*fpos < m->offset + m->size) {
tsz = m->offset + m->size - *fpos;
@@ -221,23 +244,33 @@ static u64 __init get_vmcore_size_elf32(char *elfptr, size_t elfsz)
return size;
}

-/* Merges all the PT_NOTE headers into one. */
-static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
- struct list_head *vc_list)
+/**
+ * process_note_headers_elf64 - Perform a variety of processing on ELF
+ * note segments according to the combination of function arguments.
+ *
+ * @ehdr_ptr - ELF header buffer
+ * @nr_notes - the number of program header entries of PT_NOTE type
+ * @notes_sz - total size of ELF note segment
+ * @notes_buf - buffer into which ELF note segment is copied
+ *
+ * Assume @ehdr_ptr is always not NULL. If @nr_notes is not NULL, then
+ * the number of program header entries of PT_NOTE type is assigned to
+ * @nr_notes. If @notes_sz is not NULL, then total size of ELF note
+ * segment, header part plus data part, is assigned to @notes_sz. If
+ * @notes_buf is not NULL, then ELF note segment is copied into
+ * @notes_buf.
+ */
+static int __init process_note_headers_elf64(const Elf64_Ehdr *ehdr_ptr,
+ int *nr_notes, u64 *notes_sz,
+ char *notes_buf)
{
int i, nr_ptnote=0, rc=0;
- char *tmp;
- Elf64_Ehdr *ehdr_ptr;
- Elf64_Phdr phdr, *phdr_ptr;
+ Elf64_Phdr *phdr_ptr = (Elf64_Phdr*)(ehdr_ptr + 1);
Elf64_Nhdr *nhdr_ptr;
- u64 phdr_sz = 0, note_off;
+ u64 phdr_sz = 0;

- ehdr_ptr = (Elf64_Ehdr *)elfptr;
- phdr_ptr = (Elf64_Phdr*)(elfptr + sizeof(Elf64_Ehdr));
for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
- int j;
void *notes_section;
- struct vmcore *new;
u64 offset, max_sz, sz, real_sz = 0;
if (phdr_ptr->p_type != PT_NOTE)
continue;
@@ -253,7 +286,7 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
return rc;
}
nhdr_ptr = notes_section;
- for (j = 0; j < max_sz; j += sz) {
+ while (real_sz < max_sz) {
if (nhdr_ptr->n_namesz == 0)
break;
sz = sizeof(Elf64_Nhdr) +
@@ -262,20 +295,68 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
real_sz += sz;
nhdr_ptr = (Elf64_Nhdr*)((char*)nhdr_ptr + sz);
}
-
- /* Add this contiguous chunk of notes section to vmcore list.*/
- new = get_new_element();
- if (!new) {
- kfree(notes_section);
- return -ENOMEM;
+ if (notes_buf) {
+ offset = phdr_ptr->p_offset;
+ rc = read_from_oldmem(notes_buf + phdr_sz, real_sz,
+ &offset, 0);
+ if (rc < 0) {
+ kfree(notes_section);
+ return rc;
+ }
}
- new->paddr = phdr_ptr->p_offset;
- new->size = real_sz;
- list_add_tail(&new->list, vc_list);
phdr_sz += real_sz;
kfree(notes_section);
}

+ if (nr_notes)
+ *nr_notes = nr_ptnote;
+ if (notes_sz)
+ *notes_sz = phdr_sz;
+
+ return 0;
+}
+
+static int __init get_note_number_and_size_elf64(const Elf64_Ehdr *ehdr_ptr,
+ int *nr_ptnote, u64 *phdr_sz)
+{
+ return process_note_headers_elf64(ehdr_ptr, nr_ptnote, phdr_sz, NULL);
+}
+
+static int __init copy_notes_elf64(const Elf64_Ehdr *ehdr_ptr, char *notes_buf)
+{
+ return process_note_headers_elf64(ehdr_ptr, NULL, NULL, notes_buf);
+}
+
+/* Merges all the PT_NOTE headers into one. */
+static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
+ char **notes_buf, size_t *notes_sz)
+{
+ int i, nr_ptnote=0, rc=0;
+ char *tmp;
+ Elf64_Ehdr *ehdr_ptr;
+ Elf64_Phdr phdr;
+ u64 phdr_sz = 0, note_off;
+ struct vm_struct *vm;
+
+ ehdr_ptr = (Elf64_Ehdr *)elfptr;
+
+ rc = get_note_number_and_size_elf64(ehdr_ptr, &nr_ptnote, &phdr_sz);
+ if (rc < 0)
+ return rc;
+
+ *notes_sz = roundup(phdr_sz, PAGE_SIZE);
+ *notes_buf = vzalloc(*notes_sz);
+ if (!*notes_buf)
+ return -ENOMEM;
+
+ vm = find_vm_area(*notes_buf);
+ BUG_ON(!vm);
+ vm->flags |= VM_USERMAP;
+
+ rc = copy_notes_elf64(ehdr_ptr, *notes_buf);
+ if (rc < 0)
+ return rc;
+
/* Prepare merged PT_NOTE program header. */
phdr.p_type = PT_NOTE;
phdr.p_flags = 0;
@@ -304,23 +385,33 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
return 0;
}

-/* Merges all the PT_NOTE headers into one. */
-static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
- struct list_head *vc_list)
+/**
+ * process_note_headers_elf32 - Perform a variety of processing on ELF
+ * note segments according to the combination of function arguments.
+ *
+ * @ehdr_ptr - ELF header buffer
+ * @nr_notes - the number of program header entries of PT_NOTE type
+ * @notes_sz - total size of ELF note segment
+ * @notes_buf - buffer into which ELF note segment is copied
+ *
+ * Assume @ehdr_ptr is always not NULL. If @nr_notes is not NULL, then
+ * the number of program header entries of PT_NOTE type is assigned to
+ * @nr_notes. If @notes_sz is not NULL, then total size of ELF note
+ * segment, header part plus data part, is assigned to @notes_sz. If
+ * @notes_buf is not NULL, then ELF note segment is copied into
+ * @notes_buf.
+ */
+static int __init process_note_headers_elf32(const Elf32_Ehdr *ehdr_ptr,
+ int *nr_notes, u64 *notes_sz,
+ char *notes_buf)
{
int i, nr_ptnote=0, rc=0;
- char *tmp;
- Elf32_Ehdr *ehdr_ptr;
- Elf32_Phdr phdr, *phdr_ptr;
+ Elf32_Phdr *phdr_ptr = (Elf32_Phdr*)(ehdr_ptr + 1);
Elf32_Nhdr *nhdr_ptr;
- u64 phdr_sz = 0, note_off;
+ u64 phdr_sz = 0;

- ehdr_ptr = (Elf32_Ehdr *)elfptr;
- phdr_ptr = (Elf32_Phdr*)(elfptr + sizeof(Elf32_Ehdr));
for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
- int j;
void *notes_section;
- struct vmcore *new;
u64 offset, max_sz, sz, real_sz = 0;
if (phdr_ptr->p_type != PT_NOTE)
continue;
@@ -336,7 +427,7 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
return rc;
}
nhdr_ptr = notes_section;
- for (j = 0; j < max_sz; j += sz) {
+ while (real_sz < max_sz) {
if (nhdr_ptr->n_namesz == 0)
break;
sz = sizeof(Elf32_Nhdr) +
@@ -345,20 +436,68 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
real_sz += sz;
nhdr_ptr = (Elf32_Nhdr*)((char*)nhdr_ptr + sz);
}
-
- /* Add this contiguous chunk of notes section to vmcore list.*/
- new = get_new_element();
- if (!new) {
- kfree(notes_section);
- return -ENOMEM;
+ if (notes_buf) {
+ offset = phdr_ptr->p_offset;
+ rc = read_from_oldmem(notes_buf + phdr_sz, real_sz,
+ &offset, 0);
+ if (rc < 0) {
+ kfree(notes_section);
+ return rc;
+ }
}
- new->paddr = phdr_ptr->p_offset;
- new->size = real_sz;
- list_add_tail(&new->list, vc_list);
phdr_sz += real_sz;
kfree(notes_section);
}

+ if (nr_notes)
+ *nr_notes = nr_ptnote;
+ if (notes_sz)
+ *notes_sz = phdr_sz;
+
+ return 0;
+}
+
+static int __init get_note_number_and_size_elf32(const Elf32_Ehdr *ehdr_ptr,
+ int *nr_ptnote, u64 *phdr_sz)
+{
+ return process_note_headers_elf32(ehdr_ptr, nr_ptnote, phdr_sz, NULL);
+}
+
+static int __init copy_notes_elf32(const Elf32_Ehdr *ehdr_ptr, char *notes_buf)
+{
+ return process_note_headers_elf32(ehdr_ptr, NULL, NULL, notes_buf);
+}
+
+/* Merges all the PT_NOTE headers into one. */
+static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
+ char **notes_buf, size_t *notes_sz)
+{
+ int i, nr_ptnote=0, rc=0;
+ char *tmp;
+ Elf32_Ehdr *ehdr_ptr;
+ Elf32_Phdr phdr;
+ u64 phdr_sz = 0, note_off;
+ struct vm_struct *vm;
+
+ ehdr_ptr = (Elf32_Ehdr *)elfptr;
+
+ rc = get_note_number_and_size_elf32(ehdr_ptr, &nr_ptnote, &phdr_sz);
+ if (rc < 0)
+ return rc;
+
+ *notes_sz = roundup(phdr_sz, PAGE_SIZE);
+ *notes_buf = vzalloc(*notes_sz);
+ if (!*notes_buf)
+ return -ENOMEM;
+
+ vm = find_vm_area(*notes_buf);
+ BUG_ON(!vm);
+ vm->flags |= VM_USERMAP;
+
+ rc = copy_notes_elf32(ehdr_ptr, *notes_buf);
+ if (rc < 0)
+ return rc;
+
/* Prepare merged PT_NOTE program header. */
phdr.p_type = PT_NOTE;
phdr.p_flags = 0;
@@ -391,6 +530,7 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
* the new offset fields of exported program headers. */
static int __init process_ptload_program_headers_elf64(char *elfptr,
size_t elfsz,
+ size_t elfnotes_sz,
struct list_head *vc_list)
{
int i;
@@ -402,8 +542,8 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
ehdr_ptr = (Elf64_Ehdr *)elfptr;
phdr_ptr = (Elf64_Phdr*)(elfptr + sizeof(Elf64_Ehdr)); /* PT_NOTE hdr */

- /* First program header is PT_NOTE header. */
- vmcore_off = elfsz + roundup(phdr_ptr->p_memsz, PAGE_SIZE);
+ /* Skip Elf header, program headers and Elf note segment. */
+ vmcore_off = elfsz + elfnotes_sz;

for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
u64 paddr, start, end, size;
@@ -433,6 +573,7 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,

static int __init process_ptload_program_headers_elf32(char *elfptr,
size_t elfsz,
+ size_t elfnotes_sz,
struct list_head *vc_list)
{
int i;
@@ -444,8 +585,8 @@ static int __init process_ptload_program_headers_elf32(char *elfptr,
ehdr_ptr = (Elf32_Ehdr *)elfptr;
phdr_ptr = (Elf32_Phdr*)(elfptr + sizeof(Elf32_Ehdr)); /* PT_NOTE hdr */

- /* First program header is PT_NOTE header. */
- vmcore_off = elfsz + roundup(phdr_ptr->p_memsz, PAGE_SIZE);
+ /* Skip Elf header, program headers and Elf note segment. */
+ vmcore_off = elfsz + elfnotes_sz;

for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
u64 paddr, start, end, size;
@@ -474,17 +615,15 @@ static int __init process_ptload_program_headers_elf32(char *elfptr,
}

/* Sets offset fields of vmcore elements. */
-static void __init set_vmcore_list_offsets_elf64(char *elfptr, size_t elfsz,
+static void __init set_vmcore_list_offsets_elf64(size_t elfsz,
+ size_t elfnotes_sz,
struct list_head *vc_list)
{
loff_t vmcore_off;
- Elf64_Ehdr *ehdr_ptr;
struct vmcore *m;

- ehdr_ptr = (Elf64_Ehdr *)elfptr;
-
- /* Skip Elf header and program headers. */
- vmcore_off = elfsz;
+ /* Skip Elf header, program headers and Elf note segment. */
+ vmcore_off = elfsz + elfnotes_sz;

list_for_each_entry(m, vc_list, list) {
m->offset = vmcore_off;
@@ -493,17 +632,15 @@ static void __init set_vmcore_list_offsets_elf64(char *elfptr, size_t elfsz,
}

/* Sets offset fields of vmcore elements. */
-static void __init set_vmcore_list_offsets_elf32(char *elfptr, size_t elfsz,
+static void __init set_vmcore_list_offsets_elf32(size_t elfsz,
+ size_t elfnotes_sz,
struct list_head *vc_list)
{
loff_t vmcore_off;
- Elf32_Ehdr *ehdr_ptr;
struct vmcore *m;

- ehdr_ptr = (Elf32_Ehdr *)elfptr;
-
- /* Skip Elf header and program headers. */
- vmcore_off = elfsz;
+ /* Skip Elf header, program headers and Elf note segment. */
+ vmcore_off = elfsz + elfnotes_sz;

list_for_each_entry(m, vc_list, list) {
m->offset = vmcore_off;
@@ -554,20 +691,23 @@ static int __init parse_crash_elf64_headers(void)
}

/* Merge all PT_NOTE headers into one. */
- rc = merge_note_headers_elf64(elfcorebuf, &elfcorebuf_sz, &vmcore_list);
+ rc = merge_note_headers_elf64(elfcorebuf, &elfcorebuf_sz,
+ &elfnotes_buf, &elfnotes_sz);
if (rc) {
free_pages((unsigned long)elfcorebuf,
get_order(elfcorebuf_sz_orig));
return rc;
}
rc = process_ptload_program_headers_elf64(elfcorebuf, elfcorebuf_sz,
- &vmcore_list);
+ elfnotes_sz,
+ &vmcore_list);
if (rc) {
free_pages((unsigned long)elfcorebuf,
get_order(elfcorebuf_sz_orig));
return rc;
}
- set_vmcore_list_offsets_elf64(elfcorebuf, elfcorebuf_sz, &vmcore_list);
+ set_vmcore_list_offsets_elf64(elfcorebuf_sz, elfnotes_sz,
+ &vmcore_list);
return 0;
}

@@ -614,20 +754,23 @@ static int __init parse_crash_elf32_headers(void)
}

/* Merge all PT_NOTE headers into one. */
- rc = merge_note_headers_elf32(elfcorebuf, &elfcorebuf_sz, &vmcore_list);
+ rc = merge_note_headers_elf32(elfcorebuf, &elfcorebuf_sz,
+ &elfnotes_buf, &elfnotes_sz);
if (rc) {
free_pages((unsigned long)elfcorebuf,
get_order(elfcorebuf_sz_orig));
return rc;
}
rc = process_ptload_program_headers_elf32(elfcorebuf, elfcorebuf_sz,
- &vmcore_list);
+ elfnotes_sz,
+ &vmcore_list);
if (rc) {
free_pages((unsigned long)elfcorebuf,
get_order(elfcorebuf_sz_orig));
return rc;
}
- set_vmcore_list_offsets_elf32(elfcorebuf, elfcorebuf_sz, &vmcore_list);
+ set_vmcore_list_offsets_elf32(elfcorebuf_sz, elfnotes_sz,
+ &vmcore_list);
return 0;
}

@@ -706,6 +849,8 @@ void vmcore_cleanup(void)
list_del(&m->list);
kfree(m);
}
+ vfree(elfnotes_buf);
+ elfnotes_buf = NULL;
free_pages((unsigned long)elfcorebuf,
get_order(elfcorebuf_sz_orig));
elfcorebuf = NULL;

2013-05-15 09:06:35

by Hatayama, Daisuke

[permalink] [raw]
Subject: [PATCH v6 8/8] vmcore: support mmap() on /proc/vmcore

This patch introduces mmap_vmcore().

Don't permit writable nor executable mapping even with mprotect()
because this mmap() is aimed at reading crash dump memory.
Non-writable mapping is also requirement of remap_pfn_range() when
mapping linear pages on non-consecutive physical pages; see
is_cow_mapping().

Set VM_MIXEDMAP flag to remap memory by remap_pfn_range and by
remap_vmalloc_range_pertial at the same time for a single
vma. do_munmap() can correctly clean partially remapped vma with two
functions in abnormal case. See zap_pte_range(), vm_normal_page() and
their comments for details.

On x86-32 PAE kernels, mmap() supports at most 16TB memory only. This
limitation comes from the fact that the third argument of
remap_pfn_range(), pfn, is of 32-bit length on x86-32: unsigned long.

Signed-off-by: HATAYAMA Daisuke <[email protected]>
---

fs/proc/vmcore.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 86 insertions(+), 0 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 7f2041c..2c72487 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -20,6 +20,7 @@
#include <linux/init.h>
#include <linux/crash_dump.h>
#include <linux/list.h>
+#include <linux/vmalloc.h>
#include <asm/uaccess.h>
#include <asm/io.h>
#include "internal.h"
@@ -200,9 +201,94 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer,
return acc;
}

+static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
+{
+ size_t size = vma->vm_end - vma->vm_start;
+ u64 start, end, len, tsz;
+ struct vmcore *m;
+
+ start = (u64)vma->vm_pgoff << PAGE_SHIFT;
+ end = start + size;
+
+ if (size > vmcore_size || end > vmcore_size)
+ return -EINVAL;
+
+ if (vma->vm_flags & (VM_WRITE | VM_EXEC))
+ return -EPERM;
+
+ vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
+ vma->vm_flags |= VM_MIXEDMAP;
+
+ len = 0;
+
+ if (start < elfcorebuf_sz) {
+ u64 pfn;
+
+ tsz = elfcorebuf_sz - start;
+ if (size < tsz)
+ tsz = size;
+ pfn = __pa(elfcorebuf + start) >> PAGE_SHIFT;
+ if (remap_pfn_range(vma, vma->vm_start, pfn, tsz,
+ vma->vm_page_prot))
+ return -EAGAIN;
+ size -= tsz;
+ start += tsz;
+ len += tsz;
+
+ if (size == 0)
+ return 0;
+ }
+
+ if (start < elfcorebuf_sz + elfnotes_sz) {
+ void *kaddr;
+
+ tsz = elfcorebuf_sz + elfnotes_sz - start;
+ if (size < tsz)
+ tsz = size;
+ kaddr = elfnotes_buf + start - elfcorebuf_sz;
+ if (remap_vmalloc_range_partial(vma, vma->vm_start + len,
+ kaddr, tsz)) {
+ do_munmap(vma->vm_mm, vma->vm_start, len);
+ return -EAGAIN;
+ }
+ size -= tsz;
+ start += tsz;
+ len += tsz;
+
+ if (size == 0)
+ return 0;
+ }
+
+ list_for_each_entry(m, &vmcore_list, list) {
+ if (start < m->offset + m->size) {
+ u64 paddr = 0;
+
+ tsz = m->offset + m->size - start;
+ if (size < tsz)
+ tsz = size;
+ paddr = m->paddr + start - m->offset;
+ if (remap_pfn_range(vma, vma->vm_start + len,
+ paddr >> PAGE_SHIFT, tsz,
+ vma->vm_page_prot)) {
+ do_munmap(vma->vm_mm, vma->vm_start, len);
+ return -EAGAIN;
+ }
+ size -= tsz;
+ start += tsz;
+ len += tsz;
+
+ if (size == 0)
+ return 0;
+ }
+ }
+
+ return 0;
+}
+
static const struct file_operations proc_vmcore_operations = {
.read = read_vmcore,
.llseek = default_llseek,
+ .mmap = mmap_vmcore,
};

static struct vmcore* __init get_new_element(void)

2013-05-15 09:35:34

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH v6 1/8] vmcore: clean up read_vmcore()

于 2013年05月15日 17:05, HATAYAMA Daisuke 写道:
> Rewrite part of read_vmcore() that reads objects in vmcore_list in the
> same way as part reading ELF headers, by which some duplicated and
> redundant codes are removed.
>
> Signed-off-by: HATAYAMA Daisuke <[email protected]>
> Acked-by: Vivek Goyal <[email protected]>

This cleanup really makes the code more clear.

Just one minor nitpick below.

Acked-by: Zhang Yanfei <[email protected]>

> ---
>
> fs/proc/vmcore.c | 68 ++++++++++++++++--------------------------------------
> 1 files changed, 20 insertions(+), 48 deletions(-)
>
> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
> index 17f7e08..ab0c92e 100644
> --- a/fs/proc/vmcore.c
> +++ b/fs/proc/vmcore.c
> @@ -118,27 +118,6 @@ static ssize_t read_from_oldmem(char *buf, size_t count,
> return read;
> }
>
> -/* Maps vmcore file offset to respective physical address in memroy. */
> -static u64 map_offset_to_paddr(loff_t offset, struct list_head *vc_list,
> - struct vmcore **m_ptr)
> -{
> - struct vmcore *m;
> - u64 paddr;
> -
> - list_for_each_entry(m, vc_list, list) {
> - u64 start, end;
> - start = m->offset;
> - end = m->offset + m->size - 1;
> - if (offset >= start && offset <= end) {
> - paddr = m->paddr + offset - start;
> - *m_ptr = m;
> - return paddr;
> - }
> - }
> - *m_ptr = NULL;
> - return 0;
> -}
> -
> /* Read from the ELF header and then the crash dump. On error, negative value is
> * returned otherwise number of bytes read are returned.
> */
> @@ -147,8 +126,8 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer,
> {
> ssize_t acc = 0, tmp;
> size_t tsz;
> - u64 start, nr_bytes;
> - struct vmcore *curr_m = NULL;
> + u64 start;
> + struct vmcore *m = NULL;
>
> if (buflen == 0 || *fpos >= vmcore_size)
> return 0;
> @@ -174,33 +153,26 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer,
> return acc;
> }
>
> - start = map_offset_to_paddr(*fpos, &vmcore_list, &curr_m);
> - if (!curr_m)
> - return -EINVAL;
> -
> - while (buflen) {
> - tsz = min_t(size_t, buflen, PAGE_SIZE - (start & ~PAGE_MASK));
> -
> - /* Calculate left bytes in current memory segment. */
> - nr_bytes = (curr_m->size - (start - curr_m->paddr));
> - if (tsz > nr_bytes)
> - tsz = nr_bytes;
> -
> - tmp = read_from_oldmem(buffer, tsz, &start, 1);
> - if (tmp < 0)
> - return tmp;
> - buflen -= tsz;
> - *fpos += tsz;
> - buffer += tsz;
> - acc += tsz;
> - if (start >= (curr_m->paddr + curr_m->size)) {
> - if (curr_m->list.next == &vmcore_list)
> - return acc; /*EOF*/
> - curr_m = list_entry(curr_m->list.next,
> - struct vmcore, list);
> - start = curr_m->paddr;
> + list_for_each_entry(m, &vmcore_list, list) {
> + if (*fpos < m->offset + m->size) {
> + tsz = m->offset + m->size - *fpos;
> + if (buflen < tsz)
> + tsz = buflen;

if (tsz > buflen)
tsz = buflen;

seems better.

Or you can use a min_t here:

tsz = min_t(size_t, m->offset + m->size - *fpos, buflen);


> + start = m->paddr + *fpos - m->offset;
> + tmp = read_from_oldmem(buffer, tsz, &start, 1);
> + if (tmp < 0)
> + return tmp;
> + buflen -= tsz;
> + *fpos += tsz;
> + buffer += tsz;
> + acc += tsz;
> +
> + /* leave now if filled buffer already */
> + if (buflen == 0)
> + return acc;
> }
> }
> +
> return acc;
> }
>
>
>

2013-05-15 21:38:14

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH v6 4/8] vmalloc: make find_vm_area check in range

On Wed, May 15, 2013 at 5:06 AM, HATAYAMA Daisuke
<[email protected]> wrote:
> Currently, __find_vmap_area searches for the kernel VM area starting
> at a given address. This patch changes this behavior so that it
> searches for the kernel VM area to which the address belongs. This
> change is needed by remap_vmalloc_range_partial to be introduced in
> later patch that receives any position of kernel VM area as target
> address.
>
> This patch changes the condition (addr > va->va_start) to the
> equivalent (addr >= va->va_end) by taking advantage of the fact that
> each kernel VM area is non-overlapping.
>
> Signed-off-by: HATAYAMA Daisuke <[email protected]>
> ---
>
> mm/vmalloc.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
> index d365724..3875fa2 100644
> --- a/mm/vmalloc.c
> +++ b/mm/vmalloc.c
> @@ -292,7 +292,7 @@ static struct vmap_area *__find_vmap_area(unsigned long addr)
> va = rb_entry(n, struct vmap_area, rb_node);
> if (addr < va->va_start)
> n = n->rb_left;
> - else if (addr > va->va_start)
> + else if (addr >= va->va_end)
> n = n->rb_right;

OK. This is natural definition. Looks good.

Acked-by: KOSAKI Motohiro <[email protected]>

2013-05-16 06:08:16

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH v6 2/8] vmcore: allocate buffer for ELF headers on page-size alignment

于 2013年05月15日 17:05, HATAYAMA Daisuke 写道:
> Allocate ELF headers on page-size boundary using __get_free_pages()
> instead of kmalloc().
>
> Later patch will merge PT_NOTE entries into a single unique one and
> decrease the buffer size actually used. Keep original buffer size in
> variable elfcorebuf_sz_orig to kfree the buffer later and actually
> used buffer size with rounded up to page-size boundary in variable
> elfcorebuf_sz separately.
>
> The size of part of the ELF buffer exported from /proc/vmcore is
> elfcorebuf_sz.
>
> The merged, removed PT_NOTE entries, i.e. the range [elfcorebuf_sz,
> elfcorebuf_sz_orig], is filled with 0.
>
> Use size of the ELF headers as an initial offset value in
> set_vmcore_list_offsets_elf{64,32} and
> process_ptload_program_headers_elf{64,32} in order to indicate that
> the offset includes the holes towards the page boundary.
>
> Signed-off-by: HATAYAMA Daisuke <[email protected]>
> ---

Acked-by: Zhang Yanfei <[email protected]>

>
> fs/proc/vmcore.c | 80 ++++++++++++++++++++++++++++++------------------------
> 1 files changed, 45 insertions(+), 35 deletions(-)
>
> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
> index ab0c92e..48886e6 100644
> --- a/fs/proc/vmcore.c
> +++ b/fs/proc/vmcore.c
> @@ -32,6 +32,7 @@ static LIST_HEAD(vmcore_list);
> /* Stores the pointer to the buffer containing kernel elf core headers. */
> static char *elfcorebuf;
> static size_t elfcorebuf_sz;
> +static size_t elfcorebuf_sz_orig;
>
> /* Total size of vmcore file. */
> static u64 vmcore_size;
> @@ -186,7 +187,7 @@ static struct vmcore* __init get_new_element(void)
> return kzalloc(sizeof(struct vmcore), GFP_KERNEL);
> }
>
> -static u64 __init get_vmcore_size_elf64(char *elfptr)
> +static u64 __init get_vmcore_size_elf64(char *elfptr, size_t elfsz)
> {
> int i;
> u64 size;
> @@ -195,7 +196,7 @@ static u64 __init get_vmcore_size_elf64(char *elfptr)
>
> ehdr_ptr = (Elf64_Ehdr *)elfptr;
> phdr_ptr = (Elf64_Phdr*)(elfptr + sizeof(Elf64_Ehdr));
> - size = sizeof(Elf64_Ehdr) + ((ehdr_ptr->e_phnum) * sizeof(Elf64_Phdr));
> + size = elfsz;
> for (i = 0; i < ehdr_ptr->e_phnum; i++) {
> size += phdr_ptr->p_memsz;
> phdr_ptr++;
> @@ -203,7 +204,7 @@ static u64 __init get_vmcore_size_elf64(char *elfptr)
> return size;
> }
>
> -static u64 __init get_vmcore_size_elf32(char *elfptr)
> +static u64 __init get_vmcore_size_elf32(char *elfptr, size_t elfsz)
> {
> int i;
> u64 size;
> @@ -212,7 +213,7 @@ static u64 __init get_vmcore_size_elf32(char *elfptr)
>
> ehdr_ptr = (Elf32_Ehdr *)elfptr;
> phdr_ptr = (Elf32_Phdr*)(elfptr + sizeof(Elf32_Ehdr));
> - size = sizeof(Elf32_Ehdr) + ((ehdr_ptr->e_phnum) * sizeof(Elf32_Phdr));
> + size = elfsz;
> for (i = 0; i < ehdr_ptr->e_phnum; i++) {
> size += phdr_ptr->p_memsz;
> phdr_ptr++;
> @@ -280,7 +281,7 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
> phdr.p_flags = 0;
> note_off = sizeof(Elf64_Ehdr) +
> (ehdr_ptr->e_phnum - nr_ptnote +1) * sizeof(Elf64_Phdr);
> - phdr.p_offset = note_off;
> + phdr.p_offset = roundup(note_off, PAGE_SIZE);
> phdr.p_vaddr = phdr.p_paddr = 0;
> phdr.p_filesz = phdr.p_memsz = phdr_sz;
> phdr.p_align = 0;
> @@ -294,6 +295,8 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
> i = (nr_ptnote - 1) * sizeof(Elf64_Phdr);
> *elfsz = *elfsz - i;
> memmove(tmp, tmp+i, ((*elfsz)-sizeof(Elf64_Ehdr)-sizeof(Elf64_Phdr)));
> + memset(elfptr + *elfsz, 0, i);
> + *elfsz = roundup(*elfsz, PAGE_SIZE);
>
> /* Modify e_phnum to reflect merged headers. */
> ehdr_ptr->e_phnum = ehdr_ptr->e_phnum - nr_ptnote + 1;
> @@ -361,7 +364,7 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
> phdr.p_flags = 0;
> note_off = sizeof(Elf32_Ehdr) +
> (ehdr_ptr->e_phnum - nr_ptnote +1) * sizeof(Elf32_Phdr);
> - phdr.p_offset = note_off;
> + phdr.p_offset = roundup(note_off, PAGE_SIZE);
> phdr.p_vaddr = phdr.p_paddr = 0;
> phdr.p_filesz = phdr.p_memsz = phdr_sz;
> phdr.p_align = 0;
> @@ -375,6 +378,8 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
> i = (nr_ptnote - 1) * sizeof(Elf32_Phdr);
> *elfsz = *elfsz - i;
> memmove(tmp, tmp+i, ((*elfsz)-sizeof(Elf32_Ehdr)-sizeof(Elf32_Phdr)));
> + memset(elfptr + *elfsz, 0, i);
> + *elfsz = roundup(*elfsz, PAGE_SIZE);
>
> /* Modify e_phnum to reflect merged headers. */
> ehdr_ptr->e_phnum = ehdr_ptr->e_phnum - nr_ptnote + 1;
> @@ -398,9 +403,7 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
> phdr_ptr = (Elf64_Phdr*)(elfptr + sizeof(Elf64_Ehdr)); /* PT_NOTE hdr */
>
> /* First program header is PT_NOTE header. */
> - vmcore_off = sizeof(Elf64_Ehdr) +
> - (ehdr_ptr->e_phnum) * sizeof(Elf64_Phdr) +
> - phdr_ptr->p_memsz; /* Note sections */
> + vmcore_off = elfsz + roundup(phdr_ptr->p_memsz, PAGE_SIZE);
>
> for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
> if (phdr_ptr->p_type != PT_LOAD)
> @@ -435,9 +438,7 @@ static int __init process_ptload_program_headers_elf32(char *elfptr,
> phdr_ptr = (Elf32_Phdr*)(elfptr + sizeof(Elf32_Ehdr)); /* PT_NOTE hdr */
>
> /* First program header is PT_NOTE header. */
> - vmcore_off = sizeof(Elf32_Ehdr) +
> - (ehdr_ptr->e_phnum) * sizeof(Elf32_Phdr) +
> - phdr_ptr->p_memsz; /* Note sections */
> + vmcore_off = elfsz + roundup(phdr_ptr->p_memsz, PAGE_SIZE);
>
> for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
> if (phdr_ptr->p_type != PT_LOAD)
> @@ -459,7 +460,7 @@ static int __init process_ptload_program_headers_elf32(char *elfptr,
> }
>
> /* Sets offset fields of vmcore elements. */
> -static void __init set_vmcore_list_offsets_elf64(char *elfptr,
> +static void __init set_vmcore_list_offsets_elf64(char *elfptr, size_t elfsz,
> struct list_head *vc_list)
> {
> loff_t vmcore_off;
> @@ -469,8 +470,7 @@ static void __init set_vmcore_list_offsets_elf64(char *elfptr,
> ehdr_ptr = (Elf64_Ehdr *)elfptr;
>
> /* Skip Elf header and program headers. */
> - vmcore_off = sizeof(Elf64_Ehdr) +
> - (ehdr_ptr->e_phnum) * sizeof(Elf64_Phdr);
> + vmcore_off = elfsz;
>
> list_for_each_entry(m, vc_list, list) {
> m->offset = vmcore_off;
> @@ -479,7 +479,7 @@ static void __init set_vmcore_list_offsets_elf64(char *elfptr,
> }
>
> /* Sets offset fields of vmcore elements. */
> -static void __init set_vmcore_list_offsets_elf32(char *elfptr,
> +static void __init set_vmcore_list_offsets_elf32(char *elfptr, size_t elfsz,
> struct list_head *vc_list)
> {
> loff_t vmcore_off;
> @@ -489,8 +489,7 @@ static void __init set_vmcore_list_offsets_elf32(char *elfptr,
> ehdr_ptr = (Elf32_Ehdr *)elfptr;
>
> /* Skip Elf header and program headers. */
> - vmcore_off = sizeof(Elf32_Ehdr) +
> - (ehdr_ptr->e_phnum) * sizeof(Elf32_Phdr);
> + vmcore_off = elfsz;
>
> list_for_each_entry(m, vc_list, list) {
> m->offset = vmcore_off;
> @@ -526,30 +525,35 @@ static int __init parse_crash_elf64_headers(void)
> }
>
> /* Read in all elf headers. */
> - elfcorebuf_sz = sizeof(Elf64_Ehdr) + ehdr.e_phnum * sizeof(Elf64_Phdr);
> - elfcorebuf = kmalloc(elfcorebuf_sz, GFP_KERNEL);
> + elfcorebuf_sz_orig = sizeof(Elf64_Ehdr) + ehdr.e_phnum * sizeof(Elf64_Phdr);
> + elfcorebuf_sz = elfcorebuf_sz_orig;
> + elfcorebuf = (void *) __get_free_pages(GFP_KERNEL | __GFP_ZERO,
> + get_order(elfcorebuf_sz_orig));
> if (!elfcorebuf)
> return -ENOMEM;
> addr = elfcorehdr_addr;
> - rc = read_from_oldmem(elfcorebuf, elfcorebuf_sz, &addr, 0);
> + rc = read_from_oldmem(elfcorebuf, elfcorebuf_sz_orig, &addr, 0);
> if (rc < 0) {
> - kfree(elfcorebuf);
> + free_pages((unsigned long)elfcorebuf,
> + get_order(elfcorebuf_sz_orig));
> return rc;
> }
>
> /* Merge all PT_NOTE headers into one. */
> rc = merge_note_headers_elf64(elfcorebuf, &elfcorebuf_sz, &vmcore_list);
> if (rc) {
> - kfree(elfcorebuf);
> + free_pages((unsigned long)elfcorebuf,
> + get_order(elfcorebuf_sz_orig));
> return rc;
> }
> rc = process_ptload_program_headers_elf64(elfcorebuf, elfcorebuf_sz,
> &vmcore_list);
> if (rc) {
> - kfree(elfcorebuf);
> + free_pages((unsigned long)elfcorebuf,
> + get_order(elfcorebuf_sz_orig));
> return rc;
> }
> - set_vmcore_list_offsets_elf64(elfcorebuf, &vmcore_list);
> + set_vmcore_list_offsets_elf64(elfcorebuf, elfcorebuf_sz, &vmcore_list);
> return 0;
> }
>
> @@ -581,30 +585,35 @@ static int __init parse_crash_elf32_headers(void)
> }
>
> /* Read in all elf headers. */
> - elfcorebuf_sz = sizeof(Elf32_Ehdr) + ehdr.e_phnum * sizeof(Elf32_Phdr);
> - elfcorebuf = kmalloc(elfcorebuf_sz, GFP_KERNEL);
> + elfcorebuf_sz_orig = sizeof(Elf32_Ehdr) + ehdr.e_phnum * sizeof(Elf32_Phdr);
> + elfcorebuf_sz = elfcorebuf_sz_orig;
> + elfcorebuf = (void *) __get_free_pages(GFP_KERNEL | __GFP_ZERO,
> + get_order(elfcorebuf_sz_orig));
> if (!elfcorebuf)
> return -ENOMEM;
> addr = elfcorehdr_addr;
> - rc = read_from_oldmem(elfcorebuf, elfcorebuf_sz, &addr, 0);
> + rc = read_from_oldmem(elfcorebuf, elfcorebuf_sz_orig, &addr, 0);
> if (rc < 0) {
> - kfree(elfcorebuf);
> + free_pages((unsigned long)elfcorebuf,
> + get_order(elfcorebuf_sz_orig));
> return rc;
> }
>
> /* Merge all PT_NOTE headers into one. */
> rc = merge_note_headers_elf32(elfcorebuf, &elfcorebuf_sz, &vmcore_list);
> if (rc) {
> - kfree(elfcorebuf);
> + free_pages((unsigned long)elfcorebuf,
> + get_order(elfcorebuf_sz_orig));
> return rc;
> }
> rc = process_ptload_program_headers_elf32(elfcorebuf, elfcorebuf_sz,
> &vmcore_list);
> if (rc) {
> - kfree(elfcorebuf);
> + free_pages((unsigned long)elfcorebuf,
> + get_order(elfcorebuf_sz_orig));
> return rc;
> }
> - set_vmcore_list_offsets_elf32(elfcorebuf, &vmcore_list);
> + set_vmcore_list_offsets_elf32(elfcorebuf, elfcorebuf_sz, &vmcore_list);
> return 0;
> }
>
> @@ -629,14 +638,14 @@ static int __init parse_crash_elf_headers(void)
> return rc;
>
> /* Determine vmcore size. */
> - vmcore_size = get_vmcore_size_elf64(elfcorebuf);
> + vmcore_size = get_vmcore_size_elf64(elfcorebuf, elfcorebuf_sz);
> } else if (e_ident[EI_CLASS] == ELFCLASS32) {
> rc = parse_crash_elf32_headers();
> if (rc)
> return rc;
>
> /* Determine vmcore size. */
> - vmcore_size = get_vmcore_size_elf32(elfcorebuf);
> + vmcore_size = get_vmcore_size_elf32(elfcorebuf, elfcorebuf_sz);
> } else {
> pr_warn("Warning: Core image elf header is not sane\n");
> return -EINVAL;
> @@ -683,7 +692,8 @@ void vmcore_cleanup(void)
> list_del(&m->list);
> kfree(m);
> }
> - kfree(elfcorebuf);
> + free_pages((unsigned long)elfcorebuf,
> + get_order(elfcorebuf_sz_orig));
> elfcorebuf = NULL;
> }
> EXPORT_SYMBOL_GPL(vmcore_cleanup);
>
>

2013-05-16 06:08:14

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH v6 3/8] vmcore: treat memory chunks referenced by PT_LOAD program header entries in page-size boundary in vmcore_list

于 2013年05月15日 17:05, HATAYAMA Daisuke 写道:
> Treat memory chunks referenced by PT_LOAD program header entries in
> page-size boundary in vmcore_list. Formally, for each range [start,
> end], we set up the corresponding vmcore object in vmcore_list to
> [rounddown(start, PAGE_SIZE), roundup(end, PAGE_SIZE)].
>
> This change affects layout of /proc/vmcore. The gaps generated by the
> rearrangement are newly made visible to applications as
> holes. Concretely, they are two ranges [rounddown(start, PAGE_SIZE),
> start] and [end, roundup(end, PAGE_SIZE)].
>
> Suppose variable m points at a vmcore object in vmcore_list, and
> variable phdr points at the program header of PT_LOAD type the
> variable m corresponds to. Then, pictorially:
>
> m->offset +---------------+
> | hole |
> phdr->p_offset = +---------------+
> m->offset + (paddr - start) | |\
> | kernel memory | phdr->p_memsz
> | |/
> +---------------+
> | hole |
> m->offset + m->size +---------------+
>
> where m->offset and m->offset + m->size are always page-size aligned.
>
> Signed-off-by: HATAYAMA Daisuke <[email protected]>
> Acked-by: Vivek Goyal <[email protected]>
> ---

Acked-by: Zhang Yanfei <[email protected]>

>
> fs/proc/vmcore.c | 30 ++++++++++++++++++++++--------
> 1 files changed, 22 insertions(+), 8 deletions(-)
>
> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
> index 48886e6..6cf7fbd 100644
> --- a/fs/proc/vmcore.c
> +++ b/fs/proc/vmcore.c
> @@ -406,20 +406,27 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
> vmcore_off = elfsz + roundup(phdr_ptr->p_memsz, PAGE_SIZE);
>
> for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
> + u64 paddr, start, end, size;
> +
> if (phdr_ptr->p_type != PT_LOAD)
> continue;
>
> + paddr = phdr_ptr->p_offset;
> + start = rounddown(paddr, PAGE_SIZE);
> + end = roundup(paddr + phdr_ptr->p_memsz, PAGE_SIZE);
> + size = end - start;
> +
> /* Add this contiguous chunk of memory to vmcore list.*/
> new = get_new_element();
> if (!new)
> return -ENOMEM;
> - new->paddr = phdr_ptr->p_offset;
> - new->size = phdr_ptr->p_memsz;
> + new->paddr = start;
> + new->size = size;
> list_add_tail(&new->list, vc_list);
>
> /* Update the program header offset. */
> - phdr_ptr->p_offset = vmcore_off;
> - vmcore_off = vmcore_off + phdr_ptr->p_memsz;
> + phdr_ptr->p_offset = vmcore_off + (paddr - start);
> + vmcore_off = vmcore_off + size;
> }
> return 0;
> }
> @@ -441,20 +448,27 @@ static int __init process_ptload_program_headers_elf32(char *elfptr,
> vmcore_off = elfsz + roundup(phdr_ptr->p_memsz, PAGE_SIZE);
>
> for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
> + u64 paddr, start, end, size;
> +
> if (phdr_ptr->p_type != PT_LOAD)
> continue;
>
> + paddr = phdr_ptr->p_offset;
> + start = rounddown(paddr, PAGE_SIZE);
> + end = roundup(paddr + phdr_ptr->p_memsz, PAGE_SIZE);
> + size = end - start;
> +
> /* Add this contiguous chunk of memory to vmcore list.*/
> new = get_new_element();
> if (!new)
> return -ENOMEM;
> - new->paddr = phdr_ptr->p_offset;
> - new->size = phdr_ptr->p_memsz;
> + new->paddr = start;
> + new->size = size;
> list_add_tail(&new->list, vc_list);
>
> /* Update the program header offset */
> - phdr_ptr->p_offset = vmcore_off;
> - vmcore_off = vmcore_off + phdr_ptr->p_memsz;
> + phdr_ptr->p_offset = vmcore_off + (paddr - start);
> + vmcore_off = vmcore_off + size;
> }
> return 0;
> }
>
>

2013-05-16 08:00:57

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH v6 7/8] vmcore: calculate vmcore file size from buffer size and total size of vmcore objects

于 2013年05月15日 17:06, HATAYAMA Daisuke 写道:
> The previous patches newly added holes before each chunk of memory and
> the holes need to be count in vmcore file size. There are two ways to
> count file size in such a way:
>
> 1) supporse m as a poitner to the last vmcore object in vmcore_list.
> , then file size is (m->offset + m->size), or
>
> 2) calculate sum of size of buffers for ELF header, program headers,
> ELF note segments and objects in vmcore_list.
>
> Although 1) is more direct and simpler than 2), 2) seems better in
> that it reflects internal object structure of /proc/vmcore. Thus, this
> patch changes get_vmcore_size_elf{64, 32} so that it calculates size
> in the way of 2).
>
> Signed-off-by: HATAYAMA Daisuke <[email protected]>
> Acked-by: Vivek Goyal <[email protected]>
> ---

Acked-by: Zhang Yanfei <[email protected]>

>
> fs/proc/vmcore.c | 40 ++++++++++++++++++----------------------
> 1 files changed, 18 insertions(+), 22 deletions(-)
>
> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
> index 4e121fda..7f2041c 100644
> --- a/fs/proc/vmcore.c
> +++ b/fs/proc/vmcore.c
> @@ -210,36 +210,28 @@ static struct vmcore* __init get_new_element(void)
> return kzalloc(sizeof(struct vmcore), GFP_KERNEL);
> }
>
> -static u64 __init get_vmcore_size_elf64(char *elfptr, size_t elfsz)
> +static u64 __init get_vmcore_size_elf64(size_t elfsz, size_t elfnotesegsz,
> + struct list_head *vc_list)
> {
> - int i;
> u64 size;
> - Elf64_Ehdr *ehdr_ptr;
> - Elf64_Phdr *phdr_ptr;
> + struct vmcore *m;
>
> - ehdr_ptr = (Elf64_Ehdr *)elfptr;
> - phdr_ptr = (Elf64_Phdr*)(elfptr + sizeof(Elf64_Ehdr));
> - size = elfsz;
> - for (i = 0; i < ehdr_ptr->e_phnum; i++) {
> - size += phdr_ptr->p_memsz;
> - phdr_ptr++;
> + size = elfsz + elfnotesegsz;
> + list_for_each_entry(m, vc_list, list) {
> + size += m->size;
> }
> return size;
> }
>
> -static u64 __init get_vmcore_size_elf32(char *elfptr, size_t elfsz)
> +static u64 __init get_vmcore_size_elf32(size_t elfsz, size_t elfnotesegsz,
> + struct list_head *vc_list)
> {
> - int i;
> u64 size;
> - Elf32_Ehdr *ehdr_ptr;
> - Elf32_Phdr *phdr_ptr;
> + struct vmcore *m;
>
> - ehdr_ptr = (Elf32_Ehdr *)elfptr;
> - phdr_ptr = (Elf32_Phdr*)(elfptr + sizeof(Elf32_Ehdr));
> - size = elfsz;
> - for (i = 0; i < ehdr_ptr->e_phnum; i++) {
> - size += phdr_ptr->p_memsz;
> - phdr_ptr++;
> + size = elfsz + elfnotesegsz;
> + list_for_each_entry(m, vc_list, list) {
> + size += m->size;
> }
> return size;
> }
> @@ -795,14 +787,18 @@ static int __init parse_crash_elf_headers(void)
> return rc;
>
> /* Determine vmcore size. */
> - vmcore_size = get_vmcore_size_elf64(elfcorebuf, elfcorebuf_sz);
> + vmcore_size = get_vmcore_size_elf64(elfcorebuf_sz,
> + elfnotes_sz,
> + &vmcore_list);
> } else if (e_ident[EI_CLASS] == ELFCLASS32) {
> rc = parse_crash_elf32_headers();
> if (rc)
> return rc;
>
> /* Determine vmcore size. */
> - vmcore_size = get_vmcore_size_elf32(elfcorebuf, elfcorebuf_sz);
> + vmcore_size = get_vmcore_size_elf32(elfcorebuf_sz,
> + elfnotes_sz,
> + &vmcore_list);
> } else {
> pr_warn("Warning: Core image elf header is not sane\n");
> return -EINVAL;
>
>

2013-05-16 08:01:00

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH v6 8/8] vmcore: support mmap() on /proc/vmcore

于 2013年05月15日 17:06, HATAYAMA Daisuke 写道:
> This patch introduces mmap_vmcore().
>
> Don't permit writable nor executable mapping even with mprotect()
> because this mmap() is aimed at reading crash dump memory.
> Non-writable mapping is also requirement of remap_pfn_range() when
> mapping linear pages on non-consecutive physical pages; see
> is_cow_mapping().
>
> Set VM_MIXEDMAP flag to remap memory by remap_pfn_range and by
> remap_vmalloc_range_pertial at the same time for a single
> vma. do_munmap() can correctly clean partially remapped vma with two
> functions in abnormal case. See zap_pte_range(), vm_normal_page() and
> their comments for details.
>
> On x86-32 PAE kernels, mmap() supports at most 16TB memory only. This
> limitation comes from the fact that the third argument of
> remap_pfn_range(), pfn, is of 32-bit length on x86-32: unsigned long.
>
> Signed-off-by: HATAYAMA Daisuke <[email protected]>
> ---

Assuming that patch 4 & 5 of this series are ok:

Acked-by: Zhang Yanfei <[email protected]>

>
> fs/proc/vmcore.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 86 insertions(+), 0 deletions(-)
>
> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
> index 7f2041c..2c72487 100644
> --- a/fs/proc/vmcore.c
> +++ b/fs/proc/vmcore.c
> @@ -20,6 +20,7 @@
> #include <linux/init.h>
> #include <linux/crash_dump.h>
> #include <linux/list.h>
> +#include <linux/vmalloc.h>
> #include <asm/uaccess.h>
> #include <asm/io.h>
> #include "internal.h"
> @@ -200,9 +201,94 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer,
> return acc;
> }
>
> +static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
> +{
> + size_t size = vma->vm_end - vma->vm_start;
> + u64 start, end, len, tsz;
> + struct vmcore *m;
> +
> + start = (u64)vma->vm_pgoff << PAGE_SHIFT;
> + end = start + size;
> +
> + if (size > vmcore_size || end > vmcore_size)
> + return -EINVAL;
> +
> + if (vma->vm_flags & (VM_WRITE | VM_EXEC))
> + return -EPERM;
> +
> + vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
> + vma->vm_flags |= VM_MIXEDMAP;
> +
> + len = 0;
> +
> + if (start < elfcorebuf_sz) {
> + u64 pfn;
> +
> + tsz = elfcorebuf_sz - start;
> + if (size < tsz)
> + tsz = size;
> + pfn = __pa(elfcorebuf + start) >> PAGE_SHIFT;
> + if (remap_pfn_range(vma, vma->vm_start, pfn, tsz,
> + vma->vm_page_prot))
> + return -EAGAIN;
> + size -= tsz;
> + start += tsz;
> + len += tsz;
> +
> + if (size == 0)
> + return 0;
> + }
> +
> + if (start < elfcorebuf_sz + elfnotes_sz) {
> + void *kaddr;
> +
> + tsz = elfcorebuf_sz + elfnotes_sz - start;
> + if (size < tsz)
> + tsz = size;
> + kaddr = elfnotes_buf + start - elfcorebuf_sz;
> + if (remap_vmalloc_range_partial(vma, vma->vm_start + len,
> + kaddr, tsz)) {
> + do_munmap(vma->vm_mm, vma->vm_start, len);
> + return -EAGAIN;
> + }
> + size -= tsz;
> + start += tsz;
> + len += tsz;
> +
> + if (size == 0)
> + return 0;
> + }
> +
> + list_for_each_entry(m, &vmcore_list, list) {
> + if (start < m->offset + m->size) {
> + u64 paddr = 0;
> +
> + tsz = m->offset + m->size - start;
> + if (size < tsz)
> + tsz = size;
> + paddr = m->paddr + start - m->offset;
> + if (remap_pfn_range(vma, vma->vm_start + len,
> + paddr >> PAGE_SHIFT, tsz,
> + vma->vm_page_prot)) {
> + do_munmap(vma->vm_mm, vma->vm_start, len);
> + return -EAGAIN;
> + }
> + size -= tsz;
> + start += tsz;
> + len += tsz;
> +
> + if (size == 0)
> + return 0;
> + }
> + }
> +
> + return 0;
> +}
> +
> static const struct file_operations proc_vmcore_operations = {
> .read = read_vmcore,
> .llseek = default_llseek,
> + .mmap = mmap_vmcore,
> };
>
> static struct vmcore* __init get_new_element(void)
>
>
> _______________________________________________
> kexec mailing list
> [email protected]
> http://lists.infradead.org/mailman/listinfo/kexec
>

2013-05-16 08:01:04

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH v6 6/8] vmcore: allocate ELF note segment in the 2nd kernel vmalloc memory

于 2013年05月15日 17:06, HATAYAMA Daisuke 写道:
> The reasons why we don't allocate ELF note segment in the 1st kernel
> (old memory) on page boundary is to keep backward compatibility for
> old kernels, and that if doing so, we waste not a little memory due to
> round-up operation to fit the memory to page boundary since most of
> the buffers are in per-cpu area.
>
> ELF notes are per-cpu, so total size of ELF note segments depends on
> number of CPUs. The current maximum number of CPUs on x86_64 is 5192,
> and there's already system with 4192 CPUs in SGI, where total size
> amounts to 1MB. This can be larger in the near future or possibly even
> now on another architecture that has larger size of note per a single
> cpu. Thus, to avoid the case where memory allocation for large block
> fails, we allocate vmcore objects on vmalloc memory.
>
> This patch adds elfnotes_buf and elfnotes_sz variables to keep pointer
> to the ELF note segment buffer and its size. There's no longer the
> vmcore object that corresponds to the ELF note segment in
> vmcore_list. Accordingly, read_vmcore() has new case for ELF note
> segment and set_vmcore_list_offsets_elf{64,32}() and other helper
> functions starts calculating offset from sum of size of ELF headers
> and size of ELF note segment.
>
> Signed-off-by: HATAYAMA Daisuke <[email protected]>
> ---

Acked-by: Zhang Yanfei <[email protected]>

>
> fs/proc/vmcore.c | 273 +++++++++++++++++++++++++++++++++++++++++-------------
> 1 files changed, 209 insertions(+), 64 deletions(-)
>
> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
> index 6cf7fbd..4e121fda 100644
> --- a/fs/proc/vmcore.c
> +++ b/fs/proc/vmcore.c
> @@ -34,6 +34,9 @@ static char *elfcorebuf;
> static size_t elfcorebuf_sz;
> static size_t elfcorebuf_sz_orig;
>
> +static char *elfnotes_buf;
> +static size_t elfnotes_sz;
> +
> /* Total size of vmcore file. */
> static u64 vmcore_size;
>
> @@ -154,6 +157,26 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer,
> return acc;
> }
>
> + /* Read Elf note segment */
> + if (*fpos < elfcorebuf_sz + elfnotes_sz) {
> + void *kaddr;
> +
> + tsz = elfcorebuf_sz + elfnotes_sz - *fpos;
> + if (buflen < tsz)
> + tsz = buflen;
> + kaddr = elfnotes_buf + *fpos - elfcorebuf_sz;
> + if (copy_to_user(buffer, kaddr, tsz))
> + return -EFAULT;
> + buflen -= tsz;
> + *fpos += tsz;
> + buffer += tsz;
> + acc += tsz;
> +
> + /* leave now if filled buffer already */
> + if (buflen == 0)
> + return acc;
> + }
> +
> list_for_each_entry(m, &vmcore_list, list) {
> if (*fpos < m->offset + m->size) {
> tsz = m->offset + m->size - *fpos;
> @@ -221,23 +244,33 @@ static u64 __init get_vmcore_size_elf32(char *elfptr, size_t elfsz)
> return size;
> }
>
> -/* Merges all the PT_NOTE headers into one. */
> -static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
> - struct list_head *vc_list)
> +/**
> + * process_note_headers_elf64 - Perform a variety of processing on ELF
> + * note segments according to the combination of function arguments.
> + *
> + * @ehdr_ptr - ELF header buffer
> + * @nr_notes - the number of program header entries of PT_NOTE type
> + * @notes_sz - total size of ELF note segment
> + * @notes_buf - buffer into which ELF note segment is copied
> + *
> + * Assume @ehdr_ptr is always not NULL. If @nr_notes is not NULL, then
> + * the number of program header entries of PT_NOTE type is assigned to
> + * @nr_notes. If @notes_sz is not NULL, then total size of ELF note
> + * segment, header part plus data part, is assigned to @notes_sz. If
> + * @notes_buf is not NULL, then ELF note segment is copied into
> + * @notes_buf.
> + */
> +static int __init process_note_headers_elf64(const Elf64_Ehdr *ehdr_ptr,
> + int *nr_notes, u64 *notes_sz,
> + char *notes_buf)
> {
> int i, nr_ptnote=0, rc=0;
> - char *tmp;
> - Elf64_Ehdr *ehdr_ptr;
> - Elf64_Phdr phdr, *phdr_ptr;
> + Elf64_Phdr *phdr_ptr = (Elf64_Phdr*)(ehdr_ptr + 1);
> Elf64_Nhdr *nhdr_ptr;
> - u64 phdr_sz = 0, note_off;
> + u64 phdr_sz = 0;
>
> - ehdr_ptr = (Elf64_Ehdr *)elfptr;
> - phdr_ptr = (Elf64_Phdr*)(elfptr + sizeof(Elf64_Ehdr));
> for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
> - int j;
> void *notes_section;
> - struct vmcore *new;
> u64 offset, max_sz, sz, real_sz = 0;
> if (phdr_ptr->p_type != PT_NOTE)
> continue;
> @@ -253,7 +286,7 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
> return rc;
> }
> nhdr_ptr = notes_section;
> - for (j = 0; j < max_sz; j += sz) {
> + while (real_sz < max_sz) {
> if (nhdr_ptr->n_namesz == 0)
> break;
> sz = sizeof(Elf64_Nhdr) +
> @@ -262,20 +295,68 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
> real_sz += sz;
> nhdr_ptr = (Elf64_Nhdr*)((char*)nhdr_ptr + sz);
> }
> -
> - /* Add this contiguous chunk of notes section to vmcore list.*/
> - new = get_new_element();
> - if (!new) {
> - kfree(notes_section);
> - return -ENOMEM;
> + if (notes_buf) {
> + offset = phdr_ptr->p_offset;
> + rc = read_from_oldmem(notes_buf + phdr_sz, real_sz,
> + &offset, 0);
> + if (rc < 0) {
> + kfree(notes_section);
> + return rc;
> + }
> }
> - new->paddr = phdr_ptr->p_offset;
> - new->size = real_sz;
> - list_add_tail(&new->list, vc_list);
> phdr_sz += real_sz;
> kfree(notes_section);
> }
>
> + if (nr_notes)
> + *nr_notes = nr_ptnote;
> + if (notes_sz)
> + *notes_sz = phdr_sz;
> +
> + return 0;
> +}
> +
> +static int __init get_note_number_and_size_elf64(const Elf64_Ehdr *ehdr_ptr,
> + int *nr_ptnote, u64 *phdr_sz)
> +{
> + return process_note_headers_elf64(ehdr_ptr, nr_ptnote, phdr_sz, NULL);
> +}
> +
> +static int __init copy_notes_elf64(const Elf64_Ehdr *ehdr_ptr, char *notes_buf)
> +{
> + return process_note_headers_elf64(ehdr_ptr, NULL, NULL, notes_buf);
> +}
> +
> +/* Merges all the PT_NOTE headers into one. */
> +static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
> + char **notes_buf, size_t *notes_sz)
> +{
> + int i, nr_ptnote=0, rc=0;
> + char *tmp;
> + Elf64_Ehdr *ehdr_ptr;
> + Elf64_Phdr phdr;
> + u64 phdr_sz = 0, note_off;
> + struct vm_struct *vm;
> +
> + ehdr_ptr = (Elf64_Ehdr *)elfptr;
> +
> + rc = get_note_number_and_size_elf64(ehdr_ptr, &nr_ptnote, &phdr_sz);
> + if (rc < 0)
> + return rc;
> +
> + *notes_sz = roundup(phdr_sz, PAGE_SIZE);
> + *notes_buf = vzalloc(*notes_sz);
> + if (!*notes_buf)
> + return -ENOMEM;
> +
> + vm = find_vm_area(*notes_buf);
> + BUG_ON(!vm);
> + vm->flags |= VM_USERMAP;
> +
> + rc = copy_notes_elf64(ehdr_ptr, *notes_buf);
> + if (rc < 0)
> + return rc;
> +
> /* Prepare merged PT_NOTE program header. */
> phdr.p_type = PT_NOTE;
> phdr.p_flags = 0;
> @@ -304,23 +385,33 @@ static int __init merge_note_headers_elf64(char *elfptr, size_t *elfsz,
> return 0;
> }
>
> -/* Merges all the PT_NOTE headers into one. */
> -static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
> - struct list_head *vc_list)
> +/**
> + * process_note_headers_elf32 - Perform a variety of processing on ELF
> + * note segments according to the combination of function arguments.
> + *
> + * @ehdr_ptr - ELF header buffer
> + * @nr_notes - the number of program header entries of PT_NOTE type
> + * @notes_sz - total size of ELF note segment
> + * @notes_buf - buffer into which ELF note segment is copied
> + *
> + * Assume @ehdr_ptr is always not NULL. If @nr_notes is not NULL, then
> + * the number of program header entries of PT_NOTE type is assigned to
> + * @nr_notes. If @notes_sz is not NULL, then total size of ELF note
> + * segment, header part plus data part, is assigned to @notes_sz. If
> + * @notes_buf is not NULL, then ELF note segment is copied into
> + * @notes_buf.
> + */
> +static int __init process_note_headers_elf32(const Elf32_Ehdr *ehdr_ptr,
> + int *nr_notes, u64 *notes_sz,
> + char *notes_buf)
> {
> int i, nr_ptnote=0, rc=0;
> - char *tmp;
> - Elf32_Ehdr *ehdr_ptr;
> - Elf32_Phdr phdr, *phdr_ptr;
> + Elf32_Phdr *phdr_ptr = (Elf32_Phdr*)(ehdr_ptr + 1);
> Elf32_Nhdr *nhdr_ptr;
> - u64 phdr_sz = 0, note_off;
> + u64 phdr_sz = 0;
>
> - ehdr_ptr = (Elf32_Ehdr *)elfptr;
> - phdr_ptr = (Elf32_Phdr*)(elfptr + sizeof(Elf32_Ehdr));
> for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
> - int j;
> void *notes_section;
> - struct vmcore *new;
> u64 offset, max_sz, sz, real_sz = 0;
> if (phdr_ptr->p_type != PT_NOTE)
> continue;
> @@ -336,7 +427,7 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
> return rc;
> }
> nhdr_ptr = notes_section;
> - for (j = 0; j < max_sz; j += sz) {
> + while (real_sz < max_sz) {
> if (nhdr_ptr->n_namesz == 0)
> break;
> sz = sizeof(Elf32_Nhdr) +
> @@ -345,20 +436,68 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
> real_sz += sz;
> nhdr_ptr = (Elf32_Nhdr*)((char*)nhdr_ptr + sz);
> }
> -
> - /* Add this contiguous chunk of notes section to vmcore list.*/
> - new = get_new_element();
> - if (!new) {
> - kfree(notes_section);
> - return -ENOMEM;
> + if (notes_buf) {
> + offset = phdr_ptr->p_offset;
> + rc = read_from_oldmem(notes_buf + phdr_sz, real_sz,
> + &offset, 0);
> + if (rc < 0) {
> + kfree(notes_section);
> + return rc;
> + }
> }
> - new->paddr = phdr_ptr->p_offset;
> - new->size = real_sz;
> - list_add_tail(&new->list, vc_list);
> phdr_sz += real_sz;
> kfree(notes_section);
> }
>
> + if (nr_notes)
> + *nr_notes = nr_ptnote;
> + if (notes_sz)
> + *notes_sz = phdr_sz;
> +
> + return 0;
> +}
> +
> +static int __init get_note_number_and_size_elf32(const Elf32_Ehdr *ehdr_ptr,
> + int *nr_ptnote, u64 *phdr_sz)
> +{
> + return process_note_headers_elf32(ehdr_ptr, nr_ptnote, phdr_sz, NULL);
> +}
> +
> +static int __init copy_notes_elf32(const Elf32_Ehdr *ehdr_ptr, char *notes_buf)
> +{
> + return process_note_headers_elf32(ehdr_ptr, NULL, NULL, notes_buf);
> +}
> +
> +/* Merges all the PT_NOTE headers into one. */
> +static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
> + char **notes_buf, size_t *notes_sz)
> +{
> + int i, nr_ptnote=0, rc=0;
> + char *tmp;
> + Elf32_Ehdr *ehdr_ptr;
> + Elf32_Phdr phdr;
> + u64 phdr_sz = 0, note_off;
> + struct vm_struct *vm;
> +
> + ehdr_ptr = (Elf32_Ehdr *)elfptr;
> +
> + rc = get_note_number_and_size_elf32(ehdr_ptr, &nr_ptnote, &phdr_sz);
> + if (rc < 0)
> + return rc;
> +
> + *notes_sz = roundup(phdr_sz, PAGE_SIZE);
> + *notes_buf = vzalloc(*notes_sz);
> + if (!*notes_buf)
> + return -ENOMEM;
> +
> + vm = find_vm_area(*notes_buf);
> + BUG_ON(!vm);
> + vm->flags |= VM_USERMAP;
> +
> + rc = copy_notes_elf32(ehdr_ptr, *notes_buf);
> + if (rc < 0)
> + return rc;
> +
> /* Prepare merged PT_NOTE program header. */
> phdr.p_type = PT_NOTE;
> phdr.p_flags = 0;
> @@ -391,6 +530,7 @@ static int __init merge_note_headers_elf32(char *elfptr, size_t *elfsz,
> * the new offset fields of exported program headers. */
> static int __init process_ptload_program_headers_elf64(char *elfptr,
> size_t elfsz,
> + size_t elfnotes_sz,
> struct list_head *vc_list)
> {
> int i;
> @@ -402,8 +542,8 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
> ehdr_ptr = (Elf64_Ehdr *)elfptr;
> phdr_ptr = (Elf64_Phdr*)(elfptr + sizeof(Elf64_Ehdr)); /* PT_NOTE hdr */
>
> - /* First program header is PT_NOTE header. */
> - vmcore_off = elfsz + roundup(phdr_ptr->p_memsz, PAGE_SIZE);
> + /* Skip Elf header, program headers and Elf note segment. */
> + vmcore_off = elfsz + elfnotes_sz;
>
> for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
> u64 paddr, start, end, size;
> @@ -433,6 +573,7 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
>
> static int __init process_ptload_program_headers_elf32(char *elfptr,
> size_t elfsz,
> + size_t elfnotes_sz,
> struct list_head *vc_list)
> {
> int i;
> @@ -444,8 +585,8 @@ static int __init process_ptload_program_headers_elf32(char *elfptr,
> ehdr_ptr = (Elf32_Ehdr *)elfptr;
> phdr_ptr = (Elf32_Phdr*)(elfptr + sizeof(Elf32_Ehdr)); /* PT_NOTE hdr */
>
> - /* First program header is PT_NOTE header. */
> - vmcore_off = elfsz + roundup(phdr_ptr->p_memsz, PAGE_SIZE);
> + /* Skip Elf header, program headers and Elf note segment. */
> + vmcore_off = elfsz + elfnotes_sz;
>
> for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
> u64 paddr, start, end, size;
> @@ -474,17 +615,15 @@ static int __init process_ptload_program_headers_elf32(char *elfptr,
> }
>
> /* Sets offset fields of vmcore elements. */
> -static void __init set_vmcore_list_offsets_elf64(char *elfptr, size_t elfsz,
> +static void __init set_vmcore_list_offsets_elf64(size_t elfsz,
> + size_t elfnotes_sz,
> struct list_head *vc_list)
> {
> loff_t vmcore_off;
> - Elf64_Ehdr *ehdr_ptr;
> struct vmcore *m;
>
> - ehdr_ptr = (Elf64_Ehdr *)elfptr;
> -
> - /* Skip Elf header and program headers. */
> - vmcore_off = elfsz;
> + /* Skip Elf header, program headers and Elf note segment. */
> + vmcore_off = elfsz + elfnotes_sz;
>
> list_for_each_entry(m, vc_list, list) {
> m->offset = vmcore_off;
> @@ -493,17 +632,15 @@ static void __init set_vmcore_list_offsets_elf64(char *elfptr, size_t elfsz,
> }
>
> /* Sets offset fields of vmcore elements. */
> -static void __init set_vmcore_list_offsets_elf32(char *elfptr, size_t elfsz,
> +static void __init set_vmcore_list_offsets_elf32(size_t elfsz,
> + size_t elfnotes_sz,
> struct list_head *vc_list)
> {
> loff_t vmcore_off;
> - Elf32_Ehdr *ehdr_ptr;
> struct vmcore *m;
>
> - ehdr_ptr = (Elf32_Ehdr *)elfptr;
> -
> - /* Skip Elf header and program headers. */
> - vmcore_off = elfsz;
> + /* Skip Elf header, program headers and Elf note segment. */
> + vmcore_off = elfsz + elfnotes_sz;
>
> list_for_each_entry(m, vc_list, list) {
> m->offset = vmcore_off;
> @@ -554,20 +691,23 @@ static int __init parse_crash_elf64_headers(void)
> }
>
> /* Merge all PT_NOTE headers into one. */
> - rc = merge_note_headers_elf64(elfcorebuf, &elfcorebuf_sz, &vmcore_list);
> + rc = merge_note_headers_elf64(elfcorebuf, &elfcorebuf_sz,
> + &elfnotes_buf, &elfnotes_sz);
> if (rc) {
> free_pages((unsigned long)elfcorebuf,
> get_order(elfcorebuf_sz_orig));
> return rc;
> }
> rc = process_ptload_program_headers_elf64(elfcorebuf, elfcorebuf_sz,
> - &vmcore_list);
> + elfnotes_sz,
> + &vmcore_list);
> if (rc) {
> free_pages((unsigned long)elfcorebuf,
> get_order(elfcorebuf_sz_orig));
> return rc;
> }
> - set_vmcore_list_offsets_elf64(elfcorebuf, elfcorebuf_sz, &vmcore_list);
> + set_vmcore_list_offsets_elf64(elfcorebuf_sz, elfnotes_sz,
> + &vmcore_list);
> return 0;
> }
>
> @@ -614,20 +754,23 @@ static int __init parse_crash_elf32_headers(void)
> }
>
> /* Merge all PT_NOTE headers into one. */
> - rc = merge_note_headers_elf32(elfcorebuf, &elfcorebuf_sz, &vmcore_list);
> + rc = merge_note_headers_elf32(elfcorebuf, &elfcorebuf_sz,
> + &elfnotes_buf, &elfnotes_sz);
> if (rc) {
> free_pages((unsigned long)elfcorebuf,
> get_order(elfcorebuf_sz_orig));
> return rc;
> }
> rc = process_ptload_program_headers_elf32(elfcorebuf, elfcorebuf_sz,
> - &vmcore_list);
> + elfnotes_sz,
> + &vmcore_list);
> if (rc) {
> free_pages((unsigned long)elfcorebuf,
> get_order(elfcorebuf_sz_orig));
> return rc;
> }
> - set_vmcore_list_offsets_elf32(elfcorebuf, elfcorebuf_sz, &vmcore_list);
> + set_vmcore_list_offsets_elf32(elfcorebuf_sz, elfnotes_sz,
> + &vmcore_list);
> return 0;
> }
>
> @@ -706,6 +849,8 @@ void vmcore_cleanup(void)
> list_del(&m->list);
> kfree(m);
> }
> + vfree(elfnotes_buf);
> + elfnotes_buf = NULL;
> free_pages((unsigned long)elfcorebuf,
> get_order(elfcorebuf_sz_orig));
> elfcorebuf = NULL;
>
>

2013-05-16 16:51:44

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v6 2/8] vmcore: allocate buffer for ELF headers on page-size alignment

On Wed, May 15, 2013 at 06:05:51PM +0900, HATAYAMA Daisuke wrote:

[..]
> @@ -398,9 +403,7 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
> phdr_ptr = (Elf64_Phdr*)(elfptr + sizeof(Elf64_Ehdr)); /* PT_NOTE hdr */
>
> /* First program header is PT_NOTE header. */
> - vmcore_off = sizeof(Elf64_Ehdr) +
> - (ehdr_ptr->e_phnum) * sizeof(Elf64_Phdr) +
> - phdr_ptr->p_memsz; /* Note sections */
> + vmcore_off = elfsz + roundup(phdr_ptr->p_memsz, PAGE_SIZE);
>
> for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
> if (phdr_ptr->p_type != PT_LOAD)
> @@ -435,9 +438,7 @@ static int __init process_ptload_program_headers_elf32(char *elfptr,
> phdr_ptr = (Elf32_Phdr*)(elfptr + sizeof(Elf32_Ehdr)); /* PT_NOTE hdr */
>
> /* First program header is PT_NOTE header. */
> - vmcore_off = sizeof(Elf32_Ehdr) +
> - (ehdr_ptr->e_phnum) * sizeof(Elf32_Phdr) +
> - phdr_ptr->p_memsz; /* Note sections */
> + vmcore_off = elfsz + roundup(phdr_ptr->p_memsz, PAGE_SIZE);

Hmm.., so we are rounding up ELF note data size too here. I think this belongs
in some other patch as in this patch we are just rounding up the elf
headers.

This might create read problems too as we have not taking care of this
rounding when adding note to vc_list and it might happen that we are
reading wrong data at a particular offset.

So may be this rounding up we should do in later patches when we take
care of copying ELF notes data to second kernel.

Vivek

2013-05-16 20:33:17

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v6 6/8] vmcore: allocate ELF note segment in the 2nd kernel vmalloc memory

On Wed, May 15, 2013 at 06:06:14PM +0900, HATAYAMA Daisuke wrote:

[..]

> +static int __init get_note_number_and_size_elf32(const Elf32_Ehdr *ehdr_ptr,
> + int *nr_ptnote, u64 *phdr_sz)
> +{
> + return process_note_headers_elf32(ehdr_ptr, nr_ptnote, phdr_sz, NULL);
> +}
> +
> +static int __init copy_notes_elf32(const Elf32_Ehdr *ehdr_ptr, char *notes_buf)
> +{
> + return process_note_headers_elf32(ehdr_ptr, NULL, NULL, notes_buf);
> +}
> +

Please don't do this. We need to create two separate functions doing
two different operations and not just create wrapper around a function
which does two things.

I know both functions will have similar for loops for going through
the elf notes but it is better then doing function overloading based
on parameters passed.

Vivek

2013-05-16 20:45:15

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH v6 8/8] vmcore: support mmap() on /proc/vmcore

On Wed, May 15, 2013 at 06:06:26PM +0900, HATAYAMA Daisuke wrote:
> This patch introduces mmap_vmcore().
>
> Don't permit writable nor executable mapping even with mprotect()
> because this mmap() is aimed at reading crash dump memory.
> Non-writable mapping is also requirement of remap_pfn_range() when
> mapping linear pages on non-consecutive physical pages; see
> is_cow_mapping().
>
> Set VM_MIXEDMAP flag to remap memory by remap_pfn_range and by
> remap_vmalloc_range_pertial at the same time for a single
> vma. do_munmap() can correctly clean partially remapped vma with two
> functions in abnormal case. See zap_pte_range(), vm_normal_page() and
> their comments for details.
>
> On x86-32 PAE kernels, mmap() supports at most 16TB memory only. This
> limitation comes from the fact that the third argument of
> remap_pfn_range(), pfn, is of 32-bit length on x86-32: unsigned long.
>
> Signed-off-by: HATAYAMA Daisuke <[email protected]>
> ---

This one looks fine to me assuming vm folks like
remap_vmalloc_range_partial().

Acked-by: Vivek Goyal <[email protected]>

Thanks
Vivek

>
> fs/proc/vmcore.c | 86 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 1 files changed, 86 insertions(+), 0 deletions(-)
>
> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
> index 7f2041c..2c72487 100644
> --- a/fs/proc/vmcore.c
> +++ b/fs/proc/vmcore.c
> @@ -20,6 +20,7 @@
> #include <linux/init.h>
> #include <linux/crash_dump.h>
> #include <linux/list.h>
> +#include <linux/vmalloc.h>
> #include <asm/uaccess.h>
> #include <asm/io.h>
> #include "internal.h"
> @@ -200,9 +201,94 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer,
> return acc;
> }
>
> +static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
> +{
> + size_t size = vma->vm_end - vma->vm_start;
> + u64 start, end, len, tsz;
> + struct vmcore *m;
> +
> + start = (u64)vma->vm_pgoff << PAGE_SHIFT;
> + end = start + size;
> +
> + if (size > vmcore_size || end > vmcore_size)
> + return -EINVAL;
> +
> + if (vma->vm_flags & (VM_WRITE | VM_EXEC))
> + return -EPERM;
> +
> + vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
> + vma->vm_flags |= VM_MIXEDMAP;
> +
> + len = 0;
> +
> + if (start < elfcorebuf_sz) {
> + u64 pfn;
> +
> + tsz = elfcorebuf_sz - start;
> + if (size < tsz)
> + tsz = size;
> + pfn = __pa(elfcorebuf + start) >> PAGE_SHIFT;
> + if (remap_pfn_range(vma, vma->vm_start, pfn, tsz,
> + vma->vm_page_prot))
> + return -EAGAIN;
> + size -= tsz;
> + start += tsz;
> + len += tsz;
> +
> + if (size == 0)
> + return 0;
> + }
> +
> + if (start < elfcorebuf_sz + elfnotes_sz) {
> + void *kaddr;
> +
> + tsz = elfcorebuf_sz + elfnotes_sz - start;
> + if (size < tsz)
> + tsz = size;
> + kaddr = elfnotes_buf + start - elfcorebuf_sz;
> + if (remap_vmalloc_range_partial(vma, vma->vm_start + len,
> + kaddr, tsz)) {
> + do_munmap(vma->vm_mm, vma->vm_start, len);
> + return -EAGAIN;
> + }
> + size -= tsz;
> + start += tsz;
> + len += tsz;
> +
> + if (size == 0)
> + return 0;
> + }
> +
> + list_for_each_entry(m, &vmcore_list, list) {
> + if (start < m->offset + m->size) {
> + u64 paddr = 0;
> +
> + tsz = m->offset + m->size - start;
> + if (size < tsz)
> + tsz = size;
> + paddr = m->paddr + start - m->offset;
> + if (remap_pfn_range(vma, vma->vm_start + len,
> + paddr >> PAGE_SHIFT, tsz,
> + vma->vm_page_prot)) {
> + do_munmap(vma->vm_mm, vma->vm_start, len);
> + return -EAGAIN;
> + }
> + size -= tsz;
> + start += tsz;
> + len += tsz;
> +
> + if (size == 0)
> + return 0;
> + }
> + }
> +
> + return 0;
> +}
> +
> static const struct file_operations proc_vmcore_operations = {
> .read = read_vmcore,
> .llseek = default_llseek,
> + .mmap = mmap_vmcore,
> };
>
> static struct vmcore* __init get_new_element(void)

2013-05-16 23:46:07

by Hatayama, Daisuke

[permalink] [raw]
Subject: Re: [PATCH v6 4/8] vmalloc: make find_vm_area check in range

(2013/05/16 6:37), KOSAKI Motohiro wrote:
> On Wed, May 15, 2013 at 5:06 AM, HATAYAMA Daisuke
> <[email protected]> wrote:
>> Currently, __find_vmap_area searches for the kernel VM area starting
>> at a given address. This patch changes this behavior so that it
>> searches for the kernel VM area to which the address belongs. This
>> change is needed by remap_vmalloc_range_partial to be introduced in
>> later patch that receives any position of kernel VM area as target
>> address.
>>
>> This patch changes the condition (addr > va->va_start) to the
>> equivalent (addr >= va->va_end) by taking advantage of the fact that
>> each kernel VM area is non-overlapping.
>>
>> Signed-off-by: HATAYAMA Daisuke <[email protected]>
>> ---
>>
>> mm/vmalloc.c | 2 +-
>> 1 files changed, 1 insertions(+), 1 deletions(-)
>>
>> diff --git a/mm/vmalloc.c b/mm/vmalloc.c
>> index d365724..3875fa2 100644
>> --- a/mm/vmalloc.c
>> +++ b/mm/vmalloc.c
>> @@ -292,7 +292,7 @@ static struct vmap_area *__find_vmap_area(unsigned long addr)
>> va = rb_entry(n, struct vmap_area, rb_node);
>> if (addr < va->va_start)
>> n = n->rb_left;
>> - else if (addr > va->va_start)
>> + else if (addr >= va->va_end)
>> n = n->rb_right;
>
> OK. This is natural definition. Looks good.
>
> Acked-by: KOSAKI Motohiro <[email protected]>

Thanks for your reviewing. Could you or other someone review the next
5/8 patch too? It also changes vmalloc and cc people's review is needed.

--
Thanks.
HATAYAMA, Daisuke

2013-05-16 23:47:24

by Hatayama, Daisuke

[permalink] [raw]
Subject: Re: [PATCH v6 6/8] vmcore: allocate ELF note segment in the 2nd kernel vmalloc memory

(2013/05/17 5:32), Vivek Goyal wrote:
> On Wed, May 15, 2013 at 06:06:14PM +0900, HATAYAMA Daisuke wrote:
>
> [..]
>
>> +static int __init get_note_number_and_size_elf32(const Elf32_Ehdr *ehdr_ptr,
>> + int *nr_ptnote, u64 *phdr_sz)
>> +{
>> + return process_note_headers_elf32(ehdr_ptr, nr_ptnote, phdr_sz, NULL);
>> +}
>> +
>> +static int __init copy_notes_elf32(const Elf32_Ehdr *ehdr_ptr, char *notes_buf)
>> +{
>> + return process_note_headers_elf32(ehdr_ptr, NULL, NULL, notes_buf);
>> +}
>> +
>
> Please don't do this. We need to create two separate functions doing
> two different operations and not just create wrapper around a function
> which does two things.
>
> I know both functions will have similar for loops for going through
> the elf notes but it is better then doing function overloading based
> on parameters passed.
>

I see. This part must be fixed in the next version.

--
Thanks.
HATAYAMA, Daisuke

2013-05-17 00:07:10

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v6 0/8] kdump, vmcore: support mmap() on /proc/vmcore

On 05/15/2013 02:05 AM, HATAYAMA Daisuke wrote:
> Currently, read to /proc/vmcore is done by read_oldmem() that uses
> ioremap/iounmap per a single page. For example, if memory is 1GB,
> ioremap/iounmap is called (1GB / 4KB)-times, that is, 262144
> times. This causes big performance degradation.

read_oldmem() is fundamentally broken and unsafe. It needs to be
unified with the plain /dev/mem code and any missing functionality fixed
instead of "let's just do a whole new driver".

-hpa

2013-05-17 00:08:31

by Hatayama, Daisuke

[permalink] [raw]
Subject: Re: [PATCH v6 2/8] vmcore: allocate buffer for ELF headers on page-size alignment

(2013/05/17 1:51), Vivek Goyal wrote:
> On Wed, May 15, 2013 at 06:05:51PM +0900, HATAYAMA Daisuke wrote:
>
> [..]
>> @@ -398,9 +403,7 @@ static int __init process_ptload_program_headers_elf64(char *elfptr,
>> phdr_ptr = (Elf64_Phdr*)(elfptr + sizeof(Elf64_Ehdr)); /* PT_NOTE hdr */
>>
>> /* First program header is PT_NOTE header. */
>> - vmcore_off = sizeof(Elf64_Ehdr) +
>> - (ehdr_ptr->e_phnum) * sizeof(Elf64_Phdr) +
>> - phdr_ptr->p_memsz; /* Note sections */
>> + vmcore_off = elfsz + roundup(phdr_ptr->p_memsz, PAGE_SIZE);
>>
>> for (i = 0; i < ehdr_ptr->e_phnum; i++, phdr_ptr++) {
>> if (phdr_ptr->p_type != PT_LOAD)
>> @@ -435,9 +438,7 @@ static int __init process_ptload_program_headers_elf32(char *elfptr,
>> phdr_ptr = (Elf32_Phdr*)(elfptr + sizeof(Elf32_Ehdr)); /* PT_NOTE hdr */
>>
>> /* First program header is PT_NOTE header. */
>> - vmcore_off = sizeof(Elf32_Ehdr) +
>> - (ehdr_ptr->e_phnum) * sizeof(Elf32_Phdr) +
>> - phdr_ptr->p_memsz; /* Note sections */
>> + vmcore_off = elfsz + roundup(phdr_ptr->p_memsz, PAGE_SIZE);
>
> Hmm.., so we are rounding up ELF note data size too here. I think this belongs
> in some other patch as in this patch we are just rounding up the elf
> headers.
>
> This might create read problems too as we have not taking care of this
> rounding when adding note to vc_list and it might happen that we are
> reading wrong data at a particular offset.
>
> So may be this rounding up we should do in later patches when we take
> care of copying ELF notes data to second kernel.
>
> Vivek
>

This is my careless fault. They should have been in 6/7.

--
Thanks.
HATAYAMA, Daisuke

2013-05-17 01:46:42

by Hatayama, Daisuke

[permalink] [raw]
Subject: Re: [PATCH v6 0/8] kdump, vmcore: support mmap() on /proc/vmcore

(2013/05/17 9:06), H. Peter Anvin wrote:
> On 05/15/2013 02:05 AM, HATAYAMA Daisuke wrote:
>> Currently, read to /proc/vmcore is done by read_oldmem() that uses
>> ioremap/iounmap per a single page. For example, if memory is 1GB,
>> ioremap/iounmap is called (1GB / 4KB)-times, that is, 262144
>> times. This causes big performance degradation.
>
> read_oldmem() is fundamentally broken and unsafe. It needs to be
> unified with the plain /dev/mem code and any missing functionality fixed
> instead of "let's just do a whole new driver".
>
> -hpa

Do you mean range_is_allowed should be extended so that it checks
according to memory map passed from the 1st kernel?

BTW, read request to read_oldmem via read_vmcore and mmap on some part
of the 1st kernel, seems safe since it's always restrected to within the
memory map.

Or is there other missing point?

--
Thanks.
HATAYAMA, Daisuke

2013-05-17 02:53:50

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v6 0/8] kdump, vmcore: support mmap() on /proc/vmcore

"H. Peter Anvin" <[email protected]> writes:

> On 05/15/2013 02:05 AM, HATAYAMA Daisuke wrote:
>> Currently, read to /proc/vmcore is done by read_oldmem() that uses
>> ioremap/iounmap per a single page. For example, if memory is 1GB,
>> ioremap/iounmap is called (1GB / 4KB)-times, that is, 262144
>> times. This causes big performance degradation.
>
> read_oldmem() is fundamentally broken and unsafe. It needs to be
> unified with the plain /dev/mem code and any missing functionality fixed
> instead of "let's just do a whole new driver".

That is completely and totally orthogonal to this change.

read_oldmem may have problems but in practice on a large systems those
problems are totally dwarfed by real life performance issues that come
from playing too much with the page tables.

I really don't find bringing up whatever foundational issues you have
with read_oldmem() appropriate or relevant here.

Eric

2013-05-17 03:22:12

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v6 0/8] kdump, vmcore: support mmap() on /proc/vmcore

On 05/16/2013 07:53 PM, Eric W. Biederman wrote:
>
> That is completely and totally orthogonal to this change.
>
> read_oldmem may have problems but in practice on a large systems those
> problems are totally dwarfed by real life performance issues that come
> from playing too much with the page tables.
>
> I really don't find bringing up whatever foundational issues you have
> with read_oldmem() appropriate or relevant here.
>

Well, it is in the sense that we have two pieces of code doing the same
thing, each with different bugs.

-hpa

2013-05-17 04:29:19

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH v6 0/8] kdump, vmcore: support mmap() on /proc/vmcore

"H. Peter Anvin" <[email protected]> writes:

> On 05/16/2013 07:53 PM, Eric W. Biederman wrote:
>>
>> That is completely and totally orthogonal to this change.
>>
>> read_oldmem may have problems but in practice on a large systems those
>> problems are totally dwarfed by real life performance issues that come
>> from playing too much with the page tables.
>>
>> I really don't find bringing up whatever foundational issues you have
>> with read_oldmem() appropriate or relevant here.
>>
>
> Well, it is in the sense that we have two pieces of code doing the same
> thing, each with different bugs.

Not a the tiniest little bit.

All this patchset is about is which page table kernel vs user we map the
physical addresses in.

As such this patchset should neither increase nor decrease the number of
bugs, or cause any other hilarity.

Whatever theoretical issues you have with /dev/oldmem and /proc/vmcore
can and should be talked about and addressed independently of these
changes. HATMAYA Daisuke already has enough to handle coming up with a
clean set of patches that add mmap support.

Eric

2013-05-17 05:44:28

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH v6 0/8] kdump, vmcore: support mmap() on /proc/vmcore

On 05/16/2013 09:29 PM, Eric W. Biederman wrote:
>
> Whatever theoretical issues you have with /dev/oldmem and /proc/vmcore
> can and should be talked about and addressed independently of these
> changes.

And they are... last I know Dave Hansen was looking at it.

-hpa