Hello Vivek and Hatayama,
As discussed here the 6th version of the patch set.
ChangeLog
=========
v5 => v6)
- Set elfcorehdr_addr to ELFCORE_ADDR_ERR after elfcorehdr_free()
- Fix OLDMEM_SIZE/ZFCPDUMP_HSA_SIZE confusion
- Remove return VM_FAULT_MAJOR/MINOR
- Use find_or_create_page() in mmap_vmcore_fault()
- Use kfree instead of vfree in elfcorehdr_free()
v4 => v5)
- Add weak function elfcorehdr_read_notes() to read ELF notes
- Rename weak functions for ELF header access and use "vmcorehdr_" prefix
- Generic vmcore code calls elfcorehdr_alloc() if elfcorehdr= is not specified
- Add vmcore fault handler for mmap of non-resident memory regions
- Add weak function remap_oldmem_pfn_range() to be used by zfcpdump for mmap
v3 => v4)
- Rebase to 3.10-rc2 + vmcore mmap patches v8
v2 => v3)
- Get rid of ELFCORE_ADDR_NEWMEM
- Make read_from_crash_header() only read from kernel
- Move read_from_crash_header() to weak function arch_read_from_crash_header()
- Implement read_from_crash_header() strong function for s390
- Set elfcorehdr_addr to address of new ELF header
v1 => v2)
- Rebase 3.10-rc2 + vmcore mmap patches
- Introduced arch_get/free_crash_header() and ELFCORE_ADDR_NEWMEM
Feature Description
===================
For s390 we want to use /proc/vmcore for our SCSI stand-alone
dump (zfcpdump). We have support where the first HSA_SIZE bytes are
saved into a hypervisor owned memory area (HSA) before the kdump
kernel is booted. When the kdump kernel starts, it is restricted
to use only HSA_SIZE bytes.
The advantages of this mechanism are:
* No crashkernel memory has to be defined in the old kernel.
* Early boot problems (before kexec_load has been done) can be dumped
* Non-Linux systems can be dumped.
We modify the s390 copy_oldmem_page() function to read from the HSA memory
if memory below HSA_SIZE bytes is requested.
Since we cannot use the kexec tool to load the kernel in this scenario,
we have to build the ELF header in the 2nd (kdump/new) kernel.
So with the following patch set we would like to introduce the new
function that the ELF header for /proc/vmcore can be created in the 2nd
kernel memory.
The following steps are done during zfcpdump execution:
1. Production system crashes
2. User boots a SCSI disk that has been prepared with the zfcpdump tool
3. Hypervisor saves CPU state of boot CPU and HSA_SIZE bytes of memory into HSA
4. Boot loader loads kernel into low memory area
5. Kernel boots and uses only HSA_SIZE bytes of memory
6. Kernel saves registers of non-boot CPUs
7. Kernel does memory detection for dump memory map
8. Kernel creates ELF header for /proc/vmcore
9. /proc/vmcore uses this header for initialization
10. The zfcpdump user space reads /proc/vmcore to write dump to SCSI disk
- copy_oldmem_page() copies from HSA for memory below HSA_SIZE
- copy_oldmem_page() copies from real memory for memory above HSA_SIZE
*** BLURB HERE ***
Jan Willeke (1):
s390/vmcore: Implement remap_oldmem_pfn_range for s390
Michael Holzheu (4):
vmcore: Introduce ELF header in new memory feature
s390/vmcore: Use ELF header in new memory feature
vmcore: Introduce remap_oldmem_pfn_range()
s390/vmcore: Use vmcore for zfcpdump
arch/s390/Kconfig | 3 +-
arch/s390/include/asm/sclp.h | 1 +
arch/s390/kernel/crash_dump.c | 228 ++++++++++++++++++++++++++++++++++--------
drivers/s390/char/zcore.c | 6 +-
fs/proc/vmcore.c | 145 +++++++++++++++++++++++----
include/linux/crash_dump.h | 9 ++
6 files changed, 326 insertions(+), 66 deletions(-)
--
1.8.2.2
From: Jan Willeke <[email protected]>
This patch introduces the s390 specific way to map pages from oldmem.
The memory area below OLDMEM_SIZE is mapped with offset OLDMEM_BASE.
The other old memory is mapped directly.
Signed-off-by: Jan Willeke <[email protected]>
Signed-off-by: Michael Holzheu <[email protected]>
---
arch/s390/kernel/crash_dump.c | 26 ++++++++++++++++++++++++++
1 file changed, 26 insertions(+)
diff --git a/arch/s390/kernel/crash_dump.c b/arch/s390/kernel/crash_dump.c
index bfb9cda..9769aa3 100644
--- a/arch/s390/kernel/crash_dump.c
+++ b/arch/s390/kernel/crash_dump.c
@@ -56,6 +56,32 @@ ssize_t copy_oldmem_page(unsigned long pfn, char *buf,
}
/*
+ * Remap "oldmem"
+ *
+ * For the kdump reserved memory this functions performs a swap operation:
+ * [0 - OLDMEM_SIZE] is mapped to [OLDMEM_BASE - OLDMEM_BASE + OLDMEM_SIZE]
+ */
+int remap_oldmem_pfn_range(struct vm_area_struct *vma, unsigned long from,
+ unsigned long pfn, unsigned long size, pgprot_t prot)
+{
+ unsigned long size_old;
+ int rc;
+
+ if (pfn < OLDMEM_SIZE >> PAGE_SHIFT) {
+ size_old = min(size, OLDMEM_SIZE - (pfn << PAGE_SHIFT));
+ rc = remap_pfn_range(vma, from,
+ pfn + (OLDMEM_BASE >> PAGE_SHIFT),
+ size_old, prot);
+ if (rc || size == size_old)
+ return rc;
+ size -= size_old;
+ from += size_old;
+ pfn += size_old >> PAGE_SHIFT;
+ }
+ return remap_pfn_range(vma, from, pfn, size, prot);
+}
+
+/*
* Copy memory from old kernel
*/
int copy_from_oldmem(void *dest, void *src, size_t count)
--
1.8.2.2
This patch modifies the s390 copy_oldmem_page() and remap_oldmem_pfn_range()
function for zfcpdump to read from the HSA memory if memory below HSA_SIZE
bytes is requested. Otherwise real memory is used.
Signed-off-by: Michael Holzheu <[email protected]>
---
arch/s390/Kconfig | 3 +-
arch/s390/include/asm/sclp.h | 1 +
arch/s390/kernel/crash_dump.c | 129 +++++++++++++++++++++++++++++++++++-------
drivers/s390/char/zcore.c | 6 +-
4 files changed, 114 insertions(+), 25 deletions(-)
diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 2c9789d..a0d78f1 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -516,6 +516,7 @@ config CRASH_DUMP
bool "kernel crash dumps"
depends on 64BIT && SMP
select KEXEC
+ select ZFCPDUMP
help
Generate crash dump after being started by kexec.
Crash dump kernels are loaded in the main kernel with kexec-tools
@@ -526,7 +527,7 @@ config CRASH_DUMP
config ZFCPDUMP
def_bool n
prompt "zfcpdump support"
- select SMP
+ depends on SMP
help
Select this option if you want to build an zfcpdump enabled kernel.
Refer to <file:Documentation/s390/zfcpdump.txt> for more details on this.
diff --git a/arch/s390/include/asm/sclp.h b/arch/s390/include/asm/sclp.h
index 06a1361..7dc7f9c 100644
--- a/arch/s390/include/asm/sclp.h
+++ b/arch/s390/include/asm/sclp.h
@@ -56,5 +56,6 @@ bool sclp_has_linemode(void);
bool sclp_has_vt220(void);
int sclp_pci_configure(u32 fid);
int sclp_pci_deconfigure(u32 fid);
+int memcpy_hsa(void *dest, unsigned long src, size_t count, int mode);
#endif /* _ASM_S390_SCLP_H */
diff --git a/arch/s390/kernel/crash_dump.c b/arch/s390/kernel/crash_dump.c
index 9769aa3..9a56989 100644
--- a/arch/s390/kernel/crash_dump.c
+++ b/arch/s390/kernel/crash_dump.c
@@ -16,6 +16,7 @@
#include <asm/os_info.h>
#include <asm/elf.h>
#include <asm/ipl.h>
+#include <asm/sclp.h>
#define PTR_ADD(x, y) (((char *) (x)) + ((unsigned long) (y)))
#define PTR_SUB(x, y) (((char *) (x)) - ((unsigned long) (y)))
@@ -27,42 +28,80 @@
static void *elfcorehdr_newmem;
/*
- * Copy one page from "oldmem"
+ * Copy one page from zfcpdump "oldmem"
+ *
+ * For pages below ZFCPDUMP_HSA_SIZE memory from the HSA is copied. Otherwise
+ * real memory copy is used.
+ */
+static ssize_t copy_oldmem_page_zfcpdump(char *buf, size_t csize,
+ unsigned long src, int userbuf)
+{
+ int rc;
+
+ if (src < ZFCPDUMP_HSA_SIZE) {
+ rc = memcpy_hsa(buf, src, csize, userbuf);
+ } else {
+ if (userbuf)
+ rc = copy_to_user_real((void __force __user *) buf,
+ (void *) src, csize);
+ else
+ rc = memcpy_real(buf, (void *) src, csize);
+ }
+ return rc ? rc : csize;
+}
+
+/*
+ * Copy one page from kdump "oldmem"
*
* For the kdump reserved memory this functions performs a swap operation:
* - [OLDMEM_BASE - OLDMEM_BASE + OLDMEM_SIZE] is mapped to [0 - OLDMEM_SIZE].
* - [0 - OLDMEM_SIZE] is mapped to [OLDMEM_BASE - OLDMEM_BASE + OLDMEM_SIZE]
*/
-ssize_t copy_oldmem_page(unsigned long pfn, char *buf,
- size_t csize, unsigned long offset, int userbuf)
-{
- unsigned long src;
+static ssize_t copy_oldmem_page_kdump(char *buf, size_t csize,
+ unsigned long src, int userbuf)
- if (!csize)
- return 0;
+{
+ int rc;
- src = (pfn << PAGE_SHIFT) + offset;
if (src < OLDMEM_SIZE)
src += OLDMEM_BASE;
else if (src > OLDMEM_BASE &&
src < OLDMEM_BASE + OLDMEM_SIZE)
src -= OLDMEM_BASE;
if (userbuf)
- copy_to_user_real((void __force __user *) buf, (void *) src,
- csize);
+ rc = copy_to_user_real((void __force __user *) buf,
+ (void *) src, csize);
+ else
+ rc = memcpy_real(buf, (void *) src, csize);
+ return rc ? rc : csize;
+}
+
+/*
+ * Copy one page from "oldmem"
+ */
+ssize_t copy_oldmem_page(unsigned long pfn, char *buf, size_t csize,
+ unsigned long offset, int userbuf)
+{
+ unsigned long src;
+
+ if (!csize)
+ return 0;
+ src = (pfn << PAGE_SHIFT) + offset;
+ if (OLDMEM_BASE)
+ return copy_oldmem_page_kdump(buf, csize, src, userbuf);
else
- memcpy_real(buf, (void *) src, csize);
- return csize;
+ return copy_oldmem_page_zfcpdump(buf, csize, src, userbuf);
}
/*
- * Remap "oldmem"
+ * Remap "oldmem" for kdump
*
* For the kdump reserved memory this functions performs a swap operation:
* [0 - OLDMEM_SIZE] is mapped to [OLDMEM_BASE - OLDMEM_BASE + OLDMEM_SIZE]
*/
-int remap_oldmem_pfn_range(struct vm_area_struct *vma, unsigned long from,
- unsigned long pfn, unsigned long size, pgprot_t prot)
+static int remap_oldmem_pfn_range_kdump(struct vm_area_struct *vma,
+ unsigned long from, unsigned long pfn,
+ unsigned long size, pgprot_t prot)
{
unsigned long size_old;
int rc;
@@ -82,6 +121,43 @@ int remap_oldmem_pfn_range(struct vm_area_struct *vma, unsigned long from,
}
/*
+ * Remap "oldmem" for zfcpdump
+ *
+ * We only map available memory above ZFCPDUMP_HSA_SIZE. Memory below
+ * ZFCPDUMP_HSA_SIZE is read on demand using the copy_oldmem_page() function.
+ */
+static int remap_oldmem_pfn_range_zfcpdump(struct vm_area_struct *vma,
+ unsigned long from,
+ unsigned long pfn,
+ unsigned long size, pgprot_t prot)
+{
+ unsigned long size_hsa;
+
+ if (pfn < ZFCPDUMP_HSA_SIZE >> PAGE_SHIFT) {
+ size_hsa = min(size, ZFCPDUMP_HSA_SIZE - (pfn << PAGE_SHIFT));
+ if (size == size_hsa)
+ return 0;
+ size -= size_hsa;
+ from += size_hsa;
+ pfn += size_hsa >> PAGE_SHIFT;
+ }
+ return remap_pfn_range(vma, from, pfn, size, prot);
+}
+
+/*
+ * Remap "oldmem" for kdump or zfcpdump
+ */
+int remap_oldmem_pfn_range(struct vm_area_struct *vma, unsigned long from,
+ unsigned long pfn, unsigned long size, pgprot_t prot)
+{
+ if (OLDMEM_BASE)
+ return remap_oldmem_pfn_range_kdump(vma, from, pfn, size, prot);
+ else
+ return remap_oldmem_pfn_range_zfcpdump(vma, from, pfn, size,
+ prot);
+}
+
+/*
* Copy memory from old kernel
*/
int copy_from_oldmem(void *dest, void *src, size_t count)
@@ -89,11 +165,21 @@ int copy_from_oldmem(void *dest, void *src, size_t count)
unsigned long copied = 0;
int rc;
- if ((unsigned long) src < OLDMEM_SIZE) {
- copied = min(count, OLDMEM_SIZE - (unsigned long) src);
- rc = memcpy_real(dest, src + OLDMEM_BASE, copied);
- if (rc)
- return rc;
+ if (OLDMEM_BASE) {
+ if ((unsigned long) src < OLDMEM_SIZE) {
+ copied = min(count, OLDMEM_SIZE - (unsigned long) src);
+ rc = memcpy_real(dest, src + OLDMEM_BASE, copied);
+ if (rc)
+ return rc;
+ }
+ } else {
+ if ((unsigned long) src < ZFCPDUMP_HSA_SIZE) {
+ copied = min(count,
+ ZFCPDUMP_HSA_SIZE - (unsigned long) src);
+ rc = memcpy_hsa(dest, (unsigned long) src, copied, 0);
+ if (rc)
+ return rc;
+ }
}
return memcpy_real(dest + copied, src + copied, count - copied);
}
@@ -423,7 +509,8 @@ int elfcorehdr_alloc(unsigned long long *addr, unsigned long long *size)
u32 alloc_size;
u64 hdr_off;
- if (!OLDMEM_BASE)
+ /* If we are not in kdump or zfcpdump mode return */
+ if (!OLDMEM_BASE && ipl_info.type != IPL_TYPE_FCP_DUMP)
return 0;
/* If elfcorehdr= has been passed via cmdline, we use that one */
if (elfcorehdr_addr != ELFCORE_ADDR_MAX)
diff --git a/drivers/s390/char/zcore.c b/drivers/s390/char/zcore.c
index 9e5e146..794820a 100644
--- a/drivers/s390/char/zcore.c
+++ b/drivers/s390/char/zcore.c
@@ -30,8 +30,8 @@
#define TRACE(x...) debug_sprintf_event(zcore_dbf, 1, x)
-#define TO_USER 0
-#define TO_KERNEL 1
+#define TO_USER 1
+#define TO_KERNEL 0
#define CHUNK_INFO_SIZE 34 /* 2 16-byte char, each followed by blank */
enum arch_id {
@@ -73,7 +73,7 @@ static struct ipl_parameter_block *ipl_block;
* @count: Size of buffer, which should be copied
* @mode: Either TO_KERNEL or TO_USER
*/
-static int memcpy_hsa(void *dest, unsigned long src, size_t count, int mode)
+int memcpy_hsa(void *dest, unsigned long src, size_t count, int mode)
{
int offs, blk_num;
static char buf[PAGE_SIZE] __attribute__((__aligned__(PAGE_SIZE)));
--
1.8.2.2
This patch now exchanges the old relocate mechanism with the new
arch function call override mechanism that allows to create the ELF
core header in the 2nd kernel.
Signed-off-by: Michael Holzheu <[email protected]>
---
arch/s390/kernel/crash_dump.c | 81 ++++++++++++++++++++++++++++---------------
1 file changed, 54 insertions(+), 27 deletions(-)
diff --git a/arch/s390/kernel/crash_dump.c b/arch/s390/kernel/crash_dump.c
index f703d91..bfb9cda 100644
--- a/arch/s390/kernel/crash_dump.c
+++ b/arch/s390/kernel/crash_dump.c
@@ -22,6 +22,11 @@
#define PTR_DIFF(x, y) ((unsigned long)(((char *) (x)) - ((unsigned long) (y))))
/*
+ * Pointer to ELF header in new kernel
+ */
+static void *elfcorehdr_newmem;
+
+/*
* Copy one page from "oldmem"
*
* For the kdump reserved memory this functions performs a swap operation:
@@ -325,14 +330,6 @@ static int get_mem_chunk_cnt(void)
}
/*
- * Relocate pointer in order to allow vmcore code access the data
- */
-static inline unsigned long relocate(unsigned long addr)
-{
- return OLDMEM_BASE + addr;
-}
-
-/*
* Initialize ELF loads (new kernel)
*/
static int loads_init(Elf64_Phdr *phdr, u64 loads_offset)
@@ -383,7 +380,7 @@ static void *notes_init(Elf64_Phdr *phdr, void *ptr, u64 notes_offset)
ptr = nt_vmcoreinfo(ptr);
memset(phdr, 0, sizeof(*phdr));
phdr->p_type = PT_NOTE;
- phdr->p_offset = relocate(notes_offset);
+ phdr->p_offset = notes_offset;
phdr->p_filesz = (unsigned long) PTR_SUB(ptr, ptr_start);
phdr->p_memsz = phdr->p_filesz;
return ptr;
@@ -392,7 +389,7 @@ static void *notes_init(Elf64_Phdr *phdr, void *ptr, u64 notes_offset)
/*
* Create ELF core header (new kernel)
*/
-static void s390_elf_corehdr_create(char **elfcorebuf, size_t *elfcorebuf_sz)
+int elfcorehdr_alloc(unsigned long long *addr, unsigned long long *size)
{
Elf64_Phdr *phdr_notes, *phdr_loads;
int mem_chunk_cnt;
@@ -400,6 +397,11 @@ static void s390_elf_corehdr_create(char **elfcorebuf, size_t *elfcorebuf_sz)
u32 alloc_size;
u64 hdr_off;
+ if (!OLDMEM_BASE)
+ return 0;
+ /* If elfcorehdr= has been passed via cmdline, we use that one */
+ if (elfcorehdr_addr != ELFCORE_ADDR_MAX)
+ return 0;
mem_chunk_cnt = get_mem_chunk_cnt();
alloc_size = 0x1000 + get_cpu_cnt() * 0x300 +
@@ -417,27 +419,52 @@ static void s390_elf_corehdr_create(char **elfcorebuf, size_t *elfcorebuf_sz)
ptr = notes_init(phdr_notes, ptr, ((unsigned long) hdr) + hdr_off);
/* Init loads */
hdr_off = PTR_DIFF(ptr, hdr);
- loads_init(phdr_loads, ((unsigned long) hdr) + hdr_off);
- *elfcorebuf_sz = hdr_off;
- *elfcorebuf = (void *) relocate((unsigned long) hdr);
- BUG_ON(*elfcorebuf_sz > alloc_size);
+ loads_init(phdr_loads, hdr_off);
+ *addr = (unsigned long long) hdr;
+ elfcorehdr_newmem = hdr;
+ *size = (unsigned long long) hdr_off;
+ BUG_ON(elfcorehdr_size > alloc_size);
+ return 0;
}
/*
- * Create kdump ELF core header in new kernel, if it has not been passed via
- * the "elfcorehdr" kernel parameter
+ * Free ELF core header (new kernel)
*/
-static int setup_kdump_elfcorehdr(void)
+void elfcorehdr_free(unsigned long long addr)
{
- size_t elfcorebuf_sz;
- char *elfcorebuf;
-
- if (!OLDMEM_BASE || is_kdump_kernel())
- return -EINVAL;
- s390_elf_corehdr_create(&elfcorebuf, &elfcorebuf_sz);
- elfcorehdr_addr = (unsigned long long) elfcorebuf;
- elfcorehdr_size = elfcorebuf_sz;
- return 0;
+ if (!elfcorehdr_newmem)
+ return;
+ kfree((void *)(unsigned long)addr);
}
-subsys_initcall(setup_kdump_elfcorehdr);
+/*
+ * Read from ELF header
+ */
+ssize_t elfcorehdr_read(char *buf, size_t count, u64 *ppos)
+{
+ void *src = (void *)(unsigned long)*ppos;
+
+ src = elfcorehdr_newmem ? src : src - OLDMEM_BASE;
+ memcpy(buf, src, count);
+ *ppos += count;
+ return count;
+}
+
+/*
+ * Read from ELF notes data
+ */
+ssize_t elfcorehdr_read_notes(char *buf, size_t count, u64 *ppos)
+{
+ void *src = (void *)(unsigned long)*ppos;
+ int rc;
+
+ if (elfcorehdr_newmem) {
+ memcpy(buf, src, count);
+ } else {
+ rc = copy_from_oldmem(buf, src, count);
+ if (rc)
+ return rc;
+ }
+ *ppos += count;
+ return count;
+}
--
1.8.2.2
Currently for s390 we create the ELF core header in the 2nd kernel
with a small trick. We relocate the addresses in the ELF header in
a way that for the /proc/vmcore code it seems to be in the 1st kernel
(old) memory and the read_from_oldmem() returns the correct data.
This allows the /proc/vmcore code to use the ELF header in the
2nd kernel.
This patch now exchanges the old mechanism with the new and much
cleaner function call override feature that now offcially allows to
create the ELF core header in the 2nd kernel.
To use the new feature the following function have to be defined
by the architecture backend code to read from new memory:
* elfcorehdr_alloc: Allocate ELF header
* elfcorehdr_free: Free the memory of the ELF header
* elfcorehdr_read: Read from ELF header
* elfcorehdr_read_notes: Read from ELF notes
Signed-off-by: Michael Holzheu <[email protected]>
---
fs/proc/vmcore.c | 61 ++++++++++++++++++++++++++++++++++++++--------
include/linux/crash_dump.h | 6 +++++
2 files changed, 57 insertions(+), 10 deletions(-)
diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 0f6db52..c28189c 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -123,6 +123,36 @@ static ssize_t read_from_oldmem(char *buf, size_t count,
return read;
}
+/*
+ * Architectures may override this function to allocate ELF header in 2nd kernel
+ */
+int __weak elfcorehdr_alloc(unsigned long long *addr, unsigned long long *size)
+{
+ return 0;
+}
+
+/*
+ * Architectures may override this function to free header
+ */
+void __weak elfcorehdr_free(unsigned long long addr)
+{}
+
+/*
+ * Architectures may override this function to read from ELF header
+ */
+ssize_t __weak elfcorehdr_read(char *buf, size_t count, u64 *ppos)
+{
+ return read_from_oldmem(buf, count, ppos, 0);
+}
+
+/*
+ * Architectures may override this function to read from notes sections
+ */
+ssize_t __weak elfcorehdr_read_notes(char *buf, size_t count, u64 *ppos)
+{
+ return read_from_oldmem(buf, count, ppos, 0);
+}
+
/* Read from the ELF header and then the crash dump. On error, negative value is
* returned otherwise number of bytes read are returned.
*/
@@ -322,7 +352,7 @@ static int __init update_note_header_size_elf64(const Elf64_Ehdr *ehdr_ptr)
notes_section = kmalloc(max_sz, GFP_KERNEL);
if (!notes_section)
return -ENOMEM;
- rc = read_from_oldmem(notes_section, max_sz, &offset, 0);
+ rc = elfcorehdr_read_notes(notes_section, max_sz, &offset);
if (rc < 0) {
kfree(notes_section);
return rc;
@@ -409,7 +439,8 @@ static int __init copy_notes_elf64(const Elf64_Ehdr *ehdr_ptr, char *notes_buf)
if (phdr_ptr->p_type != PT_NOTE)
continue;
offset = phdr_ptr->p_offset;
- rc = read_from_oldmem(notes_buf, phdr_ptr->p_memsz, &offset, 0);
+ rc = elfcorehdr_read_notes(notes_buf, phdr_ptr->p_memsz,
+ &offset);
if (rc < 0)
return rc;
notes_buf += phdr_ptr->p_memsz;
@@ -510,7 +541,7 @@ static int __init update_note_header_size_elf32(const Elf32_Ehdr *ehdr_ptr)
notes_section = kmalloc(max_sz, GFP_KERNEL);
if (!notes_section)
return -ENOMEM;
- rc = read_from_oldmem(notes_section, max_sz, &offset, 0);
+ rc = elfcorehdr_read_notes(notes_section, max_sz, &offset);
if (rc < 0) {
kfree(notes_section);
return rc;
@@ -597,7 +628,8 @@ static int __init copy_notes_elf32(const Elf32_Ehdr *ehdr_ptr, char *notes_buf)
if (phdr_ptr->p_type != PT_NOTE)
continue;
offset = phdr_ptr->p_offset;
- rc = read_from_oldmem(notes_buf, phdr_ptr->p_memsz, &offset, 0);
+ rc = elfcorehdr_read_notes(notes_buf, phdr_ptr->p_memsz,
+ &offset);
if (rc < 0)
return rc;
notes_buf += phdr_ptr->p_memsz;
@@ -793,7 +825,7 @@ static int __init parse_crash_elf64_headers(void)
addr = elfcorehdr_addr;
/* Read Elf header */
- rc = read_from_oldmem((char*)&ehdr, sizeof(Elf64_Ehdr), &addr, 0);
+ rc = elfcorehdr_read((char *)&ehdr, sizeof(Elf64_Ehdr), &addr);
if (rc < 0)
return rc;
@@ -820,7 +852,7 @@ static int __init parse_crash_elf64_headers(void)
if (!elfcorebuf)
return -ENOMEM;
addr = elfcorehdr_addr;
- rc = read_from_oldmem(elfcorebuf, elfcorebuf_sz_orig, &addr, 0);
+ rc = elfcorehdr_read(elfcorebuf, elfcorebuf_sz_orig, &addr);
if (rc < 0)
goto fail;
@@ -849,7 +881,7 @@ static int __init parse_crash_elf32_headers(void)
addr = elfcorehdr_addr;
/* Read Elf header */
- rc = read_from_oldmem((char*)&ehdr, sizeof(Elf32_Ehdr), &addr, 0);
+ rc = elfcorehdr_read((char *)&ehdr, sizeof(Elf32_Ehdr), &addr);
if (rc < 0)
return rc;
@@ -875,7 +907,7 @@ static int __init parse_crash_elf32_headers(void)
if (!elfcorebuf)
return -ENOMEM;
addr = elfcorehdr_addr;
- rc = read_from_oldmem(elfcorebuf, elfcorebuf_sz_orig, &addr, 0);
+ rc = elfcorehdr_read(elfcorebuf, elfcorebuf_sz_orig, &addr);
if (rc < 0)
goto fail;
@@ -902,7 +934,7 @@ static int __init parse_crash_elf_headers(void)
int rc=0;
addr = elfcorehdr_addr;
- rc = read_from_oldmem(e_ident, EI_NIDENT, &addr, 0);
+ rc = elfcorehdr_read(e_ident, EI_NIDENT, &addr);
if (rc < 0)
return rc;
if (memcmp(e_ident, ELFMAG, SELFMAG) != 0) {
@@ -935,7 +967,14 @@ static int __init vmcore_init(void)
{
int rc = 0;
- /* If elfcorehdr= has been passed in cmdline, then capture the dump.*/
+ /* Allow architectures to allocate ELF header in 2nd kernel */
+ rc = elfcorehdr_alloc(&elfcorehdr_addr, &elfcorehdr_size);
+ if (rc)
+ return rc;
+ /*
+ * If elfcorehdr= has been passed in cmdline or created in 2nd kernel,
+ * then capture the dump.
+ */
if (!(is_vmcore_usable()))
return rc;
rc = parse_crash_elf_headers();
@@ -943,6 +982,8 @@ static int __init vmcore_init(void)
pr_warn("Kdump: vmcore not initialized\n");
return rc;
}
+ elfcorehdr_free(elfcorehdr_addr);
+ elfcorehdr_addr = ELFCORE_ADDR_ERR;
proc_vmcore = proc_create("vmcore", S_IRUSR, NULL, &proc_vmcore_operations);
if (proc_vmcore)
diff --git a/include/linux/crash_dump.h b/include/linux/crash_dump.h
index 37e4f8d..6571f82 100644
--- a/include/linux/crash_dump.h
+++ b/include/linux/crash_dump.h
@@ -12,6 +12,12 @@
extern unsigned long long elfcorehdr_addr;
extern unsigned long long elfcorehdr_size;
+extern int __weak elfcorehdr_alloc(unsigned long long *addr,
+ unsigned long long *size);
+extern void __weak elfcorehdr_free(unsigned long long addr);
+extern ssize_t __weak elfcorehdr_read(char *buf, size_t count, u64 *ppos);
+extern ssize_t __weak elfcorehdr_read_notes(char *buf, size_t count, u64 *ppos);
+
extern ssize_t copy_oldmem_page(unsigned long, char *, size_t,
unsigned long, int);
--
1.8.2.2
For zfcpdump we can't map the HSA storage because it is only available
via a read interface. Therefore, for the new vmcore mmap feature we have
introduce a new mechanism to create mappings on demand.
This patch introduces a new architecture function remap_oldmem_pfn_range()
that should be used to create mappings with remap_pfn_range() for oldmem
areas that can be directly mapped. For zfcpdump this is everything besides
of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range()
a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault()
is called.
This handler works as follows:
* Get already available or new page from page cache (find_or_create_page)
* Check if /proc/vmcore page is filled with data (PageUptodate)
* If yes:
Return that page
* If no:
Fill page using __vmcore_read(), set PageUptodate, and return page
Signed-off-by: Michael Holzheu <[email protected]>
---
fs/proc/vmcore.c | 84 +++++++++++++++++++++++++++++++++++++++++-----
include/linux/crash_dump.h | 3 ++
2 files changed, 79 insertions(+), 8 deletions(-)
diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index c28189c..77312c7 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -21,6 +21,7 @@
#include <linux/crash_dump.h>
#include <linux/list.h>
#include <linux/vmalloc.h>
+#include <linux/pagemap.h>
#include <asm/uaccess.h>
#include <asm/io.h>
#include "internal.h"
@@ -153,11 +154,35 @@ ssize_t __weak elfcorehdr_read_notes(char *buf, size_t count, u64 *ppos)
return read_from_oldmem(buf, count, ppos, 0);
}
+/*
+ * Architectures may override this function to map oldmem
+ */
+int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
+ unsigned long from, unsigned long pfn,
+ unsigned long size, pgprot_t prot)
+{
+ return remap_pfn_range(vma, from, pfn, size, prot);
+}
+
+/*
+ * Copy to either kernel or user space
+ */
+static int copy_to(void *target, void *src, size_t size, int userbuf)
+{
+ if (userbuf) {
+ if (copy_to_user(target, src, size))
+ return -EFAULT;
+ } else {
+ memcpy(target, src, size);
+ }
+ return 0;
+}
+
/* Read from the ELF header and then the crash dump. On error, negative value is
* returned otherwise number of bytes read are returned.
*/
-static ssize_t read_vmcore(struct file *file, char __user *buffer,
- size_t buflen, loff_t *fpos)
+static ssize_t __read_vmcore(char *buffer, size_t buflen, loff_t *fpos,
+ int userbuf)
{
ssize_t acc = 0, tmp;
size_t tsz;
@@ -174,7 +199,7 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer,
/* Read ELF core header */
if (*fpos < elfcorebuf_sz) {
tsz = min(elfcorebuf_sz - (size_t)*fpos, buflen);
- if (copy_to_user(buffer, elfcorebuf + *fpos, tsz))
+ if (copy_to(buffer, elfcorebuf + *fpos, tsz, userbuf))
return -EFAULT;
buflen -= tsz;
*fpos += tsz;
@@ -192,7 +217,7 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer,
tsz = min(elfcorebuf_sz + elfnotes_sz - (size_t)*fpos, buflen);
kaddr = elfnotes_buf + *fpos - elfcorebuf_sz;
- if (copy_to_user(buffer, kaddr, tsz))
+ if (copy_to(buffer, kaddr, tsz, userbuf))
return -EFAULT;
buflen -= tsz;
*fpos += tsz;
@@ -208,7 +233,7 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer,
if (*fpos < m->offset + m->size) {
tsz = min_t(size_t, m->offset + m->size - *fpos, buflen);
start = m->paddr + *fpos - m->offset;
- tmp = read_from_oldmem(buffer, tsz, &start, 1);
+ tmp = read_from_oldmem(buffer, tsz, &start, userbuf);
if (tmp < 0)
return tmp;
buflen -= tsz;
@@ -225,6 +250,48 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer,
return acc;
}
+static ssize_t read_vmcore(struct file *file, char __user *buffer,
+ size_t buflen, loff_t *fpos)
+{
+ return __read_vmcore(buffer, buflen, fpos, 1);
+}
+
+/*
+ * The vmcore fault handler uses the page cache and fills data using the
+ * standard __vmcore_read() function.
+ */
+static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ struct address_space *mapping = vma->vm_file->f_mapping;
+ pgoff_t index = vmf->pgoff;
+ struct page *page;
+ loff_t src;
+ char *buf;
+ int rc;
+
+ page = find_or_create_page(mapping, index, GFP_KERNEL);
+ if (!page)
+ return VM_FAULT_OOM;
+ if (!PageUptodate(page)) {
+ src = index << PAGE_CACHE_SHIFT;
+ buf = (void *) (page_to_pfn(page) << PAGE_SHIFT);
+ rc = __read_vmcore(buf, PAGE_SIZE, &src, 0);
+ if (rc < 0) {
+ unlock_page(page);
+ page_cache_release(page);
+ return (rc == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS;
+ }
+ SetPageUptodate(page);
+ }
+ unlock_page(page);
+ vmf->page = page;
+ return 0;
+}
+
+static const struct vm_operations_struct vmcore_mmap_ops = {
+ .fault = mmap_vmcore_fault,
+};
+
static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
{
size_t size = vma->vm_end - vma->vm_start;
@@ -242,6 +309,7 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
vma->vm_flags |= VM_MIXEDMAP;
+ vma->vm_ops = &vmcore_mmap_ops;
len = 0;
@@ -283,9 +351,9 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
tsz = min_t(size_t, m->offset + m->size - start, size);
paddr = m->paddr + start - m->offset;
- if (remap_pfn_range(vma, vma->vm_start + len,
- paddr >> PAGE_SHIFT, tsz,
- vma->vm_page_prot))
+ if (remap_oldmem_pfn_range(vma, vma->vm_start + len,
+ paddr >> PAGE_SHIFT, tsz,
+ vma->vm_page_prot))
goto fail;
size -= tsz;
start += tsz;
diff --git a/include/linux/crash_dump.h b/include/linux/crash_dump.h
index 6571f82..fe68a5a 100644
--- a/include/linux/crash_dump.h
+++ b/include/linux/crash_dump.h
@@ -17,6 +17,9 @@ extern int __weak elfcorehdr_alloc(unsigned long long *addr,
extern void __weak elfcorehdr_free(unsigned long long addr);
extern ssize_t __weak elfcorehdr_read(char *buf, size_t count, u64 *ppos);
extern ssize_t __weak elfcorehdr_read_notes(char *buf, size_t count, u64 *ppos);
+extern int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
+ unsigned long from, unsigned long pfn,
+ unsigned long size, pgprot_t prot);
extern ssize_t copy_oldmem_page(unsigned long, char *, size_t,
unsigned long, int);
--
1.8.2.2
On Mon, Jul 01, 2013 at 09:32:35PM +0200, Michael Holzheu wrote:
> Currently for s390 we create the ELF core header in the 2nd kernel
> with a small trick. We relocate the addresses in the ELF header in
> a way that for the /proc/vmcore code it seems to be in the 1st kernel
> (old) memory and the read_from_oldmem() returns the correct data.
> This allows the /proc/vmcore code to use the ELF header in the
> 2nd kernel.
>
> This patch now exchanges the old mechanism with the new and much
> cleaner function call override feature that now offcially allows to
> create the ELF core header in the 2nd kernel.
>
> To use the new feature the following function have to be defined
> by the architecture backend code to read from new memory:
>
> * elfcorehdr_alloc: Allocate ELF header
> * elfcorehdr_free: Free the memory of the ELF header
> * elfcorehdr_read: Read from ELF header
> * elfcorehdr_read_notes: Read from ELF notes
>
> Signed-off-by: Michael Holzheu <[email protected]>
Looks good to me.
Acked-by: Vivek Goyal <[email protected]>
Vivek
> ---
> fs/proc/vmcore.c | 61 ++++++++++++++++++++++++++++++++++++++--------
> include/linux/crash_dump.h | 6 +++++
> 2 files changed, 57 insertions(+), 10 deletions(-)
>
> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
> index 0f6db52..c28189c 100644
> --- a/fs/proc/vmcore.c
> +++ b/fs/proc/vmcore.c
> @@ -123,6 +123,36 @@ static ssize_t read_from_oldmem(char *buf, size_t count,
> return read;
> }
>
> +/*
> + * Architectures may override this function to allocate ELF header in 2nd kernel
> + */
> +int __weak elfcorehdr_alloc(unsigned long long *addr, unsigned long long *size)
> +{
> + return 0;
> +}
> +
> +/*
> + * Architectures may override this function to free header
> + */
> +void __weak elfcorehdr_free(unsigned long long addr)
> +{}
> +
> +/*
> + * Architectures may override this function to read from ELF header
> + */
> +ssize_t __weak elfcorehdr_read(char *buf, size_t count, u64 *ppos)
> +{
> + return read_from_oldmem(buf, count, ppos, 0);
> +}
> +
> +/*
> + * Architectures may override this function to read from notes sections
> + */
> +ssize_t __weak elfcorehdr_read_notes(char *buf, size_t count, u64 *ppos)
> +{
> + return read_from_oldmem(buf, count, ppos, 0);
> +}
> +
> /* Read from the ELF header and then the crash dump. On error, negative value is
> * returned otherwise number of bytes read are returned.
> */
> @@ -322,7 +352,7 @@ static int __init update_note_header_size_elf64(const Elf64_Ehdr *ehdr_ptr)
> notes_section = kmalloc(max_sz, GFP_KERNEL);
> if (!notes_section)
> return -ENOMEM;
> - rc = read_from_oldmem(notes_section, max_sz, &offset, 0);
> + rc = elfcorehdr_read_notes(notes_section, max_sz, &offset);
> if (rc < 0) {
> kfree(notes_section);
> return rc;
> @@ -409,7 +439,8 @@ static int __init copy_notes_elf64(const Elf64_Ehdr *ehdr_ptr, char *notes_buf)
> if (phdr_ptr->p_type != PT_NOTE)
> continue;
> offset = phdr_ptr->p_offset;
> - rc = read_from_oldmem(notes_buf, phdr_ptr->p_memsz, &offset, 0);
> + rc = elfcorehdr_read_notes(notes_buf, phdr_ptr->p_memsz,
> + &offset);
> if (rc < 0)
> return rc;
> notes_buf += phdr_ptr->p_memsz;
> @@ -510,7 +541,7 @@ static int __init update_note_header_size_elf32(const Elf32_Ehdr *ehdr_ptr)
> notes_section = kmalloc(max_sz, GFP_KERNEL);
> if (!notes_section)
> return -ENOMEM;
> - rc = read_from_oldmem(notes_section, max_sz, &offset, 0);
> + rc = elfcorehdr_read_notes(notes_section, max_sz, &offset);
> if (rc < 0) {
> kfree(notes_section);
> return rc;
> @@ -597,7 +628,8 @@ static int __init copy_notes_elf32(const Elf32_Ehdr *ehdr_ptr, char *notes_buf)
> if (phdr_ptr->p_type != PT_NOTE)
> continue;
> offset = phdr_ptr->p_offset;
> - rc = read_from_oldmem(notes_buf, phdr_ptr->p_memsz, &offset, 0);
> + rc = elfcorehdr_read_notes(notes_buf, phdr_ptr->p_memsz,
> + &offset);
> if (rc < 0)
> return rc;
> notes_buf += phdr_ptr->p_memsz;
> @@ -793,7 +825,7 @@ static int __init parse_crash_elf64_headers(void)
> addr = elfcorehdr_addr;
>
> /* Read Elf header */
> - rc = read_from_oldmem((char*)&ehdr, sizeof(Elf64_Ehdr), &addr, 0);
> + rc = elfcorehdr_read((char *)&ehdr, sizeof(Elf64_Ehdr), &addr);
> if (rc < 0)
> return rc;
>
> @@ -820,7 +852,7 @@ static int __init parse_crash_elf64_headers(void)
> if (!elfcorebuf)
> return -ENOMEM;
> addr = elfcorehdr_addr;
> - rc = read_from_oldmem(elfcorebuf, elfcorebuf_sz_orig, &addr, 0);
> + rc = elfcorehdr_read(elfcorebuf, elfcorebuf_sz_orig, &addr);
> if (rc < 0)
> goto fail;
>
> @@ -849,7 +881,7 @@ static int __init parse_crash_elf32_headers(void)
> addr = elfcorehdr_addr;
>
> /* Read Elf header */
> - rc = read_from_oldmem((char*)&ehdr, sizeof(Elf32_Ehdr), &addr, 0);
> + rc = elfcorehdr_read((char *)&ehdr, sizeof(Elf32_Ehdr), &addr);
> if (rc < 0)
> return rc;
>
> @@ -875,7 +907,7 @@ static int __init parse_crash_elf32_headers(void)
> if (!elfcorebuf)
> return -ENOMEM;
> addr = elfcorehdr_addr;
> - rc = read_from_oldmem(elfcorebuf, elfcorebuf_sz_orig, &addr, 0);
> + rc = elfcorehdr_read(elfcorebuf, elfcorebuf_sz_orig, &addr);
> if (rc < 0)
> goto fail;
>
> @@ -902,7 +934,7 @@ static int __init parse_crash_elf_headers(void)
> int rc=0;
>
> addr = elfcorehdr_addr;
> - rc = read_from_oldmem(e_ident, EI_NIDENT, &addr, 0);
> + rc = elfcorehdr_read(e_ident, EI_NIDENT, &addr);
> if (rc < 0)
> return rc;
> if (memcmp(e_ident, ELFMAG, SELFMAG) != 0) {
> @@ -935,7 +967,14 @@ static int __init vmcore_init(void)
> {
> int rc = 0;
>
> - /* If elfcorehdr= has been passed in cmdline, then capture the dump.*/
> + /* Allow architectures to allocate ELF header in 2nd kernel */
> + rc = elfcorehdr_alloc(&elfcorehdr_addr, &elfcorehdr_size);
> + if (rc)
> + return rc;
> + /*
> + * If elfcorehdr= has been passed in cmdline or created in 2nd kernel,
> + * then capture the dump.
> + */
> if (!(is_vmcore_usable()))
> return rc;
> rc = parse_crash_elf_headers();
> @@ -943,6 +982,8 @@ static int __init vmcore_init(void)
> pr_warn("Kdump: vmcore not initialized\n");
> return rc;
> }
> + elfcorehdr_free(elfcorehdr_addr);
> + elfcorehdr_addr = ELFCORE_ADDR_ERR;
>
> proc_vmcore = proc_create("vmcore", S_IRUSR, NULL, &proc_vmcore_operations);
> if (proc_vmcore)
> diff --git a/include/linux/crash_dump.h b/include/linux/crash_dump.h
> index 37e4f8d..6571f82 100644
> --- a/include/linux/crash_dump.h
> +++ b/include/linux/crash_dump.h
> @@ -12,6 +12,12 @@
> extern unsigned long long elfcorehdr_addr;
> extern unsigned long long elfcorehdr_size;
>
> +extern int __weak elfcorehdr_alloc(unsigned long long *addr,
> + unsigned long long *size);
> +extern void __weak elfcorehdr_free(unsigned long long addr);
> +extern ssize_t __weak elfcorehdr_read(char *buf, size_t count, u64 *ppos);
> +extern ssize_t __weak elfcorehdr_read_notes(char *buf, size_t count, u64 *ppos);
> +
> extern ssize_t copy_oldmem_page(unsigned long, char *, size_t,
> unsigned long, int);
>
> --
> 1.8.2.2
On Mon, Jul 01, 2013 at 09:32:37PM +0200, Michael Holzheu wrote:
> For zfcpdump we can't map the HSA storage because it is only available
> via a read interface. Therefore, for the new vmcore mmap feature we have
> introduce a new mechanism to create mappings on demand.
>
> This patch introduces a new architecture function remap_oldmem_pfn_range()
> that should be used to create mappings with remap_pfn_range() for oldmem
> areas that can be directly mapped. For zfcpdump this is everything besides
> of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range()
> a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault()
> is called.
>
> This handler works as follows:
>
> * Get already available or new page from page cache (find_or_create_page)
> * Check if /proc/vmcore page is filled with data (PageUptodate)
> * If yes:
> Return that page
> * If no:
> Fill page using __vmcore_read(), set PageUptodate, and return page
>
> Signed-off-by: Michael Holzheu <[email protected]>
In general vmcore related changes look fine to me. I am not very familiar
with the logic of finding pages in page cache and using page uptodate
flag.
Hatayama, can you please review it.
Acked-by: Vivek Goyal <[email protected]>
Thanks
Vivek
> ---
> fs/proc/vmcore.c | 84 +++++++++++++++++++++++++++++++++++++++++-----
> include/linux/crash_dump.h | 3 ++
> 2 files changed, 79 insertions(+), 8 deletions(-)
>
> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
> index c28189c..77312c7 100644
> --- a/fs/proc/vmcore.c
> +++ b/fs/proc/vmcore.c
> @@ -21,6 +21,7 @@
> #include <linux/crash_dump.h>
> #include <linux/list.h>
> #include <linux/vmalloc.h>
> +#include <linux/pagemap.h>
> #include <asm/uaccess.h>
> #include <asm/io.h>
> #include "internal.h"
> @@ -153,11 +154,35 @@ ssize_t __weak elfcorehdr_read_notes(char *buf, size_t count, u64 *ppos)
> return read_from_oldmem(buf, count, ppos, 0);
> }
>
> +/*
> + * Architectures may override this function to map oldmem
> + */
> +int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
> + unsigned long from, unsigned long pfn,
> + unsigned long size, pgprot_t prot)
> +{
> + return remap_pfn_range(vma, from, pfn, size, prot);
> +}
> +
> +/*
> + * Copy to either kernel or user space
> + */
> +static int copy_to(void *target, void *src, size_t size, int userbuf)
> +{
> + if (userbuf) {
> + if (copy_to_user(target, src, size))
> + return -EFAULT;
> + } else {
> + memcpy(target, src, size);
> + }
> + return 0;
> +}
> +
> /* Read from the ELF header and then the crash dump. On error, negative value is
> * returned otherwise number of bytes read are returned.
> */
> -static ssize_t read_vmcore(struct file *file, char __user *buffer,
> - size_t buflen, loff_t *fpos)
> +static ssize_t __read_vmcore(char *buffer, size_t buflen, loff_t *fpos,
> + int userbuf)
> {
> ssize_t acc = 0, tmp;
> size_t tsz;
> @@ -174,7 +199,7 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer,
> /* Read ELF core header */
> if (*fpos < elfcorebuf_sz) {
> tsz = min(elfcorebuf_sz - (size_t)*fpos, buflen);
> - if (copy_to_user(buffer, elfcorebuf + *fpos, tsz))
> + if (copy_to(buffer, elfcorebuf + *fpos, tsz, userbuf))
> return -EFAULT;
> buflen -= tsz;
> *fpos += tsz;
> @@ -192,7 +217,7 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer,
>
> tsz = min(elfcorebuf_sz + elfnotes_sz - (size_t)*fpos, buflen);
> kaddr = elfnotes_buf + *fpos - elfcorebuf_sz;
> - if (copy_to_user(buffer, kaddr, tsz))
> + if (copy_to(buffer, kaddr, tsz, userbuf))
> return -EFAULT;
> buflen -= tsz;
> *fpos += tsz;
> @@ -208,7 +233,7 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer,
> if (*fpos < m->offset + m->size) {
> tsz = min_t(size_t, m->offset + m->size - *fpos, buflen);
> start = m->paddr + *fpos - m->offset;
> - tmp = read_from_oldmem(buffer, tsz, &start, 1);
> + tmp = read_from_oldmem(buffer, tsz, &start, userbuf);
> if (tmp < 0)
> return tmp;
> buflen -= tsz;
> @@ -225,6 +250,48 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer,
> return acc;
> }
>
> +static ssize_t read_vmcore(struct file *file, char __user *buffer,
> + size_t buflen, loff_t *fpos)
> +{
> + return __read_vmcore(buffer, buflen, fpos, 1);
> +}
> +
> +/*
> + * The vmcore fault handler uses the page cache and fills data using the
> + * standard __vmcore_read() function.
> + */
> +static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> + struct address_space *mapping = vma->vm_file->f_mapping;
> + pgoff_t index = vmf->pgoff;
> + struct page *page;
> + loff_t src;
> + char *buf;
> + int rc;
> +
> + page = find_or_create_page(mapping, index, GFP_KERNEL);
> + if (!page)
> + return VM_FAULT_OOM;
> + if (!PageUptodate(page)) {
> + src = index << PAGE_CACHE_SHIFT;
> + buf = (void *) (page_to_pfn(page) << PAGE_SHIFT);
> + rc = __read_vmcore(buf, PAGE_SIZE, &src, 0);
> + if (rc < 0) {
> + unlock_page(page);
> + page_cache_release(page);
> + return (rc == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS;
> + }
> + SetPageUptodate(page);
> + }
> + unlock_page(page);
> + vmf->page = page;
> + return 0;
> +}
> +
> +static const struct vm_operations_struct vmcore_mmap_ops = {
> + .fault = mmap_vmcore_fault,
> +};
> +
> static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
> {
> size_t size = vma->vm_end - vma->vm_start;
> @@ -242,6 +309,7 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
>
> vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
> vma->vm_flags |= VM_MIXEDMAP;
> + vma->vm_ops = &vmcore_mmap_ops;
>
> len = 0;
>
> @@ -283,9 +351,9 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
>
> tsz = min_t(size_t, m->offset + m->size - start, size);
> paddr = m->paddr + start - m->offset;
> - if (remap_pfn_range(vma, vma->vm_start + len,
> - paddr >> PAGE_SHIFT, tsz,
> - vma->vm_page_prot))
> + if (remap_oldmem_pfn_range(vma, vma->vm_start + len,
> + paddr >> PAGE_SHIFT, tsz,
> + vma->vm_page_prot))
> goto fail;
> size -= tsz;
> start += tsz;
> diff --git a/include/linux/crash_dump.h b/include/linux/crash_dump.h
> index 6571f82..fe68a5a 100644
> --- a/include/linux/crash_dump.h
> +++ b/include/linux/crash_dump.h
> @@ -17,6 +17,9 @@ extern int __weak elfcorehdr_alloc(unsigned long long *addr,
> extern void __weak elfcorehdr_free(unsigned long long addr);
> extern ssize_t __weak elfcorehdr_read(char *buf, size_t count, u64 *ppos);
> extern ssize_t __weak elfcorehdr_read_notes(char *buf, size_t count, u64 *ppos);
> +extern int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
> + unsigned long from, unsigned long pfn,
> + unsigned long size, pgprot_t prot);
>
> extern ssize_t copy_oldmem_page(unsigned long, char *, size_t,
> unsigned long, int);
> --
> 1.8.2.2
On Mon, Jul 01, 2013 at 09:32:36PM +0200, Michael Holzheu wrote:
[..]
> +ssize_t elfcorehdr_read(char *buf, size_t count, u64 *ppos)
> +{
> + void *src = (void *)(unsigned long)*ppos;
> +
> + src = elfcorehdr_newmem ? src : src - OLDMEM_BASE;
Seriously, we need to get rid of all this OLDMEM_BASE logic in s390
specific code. For regular kdump, it is no different than x86. Only
special handling required for zfcpdump for HSA region.
Why do we need above. Is it to cover the case where elfcorehdr have
been prepared by user space? Are elf headers initially stored in
reserved region and then swapped. Why do we need to swap these or
why kexec-tools could not take care of swapping it.
Anyway, I think in a separate patch series it is good to cleanup
s390 code for removing all the swap logic related stuff. I can't
wrap my head around it anymore.
Thanks
Vivek
On Tue, 2 Jul 2013 12:23:23 -0400
Vivek Goyal <[email protected]> wrote:
> On Mon, Jul 01, 2013 at 09:32:36PM +0200, Michael Holzheu wrote:
>
> [..]
> > +ssize_t elfcorehdr_read(char *buf, size_t count, u64 *ppos)
> > +{
> > + void *src = (void *)(unsigned long)*ppos;
> > +
> > + src = elfcorehdr_newmem ? src : src - OLDMEM_BASE;
>
> Seriously, we need to get rid of all this OLDMEM_BASE logic in s390
> specific code. For regular kdump, it is no different than x86. Only
> special handling required for zfcpdump for HSA region.
>
> Why do we need above. Is it to cover the case where elfcorehdr have
> been prepared by user space? Are elf headers initially stored in
> reserved region and then swapped. Why do we need to swap these or
> why kexec-tools could not take care of swapping it.
I know it is confusing. The "src - OLDMEM_BASE" term is currently
needed because of the swap issue that we have discussed already. We
load the ELF header into reserved memory
[OLDMEM_BASE, OLDMEM_BASE + OLDMEM_SIZE] that is swapped with
[0, OLDMEM_SIZE]. So the ELF header address has to be adjusted.
Thanks!
Michael
On Tue, 2 Jul 2013 11:42:14 -0400
Vivek Goyal <[email protected]> wrote:
> On Mon, Jul 01, 2013 at 09:32:37PM +0200, Michael Holzheu wrote:
[snip]
> > This handler works as follows:
> >
> > * Get already available or new page from page cache (find_or_create_page)
> > * Check if /proc/vmcore page is filled with data (PageUptodate)
> > * If yes:
> > Return that page
> > * If no:
> > Fill page using __vmcore_read(), set PageUptodate, and return page
> >
> > Signed-off-by: Michael Holzheu <[email protected]>
>
> In general vmcore related changes look fine to me. I am not very familiar
> with the logic of finding pages in page cache and using page uptodate
> flag.
>
> Hatayama, can you please review it.
>
> Acked-by: Vivek Goyal <[email protected]>
Hello Vivek,
If Hatayama accepts the patch, should we then send the patch series to
Andrew Morton?
Michael
On Wed, Jul 03, 2013 at 09:59:13AM +0200, Michael Holzheu wrote:
> On Tue, 2 Jul 2013 12:23:23 -0400
> Vivek Goyal <[email protected]> wrote:
>
> > On Mon, Jul 01, 2013 at 09:32:36PM +0200, Michael Holzheu wrote:
> >
> > [..]
> > > +ssize_t elfcorehdr_read(char *buf, size_t count, u64 *ppos)
> > > +{
> > > + void *src = (void *)(unsigned long)*ppos;
> > > +
> > > + src = elfcorehdr_newmem ? src : src - OLDMEM_BASE;
> >
> > Seriously, we need to get rid of all this OLDMEM_BASE logic in s390
> > specific code. For regular kdump, it is no different than x86. Only
> > special handling required for zfcpdump for HSA region.
> >
> > Why do we need above. Is it to cover the case where elfcorehdr have
> > been prepared by user space? Are elf headers initially stored in
> > reserved region and then swapped. Why do we need to swap these or
> > why kexec-tools could not take care of swapping it.
>
> I know it is confusing. The "src - OLDMEM_BASE" term is currently
> needed because of the swap issue that we have discussed already. We
> load the ELF header into reserved memory
> [OLDMEM_BASE, OLDMEM_BASE + OLDMEM_SIZE] that is swapped with
> [0, OLDMEM_SIZE]. So the ELF header address has to be adjusted.
Can't kexec-tools could easily do this swapping and modify elfcorehdr=
command line accordingly so that second kernel does not have to do
swapping for ELF headers.
And for PT_LOAD segment swapping, we could use ELF header swapping trick
(again in kexec-tools).
After above two changes I think all the OLD_MEMBASE magic will go away
from s390 code and only HSA region special handling will remain.
This brings it inline with x86 code and it becomes easier to understand
the s390 code. Otherwise there so may special corner cases that it is
easy to get lost.
Thanks
Vivek
On Wed, Jul 03, 2013 at 03:59:33PM +0200, Michael Holzheu wrote:
> On Tue, 2 Jul 2013 11:42:14 -0400
> Vivek Goyal <[email protected]> wrote:
>
> > On Mon, Jul 01, 2013 at 09:32:37PM +0200, Michael Holzheu wrote:
>
> [snip]
>
> > > This handler works as follows:
> > >
> > > * Get already available or new page from page cache (find_or_create_page)
> > > * Check if /proc/vmcore page is filled with data (PageUptodate)
> > > * If yes:
> > > Return that page
> > > * If no:
> > > Fill page using __vmcore_read(), set PageUptodate, and return page
> > >
> > > Signed-off-by: Michael Holzheu <[email protected]>
> >
> > In general vmcore related changes look fine to me. I am not very familiar
> > with the logic of finding pages in page cache and using page uptodate
> > flag.
> >
> > Hatayama, can you please review it.
> >
> > Acked-by: Vivek Goyal <[email protected]>
>
> Hello Vivek,
>
> If Hatayama accepts the patch, should we then send the patch series to
> Andrew Morton?
Yes, once hatayama reviews the patch, please send it to Anderew Morton for
for inclusion of paches.
Thanks
Vivek
On Wed, 3 Jul 2013 10:15:29 -0400
Vivek Goyal <[email protected]> wrote:
> On Wed, Jul 03, 2013 at 09:59:13AM +0200, Michael Holzheu wrote:
> > On Tue, 2 Jul 2013 12:23:23 -0400
> > Vivek Goyal <[email protected]> wrote:
> >
> > > On Mon, Jul 01, 2013 at 09:32:36PM +0200, Michael Holzheu wrote:
> > >
> > > [..]
> > > > +ssize_t elfcorehdr_read(char *buf, size_t count, u64 *ppos)
> > > > +{
> > > > + void *src = (void *)(unsigned long)*ppos;
> > > > +
> > > > + src = elfcorehdr_newmem ? src : src - OLDMEM_BASE;
> > >
> > > Seriously, we need to get rid of all this OLDMEM_BASE logic in s390
> > > specific code. For regular kdump, it is no different than x86. Only
> > > special handling required for zfcpdump for HSA region.
> > >
> > > Why do we need above. Is it to cover the case where elfcorehdr have
> > > been prepared by user space? Are elf headers initially stored in
> > > reserved region and then swapped. Why do we need to swap these or
> > > why kexec-tools could not take care of swapping it.
> >
> > I know it is confusing. The "src - OLDMEM_BASE" term is currently
> > needed because of the swap issue that we have discussed already. We
> > load the ELF header into reserved memory
> > [OLDMEM_BASE, OLDMEM_BASE + OLDMEM_SIZE] that is swapped with
> > [0, OLDMEM_SIZE]. So the ELF header address has to be adjusted.
>
> Can't kexec-tools could easily do this swapping and modify elfcorehdr=
> command line accordingly so that second kernel does not have to do
> swapping for ELF headers.
>
> And for PT_LOAD segment swapping, we could use ELF header swapping trick
> (again in kexec-tools).
>
> After above two changes I think all the OLD_MEMBASE magic will go away
> from s390 code and only HSA region special handling will remain.
>
> This brings it inline with x86 code and it becomes easier to understand
> the s390 code. Otherwise there so may special corner cases that it is
> easy to get lost.
Right, I agree that it is possible to do the swap in the kexec tool.
Then we would load in the kexec tool the ELF header to address
"OLDMEM_BASE + addr" (or "crashkernel base + addr") and would specify
the kernel parameter as "elfcorehdr=addr".
Currently we specify "elfcorehdr=OLDMEM_BASE + addr" and the kernel
reverses the swap.
Michael
On Wed, Jul 03, 2013 at 04:39:44PM +0200, Michael Holzheu wrote:
> On Wed, 3 Jul 2013 10:15:29 -0400
> Vivek Goyal <[email protected]> wrote:
>
> > On Wed, Jul 03, 2013 at 09:59:13AM +0200, Michael Holzheu wrote:
> > > On Tue, 2 Jul 2013 12:23:23 -0400
> > > Vivek Goyal <[email protected]> wrote:
> > >
> > > > On Mon, Jul 01, 2013 at 09:32:36PM +0200, Michael Holzheu wrote:
> > > >
> > > > [..]
> > > > > +ssize_t elfcorehdr_read(char *buf, size_t count, u64 *ppos)
> > > > > +{
> > > > > + void *src = (void *)(unsigned long)*ppos;
> > > > > +
> > > > > + src = elfcorehdr_newmem ? src : src - OLDMEM_BASE;
> > > >
> > > > Seriously, we need to get rid of all this OLDMEM_BASE logic in s390
> > > > specific code. For regular kdump, it is no different than x86. Only
> > > > special handling required for zfcpdump for HSA region.
> > > >
> > > > Why do we need above. Is it to cover the case where elfcorehdr have
> > > > been prepared by user space? Are elf headers initially stored in
> > > > reserved region and then swapped. Why do we need to swap these or
> > > > why kexec-tools could not take care of swapping it.
> > >
> > > I know it is confusing. The "src - OLDMEM_BASE" term is currently
> > > needed because of the swap issue that we have discussed already. We
> > > load the ELF header into reserved memory
> > > [OLDMEM_BASE, OLDMEM_BASE + OLDMEM_SIZE] that is swapped with
> > > [0, OLDMEM_SIZE]. So the ELF header address has to be adjusted.
> >
> > Can't kexec-tools could easily do this swapping and modify elfcorehdr=
> > command line accordingly so that second kernel does not have to do
> > swapping for ELF headers.
> >
> > And for PT_LOAD segment swapping, we could use ELF header swapping trick
> > (again in kexec-tools).
> >
> > After above two changes I think all the OLD_MEMBASE magic will go away
> > from s390 code and only HSA region special handling will remain.
> >
> > This brings it inline with x86 code and it becomes easier to understand
> > the s390 code. Otherwise there so may special corner cases that it is
> > easy to get lost.
>
> Right, I agree that it is possible to do the swap in the kexec tool.
> Then we would load in the kexec tool the ELF header to address
> "OLDMEM_BASE + addr" (or "crashkernel base + addr") and would specify
> the kernel parameter as "elfcorehdr=addr".
Exactly. purgatory does swap and kexec-tools prepares the purgatory and
it knows that swap will take place so it can pass elfcorehdr=addr,
instead of elfcorehdr=OLDMEM_BASE.
Hardcoding this knowledge in second kernel is not a very good idea.
Thanks
Vivek
(2013/07/02 4:32), Michael Holzheu wrote:
> For zfcpdump we can't map the HSA storage because it is only available
> via a read interface. Therefore, for the new vmcore mmap feature we have
> introduce a new mechanism to create mappings on demand.
>
> This patch introduces a new architecture function remap_oldmem_pfn_range()
> that should be used to create mappings with remap_pfn_range() for oldmem
> areas that can be directly mapped. For zfcpdump this is everything besides
> of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range()
> a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault()
> is called.
>
This fault handler is only for s390 specific issue. Other architectures don't need
this for the time being.
Also, from the same reason, I'm doing this review based on source code only.
I cannot run the fault handler on meaningful system, which is currently s390 only.
I'm also concerned about the fault handler covers a full range of vmcore, which
could hide some kind of mmap() bug that results in page fault.
So, the fault handler should be enclosed by ifdef CONFIG_S390 for the time being.
> This handler works as follows:
>
> * Get already available or new page from page cache (find_or_create_page)
> * Check if /proc/vmcore page is filled with data (PageUptodate)
> * If yes:
> Return that page
> * If no:
> Fill page using __vmcore_read(), set PageUptodate, and return page
>
It seems good to me on this page-in logic.
<cut>
> @@ -225,6 +250,48 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer,
> return acc;
> }
>
> +static ssize_t read_vmcore(struct file *file, char __user *buffer,
> + size_t buflen, loff_t *fpos)
> +{
> + return __read_vmcore(buffer, buflen, fpos, 1);
> +}
> +
> +/*
> + * The vmcore fault handler uses the page cache and fills data using the
> + * standard __vmcore_read() function.
> + */
Could you describe usecase of this fault handler on s390?
> +static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> + struct address_space *mapping = vma->vm_file->f_mapping;
> + pgoff_t index = vmf->pgoff;
> + struct page *page;
> + loff_t src;
> + char *buf;
> + int rc;
> +
You should check where faulting address points to valid range.
If the fault happens on invalid range, return VM_FAULT_SIGBUS.
On s390 case, I think the range except for HSA should be thought of as invalid.
> + page = find_or_create_page(mapping, index, GFP_KERNEL);
> + if (!page)
> + return VM_FAULT_OOM;
> + if (!PageUptodate(page)) {
> + src = index << PAGE_CACHE_SHIFT;
src = (loff_t)index << PAGE_CACHE_SHIFT;
loff_t has long long while index has unsigned long.
On s390 both might have the same byte length.
Also I prefer offset to src, but this is minor suggestion.
> + buf = (void *) (page_to_pfn(page) << PAGE_SHIFT);
I found page_to_virt() macro.
> + rc = __read_vmcore(buf, PAGE_SIZE, &src, 0);
> + if (rc < 0) {
> + unlock_page(page);
> + page_cache_release(page);
> + return (rc == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS;
> + }
> + SetPageUptodate(page);
> + }
> + unlock_page(page);
> + vmf->page = page;
> + return 0;
> +}
> +
> +static const struct vm_operations_struct vmcore_mmap_ops = {
> + .fault = mmap_vmcore_fault,
> +};
> +
> static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
> {
> size_t size = vma->vm_end - vma->vm_start;
> @@ -242,6 +309,7 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
>
> vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
> vma->vm_flags |= VM_MIXEDMAP;
> + vma->vm_ops = &vmcore_mmap_ops;
>
> len = 0;
>
--
Thanks.
HATAYAMA, Daisuke
On Mon, 08 Jul 2013 14:32:09 +0900
HATAYAMA Daisuke <[email protected]> wrote:
> (2013/07/02 4:32), Michael Holzheu wrote:
> > For zfcpdump we can't map the HSA storage because it is only available
> > via a read interface. Therefore, for the new vmcore mmap feature we have
> > introduce a new mechanism to create mappings on demand.
> >
> > This patch introduces a new architecture function remap_oldmem_pfn_range()
> > that should be used to create mappings with remap_pfn_range() for oldmem
> > areas that can be directly mapped. For zfcpdump this is everything besides
> > of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range()
> > a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault()
> > is called.
> >
>
> This fault handler is only for s390 specific issue. Other architectures don't need
> this for the time being.
>
> Also, from the same reason, I'm doing this review based on source code only.
> I cannot run the fault handler on meaningful system, which is currently s390 only.
You can test the code on other architectures if you do not map anything in advance.
For example you could just "return 0" in remap_oldmem_pfn_range():
/*
* Architectures may override this function to map oldmem
*/
int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
unsigned long from, unsigned long pfn,
unsigned long size, pgprot_t prot)
{
return 0;
}
In that case for all pages the new mechanism would be used.
>
> I'm also concerned about the fault handler covers a full range of vmcore, which
> could hide some kind of mmap() bug that results in page fault.
>
> So, the fault handler should be enclosed by ifdef CONFIG_S390 for the time being.
I personally do not like that, but if Vivek and you prefer this, of course we
can do that.
What about something like:
#ifdef CONFIG_S390
static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
...
}
#else
static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
BUG();
}
#endif
>
> > This handler works as follows:
> >
> > * Get already available or new page from page cache (find_or_create_page)
> > * Check if /proc/vmcore page is filled with data (PageUptodate)
> > * If yes:
> > Return that page
> > * If no:
> > Fill page using __vmcore_read(), set PageUptodate, and return page
> >
>
> It seems good to me on this page-in logic.
>
> <cut>
> > @@ -225,6 +250,48 @@ static ssize_t read_vmcore(struct file *file, char __user *buffer,
> > return acc;
> > }
> >
> > +static ssize_t read_vmcore(struct file *file, char __user *buffer,
> > + size_t buflen, loff_t *fpos)
> > +{
> > + return __read_vmcore(buffer, buflen, fpos, 1);
> > +}
> > +
> > +/*
> > + * The vmcore fault handler uses the page cache and fills data using the
> > + * standard __vmcore_read() function.
> > + */
>
> Could you describe usecase of this fault handler on s390?
ok.
>
> > +static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> > +{
> > + struct address_space *mapping = vma->vm_file->f_mapping;
> > + pgoff_t index = vmf->pgoff;
> > + struct page *page;
> > + loff_t src;
> > + char *buf;
> > + int rc;
> > +
>
> You should check where faulting address points to valid range.
> If the fault happens on invalid range, return VM_FAULT_SIGBUS.
>
> On s390 case, I think the range except for HSA should be thought of as invalid.
>
Hmmm, this would require another architecture backend interface. If we get a fault
for a page outside of the HSA this would be a programming error in our code. Not
sure if we should introduce new architecture interfaces just for sanity checks.
> > + page = find_or_create_page(mapping, index, GFP_KERNEL);
> > + if (!page)
> > + return VM_FAULT_OOM;
> > + if (!PageUptodate(page)) {
> > + src = index << PAGE_CACHE_SHIFT;
>
> src = (loff_t)index << PAGE_CACHE_SHIFT;
>
> loff_t has long long while index has unsigned long.
Ok.
> On s390 both might have the same byte length.
On s390x both are 64 bit. On our 32 bit s390 archtecture long long is 64 bit
and unsigned long 32 bit.
> Also I prefer offset to src, but this is minor suggestion.
Yes, I agree.
>
> > + buf = (void *) (page_to_pfn(page) << PAGE_SHIFT);
>
> I found page_to_virt() macro.
Looks like page_to_virt() is not usable on most architectures and probably
pfn_to_kaddr(pfn) would be the correct thing here. Unfortunately is also not
defined on s390.
But when I discussed your comment with Martin, we found out that the current
buf = (void *) (page_to_pfn(page) << PAGE_SHIFT);
is not correct on x86. It should be:
buf = __va((page_to_pfn(page) << PAGE_SHIFT));
So would be the following patch ok for you?
---
fs/proc/vmcore.c | 94 +++++++++++++++++++++++++++++++++++++++++----
include/linux/crash_dump.h | 3 +
2 files changed, 89 insertions(+), 8 deletions(-)
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -21,6 +21,7 @@
#include <linux/crash_dump.h>
#include <linux/list.h>
#include <linux/vmalloc.h>
+#include <linux/pagemap.h>
#include <asm/uaccess.h>
#include <asm/io.h>
#include "internal.h"
@@ -153,11 +154,35 @@ ssize_t __weak elfcorehdr_read_notes(cha
return read_from_oldmem(buf, count, ppos, 0);
}
+/*
+ * Architectures may override this function to map oldmem
+ */
+int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
+ unsigned long from, unsigned long pfn,
+ unsigned long size, pgprot_t prot)
+{
+ return remap_pfn_range(vma, from, pfn, size, prot);
+}
+
+/*
+ * Copy to either kernel or user space
+ */
+static int copy_to(void *target, void *src, size_t size, int userbuf)
+{
+ if (userbuf) {
+ if (copy_to_user(target, src, size))
+ return -EFAULT;
+ } else {
+ memcpy(target, src, size);
+ }
+ return 0;
+}
+
/* Read from the ELF header and then the crash dump. On error, negative value is
* returned otherwise number of bytes read are returned.
*/
-static ssize_t read_vmcore(struct file *file, char __user *buffer,
- size_t buflen, loff_t *fpos)
+static ssize_t __read_vmcore(char *buffer, size_t buflen, loff_t *fpos,
+ int userbuf)
{
ssize_t acc = 0, tmp;
size_t tsz;
@@ -174,7 +199,7 @@ static ssize_t read_vmcore(struct file *
/* Read ELF core header */
if (*fpos < elfcorebuf_sz) {
tsz = min(elfcorebuf_sz - (size_t)*fpos, buflen);
- if (copy_to_user(buffer, elfcorebuf + *fpos, tsz))
+ if (copy_to(buffer, elfcorebuf + *fpos, tsz, userbuf))
return -EFAULT;
buflen -= tsz;
*fpos += tsz;
@@ -192,7 +217,7 @@ static ssize_t read_vmcore(struct file *
tsz = min(elfcorebuf_sz + elfnotes_sz - (size_t)*fpos, buflen);
kaddr = elfnotes_buf + *fpos - elfcorebuf_sz;
- if (copy_to_user(buffer, kaddr, tsz))
+ if (copy_to(buffer, kaddr, tsz, userbuf))
return -EFAULT;
buflen -= tsz;
*fpos += tsz;
@@ -208,7 +233,7 @@ static ssize_t read_vmcore(struct file *
if (*fpos < m->offset + m->size) {
tsz = min_t(size_t, m->offset + m->size - *fpos, buflen);
start = m->paddr + *fpos - m->offset;
- tmp = read_from_oldmem(buffer, tsz, &start, 1);
+ tmp = read_from_oldmem(buffer, tsz, &start, userbuf);
if (tmp < 0)
return tmp;
buflen -= tsz;
@@ -225,6 +250,58 @@ static ssize_t read_vmcore(struct file *
return acc;
}
+static ssize_t read_vmcore(struct file *file, char __user *buffer,
+ size_t buflen, loff_t *fpos)
+{
+ return __read_vmcore(buffer, buflen, fpos, 1);
+}
+
+#ifdef CONFIG_S390
+/*
+ * The vmcore fault handler uses the page cache and fills data using the
+ * standard __vmcore_read() function.
+ *
+ * On s390 the fault handler is used for memory regions that can't be mapped
+ * directly with remap_pfn_range().
+ */
+static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ struct address_space *mapping = vma->vm_file->f_mapping;
+ pgoff_t index = vmf->pgoff;
+ struct page *page;
+ loff_t offset;
+ char *buf;
+ int rc;
+
+ page = find_or_create_page(mapping, index, GFP_KERNEL);
+ if (!page)
+ return VM_FAULT_OOM;
+ if (!PageUptodate(page)) {
+ offset = (loff_t) index << PAGE_CACHE_SHIFT;
+ buf = __va((page_to_pfn(page) << PAGE_SHIFT));
+ rc = __read_vmcore(buf, PAGE_SIZE, &offset, 0);
+ if (rc < 0) {
+ unlock_page(page);
+ page_cache_release(page);
+ return (rc == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS;
+ }
+ SetPageUptodate(page);
+ }
+ unlock_page(page);
+ vmf->page = page;
+ return 0;
+}
+#else
+static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ BUG();
+}
+#endif
+
+static const struct vm_operations_struct vmcore_mmap_ops = {
+ .fault = mmap_vmcore_fault,
+};
+
static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
{
size_t size = vma->vm_end - vma->vm_start;
@@ -242,6 +319,7 @@ static int mmap_vmcore(struct file *file
vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
vma->vm_flags |= VM_MIXEDMAP;
+ vma->vm_ops = &vmcore_mmap_ops;
len = 0;
@@ -283,9 +361,9 @@ static int mmap_vmcore(struct file *file
tsz = min_t(size_t, m->offset + m->size - start, size);
paddr = m->paddr + start - m->offset;
- if (remap_pfn_range(vma, vma->vm_start + len,
- paddr >> PAGE_SHIFT, tsz,
- vma->vm_page_prot))
+ if (remap_oldmem_pfn_range(vma, vma->vm_start + len,
+ paddr >> PAGE_SHIFT, tsz,
+ vma->vm_page_prot))
goto fail;
size -= tsz;
start += tsz;
--- a/include/linux/crash_dump.h
+++ b/include/linux/crash_dump.h
@@ -17,6 +17,9 @@ extern int __weak elfcorehdr_alloc(unsig
extern void __weak elfcorehdr_free(unsigned long long addr);
extern ssize_t __weak elfcorehdr_read(char *buf, size_t count, u64 *ppos);
extern ssize_t __weak elfcorehdr_read_notes(char *buf, size_t count, u64 *ppos);
+extern int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
+ unsigned long from, unsigned long pfn,
+ unsigned long size, pgprot_t prot);
extern ssize_t copy_oldmem_page(unsigned long, char *, size_t,
unsigned long, int);
On Mon, Jul 08, 2013 at 11:28:39AM +0200, Michael Holzheu wrote:
> On Mon, 08 Jul 2013 14:32:09 +0900
> HATAYAMA Daisuke <[email protected]> wrote:
>
> > (2013/07/02 4:32), Michael Holzheu wrote:
> > > For zfcpdump we can't map the HSA storage because it is only available
> > > via a read interface. Therefore, for the new vmcore mmap feature we have
> > > introduce a new mechanism to create mappings on demand.
> > >
> > > This patch introduces a new architecture function remap_oldmem_pfn_range()
> > > that should be used to create mappings with remap_pfn_range() for oldmem
> > > areas that can be directly mapped. For zfcpdump this is everything besides
> > > of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range()
> > > a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault()
> > > is called.
> > >
> >
> > This fault handler is only for s390 specific issue. Other architectures don't need
> > this for the time being.
> >
> > Also, from the same reason, I'm doing this review based on source code only.
> > I cannot run the fault handler on meaningful system, which is currently s390 only.
>
> You can test the code on other architectures if you do not map anything in advance.
> For example you could just "return 0" in remap_oldmem_pfn_range():
>
> /*
> * Architectures may override this function to map oldmem
> */
> int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
> unsigned long from, unsigned long pfn,
> unsigned long size, pgprot_t prot)
> {
> return 0;
> }
>
> In that case for all pages the new mechanism would be used.
>
> >
> > I'm also concerned about the fault handler covers a full range of vmcore, which
> > could hide some kind of mmap() bug that results in page fault.
> >
> > So, the fault handler should be enclosed by ifdef CONFIG_S390 for the time being.
>
> I personally do not like that, but if Vivek and you prefer this, of course we
> can do that.
>
> What about something like:
>
> #ifdef CONFIG_S390
> static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> {
> ...
> }
> #else
> static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> {
> BUG();
> }
> #endif
I personally perfer not to special case it for s390 only and let the
handler be generic.
If there is a bug in remap_old_pfn_range(), only side affect is that
we will fault in the page when it is accessed and that will be slow. BUG()
sounds excessive. At max it could be WARN_ONCE().
In regular cases for x86, this path should not even hit. So special casing
it to detect issues with remap_old_pfn_range() does not sound very good
to me. I would rather leave it as it is and if there are bugs and mmap()
slows down, then somebody needs to debug it.
Thanks
Vivek
(2013/07/08 18:28), Michael Holzheu wrote:
> On Mon, 08 Jul 2013 14:32:09 +0900
> HATAYAMA Daisuke <[email protected]> wrote:
>
>> (2013/07/02 4:32), Michael Holzheu wrote:
>>> For zfcpdump we can't map the HSA storage because it is only available
>>> via a read interface. Therefore, for the new vmcore mmap feature we have
>>> introduce a new mechanism to create mappings on demand.
>>>
>>> This patch introduces a new architecture function remap_oldmem_pfn_range()
>>> that should be used to create mappings with remap_pfn_range() for oldmem
>>> areas that can be directly mapped. For zfcpdump this is everything besides
>>> of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range()
>>> a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault()
>>> is called.
>>>
>>
>> This fault handler is only for s390 specific issue. Other architectures don't need
>> this for the time being.
>>
>> Also, from the same reason, I'm doing this review based on source code only.
>> I cannot run the fault handler on meaningful system, which is currently s390 only.
>
> You can test the code on other architectures if you do not map anything in advance.
> For example you could just "return 0" in remap_oldmem_pfn_range():
>
> /*
> * Architectures may override this function to map oldmem
> */
> int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
> unsigned long from, unsigned long pfn,
> unsigned long size, pgprot_t prot)
> {
> return 0;
> }
>
> In that case for all pages the new mechanism would be used.
>
I meant without modifying source code at all. You say I need to define some function.
<cut>
>>
>>> +static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>>> +{
>>> + struct address_space *mapping = vma->vm_file->f_mapping;
>>> + pgoff_t index = vmf->pgoff;
>>> + struct page *page;
>>> + loff_t src;
>>> + char *buf;
>>> + int rc;
>>> +
>>
>> You should check where faulting address points to valid range.
>> If the fault happens on invalid range, return VM_FAULT_SIGBUS.
>>
>> On s390 case, I think the range except for HSA should be thought of as invalid.
>>
>
> Hmmm, this would require another architecture backend interface. If we get a fault
> for a page outside of the HSA this would be a programming error in our code. Not
> sure if we should introduce new architecture interfaces just for sanity checks.
>
I think you need to introduce the backend interface since it's bug if it happens.
The current implementation hides such erroneous path due to generic implementation.
I also don't think it's big change from this version since you have already been
about to introduce several backend interfaces.
>>> + page = find_or_create_page(mapping, index, GFP_KERNEL);
>>> + if (!page)
>>> + return VM_FAULT_OOM;
>>> + if (!PageUptodate(page)) {
>>> + src = index << PAGE_CACHE_SHIFT;
>>
>> src = (loff_t)index << PAGE_CACHE_SHIFT;
>>
>> loff_t has long long while index has unsigned long.
>
> Ok.
>
>> On s390 both might have the same byte length.
>
> On s390x both are 64 bit. On our 32 bit s390 archtecture long long is 64 bit
> and unsigned long 32 bit.
>
>> Also I prefer offset to src, but this is minor suggestion.
>
> Yes, I agree.
>
>>
>>> + buf = (void *) (page_to_pfn(page) << PAGE_SHIFT);
>>
>> I found page_to_virt() macro.
>
> Looks like page_to_virt() is not usable on most architectures and probably
> pfn_to_kaddr(pfn) would be the correct thing here. Unfortunately is also not
> defined on s390.
>
> But when I discussed your comment with Martin, we found out that the current
>
> buf = (void *) (page_to_pfn(page) << PAGE_SHIFT);
>
> is not correct on x86. It should be:
>
> buf = __va((page_to_pfn(page) << PAGE_SHIFT));
>
It seems OK for this.
--
Thanks.
HATAYAMA, Daisuke
(2013/07/08 23:28), Vivek Goyal wrote:
> On Mon, Jul 08, 2013 at 11:28:39AM +0200, Michael Holzheu wrote:
>> On Mon, 08 Jul 2013 14:32:09 +0900
>> HATAYAMA Daisuke <[email protected]> wrote:
>>
>>> (2013/07/02 4:32), Michael Holzheu wrote:
>>>> For zfcpdump we can't map the HSA storage because it is only available
>>>> via a read interface. Therefore, for the new vmcore mmap feature we have
>>>> introduce a new mechanism to create mappings on demand.
>>>>
>>>> This patch introduces a new architecture function remap_oldmem_pfn_range()
>>>> that should be used to create mappings with remap_pfn_range() for oldmem
>>>> areas that can be directly mapped. For zfcpdump this is everything besides
>>>> of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range()
>>>> a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault()
>>>> is called.
>>>>
>>>
>>> This fault handler is only for s390 specific issue. Other architectures don't need
>>> this for the time being.
>>>
>>> Also, from the same reason, I'm doing this review based on source code only.
>>> I cannot run the fault handler on meaningful system, which is currently s390 only.
>>
>> You can test the code on other architectures if you do not map anything in advance.
>> For example you could just "return 0" in remap_oldmem_pfn_range():
>>
>> /*
>> * Architectures may override this function to map oldmem
>> */
>> int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
>> unsigned long from, unsigned long pfn,
>> unsigned long size, pgprot_t prot)
>> {
>> return 0;
>> }
>>
>> In that case for all pages the new mechanism would be used.
>>
>>>
>>> I'm also concerned about the fault handler covers a full range of vmcore, which
>>> could hide some kind of mmap() bug that results in page fault.
>>>
>>> So, the fault handler should be enclosed by ifdef CONFIG_S390 for the time being.
>>
>> I personally do not like that, but if Vivek and you prefer this, of course we
>> can do that.
>>
>> What about something like:
>>
>> #ifdef CONFIG_S390
>> static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>> {
>> ...
>> }
>> #else
>> static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>> {
>> BUG();
>> }
>> #endif
>
> I personally perfer not to special case it for s390 only and let the
> handler be generic.
>
> If there is a bug in remap_old_pfn_range(), only side affect is that
> we will fault in the page when it is accessed and that will be slow. BUG()
> sounds excessive. At max it could be WARN_ONCE().
>
> In regular cases for x86, this path should not even hit. So special casing
> it to detect issues with remap_old_pfn_range() does not sound very good
> to me. I would rather leave it as it is and if there are bugs and mmap()
> slows down, then somebody needs to debug it.
>
I agree to WARN_ONCE(). Then, we can notice bug at least if it occurs.
Interface is like this?
[generic]
bool __weak in_valid_fault_range(pgoff_t pgoff)
{
return false;
}
[s390]
bool in_valid_fault_range(pgoff_t pgoff)
{
loff_t offset = pgoff << PAGE_CACHE_SHIFT;
u64 paddr = vmcore_offset_to_paddr(offset);
return paddr < ZFCPDUMP_HSA_SIZE;
}
assuming vmcore_offset_to_paddr() that looks up vmcore_list and returns physical
address corresponding to given offset of vmcore. I guess this could return error
value if there's no entry corresponding to given offset in vmcore_list.
--
Thanks.
HATAYAMA, Daisuke
Hello Hatayama,
On Tue, 09 Jul 2013 14:49:48 +0900
HATAYAMA Daisuke <[email protected]> wrote:
> (2013/07/08 23:28), Vivek Goyal wrote:
> > On Mon, Jul 08, 2013 at 11:28:39AM +0200, Michael Holzheu wrote:
> >> On Mon, 08 Jul 2013 14:32:09 +0900
> >> HATAYAMA Daisuke <[email protected]> wrote:
[snip]
> > I personally perfer not to special case it for s390 only and let the
> > handler be generic.
> >
> > If there is a bug in remap_old_pfn_range(), only side affect is that
> > we will fault in the page when it is accessed and that will be slow. BUG()
> > sounds excessive. At max it could be WARN_ONCE().
> >
> > In regular cases for x86, this path should not even hit. So special casing
> > it to detect issues with remap_old_pfn_range() does not sound very good
> > to me. I would rather leave it as it is and if there are bugs and mmap()
> > slows down, then somebody needs to debug it.
> >
>
> I agree to WARN_ONCE(). Then, we can notice bug at least if it occurs.
>
> Interface is like this?
>
> [generic]
>
> bool __weak in_valid_fault_range(pgoff_t pgoff)
> {
> return false;
> }
>
> [s390]
>
> bool in_valid_fault_range(pgoff_t pgoff)
> {
> loff_t offset = pgoff << PAGE_CACHE_SHIFT;
> u64 paddr = vmcore_offset_to_paddr(offset);
>
> return paddr < ZFCPDUMP_HSA_SIZE;
> }
>
> assuming vmcore_offset_to_paddr() that looks up vmcore_list and returns physical
> address corresponding to given offset of vmcore. I guess this could return error
> value if there's no entry corresponding to given offset in vmcore_list.
I think this is too much code (and overhead) just for checking the correctness the
kdump mmap implementation.
My suggestion is to add the WARN_ONCE() for #ifndef CONFIG_S390. This has the same
effect as your suggestion for all architectures besides of s390. And for s390 we
take the risk that a programming error would result in poor /proc/vmcore
performance.
So, at least for this patch series I would implement the fault handler as follows:
static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
...
char *buf;
int rc;
#ifndef CONFIG_S390
WARN_ONCE(1, "vmcore: Unexpected call of mmap_vmcore_fault()");
#endif
page = find_or_create_page(mapping, index, GFP_KERNEL);
At this point I have to tell you that we plan another vmcore patch series where
the fault handler might be called also for other architectures. But I think we
should *then* discuss your issue again.
So in order to make progress with this patch series (which is also needed to
make your mmap patches work for s390) I would suggest to use the following patch:
---
fs/proc/vmcore.c | 90 +++++++++++++++++++++++++++++++++++++++++----
include/linux/crash_dump.h | 3 +
2 files changed, 85 insertions(+), 8 deletions(-)
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -21,6 +21,7 @@
#include <linux/crash_dump.h>
#include <linux/list.h>
#include <linux/vmalloc.h>
+#include <linux/pagemap.h>
#include <asm/uaccess.h>
#include <asm/io.h>
#include "internal.h"
@@ -153,11 +154,35 @@ ssize_t __weak elfcorehdr_read_notes(cha
return read_from_oldmem(buf, count, ppos, 0);
}
+/*
+ * Architectures may override this function to map oldmem
+ */
+int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
+ unsigned long from, unsigned long pfn,
+ unsigned long size, pgprot_t prot)
+{
+ return remap_pfn_range(vma, from, pfn, size, prot);
+}
+
+/*
+ * Copy to either kernel or user space
+ */
+static int copy_to(void *target, void *src, size_t size, int userbuf)
+{
+ if (userbuf) {
+ if (copy_to_user(target, src, size))
+ return -EFAULT;
+ } else {
+ memcpy(target, src, size);
+ }
+ return 0;
+}
+
/* Read from the ELF header and then the crash dump. On error, negative value is
* returned otherwise number of bytes read are returned.
*/
-static ssize_t read_vmcore(struct file *file, char __user *buffer,
- size_t buflen, loff_t *fpos)
+static ssize_t __read_vmcore(char *buffer, size_t buflen, loff_t *fpos,
+ int userbuf)
{
ssize_t acc = 0, tmp;
size_t tsz;
@@ -174,7 +199,7 @@ static ssize_t read_vmcore(struct file *
/* Read ELF core header */
if (*fpos < elfcorebuf_sz) {
tsz = min(elfcorebuf_sz - (size_t)*fpos, buflen);
- if (copy_to_user(buffer, elfcorebuf + *fpos, tsz))
+ if (copy_to(buffer, elfcorebuf + *fpos, tsz, userbuf))
return -EFAULT;
buflen -= tsz;
*fpos += tsz;
@@ -192,7 +217,7 @@ static ssize_t read_vmcore(struct file *
tsz = min(elfcorebuf_sz + elfnotes_sz - (size_t)*fpos, buflen);
kaddr = elfnotes_buf + *fpos - elfcorebuf_sz;
- if (copy_to_user(buffer, kaddr, tsz))
+ if (copy_to(buffer, kaddr, tsz, userbuf))
return -EFAULT;
buflen -= tsz;
*fpos += tsz;
@@ -208,7 +233,7 @@ static ssize_t read_vmcore(struct file *
if (*fpos < m->offset + m->size) {
tsz = min_t(size_t, m->offset + m->size - *fpos, buflen);
start = m->paddr + *fpos - m->offset;
- tmp = read_from_oldmem(buffer, tsz, &start, 1);
+ tmp = read_from_oldmem(buffer, tsz, &start, userbuf);
if (tmp < 0)
return tmp;
buflen -= tsz;
@@ -225,6 +250,54 @@ static ssize_t read_vmcore(struct file *
return acc;
}
+static ssize_t read_vmcore(struct file *file, char __user *buffer,
+ size_t buflen, loff_t *fpos)
+{
+ return __read_vmcore(buffer, buflen, fpos, 1);
+}
+
+/*
+ * The vmcore fault handler uses the page cache and fills data using the
+ * standard __vmcore_read() function.
+ *
+ * On s390 the fault handler is used for memory regions that can't be mapped
+ * directly with remap_pfn_range().
+ */
+static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ struct address_space *mapping = vma->vm_file->f_mapping;
+ pgoff_t index = vmf->pgoff;
+ struct page *page;
+ loff_t offset;
+ char *buf;
+ int rc;
+
+#ifndef CONFIG_S390
+ WARN_ONCE(1, "vmcore: Unexpected call of mmap_vmcore_fault()");
+#endif
+ page = find_or_create_page(mapping, index, GFP_KERNEL);
+ if (!page)
+ return VM_FAULT_OOM;
+ if (!PageUptodate(page)) {
+ offset = (loff_t) index << PAGE_CACHE_SHIFT;
+ buf = __va((page_to_pfn(page) << PAGE_SHIFT));
+ rc = __read_vmcore(buf, PAGE_SIZE, &offset, 0);
+ if (rc < 0) {
+ unlock_page(page);
+ page_cache_release(page);
+ return (rc == -ENOMEM) ? VM_FAULT_OOM : VM_FAULT_SIGBUS;
+ }
+ SetPageUptodate(page);
+ }
+ unlock_page(page);
+ vmf->page = page;
+ return 0;
+}
+
+static const struct vm_operations_struct vmcore_mmap_ops = {
+ .fault = mmap_vmcore_fault,
+};
+
static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
{
size_t size = vma->vm_end - vma->vm_start;
@@ -242,6 +315,7 @@ static int mmap_vmcore(struct file *file
vma->vm_flags &= ~(VM_MAYWRITE | VM_MAYEXEC);
vma->vm_flags |= VM_MIXEDMAP;
+ vma->vm_ops = &vmcore_mmap_ops;
len = 0;
@@ -283,9 +357,9 @@ static int mmap_vmcore(struct file *file
tsz = min_t(size_t, m->offset + m->size - start, size);
paddr = m->paddr + start - m->offset;
- if (remap_pfn_range(vma, vma->vm_start + len,
- paddr >> PAGE_SHIFT, tsz,
- vma->vm_page_prot))
+ if (remap_oldmem_pfn_range(vma, vma->vm_start + len,
+ paddr >> PAGE_SHIFT, tsz,
+ vma->vm_page_prot))
goto fail;
size -= tsz;
start += tsz;
--- a/include/linux/crash_dump.h
+++ b/include/linux/crash_dump.h
@@ -17,6 +17,9 @@ extern int __weak elfcorehdr_alloc(unsig
extern void __weak elfcorehdr_free(unsigned long long addr);
extern ssize_t __weak elfcorehdr_read(char *buf, size_t count, u64 *ppos);
extern ssize_t __weak elfcorehdr_read_notes(char *buf, size_t count, u64 *ppos);
+extern int __weak remap_oldmem_pfn_range(struct vm_area_struct *vma,
+ unsigned long from, unsigned long pfn,
+ unsigned long size, pgprot_t prot);
extern ssize_t copy_oldmem_page(unsigned long, char *, size_t,
(2013/07/10 17:42), Michael Holzheu wrote:
> Hello Hatayama,
>
> On Tue, 09 Jul 2013 14:49:48 +0900
> HATAYAMA Daisuke <[email protected]> wrote:
>
>> (2013/07/08 23:28), Vivek Goyal wrote:
>>> On Mon, Jul 08, 2013 at 11:28:39AM +0200, Michael Holzheu wrote:
>>>> On Mon, 08 Jul 2013 14:32:09 +0900
>>>> HATAYAMA Daisuke <[email protected]> wrote:
>
> [snip]
>
>>> I personally perfer not to special case it for s390 only and let the
>>> handler be generic.
>>>
>>> If there is a bug in remap_old_pfn_range(), only side affect is that
>>> we will fault in the page when it is accessed and that will be slow. BUG()
>>> sounds excessive. At max it could be WARN_ONCE().
>>>
>>> In regular cases for x86, this path should not even hit. So special casing
>>> it to detect issues with remap_old_pfn_range() does not sound very good
>>> to me. I would rather leave it as it is and if there are bugs and mmap()
>>> slows down, then somebody needs to debug it.
>>>
>>
>> I agree to WARN_ONCE(). Then, we can notice bug at least if it occurs.
>>
>> Interface is like this?
>>
>> [generic]
>>
>> bool __weak in_valid_fault_range(pgoff_t pgoff)
>> {
>> return false;
>> }
>>
>> [s390]
>>
>> bool in_valid_fault_range(pgoff_t pgoff)
>> {
>> loff_t offset = pgoff << PAGE_CACHE_SHIFT;
>> u64 paddr = vmcore_offset_to_paddr(offset);
>>
>> return paddr < ZFCPDUMP_HSA_SIZE;
>> }
>>
>> assuming vmcore_offset_to_paddr() that looks up vmcore_list and returns physical
>> address corresponding to given offset of vmcore. I guess this could return error
>> value if there's no entry corresponding to given offset in vmcore_list.
>
> I think this is too much code (and overhead) just for checking the correctness the
> kdump mmap implementation.
>
> My suggestion is to add the WARN_ONCE() for #ifndef CONFIG_S390. This has the same
> effect as your suggestion for all architectures besides of s390. And for s390 we
> take the risk that a programming error would result in poor /proc/vmcore
> performance.
>
If you want to avoid looking up vmcore_list that takes linear time w.r.t. the number
of the elements, you can still calculate the range of offsets in /proc/vmcore
corresponding to HSA during /proc/vmcore initialization.
Also, could you tell me how often and how much the HSA region is during crash dumping?
I guess the read to HSA is done mainly during early part of crash dumping process only.
According to the code, it appears at most 64MiB only. Then, I feel performance is not
a big issue.
Also, cost of WARN_ONCE() is one memory access only in the 2nd and later calls. I don't
think it too much overhead...
> So, at least for this patch series I would implement the fault handler as follows:
>
> static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> {
> ...
> char *buf;
> int rc;
>
> #ifndef CONFIG_S390
> WARN_ONCE(1, "vmcore: Unexpected call of mmap_vmcore_fault()");
> #endif
> page = find_or_create_page(mapping, index, GFP_KERNEL);
>
> At this point I have to tell you that we plan another vmcore patch series where
> the fault handler might be called also for other architectures. But I think we
> should *then* discuss your issue again.
>
Could you explain the plan in more detail? Or I cannot review correctly since I don't
know whether there's really usecase of this generic fault handler for other
architectures. This is the issue for architectures other than s390, not mine; now we
don't need it at all.
--
Thanks.
HATAYAMA, Daisuke
On Wed, 10 Jul 2013 18:50:18 +0900
HATAYAMA Daisuke <[email protected]> wrote:
[snip]
> (2013/07/10 17:42), Michael Holzheu wrote:
> > My suggestion is to add the WARN_ONCE() for #ifndef CONFIG_S390. This has the same
> > effect as your suggestion for all architectures besides of s390. And for s390 we
> > take the risk that a programming error would result in poor /proc/vmcore
> > performance.
> >
>
> If you want to avoid looking up vmcore_list that takes linear time w.r.t. the number
> of the elements, you can still calculate the range of offsets in /proc/vmcore
> corresponding to HSA during /proc/vmcore initialization.
>
> Also, could you tell me how often and how much the HSA region is during crash dumping?
> I guess the read to HSA is done mainly during early part of crash dumping process only.
> According to the code, it appears at most 64MiB only. Then, I feel performance is not
> a big issue.
Currently it is 32 MiB and normally it is read only once.
>
> Also, cost of WARN_ONCE() is one memory access only in the 2nd and later calls. I don't
> think it too much overhead...
I was more concerned about in_valid_fault_range(). But I was most concerned the additional
interface that introduces more complexity to the code. And that just to implement a
sanity check that in our opinion we don't really need.
And what makes it even worse:
With the current patch series this check is only relevant for s390 :-)
>
> > So, at least for this patch series I would implement the fault handler as follows:
> >
> > static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> > {
> > ...
> > char *buf;
> > int rc;
> >
> > #ifndef CONFIG_S390
> > WARN_ONCE(1, "vmcore: Unexpected call of mmap_vmcore_fault()");
> > #endif
> > page = find_or_create_page(mapping, index, GFP_KERNEL);
> >
> > At this point I have to tell you that we plan another vmcore patch series where
> > the fault handler might be called also for other architectures. But I think we
> > should *then* discuss your issue again.
> >
>
> Could you explain the plan in more detail? Or I cannot review correctly since I don't
> know whether there's really usecase of this generic fault handler for other
> architectures.
> This is the issue for architectures other than s390, not mine; now we
> don't need it at all.
I would have preferred to do the things one after the other. Otherwise I fear
that this discussion will never come to an end. This patch series is needed
to get zfcpdump running with /proc/vmcore. And for that reason the patch series
makes sense for itself.
But FYI:
The other patch series deals with the problem that we have additional
information for s390 that we want to include in /proc/vmcore. We have a one
byte storage key per memory page. This storage keys are stored in the
s390 firmware and can be read using a s390 specific machine instruction.
We plan to put that information into the ELF notes section. For a 1 TiB
dump we will have 256 MiB storage keys.
Currently the notes section is preallocated in vmcore.c. Because we do
not want to preallocate so much memory we would like to read the notes
section on demand. Similar to the HSA memory for zfcpdump. To get this
work with your mmap code, we would then also use the fault handler to
get the notes sections on demand. Our current patch does this for all
notes and therefore also the other architectures would then use the
fault handler.
One advantage for the common code is that no additional memory has
to be allocated for the notes buffer.
The storage key patch series is currently not final because it depends
on the zfcpdump patch series.
Michael
On Wed, Jul 10, 2013 at 06:50:18PM +0900, HATAYAMA Daisuke wrote:
[..]
> If you want to avoid looking up vmcore_list that takes linear time w.r.t. the number
> of the elements, you can still calculate the range of offsets in /proc/vmcore
> corresponding to HSA during /proc/vmcore initialization.
>
> Also, could you tell me how often and how much the HSA region is during crash dumping?
> I guess the read to HSA is done mainly during early part of crash dumping process only.
> According to the code, it appears at most 64MiB only. Then, I feel performance is not
> a big issue.
>
> Also, cost of WARN_ONCE() is one memory access only in the 2nd and later calls. I don't
> think it too much overhead...
Hi Hatayama,
I think michael's proposal of just putting in WARN_ONCE for non s390 arch
sounds reasonable. It is simple and meets your need of being able to
detect that non s390 arch don't make use of mmap() path yet. Introducing
in_valid_fault_range() kind of sounds overkill to me for this issue.
Thanks
Vivek
(2013/07/10 23:33), Vivek Goyal wrote:
> On Wed, Jul 10, 2013 at 06:50:18PM +0900, HATAYAMA Daisuke wrote:
>
> [..]
>> If you want to avoid looking up vmcore_list that takes linear time w.r.t. the number
>> of the elements, you can still calculate the range of offsets in /proc/vmcore
>> corresponding to HSA during /proc/vmcore initialization.
>>
>> Also, could you tell me how often and how much the HSA region is during crash dumping?
>> I guess the read to HSA is done mainly during early part of crash dumping process only.
>> According to the code, it appears at most 64MiB only. Then, I feel performance is not
>> a big issue.
>>
>> Also, cost of WARN_ONCE() is one memory access only in the 2nd and later calls. I don't
>> think it too much overhead...
>
> Hi Hatayama,
>
> I think michael's proposal of just putting in WARN_ONCE for non s390 arch
> sounds reasonable. It is simple and meets your need of being able to
> detect that non s390 arch don't make use of mmap() path yet. Introducing
> in_valid_fault_range() kind of sounds overkill to me for this issue.
>
> Thanks
> Vivek
>
How about
static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
...
char *buf;
int rc;
#ifndef CONFIG_S390
return VM_FAULT_SIGBUS;
#endif
page = find_or_create_page(mapping, index, GFP_KERNEL);
Considering again, I don't think WARN_ONCE() is good now. The fact that fault occurs on
mmap() region indicates some kind of buggy situation occurs on the process. The process
should be killed as soon as possible. If user still wants to get crash dump, he should
try again in another process.
--
Thanks.
HATAYAMA, Daisuke
(2013/07/10 20:00), Michael Holzheu wrote:
> On Wed, 10 Jul 2013 18:50:18 +0900
> HATAYAMA Daisuke <[email protected]> wrote:
>
> [snip]
>
>> (2013/07/10 17:42), Michael Holzheu wrote:
>>> My suggestion is to add the WARN_ONCE() for #ifndef CONFIG_S390. This has the same
>>> effect as your suggestion for all architectures besides of s390. And for s390 we
>>> take the risk that a programming error would result in poor /proc/vmcore
>>> performance.
>>>
>>
>> If you want to avoid looking up vmcore_list that takes linear time w.r.t. the number
>> of the elements, you can still calculate the range of offsets in /proc/vmcore
>> corresponding to HSA during /proc/vmcore initialization.
>>
>> Also, could you tell me how often and how much the HSA region is during crash dumping?
>> I guess the read to HSA is done mainly during early part of crash dumping process only.
>> According to the code, it appears at most 64MiB only. Then, I feel performance is not
>> a big issue.
>
> Currently it is 32 MiB and normally it is read only once.
>
>>
>> Also, cost of WARN_ONCE() is one memory access only in the 2nd and later calls. I don't
>> think it too much overhead...
>
> I was more concerned about in_valid_fault_range(). But I was most concerned the additional
> interface that introduces more complexity to the code. And that just to implement a
> sanity check that in our opinion we don't really need.
>
> And what makes it even worse:
>
What you think the sanity check is unnecessary is perfectly wrong. You design page faults
always happens on HSA region. If page fault happens on the other parts, i.e. some point
of mmap()ed region, it means somehow page table on the address has not been created. This
is bug, possibly caused by mmap() itself, page table creation, other components in kernel,
bit-flip due to broken hardware, etc. Anyway, program cannot detect what kind of bug occurs
now. There's no guarantee that program runs safely, of course for page cache creation, too.
We cannot and must expect such buggy process to behave in invalid states just as our design.
It results in undefined behaviour. The only thing we can do is to kill the buggy process
as soon as possible.
> With the current patch series this check is only relevant for s390 :-)
>
>>
>>> So, at least for this patch series I would implement the fault handler as follows:
>>>
>>> static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>>> {
>>> ...
>>> char *buf;
>>> int rc;
>>>
>>> #ifndef CONFIG_S390
>>> WARN_ONCE(1, "vmcore: Unexpected call of mmap_vmcore_fault()");
>>> #endif
>>> page = find_or_create_page(mapping, index, GFP_KERNEL);
>>>
>>> At this point I have to tell you that we plan another vmcore patch series where
>>> the fault handler might be called also for other architectures. But I think we
>>> should *then* discuss your issue again.
>>>
>>
>> Could you explain the plan in more detail? Or I cannot review correctly since I don't
>> know whether there's really usecase of this generic fault handler for other
>> architectures.
>> This is the issue for architectures other than s390, not mine; now we
>> don't need it at all.
>
> I would have preferred to do the things one after the other. Otherwise I fear
> that this discussion will never come to an end. This patch series is needed
> to get zfcpdump running with /proc/vmcore. And for that reason the patch series
> makes sense for itself.
>
> But FYI:
>
> The other patch series deals with the problem that we have additional
> information for s390 that we want to include in /proc/vmcore. We have a one
> byte storage key per memory page. This storage keys are stored in the
> s390 firmware and can be read using a s390 specific machine instruction.
> We plan to put that information into the ELF notes section. For a 1 TiB
> dump we will have 256 MiB storage keys.
>
> Currently the notes section is preallocated in vmcore.c. Because we do
> not want to preallocate so much memory we would like to read the notes
> section on demand. Similar to the HSA memory for zfcpdump. To get this
> work with your mmap code, we would then also use the fault handler to
> get the notes sections on demand. Our current patch does this for all
> notes and therefore also the other architectures would then use the
> fault handler.
>
> One advantage for the common code is that no additional memory has
> to be allocated for the notes buffer.
>
> The storage key patch series is currently not final because it depends
> on the zfcpdump patch series.
>
Yes, on-demand memory allocation by fault handler is definitely suitable
for note sections throughout all architectures. But need of sanity check
is unchanged here. The issue is orthogonal. It's still needed just as
explained above.
BTW, sanity check for note sections is easy.
if (!(elfnote->p_offset <= offset ||
offset < elfnote->p_offset + elfnote->p_filesz))
return VM_FAULT_SIGBUS;
I assume elfnote has a pointer to a unique PT_NOTE program header table
in the ELF header buffer. The code would be the same on every
architecture.
--
Thanks.
HATAYAMA, Daisuke
On Sat, 13 Jul 2013 01:02:50 +0900
HATAYAMA Daisuke <[email protected]> wrote:
> (2013/07/10 20:00), Michael Holzheu wrote:
> > On Wed, 10 Jul 2013 18:50:18 +0900
> > HATAYAMA Daisuke <[email protected]> wrote:
> >
> > [snip]
> >
> >> (2013/07/10 17:42), Michael Holzheu wrote:
> >>> My suggestion is to add the WARN_ONCE() for #ifndef CONFIG_S390. This has the same
> >>> effect as your suggestion for all architectures besides of s390. And for s390 we
> >>> take the risk that a programming error would result in poor /proc/vmcore
> >>> performance.
> >>>
> >>
> >> If you want to avoid looking up vmcore_list that takes linear time w.r.t. the number
> >> of the elements, you can still calculate the range of offsets in /proc/vmcore
> >> corresponding to HSA during /proc/vmcore initialization.
> >>
> >> Also, could you tell me how often and how much the HSA region is during crash dumping?
> >> I guess the read to HSA is done mainly during early part of crash dumping process only.
> >> According to the code, it appears at most 64MiB only. Then, I feel performance is not
> >> a big issue.
> >
> > Currently it is 32 MiB and normally it is read only once.
> >
> >>
> >> Also, cost of WARN_ONCE() is one memory access only in the 2nd and later calls. I don't
> >> think it too much overhead...
> >
> > I was more concerned about in_valid_fault_range(). But I was most concerned the additional
> > interface that introduces more complexity to the code. And that just to implement a
> > sanity check that in our opinion we don't really need.
> >
> > And what makes it even worse:
> >
>
> What you think the sanity check is unnecessary is perfectly wrong. You design page faults
> always happens on HSA region. If page fault happens on the other parts, i.e. some point
> of mmap()ed region, it means somehow page table on the address has not been created. This
> is bug, possibly caused by mmap() itself, page table creation, other components in kernel,
> bit-flip due to broken hardware, etc. Anyway, program cannot detect what kind of bug occurs
> now. There's no guarantee that program runs safely, of course for page cache creation, too.
> We cannot and must expect such buggy process to behave in invalid states just as our design.
> It results in undefined behaviour. The only thing we can do is to kill the buggy process
> as soon as possible.
I don't quite get this point, please bear with me. If you compare the situation before and
after the introduction of the fault handler the possible error scenarios are not almost
identical:
1) If an access is made outside of the mapped memory region the first level fault handler
(do_exception for s390, __do_page_fault for x86) won't find a vma and force a SIGSEGV
right away, independent of the existance of a hole and the vmcore fault handler.
2) If there is a hardware bug that corrupts a page table the behaviour depends on how the
entries are corrupted. If the outcome is a valid pte an incorrect memory area will be
accessed, the same with or without the vmcore fault handler. If the corrupted pte is
an invalid pte it can come out as swap pte, file pte, or as empty pte. The behaviour
does not change for swap and file ptes, you will get funny results in both cases.
For empty ptes the new behaviour will call the vmcore fault handler for the address
in question. If the read() function can satisfy the request we will get a page cache
copy of the missing page, if the read function can not satisfy the request it returns
an error which is translated to a SIGBUS.
This new behaviour is IMHO better than the old one, it successfully recovers from at
least one type of corruption. For x86 that would be the case if the page table is
overwritten with zeroes, for s390 a specific bit pattern in the pte is required.
3) In the case of a programming error in regard to remap_pfn_range the new behaviour will
provide page cache copies and the dump will technically be correct. The performance
might suffer a little bit as the CPU will have to create the page cache copies but
compared to the I/O that is involved with writing a dump this is negligible, no?
It seems to me that the warning message you want to see in the fault handler would be
a debugging aid for the developer to see if the mmap() and the remap_pfn_range() calls
match up. Something similar to a single VM_WARN_ON() messages would be appropriate, no?
--
blue skies,
Martin.
"Reality continues to ruin my life." - Calvin.
On Tue, 2 Jul 2013 11:42:14 -0400
Vivek Goyal <[email protected]> wrote:
> On Mon, Jul 01, 2013 at 09:32:37PM +0200, Michael Holzheu wrote:
> > For zfcpdump we can't map the HSA storage because it is only available
> > via a read interface. Therefore, for the new vmcore mmap feature we have
> > introduce a new mechanism to create mappings on demand.
> >
> > This patch introduces a new architecture function remap_oldmem_pfn_range()
> > that should be used to create mappings with remap_pfn_range() for oldmem
> > areas that can be directly mapped. For zfcpdump this is everything besides
> > of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range()
> > a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault()
> > is called.
> >
> > This handler works as follows:
> >
> > * Get already available or new page from page cache (find_or_create_page)
> > * Check if /proc/vmcore page is filled with data (PageUptodate)
> > * If yes:
> > Return that page
> > * If no:
> > Fill page using __vmcore_read(), set PageUptodate, and return page
> >
> > Signed-off-by: Michael Holzheu <[email protected]>
>
> In general vmcore related changes look fine to me. I am not very familiar
> with the logic of finding pages in page cache and using page uptodate
> flag.
>
> Hatayama, can you please review it.
>
> Acked-by: Vivek Goyal <[email protected]>
Hello Vivek and Andrew,
We just realized that Hatayama's mmap patches went into v3.11-rc1. This currently
breaks s390 kdump because of the following two issues:
1) The copy_oldmem_page() is now used for copying to vmalloc memory
2) The mmap() implementation is not compatible with the current
s390 crashkernel swap:
See: http://marc.info/?l=kexec&m=136940802511603&w=2
The "kdump: Introduce ELF header in new memory feature" patch series will
fix both issues for s390.
There is the one small open discussion left:
http://www.mail-archive.com/[email protected]/msg464856.html
But once we have finished that, would it be possible to get the
patches in 3.11?
Michael
On Fri, Jul 12, 2013 at 08:05:31PM +0900, HATAYAMA Daisuke wrote:
[..]
> How about
>
> static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> {
> ...
> char *buf;
> int rc;
>
> #ifndef CONFIG_S390
> return VM_FAULT_SIGBUS;
> #endif
> page = find_or_create_page(mapping, index, GFP_KERNEL);
>
> Considering again, I don't think WARN_ONCE() is good now. The fact that fault occurs on
> mmap() region indicates some kind of buggy situation occurs on the process. The process
> should be killed as soon as possible. If user still wants to get crash dump, he should
> try again in another process.
I don't understand that. Process should be killed only if there was no
mapping created for the region process is trying to access.
If there is a mapping but we are trying to fault in the actual contents,
then it is not a problem of process. Process is accessing a region of
memory which it is supposed to access.
Potential problem here is that remap_pfn_range() did not map everything
it was expected to so we have to resort on page fault handler to read
that in. So it is more of a kernel issue and not process issue and for
that WARN_ONCE() sounds better?
Vivek
On Mon, Jul 15, 2013 at 03:44:51PM +0200, Michael Holzheu wrote:
> On Tue, 2 Jul 2013 11:42:14 -0400
> Vivek Goyal <[email protected]> wrote:
>
> > On Mon, Jul 01, 2013 at 09:32:37PM +0200, Michael Holzheu wrote:
> > > For zfcpdump we can't map the HSA storage because it is only available
> > > via a read interface. Therefore, for the new vmcore mmap feature we have
> > > introduce a new mechanism to create mappings on demand.
> > >
> > > This patch introduces a new architecture function remap_oldmem_pfn_range()
> > > that should be used to create mappings with remap_pfn_range() for oldmem
> > > areas that can be directly mapped. For zfcpdump this is everything besides
> > > of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range()
> > > a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault()
> > > is called.
> > >
> > > This handler works as follows:
> > >
> > > * Get already available or new page from page cache (find_or_create_page)
> > > * Check if /proc/vmcore page is filled with data (PageUptodate)
> > > * If yes:
> > > Return that page
> > > * If no:
> > > Fill page using __vmcore_read(), set PageUptodate, and return page
> > >
> > > Signed-off-by: Michael Holzheu <[email protected]>
> >
> > In general vmcore related changes look fine to me. I am not very familiar
> > with the logic of finding pages in page cache and using page uptodate
> > flag.
> >
> > Hatayama, can you please review it.
> >
> > Acked-by: Vivek Goyal <[email protected]>
>
> Hello Vivek and Andrew,
>
> We just realized that Hatayama's mmap patches went into v3.11-rc1. This currently
> breaks s390 kdump because of the following two issues:
>
> 1) The copy_oldmem_page() is now used for copying to vmalloc memory
> 2) The mmap() implementation is not compatible with the current
> s390 crashkernel swap:
> See: http://marc.info/?l=kexec&m=136940802511603&w=2
>
> The "kdump: Introduce ELF header in new memory feature" patch series will
> fix both issues for s390.
>
> There is the one small open discussion left:
>
> http://www.mail-archive.com/[email protected]/msg464856.html
>
> But once we have finished that, would it be possible to get the
> patches in 3.11?
How about taking mmap() fault handler patches in 3.12. And in 3.11, deny
mmap() on s390 forcing makedumpfile to fall back on read() interface. That
way there will be no regression and mmap() related speedup will show up
in next release on s390.
Thanks
Vivek
(2013/07/15 23:20), Vivek Goyal wrote:
> On Fri, Jul 12, 2013 at 08:05:31PM +0900, HATAYAMA Daisuke wrote:
>
> [..]
>> How about
>>
>> static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>> {
>> ...
>> char *buf;
>> int rc;
>>
>> #ifndef CONFIG_S390
>> return VM_FAULT_SIGBUS;
>> #endif
>> page = find_or_create_page(mapping, index, GFP_KERNEL);
>>
>> Considering again, I don't think WARN_ONCE() is good now. The fact that fault occurs on
>> mmap() region indicates some kind of buggy situation occurs on the process. The process
>> should be killed as soon as possible. If user still wants to get crash dump, he should
>> try again in another process.
>
> I don't understand that. Process should be killed only if there was no
> mapping created for the region process is trying to access.
>
> If there is a mapping but we are trying to fault in the actual contents,
> then it is not a problem of process. Process is accessing a region of
> memory which it is supposed to access.
>
> Potential problem here is that remap_pfn_range() did not map everything
> it was expected to so we have to resort on page fault handler to read
> that in. So it is more of a kernel issue and not process issue and for
> that WARN_ONCE() sounds better?
>
On the current design, there's no page faults on memory mapped by remap_pfn_range().
They map a whole range in the current design. If there are page faults, page table of the process
is broken in their some page entries. This indicates the process's bahaviour is affected by
some software/hardware bugs. In theory, process could result in arbitrary behaviour. We cannot
detect the reason and recover the original sane state. The only thing we can do is to kill
the process and drop the possibility of the process to affect other system components and of
system to result in worse situation.
--
Thanks.
HATAYAMA, Daisuke
(2013/07/15 18:21), Martin Schwidefsky wrote:
> On Sat, 13 Jul 2013 01:02:50 +0900
> HATAYAMA Daisuke <[email protected]> wrote:
>
>> (2013/07/10 20:00), Michael Holzheu wrote:
>>> On Wed, 10 Jul 2013 18:50:18 +0900
>>> HATAYAMA Daisuke <[email protected]> wrote:
>>>
>>> [snip]
>>>
>>>> (2013/07/10 17:42), Michael Holzheu wrote:
>>>>> My suggestion is to add the WARN_ONCE() for #ifndef CONFIG_S390. This has the same
>>>>> effect as your suggestion for all architectures besides of s390. And for s390 we
>>>>> take the risk that a programming error would result in poor /proc/vmcore
>>>>> performance.
>>>>>
>>>>
>>>> If you want to avoid looking up vmcore_list that takes linear time w.r.t. the number
>>>> of the elements, you can still calculate the range of offsets in /proc/vmcore
>>>> corresponding to HSA during /proc/vmcore initialization.
>>>>
>>>> Also, could you tell me how often and how much the HSA region is during crash dumping?
>>>> I guess the read to HSA is done mainly during early part of crash dumping process only.
>>>> According to the code, it appears at most 64MiB only. Then, I feel performance is not
>>>> a big issue.
>>>
>>> Currently it is 32 MiB and normally it is read only once.
>>>
>>>>
>>>> Also, cost of WARN_ONCE() is one memory access only in the 2nd and later calls. I don't
>>>> think it too much overhead...
>>>
>>> I was more concerned about in_valid_fault_range(). But I was most concerned the additional
>>> interface that introduces more complexity to the code. And that just to implement a
>>> sanity check that in our opinion we don't really need.
>>>
>>> And what makes it even worse:
>>>
>>
>> What you think the sanity check is unnecessary is perfectly wrong. You design page faults
>> always happens on HSA region. If page fault happens on the other parts, i.e. some point
>> of mmap()ed region, it means somehow page table on the address has not been created. This
>> is bug, possibly caused by mmap() itself, page table creation, other components in kernel,
>> bit-flip due to broken hardware, etc. Anyway, program cannot detect what kind of bug occurs
>> now. There's no guarantee that program runs safely, of course for page cache creation, too.
>> We cannot and must expect such buggy process to behave in invalid states just as our design.
>> It results in undefined behaviour. The only thing we can do is to kill the buggy process
>> as soon as possible.
>
> I don't quite get this point, please bear with me. If you compare the situation before and
> after the introduction of the fault handler the possible error scenarios are not almost
> identical:
> 1) If an access is made outside of the mapped memory region the first level fault handler
> (do_exception for s390, __do_page_fault for x86) won't find a vma and force a SIGSEGV
> right away, independent of the existance of a hole and the vmcore fault handler.
> 2) If there is a hardware bug that corrupts a page table the behaviour depends on how the
> entries are corrupted. If the outcome is a valid pte an incorrect memory area will be
> accessed, the same with or without the vmcore fault handler. If the corrupted pte is
> an invalid pte it can come out as swap pte, file pte, or as empty pte. The behaviour
> does not change for swap and file ptes, you will get funny results in both cases.
> For empty ptes the new behaviour will call the vmcore fault handler for the address
> in question. If the read() function can satisfy the request we will get a page cache
> copy of the missing page, if the read function can not satisfy the request it returns
> an error which is translated to a SIGBUS.
> This new behaviour is IMHO better than the old one, it successfully recovers from at
> least one type of corruption. For x86 that would be the case if the page table is
> overwritten with zeroes, for s390 a specific bit pattern in the pte is required.
As you already noticed here, this senario works well only if there's page table corruption
only. Process could be corrupted more.
> 3) In the case of a programming error in regard to remap_pfn_range the new behaviour will
> provide page cache copies and the dump will technically be correct. The performance
> might suffer a little bit as the CPU will have to create the page cache copies but
> compared to the I/O that is involved with writing a dump this is negligible, no?
>
> It seems to me that the warning message you want to see in the fault handler would be
> a debugging aid for the developer to see if the mmap() and the remap_pfn_range() calls
> match up. Something similar to a single VM_WARN_ON() messages would be appropriate, no?
>
warning message is meaningless, which doesn't stop process. I no longer consider debugging.
It's valid to kill the process as soon as possible. If there's page fault to the address we
don't intend, it's buggy, and so we don't guess how the buggy process behave. Please consider
the cases that system goes into more catastrophic situation after we leave the process running to
get warning messages for debugging purposes.
--
Thanks.
HATAYAMA, Daisuke
On Mon, 15 Jul 2013 10:27:08 -0400
Vivek Goyal <[email protected]> wrote:
> On Mon, Jul 15, 2013 at 03:44:51PM +0200, Michael Holzheu wrote:
> > On Tue, 2 Jul 2013 11:42:14 -0400
> > Vivek Goyal <[email protected]> wrote:
> >
> > > On Mon, Jul 01, 2013 at 09:32:37PM +0200, Michael Holzheu wrote:
> > > > For zfcpdump we can't map the HSA storage because it is only available
> > > > via a read interface. Therefore, for the new vmcore mmap feature we have
> > > > introduce a new mechanism to create mappings on demand.
> > > >
> > > > This patch introduces a new architecture function remap_oldmem_pfn_range()
> > > > that should be used to create mappings with remap_pfn_range() for oldmem
> > > > areas that can be directly mapped. For zfcpdump this is everything besides
> > > > of the HSA memory. For the areas that are not mapped by remap_oldmem_pfn_range()
> > > > a generic vmcore a new generic vmcore fault handler mmap_vmcore_fault()
> > > > is called.
> > > >
> > > > This handler works as follows:
> > > >
> > > > * Get already available or new page from page cache (find_or_create_page)
> > > > * Check if /proc/vmcore page is filled with data (PageUptodate)
> > > > * If yes:
> > > > Return that page
> > > > * If no:
> > > > Fill page using __vmcore_read(), set PageUptodate, and return page
> > > >
> > > > Signed-off-by: Michael Holzheu <[email protected]>
> > >
> > > In general vmcore related changes look fine to me. I am not very familiar
> > > with the logic of finding pages in page cache and using page uptodate
> > > flag.
> > >
> > > Hatayama, can you please review it.
> > >
> > > Acked-by: Vivek Goyal <[email protected]>
> >
> > Hello Vivek and Andrew,
> >
> > We just realized that Hatayama's mmap patches went into v3.11-rc1. This currently
> > breaks s390 kdump because of the following two issues:
> >
> > 1) The copy_oldmem_page() is now used for copying to vmalloc memory
> > 2) The mmap() implementation is not compatible with the current
> > s390 crashkernel swap:
> > See: http://marc.info/?l=kexec&m=136940802511603&w=2
> >
> > The "kdump: Introduce ELF header in new memory feature" patch series will
> > fix both issues for s390.
> >
> > There is the one small open discussion left:
> >
> > http://www.mail-archive.com/[email protected]/msg464856.html
> >
> > But once we have finished that, would it be possible to get the
> > patches in 3.11?
>
> How about taking mmap() fault handler patches in 3.12. And in 3.11, deny
> mmap() on s390 forcing makedumpfile to fall back on read() interface. That
> way there will be no regression and mmap() related speedup will show up
> in next release on s390.
Hello Vivek and Hatayama,
But then we still would have to somehow fix the copy_oldmem_page() issue (1).
We would prefer to add the current patch series with "#ifndef CONFIG_S390" in
the fault handler.
@Vivek:
Since you are the kdump maintainer, could you tell us which of the both
variants you would like to have?
static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
{
...
#ifndef CONFIG_S390
return VM_FAULT_SIGBUS;
#endif
or
#ifndef CONFIG_S390
WARN_ONCE(1, "vmcore: Unexpected call of mmap_vmcore_fault()");
#endif
For all architectures besides of s390 this would implement the requested behavior.
Regards,
Michael
(2013/07/16 9:27), HATAYAMA Daisuke wrote:
> (2013/07/15 23:20), Vivek Goyal wrote:
>> On Fri, Jul 12, 2013 at 08:05:31PM +0900, HATAYAMA Daisuke wrote:
>>
>> [..]
>>> How about
>>>
>>> static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
>>> {
>>> ...
>>> char *buf;
>>> int rc;
>>>
>>> #ifndef CONFIG_S390
>>> return VM_FAULT_SIGBUS;
>>> #endif
>>> page = find_or_create_page(mapping, index, GFP_KERNEL);
>>>
>>> Considering again, I don't think WARN_ONCE() is good now. The fact that fault occurs on
>>> mmap() region indicates some kind of buggy situation occurs on the process. The process
>>> should be killed as soon as possible. If user still wants to get crash dump, he should
>>> try again in another process.
>>
>> I don't understand that. Process should be killed only if there was no
>> mapping created for the region process is trying to access.
>>
>> If there is a mapping but we are trying to fault in the actual contents,
>> then it is not a problem of process. Process is accessing a region of
>> memory which it is supposed to access.
>>
>> Potential problem here is that remap_pfn_range() did not map everything
>> it was expected to so we have to resort on page fault handler to read
>> that in. So it is more of a kernel issue and not process issue and for
>> that WARN_ONCE() sounds better?
>>
>
> On the current design, there's no page faults on memory mapped by remap_pfn_range().
> They map a whole range in the current design. If there are page faults, page table of the process
> is broken in their some page entries. This indicates the process's bahaviour is affected by
> some software/hardware bugs. In theory, process could result in arbitrary behaviour. We cannot
> detect the reason and recover the original sane state. The only thing we can do is to kill
> the process and drop the possibility of the process to affect other system components and of
> system to result in worse situation.
>
In summary, it seems that you two and I have different implementation
policy on how to deal with the process that is no longer in healthy state.
You two's idea is try to continue dump in non-healthy state as much as possible
as long as there is possibility of continuing it, while my idea kill the process
promptly and to retry crash dump in another new process since the process is no longer
in healthy state and could behave arbitrarily.
The logic in non-healthy states depends on implementation policy since there
is no obviously correct logic. I guess this discussion would not end soon.
I believe it is supposed that maintainer's idea should basically have high
priority over others. So I don't object anymore, though I don't think it best
at all.
--
Thanks.
HATAYAMA, Daisuke
On Tue, Jul 16, 2013 at 11:25:27AM +0200, Michael Holzheu wrote:
[..]
> > > Hello Vivek and Andrew,
> > >
> > > We just realized that Hatayama's mmap patches went into v3.11-rc1. This currently
> > > breaks s390 kdump because of the following two issues:
> > >
> > > 1) The copy_oldmem_page() is now used for copying to vmalloc memory
> > > 2) The mmap() implementation is not compatible with the current
> > > s390 crashkernel swap:
> > > See: http://marc.info/?l=kexec&m=136940802511603&w=2
> > >
> > > The "kdump: Introduce ELF header in new memory feature" patch series will
> > > fix both issues for s390.
> > >
> > > There is the one small open discussion left:
> > >
> > > http://www.mail-archive.com/[email protected]/msg464856.html
> > >
> > > But once we have finished that, would it be possible to get the
> > > patches in 3.11?
> >
> > How about taking mmap() fault handler patches in 3.12. And in 3.11, deny
> > mmap() on s390 forcing makedumpfile to fall back on read() interface. That
> > way there will be no regression and mmap() related speedup will show up
> > in next release on s390.
>
> Hello Vivek and Hatayama,
>
> But then we still would have to somehow fix the copy_oldmem_page() issue (1).
>
> We would prefer to add the current patch series with "#ifndef CONFIG_S390" in
> the fault handler.
>
> @Vivek:
>
> Since you are the kdump maintainer, could you tell us which of the both
> variants you would like to have?
>
> static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> {
> ...
>
> #ifndef CONFIG_S390
> return VM_FAULT_SIGBUS;
> #endif
I think let us use this one. Kill the process on non-s390 arch if it goes
into mmap() fault handler.
copy_oldmem_page() using real mode in s390 is an issue for vmalloc region.
I think post the series again and let us see if Andrew is comfortable
with it. I suspect it is late.
IIUC, we do copying in real mode only to cope with HSA region in zfcpdump
case. Also I think in case of zfcpdump, /proc/vmcore is not usable yet
and your patches achieves that. So as an stop gap measure can we stop
dropping to real mode so that regular kdump in s390 work. (I am not sure
in case of zfcpdump, currently how do you retrieve vmcore as /dev/oldmem
is gone now).
Thanks
Vivek
On Tue, 16 Jul 2013 10:04:18 -0400
Vivek Goyal <[email protected]> wrote:
> On Tue, Jul 16, 2013 at 11:25:27AM +0200, Michael Holzheu wrote:
>
> [..]
> > > > Hello Vivek and Andrew,
> > > >
> > > > We just realized that Hatayama's mmap patches went into v3.11-rc1. This currently
> > > > breaks s390 kdump because of the following two issues:
> > > >
> > > > 1) The copy_oldmem_page() is now used for copying to vmalloc memory
> > > > 2) The mmap() implementation is not compatible with the current
> > > > s390 crashkernel swap:
> > > > See: http://marc.info/?l=kexec&m=136940802511603&w=2
> > > >
> > > > The "kdump: Introduce ELF header in new memory feature" patch series will
> > > > fix both issues for s390.
> > > >
> > > > There is the one small open discussion left:
> > > >
> > > > http://www.mail-archive.com/[email protected]/msg464856.html
> > > >
> > > > But once we have finished that, would it be possible to get the
> > > > patches in 3.11?
> > >
> > > How about taking mmap() fault handler patches in 3.12. And in 3.11, deny
> > > mmap() on s390 forcing makedumpfile to fall back on read() interface. That
> > > way there will be no regression and mmap() related speedup will show up
> > > in next release on s390.
> >
> > Hello Vivek and Hatayama,
> >
> > But then we still would have to somehow fix the copy_oldmem_page() issue (1).
> >
> > We would prefer to add the current patch series with "#ifndef CONFIG_S390" in
> > the fault handler.
> >
> > @Vivek:
> >
> > Since you are the kdump maintainer, could you tell us which of the both
> > variants you would like to have?
> >
> > static int mmap_vmcore_fault(struct vm_area_struct *vma, struct vm_fault *vmf)
> > {
> > ...
> >
> > #ifndef CONFIG_S390
> > return VM_FAULT_SIGBUS;
> > #endif
>
> I think let us use this one. Kill the process on non-s390 arch if it goes
> into mmap() fault handler.
>
> copy_oldmem_page() using real mode in s390 is an issue for vmalloc region.
> I think post the series again and let us see if Andrew is comfortable
> with it.
Ok great, I will send Andrew the current patch series.
Let's see what happens...
> I suspect it is late.
> IIUC, we do copying in real mode only to cope with HSA region in zfcpdump
> case.
The problem is that with the mmap patches we now use copy_oldmem_page()
to copy the notes from oldmem into the notes_buf which has been allocated
with vmalloc. The s390 version of copy_oldmem_page() bypasses the page
tables and uses real copy. And this does not work on vmalloc memory.
Best Regards,
Michael
On Tue, Jul 16, 2013 at 05:37:09PM +0200, Michael Holzheu wrote:
[..]
> The problem is that with the mmap patches we now use copy_oldmem_page()
> to copy the notes from oldmem into the notes_buf which has been allocated
> with vmalloc. The s390 version of copy_oldmem_page() bypasses the page
> tables and uses real copy. And this does not work on vmalloc memory.
I got that. I am trying to remember that in 3.10 why are we dropping
to real mode in s390 while copying pages.
I thought dropping to real mode was needed only to access HSA region.
HSA region comes into the picture only for zfcudump and /proc/vmcore support
for zfcpdump is not there yet
IOW, what if we start copying with page tables enabled in s390. Will
it not work for regular kdump case and take care of zfcpdump in 3.12.
Thanks
Vivek