2013-05-24 13:08:17

by Michael Holzheu

[permalink] [raw]
Subject: [PATCH 0/2] kdump/mmap: Fix mmap of /proc/vmcore for s390

Hello Vivek and Hatayama,

Currently the /proc/vmcore mmap patches are not working on s390. The
problem is that on s390 the kernel in not relocatable and therefore
always runs in the lower memory area. Therefore for kdump on s390 we
swap the lower memory area with the crashkernel area before starting
the kdump kernel:

[0 - OLDMEM_SIZE] is mapped to [OLDMEM_BASE - OLDMEM_BASE + OLDMEM_SIZE]

To fix /proc/vmcore mmap memory below OLDMEMSIZE needs to be mapped
with OLDMEM_BASE as offset. To achieve that, a new weak function
arch_oldmem_remap_pfn_range() is introduced.

If you agree with our approach, could you integrate the two patches
into the mmap patch series?

Best Regards,
Michael

---
Jan Willeke (2):
kdump/mmap: Introduce arch_oldmem_remap_pfn_range()
s390/kdump/mmap: Implement arch_oldmem_remap_pfn_range() for s390

arch/s390/kernel/crash_dump.c | 27 +++++++++++++++++++++++++++
fs/proc/vmcore.c | 15 ++++++++++++++-
include/linux/crash_dump.h | 5 +++++
3 files changed, 46 insertions(+), 1 deletion(-)

--
1.8.1.6


2013-05-24 13:08:19

by Michael Holzheu

[permalink] [raw]
Subject: [PATCH 1/2] kdump/mmap: Introduce arch_oldmem_remap_pfn_range()

From: Jan Willeke <[email protected]>

Currently the /proc/vmcore mmap patches are not working on s390. The
problem is that on s390 the kernel in not relocatable and therefore
always runs in the lower memory area. Therefore for kdump on s390 we
swap the lower memory area with the crashkernel area before starting
the kdump kernel:

[0 - OLDMEM_SIZE] is mapped to [OLDMEM_BASE - OLDMEM_BASE + OLDMEM_SIZE]

To fix /proc/vmcore mmap memory below OLDMEMSIZE needs to be mapped
with OLDMEM_BASE as offset. To achieve that, a new weak function
arch_oldmem_remap_pfn_range() is introduced.

Signed-off-by: Jan Willeke <[email protected]>
Signed-off-by: Michael Holzheu <[email protected]>
---
fs/proc/vmcore.c | 15 ++++++++++++++-
include/linux/crash_dump.h | 5 +++++
2 files changed, 19 insertions(+), 1 deletion(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 80221d7..3eda0ac 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -123,6 +123,19 @@ static ssize_t read_from_oldmem(char *buf, size_t count,
return read;
}

+/*
+ * Architetures may override this function to map oldmen
+ */
+int __weak arch_oldmem_remap_pfn_range(struct vm_area_struct *vma,
+ unsigned long from,
+ unsigned long pfn,
+ unsigned long size,
+ pgprot_t prot)
+{
+ return remap_pfn_range(vma, from, pfn, size, prot);
+}
+
+
/* Read from the ELF header and then the crash dump. On error, negative value is
* returned otherwise number of bytes read are returned.
*/
@@ -267,7 +280,7 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
if (size < tsz)
tsz = size;
paddr = m->paddr + start - m->offset;
- if (remap_pfn_range(vma, vma->vm_start + len,
+ if (arch_oldmem_remap_pfn_range(vma, vma->vm_start + len,
paddr >> PAGE_SHIFT, tsz,
vma->vm_page_prot)) {
do_munmap(vma->vm_mm, vma->vm_start, len);
diff --git a/include/linux/crash_dump.h b/include/linux/crash_dump.h
index 37e4f8d..da300a7 100644
--- a/include/linux/crash_dump.h
+++ b/include/linux/crash_dump.h
@@ -14,6 +14,11 @@ extern unsigned long long elfcorehdr_size;

extern ssize_t copy_oldmem_page(unsigned long, char *, size_t,
unsigned long, int);
+extern int __weak arch_oldmem_remap_pfn_range(struct vm_area_struct *vma,
+ unsigned long from,
+ unsigned long pfn,
+ unsigned long size,
+ pgprot_t prot);

/* Architecture code defines this if there are other possible ELF
* machine types, e.g. on bi-arch capable hardware. */
--
1.8.1.6

2013-05-24 13:08:16

by Michael Holzheu

[permalink] [raw]
Subject: [PATCH 2/2] s390/kdump/mmap: Implement arch_oldmem_remap_pfn_range() for s390

From: Jan Willeke <[email protected]>

This patch introduced the S390 specific way to map pages from oldmem.
The memory area below OLDMEM_SIZE is mapped with offset OLDMEM_BASE.
The other old memory is mapped directly.

Signed-off-by: Jan Willeke <[email protected]>
Signed-off-by: Michael Holzheu <[email protected]>
---
arch/s390/kernel/crash_dump.c | 27 +++++++++++++++++++++++++++
1 file changed, 27 insertions(+)

diff --git a/arch/s390/kernel/crash_dump.c b/arch/s390/kernel/crash_dump.c
index f703d91..f524eb2 100644
--- a/arch/s390/kernel/crash_dump.c
+++ b/arch/s390/kernel/crash_dump.c
@@ -51,6 +51,33 @@ ssize_t copy_oldmem_page(unsigned long pfn, char *buf,
}

/*
+ * Remap "oldmem"
+ *
+ * For the kdump reserved memory this functions performs a swap operation:
+ * [0 - OLDMEM_SIZE] is mapped to [OLDMEM_BASE - OLDMEM_BASE + OLDMEM_SIZE]
+ */
+int arch_oldmem_remap_pfn_range(struct vm_area_struct *vma, unsigned long from,
+ unsigned long pfn, unsigned long size,
+ pgprot_t prot)
+{
+ unsigned long size_old;
+ int rc;
+
+ if (pfn < OLDMEM_SIZE >> PAGE_SHIFT) {
+ size_old = min(size, OLDMEM_SIZE - (pfn << PAGE_SHIFT));
+ rc = remap_pfn_range(vma, from,
+ pfn + (OLDMEM_BASE >> PAGE_SHIFT),
+ size_old, prot);
+ if (rc || size == size_old)
+ return rc;
+ size -= size_old;
+ from += size_old;
+ pfn += size_old >> PAGE_SHIFT;
+ }
+ return remap_pfn_range(vma, from, pfn, size, prot);
+}
+
+/*
* Copy memory from old kernel
*/
int copy_from_oldmem(void *dest, void *src, size_t count)
--
1.8.1.6

2013-05-24 14:38:08

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 0/2] kdump/mmap: Fix mmap of /proc/vmcore for s390

On Fri, May 24, 2013 at 03:08:07PM +0200, Michael Holzheu wrote:
> Hello Vivek and Hatayama,
>
> Currently the /proc/vmcore mmap patches are not working on s390. The
> problem is that on s390 the kernel in not relocatable and therefore
> always runs in the lower memory area. Therefore for kdump on s390 we
> swap the lower memory area with the crashkernel area before starting
> the kdump kernel:
>
> [0 - OLDMEM_SIZE] is mapped to [OLDMEM_BASE - OLDMEM_BASE + OLDMEM_SIZE]
>
> To fix /proc/vmcore mmap memory below OLDMEMSIZE needs to be mapped
> with OLDMEM_BASE as offset. To achieve that, a new weak function
> arch_oldmem_remap_pfn_range() is introduced.
>
> If you agree with our approach, could you integrate the two patches
> into the mmap patch series?

Hi Michael,

Sorry, I don't understand the problem. If we swapped low memory and crash
reserved memory, that should have been taken care by prepared ELF headers
so that we map the right pfns. In x86 we swap 640K of low memory with 640K
of memory in reserved and we take care of this by preparing elf headers
accordingly.

So why s390 can't do the same thing?

Thanks
Vivek

>
> Best Regards,
> Michael
>
> ---
> Jan Willeke (2):
> kdump/mmap: Introduce arch_oldmem_remap_pfn_range()
> s390/kdump/mmap: Implement arch_oldmem_remap_pfn_range() for s390
>
> arch/s390/kernel/crash_dump.c | 27 +++++++++++++++++++++++++++
> fs/proc/vmcore.c | 15 ++++++++++++++-
> include/linux/crash_dump.h | 5 +++++
> 3 files changed, 46 insertions(+), 1 deletion(-)
>
> --
> 1.8.1.6

2013-05-24 15:06:36

by Michael Holzheu

[permalink] [raw]
Subject: Re: [PATCH 0/2] kdump/mmap: Fix mmap of /proc/vmcore for s390

Hello Vivek,

On Fri, 24 May 2013 10:36:44 -0400
Vivek Goyal <[email protected]> wrote:

[snip]

> Sorry, I don't understand the problem. If we swapped low memory and
> crash reserved memory, that should have been taken care by prepared
> ELF headers so that we map the right pfns. In x86 we swap 640K of low
> memory with 640K of memory in reserved and we take care of this by
> preparing elf headers accordingly.
>
> So why s390 can't do the same thing?

I am not sure if I understand this. Currently we create the ELF
header in a way that we have virtual=real. In the copy_oldmem_page() we
do the swap so that for the /proc/vmcore code it looks like contiguous
non-swapped memory.

One reason why I thought this was necessary was that /dev/oldmem
also uses the function and it should provide linear memory access like
it is on the live system with /dev/mem.

Is that implementation incorrect?

Best Regards,
Michael

2013-05-24 15:29:38

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 0/2] kdump/mmap: Fix mmap of /proc/vmcore for s390

On Fri, May 24, 2013 at 05:06:26PM +0200, Michael Holzheu wrote:
> Hello Vivek,
>
> On Fri, 24 May 2013 10:36:44 -0400
> Vivek Goyal <[email protected]> wrote:
>
> [snip]
>
> > Sorry, I don't understand the problem. If we swapped low memory and
> > crash reserved memory, that should have been taken care by prepared
> > ELF headers so that we map the right pfns. In x86 we swap 640K of low
> > memory with 640K of memory in reserved and we take care of this by
> > preparing elf headers accordingly.
> >
> > So why s390 can't do the same thing?
>
> I am not sure if I understand this. Currently we create the ELF
> header in a way that we have virtual=real. In the copy_oldmem_page() we
> do the swap so that for the /proc/vmcore code it looks like contiguous
> non-swapped memory.
>
> One reason why I thought this was necessary was that /dev/oldmem
> also uses the function and it should provide linear memory access like
> it is on the live system with /dev/mem.
>
> Is that implementation incorrect?

[ CC Andrew. Keep him in loop for all kernel kdump patches as all kdump
patches are routed through him ].

[ CC Eric Biederman ]

Looking at the code, looks like /dev/oldmem is broken. It does not know
anything about swap of any of the memory areas and it will simply
return the contents of page frame asked. And this has been like this
since the beginning.

I have always questioned the utility of /dev/oldmem. Atleast I am not
aware of any tool making use of it.

If we want to fix it, then somebow all the swapped memory region info
needs to be communicated to second kernel so that read_oldmem() can
do the mapping correctly and we really don't have any mechanism for
that. (I am assuming that in s390 you must have hardcoded the regions
of memory which are always swapped).

As /proc/vmcore is the most used and useful interface, I prefer that
we swap memory and put that info in elf headers. For /dev/oldme, I
don't mind if we leave it as it is. If somebody really cares, then
I guess we need to write a new command line option which /dev/mem
can parse and which tells it about swaps so that /dev/oldmem can
map things correctly. (This is better than hardcoding things).

Eric, do you have any thoughts on this.

Thanks
Vivek

2013-05-24 16:47:00

by Michael Holzheu

[permalink] [raw]
Subject: Re: [PATCH 0/2] kdump/mmap: Fix mmap of /proc/vmcore for s390

On Fri, 24 May 2013 11:28:49 -0400
Vivek Goyal <[email protected]> wrote:

> On Fri, May 24, 2013 at 05:06:26PM +0200, Michael Holzheu wrote:

[snip]

> As /proc/vmcore is the most used and useful interface, I prefer that
> we swap memory and put that info in elf headers. For /dev/oldme, I
> don't mind if we leave it as it is. If somebody really cares, then
> I guess we need to write a new command line option which /dev/mem
> can parse and which tells it about swaps so that /dev/oldmem can
> map things correctly. (This is better than hardcoding things).

Besides of the potential /dev/oldmem issue, I still do not understand
the option of doing the swap in the elf header. Looks like I missed
here a fundamental design point of kdump :(

Is that done by specifying different virtual and physical addresses in
the ELF header?

Michael

2013-05-24 17:06:02

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 0/2] kdump/mmap: Fix mmap of /proc/vmcore for s390

On Fri, May 24, 2013 at 06:46:53PM +0200, Michael Holzheu wrote:
> On Fri, 24 May 2013 11:28:49 -0400
> Vivek Goyal <[email protected]> wrote:
>
> > On Fri, May 24, 2013 at 05:06:26PM +0200, Michael Holzheu wrote:
>
> [snip]
>
> > As /proc/vmcore is the most used and useful interface, I prefer that
> > we swap memory and put that info in elf headers. For /dev/oldme, I
> > don't mind if we leave it as it is. If somebody really cares, then
> > I guess we need to write a new command line option which /dev/mem
> > can parse and which tells it about swaps so that /dev/oldmem can
> > map things correctly. (This is better than hardcoding things).
>
> Besides of the potential /dev/oldmem issue, I still do not understand
> the option of doing the swap in the elf header. Looks like I missed
> here a fundamental design point of kdump :(
>
> Is that done by specifying different virtual and physical addresses in
> the ELF header?

Nope. We keep the virtual to physical address mapping same. We just modify
the p_offset in PT_LOAD elf header to represent where actually the
memory is present physically. And when /proc/vmcore reads the data, it
reads it from p_offset.

IOW, p_offset and p_paddr will be different for swapped memory but
should be same for memory which has not been swapped.

Thanks
Vivek

2013-05-24 22:45:09

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/2] kdump/mmap: Fix mmap of /proc/vmcore for s390

Vivek Goyal <[email protected]> writes:

> On Fri, May 24, 2013 at 05:06:26PM +0200, Michael Holzheu wrote:
>> Hello Vivek,
>>
>> On Fri, 24 May 2013 10:36:44 -0400
>> Vivek Goyal <[email protected]> wrote:
>>
>> [snip]
>>
>> > Sorry, I don't understand the problem. If we swapped low memory and
>> > crash reserved memory, that should have been taken care by prepared
>> > ELF headers so that we map the right pfns. In x86 we swap 640K of low
>> > memory with 640K of memory in reserved and we take care of this by
>> > preparing elf headers accordingly.
>> >
>> > So why s390 can't do the same thing?
>>
>> I am not sure if I understand this. Currently we create the ELF
>> header in a way that we have virtual=real. In the copy_oldmem_page() we
>> do the swap so that for the /proc/vmcore code it looks like contiguous
>> non-swapped memory.
>>
>> One reason why I thought this was necessary was that /dev/oldmem
>> also uses the function and it should provide linear memory access like
>> it is on the live system with /dev/mem.
>>
>> Is that implementation incorrect?
>
> [ CC Andrew. Keep him in loop for all kernel kdump patches as all kdump
> patches are routed through him ].
>
> [ CC Eric Biederman ]
>
> Looking at the code, looks like /dev/oldmem is broken. It does not know
> anything about swap of any of the memory areas and it will simply
> return the contents of page frame asked. And this has been like this
> since the beginning.
>
> I have always questioned the utility of /dev/oldmem. Atleast I am not
> aware of any tool making use of it.
>
> If we want to fix it, then somebow all the swapped memory region info
> needs to be communicated to second kernel so that read_oldmem() can
> do the mapping correctly and we really don't have any mechanism for
> that. (I am assuming that in s390 you must have hardcoded the regions
> of memory which are always swapped).
>
> As /proc/vmcore is the most used and useful interface, I prefer that
> we swap memory and put that info in elf headers. For /dev/oldme, I
> don't mind if we leave it as it is. If somebody really cares, then
> I guess we need to write a new command line option which /dev/mem
> can parse and which tells it about swaps so that /dev/oldmem can
> map things correctly. (This is better than hardcoding things).
>
> Eric, do you have any thoughts on this.

I don't think anyone actually uses /dev/oldmem. I would like to cite
the s390 confusion as proof but I don't think that quite works.

I think the solution is for someone to send a patch removing /dev/oldmem
as an unused piece of code. That will also move us in the direction of
resolving HPAs concerns.

The function copy_oldmem_page also concerns me. I don't have a clue why
we duplicate that function on every architecutre in a slightly different
form. There should be enough abstractions in the kernel to make that
unnecessary. I would be glad to see that function go, and remove the
possibility of confusion that happened on s390.

Eric

2013-05-25 00:33:30

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH 0/2] kdump/mmap: Fix mmap of /proc/vmcore for s390

Hello Eric,

于 2013年05月25日 06:44, Eric W. Biederman 写道:
> Vivek Goyal <[email protected]> writes:
>
>> On Fri, May 24, 2013 at 05:06:26PM +0200, Michael Holzheu wrote:
>>> Hello Vivek,
>>>
>>> On Fri, 24 May 2013 10:36:44 -0400
>>> Vivek Goyal <[email protected]> wrote:
>>>
>>> [snip]
>>>
>>>> Sorry, I don't understand the problem. If we swapped low memory and
>>>> crash reserved memory, that should have been taken care by prepared
>>>> ELF headers so that we map the right pfns. In x86 we swap 640K of low
>>>> memory with 640K of memory in reserved and we take care of this by
>>>> preparing elf headers accordingly.
>>>>
>>>> So why s390 can't do the same thing?
>>>
>>> I am not sure if I understand this. Currently we create the ELF
>>> header in a way that we have virtual=real. In the copy_oldmem_page() we
>>> do the swap so that for the /proc/vmcore code it looks like contiguous
>>> non-swapped memory.
>>>
>>> One reason why I thought this was necessary was that /dev/oldmem
>>> also uses the function and it should provide linear memory access like
>>> it is on the live system with /dev/mem.
>>>
>>> Is that implementation incorrect?
>>
>> [ CC Andrew. Keep him in loop for all kernel kdump patches as all kdump
>> patches are routed through him ].
>>
>> [ CC Eric Biederman ]
>>
>> Looking at the code, looks like /dev/oldmem is broken. It does not know
>> anything about swap of any of the memory areas and it will simply
>> return the contents of page frame asked. And this has been like this
>> since the beginning.
>>
>> I have always questioned the utility of /dev/oldmem. Atleast I am not
>> aware of any tool making use of it.
>>
>> If we want to fix it, then somebow all the swapped memory region info
>> needs to be communicated to second kernel so that read_oldmem() can
>> do the mapping correctly and we really don't have any mechanism for
>> that. (I am assuming that in s390 you must have hardcoded the regions
>> of memory which are always swapped).
>>
>> As /proc/vmcore is the most used and useful interface, I prefer that
>> we swap memory and put that info in elf headers. For /dev/oldme, I
>> don't mind if we leave it as it is. If somebody really cares, then
>> I guess we need to write a new command line option which /dev/mem
>> can parse and which tells it about swaps so that /dev/oldmem can
>> map things correctly. (This is better than hardcoding things).
>>
>> Eric, do you have any thoughts on this.
>
> I don't think anyone actually uses /dev/oldmem. I would like to cite
> the s390 confusion as proof but I don't think that quite works.
>
> I think the solution is for someone to send a patch removing /dev/oldmem
> as an unused piece of code. That will also move us in the direction of
> resolving HPAs concerns.

I think I can try this.

>
> The function copy_oldmem_page also concerns me. I don't have a clue why
> we duplicate that function on every architecutre in a slightly different
> form. There should be enough abstractions in the kernel to make that
> unnecessary. I would be glad to see that function go, and remove the
> possibility of confusion that happened on s390.

You mean we should have a common copy_oldmem_page for all architectures? And
just like vivek said above, for s390, we should put the swap info in the elf
headers instead of doing that in copy_oldmem_page.

Zhang

2013-05-25 03:02:00

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/2] kdump/mmap: Fix mmap of /proc/vmcore for s390

Zhang Yanfei <[email protected]> writes:

> Hello Eric,
>
>> The function copy_oldmem_page also concerns me. I don't have a clue why
>> we duplicate that function on every architecutre in a slightly different
>> form. There should be enough abstractions in the kernel to make that
>> unnecessary. I would be glad to see that function go, and remove the
>> possibility of confusion that happened on s390.
>
> You mean we should have a common copy_oldmem_page for all architectures? And
> just like vivek said above, for s390, we should put the swap info in the elf
> headers instead of doing that in copy_oldmem_page.

Exactly.

The user space change in /sbin/kexec should even be backwards compatible
for s390. So fixing /sbin/kexec should probably come first.

Eric

2013-05-25 08:32:10

by Zhang Yanfei

[permalink] [raw]
Subject: Re: [PATCH 0/2] kdump/mmap: Fix mmap of /proc/vmcore for s390

于 2013年05月25日 11:01, Eric W. Biederman 写道:
> Zhang Yanfei <[email protected]> writes:
>
>> Hello Eric,
>>
>>> The function copy_oldmem_page also concerns me. I don't have a clue why
>>> we duplicate that function on every architecutre in a slightly different
>>> form. There should be enough abstractions in the kernel to make that
>>> unnecessary. I would be glad to see that function go, and remove the
>>> possibility of confusion that happened on s390.
>>
>> You mean we should have a common copy_oldmem_page for all architectures? And
>> just like vivek said above, for s390, we should put the swap info in the elf
>> headers instead of doing that in copy_oldmem_page.
>
> Exactly.
>
> The user space change in /sbin/kexec should even be backwards compatible
> for s390. So fixing /sbin/kexec should probably come first.
>

I am kind of not sure about the "backwards compatible for s390" you meant.

For s390, if we put swap info into the elf header, This will change /sbin/kexec.
But at this point, copy_oldmem_page is still doing the swap when we try to read
the pages among [0 - OLDMEM_SIZE] and [OLDMEM_BASE - OLDMEM_BASE + OLDMEM_SIZE].
So removing the swap in copy_oldmem_page should be done at the same time.
New kexec with old kernels would fail and old kexec with new kernels would fail too.

So could you please explain more about the ""backwards compatible". And please
correct me if I am wrong.

Thanks
Zhang

2013-05-25 12:52:27

by Michael Holzheu

[permalink] [raw]
Subject: Re: [PATCH 0/2] kdump/mmap: Fix mmap of /proc/vmcore for s390

On Sat, 25 May 2013 16:31:58 +0800
Zhang Yanfei <[email protected]> wrote:

[snip]

> For s390, if we put swap info into the elf header, This will
> change /sbin/kexec. But at this point, copy_oldmem_page is still
> doing the swap when we try to read the pages among [0 - OLDMEM_SIZE]
> and [OLDMEM_BASE - OLDMEM_BASE + OLDMEM_SIZE]. So removing the swap
> in copy_oldmem_page should be done at the same time. New kexec with
> old kernels would fail and old kexec with new kernels would fail too.
>
> So could you please explain more about the ""backwards compatible".
> And please correct me if I am wrong.

Hello all,

I think Zhang is right and in theory we are not backwards compatible
because with the ELF header fix old kexec tools would no longer work
with new kernels.

But:

For s390 we normally do *not* create the ELF header in the old
kernel with kexec. Instead the new kernel does all the memory and CPU
detection and creates the ELF header in the new memory.

See also our current discussion with Vivek:
https://lkml.org/lkml/2013/5/24/164

The main reason why we did this for s390 was that we can have many
CPU hotplug events because of a s390 specific daemon called
"cpuplugd" (s390-tools). We wanted to avoid the kdump reloading with
kexec triggered by CPU and memory hotplug events. For s390 distributions
the kexec udev rules are disabled.

Besides of the newmem mechanism, for completeness, we also implemented
the oldmem ELF header mechansim in kexec. But this is disabled
by default.

See: "#ifdef WITH_ELF_HEADER" in kexec/arch/s390/crashdump-s390.c

Currently no distribution uses the oldmem mechanism.

Therefore, if necessary, IMHO we can switch to the ELF header memory
swap mechanism for s390 in the kernel. Of course we would then also
have to adjust the (disabled) kexec code.

Best Regards,
Michael

2013-05-25 13:13:36

by Michael Holzheu

[permalink] [raw]
Subject: Re: [PATCH 0/2] kdump/mmap: Fix mmap of /proc/vmcore for s390

On Fri, 24 May 2013 13:05:07 -0400
Vivek Goyal <[email protected]> wrote:

> On Fri, May 24, 2013 at 06:46:53PM +0200, Michael Holzheu wrote:
> > On Fri, 24 May 2013 11:28:49 -0400
> > Vivek Goyal <[email protected]> wrote:
> >
> > > On Fri, May 24, 2013 at 05:06:26PM +0200, Michael Holzheu wrote:
> >
> > [snip]
> >
> > > As /proc/vmcore is the most used and useful interface, I prefer
> > > that we swap memory and put that info in elf headers.
> > > For /dev/oldme, I don't mind if we leave it as it is. If somebody
> > > really cares, then I guess we need to write a new command line
> > > option which /dev/mem can parse and which tells it about swaps so
> > > that /dev/oldmem can map things correctly. (This is better than
> > > hardcoding things).
> >
> > Besides of the potential /dev/oldmem issue, I still do not
> > understand the option of doing the swap in the elf header. Looks
> > like I missed here a fundamental design point of kdump :(
> >
> > Is that done by specifying different virtual and physical addresses
> > in the ELF header?
>
> Nope. We keep the virtual to physical address mapping same. We just
> modify the p_offset in PT_LOAD elf header to represent where actually
> the memory is present physically. And when /proc/vmcore reads the
> data, it reads it from p_offset.
>
> IOW, p_offset and p_paddr will be different for swapped memory but
> should be same for memory which has not been swapped.

Hello Vivek,

Ok, now I got it :)

It worked for me by specifying a PT_LOAD with:

phdr->p_offset = OLDMEM_BASE;
phdr->p_vaddr = phdr->p_paddr = 0;
phdr->p_filesz = phdr->p_memsz = OLDMEM_SIZE;

Best Regards,
Michael

2013-05-25 20:36:33

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/2] kdump/mmap: Fix mmap of /proc/vmcore for s390

Zhang Yanfei <[email protected]> writes:

> 于 2013年05月25日 11:01, Eric W. Biederman 写道:
>> Zhang Yanfei <[email protected]> writes:
>>
>>> Hello Eric,
>>>
>>>> The function copy_oldmem_page also concerns me. I don't have a clue why
>>>> we duplicate that function on every architecutre in a slightly different
>>>> form. There should be enough abstractions in the kernel to make that
>>>> unnecessary. I would be glad to see that function go, and remove the
>>>> possibility of confusion that happened on s390.
>>>
>>> You mean we should have a common copy_oldmem_page for all architectures? And
>>> just like vivek said above, for s390, we should put the swap info in the elf
>>> headers instead of doing that in copy_oldmem_page.
>>
>> Exactly.
>>
>> The user space change in /sbin/kexec should even be backwards compatible
>> for s390. So fixing /sbin/kexec should probably come first.
>>
>
> I am kind of not sure about the "backwards compatible for s390" you meant.
>
> For s390, if we put swap info into the elf header, This will change /sbin/kexec.
> But at this point, copy_oldmem_page is still doing the swap when we try to read
> the pages among [0 - OLDMEM_SIZE] and [OLDMEM_BASE - OLDMEM_BASE + OLDMEM_SIZE].
> So removing the swap in copy_oldmem_page should be done at the same time.
> New kexec with old kernels would fail and old kexec with new kernels would fail too.
>
> So could you please explain more about the ""backwards compatible". And please
> correct me if I am wrong.

It looks like my misreading of things. I was not expecting the existing
copy_oldmem_page to do a complete swap of addresses. I was expecting
something like the result obtained by doing a header swap with the ELF
headers where part of the address was translated and the rest was simply
not mentioned.

But from other replies it appears there isn't a problem.

Eric

2013-05-28 13:55:41

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 0/2] kdump/mmap: Fix mmap of /proc/vmcore for s390

On Sat, May 25, 2013 at 02:52:17PM +0200, Michael Holzheu wrote:
> On Sat, 25 May 2013 16:31:58 +0800
> Zhang Yanfei <[email protected]> wrote:
>
> [snip]
>
> > For s390, if we put swap info into the elf header, This will
> > change /sbin/kexec. But at this point, copy_oldmem_page is still
> > doing the swap when we try to read the pages among [0 - OLDMEM_SIZE]
> > and [OLDMEM_BASE - OLDMEM_BASE + OLDMEM_SIZE]. So removing the swap
> > in copy_oldmem_page should be done at the same time. New kexec with
> > old kernels would fail and old kexec with new kernels would fail too.
> >
> > So could you please explain more about the ""backwards compatible".
> > And please correct me if I am wrong.
>
> Hello all,
>
> I think Zhang is right and in theory we are not backwards compatible
> because with the ELF header fix old kexec tools would no longer work
> with new kernels.
>
> But:
>
> For s390 we normally do *not* create the ELF header in the old
> kernel with kexec. Instead the new kernel does all the memory and CPU
> detection and creates the ELF header in the new memory.
>
> See also our current discussion with Vivek:
> https://lkml.org/lkml/2013/5/24/164
>
> The main reason why we did this for s390 was that we can have many
> CPU hotplug events because of a s390 specific daemon called
> "cpuplugd" (s390-tools). We wanted to avoid the kdump reloading with
> kexec triggered by CPU and memory hotplug events. For s390 distributions
> the kexec udev rules are disabled.
>
> Besides of the newmem mechanism, for completeness, we also implemented
> the oldmem ELF header mechansim in kexec. But this is disabled
> by default.
>
> See: "#ifdef WITH_ELF_HEADER" in kexec/arch/s390/crashdump-s390.c
>
> Currently no distribution uses the oldmem mechanism.

Hi Michael,

Mechanism to read from newmem is not there yet (your patches are still
being reviewed) and you mentioned that no distribution is building
kexec-tools with WITH_ELF_HEADER on s390. So how things are currently
working for kdump on s390?

>
> Therefore, if necessary, IMHO we can switch to the ELF header memory
> swap mechanism for s390 in the kernel. Of course we would then also
> have to adjust the (disabled) kexec code.

So are you saying that s390 is ready to switch to mechanism of creating
ELF headers in first kernel by kexec-tools and new kernel does not
have to preare ELF headers?

Thanks
Vivek

2013-05-28 14:45:22

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 0/2] kdump/mmap: Fix mmap of /proc/vmcore for s390

On Sat, May 25, 2013 at 02:52:17PM +0200, Michael Holzheu wrote:

[..]
> Therefore, if necessary, IMHO we can switch to the ELF header memory
> swap mechanism for s390 in the kernel. Of course we would then also
> have to adjust the (disabled) kexec code.

I think it is a good idea to fix it in s390 kernel so that
copy_oldmem_page() does not do any swapping and fix the ELF header
generation logic and any swapping is done in ELF headers.

Agreed that we need to fixed s390 kexec-tools too. I guess our best
bet would be to parse the kernel version and fix headers only for
newer kernel versions. This assumes that first kernel and second kernel
are same but that's the case for majority of the people anyway. So for
majority of people, change will be backward compatible.

Thanks
Vivek

2013-05-29 11:51:53

by Michael Holzheu

[permalink] [raw]
Subject: Re: [PATCH 0/2] kdump/mmap: Fix mmap of /proc/vmcore for s390

On Tue, 28 May 2013 09:55:01 -0400
Vivek Goyal <[email protected]> wrote:

> On Sat, May 25, 2013 at 02:52:17PM +0200, Michael Holzheu wrote:

[snip]

> > Besides of the newmem mechanism, for completeness, we also
> > implemented the oldmem ELF header mechansim in kexec. But this is
> > disabled by default.
> >
> > See: "#ifdef WITH_ELF_HEADER" in kexec/arch/s390/crashdump-s390.c
> >
> > Currently no distribution uses the oldmem mechanism.
>
> Hi Michael,
>
> Mechanism to read from newmem is not there yet (your patches are still
> being reviewed) and you mentioned that no distribution is building
> kexec-tools with WITH_ELF_HEADER on s390. So how things are currently
> working for kdump on s390?

Hello Vivek,

On s390, if we do not get the ELF header from the 1st kernel with
"elfcorehdr=", we build the ELF header in the 2nd kernel. This is
currently the default because WITH_ELF_HEADER is not defined for
kexec tools.

>>> START QUOTE

[PATCH v3 1/3] kdump: Introduce ELF header in new memory feature

Currently for s390 we create the ELF core header in the 2nd kernel with
a small trick. We relocate the addresses in the ELF header in a way
that for the /proc/vmcore code it seems to be in the 1st kernel (old)
memory and the read_from_oldmem() returns the correct data. This allows
the /proc/vmcore code to use the ELF header in the 2nd kernel.

>>> END QUOTE

For our current zfcpdump project (see "[PATCH 3/3]s390/kdump: Use
vmcore for zfcpdump") we could no longer use this trick. Therefore we
sent you the patches to get a clean interface for ELF header creation
in the 2nd kernel.

> >
> > Therefore, if necessary, IMHO we can switch to the ELF header memory
> > swap mechanism for s390 in the kernel. Of course we would then also
> > have to adjust the (disabled) kexec code.
>
> So are you saying that s390 is ready to switch to mechanism of
> creating ELF headers in first kernel by kexec-tools and new kernel
> does not have to preare ELF headers?

No, I meant that currently nobody is using the kexec tools ELF header
creation in the 1st kernel on s390. We create the ELF header in the
2nd kernel (mainly because of our cpuplugd issue).

Therefore, I think, we can safely change the ELF header creation in 2nd
kernel to use your p_offset swap trick *and* we remove the swap code in
the copy_oldmem_page() implementation (same kernel).

Best Regards,
Michael

2013-05-29 16:24:48

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 0/2] kdump/mmap: Fix mmap of /proc/vmcore for s390

On Wed, May 29, 2013 at 01:51:44PM +0200, Michael Holzheu wrote:
> On Tue, 28 May 2013 09:55:01 -0400
> Vivek Goyal <[email protected]> wrote:
>
> > On Sat, May 25, 2013 at 02:52:17PM +0200, Michael Holzheu wrote:
>
> [snip]
>
> > > Besides of the newmem mechanism, for completeness, we also
> > > implemented the oldmem ELF header mechansim in kexec. But this is
> > > disabled by default.
> > >
> > > See: "#ifdef WITH_ELF_HEADER" in kexec/arch/s390/crashdump-s390.c
> > >
> > > Currently no distribution uses the oldmem mechanism.
> >
> > Hi Michael,
> >
> > Mechanism to read from newmem is not there yet (your patches are still
> > being reviewed) and you mentioned that no distribution is building
> > kexec-tools with WITH_ELF_HEADER on s390. So how things are currently
> > working for kdump on s390?
>
> Hello Vivek,
>
> On s390, if we do not get the ELF header from the 1st kernel with
> "elfcorehdr=", we build the ELF header in the 2nd kernel. This is
> currently the default because WITH_ELF_HEADER is not defined for
> kexec tools.
>
> >>> START QUOTE
>
> [PATCH v3 1/3] kdump: Introduce ELF header in new memory feature
>
> Currently for s390 we create the ELF core header in the 2nd kernel with
> a small trick. We relocate the addresses in the ELF header in a way
> that for the /proc/vmcore code it seems to be in the 1st kernel (old)
> memory and the read_from_oldmem() returns the correct data. This allows
> the /proc/vmcore code to use the ELF header in the 2nd kernel.
>
> >>> END QUOTE
>
> For our current zfcpdump project (see "[PATCH 3/3]s390/kdump: Use
> vmcore for zfcpdump") we could no longer use this trick. Therefore we
> sent you the patches to get a clean interface for ELF header creation
> in the 2nd kernel.
>
> > >
> > > Therefore, if necessary, IMHO we can switch to the ELF header memory
> > > swap mechanism for s390 in the kernel. Of course we would then also
> > > have to adjust the (disabled) kexec code.
> >
> > So are you saying that s390 is ready to switch to mechanism of
> > creating ELF headers in first kernel by kexec-tools and new kernel
> > does not have to preare ELF headers?
>
> No, I meant that currently nobody is using the kexec tools ELF header
> creation in the 1st kernel on s390. We create the ELF header in the
> 2nd kernel (mainly because of our cpuplugd issue).
>
> Therefore, I think, we can safely change the ELF header creation in 2nd
> kernel to use your p_offset swap trick *and* we remove the swap code in
> the copy_oldmem_page() implementation (same kernel).

Ok. Got it. So s390 can fix it in kernel without creating any backward
compatibility issues (given the fact that nobody sees to be using
kexec-tools to build headers).

So please go ahead and fix it and that should solve your mmap() issue
too. Also please fix kexec-tools and that change will not be backward
compatible.

Thanks
Vivek

2013-05-29 17:13:09

by Michael Holzheu

[permalink] [raw]
Subject: Re: [PATCH 0/2] kdump/mmap: Fix mmap of /proc/vmcore for s390

On Wed, 29 May 2013 12:23:26 -0400
Vivek Goyal <[email protected]> wrote:

> On Wed, May 29, 2013 at 01:51:44PM +0200, Michael Holzheu wrote:
> > On Tue, 28 May 2013 09:55:01 -0400
> > Vivek Goyal <[email protected]> wrote:
> >
> > > On Sat, May 25, 2013 at 02:52:17PM +0200, Michael Holzheu wrote:
> >
> > [snip]
> >
> > > > Besides of the newmem mechanism, for completeness, we also
> > > > implemented the oldmem ELF header mechansim in kexec. But this
> > > > is disabled by default.
> > > >
> > > > See: "#ifdef WITH_ELF_HEADER" in
> > > > kexec/arch/s390/crashdump-s390.c
> > > >
> > > > Currently no distribution uses the oldmem mechanism.
> > >
> > > Hi Michael,
> > >
> > > Mechanism to read from newmem is not there yet (your patches are
> > > still being reviewed) and you mentioned that no distribution is
> > > building kexec-tools with WITH_ELF_HEADER on s390. So how things
> > > are currently working for kdump on s390?
> >
> > Hello Vivek,
> >
> > On s390, if we do not get the ELF header from the 1st kernel with
> > "elfcorehdr=", we build the ELF header in the 2nd kernel. This is
> > currently the default because WITH_ELF_HEADER is not defined for
> > kexec tools.
> >
> > >>> START QUOTE
> >
> > [PATCH v3 1/3] kdump: Introduce ELF header in new memory feature
> >
> > Currently for s390 we create the ELF core header in the 2nd kernel
> > with a small trick. We relocate the addresses in the ELF header in
> > a way that for the /proc/vmcore code it seems to be in the 1st
> > kernel (old) memory and the read_from_oldmem() returns the correct
> > data. This allows the /proc/vmcore code to use the ELF header in
> > the 2nd kernel.
> >
> > >>> END QUOTE
> >
> > For our current zfcpdump project (see "[PATCH 3/3]s390/kdump: Use
> > vmcore for zfcpdump") we could no longer use this trick. Therefore
> > we sent you the patches to get a clean interface for ELF header
> > creation in the 2nd kernel.
> >
> > > >
> > > > Therefore, if necessary, IMHO we can switch to the ELF header
> > > > memory swap mechanism for s390 in the kernel. Of course we
> > > > would then also have to adjust the (disabled) kexec code.
> > >
> > > So are you saying that s390 is ready to switch to mechanism of
> > > creating ELF headers in first kernel by kexec-tools and new kernel
> > > does not have to preare ELF headers?
> >
> > No, I meant that currently nobody is using the kexec tools ELF
> > header creation in the 1st kernel on s390. We create the ELF header
> > in the 2nd kernel (mainly because of our cpuplugd issue).
> >
> > Therefore, I think, we can safely change the ELF header creation in
> > 2nd kernel to use your p_offset swap trick *and* we remove the swap
> > code in the copy_oldmem_page() implementation (same kernel).
>
> Ok. Got it. So s390 can fix it in kernel without creating any backward
> compatibility issues (given the fact that nobody sees to be using
> kexec-tools to build headers).
>
> So please go ahead and fix it and that should solve your mmap() issue
> too. Also please fix kexec-tools and that change will not be backward
> compatible.

Ok, I will do this.

I think we should add this "swap in ELF header" patch to the "kdump:
Allow ELF header creation in new kernel" patch series (on top of the
mmap patch series). Because when I remove the swap code from
copy_oldmem_page(), the old trick to access the ELF header in the first
kernel memory will no longer work.

Is that ok for you?

Michael

2013-05-30 15:25:04

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 0/2] kdump/mmap: Fix mmap of /proc/vmcore for s390

On Wed, May 29, 2013 at 07:12:49PM +0200, Michael Holzheu wrote:

[..]
> > > > So are you saying that s390 is ready to switch to mechanism of
> > > > creating ELF headers in first kernel by kexec-tools and new kernel
> > > > does not have to preare ELF headers?
> > >
> > > No, I meant that currently nobody is using the kexec tools ELF
> > > header creation in the 1st kernel on s390. We create the ELF header
> > > in the 2nd kernel (mainly because of our cpuplugd issue).
> > >
> > > Therefore, I think, we can safely change the ELF header creation in
> > > 2nd kernel to use your p_offset swap trick *and* we remove the swap
> > > code in the copy_oldmem_page() implementation (same kernel).
> >
> > Ok. Got it. So s390 can fix it in kernel without creating any backward
> > compatibility issues (given the fact that nobody sees to be using
> > kexec-tools to build headers).
> >
> > So please go ahead and fix it and that should solve your mmap() issue
> > too. Also please fix kexec-tools and that change will not be backward
> > compatible.
>
> Ok, I will do this.
>
> I think we should add this "swap in ELF header" patch to the "kdump:
> Allow ELF header creation in new kernel" patch series (on top of the
> mmap patch series). Because when I remove the swap code from
> copy_oldmem_page(), the old trick to access the ELF header in the first
> kernel memory will no longer work.
>
> Is that ok for you?

I am fine with both the patches in same series.

Thanks
Vivek

2013-05-30 20:39:29

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 0/2] kdump/mmap: Fix mmap of /proc/vmcore for s390

On Wed, May 29, 2013 at 01:51:44PM +0200, Michael Holzheu wrote:

[..]
> >>> START QUOTE
>
> [PATCH v3 1/3] kdump: Introduce ELF header in new memory feature
>
> Currently for s390 we create the ELF core header in the 2nd kernel with
> a small trick. We relocate the addresses in the ELF header in a way
> that for the /proc/vmcore code it seems to be in the 1st kernel (old)
> memory and the read_from_oldmem() returns the correct data. This allows
> the /proc/vmcore code to use the ELF header in the 2nd kernel.
>
> >>> END QUOTE
>
> For our current zfcpdump project (see "[PATCH 3/3]s390/kdump: Use
> vmcore for zfcpdump") we could no longer use this trick. Therefore we
> sent you the patches to get a clean interface for ELF header creation
> in the 2nd kernel.

Hi Michael,

Few more questions.

- What's the difference between zfcpdump and kdump. I thought zfcpdump
just boots specific kernel from fixed drive? If yes, why can't that
kernel prepare headers in similar way as regular kdump kernel does
and gain from kdump kernel swap trick?

Also, we are accessing the contents of elf headers using physical
address. If that's the case, does it make a difference if data is
in old kernel's memory or new kernel's memory. We will use the physical
address and create a temporary mapping and it should not make a difference
whether same physical page is already mapped in current kernel or not.

Only restriction this places is that all ELF header needs to be
contiguous. I see that s390 code already creates elf headers using
kzalloc_panic(). So memory allocated should by physically contiguous.

So can't we just put __pa(elfcorebuf) in elfcorehdr_addr. And same
is true for p_offset fields in PT_NOTE headers and everything should
work fine?

Only problem we can face is that at some point of time kzalloc() might
not be able to contiguous memory request. We can handle that once s390
runs into those issues. You are anyway allocating memory using kzalloc().

And if this works for s390 kdump, it should work for zfcpdump too?

Thanks
Vivek

2013-05-31 14:21:43

by Michael Holzheu

[permalink] [raw]
Subject: Re: [PATCH 0/2] kdump/mmap: Fix mmap of /proc/vmcore for s390

On Thu, 30 May 2013 16:38:47 -0400
Vivek Goyal <[email protected]> wrote:

> On Wed, May 29, 2013 at 01:51:44PM +0200, Michael Holzheu wrote:
>
> [..]
> > >>> START QUOTE
> >
> > [PATCH v3 1/3] kdump: Introduce ELF header in new memory feature
> >
> > Currently for s390 we create the ELF core header in the 2nd kernel
> > with a small trick. We relocate the addresses in the ELF header in
> > a way that for the /proc/vmcore code it seems to be in the 1st
> > kernel (old) memory and the read_from_oldmem() returns the correct
> > data. This allows the /proc/vmcore code to use the ELF header in
> > the 2nd kernel.
> >
> > >>> END QUOTE
> >
> > For our current zfcpdump project (see "[PATCH 3/3]s390/kdump: Use
> > vmcore for zfcpdump") we could no longer use this trick. Therefore
> > we sent you the patches to get a clean interface for ELF header
> > creation in the 2nd kernel.
>
> Hi Michael,
>
> Few more questions.
>
> - What's the difference between zfcpdump and kdump. I thought zfcpdump
> just boots specific kernel from fixed drive? If yes, why can't that
> kernel prepare headers in similar way as regular kdump kernel does
> and gain from kdump kernel swap trick?

Correct, the zfcpdump kernel is booted from a fixed disk drive. The
difference is that the zfcpdump HSA memory is not mapped into real
memory. It is accessed using a read memory interface "memcpy_hsa()"
that copies memory from the hypervisor owned HSA memory into the Linux
memory.

So it looks like the following:

+----------+ +------------+
| | memcpy_hsa() | |
| zfcpdump | <-------------- | HSA memory |
| | | |
+----------+ +------------+
| |
| old mem |
| |
+----------+

In the copy_oldmem_page() function for zfcpdump we do the following:

copy_oldmem_page_zfcpdump(...)
{
if (src < ZFCPDUMP_HSA_SIZE) {
if (memcpy_hsa(buf, src, csize, userbuf) < 0)
return -EINVAL;
} else {
if (userbuf)
copy_to_user_real((buf, src, csize);
else
memcpy_real(buf, src, csize);
}
}

So I think for zfcpdump we only can use the read() interface
of /proc/vmcore. But this is sufficient for us since we also provide
the s390 specific zfcpdump user space that copies /proc/vmcore.

> Also, we are accessing the contents of elf headers using physical
> address. If that's the case, does it make a difference if data is
> in old kernel's memory or new kernel's memory. We will use the
> physical address and create a temporary mapping and it should not
> make a difference whether same physical page is already mapped in
> current kernel or not.
>
> Only restriction this places is that all ELF header needs to be
> contiguous. I see that s390 code already creates elf headers using
> kzalloc_panic(). So memory allocated should by physically contiguous.
>
> So can't we just put __pa(elfcorebuf) in elfcorehdr_addr. And same
> is true for p_offset fields in PT_NOTE headers and everything should
> work fine?
>
> Only problem we can face is that at some point of time kzalloc() might
> not be able to contiguous memory request. We can handle that once s390
> runs into those issues. You are anyway allocating memory using
> kzalloc().
>
> And if this works for s390 kdump, it should work for zfcpdump too?

So your suggestion is that copy_oldmem_page() should also be used for
copying memory from the new kernel, correct?

For kdump on s390 I think this will work with the new "ELF header swap"
patch. With that patch access to [0, OLDMEM_SIZE] will uniquely identify
an address in the new kernel and access to [OLDMEM_BASE, OLDMEM_BASE +
OLDMEM_SIZE] will identify an address in the old kernel.

For zfcpdump currently we add a load from [0, HSA_SIZE] where p_offset
equals p_paddr. Therefore we can't distinguish in copy_oldmem_page() if
we read from oldmem (HSA) or newmem. The range [0, HSA_SIZE] is used
twice. As a workaroun we could use an artificial p_offset for the HSA
memory chunk that is not used by the 1st kernel physical memory. This
is not really beautiful, but probably doable.

When I tried to implement this for kdump, I noticed another problem
with the vmcore mmap patches. Our copy_oldmem_page() function uses
memcpy_real() to access the old 1st kernel memory. This function
switches to real mode and therefore does not require any page tables.
But as a side effect of that we can't copy to vmalloc memory. The mmap
patches use vmalloc memory for "notes_buf". So currently using our
copy_oldmem_page() fails here.

If copy_oldmem_page() now also must be able to copy to vmalloc memory,
we would have to add new code for that:

* oldmem -> newmem (real): Use direct memcpy_real()
* oldmem -> newmem (vmalloc): Use intermediate buffer with memcpy_real()
* newmem -> newmem: Use memcpy()

What do you think?

Best Regards,
Michael

2013-05-31 16:02:38

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 0/2] kdump/mmap: Fix mmap of /proc/vmcore for s390

On Fri, May 31, 2013 at 04:21:27PM +0200, Michael Holzheu wrote:
> On Thu, 30 May 2013 16:38:47 -0400
> Vivek Goyal <[email protected]> wrote:
>
> > On Wed, May 29, 2013 at 01:51:44PM +0200, Michael Holzheu wrote:
> >
> > [..]
> > > >>> START QUOTE
> > >
> > > [PATCH v3 1/3] kdump: Introduce ELF header in new memory feature
> > >
> > > Currently for s390 we create the ELF core header in the 2nd kernel
> > > with a small trick. We relocate the addresses in the ELF header in
> > > a way that for the /proc/vmcore code it seems to be in the 1st
> > > kernel (old) memory and the read_from_oldmem() returns the correct
> > > data. This allows the /proc/vmcore code to use the ELF header in
> > > the 2nd kernel.
> > >
> > > >>> END QUOTE
> > >
> > > For our current zfcpdump project (see "[PATCH 3/3]s390/kdump: Use
> > > vmcore for zfcpdump") we could no longer use this trick. Therefore
> > > we sent you the patches to get a clean interface for ELF header
> > > creation in the 2nd kernel.
> >
> > Hi Michael,
> >
> > Few more questions.
> >
> > - What's the difference between zfcpdump and kdump. I thought zfcpdump
> > just boots specific kernel from fixed drive? If yes, why can't that
> > kernel prepare headers in similar way as regular kdump kernel does
> > and gain from kdump kernel swap trick?
>
> Correct, the zfcpdump kernel is booted from a fixed disk drive. The
> difference is that the zfcpdump HSA memory is not mapped into real
> memory. It is accessed using a read memory interface "memcpy_hsa()"
> that copies memory from the hypervisor owned HSA memory into the Linux
> memory.
>
> So it looks like the following:
>
> +----------+ +------------+
> | | memcpy_hsa() | |
> | zfcpdump | <-------------- | HSA memory |
> | | | |
> +----------+ +------------+
> | |
> | old mem |
> | |
> +----------+
>
> In the copy_oldmem_page() function for zfcpdump we do the following:
>
> copy_oldmem_page_zfcpdump(...)
> {
> if (src < ZFCPDUMP_HSA_SIZE) {
> if (memcpy_hsa(buf, src, csize, userbuf) < 0)
> return -EINVAL;
> } else {
> if (userbuf)
> copy_to_user_real((buf, src, csize);
> else
> memcpy_real(buf, src, csize);
> }
> }
>
> So I think for zfcpdump we only can use the read() interface
> of /proc/vmcore. But this is sufficient for us since we also provide
> the s390 specific zfcpdump user space that copies /proc/vmcore.
>
> > Also, we are accessing the contents of elf headers using physical
> > address. If that's the case, does it make a difference if data is
> > in old kernel's memory or new kernel's memory. We will use the
> > physical address and create a temporary mapping and it should not
> > make a difference whether same physical page is already mapped in
> > current kernel or not.
> >
> > Only restriction this places is that all ELF header needs to be
> > contiguous. I see that s390 code already creates elf headers using
> > kzalloc_panic(). So memory allocated should by physically contiguous.
> >
> > So can't we just put __pa(elfcorebuf) in elfcorehdr_addr. And same
> > is true for p_offset fields in PT_NOTE headers and everything should
> > work fine?
> >
> > Only problem we can face is that at some point of time kzalloc() might
> > not be able to contiguous memory request. We can handle that once s390
> > runs into those issues. You are anyway allocating memory using
> > kzalloc().
> >
> > And if this works for s390 kdump, it should work for zfcpdump too?
>
> So your suggestion is that copy_oldmem_page() should also be used for
> copying memory from the new kernel, correct?

Yes.

>
> For kdump on s390 I think this will work with the new "ELF header swap"
> patch. With that patch access to [0, OLDMEM_SIZE] will uniquely identify
> an address in the new kernel and access to [OLDMEM_BASE, OLDMEM_BASE +
> OLDMEM_SIZE] will identify an address in the old kernel.
>
> For zfcpdump currently we add a load from [0, HSA_SIZE] where p_offset
> equals p_paddr. Therefore we can't distinguish in copy_oldmem_page() if
> we read from oldmem (HSA) or newmem. The range [0, HSA_SIZE] is used
> twice. As a workaroun we could use an artificial p_offset for the HSA
> memory chunk that is not used by the 1st kernel physical memory. This
> is not really beautiful, but probably doable.

Ok, zfcpdump is a problem because HSA memory region is in addition to
regular memory address space.

Yep trying to figure out unused memory region in first kernel and mapping
HSA to that is little ugly. But you know s390 better, so you decide
whether you want to take that path or not. Generic code does not care
what p_offset is pointing to.

If you decide not to do that, agreed that copy_oldmem_page() need to
differentiate between reference to HSA memory and reference to new
memory. I guess in that case we will have to go with original proposal
of using arch functions to access and read headers. I guess we can
export physical address of elf headers to generic code and when
generic code reads from that address using arch function we can read
from new memory. copy_oldmem_page() will read from HSA memory for
anything less than HSA_SIZE and from real memory for anything above
it.

Other way could the to use lower bits of p_offset field to store
additional info. But that would make generic code ugly.

>
> When I tried to implement this for kdump, I noticed another problem
> with the vmcore mmap patches. Our copy_oldmem_page() function uses
> memcpy_real() to access the old 1st kernel memory. This function
> switches to real mode and therefore does not require any page tables.
> But as a side effect of that we can't copy to vmalloc memory. The mmap
> patches use vmalloc memory for "notes_buf". So currently using our
> copy_oldmem_page() fails here.
>
> If copy_oldmem_page() now also must be able to copy to vmalloc memory,
> we would have to add new code for that:
>
> * oldmem -> newmem (real): Use direct memcpy_real()
> * oldmem -> newmem (vmalloc): Use intermediate buffer with memcpy_real()
> * newmem -> newmem: Use memcpy()
>
> What do you think?

Yep, looks like you will have to do something like that.

Can't we map HSA frames temporarily, copy data and tear down the mapping?

If not, how would remap_pfn_range() work with HSA region when
/proc/vmcore is mmaped()?

Thanks
Vivek

2013-06-03 13:27:30

by Michael Holzheu

[permalink] [raw]
Subject: Re: [PATCH 0/2] kdump/mmap: Fix mmap of /proc/vmcore for s390

On Fri, 31 May 2013 12:01:58 -0400
Vivek Goyal <[email protected]> wrote:

> On Fri, May 31, 2013 at 04:21:27PM +0200, Michael Holzheu wrote:
> > On Thu, 30 May 2013 16:38:47 -0400
> > Vivek Goyal <[email protected]> wrote:
> >
> > > On Wed, May 29, 2013 at 01:51:44PM +0200, Michael Holzheu wrote:
> > >

[...]

> > For zfcpdump currently we add a load from [0, HSA_SIZE] where
> > p_offset equals p_paddr. Therefore we can't distinguish in
> > copy_oldmem_page() if we read from oldmem (HSA) or newmem. The
> > range [0, HSA_SIZE] is used twice. As a workaroun we could use an
> > artificial p_offset for the HSA memory chunk that is not used by
> > the 1st kernel physical memory. This is not really beautiful, but
> > probably doable.
>
> Ok, zfcpdump is a problem because HSA memory region is in addition to
> regular memory address space.

Right and the HSA memory is accessed with a read() interface and can't
be directly mapped.

[...]

> If you decide not to do that, agreed that copy_oldmem_page() need to
> differentiate between reference to HSA memory and reference to new
> memory. I guess in that case we will have to go with original proposal
> of using arch functions to access and read headers.

Let me think about that a bit more ...

[...]

> > If copy_oldmem_page() now also must be able to copy to vmalloc
> > memory, we would have to add new code for that:
> >
> > * oldmem -> newmem (real): Use direct memcpy_real()
> > * oldmem -> newmem (vmalloc): Use intermediate buffer with
> > memcpy_real()
> > * newmem -> newmem: Use memcpy()
> >
> > What do you think?
>
> Yep, looks like you will have to do something like that.
>
> Can't we map HSA frames temporarily, copy data and tear down the
> mapping?

Yes, we would have to create a *temporarily* mapping (see suggestion
below). We do not have enough memory to copy the complete HSA.

> If not, how would remap_pfn_range() work with HSA region when
> /proc/vmcore is mmaped()?

I am no memory management expert, so I discussed that with Martin
Schwidefsky (s390 architecture maintainer). Perhaps something like
the following could work:

After vmcore_mmap() is called the HSA pages are not initially mapped in
the page tables. So when user space accesses those parts
of /proc/vmcore, a fault will be generated. We implement a mechanism
that in this case the HSA is copied to a new page in the page cache and
a mapping is created for it. Since the page is allocated in the page
cache, it can be released afterwards by the kernel when we get memory
pressure.

Our current idea for such an implementation:

* Create new address space (struct address_space) for /proc/vmcore.
* Implement new vm_operations_struct "vmcore_mmap_ops" with
new vmcore_fault() ".fault" callback for /proc/vmcore.
* Set vma->vm_ops to vmcore_mmap_ops in mmap_vmcore().
* The vmcore_fault() function will get a new page cache page,
copy HSA page to page cache page add it to vmcore address space.
To see how this could work, we looked into the functions
filemap_fault() in "mm/filemap.c" and relay_buf_fault() in
"kernel/relay.c".

What do you think?

Michael

2013-06-03 16:40:54

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH 0/2] kdump/mmap: Fix mmap of /proc/vmcore for s390

On Mon, Jun 03, 2013 at 03:27:18PM +0200, Michael Holzheu wrote:

[..]
> > If not, how would remap_pfn_range() work with HSA region when
> > /proc/vmcore is mmaped()?
>
> I am no memory management expert, so I discussed that with Martin
> Schwidefsky (s390 architecture maintainer). Perhaps something like
> the following could work:
>
> After vmcore_mmap() is called the HSA pages are not initially mapped in
> the page tables. So when user space accesses those parts
> of /proc/vmcore, a fault will be generated. We implement a mechanism
> that in this case the HSA is copied to a new page in the page cache and
> a mapping is created for it. Since the page is allocated in the page
> cache, it can be released afterwards by the kernel when we get memory
> pressure.
>
> Our current idea for such an implementation:
>
> * Create new address space (struct address_space) for /proc/vmcore.
> * Implement new vm_operations_struct "vmcore_mmap_ops" with
> new vmcore_fault() ".fault" callback for /proc/vmcore.
> * Set vma->vm_ops to vmcore_mmap_ops in mmap_vmcore().
> * The vmcore_fault() function will get a new page cache page,
> copy HSA page to page cache page add it to vmcore address space.
> To see how this could work, we looked into the functions
> filemap_fault() in "mm/filemap.c" and relay_buf_fault() in
> "kernel/relay.c".
>
> What do you think?

I am not mm expert either but above proposal sounds reasonable to me.

So remap_pfn_range() call will go in arch dependent code so that arch
can decide which range can be mapped right away and which ranges will
be filed in when fault happens? I am assuming that s390 will map
everything except for pfn between 0 and HSA_SIZE.

And regular s390 kdump will map everyting right away and will not
have to rely on fault mechanism?

Thanks
Vivek

2013-06-03 16:48:56

by Michael Holzheu

[permalink] [raw]
Subject: Re: [PATCH 0/2] kdump/mmap: Fix mmap of /proc/vmcore for s390

On Mon, 3 Jun 2013 11:59:40 -0400
Vivek Goyal <[email protected]> wrote:

> On Mon, Jun 03, 2013 at 03:27:18PM +0200, Michael Holzheu wrote:
>
> [..]
> > > If not, how would remap_pfn_range() work with HSA region when
> > > /proc/vmcore is mmaped()?
> >
> > I am no memory management expert, so I discussed that with Martin
> > Schwidefsky (s390 architecture maintainer). Perhaps something like
> > the following could work:
> >
> > After vmcore_mmap() is called the HSA pages are not initially
> > mapped in the page tables. So when user space accesses those parts
> > of /proc/vmcore, a fault will be generated. We implement a mechanism
> > that in this case the HSA is copied to a new page in the page cache
> > and a mapping is created for it. Since the page is allocated in the
> > page cache, it can be released afterwards by the kernel when we get
> > memory pressure.
> >
> > Our current idea for such an implementation:
> >
> > * Create new address space (struct address_space) for /proc/vmcore.
> > * Implement new vm_operations_struct "vmcore_mmap_ops" with
> > new vmcore_fault() ".fault" callback for /proc/vmcore.
> > * Set vma->vm_ops to vmcore_mmap_ops in mmap_vmcore().
> > * The vmcore_fault() function will get a new page cache page,
> > copy HSA page to page cache page add it to vmcore address space.
> > To see how this could work, we looked into the functions
> > filemap_fault() in "mm/filemap.c" and relay_buf_fault() in
> > "kernel/relay.c".
> >
> > What do you think?
>
> I am not mm expert either but above proposal sounds reasonable to me.
>
> So remap_pfn_range() call will go in arch dependent code so that arch
> can decide which range can be mapped right away and which ranges will
> be filed in when fault happens? I am assuming that s390 will map
> everything except for pfn between 0 and HSA_SIZE.

Yes, for [0 - HSA_SIZE] the fault handler will be called and for
the rest we establish a mapping with remap_pfn_range() as it is
currently done. Therefore no fault handler will be called for that part
of /proc/vmcore.

I will try to find out if it is doable that way.

> And regular s390 kdump will map everyting right away and will not
> have to rely on fault mechanism?

Yes, as kdump on the other archs.

Thanks
Michael