2014-07-07 15:06:04

by Vitaly Kuznetsov

[permalink] [raw]
Subject: [PATCH] mmap_vmcore: skip non-ram pages reported by hypervisors

we have a special check in read_vmcore() handler to check if the page was
reported as ram or not by the hypervisor (pfn_is_ram()). However, when
vmcore is read with mmap() no such check is performed. That can lead to
unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
enormous load in both DomU and Dom0.

Fix the issue by mapping each non-ram page to the zero page. Keep direct
path with remap_oldmem_pfn_range() to avoid looping through all pages on
bare metal.

The issue can also be solved by overriding remap_oldmem_pfn_range() in
xen-specific code, as remap_oldmem_pfn_range() was been designed for.
That, however, would involve non-obvious xen code path for all x86 builds
with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
code on x86 arch from doing the same override.

Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
fs/proc/vmcore.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 62 insertions(+), 6 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 382aa89..2716e19 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -328,6 +328,46 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz)
* virtually contiguous user-space in ELF layout.
*/
#ifdef CONFIG_MMU
+static u64 remap_oldmem_pfn_checked(struct vm_area_struct *vma, u64 len,
+ unsigned long pfn, unsigned long page_count)
+{
+ unsigned long pos;
+ size_t size;
+ unsigned long vma_addr;
+ unsigned long emptypage_pfn = __pa(empty_zero_page) >> PAGE_SHIFT;
+
+ for (pos = pfn; (pos - pfn) <= page_count; pos++) {
+ if (!pfn_is_ram(pos) || (pos - pfn) == page_count) {
+ /* we hit a page which is not ram or reached the end */
+ if (pos - pfn > 0) {
+ /* remapping continuous region */
+ size = (pos - pfn) << PAGE_SHIFT;
+ vma_addr = vma->vm_start + len;
+ if (remap_oldmem_pfn_range(vma, vma_addr,
+ pfn, size,
+ vma->vm_page_prot))
+ return len;
+ len += size;
+ page_count -= (pos - pfn);
+ }
+ if (page_count > 0) {
+ /* we hit a page which is not ram, replacing
+ with an empty one */
+ vma_addr = vma->vm_start + len;
+ if (remap_oldmem_pfn_range(vma, vma_addr,
+ emptypage_pfn,
+ PAGE_SIZE,
+ vma->vm_page_prot))
+ return len;
+ len += PAGE_SIZE;
+ pfn = pos + 1;
+ page_count--;
+ }
+ }
+ }
+ return len;
+}
+
static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
{
size_t size = vma->vm_end - vma->vm_start;
@@ -383,17 +423,33 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)

list_for_each_entry(m, &vmcore_list, list) {
if (start < m->offset + m->size) {
- u64 paddr = 0;
+ u64 paddr = 0, original_len;
+ unsigned long pfn, page_count;

tsz = min_t(size_t, m->offset + m->size - start, size);
paddr = m->paddr + start - m->offset;
- if (remap_oldmem_pfn_range(vma, vma->vm_start + len,
- paddr >> PAGE_SHIFT, tsz,
- vma->vm_page_prot))
- goto fail;
+
+ /* check if oldmem_pfn_is_ram was registered to avoid
+ looping over all pages without a reason */
+ if (oldmem_pfn_is_ram) {
+ pfn = paddr >> PAGE_SHIFT;
+ page_count = tsz >> PAGE_SHIFT;
+ original_len = len;
+ len = remap_oldmem_pfn_checked(vma, len, pfn,
+ page_count);
+ if (len != original_len + tsz)
+ goto fail;
+ } else {
+ if (remap_oldmem_pfn_range(vma,
+ vma->vm_start + len,
+ paddr >> PAGE_SHIFT,
+ tsz,
+ vma->vm_page_prot))
+ goto fail;
+ len += tsz;
+ }
size -= tsz;
start += tsz;
- len += tsz;

if (size == 0)
return 0;
--
1.9.3


2014-07-07 20:33:06

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mmap_vmcore: skip non-ram pages reported by hypervisors

On Mon, 7 Jul 2014 17:05:49 +0200 Vitaly Kuznetsov <[email protected]> wrote:

> we have a special check in read_vmcore() handler to check if the page was
> reported as ram or not by the hypervisor (pfn_is_ram()). However, when
> vmcore is read with mmap() no such check is performed. That can lead to
> unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
> mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
> enormous load in both DomU and Dom0.
>
> Fix the issue by mapping each non-ram page to the zero page. Keep direct
> path with remap_oldmem_pfn_range() to avoid looping through all pages on
> bare metal.
>
> The issue can also be solved by overriding remap_oldmem_pfn_range() in
> xen-specific code, as remap_oldmem_pfn_range() was been designed for.
> That, however, would involve non-obvious xen code path for all x86 builds
> with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
> code on x86 arch from doing the same override.

I'd like to get some reviewed-by's and tested-by's on this one please.

> --- a/fs/proc/vmcore.c
> +++ b/fs/proc/vmcore.c
> @@ -328,6 +328,46 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz)
> * virtually contiguous user-space in ELF layout.
> */
> #ifdef CONFIG_MMU
> +static u64 remap_oldmem_pfn_checked(struct vm_area_struct *vma, u64 len,
> + unsigned long pfn, unsigned long page_count)
> +{
> + unsigned long pos;
> + size_t size;
> + unsigned long vma_addr;
> + unsigned long emptypage_pfn = __pa(empty_zero_page) >> PAGE_SHIFT;

That's old-school. Can we use my_zero_pfn() here?

Also, "zeropage_pfn" is a better name - let's not introduce the
hitherto unknown concept of an "empty page".

> + for (pos = pfn; (pos - pfn) <= page_count; pos++) {
> + if (!pfn_is_ram(pos) || (pos - pfn) == page_count) {
> + /* we hit a page which is not ram or reached the end */
> + if (pos - pfn > 0) {
> + /* remapping continuous region */
> + size = (pos - pfn) << PAGE_SHIFT;
> + vma_addr = vma->vm_start + len;
> + if (remap_oldmem_pfn_range(vma, vma_addr,
> + pfn, size,
> + vma->vm_page_prot))
> + return len;
> + len += size;
> + page_count -= (pos - pfn);
> + }
> + if (page_count > 0) {
> + /* we hit a page which is not ram, replacing
> + with an empty one */

I suggest

/*
* We hit a page which is not ram. Replace it
* with the zero page.
*/

> + vma_addr = vma->vm_start + len;
> + if (remap_oldmem_pfn_range(vma, vma_addr,
> + emptypage_pfn,
> + PAGE_SIZE,
> + vma->vm_page_prot))
> + return len;
> + len += PAGE_SIZE;
> + pfn = pos + 1;
> + page_count--;
> + }
> + }
> + }
> + return len;
> +}

Also, this loop seems unnecessarily hard to follow. It *look* like the
`for' statement has an off-by-one because of the "<=", but page_count
is mofidied inside the loop! Despite it being an incoming formal
argument.

None of this is made any easier by the function's lack of
documentation. Some description of the incoming args would help, along
with an overall description of the function's responsibilities.

That being said, can't we just do something nice and simple like

pos = pfn;
while (pos < pfn + page_count) {
stuff which advances `pos'
}

?

> static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
> {
> size_t size = vma->vm_end - vma->vm_start;
> @@ -383,17 +423,33 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
>
> list_for_each_entry(m, &vmcore_list, list) {
> if (start < m->offset + m->size) {
> - u64 paddr = 0;
> + u64 paddr = 0, original_len;
> + unsigned long pfn, page_count;
>
> tsz = min_t(size_t, m->offset + m->size - start, size);
> paddr = m->paddr + start - m->offset;
> - if (remap_oldmem_pfn_range(vma, vma->vm_start + len,
> - paddr >> PAGE_SHIFT, tsz,
> - vma->vm_page_prot))
> - goto fail;
> +
> + /* check if oldmem_pfn_is_ram was registered to avoid
> + looping over all pages without a reason */

Please lay out the comments in the usual fashion:

/*
* ...
* ..
*/

And sentences start whith upper-case letters!

> + if (oldmem_pfn_is_ram) {
> + pfn = paddr >> PAGE_SHIFT;
> + page_count = tsz >> PAGE_SHIFT;
> + original_len = len;
> + len = remap_oldmem_pfn_checked(vma, len, pfn,
> + page_count);
> + if (len != original_len + tsz)
> + goto fail;
> + } else {
> + if (remap_oldmem_pfn_range(vma,
> + vma->vm_start + len,
> + paddr >> PAGE_SHIFT,
> + tsz,
> + vma->vm_page_prot))
> + goto fail;
> + len += tsz;
> + }
> size -= tsz;
> start += tsz;
> - len += tsz;
>
> if (size == 0)
> return 0;

2014-07-08 08:09:13

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH] mmap_vmcore: skip non-ram pages reported by hypervisors

Andrew Morton <[email protected]> writes:

> On Mon, 7 Jul 2014 17:05:49 +0200 Vitaly Kuznetsov <[email protected]> wrote:
>
>> we have a special check in read_vmcore() handler to check if the page was
>> reported as ram or not by the hypervisor (pfn_is_ram()). However, when
>> vmcore is read with mmap() no such check is performed. That can lead to
>> unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
>> mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
>> enormous load in both DomU and Dom0.
>>
>> Fix the issue by mapping each non-ram page to the zero page. Keep direct
>> path with remap_oldmem_pfn_range() to avoid looping through all pages on
>> bare metal.
>>
>> The issue can also be solved by overriding remap_oldmem_pfn_range() in
>> xen-specific code, as remap_oldmem_pfn_range() was been designed for.
>> That, however, would involve non-obvious xen code path for all x86 builds
>> with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
>> code on x86 arch from doing the same override.
>
> I'd like to get some reviewed-by's and tested-by's on this one please.
>

This patch can be tested with Xen PVHVM guest only as it is the only
platform which registers oldmem_pfn_is_ram atm.

>> --- a/fs/proc/vmcore.c
>> +++ b/fs/proc/vmcore.c
>> @@ -328,6 +328,46 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz)
>> * virtually contiguous user-space in ELF layout.
>> */
>> #ifdef CONFIG_MMU
>> +static u64 remap_oldmem_pfn_checked(struct vm_area_struct *vma, u64 len,
>> + unsigned long pfn, unsigned long page_count)
>> +{
>> + unsigned long pos;
>> + size_t size;
>> + unsigned long vma_addr;
>> + unsigned long emptypage_pfn = __pa(empty_zero_page) >> PAGE_SHIFT;
>
> That's old-school. Can we use my_zero_pfn() here?
>
> Also, "zeropage_pfn" is a better name - let's not introduce the
> hitherto unknown concept of an "empty page".
>

Sure!

>> + for (pos = pfn; (pos - pfn) <= page_count; pos++) {
>> + if (!pfn_is_ram(pos) || (pos - pfn) == page_count) {
>> + /* we hit a page which is not ram or reached the end */
>> + if (pos - pfn > 0) {
>> + /* remapping continuous region */
>> + size = (pos - pfn) << PAGE_SHIFT;
>> + vma_addr = vma->vm_start + len;
>> + if (remap_oldmem_pfn_range(vma, vma_addr,
>> + pfn, size,
>> + vma->vm_page_prot))
>> + return len;
>> + len += size;
>> + page_count -= (pos - pfn);
>> + }
>> + if (page_count > 0) {
>> + /* we hit a page which is not ram, replacing
>> + with an empty one */
>
> I suggest
>
> /*
> * We hit a page which is not ram. Replace it
> * with the zero page.
> */
>

:-)

>> + vma_addr = vma->vm_start + len;
>> + if (remap_oldmem_pfn_range(vma, vma_addr,
>> + emptypage_pfn,
>> + PAGE_SIZE,
>> + vma->vm_page_prot))
>> + return len;
>> + len += PAGE_SIZE;
>> + pfn = pos + 1;
>> + page_count--;
>> + }
>> + }
>> + }
>> + return len;
>> +}
>
> Also, this loop seems unnecessarily hard to follow. It *look* like the
> `for' statement has an off-by-one because of the "<=", but page_count
> is mofidied inside the loop! Despite it being an incoming formal
> argument.

There is no off-by-one error here (I believe) as I'm checking two
possible conditions to remap the continuous region:
1) We hit a non-ram page
2) We reached the end
so we exclude the page pos is pointing us to in both cases.

I tried to avoid code duplication e.g. having one 'remap continuous
region' inside the loop to do the remapping when we hit a non-ram page
and having the other one outside the loop to remap the tail.

>
> None of this is made any easier by the function's lack of
> documentation. Some description of the incoming args would help, along
> with an overall description of the function's responsibilities.
>
> That being said, can't we just do something nice and simple like
>
> pos = pfn;
> while (pos < pfn + page_count) {
> stuff which advances `pos'
> }
>
> ?
>

I completely agree it's possible to make this code easier to
understand, will do.

>> static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
>> {
>> size_t size = vma->vm_end - vma->vm_start;
>> @@ -383,17 +423,33 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
>>
>> list_for_each_entry(m, &vmcore_list, list) {
>> if (start < m->offset + m->size) {
>> - u64 paddr = 0;
>> + u64 paddr = 0, original_len;
>> + unsigned long pfn, page_count;
>>
>> tsz = min_t(size_t, m->offset + m->size - start, size);
>> paddr = m->paddr + start - m->offset;
>> - if (remap_oldmem_pfn_range(vma, vma->vm_start + len,
>> - paddr >> PAGE_SHIFT, tsz,
>> - vma->vm_page_prot))
>> - goto fail;
>> +
>> + /* check if oldmem_pfn_is_ram was registered to avoid
>> + looping over all pages without a reason */
>
> Please lay out the comments in the usual fashion:
>
> /*
> * ...
> * ..
> */
>
> And sentences start whith upper-case letters!
>
>> + if (oldmem_pfn_is_ram) {
>> + pfn = paddr >> PAGE_SHIFT;
>> + page_count = tsz >> PAGE_SHIFT;
>> + original_len = len;
>> + len = remap_oldmem_pfn_checked(vma, len, pfn,
>> + page_count);
>> + if (len != original_len + tsz)
>> + goto fail;
>> + } else {
>> + if (remap_oldmem_pfn_range(vma,
>> + vma->vm_start + len,
>> + paddr >> PAGE_SHIFT,
>> + tsz,
>> + vma->vm_page_prot))
>> + goto fail;
>> + len += tsz;
>> + }
>> size -= tsz;
>> start += tsz;
>> - len += tsz;
>>
>> if (size == 0)
>> return 0;

Thank you very much for your review! I'm working on v2.

--
Vitaly

2014-07-08 16:27:26

by David Vrabel

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH] mmap_vmcore: skip non-ram pages reported by hypervisors

On 07/07/14 21:33, Andrew Morton wrote:
> On Mon, 7 Jul 2014 17:05:49 +0200 Vitaly Kuznetsov <[email protected]> wrote:
>
>> we have a special check in read_vmcore() handler to check if the page was
>> reported as ram or not by the hypervisor (pfn_is_ram()). However, when
>> vmcore is read with mmap() no such check is performed. That can lead to
>> unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
>> mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
>> enormous load in both DomU and Dom0.

Does make forward progress though? Or is it ending up in a repeatedly
retrying the same instruction?

Is it failing on a ballooned page in a RAM region? Or is mapping non-RAM
regions as well?

>> Fix the issue by mapping each non-ram page to the zero page. Keep direct
>> path with remap_oldmem_pfn_range() to avoid looping through all pages on
>> bare metal.
>>
>> The issue can also be solved by overriding remap_oldmem_pfn_range() in
>> xen-specific code, as remap_oldmem_pfn_range() was been designed for.
>> That, however, would involve non-obvious xen code path for all x86 builds
>> with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
>> code on x86 arch from doing the same override.

The oldmem_pfn_is_ram() is Xen-specific but this problem (ballooned
pages) must be common to KVM. How does KVM handle this?

David

2014-07-08 17:27:56

by Konrad Rzeszutek Wilk

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH] mmap_vmcore: skip non-ram pages reported by hypervisors

On Mon, Jul 07, 2014 at 05:05:49PM +0200, Vitaly Kuznetsov wrote:
> we have a special check in read_vmcore() handler to check if the page was
> reported as ram or not by the hypervisor (pfn_is_ram()). However, when
> vmcore is read with mmap() no such check is performed. That can lead to
> unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
> mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
> enormous load in both DomU and Dom0.
>
> Fix the issue by mapping each non-ram page to the zero page. Keep direct
> path with remap_oldmem_pfn_range() to avoid looping through all pages on
> bare metal.
>
> The issue can also be solved by overriding remap_oldmem_pfn_range() in
> xen-specific code, as remap_oldmem_pfn_range() was been designed for.
> That, however, would involve non-obvious xen code path for all x86 builds
> with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
> code on x86 arch from doing the same override.

Could the 'remap_oldmem_pfn_range' become an function ops? I see there
is an 'register_oldmem_pfn_is_ram' - so could there be similar one for
'pfn_range'?

>
> Signed-off-by: Vitaly Kuznetsov <[email protected]>
> ---
> fs/proc/vmcore.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++-----
> 1 file changed, 62 insertions(+), 6 deletions(-)
>
> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
> index 382aa89..2716e19 100644
> --- a/fs/proc/vmcore.c
> +++ b/fs/proc/vmcore.c
> @@ -328,6 +328,46 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz)
> * virtually contiguous user-space in ELF layout.
> */
> #ifdef CONFIG_MMU
> +static u64 remap_oldmem_pfn_checked(struct vm_area_struct *vma, u64 len,
> + unsigned long pfn, unsigned long page_count)
> +{
> + unsigned long pos;
> + size_t size;
> + unsigned long vma_addr;
> + unsigned long emptypage_pfn = __pa(empty_zero_page) >> PAGE_SHIFT;
> +
> + for (pos = pfn; (pos - pfn) <= page_count; pos++) {
> + if (!pfn_is_ram(pos) || (pos - pfn) == page_count) {
> + /* we hit a page which is not ram or reached the end */
> + if (pos - pfn > 0) {
> + /* remapping continuous region */
> + size = (pos - pfn) << PAGE_SHIFT;
> + vma_addr = vma->vm_start + len;
> + if (remap_oldmem_pfn_range(vma, vma_addr,
> + pfn, size,
> + vma->vm_page_prot))
> + return len;
> + len += size;
> + page_count -= (pos - pfn);
> + }
> + if (page_count > 0) {
> + /* we hit a page which is not ram, replacing
> + with an empty one */
> + vma_addr = vma->vm_start + len;
> + if (remap_oldmem_pfn_range(vma, vma_addr,
> + emptypage_pfn,
> + PAGE_SIZE,
> + vma->vm_page_prot))
> + return len;
> + len += PAGE_SIZE;
> + pfn = pos + 1;
> + page_count--;
> + }
> + }
> + }
> + return len;
> +}
> +
> static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
> {
> size_t size = vma->vm_end - vma->vm_start;
> @@ -383,17 +423,33 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
>
> list_for_each_entry(m, &vmcore_list, list) {
> if (start < m->offset + m->size) {
> - u64 paddr = 0;
> + u64 paddr = 0, original_len;
> + unsigned long pfn, page_count;
>
> tsz = min_t(size_t, m->offset + m->size - start, size);
> paddr = m->paddr + start - m->offset;
> - if (remap_oldmem_pfn_range(vma, vma->vm_start + len,
> - paddr >> PAGE_SHIFT, tsz,
> - vma->vm_page_prot))
> - goto fail;
> +
> + /* check if oldmem_pfn_is_ram was registered to avoid
> + looping over all pages without a reason */
> + if (oldmem_pfn_is_ram) {
> + pfn = paddr >> PAGE_SHIFT;
> + page_count = tsz >> PAGE_SHIFT;
> + original_len = len;
> + len = remap_oldmem_pfn_checked(vma, len, pfn,
> + page_count);
> + if (len != original_len + tsz)
> + goto fail;
> + } else {
> + if (remap_oldmem_pfn_range(vma,
> + vma->vm_start + len,
> + paddr >> PAGE_SHIFT,
> + tsz,
> + vma->vm_page_prot))
> + goto fail;
> + len += tsz;
> + }
> size -= tsz;
> start += tsz;
> - len += tsz;
>
> if (size == 0)
> return 0;
> --
> 1.9.3
>
>
> _______________________________________________
> Xen-devel mailing list
> [email protected]
> http://lists.xen.org/xen-devel

2014-07-08 19:12:15

by Vivek Goyal

[permalink] [raw]
Subject: Re: [PATCH] mmap_vmcore: skip non-ram pages reported by hypervisors

On Mon, Jul 07, 2014 at 05:05:49PM +0200, Vitaly Kuznetsov wrote:
> we have a special check in read_vmcore() handler to check if the page was
> reported as ram or not by the hypervisor (pfn_is_ram()).

I am wondering if this name pfn_is_ram() appropriate for what we are
doing. So IIUC, a balooned memory is also RAM just that it has not
been allocated yet. That means we can safely assume that there is no
data and can safely fill it with zeros?

If yes, then page_is_zero_filled() might be a more approprate name.

Also I am wondering why it was not done as part of copy_oldmem_page()
so that respective arch could hide all the details.

> However, when
> vmcore is read with mmap() no such check is performed. That can lead to
> unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
> mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
> enormous load in both DomU and Dom0.
>
> Fix the issue by mapping each non-ram page to the zero page. Keep direct
> path with remap_oldmem_pfn_range() to avoid looping through all pages on
> bare metal.
>
> The issue can also be solved by overriding remap_oldmem_pfn_range() in
> xen-specific code, as remap_oldmem_pfn_range() was been designed for.
> That, however, would involve non-obvious xen code path for all x86 builds
> with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
> code on x86 arch from doing the same override.

I am not sure I understand this part. So what is "all other hypervisor
specic" code which will like to do this. And will that code is compiled
at the same time as CONFIG_XEN_PVHVM?

>
> Signed-off-by: Vitaly Kuznetsov <[email protected]>
> ---
> fs/proc/vmcore.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++-----
> 1 file changed, 62 insertions(+), 6 deletions(-)
>
> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
> index 382aa89..2716e19 100644
> --- a/fs/proc/vmcore.c
> +++ b/fs/proc/vmcore.c
> @@ -328,6 +328,46 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz)
> * virtually contiguous user-space in ELF layout.
> */
> #ifdef CONFIG_MMU
> +static u64 remap_oldmem_pfn_checked(struct vm_area_struct *vma, u64 len,
> + unsigned long pfn, unsigned long page_count)
> +{
> + unsigned long pos;
> + size_t size;
> + unsigned long vma_addr;
> + unsigned long emptypage_pfn = __pa(empty_zero_page) >> PAGE_SHIFT;
> +
> + for (pos = pfn; (pos - pfn) <= page_count; pos++) {
> + if (!pfn_is_ram(pos) || (pos - pfn) == page_count) {
> + /* we hit a page which is not ram or reached the end */
> + if (pos - pfn > 0) {
> + /* remapping continuous region */
> + size = (pos - pfn) << PAGE_SHIFT;
> + vma_addr = vma->vm_start + len;
> + if (remap_oldmem_pfn_range(vma, vma_addr,
> + pfn, size,
> + vma->vm_page_prot))
> + return len;
> + len += size;
> + page_count -= (pos - pfn);
> + }
> + if (page_count > 0) {
> + /* we hit a page which is not ram, replacing
> + with an empty one */
> + vma_addr = vma->vm_start + len;
> + if (remap_oldmem_pfn_range(vma, vma_addr,
> + emptypage_pfn,
> + PAGE_SIZE,
> + vma->vm_page_prot))
> + return len;
> + len += PAGE_SIZE;
> + pfn = pos + 1;
> + page_count--;
> + }
> + }
> + }
> + return len;
> +}
> +
> static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
> {
> size_t size = vma->vm_end - vma->vm_start;
> @@ -383,17 +423,33 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
>
> list_for_each_entry(m, &vmcore_list, list) {
> if (start < m->offset + m->size) {
> - u64 paddr = 0;
> + u64 paddr = 0, original_len;
> + unsigned long pfn, page_count;
>
> tsz = min_t(size_t, m->offset + m->size - start, size);
> paddr = m->paddr + start - m->offset;
> - if (remap_oldmem_pfn_range(vma, vma->vm_start + len,
> - paddr >> PAGE_SHIFT, tsz,
> - vma->vm_page_prot))
> - goto fail;
> +
> + /* check if oldmem_pfn_is_ram was registered to avoid
> + looping over all pages without a reason */
> + if (oldmem_pfn_is_ram) {
> + pfn = paddr >> PAGE_SHIFT;
> + page_count = tsz >> PAGE_SHIFT;
> + original_len = len;
> + len = remap_oldmem_pfn_checked(vma, len, pfn,
> + page_count);
> + if (len != original_len + tsz)
> + goto fail;
> + } else {
> + if (remap_oldmem_pfn_range(vma,
> + vma->vm_start + len,
> + paddr >> PAGE_SHIFT,
> + tsz,
> + vma->vm_page_prot))
> + goto fail;

Why are we defining both remap_oldmem_pfn_checked()? Can't we just
modify remap_oldmem_pfn_range() to *always* check for if
pfn_is_zero_filled() and map accordingly.

Thanks
Vivek

2014-07-09 09:17:31

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH] mmap_vmcore: skip non-ram pages reported by hypervisors

David Vrabel <[email protected]> writes:

> On 07/07/14 21:33, Andrew Morton wrote:
>> On Mon, 7 Jul 2014 17:05:49 +0200 Vitaly Kuznetsov <[email protected]> wrote:
>>
>>> we have a special check in read_vmcore() handler to check if the page was
>>> reported as ram or not by the hypervisor (pfn_is_ram()). However, when
>>> vmcore is read with mmap() no such check is performed. That can lead to
>>> unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
>>> mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
>>> enormous load in both DomU and Dom0.
>
> Does make forward progress though? Or is it ending up in a repeatedly
> retrying the same instruction?

If memcpy is using SSE2 optimization 16-byte 'movdqu' instruction never
finishes (repeatedly retrying to issue two 8-byte requests to
qemu-dm). qemu-dm decides that it's hitting 'Neither RAM nor known MMIO
space' and returns 8 0xff bytes for both of this requests (I was testing
with qemu-traditional).

>
> Is it failing on a ballooned page in a RAM region? Or is mapping non-RAM
> regions as well?

I wasn't using ballooning, it happens that oldmem has several (two in my
test) pages which are HVMMEM_mmio_dm but qemu-dm considers them being
neither ram nor mmio.

>
>>> Fix the issue by mapping each non-ram page to the zero page. Keep direct
>>> path with remap_oldmem_pfn_range() to avoid looping through all pages on
>>> bare metal.
>>>
>>> The issue can also be solved by overriding remap_oldmem_pfn_range() in
>>> xen-specific code, as remap_oldmem_pfn_range() was been designed for.
>>> That, however, would involve non-obvious xen code path for all x86 builds
>>> with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
>>> code on x86 arch from doing the same override.
>
> The oldmem_pfn_is_ram() is Xen-specific but this problem (ballooned
> pages) must be common to KVM. How does KVM handle this?

Is far as I'm concearned the issue was never hit with KVM. I *think* the
issue has something to do with the conjunction of 16-byte 'movdqu'
emulation for io pages in xen hypervisor, 8-byte event channel requests
and qemu-traditional. But even if it gets fixed on hypervisor side I
believe fixing the issue kernel-side still worth it as there are
non-fixed hypervisors out there (e.g. AWS EC2).

>
> David

--
Vitaly

2014-07-09 09:23:39

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH] mmap_vmcore: skip non-ram pages reported by hypervisors

Konrad Rzeszutek Wilk <[email protected]> writes:

> On Mon, Jul 07, 2014 at 05:05:49PM +0200, Vitaly Kuznetsov wrote:
>> we have a special check in read_vmcore() handler to check if the page was
>> reported as ram or not by the hypervisor (pfn_is_ram()). However, when
>> vmcore is read with mmap() no such check is performed. That can lead to
>> unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
>> mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
>> enormous load in both DomU and Dom0.
>>
>> Fix the issue by mapping each non-ram page to the zero page. Keep direct
>> path with remap_oldmem_pfn_range() to avoid looping through all pages on
>> bare metal.
>>
>> The issue can also be solved by overriding remap_oldmem_pfn_range() in
>> xen-specific code, as remap_oldmem_pfn_range() was been designed for.
>> That, however, would involve non-obvious xen code path for all x86 builds
>> with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
>> code on x86 arch from doing the same override.
>
> Could the 'remap_oldmem_pfn_range' become an function ops? I see there
> is an 'register_oldmem_pfn_is_ram' - so could there be similar one for
> 'pfn_range'?

yes, it is possible to replace '__weak remap_oldmem_pfn_range' with
'register_oldmem_pfn_is_ram'. However s390 arch overrides this function
in arch/s390/kernel/crash_dump.c so we'll have to make some changes
there as well.

>
>>
>> Signed-off-by: Vitaly Kuznetsov <[email protected]>
>> ---
>> fs/proc/vmcore.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++-----
>> 1 file changed, 62 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
>> index 382aa89..2716e19 100644
>> --- a/fs/proc/vmcore.c
>> +++ b/fs/proc/vmcore.c
>> @@ -328,6 +328,46 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz)
>> * virtually contiguous user-space in ELF layout.
>> */
>> #ifdef CONFIG_MMU
>> +static u64 remap_oldmem_pfn_checked(struct vm_area_struct *vma, u64 len,
>> + unsigned long pfn, unsigned long page_count)
>> +{
>> + unsigned long pos;
>> + size_t size;
>> + unsigned long vma_addr;
>> + unsigned long emptypage_pfn = __pa(empty_zero_page) >> PAGE_SHIFT;
>> +
>> + for (pos = pfn; (pos - pfn) <= page_count; pos++) {
>> + if (!pfn_is_ram(pos) || (pos - pfn) == page_count) {
>> + /* we hit a page which is not ram or reached the end */
>> + if (pos - pfn > 0) {
>> + /* remapping continuous region */
>> + size = (pos - pfn) << PAGE_SHIFT;
>> + vma_addr = vma->vm_start + len;
>> + if (remap_oldmem_pfn_range(vma, vma_addr,
>> + pfn, size,
>> + vma->vm_page_prot))
>> + return len;
>> + len += size;
>> + page_count -= (pos - pfn);
>> + }
>> + if (page_count > 0) {
>> + /* we hit a page which is not ram, replacing
>> + with an empty one */
>> + vma_addr = vma->vm_start + len;
>> + if (remap_oldmem_pfn_range(vma, vma_addr,
>> + emptypage_pfn,
>> + PAGE_SIZE,
>> + vma->vm_page_prot))
>> + return len;
>> + len += PAGE_SIZE;
>> + pfn = pos + 1;
>> + page_count--;
>> + }
>> + }
>> + }
>> + return len;
>> +}
>> +
>> static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
>> {
>> size_t size = vma->vm_end - vma->vm_start;
>> @@ -383,17 +423,33 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
>>
>> list_for_each_entry(m, &vmcore_list, list) {
>> if (start < m->offset + m->size) {
>> - u64 paddr = 0;
>> + u64 paddr = 0, original_len;
>> + unsigned long pfn, page_count;
>>
>> tsz = min_t(size_t, m->offset + m->size - start, size);
>> paddr = m->paddr + start - m->offset;
>> - if (remap_oldmem_pfn_range(vma, vma->vm_start + len,
>> - paddr >> PAGE_SHIFT, tsz,
>> - vma->vm_page_prot))
>> - goto fail;
>> +
>> + /* check if oldmem_pfn_is_ram was registered to avoid
>> + looping over all pages without a reason */
>> + if (oldmem_pfn_is_ram) {
>> + pfn = paddr >> PAGE_SHIFT;
>> + page_count = tsz >> PAGE_SHIFT;
>> + original_len = len;
>> + len = remap_oldmem_pfn_checked(vma, len, pfn,
>> + page_count);
>> + if (len != original_len + tsz)
>> + goto fail;
>> + } else {
>> + if (remap_oldmem_pfn_range(vma,
>> + vma->vm_start + len,
>> + paddr >> PAGE_SHIFT,
>> + tsz,
>> + vma->vm_page_prot))
>> + goto fail;
>> + len += tsz;
>> + }
>> size -= tsz;
>> start += tsz;
>> - len += tsz;
>>
>> if (size == 0)
>> return 0;
>> --
>> 1.9.3
>>
>>
>> _______________________________________________
>> Xen-devel mailing list
>> [email protected]
>> http://lists.xen.org/xen-devel

--
Vitaly

2014-07-09 09:46:40

by David Vrabel

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH] mmap_vmcore: skip non-ram pages reported by hypervisors

On 09/07/14 10:17, Vitaly Kuznetsov wrote:
> David Vrabel <[email protected]> writes:
>
>> On 07/07/14 21:33, Andrew Morton wrote:
>>> On Mon, 7 Jul 2014 17:05:49 +0200 Vitaly Kuznetsov <[email protected]> wrote:
>>>
>>>> we have a special check in read_vmcore() handler to check if the page was
>>>> reported as ram or not by the hypervisor (pfn_is_ram()). However, when
>>>> vmcore is read with mmap() no such check is performed. That can lead to
>>>> unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
>>>> mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
>>>> enormous load in both DomU and Dom0.
>>
>> Does make forward progress though? Or is it ending up in a repeatedly
>> retrying the same instruction?
>
> If memcpy is using SSE2 optimization 16-byte 'movdqu' instruction never
> finishes (repeatedly retrying to issue two 8-byte requests to
> qemu-dm). qemu-dm decides that it's hitting 'Neither RAM nor known MMIO
> space' and returns 8 0xff bytes for both of this requests (I was testing
> with qemu-traditional).

Yes, the emulation of instructions with 16-byte operands is a bit
broken. I should be fixed.

>> Is it failing on a ballooned page in a RAM region? Or is mapping non-RAM
>> regions as well?
>
> I wasn't using ballooning, it happens that oldmem has several (two in my
> test) pages which are HVMMEM_mmio_dm but qemu-dm considers them being
> neither ram nor mmio.

I think this would also happen with ballooned pages, which are also
not-present in the p2m and thus would show up as HVMMEM_mmio_dm type and
accesses will also be forwarded to qemu (qemu gets everything by default).

>>>> Fix the issue by mapping each non-ram page to the zero page. Keep direct
>>>> path with remap_oldmem_pfn_range() to avoid looping through all pages on
>>>> bare metal.
>>>>
>>>> The issue can also be solved by overriding remap_oldmem_pfn_range() in
>>>> xen-specific code, as remap_oldmem_pfn_range() was been designed for.
>>>> That, however, would involve non-obvious xen code path for all x86 builds
>>>> with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
>>>> code on x86 arch from doing the same override.
>>
>> The oldmem_pfn_is_ram() is Xen-specific but this problem (ballooned
>> pages) must be common to KVM. How does KVM handle this?
>
> Is far as I'm concearned the issue was never hit with KVM. I *think* the
> issue has something to do with the conjunction of 16-byte 'movdqu'
> emulation for io pages in xen hypervisor, 8-byte event channel requests
> and qemu-traditional. But even if it gets fixed on hypervisor side I
> believe fixing the issue kernel-side still worth it as there are
> non-fixed hypervisors out there (e.g. AWS EC2).

I think it would be preferrable to fix this on the hypervisor side so
Xen guests behaves in the same way as KVM guests.

But if this needs to work on non-fixed hypervisors then this patch looks
sensible. FWIW,

Acked-by: David Vrabel <[email protected]>

David

2014-07-09 09:46:58

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH] mmap_vmcore: skip non-ram pages reported by hypervisors

Vivek Goyal <[email protected]> writes:

> On Mon, Jul 07, 2014 at 05:05:49PM +0200, Vitaly Kuznetsov wrote:
>> we have a special check in read_vmcore() handler to check if the page was
>> reported as ram or not by the hypervisor (pfn_is_ram()).
>
> I am wondering if this name pfn_is_ram() appropriate for what we are
> doing. So IIUC, a balooned memory is also RAM just that it has not
> been allocated yet. That means we can safely assume that there is no
> data and can safely fill it with zeros?

For Xen pfn_is_ram() returns 0 in case the page is an mmio. Ballooned
pages are also considered being mmio (HVMOP_get_mem_type returns
HVMMEM_mmio_dm).

>
> If yes, then page_is_zero_filled() might be a more approprate name.
>

It's not as mmio page is not always zero-filled. We just don't need
these pages in vmcore.

> Also I am wondering why it was not done as part of copy_oldmem_page()
> so that respective arch could hide all the details.
>

Afaiac that wouldn't solve the mmap issue I'm trying to address but we
can ask Olaf why he preferred pfn_is_ram() path.

>> However, when
>> vmcore is read with mmap() no such check is performed. That can lead to
>> unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
>> mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
>> enormous load in both DomU and Dom0.
>>
>> Fix the issue by mapping each non-ram page to the zero page. Keep direct
>> path with remap_oldmem_pfn_range() to avoid looping through all pages on
>> bare metal.
>>
>> The issue can also be solved by overriding remap_oldmem_pfn_range() in
>> xen-specific code, as remap_oldmem_pfn_range() was been designed for.
>> That, however, would involve non-obvious xen code path for all x86 builds
>> with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
>> code on x86 arch from doing the same override.
>
> I am not sure I understand this part. So what is "all other hypervisor
> specic" code which will like to do this. And will that code is compiled
> at the same time as CONFIG_XEN_PVHVM?
>

I meant to say that we have many hypervisors for x86 supported. In case
I override __weak remap_oldmem_pfn_range() in xen-specific code it will
*always* get executed when this code was compiled. In case we'll have to
do similar override in e.g. Hyperv or KVM code in future we'll have a
mess (in which order do we need to execute these overrides?).

In few words, Xen-PVHVM is not an architecture so I'm not following
"Architectures may override this function to map oldmem" path.

>>
>> Signed-off-by: Vitaly Kuznetsov <[email protected]>
>> ---
>> fs/proc/vmcore.c | 68 +++++++++++++++++++++++++++++++++++++++++++++++++++-----
>> 1 file changed, 62 insertions(+), 6 deletions(-)
>>
>> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
>> index 382aa89..2716e19 100644
>> --- a/fs/proc/vmcore.c
>> +++ b/fs/proc/vmcore.c
>> @@ -328,6 +328,46 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz)
>> * virtually contiguous user-space in ELF layout.
>> */
>> #ifdef CONFIG_MMU
>> +static u64 remap_oldmem_pfn_checked(struct vm_area_struct *vma, u64 len,
>> + unsigned long pfn, unsigned long page_count)
>> +{
>> + unsigned long pos;
>> + size_t size;
>> + unsigned long vma_addr;
>> + unsigned long emptypage_pfn = __pa(empty_zero_page) >> PAGE_SHIFT;
>> +
>> + for (pos = pfn; (pos - pfn) <= page_count; pos++) {
>> + if (!pfn_is_ram(pos) || (pos - pfn) == page_count) {
>> + /* we hit a page which is not ram or reached the end */
>> + if (pos - pfn > 0) {
>> + /* remapping continuous region */
>> + size = (pos - pfn) << PAGE_SHIFT;
>> + vma_addr = vma->vm_start + len;
>> + if (remap_oldmem_pfn_range(vma, vma_addr,
>> + pfn, size,
>> + vma->vm_page_prot))
>> + return len;
>> + len += size;
>> + page_count -= (pos - pfn);
>> + }
>> + if (page_count > 0) {
>> + /* we hit a page which is not ram, replacing
>> + with an empty one */
>> + vma_addr = vma->vm_start + len;
>> + if (remap_oldmem_pfn_range(vma, vma_addr,
>> + emptypage_pfn,
>> + PAGE_SIZE,
>> + vma->vm_page_prot))
>> + return len;
>> + len += PAGE_SIZE;
>> + pfn = pos + 1;
>> + page_count--;
>> + }
>> + }
>> + }
>> + return len;
>> +}
>> +
>> static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
>> {
>> size_t size = vma->vm_end - vma->vm_start;
>> @@ -383,17 +423,33 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
>>
>> list_for_each_entry(m, &vmcore_list, list) {
>> if (start < m->offset + m->size) {
>> - u64 paddr = 0;
>> + u64 paddr = 0, original_len;
>> + unsigned long pfn, page_count;
>>
>> tsz = min_t(size_t, m->offset + m->size - start, size);
>> paddr = m->paddr + start - m->offset;
>> - if (remap_oldmem_pfn_range(vma, vma->vm_start + len,
>> - paddr >> PAGE_SHIFT, tsz,
>> - vma->vm_page_prot))
>> - goto fail;
>> +
>> + /* check if oldmem_pfn_is_ram was registered to avoid
>> + looping over all pages without a reason */
>> + if (oldmem_pfn_is_ram) {
>> + pfn = paddr >> PAGE_SHIFT;
>> + page_count = tsz >> PAGE_SHIFT;
>> + original_len = len;
>> + len = remap_oldmem_pfn_checked(vma, len, pfn,
>> + page_count);
>> + if (len != original_len + tsz)
>> + goto fail;
>> + } else {
>> + if (remap_oldmem_pfn_range(vma,
>> + vma->vm_start + len,
>> + paddr >> PAGE_SHIFT,
>> + tsz,
>> + vma->vm_page_prot))
>> + goto fail;
>
> Why are we defining both remap_oldmem_pfn_checked()? Can't we just
> modify remap_oldmem_pfn_range() to *always* check for if
> pfn_is_zero_filled() and map accordingly.

I wanted to preserve the direct path without the check to make things
faster when the pfn_is_ram() handler was not registered. oldmem is huge
sometimes and issuing a single call per pfn can cost us something.

>
> Thanks
> Vivek

--
Vitaly

2014-07-09 10:12:22

by Olaf Hering

[permalink] [raw]
Subject: Re: [Xen-devel] [PATCH] mmap_vmcore: skip non-ram pages reported by hypervisors

On Wed, Jul 09, Vitaly Kuznetsov wrote:

> > Also I am wondering why it was not done as part of copy_oldmem_page()
> > so that respective arch could hide all the details.
> Afaiac that wouldn't solve the mmap issue I'm trying to address but we
> can ask Olaf why he preferred pfn_is_ram() path.

Every copy_oldmem_page would need to know about the pfn_is_ram function,
so I think its better to keep that part of the code private to
fs/proc/vmcore.c

Perhaps pfn_is_ram could be named pfn_is_backed_by_ram, but the comments
make it clear what the function does.


Olaf