2014-07-14 09:17:19

by Vitaly Kuznetsov

[permalink] [raw]
Subject: [PATCH v5] mmap_vmcore: skip non-ram pages reported by hypervisors

We have a special check in read_vmcore() handler to check if the page was
reported as ram or not by the hypervisor (pfn_is_ram()). However, when
vmcore is read with mmap() no such check is performed. That can lead to
unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
enormous load in both DomU and Dom0.

Fix the issue by mapping each non-ram page to the zero page. Keep direct
path with remap_oldmem_pfn_range() to avoid looping through all pages on
bare metal.

The issue can also be solved by overriding remap_oldmem_pfn_range() in
xen-specific code, as remap_oldmem_pfn_range() was been designed for.
That, however, would involve non-obvious xen code path for all x86 builds
with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
code on x86 arch from doing the same override.

Changes from v4:
- change map_size type size_t -> unsigned long
- use prot instead of vma->vm_page_prot inside remap_oldmem_pfn_checked()

Changes from v3:
- multi line comment style changes
- minor code style changes

Changes from v2:
- make remap_oldmem_pfn_checked() interface exactly match
remap_oldmem_pfn_range()
- unmap mapped part inside remap_oldmem_pfn_checked() in case of failure so
we don't need to take care of it in mmap_vmcore()
- create vmcore_remap_oldmem_pfn() wrapper

Changes from v1:
- comment style changes
- change remap_oldmem_pfn_checked() interface to closer match the
remap_oldmem_pfn() interface
- preserve formal parameters within the loop, make the loop conditions
easier to understand
- use my_zero_pfn() for the zero page
- return remapped length instead of new offset

Reviewed-by: Andrew Jones <[email protected]>
Signed-off-by: Vitaly Kuznetsov <[email protected]>
---
fs/proc/vmcore.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
1 file changed, 80 insertions(+), 3 deletions(-)

diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
index 382aa89..1f77f35 100644
--- a/fs/proc/vmcore.c
+++ b/fs/proc/vmcore.c
@@ -328,6 +328,83 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz)
* virtually contiguous user-space in ELF layout.
*/
#ifdef CONFIG_MMU
+/*
+ * remap_oldmem_pfn_checked - do remap_oldmem_pfn_range replacing all pages
+ * reported as not being ram with the zero page.
+ *
+ * @vma: vm_area_struct describing requested mapping
+ * @from: start remapping from
+ * @pfn: page frame number to start remapping to
+ * @size: remapping size
+ * @prot: protection bits
+ *
+ * Returns zero on success, -EAGAIN on failure.
+ */
+int remap_oldmem_pfn_checked(struct vm_area_struct *vma, unsigned long from,
+ unsigned long pfn, unsigned long size,
+ pgprot_t prot)
+{
+ unsigned long map_size;
+ unsigned long pos_start, pos_end, pos;
+ unsigned long zeropage_pfn = my_zero_pfn(0);
+ u64 len = 0;
+
+ pos_start = pfn;
+ pos_end = pfn + (size >> PAGE_SHIFT);
+
+ for (pos = pos_start; pos < pos_end; ++pos) {
+ if (!pfn_is_ram(pos)) {
+ /*
+ * We hit a page which is not ram. Remap the continuous
+ * region between pos_start and pos-1 and replace
+ * the non-ram page at pos with the zero page.
+ */
+ if (pos > pos_start) {
+ /* Remap continuous region */
+ map_size = (pos - pos_start) << PAGE_SHIFT;
+ if (remap_oldmem_pfn_range(vma, from + len,
+ pos_start, map_size,
+ prot))
+ goto fail;
+ len += map_size;
+ }
+ /* Remap the zero page */
+ if (remap_oldmem_pfn_range(vma, from + len,
+ zeropage_pfn,
+ PAGE_SIZE, prot))
+ goto fail;
+ len += PAGE_SIZE;
+ pos_start = pos + 1;
+ }
+ }
+ if (pos > pos_start) {
+ /* Remap the rest */
+ map_size = (pos - pos_start) << PAGE_SHIFT;
+ if (remap_oldmem_pfn_range(vma, from + len, pos_start,
+ map_size, prot))
+ goto fail;
+ len += map_size;
+ }
+ return 0;
+fail:
+ do_munmap(vma->vm_mm, from, len);
+ return -EAGAIN;
+}
+
+int vmcore_remap_oldmem_pfn(struct vm_area_struct *vma,
+ unsigned long from, unsigned long pfn,
+ unsigned long size, pgprot_t prot)
+{
+ /*
+ * Check if oldmem_pfn_is_ram was registered to avoid
+ * looping over all pages without a reason.
+ */
+ if (oldmem_pfn_is_ram)
+ return remap_oldmem_pfn_checked(vma, from, pfn, size, prot);
+ else
+ return remap_oldmem_pfn_range(vma, from, pfn, size, prot);
+}
+
static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
{
size_t size = vma->vm_end - vma->vm_start;
@@ -387,9 +464,9 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)

tsz = min_t(size_t, m->offset + m->size - start, size);
paddr = m->paddr + start - m->offset;
- if (remap_oldmem_pfn_range(vma, vma->vm_start + len,
- paddr >> PAGE_SHIFT, tsz,
- vma->vm_page_prot))
+ if (vmcore_remap_oldmem_pfn(vma, vma->vm_start + len,
+ paddr >> PAGE_SHIFT, tsz,
+ vma->vm_page_prot))
goto fail;
size -= tsz;
start += tsz;
--
1.9.3


2014-07-14 09:49:23

by Hatayama, Daisuke

[permalink] [raw]
Subject: Re: [PATCH v5] mmap_vmcore: skip non-ram pages reported by hypervisors



(2014/07/14 18:16), Vitaly Kuznetsov wrote:
> We have a special check in read_vmcore() handler to check if the page was
> reported as ram or not by the hypervisor (pfn_is_ram()). However, when
> vmcore is read with mmap() no such check is performed. That can lead to
> unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
> mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
> enormous load in both DomU and Dom0.
>
> Fix the issue by mapping each non-ram page to the zero page. Keep direct
> path with remap_oldmem_pfn_range() to avoid looping through all pages on
> bare metal.
>
> The issue can also be solved by overriding remap_oldmem_pfn_range() in
> xen-specific code, as remap_oldmem_pfn_range() was been designed for.
> That, however, would involve non-obvious xen code path for all x86 builds
> with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
> code on x86 arch from doing the same override.
>
> Changes from v4:
> - change map_size type size_t -> unsigned long
> - use prot instead of vma->vm_page_prot inside remap_oldmem_pfn_checked()
>
> Changes from v3:
> - multi line comment style changes
> - minor code style changes
>
> Changes from v2:
> - make remap_oldmem_pfn_checked() interface exactly match
> remap_oldmem_pfn_range()
> - unmap mapped part inside remap_oldmem_pfn_checked() in case of failure so
> we don't need to take care of it in mmap_vmcore()
> - create vmcore_remap_oldmem_pfn() wrapper
>
> Changes from v1:
> - comment style changes
> - change remap_oldmem_pfn_checked() interface to closer match the
> remap_oldmem_pfn() interface
> - preserve formal parameters within the loop, make the loop conditions
> easier to understand
> - use my_zero_pfn() for the zero page
> - return remapped length instead of new offset
>
> Reviewed-by: Andrew Jones <[email protected]>
> Signed-off-by: Vitaly Kuznetsov <[email protected]>
> ---
> fs/proc/vmcore.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
> 1 file changed, 80 insertions(+), 3 deletions(-)
>
> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
> index 382aa89..1f77f35 100644
> --- a/fs/proc/vmcore.c
> +++ b/fs/proc/vmcore.c
> @@ -328,6 +328,83 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz)
> * virtually contiguous user-space in ELF layout.
> */
> #ifdef CONFIG_MMU
> +/*
> + * remap_oldmem_pfn_checked - do remap_oldmem_pfn_range replacing all pages
> + * reported as not being ram with the zero page.
> + *
> + * @vma: vm_area_struct describing requested mapping
> + * @from: start remapping from
> + * @pfn: page frame number to start remapping to
> + * @size: remapping size
> + * @prot: protection bits
> + *
> + * Returns zero on success, -EAGAIN on failure.
> + */
> +int remap_oldmem_pfn_checked(struct vm_area_struct *vma, unsigned long from,
> + unsigned long pfn, unsigned long size,
> + pgprot_t prot)
> +{
> + unsigned long map_size;
> + unsigned long pos_start, pos_end, pos;
> + unsigned long zeropage_pfn = my_zero_pfn(0);
> + u64 len = 0;

Sorry, I missed this yesterday. This should also be fixed as size_t or unsigned long.

Does 32-bit compiler warn about this at the call of do_munmap() below due to difference of bit length of the two types?

> +
> + pos_start = pfn;
> + pos_end = pfn + (size >> PAGE_SHIFT);
> +
> + for (pos = pos_start; pos < pos_end; ++pos) {
> + if (!pfn_is_ram(pos)) {
> + /*
> + * We hit a page which is not ram. Remap the continuous
> + * region between pos_start and pos-1 and replace
> + * the non-ram page at pos with the zero page.
> + */
> + if (pos > pos_start) {
> + /* Remap continuous region */
> + map_size = (pos - pos_start) << PAGE_SHIFT;
> + if (remap_oldmem_pfn_range(vma, from + len,
> + pos_start, map_size,
> + prot))
> + goto fail;
> + len += map_size;
> + }
> + /* Remap the zero page */
> + if (remap_oldmem_pfn_range(vma, from + len,
> + zeropage_pfn,
> + PAGE_SIZE, prot))
> + goto fail;
> + len += PAGE_SIZE;
> + pos_start = pos + 1;
> + }
> + }
> + if (pos > pos_start) {
> + /* Remap the rest */
> + map_size = (pos - pos_start) << PAGE_SHIFT;
> + if (remap_oldmem_pfn_range(vma, from + len, pos_start,
> + map_size, prot))
> + goto fail;
> + len += map_size;
> + }
> + return 0;
> +fail:
> + do_munmap(vma->vm_mm, from, len);
> + return -EAGAIN;
> +}
> +
> +int vmcore_remap_oldmem_pfn(struct vm_area_struct *vma,
> + unsigned long from, unsigned long pfn,
> + unsigned long size, pgprot_t prot)
> +{
> + /*
> + * Check if oldmem_pfn_is_ram was registered to avoid
> + * looping over all pages without a reason.
> + */
> + if (oldmem_pfn_is_ram)
> + return remap_oldmem_pfn_checked(vma, from, pfn, size, prot);
> + else
> + return remap_oldmem_pfn_range(vma, from, pfn, size, prot);
> +}
> +
> static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
> {
> size_t size = vma->vm_end - vma->vm_start;
> @@ -387,9 +464,9 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
>
> tsz = min_t(size_t, m->offset + m->size - start, size);
> paddr = m->paddr + start - m->offset;
> - if (remap_oldmem_pfn_range(vma, vma->vm_start + len,
> - paddr >> PAGE_SHIFT, tsz,
> - vma->vm_page_prot))
> + if (vmcore_remap_oldmem_pfn(vma, vma->vm_start + len,
> + paddr >> PAGE_SHIFT, tsz,
> + vma->vm_page_prot))
> goto fail;
> size -= tsz;
> start += tsz;
>

--
Thanks.
HATAYAMA, Daisuke

2014-07-14 12:48:08

by Vitaly Kuznetsov

[permalink] [raw]
Subject: Re: [PATCH v5] mmap_vmcore: skip non-ram pages reported by hypervisors

"HATAYAMA, Daisuke" <[email protected]> writes:

> (2014/07/14 18:16), Vitaly Kuznetsov wrote:
>> We have a special check in read_vmcore() handler to check if the page was
>> reported as ram or not by the hypervisor (pfn_is_ram()). However, when
>> vmcore is read with mmap() no such check is performed. That can lead to
>> unpredictable results, e.g. when running Xen PVHVM guest memcpy() after
>> mmap() on /proc/vmcore will hang processing HVMMEM_mmio_dm pages creating
>> enormous load in both DomU and Dom0.
>>
>> Fix the issue by mapping each non-ram page to the zero page. Keep direct
>> path with remap_oldmem_pfn_range() to avoid looping through all pages on
>> bare metal.
>>
>> The issue can also be solved by overriding remap_oldmem_pfn_range() in
>> xen-specific code, as remap_oldmem_pfn_range() was been designed for.
>> That, however, would involve non-obvious xen code path for all x86 builds
>> with CONFIG_XEN_PVHVM=y and would prevent all other hypervisor-specific
>> code on x86 arch from doing the same override.
>>
>> Changes from v4:
>> - change map_size type size_t -> unsigned long
>> - use prot instead of vma->vm_page_prot inside remap_oldmem_pfn_checked()
>>
>> Changes from v3:
>> - multi line comment style changes
>> - minor code style changes
>>
>> Changes from v2:
>> - make remap_oldmem_pfn_checked() interface exactly match
>> remap_oldmem_pfn_range()
>> - unmap mapped part inside remap_oldmem_pfn_checked() in case of failure so
>> we don't need to take care of it in mmap_vmcore()
>> - create vmcore_remap_oldmem_pfn() wrapper
>>
>> Changes from v1:
>> - comment style changes
>> - change remap_oldmem_pfn_checked() interface to closer match the
>> remap_oldmem_pfn() interface
>> - preserve formal parameters within the loop, make the loop conditions
>> easier to understand
>> - use my_zero_pfn() for the zero page
>> - return remapped length instead of new offset
>>
>> Reviewed-by: Andrew Jones <[email protected]>
>> Signed-off-by: Vitaly Kuznetsov <[email protected]>
>> ---
>> fs/proc/vmcore.c | 83 ++++++++++++++++++++++++++++++++++++++++++++++++++++++--
>> 1 file changed, 80 insertions(+), 3 deletions(-)
>>
>> diff --git a/fs/proc/vmcore.c b/fs/proc/vmcore.c
>> index 382aa89..1f77f35 100644
>> --- a/fs/proc/vmcore.c
>> +++ b/fs/proc/vmcore.c
>> @@ -328,6 +328,83 @@ static inline char *alloc_elfnotes_buf(size_t notes_sz)
>> * virtually contiguous user-space in ELF layout.
>> */
>> #ifdef CONFIG_MMU
>> +/*
>> + * remap_oldmem_pfn_checked - do remap_oldmem_pfn_range replacing all pages
>> + * reported as not being ram with the zero page.
>> + *
>> + * @vma: vm_area_struct describing requested mapping
>> + * @from: start remapping from
>> + * @pfn: page frame number to start remapping to
>> + * @size: remapping size
>> + * @prot: protection bits
>> + *
>> + * Returns zero on success, -EAGAIN on failure.
>> + */
>> +int remap_oldmem_pfn_checked(struct vm_area_struct *vma, unsigned long from,
>> + unsigned long pfn, unsigned long size,
>> + pgprot_t prot)
>> +{
>> + unsigned long map_size;
>> + unsigned long pos_start, pos_end, pos;
>> + unsigned long zeropage_pfn = my_zero_pfn(0);
>> + u64 len = 0;
>
> Sorry, I missed this yesterday.

Thanks for your review!

> This should also be fixed as size_t or unsigned long.
> Does 32-bit compiler warn about this at the call of do_munmap() below
> due to difference of bit length of the two types?

Mine doesn't. But you're right, it makes sense to make it match
do_munmap()'s interface and len there is size_t. I'll send v6 with this change.

>
>> +
>> + pos_start = pfn;
>> + pos_end = pfn + (size >> PAGE_SHIFT);
>> +
>> + for (pos = pos_start; pos < pos_end; ++pos) {
>> + if (!pfn_is_ram(pos)) {
>> + /*
>> + * We hit a page which is not ram. Remap the continuous
>> + * region between pos_start and pos-1 and replace
>> + * the non-ram page at pos with the zero page.
>> + */
>> + if (pos > pos_start) {
>> + /* Remap continuous region */
>> + map_size = (pos - pos_start) << PAGE_SHIFT;
>> + if (remap_oldmem_pfn_range(vma, from + len,
>> + pos_start, map_size,
>> + prot))
>> + goto fail;
>> + len += map_size;
>> + }
>> + /* Remap the zero page */
>> + if (remap_oldmem_pfn_range(vma, from + len,
>> + zeropage_pfn,
>> + PAGE_SIZE, prot))
>> + goto fail;
>> + len += PAGE_SIZE;
>> + pos_start = pos + 1;
>> + }
>> + }
>> + if (pos > pos_start) {
>> + /* Remap the rest */
>> + map_size = (pos - pos_start) << PAGE_SHIFT;
>> + if (remap_oldmem_pfn_range(vma, from + len, pos_start,
>> + map_size, prot))
>> + goto fail;
>> + len += map_size;
>> + }
>> + return 0;
>> +fail:
>> + do_munmap(vma->vm_mm, from, len);
>> + return -EAGAIN;
>> +}
>> +
>> +int vmcore_remap_oldmem_pfn(struct vm_area_struct *vma,
>> + unsigned long from, unsigned long pfn,
>> + unsigned long size, pgprot_t prot)
>> +{
>> + /*
>> + * Check if oldmem_pfn_is_ram was registered to avoid
>> + * looping over all pages without a reason.
>> + */
>> + if (oldmem_pfn_is_ram)
>> + return remap_oldmem_pfn_checked(vma, from, pfn, size, prot);
>> + else
>> + return remap_oldmem_pfn_range(vma, from, pfn, size, prot);
>> +}
>> +
>> static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
>> {
>> size_t size = vma->vm_end - vma->vm_start;
>> @@ -387,9 +464,9 @@ static int mmap_vmcore(struct file *file, struct vm_area_struct *vma)
>>
>> tsz = min_t(size_t, m->offset + m->size - start, size);
>> paddr = m->paddr + start - m->offset;
>> - if (remap_oldmem_pfn_range(vma, vma->vm_start + len,
>> - paddr >> PAGE_SHIFT, tsz,
>> - vma->vm_page_prot))
>> + if (vmcore_remap_oldmem_pfn(vma, vma->vm_start + len,
>> + paddr >> PAGE_SHIFT, tsz,
>> + vma->vm_page_prot))
>> goto fail;
>> size -= tsz;
>> start += tsz;
>>

--
Vitaly