2013-05-15 12:46:40

by Cliff Wickman

[permalink] [raw]
Subject: [PATCH v3] mm/pagewalk.c: walk_page_range should avoid VM_PFNMAP areas


/proc/<pid>/smaps and similar walks through a user page table should not
be looking at VM_PFNMAP areas.

v2:
- moves the VM_BUG_ON out of the loop
- adds the needed test for vma->vm_start <= addr

v3 adds comments to make this clearer, as N. Horiguchi recommends:
> I recommend that you check VM_PFNMAP in the possible callers' side.
> But this patch seems to solve your problem, so with properly commenting
> this somewhere, I do not oppose it.

Certain tests in walk_page_range() (specifically split_huge_page_pmd())
assume that all the mapped PFN's are backed with page structures. And this is
not usually true for VM_PFNMAP areas. This can result in panics on kernel
page faults when attempting to address those page structures.

There are a half dozen callers of walk_page_range() that walk through
a task's entire page table (as N. Horiguchi pointed out). So rather than
change all of them, this patch changes just walk_page_range() to ignore
VM_PFNMAP areas.

The logic of hugetlb_vma() is moved back into walk_page_range(), as we
want to test any vma in the range.

VM_PFNMAP areas are used by:
- graphics memory manager gpu/drm/drm_gem.c
- global reference unit sgi-gru/grufile.c
- sgi special memory char/mspec.c
- and probably several out-of-tree modules

I'm copying everyone who has changed this file recently, in case
there is some reason that I am not aware of to provide
/proc/<pid>/smaps|clear_refs|maps|numa_maps for these VM_PFNMAP areas.

Signed-off-by: Cliff Wickman <[email protected]>
---
mm/pagewalk.c | 65 ++++++++++++++++++++++++++++++++--------------------------
1 file changed, 36 insertions(+), 29 deletions(-)

Index: linux/mm/pagewalk.c
===================================================================
--- linux.orig/mm/pagewalk.c
+++ linux/mm/pagewalk.c
@@ -127,22 +127,6 @@ static int walk_hugetlb_range(struct vm_
return 0;
}

-static struct vm_area_struct* hugetlb_vma(unsigned long addr, struct mm_walk *walk)
-{
- struct vm_area_struct *vma;
-
- /* We don't need vma lookup at all. */
- if (!walk->hugetlb_entry)
- return NULL;
-
- VM_BUG_ON(!rwsem_is_locked(&walk->mm->mmap_sem));
- vma = find_vma(walk->mm, addr);
- if (vma && vma->vm_start <= addr && is_vm_hugetlb_page(vma))
- return vma;
-
- return NULL;
-}
-
#else /* CONFIG_HUGETLB_PAGE */
static struct vm_area_struct* hugetlb_vma(unsigned long addr, struct mm_walk *walk)
{
@@ -198,30 +182,53 @@ int walk_page_range(unsigned long addr,
if (!walk->mm)
return -EINVAL;

+ VM_BUG_ON(!rwsem_is_locked(&walk->mm->mmap_sem));
+
pgd = pgd_offset(walk->mm, addr);
do {
- struct vm_area_struct *vma;
+ struct vm_area_struct *vma = NULL;

next = pgd_addr_end(addr, end);

/*
- * handle hugetlb vma individually because pagetable walk for
- * the hugetlb page is dependent on the architecture and
- * we can't handled it in the same manner as non-huge pages.
+ * This function was not intended to be vma based.
+ * But there are vma special cases to be handled:
+ * - hugetlb vma's
+ * - VM_PFNMAP vma's
*/
- vma = hugetlb_vma(addr, walk);
+ vma = find_vma(walk->mm, addr);
if (vma) {
- if (vma->vm_end < next)
+ /*
+ * There are no page structures backing a VM_PFNMAP
+ * range, so do not allow split_huge_page_pmd().
+ */
+ if ((vma->vm_start <= addr) &&
+ (vma->vm_flags & VM_PFNMAP)) {
next = vma->vm_end;
+ pgd = pgd_offset(walk->mm, next);
+ continue;
+ }
/*
- * Hugepage is very tightly coupled with vma, so
- * walk through hugetlb entries within a given vma.
+ * Handle hugetlb vma individually because pagetable
+ * walk for the hugetlb page is dependent on the
+ * architecture and we can't handled it in the same
+ * manner as non-huge pages.
*/
- err = walk_hugetlb_range(vma, addr, next, walk);
- if (err)
- break;
- pgd = pgd_offset(walk->mm, next);
- continue;
+ if (walk->hugetlb_entry && (vma->vm_start <= addr) &&
+ is_vm_hugetlb_page(vma)) {
+ if (vma->vm_end < next)
+ next = vma->vm_end;
+ /*
+ * Hugepage is very tightly coupled with vma,
+ * so walk through hugetlb entries within a
+ * given vma.
+ */
+ err = walk_hugetlb_range(vma, addr, next, walk);
+ if (err)
+ break;
+ pgd = pgd_offset(walk->mm, next);
+ continue;
+ }
}

if (pgd_none_or_clear_bad(pgd)) {


2013-05-15 13:43:33

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH v3] mm/pagewalk.c: walk_page_range should avoid VM_PFNMAP areas

On Wed, May 15, 2013 at 07:46:36AM -0500, Cliff Wickman wrote:
>
> /proc/<pid>/smaps and similar walks through a user page table should not
> be looking at VM_PFNMAP areas.
>
> v2:
> - moves the VM_BUG_ON out of the loop
> - adds the needed test for vma->vm_start <= addr
>
> v3 adds comments to make this clearer, as N. Horiguchi recommends:
> > I recommend that you check VM_PFNMAP in the possible callers' side.
> > But this patch seems to solve your problem, so with properly commenting
> > this somewhere, I do not oppose it.
>
> Certain tests in walk_page_range() (specifically split_huge_page_pmd())
> assume that all the mapped PFN's are backed with page structures. And this is
> not usually true for VM_PFNMAP areas. This can result in panics on kernel
> page faults when attempting to address those page structures.
>
> There are a half dozen callers of walk_page_range() that walk through
> a task's entire page table (as N. Horiguchi pointed out). So rather than
> change all of them, this patch changes just walk_page_range() to ignore
> VM_PFNMAP areas.
>
> The logic of hugetlb_vma() is moved back into walk_page_range(), as we
> want to test any vma in the range.
>
> VM_PFNMAP areas are used by:
> - graphics memory manager gpu/drm/drm_gem.c
> - global reference unit sgi-gru/grufile.c
> - sgi special memory char/mspec.c
> - and probably several out-of-tree modules
>
> I'm copying everyone who has changed this file recently, in case
> there is some reason that I am not aware of to provide
> /proc/<pid>/smaps|clear_refs|maps|numa_maps for these VM_PFNMAP areas.
>
> Signed-off-by: Cliff Wickman <[email protected]>

Thank you.

Reviewed-by: Naoya Horiguchi <[email protected]>

2013-05-23 23:40:21

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH v3] mm/pagewalk.c: walk_page_range should avoid VM_PFNMAP areas

On Wed, 15 May 2013 07:46:36 -0500 Cliff Wickman <[email protected]> wrote:

> Certain tests in walk_page_range() (specifically split_huge_page_pmd())
> assume that all the mapped PFN's are backed with page structures. And this is
> not usually true for VM_PFNMAP areas. This can result in panics on kernel
> page faults when attempting to address those page structures.
>
> There are a half dozen callers of walk_page_range() that walk through
> a task's entire page table (as N. Horiguchi pointed out). So rather than
> change all of them, this patch changes just walk_page_range() to ignore
> VM_PFNMAP areas.
>
> The logic of hugetlb_vma() is moved back into walk_page_range(), as we
> want to test any vma in the range.
>
> VM_PFNMAP areas are used by:
> - graphics memory manager gpu/drm/drm_gem.c
> - global reference unit sgi-gru/grufile.c
> - sgi special memory char/mspec.c
> - and probably several out-of-tree modules

What are your thoughts on the urgency/scheduling of this fix?


(Just to be irritating: "When writing a changelog, please describe the
end-user-visible effects of the bug, so that others can more easily
decide which kernel version(s) should be fixed, and so that downstream
kernel maintainers can more easily work out whether this patch will fix
a problem which they or their customers are observing.")

2013-05-24 21:38:48

by Cliff Wickman

[permalink] [raw]
Subject: Re: [PATCH v3] mm/pagewalk.c: walk_page_range should avoid VM_PFNMAP areas




> On Thur, 23 May 2013 Andrew Morton <[email protected]> wrote:

> > On Wed, 15 May 2013 07:46:36 -0500 Cliff Wickman <[email protected]> wrote:
> > Certain tests in walk_page_range() (specifically split_huge_page_pmd())
> > assume that all the mapped PFN's are backed with page structures. And this is
> > not usually true for VM_PFNMAP areas. This can result in panics on kernel
> > page faults when attempting to address those page structures.
> >
> > There are a half dozen callers of walk_page_range() that walk through
> > a task's entire page table (as N. Horiguchi pointed out). So rather than
> > change all of them, this patch changes just walk_page_range() to ignore
> > VM_PFNMAP areas.
> >
> > The logic of hugetlb_vma() is moved back into walk_page_range(), as we
> > want to test any vma in the range.
> >
> > VM_PFNMAP areas are used by:
> > - graphics memory manager gpu/drm/drm_gem.c
> > - global reference unit sgi-gru/grufile.c
> > - sgi special memory char/mspec.c
> > - and probably several out-of-tree modules
>
> What are your thoughts on the urgency/scheduling of this fix?

The panic can be caused by simply cat'ing /proc/<pid>/smaps while an
application has a VM_PFNMAP range. It happened in-house when a benchmarker
was trying to decipher the memory layout of his program.
So that makes it rather urgent from our point of view.
We would like to see the fix included in upcoming distro releases, and having
it upstream makes that much easier to accomplish.

> (Just to be irritating: "When writing a changelog, please describe the
> end-user-visible effects of the bug, so that others can more easily
> decide which kernel version(s) should be fixed, and so that downstream
> kernel maintainers can more easily work out whether this patch will fix
> a problem which they or their customers are observing.")