2014-02-10 21:47:23

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH 00/11 v5] update page table walker

Hi,

This is ver.5 of page table walker patchset.
I rebased it onto v3.14-rc2.

- v1: http://article.gmane.org/gmane.linux.kernel.mm/108362
- v2: http://article.gmane.org/gmane.linux.kernel.mm/108827
- v3: http://article.gmane.org/gmane.linux.kernel.mm/110561
- v4: http://article.gmane.org/gmane.linux.kernel.mm/111832

Thanks,
Naoya Horiguchi
---
Test code:
git://github.com/Naoya-Horiguchi/test_rewrite_page_table_walker.git
---
Summary:

Naoya Horiguchi (11):
pagewalk: update page table walker core
pagewalk: add walk_page_vma()
smaps: redefine callback functions for page table walker
clear_refs: redefine callback functions for page table walker
pagemap: redefine callback functions for page table walker
numa_maps: redefine callback functions for page table walker
memcg: redefine callback functions for page table walker
madvise: redefine callback functions for page table walker
arch/powerpc/mm/subpage-prot.c: use walk_page_vma() instead of walk_page_range()
pagewalk: remove argument hmask from hugetlb_entry()
mempolicy: apply page table walker on queue_pages_range()

arch/powerpc/mm/subpage-prot.c | 6 +-
fs/proc/task_mmu.c | 267 ++++++++++++-----------------
include/linux/mm.h | 24 ++-
mm/madvise.c | 43 ++---
mm/memcontrol.c | 71 +++-----
mm/mempolicy.c | 255 +++++++++++-----------------
mm/pagewalk.c | 372 ++++++++++++++++++++++++++---------------
7 files changed, 506 insertions(+), 532 deletions(-)


2014-02-10 21:45:39

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH 08/11] madvise: redefine callback functions for page table walker

swapin_walk_pmd_entry() is defined as pmd_entry(), but it has no code
about pmd handling (except pmd_none_or_trans_huge_or_clear_bad, but the
same check are now done in core page table walk code).
So let's move this function on pte_entry() as swapin_walk_pte_entry().

Signed-off-by: Naoya Horiguchi <[email protected]>
---
mm/madvise.c | 43 +++++++++++++------------------------------
1 file changed, 13 insertions(+), 30 deletions(-)

diff --git v3.14-rc2.orig/mm/madvise.c v3.14-rc2/mm/madvise.c
index 539eeb96b323..5e957b984c14 100644
--- v3.14-rc2.orig/mm/madvise.c
+++ v3.14-rc2/mm/madvise.c
@@ -135,38 +135,22 @@ static long madvise_behavior(struct vm_area_struct *vma,
}

#ifdef CONFIG_SWAP
-static int swapin_walk_pmd_entry(pmd_t *pmd, unsigned long start,
+static int swapin_walk_pte_entry(pte_t *pte, unsigned long start,
unsigned long end, struct mm_walk *walk)
{
- pte_t *orig_pte;
- struct vm_area_struct *vma = walk->private;
- unsigned long index;
+ swp_entry_t entry;
+ struct page *page;
+ struct vm_area_struct *vma = walk->vma;

- if (pmd_none_or_trans_huge_or_clear_bad(pmd))
+ if (pte_present(*pte) || pte_none(*pte) || pte_file(*pte))
return 0;
-
- for (index = start; index != end; index += PAGE_SIZE) {
- pte_t pte;
- swp_entry_t entry;
- struct page *page;
- spinlock_t *ptl;
-
- orig_pte = pte_offset_map_lock(vma->vm_mm, pmd, start, &ptl);
- pte = *(orig_pte + ((index - start) / PAGE_SIZE));
- pte_unmap_unlock(orig_pte, ptl);
-
- if (pte_present(pte) || pte_none(pte) || pte_file(pte))
- continue;
- entry = pte_to_swp_entry(pte);
- if (unlikely(non_swap_entry(entry)))
- continue;
-
- page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
- vma, index);
- if (page)
- page_cache_release(page);
- }
-
+ entry = pte_to_swp_entry(*pte);
+ if (unlikely(non_swap_entry(entry)))
+ return 0;
+ page = read_swap_cache_async(entry, GFP_HIGHUSER_MOVABLE,
+ vma, start);
+ if (page)
+ page_cache_release(page);
return 0;
}

@@ -175,8 +159,7 @@ static void force_swapin_readahead(struct vm_area_struct *vma,
{
struct mm_walk walk = {
.mm = vma->vm_mm,
- .pmd_entry = swapin_walk_pmd_entry,
- .private = vma,
+ .pte_entry = swapin_walk_pte_entry,
};

walk_page_range(start, end, &walk);
--
1.8.5.3

2014-02-10 21:45:49

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH 01/11] pagewalk: update page table walker core

This patch updates mm/pagewalk.c to make code less complex and more maintenable.
The basic idea is unchanged and there's no userspace visible effect.

Most of existing callback functions need access to vma to handle each entry.
So we had better add a new member vma in struct mm_walk instead of using
mm_walk->private, which makes code simpler.

One problem in current page table walker is that we check vma in pgd loop.
Historically this was introduced to support hugetlbfs in the strange manner.
It's better and cleaner to do the vma check outside pgd loop.

Another problem is that many users of page table walker now use only
pmd_entry(), although it does both pmd-walk and pte-walk. This makes code
duplication and fluctuation among callers, which worsens the maintenability.

One difficulty of code sharing is that the callers want to determine
whether they try to walk over a specific vma or not in their own way.
To solve this, this patch introduces test_walk() callback.

When we try to use multiple callbacks in different levels, skip control is
also important. For example we have thp enabled in normal configuration, and
we are interested in doing some work for a thp. But sometimes we want to
split it and handle as normal pages, and in another time user would handle
both at pmd level and pte level.
What we need is that when we've done pmd_entry() we want to decide whether
to go down to pte level handling based on the pmd_entry()'s result. So this
patch introduces a skip control flag in mm_walk.
We can't use the returned value for this purpose, because we already
defined the meaning of whole range of returned values (>0 is to terminate
page table walk in caller's specific manner, =0 is to continue to walk,
and <0 is to abort the walk in the general manner.)

ChangeLog v5:
- fix build error ("mm/pagewalk.c:201: error: 'hmask' undeclared")

ChangeLog v4:
- add more comment
- remove verbose variable in walk_page_test()
- rename skip_check to skip_lower_level_walking
- rebased onto mmotm-2014-01-09-16-23

ChangeLog v3:
- rebased onto v3.13-rc3-mmots-2013-12-10-16-38

ChangeLog v2:
- rebase onto mmots
- add pte_none() check in walk_pte_range()
- add cond_sched() in walk_hugetlb_range()
- add skip_check()
- do VM_PFNMAP check only when ->test_walk() is not defined (because some
caller could handle VM_PFNMAP vma. copy_page_range() is an example.)
- use do-while condition (addr < end) instead of (addr != end)

Signed-off-by: Naoya Horiguchi <[email protected]>
---
include/linux/mm.h | 18 ++-
mm/pagewalk.c | 352 +++++++++++++++++++++++++++++++++--------------------
2 files changed, 235 insertions(+), 135 deletions(-)

diff --git v3.14-rc2.orig/include/linux/mm.h v3.14-rc2/include/linux/mm.h
index f28f46eade6a..4d0bc01de43c 100644
--- v3.14-rc2.orig/include/linux/mm.h
+++ v3.14-rc2/include/linux/mm.h
@@ -1067,10 +1067,18 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
* @pte_entry: if set, called for each non-empty PTE (4th-level) entry
* @pte_hole: if set, called for each hole at all levels
* @hugetlb_entry: if set, called for each hugetlb entry
- * *Caution*: The caller must hold mmap_sem() if @hugetlb_entry
- * is used.
+ * @test_walk: caller specific callback function to determine whether
+ * we walk over the current vma or not. A positive returned
+ * value means "do page table walk over the current vma,"
+ * and a negative one means "abort current page table walk
+ * right now." 0 means "skip the current vma."
+ * @mm: mm_struct representing the target process of page table walk
+ * @vma: vma currently walked
+ * @skip: internal control flag which is set when we skip the lower
+ * level entries.
+ * @private: private data for callbacks' use
*
- * (see walk_page_range for more details)
+ * (see the comment on walk_page_range() for more details)
*/
struct mm_walk {
int (*pgd_entry)(pgd_t *pgd, unsigned long addr,
@@ -1086,7 +1094,11 @@ struct mm_walk {
int (*hugetlb_entry)(pte_t *pte, unsigned long hmask,
unsigned long addr, unsigned long next,
struct mm_walk *walk);
+ int (*test_walk)(unsigned long addr, unsigned long next,
+ struct mm_walk *walk);
struct mm_struct *mm;
+ struct vm_area_struct *vma;
+ int skip;
void *private;
};

diff --git v3.14-rc2.orig/mm/pagewalk.c v3.14-rc2/mm/pagewalk.c
index 2beeabf502c5..4770558feea8 100644
--- v3.14-rc2.orig/mm/pagewalk.c
+++ v3.14-rc2/mm/pagewalk.c
@@ -3,29 +3,58 @@
#include <linux/sched.h>
#include <linux/hugetlb.h>

-static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
- struct mm_walk *walk)
+/*
+ * Check the current skip status of page table walker.
+ *
+ * Here what I mean by skip is to skip lower level walking, and that was
+ * determined for each entry independently. For example, when walk_pmd_range
+ * handles a pmd_trans_huge we don't have to walk over ptes under that pmd,
+ * and the skipping does not affect the walking over ptes under other pmds.
+ * That's why we reset @walk->skip after tested.
+ */
+static bool skip_lower_level_walking(struct mm_walk *walk)
+{
+ if (walk->skip) {
+ walk->skip = 0;
+ return true;
+ }
+ return false;
+}
+
+static int walk_pte_range(pmd_t *pmd, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
{
+ struct mm_struct *mm = walk->mm;
pte_t *pte;
+ pte_t *orig_pte;
+ spinlock_t *ptl;
int err = 0;

- pte = pte_offset_map(pmd, addr);
- for (;;) {
+ orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
+ do {
+ if (pte_none(*pte)) {
+ if (walk->pte_hole)
+ err = walk->pte_hole(addr, addr + PAGE_SIZE,
+ walk);
+ if (err)
+ break;
+ continue;
+ }
+ /*
+ * Callers should have their own way to handle swap entries
+ * in walk->pte_entry().
+ */
err = walk->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
if (err)
break;
- addr += PAGE_SIZE;
- if (addr == end)
- break;
- pte++;
- }
-
- pte_unmap(pte);
- return err;
+ } while (pte++, addr += PAGE_SIZE, addr < end);
+ pte_unmap_unlock(orig_pte, ptl);
+ cond_resched();
+ return addr == end ? 0 : err;
}

-static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
- struct mm_walk *walk)
+static int walk_pmd_range(pud_t *pud, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
{
pmd_t *pmd;
unsigned long next;
@@ -35,6 +64,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
do {
again:
next = pmd_addr_end(addr, end);
+
if (pmd_none(*pmd)) {
if (walk->pte_hole)
err = walk->pte_hole(addr, next, walk);
@@ -42,35 +72,32 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
break;
continue;
}
- /*
- * This implies that each ->pmd_entry() handler
- * needs to know about pmd_trans_huge() pmds
- */
- if (walk->pmd_entry)
- err = walk->pmd_entry(pmd, addr, next, walk);
- if (err)
- break;

- /*
- * Check this here so we only break down trans_huge
- * pages when we _need_ to
- */
- if (!walk->pte_entry)
- continue;
+ if (walk->pmd_entry) {
+ err = walk->pmd_entry(pmd, addr, next, walk);
+ if (skip_lower_level_walking(walk))
+ continue;
+ if (err)
+ break;
+ }

- split_huge_page_pmd_mm(walk->mm, addr, pmd);
- if (pmd_none_or_trans_huge_or_clear_bad(pmd))
- goto again;
- err = walk_pte_range(pmd, addr, next, walk);
- if (err)
- break;
- } while (pmd++, addr = next, addr != end);
+ if (walk->pte_entry) {
+ if (walk->vma) {
+ split_huge_page_pmd(walk->vma, addr, pmd);
+ if (pmd_trans_unstable(pmd))
+ goto again;
+ }
+ err = walk_pte_range(pmd, addr, next, walk);
+ if (err)
+ break;
+ }
+ } while (pmd++, addr = next, addr < end);

return err;
}

-static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
- struct mm_walk *walk)
+static int walk_pud_range(pgd_t *pgd, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
{
pud_t *pud;
unsigned long next;
@@ -79,6 +106,7 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
pud = pud_offset(pgd, addr);
do {
next = pud_addr_end(addr, end);
+
if (pud_none_or_clear_bad(pud)) {
if (walk->pte_hole)
err = walk->pte_hole(addr, next, walk);
@@ -86,13 +114,58 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
break;
continue;
}
- if (walk->pud_entry)
+
+ if (walk->pud_entry) {
err = walk->pud_entry(pud, addr, next, walk);
- if (!err && (walk->pmd_entry || walk->pte_entry))
+ if (skip_lower_level_walking(walk))
+ continue;
+ if (err)
+ break;
+ }
+
+ if (walk->pmd_entry || walk->pte_entry) {
err = walk_pmd_range(pud, addr, next, walk);
- if (err)
- break;
- } while (pud++, addr = next, addr != end);
+ if (err)
+ break;
+ }
+ } while (pud++, addr = next, addr < end);
+
+ return err;
+}
+
+static int walk_pgd_range(unsigned long addr, unsigned long end,
+ struct mm_walk *walk)
+{
+ pgd_t *pgd;
+ unsigned long next;
+ int err = 0;
+
+ pgd = pgd_offset(walk->mm, addr);
+ do {
+ next = pgd_addr_end(addr, end);
+
+ if (pgd_none_or_clear_bad(pgd)) {
+ if (walk->pte_hole)
+ err = walk->pte_hole(addr, next, walk);
+ if (err)
+ break;
+ continue;
+ }
+
+ if (walk->pgd_entry) {
+ err = walk->pgd_entry(pgd, addr, next, walk);
+ if (skip_lower_level_walking(walk))
+ continue;
+ if (err)
+ break;
+ }
+
+ if (walk->pud_entry || walk->pmd_entry || walk->pte_entry) {
+ err = walk_pud_range(pgd, addr, next, walk);
+ if (err)
+ break;
+ }
+ } while (pgd++, addr = next, addr < end);

return err;
}
@@ -105,144 +178,159 @@ static unsigned long hugetlb_entry_end(struct hstate *h, unsigned long addr,
return boundary < end ? boundary : end;
}

-static int walk_hugetlb_range(struct vm_area_struct *vma,
- unsigned long addr, unsigned long end,
- struct mm_walk *walk)
+static int walk_hugetlb_range(unsigned long addr, unsigned long end,
+ struct mm_walk *walk)
{
+ struct mm_struct *mm = walk->mm;
+ struct vm_area_struct *vma = walk->vma;
struct hstate *h = hstate_vma(vma);
unsigned long next;
unsigned long hmask = huge_page_mask(h);
pte_t *pte;
int err = 0;
+ spinlock_t *ptl;

do {
next = hugetlb_entry_end(h, addr, end);
pte = huge_pte_offset(walk->mm, addr & hmask);
+ ptl = huge_pte_lock(h, mm, pte);
+ /*
+ * Callers should have their own way to handle swap entries
+ * in walk->hugetlb_entry().
+ */
if (pte && walk->hugetlb_entry)
err = walk->hugetlb_entry(pte, hmask, addr, next, walk);
+ spin_unlock(ptl);
if (err)
- return err;
+ break;
} while (addr = next, addr != end);
-
- return 0;
+ cond_resched();
+ return err;
}

#else /* CONFIG_HUGETLB_PAGE */
-static int walk_hugetlb_range(struct vm_area_struct *vma,
- unsigned long addr, unsigned long end,
- struct mm_walk *walk)
+static inline int walk_hugetlb_range(unsigned long addr, unsigned long end,
+ struct mm_walk *walk)
{
return 0;
}

#endif /* CONFIG_HUGETLB_PAGE */

+/*
+ * Decide whether we really walk over the current vma on [@start, @end)
+ * or skip it. When we skip it, we set @walk->skip to 1.
+ * The return value is used to control the page table walking to
+ * continue (for zero) or not (for non-zero).
+ *
+ * Default check (only VM_PFNMAP check for now) is used when the caller
+ * doesn't define test_walk() callback.
+ */
+static int walk_page_test(unsigned long start, unsigned long end,
+ struct mm_walk *walk)
+{
+ struct vm_area_struct *vma = walk->vma;

+ if (walk->test_walk)
+ return walk->test_walk(start, end, walk);
+
+ /*
+ * Do not walk over vma(VM_PFNMAP), because we have no valid struct
+ * page backing a VM_PFNMAP range. See also commit a9ff785e4437.
+ */
+ if (vma->vm_flags & VM_PFNMAP)
+ walk->skip = 1;
+ return 0;
+}
+
+static int __walk_page_range(unsigned long start, unsigned long end,
+ struct mm_walk *walk)
+{
+ int err = 0;
+ struct vm_area_struct *vma = walk->vma;
+
+ if (vma && is_vm_hugetlb_page(vma)) {
+ if (walk->hugetlb_entry)
+ err = walk_hugetlb_range(start, end, walk);
+ } else
+ err = walk_pgd_range(start, end, walk);
+
+ return err;
+}

/**
- * walk_page_range - walk a memory map's page tables with a callback
- * @addr: starting address
- * @end: ending address
- * @walk: set of callbacks to invoke for each level of the tree
+ * walk_page_range - walk page table with caller specific callbacks
+ *
+ * Recursively walk the page table tree of the process represented by
+ * @walk->mm within the virtual address range [@start, @end). In walking,
+ * we can call caller-specific callback functions against each entry.
*
- * Recursively walk the page table for the memory area in a VMA,
- * calling supplied callbacks. Callbacks are called in-order (first
- * PGD, first PUD, first PMD, first PTE, second PTE... second PMD,
- * etc.). If lower-level callbacks are omitted, walking depth is reduced.
+ * Before starting to walk page table, some callers want to check whether
+ * they really want to walk over the vma (for example by checking vm_flags.)
+ * walk_page_test() and @walk->test_walk() do that check.
*
- * Each callback receives an entry pointer and the start and end of the
- * associated range, and a copy of the original mm_walk for access to
- * the ->private or ->mm fields.
+ * If any callback returns a non-zero value, the page table walk is aborted
+ * immediately and the return value is propagated back to the caller.
+ * Note that the meaning of the positive returned value can be defined
+ * by the caller for its own purpose.
*
- * Usually no locks are taken, but splitting transparent huge page may
- * take page table lock. And the bottom level iterator will map PTE
- * directories from highmem if necessary.
+ * If the caller defines multiple callbacks in different levels, the
+ * callbacks are called in depth-first manner. It could happen that
+ * multiple callbacks are called on a address. For example if some caller
+ * defines test_walk(), pmd_entry(), and pte_entry(), then callbacks are
+ * called in the order of test_walk(), pmd_entry(), and pte_entry().
+ * If you don't want to go down to lower level at some point and move to
+ * the next entry in the same level, you set @walk->skip to 1.
+ * For example if you succeed to handle some pmd entry as trans_huge entry,
+ * you need not call walk_pte_range() any more, so set it to avoid that.
+ * We can't determine whether to go down to lower level with the return
+ * value of the callback, because the whole range of return values (0, >0,
+ * and <0) are used up for other meanings.
*
- * If any callback returns a non-zero value, the walk is aborted and
- * the return value is propagated back to the caller. Otherwise 0 is returned.
+ * Each callback can access to the vma over which it is doing page table
+ * walk right now via @walk->vma. @walk->vma is set to NULL in walking
+ * outside a vma. If you want to access to some caller-specific data from
+ * callbacks, @walk->private should be helpful.
*
- * walk->mm->mmap_sem must be held for at least read if walk->hugetlb_entry
- * is !NULL.
+ * The callers should hold @walk->mm->mmap_sem. Note that the lower level
+ * iterators can take page table lock in lowest level iteration and/or
+ * in split_huge_page_pmd().
*/
-int walk_page_range(unsigned long addr, unsigned long end,
+int walk_page_range(unsigned long start, unsigned long end,
struct mm_walk *walk)
{
- pgd_t *pgd;
- unsigned long next;
int err = 0;
+ struct vm_area_struct *vma;
+ unsigned long next;

- if (addr >= end)
- return err;
+ if (start >= end)
+ return -EINVAL;

if (!walk->mm)
return -EINVAL;

VM_BUG_ON(!rwsem_is_locked(&walk->mm->mmap_sem));

- pgd = pgd_offset(walk->mm, addr);
do {
- struct vm_area_struct *vma = NULL;
-
- next = pgd_addr_end(addr, end);
-
- /*
- * This function was not intended to be vma based.
- * But there are vma special cases to be handled:
- * - hugetlb vma's
- * - VM_PFNMAP vma's
- */
- vma = find_vma(walk->mm, addr);
- if (vma) {
- /*
- * There are no page structures backing a VM_PFNMAP
- * range, so do not allow split_huge_page_pmd().
- */
- if ((vma->vm_start <= addr) &&
- (vma->vm_flags & VM_PFNMAP)) {
- next = vma->vm_end;
- pgd = pgd_offset(walk->mm, next);
- continue;
- }
- /*
- * Handle hugetlb vma individually because pagetable
- * walk for the hugetlb page is dependent on the
- * architecture and we can't handled it in the same
- * manner as non-huge pages.
- */
- if (walk->hugetlb_entry && (vma->vm_start <= addr) &&
- is_vm_hugetlb_page(vma)) {
- if (vma->vm_end < next)
- next = vma->vm_end;
- /*
- * Hugepage is very tightly coupled with vma,
- * so walk through hugetlb entries within a
- * given vma.
- */
- err = walk_hugetlb_range(vma, addr, next, walk);
- if (err)
- break;
- pgd = pgd_offset(walk->mm, next);
+ vma = find_vma(walk->mm, start);
+ if (!vma) { /* after the last vma */
+ walk->vma = NULL;
+ next = end;
+ } else if (start < vma->vm_start) { /* outside the found vma */
+ walk->vma = NULL;
+ next = vma->vm_start;
+ } else { /* inside the found vma */
+ walk->vma = vma;
+ next = vma->vm_end;
+ err = walk_page_test(start, end, walk);
+ if (skip_lower_level_walking(walk))
continue;
- }
- }
-
- if (pgd_none_or_clear_bad(pgd)) {
- if (walk->pte_hole)
- err = walk->pte_hole(addr, next, walk);
if (err)
break;
- pgd++;
- continue;
}
- if (walk->pgd_entry)
- err = walk->pgd_entry(pgd, addr, next, walk);
- if (!err &&
- (walk->pud_entry || walk->pmd_entry || walk->pte_entry))
- err = walk_pud_range(pgd, addr, next, walk);
+ err = __walk_page_range(start, next, walk);
if (err)
break;
- pgd++;
- } while (addr = next, addr < end);
-
+ } while (start = next, start < end);
return err;
}
--
1.8.5.3

2014-02-10 21:45:43

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH 10/11] pagewalk: remove argument hmask from hugetlb_entry()

hugetlb_entry() doesn't use the argument hmask any more,
so let's remove it now.

Signed-off-by: Naoya Horiguchi <[email protected]>
---
fs/proc/task_mmu.c | 12 ++++++------
include/linux/mm.h | 5 ++---
mm/pagewalk.c | 2 +-
3 files changed, 9 insertions(+), 10 deletions(-)

diff --git v3.14-rc2.orig/fs/proc/task_mmu.c v3.14-rc2/fs/proc/task_mmu.c
index 8b23bbcc5e04..f819d0d4a0e8 100644
--- v3.14-rc2.orig/fs/proc/task_mmu.c
+++ v3.14-rc2/fs/proc/task_mmu.c
@@ -1022,8 +1022,7 @@ static void huge_pte_to_pagemap_entry(pagemap_entry_t *pme, struct pagemapread *
}

/* This function walks within one hugetlb entry in the single call */
-static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
- unsigned long addr, unsigned long end,
+static int pagemap_hugetlb(pte_t *pte, unsigned long addr, unsigned long end,
struct mm_walk *walk)
{
struct pagemapread *pm = walk->private;
@@ -1031,6 +1030,7 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
int err = 0;
int flags2;
pagemap_entry_t pme;
+ unsigned long hmask;

WARN_ON_ONCE(!vma);

@@ -1292,8 +1292,8 @@ static int gather_pmd_stats(pmd_t *pmd, unsigned long addr,
return 0;
}
#ifdef CONFIG_HUGETLB_PAGE
-static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask,
- unsigned long addr, unsigned long end, struct mm_walk *walk)
+static int gather_hugetlb_stats(pte_t *pte, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
{
struct numa_maps *md;
struct page *page;
@@ -1311,8 +1311,8 @@ static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask,
}

#else
-static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask,
- unsigned long addr, unsigned long end, struct mm_walk *walk)
+static int gather_hugetlb_stats(pte_t *pte, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
{
return 0;
}
diff --git v3.14-rc2.orig/include/linux/mm.h v3.14-rc2/include/linux/mm.h
index 144b08617957..7b6b596a5bf1 100644
--- v3.14-rc2.orig/include/linux/mm.h
+++ v3.14-rc2/include/linux/mm.h
@@ -1091,9 +1091,8 @@ struct mm_walk {
unsigned long next, struct mm_walk *walk);
int (*pte_hole)(unsigned long addr, unsigned long next,
struct mm_walk *walk);
- int (*hugetlb_entry)(pte_t *pte, unsigned long hmask,
- unsigned long addr, unsigned long next,
- struct mm_walk *walk);
+ int (*hugetlb_entry)(pte_t *pte, unsigned long addr,
+ unsigned long next, struct mm_walk *walk);
int (*test_walk)(unsigned long addr, unsigned long next,
struct mm_walk *walk);
struct mm_struct *mm;
diff --git v3.14-rc2.orig/mm/pagewalk.c v3.14-rc2/mm/pagewalk.c
index 2a88dfa58af6..416e981243b1 100644
--- v3.14-rc2.orig/mm/pagewalk.c
+++ v3.14-rc2/mm/pagewalk.c
@@ -199,7 +199,7 @@ static int walk_hugetlb_range(unsigned long addr, unsigned long end,
* in walk->hugetlb_entry().
*/
if (pte && walk->hugetlb_entry)
- err = walk->hugetlb_entry(pte, hmask, addr, next, walk);
+ err = walk->hugetlb_entry(pte, addr, next, walk);
spin_unlock(ptl);
if (err)
break;
--
1.8.5.3

2014-02-10 21:45:47

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH 07/11] memcg: redefine callback functions for page table walker

Move code around pte loop in mem_cgroup_count_precharge_pte_range() into
mem_cgroup_count_precharge_pte() connected to pte_entry().

We don't change the callback mem_cgroup_move_charge_pte_range() for now,
because we can't do the same replacement easily due to 'goto retry'.

ChangeLog v2:
- rebase onto mmots

Signed-off-by: Naoya Horiguchi <[email protected]>
---
mm/memcontrol.c | 71 ++++++++++++++++++++++-----------------------------------
1 file changed, 27 insertions(+), 44 deletions(-)

diff --git v3.14-rc2.orig/mm/memcontrol.c v3.14-rc2/mm/memcontrol.c
index 53385cd4e6f0..a2083c24af63 100644
--- v3.14-rc2.orig/mm/memcontrol.c
+++ v3.14-rc2/mm/memcontrol.c
@@ -6900,30 +6900,29 @@ static inline enum mc_target_type get_mctgt_type_thp(struct vm_area_struct *vma,
}
#endif

-static int mem_cgroup_count_precharge_pte_range(pmd_t *pmd,
+static int mem_cgroup_count_precharge_pte(pte_t *pte,
unsigned long addr, unsigned long end,
struct mm_walk *walk)
{
- struct vm_area_struct *vma = walk->private;
- pte_t *pte;
+ if (get_mctgt_type(walk->vma, addr, *pte, NULL))
+ mc.precharge++; /* increment precharge temporarily */
+ return 0;
+}
+
+static int mem_cgroup_count_precharge_pmd(pmd_t *pmd,
+ unsigned long addr, unsigned long end,
+ struct mm_walk *walk)
+{
+ struct vm_area_struct *vma = walk->vma;
spinlock_t *ptl;

if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
if (get_mctgt_type_thp(vma, addr, *pmd, NULL) == MC_TARGET_PAGE)
mc.precharge += HPAGE_PMD_NR;
spin_unlock(ptl);
- return 0;
+ /* don't call mem_cgroup_count_precharge_pte() */
+ walk->skip = 1;
}
-
- if (pmd_trans_unstable(pmd))
- return 0;
- pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
- for (; addr != end; pte++, addr += PAGE_SIZE)
- if (get_mctgt_type(vma, addr, *pte, NULL))
- mc.precharge++; /* increment precharge temporarily */
- pte_unmap_unlock(pte - 1, ptl);
- cond_resched();
-
return 0;
}

@@ -6932,18 +6931,14 @@ static unsigned long mem_cgroup_count_precharge(struct mm_struct *mm)
unsigned long precharge;
struct vm_area_struct *vma;

+ struct mm_walk mem_cgroup_count_precharge_walk = {
+ .pmd_entry = mem_cgroup_count_precharge_pmd,
+ .pte_entry = mem_cgroup_count_precharge_pte,
+ .mm = mm,
+ };
down_read(&mm->mmap_sem);
- for (vma = mm->mmap; vma; vma = vma->vm_next) {
- struct mm_walk mem_cgroup_count_precharge_walk = {
- .pmd_entry = mem_cgroup_count_precharge_pte_range,
- .mm = mm,
- .private = vma,
- };
- if (is_vm_hugetlb_page(vma))
- continue;
- walk_page_range(vma->vm_start, vma->vm_end,
- &mem_cgroup_count_precharge_walk);
- }
+ for (vma = mm->mmap; vma; vma = vma->vm_next)
+ walk_page_vma(vma, &mem_cgroup_count_precharge_walk);
up_read(&mm->mmap_sem);

precharge = mc.precharge;
@@ -7082,7 +7077,7 @@ static int mem_cgroup_move_charge_pte_range(pmd_t *pmd,
struct mm_walk *walk)
{
int ret = 0;
- struct vm_area_struct *vma = walk->private;
+ struct vm_area_struct *vma = walk->vma;
pte_t *pte;
spinlock_t *ptl;
enum mc_target_type target_type;
@@ -7183,6 +7178,10 @@ put: /* get_mctgt_type() gets the page */
static void mem_cgroup_move_charge(struct mm_struct *mm)
{
struct vm_area_struct *vma;
+ struct mm_walk mem_cgroup_move_charge_walk = {
+ .pmd_entry = mem_cgroup_move_charge_pte_range,
+ .mm = mm,
+ };

lru_add_drain_all();
retry:
@@ -7198,24 +7197,8 @@ static void mem_cgroup_move_charge(struct mm_struct *mm)
cond_resched();
goto retry;
}
- for (vma = mm->mmap; vma; vma = vma->vm_next) {
- int ret;
- struct mm_walk mem_cgroup_move_charge_walk = {
- .pmd_entry = mem_cgroup_move_charge_pte_range,
- .mm = mm,
- .private = vma,
- };
- if (is_vm_hugetlb_page(vma))
- continue;
- ret = walk_page_range(vma->vm_start, vma->vm_end,
- &mem_cgroup_move_charge_walk);
- if (ret)
- /*
- * means we have consumed all precharges and failed in
- * doing additional charge. Just abandon here.
- */
- break;
- }
+ for (vma = mm->mmap; vma; vma = vma->vm_next)
+ walk_page_vma(vma, &mem_cgroup_move_charge_walk);
up_read(&mm->mmap_sem);
}

--
1.8.5.3

2014-02-10 21:46:11

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH 11/11] mempolicy: apply page table walker on queue_pages_range()

queue_pages_range() does page table walking in its own way now,
so this patch rewrites it with walk_page_range().
One difficulty was that queue_pages_range() needed to check vmas
to determine whether we queue pages from a given vma or skip it.
Now we have test_walk() callback in mm_walk for that purpose,
so we can do the replacement cleanly. queue_pages_test_walk()
depends on not only the current vma but also the previous one,
so we use queue_pages->prev to keep it.

ChangeLog v2:
- rebase onto mmots
- add VM_PFNMAP check on queue_pages_test_walk()

Signed-off-by: Naoya Horiguchi <[email protected]>
---
mm/mempolicy.c | 255 ++++++++++++++++++++++-----------------------------------
1 file changed, 99 insertions(+), 156 deletions(-)

diff --git v3.14-rc2.orig/mm/mempolicy.c v3.14-rc2/mm/mempolicy.c
index ae3c8f3595d4..b2155b8adbae 100644
--- v3.14-rc2.orig/mm/mempolicy.c
+++ v3.14-rc2/mm/mempolicy.c
@@ -476,140 +476,66 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
static void migrate_page_add(struct page *page, struct list_head *pagelist,
unsigned long flags);

+struct queue_pages {
+ struct list_head *pagelist;
+ unsigned long flags;
+ nodemask_t *nmask;
+ struct vm_area_struct *prev;
+};
+
/*
* Scan through pages checking if pages follow certain conditions,
* and move them to the pagelist if they do.
*/
-static int queue_pages_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
- unsigned long addr, unsigned long end,
- const nodemask_t *nodes, unsigned long flags,
- void *private)
+static int queue_pages_pte(pte_t *pte, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
{
- pte_t *orig_pte;
- pte_t *pte;
- spinlock_t *ptl;
-
- orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
- do {
- struct page *page;
- int nid;
+ struct vm_area_struct *vma = walk->vma;
+ struct page *page;
+ struct queue_pages *qp = walk->private;
+ unsigned long flags = qp->flags;
+ int nid;

- if (!pte_present(*pte))
- continue;
- page = vm_normal_page(vma, addr, *pte);
- if (!page)
- continue;
- /*
- * vm_normal_page() filters out zero pages, but there might
- * still be PageReserved pages to skip, perhaps in a VDSO.
- */
- if (PageReserved(page))
- continue;
- nid = page_to_nid(page);
- if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT))
- continue;
+ if (!pte_present(*pte))
+ return 0;
+ page = vm_normal_page(vma, addr, *pte);
+ if (!page)
+ return 0;
+ /*
+ * vm_normal_page() filters out zero pages, but there might
+ * still be PageReserved pages to skip, perhaps in a VDSO.
+ */
+ if (PageReserved(page))
+ return 0;
+ nid = page_to_nid(page);
+ if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
+ return 0;

- if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
- migrate_page_add(page, private, flags);
- else
- break;
- } while (pte++, addr += PAGE_SIZE, addr != end);
- pte_unmap_unlock(orig_pte, ptl);
- return addr != end;
+ if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+ migrate_page_add(page, qp->pagelist, flags);
+ return 0;
}

-static void queue_pages_hugetlb_pmd_range(struct vm_area_struct *vma,
- pmd_t *pmd, const nodemask_t *nodes, unsigned long flags,
- void *private)
+static int queue_pages_hugetlb(pte_t *pte, unsigned long addr,
+ unsigned long next, struct mm_walk *walk)
{
#ifdef CONFIG_HUGETLB_PAGE
+ struct queue_pages *qp = walk->private;
+ unsigned long flags = qp->flags;
int nid;
struct page *page;
- spinlock_t *ptl;

- ptl = huge_pte_lock(hstate_vma(vma), vma->vm_mm, (pte_t *)pmd);
- page = pte_page(huge_ptep_get((pte_t *)pmd));
+ page = pte_page(huge_ptep_get(pte));
nid = page_to_nid(page);
- if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT))
- goto unlock;
+ if (node_isset(nid, *qp->nmask) == !!(flags & MPOL_MF_INVERT))
+ return 0;
/* With MPOL_MF_MOVE, we migrate only unshared hugepage. */
if (flags & (MPOL_MF_MOVE_ALL) ||
(flags & MPOL_MF_MOVE && page_mapcount(page) == 1))
- isolate_huge_page(page, private);
-unlock:
- spin_unlock(ptl);
+ isolate_huge_page(page, qp->pagelist);
#else
BUG();
#endif
-}
-
-static inline int queue_pages_pmd_range(struct vm_area_struct *vma, pud_t *pud,
- unsigned long addr, unsigned long end,
- const nodemask_t *nodes, unsigned long flags,
- void *private)
-{
- pmd_t *pmd;
- unsigned long next;
-
- pmd = pmd_offset(pud, addr);
- do {
- next = pmd_addr_end(addr, end);
- if (!pmd_present(*pmd))
- continue;
- if (pmd_huge(*pmd) && is_vm_hugetlb_page(vma)) {
- queue_pages_hugetlb_pmd_range(vma, pmd, nodes,
- flags, private);
- continue;
- }
- split_huge_page_pmd(vma, addr, pmd);
- if (pmd_none_or_trans_huge_or_clear_bad(pmd))
- continue;
- if (queue_pages_pte_range(vma, pmd, addr, next, nodes,
- flags, private))
- return -EIO;
- } while (pmd++, addr = next, addr != end);
- return 0;
-}
-
-static inline int queue_pages_pud_range(struct vm_area_struct *vma, pgd_t *pgd,
- unsigned long addr, unsigned long end,
- const nodemask_t *nodes, unsigned long flags,
- void *private)
-{
- pud_t *pud;
- unsigned long next;
-
- pud = pud_offset(pgd, addr);
- do {
- next = pud_addr_end(addr, end);
- if (pud_huge(*pud) && is_vm_hugetlb_page(vma))
- continue;
- if (pud_none_or_clear_bad(pud))
- continue;
- if (queue_pages_pmd_range(vma, pud, addr, next, nodes,
- flags, private))
- return -EIO;
- } while (pud++, addr = next, addr != end);
- return 0;
-}
-
-static inline int queue_pages_pgd_range(struct vm_area_struct *vma,
- unsigned long addr, unsigned long end,
- const nodemask_t *nodes, unsigned long flags,
- void *private)
-{
- pgd_t *pgd;
- unsigned long next;
-
- pgd = pgd_offset(vma->vm_mm, addr);
- do {
- next = pgd_addr_end(addr, end);
- if (pgd_none_or_clear_bad(pgd))
- continue;
- if (queue_pages_pud_range(vma, pgd, addr, next, nodes,
- flags, private))
- return -EIO;
- } while (pgd++, addr = next, addr != end);
return 0;
}

@@ -642,6 +568,45 @@ static unsigned long change_prot_numa(struct vm_area_struct *vma,
}
#endif /* CONFIG_NUMA_BALANCING */

+static int queue_pages_test_walk(unsigned long start, unsigned long end,
+ struct mm_walk *walk)
+{
+ struct vm_area_struct *vma = walk->vma;
+ struct queue_pages *qp = walk->private;
+ unsigned long endvma = vma->vm_end;
+ unsigned long flags = qp->flags;
+
+ if (endvma > end)
+ endvma = end;
+ if (vma->vm_start > start)
+ start = vma->vm_start;
+
+ if (!(flags & MPOL_MF_DISCONTIG_OK)) {
+ if (!vma->vm_next && vma->vm_end < end)
+ return -EFAULT;
+ if (qp->prev && qp->prev->vm_end < vma->vm_start)
+ return -EFAULT;
+ }
+
+ qp->prev = vma;
+ walk->skip = 1;
+
+ if (vma->vm_flags & VM_PFNMAP)
+ return 0;
+
+ if (flags & MPOL_MF_LAZY) {
+ change_prot_numa(vma, start, endvma);
+ return 0;
+ }
+
+ if ((flags & MPOL_MF_STRICT) ||
+ ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
+ vma_migratable(vma)))
+ /* queue pages from current vma */
+ walk->skip = 0;
+ return 0;
+}
+
/*
* Walk through page tables and collect pages to be migrated.
*
@@ -651,51 +616,29 @@ static unsigned long change_prot_numa(struct vm_area_struct *vma,
*/
static struct vm_area_struct *
queue_pages_range(struct mm_struct *mm, unsigned long start, unsigned long end,
- const nodemask_t *nodes, unsigned long flags, void *private)
+ nodemask_t *nodes, unsigned long flags,
+ struct list_head *pagelist)
{
int err;
- struct vm_area_struct *first, *vma, *prev;
-
-
- first = find_vma(mm, start);
- if (!first)
- return ERR_PTR(-EFAULT);
- prev = NULL;
- for (vma = first; vma && vma->vm_start < end; vma = vma->vm_next) {
- unsigned long endvma = vma->vm_end;
-
- if (endvma > end)
- endvma = end;
- if (vma->vm_start > start)
- start = vma->vm_start;
-
- if (!(flags & MPOL_MF_DISCONTIG_OK)) {
- if (!vma->vm_next && vma->vm_end < end)
- return ERR_PTR(-EFAULT);
- if (prev && prev->vm_end < vma->vm_start)
- return ERR_PTR(-EFAULT);
- }
-
- if (flags & MPOL_MF_LAZY) {
- change_prot_numa(vma, start, endvma);
- goto next;
- }
-
- if ((flags & MPOL_MF_STRICT) ||
- ((flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL)) &&
- vma_migratable(vma))) {
-
- err = queue_pages_pgd_range(vma, start, endvma, nodes,
- flags, private);
- if (err) {
- first = ERR_PTR(err);
- break;
- }
- }
-next:
- prev = vma;
- }
- return first;
+ struct queue_pages qp = {
+ .pagelist = pagelist,
+ .flags = flags,
+ .nmask = nodes,
+ .prev = NULL,
+ };
+ struct mm_walk queue_pages_walk = {
+ .hugetlb_entry = queue_pages_hugetlb,
+ .pte_entry = queue_pages_pte,
+ .test_walk = queue_pages_test_walk,
+ .mm = mm,
+ .private = &qp,
+ };
+
+ err = walk_page_range(start, end, &queue_pages_walk);
+ if (err < 0)
+ return ERR_PTR(err);
+ else
+ return find_vma(mm, start);
}

/*
--
1.8.5.3

2014-02-10 21:46:53

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH 09/11] arch/powerpc/mm/subpage-prot.c: use walk_page_vma() instead of walk_page_range()

We don't have to use mm_walk->private to pass vma to the callback
function because of mm_walk->vma.

Signed-off-by: Naoya Horiguchi <[email protected]>
---
arch/powerpc/mm/subpage-prot.c | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)

diff --git v3.14-rc2.orig/arch/powerpc/mm/subpage-prot.c v3.14-rc2/arch/powerpc/mm/subpage-prot.c
index a770df2dae70..cec0af0a935f 100644
--- v3.14-rc2.orig/arch/powerpc/mm/subpage-prot.c
+++ v3.14-rc2/arch/powerpc/mm/subpage-prot.c
@@ -134,7 +134,7 @@ static void subpage_prot_clear(unsigned long addr, unsigned long len)
static int subpage_walk_pmd_entry(pmd_t *pmd, unsigned long addr,
unsigned long end, struct mm_walk *walk)
{
- struct vm_area_struct *vma = walk->private;
+ struct vm_area_struct *vma = walk->vma;
split_huge_page_pmd(vma, addr, pmd);
return 0;
}
@@ -163,9 +163,7 @@ static void subpage_mark_vma_nohuge(struct mm_struct *mm, unsigned long addr,
if (vma->vm_start >= (addr + len))
break;
vma->vm_flags |= VM_NOHUGEPAGE;
- subpage_proto_walk.private = vma;
- walk_page_range(vma->vm_start, vma->vm_end,
- &subpage_proto_walk);
+ walk_page_vma(vma, &subpage_proto_walk);
vma = vma->vm_next;
}
}
--
1.8.5.3

2014-02-10 21:45:36

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH 05/11] pagemap: redefine callback functions for page table walker

pagemap_pte_range() connected to pmd_entry() does both of pmd loop and
pte loop. So this patch moves pte part into pagemap_pte() on pte_entry().

We remove VM_SOFTDIRTY check in pagemap_pte_range(), because in the new
page table walker we call __walk_page_range() for each vma separately,
so we never experience multiple vmas in single pgd/pud/pmd/pte loop.

ChangeLog v2:
- remove cond_sched() (moved it to walk_hugetlb_range())
- rebase onto mmots

Signed-off-by: Naoya Horiguchi <[email protected]>
---
fs/proc/task_mmu.c | 76 ++++++++++++++++++++----------------------------------
1 file changed, 28 insertions(+), 48 deletions(-)

diff --git v3.14-rc2.orig/fs/proc/task_mmu.c v3.14-rc2/fs/proc/task_mmu.c
index 8ecae2f55a97..7ed7c88f0687 100644
--- v3.14-rc2.orig/fs/proc/task_mmu.c
+++ v3.14-rc2/fs/proc/task_mmu.c
@@ -957,19 +957,33 @@ static inline void thp_pmd_to_pagemap_entry(pagemap_entry_t *pme, struct pagemap
}
#endif

-static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
+static int pagemap_pte(pte_t *pte, unsigned long addr, unsigned long end,
struct mm_walk *walk)
{
- struct vm_area_struct *vma;
+ struct vm_area_struct *vma = walk->vma;
struct pagemapread *pm = walk->private;
- spinlock_t *ptl;
- pte_t *pte;
+ pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2));
+
+ if (vma && vma->vm_start <= addr && end <= vma->vm_end) {
+ pte_to_pagemap_entry(&pme, pm, vma, addr, *pte);
+ /* unmap before userspace copy */
+ pte_unmap(pte);
+ }
+ return add_to_pagemap(addr, &pme, pm);
+}
+
+static int pagemap_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
+ struct mm_walk *walk)
+{
int err = 0;
+ struct vm_area_struct *vma = walk->vma;
+ struct pagemapread *pm = walk->private;
pagemap_entry_t pme = make_pme(PM_NOT_PRESENT(pm->v2));
+ spinlock_t *ptl;

- /* find the first VMA at or above 'addr' */
- vma = find_vma(walk->mm, addr);
- if (vma && pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
+ if (!vma)
+ return err;
+ if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
int pmd_flags2;

if ((vma->vm_flags & VM_SOFTDIRTY) || pmd_soft_dirty(*pmd))
@@ -988,41 +1002,9 @@ static int pagemap_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
break;
}
spin_unlock(ptl);
- return err;
- }
-
- if (pmd_trans_unstable(pmd))
- return 0;
- for (; addr != end; addr += PAGE_SIZE) {
- int flags2;
-
- /* check to see if we've left 'vma' behind
- * and need a new, higher one */
- if (vma && (addr >= vma->vm_end)) {
- vma = find_vma(walk->mm, addr);
- if (vma && (vma->vm_flags & VM_SOFTDIRTY))
- flags2 = __PM_SOFT_DIRTY;
- else
- flags2 = 0;
- pme = make_pme(PM_NOT_PRESENT(pm->v2) | PM_STATUS2(pm->v2, flags2));
- }
-
- /* check that 'vma' actually covers this address,
- * and that it isn't a huge page vma */
- if (vma && (vma->vm_start <= addr) &&
- !is_vm_hugetlb_page(vma)) {
- pte = pte_offset_map(pmd, addr);
- pte_to_pagemap_entry(&pme, pm, vma, addr, *pte);
- /* unmap before userspace copy */
- pte_unmap(pte);
- }
- err = add_to_pagemap(addr, &pme, pm);
- if (err)
- return err;
+ /* don't call pagemap_pte() */
+ walk->skip = 1;
}
-
- cond_resched();
-
return err;
}

@@ -1045,12 +1027,11 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
struct mm_walk *walk)
{
struct pagemapread *pm = walk->private;
- struct vm_area_struct *vma;
+ struct vm_area_struct *vma = walk->vma;
int err = 0;
int flags2;
pagemap_entry_t pme;

- vma = find_vma(walk->mm, addr);
WARN_ON_ONCE(!vma);

if (vma && (vma->vm_flags & VM_SOFTDIRTY))
@@ -1058,6 +1039,7 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
else
flags2 = 0;

+ hmask = huge_page_mask(hstate_vma(vma));
for (; addr != end; addr += PAGE_SIZE) {
int offset = (addr & ~hmask) >> PAGE_SHIFT;
huge_pte_to_pagemap_entry(&pme, pm, *pte, offset, flags2);
@@ -1065,9 +1047,6 @@ static int pagemap_hugetlb_range(pte_t *pte, unsigned long hmask,
if (err)
return err;
}
-
- cond_resched();
-
return err;
}
#endif /* HUGETLB_PAGE */
@@ -1134,10 +1113,11 @@ static ssize_t pagemap_read(struct file *file, char __user *buf,
if (!mm || IS_ERR(mm))
goto out_free;

- pagemap_walk.pmd_entry = pagemap_pte_range;
+ pagemap_walk.pte_entry = pagemap_pte;
+ pagemap_walk.pmd_entry = pagemap_pmd;
pagemap_walk.pte_hole = pagemap_pte_hole;
#ifdef CONFIG_HUGETLB_PAGE
- pagemap_walk.hugetlb_entry = pagemap_hugetlb_range;
+ pagemap_walk.hugetlb_entry = pagemap_hugetlb;
#endif
pagemap_walk.mm = mm;
pagemap_walk.private = &pm;
--
1.8.5.3

2014-02-10 21:47:26

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH 03/11] smaps: redefine callback functions for page table walker

smaps_pte_range() connected to pmd_entry() does both of pmd loop and pte loop.
So this patch moves pte part into smaps_pte() on pte_entry() as expected by
the name.

ChangeLog v2:
- rebase onto mmots

Signed-off-by: Naoya Horiguchi <[email protected]>
---
fs/proc/task_mmu.c | 47 +++++++++++++++++------------------------------
1 file changed, 17 insertions(+), 30 deletions(-)

diff --git v3.14-rc2.orig/fs/proc/task_mmu.c v3.14-rc2/fs/proc/task_mmu.c
index fb52b548080d..62eedbe50733 100644
--- v3.14-rc2.orig/fs/proc/task_mmu.c
+++ v3.14-rc2/fs/proc/task_mmu.c
@@ -423,7 +423,6 @@ const struct file_operations proc_tid_maps_operations = {

#ifdef CONFIG_PROC_PAGE_MONITOR
struct mem_size_stats {
- struct vm_area_struct *vma;
unsigned long resident;
unsigned long shared_clean;
unsigned long shared_dirty;
@@ -437,15 +436,16 @@ struct mem_size_stats {
u64 pss;
};

-
-static void smaps_pte_entry(pte_t ptent, unsigned long addr,
- unsigned long ptent_size, struct mm_walk *walk)
+static int smaps_pte(pte_t *pte, unsigned long addr, unsigned long end,
+ struct mm_walk *walk)
{
struct mem_size_stats *mss = walk->private;
- struct vm_area_struct *vma = mss->vma;
+ struct vm_area_struct *vma = walk->vma;
pgoff_t pgoff = linear_page_index(vma, addr);
struct page *page = NULL;
int mapcount;
+ pte_t ptent = *pte;
+ unsigned long ptent_size = end - addr;

if (pte_present(ptent)) {
page = vm_normal_page(vma, addr, ptent);
@@ -462,7 +462,7 @@ static void smaps_pte_entry(pte_t ptent, unsigned long addr,
}

if (!page)
- return;
+ return 0;

if (PageAnon(page))
mss->anonymous += ptent_size;
@@ -488,35 +488,22 @@ static void smaps_pte_entry(pte_t ptent, unsigned long addr,
mss->private_clean += ptent_size;
mss->pss += (ptent_size << PSS_SHIFT);
}
+ return 0;
}

-static int smaps_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
- struct mm_walk *walk)
+static int smaps_pmd(pmd_t *pmd, unsigned long addr, unsigned long end,
+ struct mm_walk *walk)
{
struct mem_size_stats *mss = walk->private;
- struct vm_area_struct *vma = mss->vma;
- pte_t *pte;
spinlock_t *ptl;

- if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
- smaps_pte_entry(*(pte_t *)pmd, addr, HPAGE_PMD_SIZE, walk);
+ if (pmd_trans_huge_lock(pmd, walk->vma, &ptl) == 1) {
+ smaps_pte((pte_t *)pmd, addr, addr + HPAGE_PMD_SIZE, walk);
spin_unlock(ptl);
mss->anonymous_thp += HPAGE_PMD_SIZE;
- return 0;
+ /* don't call smaps_pte() */
+ walk->skip = 1;
}
-
- if (pmd_trans_unstable(pmd))
- return 0;
- /*
- * The mmap_sem held all the way back in m_start() is what
- * keeps khugepaged out of here and from collapsing things
- * in here.
- */
- pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
- for (; addr != end; pte++, addr += PAGE_SIZE)
- smaps_pte_entry(*pte, addr, PAGE_SIZE, walk);
- pte_unmap_unlock(pte - 1, ptl);
- cond_resched();
return 0;
}

@@ -581,16 +568,16 @@ static int show_smap(struct seq_file *m, void *v, int is_pid)
struct vm_area_struct *vma = v;
struct mem_size_stats mss;
struct mm_walk smaps_walk = {
- .pmd_entry = smaps_pte_range,
+ .pmd_entry = smaps_pmd,
+ .pte_entry = smaps_pte,
.mm = vma->vm_mm,
+ .vma = vma,
.private = &mss,
};

memset(&mss, 0, sizeof mss);
- mss.vma = vma;
/* mmap_sem is held in m_start */
- if (vma->vm_mm && !is_vm_hugetlb_page(vma))
- walk_page_range(vma->vm_start, vma->vm_end, &smaps_walk);
+ walk_page_vma(vma, &smaps_walk);

show_map_vma(m, vma, is_pid);

--
1.8.5.3

2014-02-10 21:47:21

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH 02/11] pagewalk: add walk_page_vma()

Introduces walk_page_vma(), which is useful for the callers which
want to walk over a given vma. It's used by later patches.

ChangeLog v4:
- rename skip_check to skip_lower_level_walking

Signed-off-by: Naoya Horiguchi <[email protected]>
---
include/linux/mm.h | 1 +
mm/pagewalk.c | 18 ++++++++++++++++++
2 files changed, 19 insertions(+)

diff --git v3.14-rc2.orig/include/linux/mm.h v3.14-rc2/include/linux/mm.h
index 4d0bc01de43c..144b08617957 100644
--- v3.14-rc2.orig/include/linux/mm.h
+++ v3.14-rc2/include/linux/mm.h
@@ -1104,6 +1104,7 @@ struct mm_walk {

int walk_page_range(unsigned long addr, unsigned long end,
struct mm_walk *walk);
+int walk_page_vma(struct vm_area_struct *vma, struct mm_walk *walk);
void free_pgd_range(struct mmu_gather *tlb, unsigned long addr,
unsigned long end, unsigned long floor, unsigned long ceiling);
int copy_page_range(struct mm_struct *dst, struct mm_struct *src,
diff --git v3.14-rc2.orig/mm/pagewalk.c v3.14-rc2/mm/pagewalk.c
index 4770558feea8..2a88dfa58af6 100644
--- v3.14-rc2.orig/mm/pagewalk.c
+++ v3.14-rc2/mm/pagewalk.c
@@ -334,3 +334,21 @@ int walk_page_range(unsigned long start, unsigned long end,
} while (start = next, start < end);
return err;
}
+
+int walk_page_vma(struct vm_area_struct *vma, struct mm_walk *walk)
+{
+ int err;
+
+ if (!walk->mm)
+ return -EINVAL;
+
+ VM_BUG_ON(!rwsem_is_locked(&walk->mm->mmap_sem));
+ VM_BUG_ON(!vma);
+ walk->vma = vma;
+ err = walk_page_test(vma->vm_start, vma->vm_end, walk);
+ if (skip_lower_level_walking(walk))
+ return 0;
+ if (err)
+ return err;
+ return __walk_page_range(vma->vm_start, vma->vm_end, walk);
+}
--
1.8.5.3

2014-02-10 21:45:33

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH 06/11] numa_maps: redefine callback functions for page table walker

gather_pte_stats() connected to pmd_entry() does both of pmd loop and
pte loop. So this patch moves pte part into pte_entry().

ChangeLog v2:
- rebase onto mmots

Signed-off-by: Naoya Horiguchi <[email protected]>
---
fs/proc/task_mmu.c | 54 ++++++++++++++++++++++++++----------------------------
1 file changed, 26 insertions(+), 28 deletions(-)

diff --git v3.14-rc2.orig/fs/proc/task_mmu.c v3.14-rc2/fs/proc/task_mmu.c
index 7ed7c88f0687..8b23bbcc5e04 100644
--- v3.14-rc2.orig/fs/proc/task_mmu.c
+++ v3.14-rc2/fs/proc/task_mmu.c
@@ -1193,7 +1193,6 @@ const struct file_operations proc_pagemap_operations = {
#ifdef CONFIG_NUMA

struct numa_maps {
- struct vm_area_struct *vma;
unsigned long pages;
unsigned long anon;
unsigned long active;
@@ -1259,43 +1258,41 @@ static struct page *can_gather_numa_stats(pte_t pte, struct vm_area_struct *vma,
return page;
}

-static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
+static int gather_pte_stats(pte_t *pte, unsigned long addr,
unsigned long end, struct mm_walk *walk)
{
- struct numa_maps *md;
- spinlock_t *ptl;
- pte_t *orig_pte;
- pte_t *pte;
+ struct numa_maps *md = walk->private;

- md = walk->private;
+ struct page *page = can_gather_numa_stats(*pte, walk->vma, addr);
+ if (!page)
+ return 0;
+ gather_stats(page, md, pte_dirty(*pte), 1);
+ return 0;
+}
+
+static int gather_pmd_stats(pmd_t *pmd, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
+{
+ struct numa_maps *md = walk->private;
+ struct vm_area_struct *vma = walk->vma;
+ spinlock_t *ptl;

- if (pmd_trans_huge_lock(pmd, md->vma, &ptl) == 1) {
+ if (pmd_trans_huge_lock(pmd, vma, &ptl) == 1) {
pte_t huge_pte = *(pte_t *)pmd;
struct page *page;

- page = can_gather_numa_stats(huge_pte, md->vma, addr);
+ page = can_gather_numa_stats(huge_pte, vma, addr);
if (page)
gather_stats(page, md, pte_dirty(huge_pte),
HPAGE_PMD_SIZE/PAGE_SIZE);
spin_unlock(ptl);
- return 0;
+ /* don't call gather_pte_stats() */
+ walk->skip = 1;
}
-
- if (pmd_trans_unstable(pmd))
- return 0;
- orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
- do {
- struct page *page = can_gather_numa_stats(*pte, md->vma, addr);
- if (!page)
- continue;
- gather_stats(page, md, pte_dirty(*pte), 1);
-
- } while (pte++, addr += PAGE_SIZE, addr != end);
- pte_unmap_unlock(orig_pte, ptl);
return 0;
}
#ifdef CONFIG_HUGETLB_PAGE
-static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
+static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask,
unsigned long addr, unsigned long end, struct mm_walk *walk)
{
struct numa_maps *md;
@@ -1314,7 +1311,7 @@ static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
}

#else
-static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
+static int gather_hugetlb_stats(pte_t *pte, unsigned long hmask,
unsigned long addr, unsigned long end, struct mm_walk *walk)
{
return 0;
@@ -1344,12 +1341,12 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
/* Ensure we start with an empty set of numa_maps statistics. */
memset(md, 0, sizeof(*md));

- md->vma = vma;
-
- walk.hugetlb_entry = gather_hugetbl_stats;
- walk.pmd_entry = gather_pte_stats;
+ walk.hugetlb_entry = gather_hugetlb_stats;
+ walk.pmd_entry = gather_pmd_stats;
+ walk.pte_entry = gather_pte_stats;
walk.private = md;
walk.mm = mm;
+ walk.vma = vma;

pol = get_vma_policy(task, vma, vma->vm_start);
mpol_to_str(buffer, sizeof(buffer), pol);
@@ -1380,6 +1377,7 @@ static int show_numa_map(struct seq_file *m, void *v, int is_pid)
if (is_vm_hugetlb_page(vma))
seq_printf(m, " huge");

+ /* mmap_sem is held by m_start */
walk_page_range(vma->vm_start, vma->vm_end, &walk);

if (!md->pages)
--
1.8.5.3

2014-02-10 21:48:30

by Naoya Horiguchi

[permalink] [raw]
Subject: [PATCH 04/11] clear_refs: redefine callback functions for page table walker

Currently clear_refs_pte_range() is connected to pmd_entry() to split thps
if found. But now this work can be done in core page table walker code.
So we have no reason to keep this callback on pmd_entry(). This patch moves
pte handling code on pte_entry() callback.

clear_refs_write() has some prechecks about if we really walk over a given
vma. It's fine to let them done by test_walk() callback, so let's define it.

Signed-off-by: Naoya Horiguchi <[email protected]>
---
fs/proc/task_mmu.c | 82 ++++++++++++++++++++++--------------------------------
1 file changed, 33 insertions(+), 49 deletions(-)

diff --git v3.14-rc2.orig/fs/proc/task_mmu.c v3.14-rc2/fs/proc/task_mmu.c
index 62eedbe50733..8ecae2f55a97 100644
--- v3.14-rc2.orig/fs/proc/task_mmu.c
+++ v3.14-rc2/fs/proc/task_mmu.c
@@ -698,7 +698,6 @@ enum clear_refs_types {
};

struct clear_refs_private {
- struct vm_area_struct *vma;
enum clear_refs_types type;
};

@@ -730,41 +729,43 @@ static inline void clear_soft_dirty(struct vm_area_struct *vma,
#endif
}

-static int clear_refs_pte_range(pmd_t *pmd, unsigned long addr,
+static int clear_refs_pte(pte_t *pte, unsigned long addr,
unsigned long end, struct mm_walk *walk)
{
struct clear_refs_private *cp = walk->private;
- struct vm_area_struct *vma = cp->vma;
- pte_t *pte, ptent;
- spinlock_t *ptl;
+ struct vm_area_struct *vma = walk->vma;
struct page *page;

- split_huge_page_pmd(vma, addr, pmd);
- if (pmd_trans_unstable(pmd))
+ if (cp->type == CLEAR_REFS_SOFT_DIRTY) {
+ clear_soft_dirty(vma, addr, pte);
return 0;
+ }
+ if (!pte_present(*pte))
+ return 0;
+ page = vm_normal_page(vma, addr, *pte);
+ if (!page)
+ return 0;
+ /* Clear accessed and referenced bits. */
+ ptep_test_and_clear_young(vma, addr, pte);
+ ClearPageReferenced(page);
+ return 0;
+}

- pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
- for (; addr != end; pte++, addr += PAGE_SIZE) {
- ptent = *pte;
-
- if (cp->type == CLEAR_REFS_SOFT_DIRTY) {
- clear_soft_dirty(vma, addr, pte);
- continue;
- }
-
- if (!pte_present(ptent))
- continue;
-
- page = vm_normal_page(vma, addr, ptent);
- if (!page)
- continue;
+static int clear_refs_test_walk(unsigned long start, unsigned long end,
+ struct mm_walk *walk)
+{
+ struct clear_refs_private *cp = walk->private;
+ struct vm_area_struct *vma = walk->vma;

- /* Clear accessed and referenced bits. */
- ptep_test_and_clear_young(vma, addr, pte);
- ClearPageReferenced(page);
- }
- pte_unmap_unlock(pte - 1, ptl);
- cond_resched();
+ /*
+ * Writing 1 to /proc/pid/clear_refs affects all pages.
+ * Writing 2 to /proc/pid/clear_refs only affects anonymous pages.
+ * Writing 3 to /proc/pid/clear_refs only affects file mapped pages.
+ */
+ if (cp->type == CLEAR_REFS_ANON && vma->vm_file)
+ walk->skip = 1;
+ if (cp->type == CLEAR_REFS_MAPPED && !vma->vm_file)
+ walk->skip = 1;
return 0;
}

@@ -806,33 +807,16 @@ static ssize_t clear_refs_write(struct file *file, const char __user *buf,
.type = type,
};
struct mm_walk clear_refs_walk = {
- .pmd_entry = clear_refs_pte_range,
+ .pte_entry = clear_refs_pte,
+ .test_walk = clear_refs_test_walk,
.mm = mm,
.private = &cp,
};
down_read(&mm->mmap_sem);
if (type == CLEAR_REFS_SOFT_DIRTY)
mmu_notifier_invalidate_range_start(mm, 0, -1);
- for (vma = mm->mmap; vma; vma = vma->vm_next) {
- cp.vma = vma;
- if (is_vm_hugetlb_page(vma))
- continue;
- /*
- * Writing 1 to /proc/pid/clear_refs affects all pages.
- *
- * Writing 2 to /proc/pid/clear_refs only affects
- * Anonymous pages.
- *
- * Writing 3 to /proc/pid/clear_refs only affects file
- * mapped pages.
- */
- if (type == CLEAR_REFS_ANON && vma->vm_file)
- continue;
- if (type == CLEAR_REFS_MAPPED && !vma->vm_file)
- continue;
- walk_page_range(vma->vm_start, vma->vm_end,
- &clear_refs_walk);
- }
+ for (vma = mm->mmap; vma; vma = vma->vm_next)
+ walk_page_vma(vma, &clear_refs_walk);
if (type == CLEAR_REFS_SOFT_DIRTY)
mmu_notifier_invalidate_range_end(mm, 0, -1);
flush_tlb_mm(mm);
--
1.8.5.3

2014-02-10 22:42:05

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 00/11 v5] update page table walker

On Mon, 10 Feb 2014 16:44:25 -0500 Naoya Horiguchi <[email protected]> wrote:

> This is ver.5 of page table walker patchset.

text data bss dec hex filename
882373 264146 757256 1903775 1d0c9f mm/built-in.o (before)
881205 264146 757128 1902479 1d078f mm/built-in.o (after)

That worked. But it adds 15 lines to mm/*.[ch] ;)

2014-02-12 05:39:57

by Joonsoo Kim

[permalink] [raw]
Subject: Re: [PATCH 01/11] pagewalk: update page table walker core

On Mon, Feb 10, 2014 at 04:44:26PM -0500, Naoya Horiguchi wrote:
> This patch updates mm/pagewalk.c to make code less complex and more maintenable.
> The basic idea is unchanged and there's no userspace visible effect.
>
> Most of existing callback functions need access to vma to handle each entry.
> So we had better add a new member vma in struct mm_walk instead of using
> mm_walk->private, which makes code simpler.
>
> One problem in current page table walker is that we check vma in pgd loop.
> Historically this was introduced to support hugetlbfs in the strange manner.
> It's better and cleaner to do the vma check outside pgd loop.
>
> Another problem is that many users of page table walker now use only
> pmd_entry(), although it does both pmd-walk and pte-walk. This makes code
> duplication and fluctuation among callers, which worsens the maintenability.
>
> One difficulty of code sharing is that the callers want to determine
> whether they try to walk over a specific vma or not in their own way.
> To solve this, this patch introduces test_walk() callback.
>
> When we try to use multiple callbacks in different levels, skip control is
> also important. For example we have thp enabled in normal configuration, and
> we are interested in doing some work for a thp. But sometimes we want to
> split it and handle as normal pages, and in another time user would handle
> both at pmd level and pte level.
> What we need is that when we've done pmd_entry() we want to decide whether
> to go down to pte level handling based on the pmd_entry()'s result. So this
> patch introduces a skip control flag in mm_walk.
> We can't use the returned value for this purpose, because we already
> defined the meaning of whole range of returned values (>0 is to terminate
> page table walk in caller's specific manner, =0 is to continue to walk,
> and <0 is to abort the walk in the general manner.)
>
> ChangeLog v5:
> - fix build error ("mm/pagewalk.c:201: error: 'hmask' undeclared")
>
> ChangeLog v4:
> - add more comment
> - remove verbose variable in walk_page_test()
> - rename skip_check to skip_lower_level_walking
> - rebased onto mmotm-2014-01-09-16-23
>
> ChangeLog v3:
> - rebased onto v3.13-rc3-mmots-2013-12-10-16-38
>
> ChangeLog v2:
> - rebase onto mmots
> - add pte_none() check in walk_pte_range()
> - add cond_sched() in walk_hugetlb_range()
> - add skip_check()
> - do VM_PFNMAP check only when ->test_walk() is not defined (because some
> caller could handle VM_PFNMAP vma. copy_page_range() is an example.)
> - use do-while condition (addr < end) instead of (addr != end)
>
> Signed-off-by: Naoya Horiguchi <[email protected]>
> ---
> include/linux/mm.h | 18 ++-
> mm/pagewalk.c | 352 +++++++++++++++++++++++++++++++++--------------------
> 2 files changed, 235 insertions(+), 135 deletions(-)
>
> diff --git v3.14-rc2.orig/include/linux/mm.h v3.14-rc2/include/linux/mm.h
> index f28f46eade6a..4d0bc01de43c 100644
> --- v3.14-rc2.orig/include/linux/mm.h
> +++ v3.14-rc2/include/linux/mm.h
> @@ -1067,10 +1067,18 @@ void unmap_vmas(struct mmu_gather *tlb, struct vm_area_struct *start_vma,
> * @pte_entry: if set, called for each non-empty PTE (4th-level) entry
> * @pte_hole: if set, called for each hole at all levels
> * @hugetlb_entry: if set, called for each hugetlb entry
> - * *Caution*: The caller must hold mmap_sem() if @hugetlb_entry
> - * is used.
> + * @test_walk: caller specific callback function to determine whether
> + * we walk over the current vma or not. A positive returned
> + * value means "do page table walk over the current vma,"
> + * and a negative one means "abort current page table walk
> + * right now." 0 means "skip the current vma."
> + * @mm: mm_struct representing the target process of page table walk
> + * @vma: vma currently walked
> + * @skip: internal control flag which is set when we skip the lower
> + * level entries.
> + * @private: private data for callbacks' use
> *
> - * (see walk_page_range for more details)
> + * (see the comment on walk_page_range() for more details)
> */
> struct mm_walk {
> int (*pgd_entry)(pgd_t *pgd, unsigned long addr,
> @@ -1086,7 +1094,11 @@ struct mm_walk {
> int (*hugetlb_entry)(pte_t *pte, unsigned long hmask,
> unsigned long addr, unsigned long next,
> struct mm_walk *walk);
> + int (*test_walk)(unsigned long addr, unsigned long next,
> + struct mm_walk *walk);
> struct mm_struct *mm;
> + struct vm_area_struct *vma;
> + int skip;
> void *private;
> };
>
> diff --git v3.14-rc2.orig/mm/pagewalk.c v3.14-rc2/mm/pagewalk.c
> index 2beeabf502c5..4770558feea8 100644
> --- v3.14-rc2.orig/mm/pagewalk.c
> +++ v3.14-rc2/mm/pagewalk.c
> @@ -3,29 +3,58 @@
> #include <linux/sched.h>
> #include <linux/hugetlb.h>
>
> -static int walk_pte_range(pmd_t *pmd, unsigned long addr, unsigned long end,
> - struct mm_walk *walk)
> +/*
> + * Check the current skip status of page table walker.
> + *
> + * Here what I mean by skip is to skip lower level walking, and that was
> + * determined for each entry independently. For example, when walk_pmd_range
> + * handles a pmd_trans_huge we don't have to walk over ptes under that pmd,
> + * and the skipping does not affect the walking over ptes under other pmds.
> + * That's why we reset @walk->skip after tested.
> + */
> +static bool skip_lower_level_walking(struct mm_walk *walk)
> +{
> + if (walk->skip) {
> + walk->skip = 0;
> + return true;
> + }
> + return false;
> +}
> +
> +static int walk_pte_range(pmd_t *pmd, unsigned long addr,
> + unsigned long end, struct mm_walk *walk)
> {
> + struct mm_struct *mm = walk->mm;
> pte_t *pte;
> + pte_t *orig_pte;
> + spinlock_t *ptl;
> int err = 0;
>
> - pte = pte_offset_map(pmd, addr);
> - for (;;) {
> + orig_pte = pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
> + do {
> + if (pte_none(*pte)) {
> + if (walk->pte_hole)
> + err = walk->pte_hole(addr, addr + PAGE_SIZE,
> + walk);
> + if (err)
> + break;
> + continue;

Hello, Naoya.

I know that this is too late for review, but I have some opinion about this.

How about removing walk->pte_hole() function pointer and related code on generic
walker? walk->pte_hole() is only used by task_mmu.c and maintaining pte_hole code
only for task_mmu.c just give us maintanance overhead and bad readability on
generic code. With removing it, we can get more simpler generic walker.

We can implement it without pte_hole() on generic walker like as below.

walk->dont_skip_hole = 1
if (pte_none(*pte) && !walk->dont_skip_hole)
continue;

call proper entry callback function which can handle pte_hole cases.

> + }
> + /*
> + * Callers should have their own way to handle swap entries
> + * in walk->pte_entry().
> + */
> err = walk->pte_entry(pte, addr, addr + PAGE_SIZE, walk);
> if (err)
> break;
> - addr += PAGE_SIZE;
> - if (addr == end)
> - break;
> - pte++;
> - }
> -
> - pte_unmap(pte);
> - return err;
> + } while (pte++, addr += PAGE_SIZE, addr < end);
> + pte_unmap_unlock(orig_pte, ptl);
> + cond_resched();
> + return addr == end ? 0 : err;
> }
>
> -static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> - struct mm_walk *walk)
> +static int walk_pmd_range(pud_t *pud, unsigned long addr,
> + unsigned long end, struct mm_walk *walk)
> {
> pmd_t *pmd;
> unsigned long next;
> @@ -35,6 +64,7 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> do {
> again:
> next = pmd_addr_end(addr, end);
> +
> if (pmd_none(*pmd)) {
> if (walk->pte_hole)
> err = walk->pte_hole(addr, next, walk);
> @@ -42,35 +72,32 @@ static int walk_pmd_range(pud_t *pud, unsigned long addr, unsigned long end,
> break;
> continue;
> }
> - /*
> - * This implies that each ->pmd_entry() handler
> - * needs to know about pmd_trans_huge() pmds
> - */
> - if (walk->pmd_entry)
> - err = walk->pmd_entry(pmd, addr, next, walk);
> - if (err)
> - break;
>
> - /*
> - * Check this here so we only break down trans_huge
> - * pages when we _need_ to
> - */
> - if (!walk->pte_entry)
> - continue;
> + if (walk->pmd_entry) {
> + err = walk->pmd_entry(pmd, addr, next, walk);
> + if (skip_lower_level_walking(walk))
> + continue;
> + if (err)
> + break;
> + }
>
> - split_huge_page_pmd_mm(walk->mm, addr, pmd);
> - if (pmd_none_or_trans_huge_or_clear_bad(pmd))
> - goto again;
> - err = walk_pte_range(pmd, addr, next, walk);
> - if (err)
> - break;
> - } while (pmd++, addr = next, addr != end);
> + if (walk->pte_entry) {
> + if (walk->vma) {
> + split_huge_page_pmd(walk->vma, addr, pmd);
> + if (pmd_trans_unstable(pmd))
> + goto again;
> + }
> + err = walk_pte_range(pmd, addr, next, walk);
> + if (err)
> + break;
> + }
> + } while (pmd++, addr = next, addr < end);
>
> return err;
> }
>
> -static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
> - struct mm_walk *walk)
> +static int walk_pud_range(pgd_t *pgd, unsigned long addr,
> + unsigned long end, struct mm_walk *walk)
> {
> pud_t *pud;
> unsigned long next;
> @@ -79,6 +106,7 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
> pud = pud_offset(pgd, addr);
> do {
> next = pud_addr_end(addr, end);
> +
> if (pud_none_or_clear_bad(pud)) {
> if (walk->pte_hole)
> err = walk->pte_hole(addr, next, walk);
> @@ -86,13 +114,58 @@ static int walk_pud_range(pgd_t *pgd, unsigned long addr, unsigned long end,
> break;
> continue;
> }
> - if (walk->pud_entry)
> +
> + if (walk->pud_entry) {
> err = walk->pud_entry(pud, addr, next, walk);
> - if (!err && (walk->pmd_entry || walk->pte_entry))
> + if (skip_lower_level_walking(walk))
> + continue;
> + if (err)
> + break;

Why do you check skip_lower_level_walking() prior to err check?
I look through all patches roughly and find that this doesn't cause any problem,
since err is 0 whenver walk->skip = 1. But, checking err first would be better.

Thanks.

2014-02-20 23:48:02

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH 01/11] pagewalk: update page table walker core

Hi Naoya,

This patch seems to trigger a NULL ptr deref here. I didn't have a change to look into it yet
but here's the spew:

[ 281.650503] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[ 281.651577] IP: [<ffffffff811a31fc>] __lock_acquire+0xbc/0x580
[ 281.652453] PGD 40b88d067 PUD 40b88c067 PMD 0
[ 281.653143] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 281.653869] Dumping ftrace buffer:
[ 281.654430] (ftrace buffer empty)
[ 281.654975] Modules linked in:
[ 281.655441] CPU: 4 PID: 12314 Comm: trinity-c361 Tainted: G W
3.14.0-rc3-next-20140220-sasha-00008-gab7e7ac-dirty #113
[ 281.657622] task: ffff8804242ab000 ti: ffff880424348000 task.ti: ffff880424348000
[ 281.658503] RIP: 0010:[<ffffffff811a31fc>] [<ffffffff811a31fc>] __lock_acquire+0xbc/0x580
[ 281.660025] RSP: 0018:ffff880424349ab8 EFLAGS: 00010002
[ 281.660761] RAX: 0000000000000086 RBX: 0000000000000018 RCX: 0000000000000000
[ 281.660761] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018
[ 281.660761] RBP: ffff880424349b28 R08: 0000000000000001 R09: 0000000000000000
[ 281.660761] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8804242ab000
[ 281.660761] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
[ 281.660761] FS: 00007f36534b0700(0000) GS:ffff88052bc00000(0000) knlGS:0000000000000000
[ 281.660761] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 281.660761] CR2: 0000000000000018 CR3: 000000040b88e000 CR4: 00000000000006e0
[ 281.660761] Stack:
[ 281.660761] ffff880424349ae8 ffffffff81180695 ffff8804242ab038 0000000000000004
[ 281.660761] 00000000001d8500 ffff88052bdd8500 ffff880424349b18 ffffffff81180915
[ 281.660761] ffffffff876a68b0 ffff8804242ab000 0000000000000000 0000000000000001
[ 281.660761] Call Trace:
[ 281.660761] [<ffffffff81180695>] ? sched_clock_local+0x25/0x90
[ 281.660761] [<ffffffff81180915>] ? sched_clock_cpu+0xc5/0x110
[ 281.660761] [<ffffffff811a3842>] lock_acquire+0x182/0x1d0
[ 281.660761] [<ffffffff812990d8>] ? walk_pte_range+0xb8/0x170
[ 281.660761] [<ffffffff811a3daa>] ? __lock_release+0x1da/0x1f0
[ 281.660761] [<ffffffff8438ae5b>] _raw_spin_lock+0x3b/0x70
[ 281.660761] [<ffffffff812990d8>] ? walk_pte_range+0xb8/0x170
[ 281.660761] [<ffffffff812990d8>] walk_pte_range+0xb8/0x170
[ 281.660761] [<ffffffff812993a1>] walk_pmd_range+0x211/0x240
[ 281.660761] [<ffffffff812994fb>] walk_pud_range+0x12b/0x160
[ 281.660761] [<ffffffff81299639>] walk_pgd_range+0x109/0x140
[ 281.660761] [<ffffffff812996a5>] __walk_page_range+0x35/0x40
[ 281.660761] [<ffffffff81299862>] walk_page_range+0xf2/0x130
[ 281.660761] [<ffffffff812a8ccc>] queue_pages_range+0x6c/0x90
[ 281.660761] [<ffffffff812a8d80>] ? queue_pages_hugetlb+0x90/0x90
[ 281.660761] [<ffffffff812a8cf0>] ? queue_pages_range+0x90/0x90
[ 281.660761] [<ffffffff812a8f50>] ? change_prot_numa+0x30/0x30
[ 281.660761] [<ffffffff812ac9f1>] do_mbind+0x311/0x330
[ 281.660761] [<ffffffff811815c1>] ? vtime_account_user+0x91/0xa0
[ 281.660761] [<ffffffff8124f1a8>] ? context_tracking_user_exit+0xa8/0x1c0
[ 281.660761] [<ffffffff812aca99>] SYSC_mbind+0x89/0xb0
[ 281.660761] [<ffffffff812acac9>] SyS_mbind+0x9/0x10
[ 281.660761] [<ffffffff84395360>] tracesys+0xdd/0xe2
[ 281.660761] Code: c2 04 47 49 85 be fa 0b 00 00 48 c7 c7 bb 85 49 85 e8 d9 7b f9 ff 31 c0 e9 9c
04 00 00 66 90 44 8b 1d a9 b8 ac 04 45 85 db 74 0c <48> 81 3b 40 61 3f 87 75 06 0f 1f 00 45 31 c0 83
fe 01 77 0c 89
[ 281.660761] RIP [<ffffffff811a31fc>] __lock_acquire+0xbc/0x580
[ 281.660761] RSP <ffff880424349ab8>
[ 281.660761] CR2: 0000000000000018
[ 281.660761] ---[ end trace b6e188d329664196 ]---

Thanks,
Sasha

2014-02-21 04:32:03

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH 01/11] pagewalk: update page table walker core

On 02/20/2014 06:47 PM, Sasha Levin wrote:
> Hi Naoya,
>
> This patch seems to trigger a NULL ptr deref here. I didn't have a change to look into it yet
> but here's the spew:
>
> [ 281.650503] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
> [ 281.651577] IP: [<ffffffff811a31fc>] __lock_acquire+0xbc/0x580
> [ 281.652453] PGD 40b88d067 PUD 40b88c067 PMD 0
> [ 281.653143] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
> [ 281.653869] Dumping ftrace buffer:
> [ 281.654430] (ftrace buffer empty)
> [ 281.654975] Modules linked in:
> [ 281.655441] CPU: 4 PID: 12314 Comm: trinity-c361 Tainted: G W
> 3.14.0-rc3-next-20140220-sasha-00008-gab7e7ac-dirty #113
> [ 281.657622] task: ffff8804242ab000 ti: ffff880424348000 task.ti: ffff880424348000
> [ 281.658503] RIP: 0010:[<ffffffff811a31fc>] [<ffffffff811a31fc>] __lock_acquire+0xbc/0x580
> [ 281.660025] RSP: 0018:ffff880424349ab8 EFLAGS: 00010002
> [ 281.660761] RAX: 0000000000000086 RBX: 0000000000000018 RCX: 0000000000000000
> [ 281.660761] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018
> [ 281.660761] RBP: ffff880424349b28 R08: 0000000000000001 R09: 0000000000000000
> [ 281.660761] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8804242ab000
> [ 281.660761] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
> [ 281.660761] FS: 00007f36534b0700(0000) GS:ffff88052bc00000(0000) knlGS:0000000000000000
> [ 281.660761] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> [ 281.660761] CR2: 0000000000000018 CR3: 000000040b88e000 CR4: 00000000000006e0
> [ 281.660761] Stack:
> [ 281.660761] ffff880424349ae8 ffffffff81180695 ffff8804242ab038 0000000000000004
> [ 281.660761] 00000000001d8500 ffff88052bdd8500 ffff880424349b18 ffffffff81180915
> [ 281.660761] ffffffff876a68b0 ffff8804242ab000 0000000000000000 0000000000000001
> [ 281.660761] Call Trace:
> [ 281.660761] [<ffffffff81180695>] ? sched_clock_local+0x25/0x90
> [ 281.660761] [<ffffffff81180915>] ? sched_clock_cpu+0xc5/0x110
> [ 281.660761] [<ffffffff811a3842>] lock_acquire+0x182/0x1d0
> [ 281.660761] [<ffffffff812990d8>] ? walk_pte_range+0xb8/0x170
> [ 281.660761] [<ffffffff811a3daa>] ? __lock_release+0x1da/0x1f0
> [ 281.660761] [<ffffffff8438ae5b>] _raw_spin_lock+0x3b/0x70
> [ 281.660761] [<ffffffff812990d8>] ? walk_pte_range+0xb8/0x170
> [ 281.660761] [<ffffffff812990d8>] walk_pte_range+0xb8/0x170
> [ 281.660761] [<ffffffff812993a1>] walk_pmd_range+0x211/0x240
> [ 281.660761] [<ffffffff812994fb>] walk_pud_range+0x12b/0x160
> [ 281.660761] [<ffffffff81299639>] walk_pgd_range+0x109/0x140
> [ 281.660761] [<ffffffff812996a5>] __walk_page_range+0x35/0x40
> [ 281.660761] [<ffffffff81299862>] walk_page_range+0xf2/0x130
> [ 281.660761] [<ffffffff812a8ccc>] queue_pages_range+0x6c/0x90
> [ 281.660761] [<ffffffff812a8d80>] ? queue_pages_hugetlb+0x90/0x90
> [ 281.660761] [<ffffffff812a8cf0>] ? queue_pages_range+0x90/0x90
> [ 281.660761] [<ffffffff812a8f50>] ? change_prot_numa+0x30/0x30
> [ 281.660761] [<ffffffff812ac9f1>] do_mbind+0x311/0x330
> [ 281.660761] [<ffffffff811815c1>] ? vtime_account_user+0x91/0xa0
> [ 281.660761] [<ffffffff8124f1a8>] ? context_tracking_user_exit+0xa8/0x1c0
> [ 281.660761] [<ffffffff812aca99>] SYSC_mbind+0x89/0xb0
> [ 281.660761] [<ffffffff812acac9>] SyS_mbind+0x9/0x10
> [ 281.660761] [<ffffffff84395360>] tracesys+0xdd/0xe2
> [ 281.660761] Code: c2 04 47 49 85 be fa 0b 00 00 48 c7 c7 bb 85 49 85 e8 d9 7b f9 ff 31 c0 e9 9c
> 04 00 00 66 90 44 8b 1d a9 b8 ac 04 45 85 db 74 0c <48> 81 3b 40 61 3f 87 75 06 0f 1f 00 45 31 c0 83
> fe 01 77 0c 89
> [ 281.660761] RIP [<ffffffff811a31fc>] __lock_acquire+0xbc/0x580
> [ 281.660761] RSP <ffff880424349ab8>
> [ 281.660761] CR2: 0000000000000018
> [ 281.660761] ---[ end trace b6e188d329664196 ]---

Out of curiosity, I'm testing out a new piece of code to make decoding this dump a bit easier. Let
me know if it helped at all. Lines are based on -next from today:

[ 281.650503] BUG: unable to handle kernel NULL pointer dereference at 0000000000000018
[ 281.651577] IP: [<kernel/locking/lockdep.c:3069>] __lock_acquire+0xbc/0x580
[ 281.652453] PGD 40b88d067 PUD 40b88c067 PMD 0
[ 281.653143] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 281.653869] Dumping ftrace buffer:
[ 281.654430] (ftrace buffer empty)
[ 281.654975] Modules linked in:
[ 281.655441] CPU: 4 PID: 12314 Comm: trinity-c361 Tainted: G W
3.14.0-rc3-next-20140220-sasha-00008-gab7e7ac-dirty #113
[ 281.657622] task: ffff8804242ab000 ti: ffff880424348000 task.ti: ffff880424348000
[ 281.658503] RIP: 0010:[<kernel/locking/lockdep.c:3069>] [<kernel/locking/lockdep.c:3069>]
__lock_acquire+0xbc/0x580
[ 281.660025] RSP: 0018:ffff880424349ab8 EFLAGS: 00010002
[ 281.660761] RAX: 0000000000000086 RBX: 0000000000000018 RCX: 0000000000000000
[ 281.660761] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000018
[ 281.660761] RBP: ffff880424349b28 R08: 0000000000000001 R09: 0000000000000000
[ 281.660761] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8804242ab000
[ 281.660761] R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000001
[ 281.660761] FS: 00007f36534b0700(0000) GS:ffff88052bc00000(0000) knlGS:0000000000000000
[ 281.660761] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 281.660761] CR2: 0000000000000018 CR3: 000000040b88e000 CR4: 00000000000006e0
[ 281.660761] Stack:
[ 281.660761] ffff880424349ae8 ffffffff81180695 ffff8804242ab038 0000000000000004
[ 281.660761] 00000000001d8500 ffff88052bdd8500 ffff880424349b18 ffffffff81180915
[ 281.660761] ffffffff876a68b0 ffff8804242ab000 0000000000000000 0000000000000001
[ 281.660761] Call Trace:
[ 281.660761] [<kernel/sched/clock.c:206>] ? sched_clock_local+0x25/0x90
[ 281.660761] [<arch/x86/include/asm/preempt.h:98 kernel/sched/clock.c:312>] ?
sched_clock_cpu+0xc5/0x110
[ 281.660761] [<arch/x86/include/asm/current.h:14 kernel/locking/lockdep.c:3602>]
lock_acquire+0x182/0x1d0
[ 281.660761] [<include/linux/spinlock.h:303 mm/pagewalk.c:33>] ? walk_pte_range+0xb8/0x170
[ 281.660761] [<kernel/locking/lockdep.c:3506>] ? __lock_release+0x1da/0x1f0
[ 281.660761] [<include/linux/spinlock_api_smp.h:143 kernel/locking/spinlock.c:151>]
_raw_spin_lock+0x3b/0x70
[ 281.660761] [<include/linux/spinlock.h:303 mm/pagewalk.c:33>] ? walk_pte_range+0xb8/0x170
[ 281.660761] [<include/linux/spinlock.h:303 mm/pagewalk.c:33>] walk_pte_range+0xb8/0x170
[ 281.660761] [<mm/pagewalk.c:90>] walk_pmd_range+0x211/0x240
[ 281.660761] [<mm/pagewalk.c:128>] walk_pud_range+0x12b/0x160
[ 281.660761] [<mm/pagewalk.c:165>] walk_pgd_range+0x109/0x140
[ 281.660761] [<mm/pagewalk.c:259>] __walk_page_range+0x35/0x40
[ 281.660761] [<mm/pagewalk.c:332>] walk_page_range+0xf2/0x130
[ 281.660761] [<mm/mempolicy.c:637>] queue_pages_range+0x6c/0x90
[ 281.660761] [<mm/mempolicy.c:492>] ? queue_pages_hugetlb+0x90/0x90
[ 281.660761] [<mm/mempolicy.c:521>] ? queue_pages_range+0x90/0x90
[ 281.660761] [<mm/mempolicy.c:573>] ? change_prot_numa+0x30/0x30
[ 281.660761] [<mm/mempolicy.c:1241>] do_mbind+0x311/0x330
[ 281.660761] [<kernel/sched/cputime.c:681>] ? vtime_account_user+0x91/0xa0
[ 281.660761] [<arch/x86/include/asm/atomic.h:26 include/linux/jump_label.h:148
include/trace/events/context_tracking.h:47 kernel/context_tracking.c:178>] ?
context_tracking_user_exit+0xa8/0x1c0
[ 281.660761] [<mm/mempolicy.c:1356>] SYSC_mbind+0x89/0xb0
[ 281.660761] [<mm/mempolicy.c:1340>] SyS_mbind+0x9/0x10
[ 281.660761] [<arch/x86/kernel/entry_64.S:749>] tracesys+0xdd/0xe2
[ 281.660761] Code: c2 04 47 49 85 be fa 0b 00 00 48 c7 c7 bb 85 49 85 e8 d9 7b f9 ff 31 c0 e9 9c
04 00 00 66 90 44 8b 1d a9 b8 ac 04 45 85 db 74 0c <48> 81 3b 40 61 3f 87 75 06 0f 1f 00 45 31 c0 83
fe 01 77 0c 89
[ 281.660761] RIP [<kernel/locking/lockdep.c:3069>] __lock_acquire+0xbc/0x580
[ 281.660761] RSP <ffff880424349ab8>
[ 281.660761] CR2: 0000000000000018
[ 281.660761] ---[ end trace b6e188d329664196 ]---


Thanks,
Sasha

2014-02-21 06:31:03

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH 11/11] mempolicy: apply page table walker on queue_pages_range()

On 02/10/2014 04:44 PM, Naoya Horiguchi wrote:
> queue_pages_range() does page table walking in its own way now,
> so this patch rewrites it with walk_page_range().
> One difficulty was that queue_pages_range() needed to check vmas
> to determine whether we queue pages from a given vma or skip it.
> Now we have test_walk() callback in mm_walk for that purpose,
> so we can do the replacement cleanly. queue_pages_test_walk()
> depends on not only the current vma but also the previous one,
> so we use queue_pages->prev to keep it.
>
> ChangeLog v2:
> - rebase onto mmots
> - add VM_PFNMAP check on queue_pages_test_walk()
>
> Signed-off-by: Naoya Horiguchi <[email protected]>
> ---

Hi Naoya,

I'm seeing another spew in today's -next, and it seems to be related to this patch. Here's the spew
(with line numbers instead of kernel addresses):


[ 1411.889835] kernel BUG at mm/hugetlb.c:3580!
[ 1411.890108] invalid opcode: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC
[ 1411.890468] Dumping ftrace buffer:
[ 1411.890468] (ftrace buffer empty)
[ 1411.890468] Modules linked in:
[ 1411.890468] CPU: 0 PID: 2653 Comm: trinity-c285 Tainted: G W
3.14.0-rc3-next-20140220-sasha-00008-gab7e7ac-dirty #113
[ 1411.890468] task: ffff8801be0cb000 ti: ffff8801e471c000 task.ti: ffff8801e471c000
[ 1411.890468] RIP: 0010:[<mm/hugetlb.c:3580>] [<mm/hugetlb.c:3580>] isolate_huge_page+0x1c/0xb0
[ 1411.890468] RSP: 0018:ffff8801e471dae8 EFLAGS: 00010246
[ 1411.890468] RAX: ffff88012b900000 RBX: ffffea0000000000 RCX: 0000000000000000
[ 1411.890468] RDX: 0000000000000000 RSI: ffff8801be0cbd00 RDI: 0000000000000000
[ 1411.890468] RBP: ffff8801e471daf8 R08: 0000000000000000 R09: 0000000000000000
[ 1411.890468] R10: 0000000000000001 R11: 0000000000000001 R12: ffff8801e471dcf8
[ 1411.890468] R13: ffffffff87d39120 R14: ffff8801e471dbc8 R15: 00007f30b1800000
[ 1411.890468] FS: 00007f30b50bb700(0000) GS:ffff88012bc00000(0000) knlGS:0000000000000000
[ 1411.890468] CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
[ 1411.890468] CR2: 0000000001609a10 CR3: 00000001e4703000 CR4: 00000000000006f0
[ 1411.890468] Stack:
[ 1411.890468] 00007f30b1000000 00007f30b0e00000 ffff8801e471db08 ffffffff812a8d71
[ 1411.890468] ffff8801e471db78 ffffffff81298fb1 00007f30b0d00000 ffff880478a16c38
[ 1411.890468] ffff8802291c6060 ffffffffffe00000 ffffffffffe00000 ffff8804fd7fa7d0
[ 1411.890468] Call Trace:
[ 1411.890468] [<mm/mempolicy.c:540>] queue_pages_hugetlb+0x81/0x90
[ 1411.890468] [<include/linux/spinlock.h:343 mm/pagewalk.c:203>] walk_hugetlb_range+0x111/0x180
[ 1411.890468] [<mm/pagewalk.c:254>] __walk_page_range+0x25/0x40
[ 1411.890468] [<mm/pagewalk.c:332>] walk_page_range+0xf2/0x130
[ 1411.890468] [<mm/mempolicy.c:637>] queue_pages_range+0x6c/0x90
[ 1411.890468] [<mm/mempolicy.c:492>] ? queue_pages_hugetlb+0x90/0x90
[ 1411.890468] [<mm/mempolicy.c:521>] ? queue_pages_range+0x90/0x90
[ 1411.890468] [<mm/mempolicy.c:573>] ? change_prot_numa+0x30/0x30
[ 1411.890468] [<mm/mempolicy.c:1004>] migrate_to_node+0x77/0xc0
[ 1411.890468] [<mm/mempolicy.c:1110>] do_migrate_pages+0x1a8/0x230
[ 1411.890468] [<mm/mempolicy.c:1461>] SYSC_migrate_pages+0x316/0x380
[ 1411.890468] [<include/linux/rcupdate.h:799 mm/mempolicy.c:1407>] ? SYSC_migrate_pages+0xac/0x380
[ 1411.890468] [<kernel/sched/cputime.c:681>] ? vtime_account_user+0x91/0xa0
[ 1411.890468] [<mm/mempolicy.c:1381>] SyS_migrate_pages+0x9/0x10
[ 1411.890468] [<arch/x86/ia32/ia32entry.S:430>] ia32_do_call+0x13/0x13
[ 1411.890468] Code: 4c 8b 6d f8 c9 c3 66 0f 1f 84 00 00 00 00 00 55 48 89 e5 41 54 49 89 f4 53 48
89 fb 48 8b 07 f6 c4 40 75 13 31 f6 e8 84 48 fb ff <0f> 0b 66 90 eb fe 66 0f 1f 44 00 00 8b 4f 1c 48
8d 77 1c 85 c9
[ 1411.890468] RIP [<mm/hugetlb.c:3580>] isolate_huge_page+0x1c/0xb0
[ 1411.890468] RSP <ffff8801e471dae8>


Thanks,
Sasha

2014-02-21 06:43:51

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH 01/11] pagewalk: update page table walker core

On 02/20/2014 10:20 PM, Naoya Horiguchi wrote:
> Hi Sasha,
>
> On Thu, Feb 20, 2014 at 06:47:56PM -0500, Sasha Levin wrote:
>> Hi Naoya,
>>
>> This patch seems to trigger a NULL ptr deref here. I didn't have a change to look into it yet
>> but here's the spew:
>
> Thanks for reporting.
> I'm not sure what caused this bug from the kernel message. But in my guessing,
> it seems that the NULL pointer is deep inside lockdep routine __lock_acquire(),
> so if we find out which pointer was NULL, it might be useful to bisect which
> the proble is (page table walker or lockdep, or both.)

This actually points to walk_pte_range() trying to lock a NULL spinlock. It happens when we call
pte_offset_map_lock() and get a NULL ptl out of pte_lockptr().

> BTW, just from curiousity, in my build environment many of kernel functions
> are inlined, so should not be shown in kernel message. But in your report
> we can see the symbols like walk_pte_range() and __lock_acquire() which never
> appear in my kernel. How did you do it? I turned off CONFIG_OPTIMIZE_INLINING,
> but didn't make it.

I'm really not sure. I've got a bunch of debug options enabled and it just seems to do the trick.

Try CONFIG_READABLE_ASM maybe?


Thanks,
Sasha

2014-02-21 16:51:25

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH 01/11] pagewalk: update page table walker core

On 02/21/2014 11:35 AM, Naoya Horiguchi wrote:
> On Fri, Feb 21, 2014 at 01:43:20AM -0500, Sasha Levin wrote:
>> On 02/20/2014 10:20 PM, Naoya Horiguchi wrote:
>>> Hi Sasha,
>>>
>>> On Thu, Feb 20, 2014 at 06:47:56PM -0500, Sasha Levin wrote:
>>>> Hi Naoya,
>>>>
>>>> This patch seems to trigger a NULL ptr deref here. I didn't have a change to look into it yet
>>>> but here's the spew:
>>>
>>> Thanks for reporting.
>>> I'm not sure what caused this bug from the kernel message. But in my guessing,
>>> it seems that the NULL pointer is deep inside lockdep routine __lock_acquire(),
>>> so if we find out which pointer was NULL, it might be useful to bisect which
>>> the proble is (page table walker or lockdep, or both.)
>>
>> This actually points to walk_pte_range() trying to lock a NULL spinlock. It happens when we call
>> pte_offset_map_lock() and get a NULL ptl out of pte_lockptr().
>
> I don't think page->ptl was NULL, because if so we hit NULL pointer dereference
> outside __lock_acquire() (it's derefered in __raw_spin_lock()).
> Maybe page->ptl->lock_dep was NULL. I'll digging it more to find out how we failed
> to set this lock_dep thing.

I don't see __raw_spin_lock() derefing it before calling __lock_acquire():

static inline void __raw_spin_lock(raw_spinlock_t *lock)
{
preempt_disable();
spin_acquire(&lock->dep_map, 0, 0, _RET_IP_);
LOCK_CONTENDED(lock, do_raw_spin_trylock, do_raw_spin_lock);
}

So after we disable preemption, spin_acquire() is basically a macro that ends up pointing to
lock_acquire().

__raw_spin_lock() would dereference 'lock' only after the lockdep call.

>>> BTW, just from curiousity, in my build environment many of kernel functions
>>> are inlined, so should not be shown in kernel message. But in your report
>>> we can see the symbols like walk_pte_range() and __lock_acquire() which never
>>> appear in my kernel. How did you do it? I turned off CONFIG_OPTIMIZE_INLINING,
>>> but didn't make it.
>>
>> I'm really not sure. I've got a bunch of debug options enabled and it just seems to do the trick.
>>
>> Try CONFIG_READABLE_ASM maybe?
>
> Hmm, it makes no change, can I have your config?

Sure, attached.


Thanks,
Sasha


Attachments:
config.gz (38.50 kB)

2014-02-21 17:18:49

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH 11/11] mempolicy: apply page table walker on queue_pages_range()

On 02/21/2014 11:58 AM, Naoya Horiguchi wrote:
> Hi Sasha,
>
> On Fri, Feb 21, 2014 at 01:30:53AM -0500, Sasha Levin wrote:
>> On 02/10/2014 04:44 PM, Naoya Horiguchi wrote:
>>> queue_pages_range() does page table walking in its own way now,
>>> so this patch rewrites it with walk_page_range().
>>> One difficulty was that queue_pages_range() needed to check vmas
>>> to determine whether we queue pages from a given vma or skip it.
>>> Now we have test_walk() callback in mm_walk for that purpose,
>>> so we can do the replacement cleanly. queue_pages_test_walk()
>>> depends on not only the current vma but also the previous one,
>>> so we use queue_pages->prev to keep it.
>>>
>>> ChangeLog v2:
>>> - rebase onto mmots
>>> - add VM_PFNMAP check on queue_pages_test_walk()
>>>
>>> Signed-off-by: Naoya Horiguchi <[email protected]>
>>> ---
>>
>> Hi Naoya,
>>
>> I'm seeing another spew in today's -next, and it seems to be related
>> to this patch. Here's the spew (with line numbers instead of kernel
>> addresses):
>
> Thanks. (line numbers translation is very helpful.)
>
> This bug looks strange to me.
> "kernel BUG at mm/hugetlb.c:3580" means we try to do isolate_huge_page()
> for !PageHead page. But the caller queue_pages_hugetlb() gets the page
> with "page = pte_page(huge_ptep_get(pte))", so it should be the head page!
>
> mm/hugetlb.c:3580 is VM_BUG_ON_PAGE(!PageHead(page), page), so we expect to
> have dump_page output at this point, is that in your kernel log?

This is usually a sign of a race between that code and thp splitting, see
https://lkml.org/lkml/2013/12/23/457 for example.

I forgot to add the dump_page output to my extraction process and the complete logs all long gone.
I'll grab it when it happens again.


Thanks,
Sasha

2014-02-23 13:05:44

by Sasha Levin

[permalink] [raw]
Subject: Re: [PATCH 11/11] mempolicy: apply page table walker on queue_pages_range()

On 02/21/2014 12:25 PM, Naoya Horiguchi wrote:
> On Fri, Feb 21, 2014 at 12:18:11PM -0500, Sasha Levin wrote:
>> On 02/21/2014 11:58 AM, Naoya Horiguchi wrote:
>>> On Fri, Feb 21, 2014 at 01:30:53AM -0500, Sasha Levin wrote:
>>>> On 02/10/2014 04:44 PM, Naoya Horiguchi wrote:
>>>>> queue_pages_range() does page table walking in its own way now,
>>>>> so this patch rewrites it with walk_page_range().
>>>>> One difficulty was that queue_pages_range() needed to check vmas
>>>>> to determine whether we queue pages from a given vma or skip it.
>>>>> Now we have test_walk() callback in mm_walk for that purpose,
>>>>> so we can do the replacement cleanly. queue_pages_test_walk()
>>>>> depends on not only the current vma but also the previous one,
>>>>> so we use queue_pages->prev to keep it.
>>>>>
>>>>> ChangeLog v2:
>>>>> - rebase onto mmots
>>>>> - add VM_PFNMAP check on queue_pages_test_walk()
>>>>>
>>>>> Signed-off-by: Naoya Horiguchi <[email protected]>
>>>>> ---
>>>>
>>>> Hi Naoya,
>>>>
>>>> I'm seeing another spew in today's -next, and it seems to be related
>>>> to this patch. Here's the spew (with line numbers instead of kernel
>>>> addresses):
>>>
>>> Thanks. (line numbers translation is very helpful.)
>>>
>>> This bug looks strange to me.
>>> "kernel BUG at mm/hugetlb.c:3580" means we try to do isolate_huge_page()
>>> for !PageHead page. But the caller queue_pages_hugetlb() gets the page
>>> with "page = pte_page(huge_ptep_get(pte))", so it should be the head page!
>>>
>>> mm/hugetlb.c:3580 is VM_BUG_ON_PAGE(!PageHead(page), page), so we expect to
>>> have dump_page output at this point, is that in your kernel log?
>>
>> This is usually a sign of a race between that code and thp splitting, see
>> https://lkml.org/lkml/2013/12/23/457 for example.
>
> queue_pages_hugetlb() is for hugetlbfs, not for thp, so I don't think that
> it's related to thp splitting, but I agree it's a race.
>
>> I forgot to add the dump_page output to my extraction process and the complete logs all long gone.
>> I'll grab it when it happens again.
>
> Thank you. It'll be useful.

And here it is:

[ 755.524966] page:ffffea0000000000 count:0 mapcount:1 mapping: (null) index:0x0
[ 755.526067] page flags: 0x0()

Followed by the same stack trace as before.


Thanks,
Sasha

2014-06-02 23:49:21

by Dave Hansen

[permalink] [raw]
Subject: Re: [PATCH 01/11] pagewalk: update page table walker core

On 02/10/2014 01:44 PM, Naoya Horiguchi wrote:
> When we try to use multiple callbacks in different levels, skip control is
> also important. For example we have thp enabled in normal configuration, and
> we are interested in doing some work for a thp. But sometimes we want to
> split it and handle as normal pages, and in another time user would handle
> both at pmd level and pte level.
> What we need is that when we've done pmd_entry() we want to decide whether
> to go down to pte level handling based on the pmd_entry()'s result. So this
> patch introduces a skip control flag in mm_walk.
> We can't use the returned value for this purpose, because we already
> defined the meaning of whole range of returned values (>0 is to terminate
> page table walk in caller's specific manner, =0 is to continue to walk,
> and <0 is to abort the walk in the general manner.)

This seems a bit complicated for a case which doesn't exist in practice
in the kernel today. We don't even *have* a single ->pte_entry handler.
Everybody just sets ->pmd_entry and does the splitting and handling of
individual pte entries in there. The only reason it's needed is because
of the later patches in the series, which is kinda goofy.

I'm biased, but I think the abstraction here is done in the wrong place.

Naoya, could you take a looked at the new handler I proposed? Would
that help make this simpler?