2011-04-27 23:36:31

by Stephen Wilson

[permalink] [raw]
Subject: [PATCH 0/8] avoid allocation in show_numa_map()

Recently a concern was raised[1] that performing an allocation while holding a
reference on a tasks mm could lead to a stalemate in the oom killer. The
concern was specific to the goings-on in /proc. Hugh Dickins stated the issue
thusly:

...imagine what happens if the system is out of memory, and the mm
we're looking at is selected for killing by the OOM killer: while we
wait in __get_free_page for more memory, no memory is freed from the
selected mm because it cannot reach exit_mmap while we hold that
reference.

The primary goal of this series is to eliminate repeated allocation/free cycles
currently happening in show_numa_maps() while we hold a reference to an mm.

The strategy is to perform the allocation once when /proc/pid/numa_maps is
opened, before a reference on the target tasks mm is taken.

Unfortunately, show_numa_maps() is implemented in mm/mempolicy.c while the
primary procfs implementation lives in fs/proc/task_mmu.c. This makes
clean cooperation between show_numa_maps() and the other seq_file operations
(start(), stop(), etc) difficult.


Patches 1-5 convert show_numa_maps() to use the generic walk_page_range()
functionality instead of the mempolicy.c specific page table walking logic.
Also, get_vma_policy() is exported. This makes the show_numa_maps()
implementation independent of mempolicy.c.

Patch 6 moves show_numa_maps() and supporting routines over to
fs/proc/task_mmu.c.

Finally, patches 7 and 8 provide minor cleanup and eliminates the troublesome
allocation.


Please note that moving show_numa_maps() into fs/proc/task_mmu.c essentially
reverts 1a75a6c825 and 48fce3429d. Also, please see the discussion at [2]. My
main justifications for moving the code back into task_mmu.c is:

- Having the show() operation "miles away" from the corresponding
seq_file iteration operations is a maintenance burden.

- The need to export ad hoc info like struct proc_maps_private is
eliminated.


These patches are based on v2.6.39-rc5.


Please note that this series is VERY LIGHTLY TESTED. I have been using
CONFIG_NUMA_EMU=y thus far as I will not have access to a real NUMA system for
another week or two.


--
steve


[1] lkml.org/lkml/2011/4/25/496
[2] marc.info/?t=113149255100001&r=1&w=2


Stephen Wilson (8):
mm: export get_vma_policy()
mm: use walk_page_range() instead of custom page table walking code
mm: remove MPOL_MF_STATS
mm: make gather_stats() type-safe and remove forward declaration
mm: remove check_huge_range()
mm: proc: move show_numa_map() to fs/proc/task_mmu.c
proc: make struct proc_maps_private truly private
proc: allocate storage for numa_maps statistics once


fs/proc/internal.h | 8 ++
fs/proc/task_mmu.c | 190 ++++++++++++++++++++++++++++++++++++++++++++-
include/linux/mempolicy.h | 3 +
include/linux/proc_fs.h | 8 --
mm/mempolicy.c | 164 +--------------------------------------
5 files changed, 200 insertions(+), 173 deletions(-)


2011-04-27 23:36:43

by Stephen Wilson

[permalink] [raw]
Subject: [PATCH 2/8] mm: use walk_page_range() instead of custom page table walking code

In the specific case of show_numa_map(), the custom page table walking
logic implemented in mempolicy.c does not provide any special service
beyond that provided by walk_page_range().

Also, converting show_numa_map() to use the generic routine decouples
the function from mempolicy.c, allowing it to be moved out of the mm
subsystem and into fs/proc.

Signed-off-by: Stephen Wilson <[email protected]>
---
mm/mempolicy.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++-------
1 files changed, 46 insertions(+), 7 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 5bfb03e..dfe27e3 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2568,6 +2568,22 @@ static void gather_stats(struct page *page, void *private, int pte_dirty)
md->node[page_to_nid(page)]++;
}

+static int gather_pte_stats(pte_t *pte, unsigned long addr,
+ unsigned long pte_size, struct mm_walk *walk)
+{
+ struct page *page;
+
+ if (pte_none(*pte))
+ return 0;
+
+ page = pte_page(*pte);
+ if (!page)
+ return 0;
+
+ gather_stats(page, walk->private, pte_dirty(*pte));
+ return 0;
+}
+
#ifdef CONFIG_HUGETLB_PAGE
static void check_huge_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end,
@@ -2597,12 +2613,35 @@ static void check_huge_range(struct vm_area_struct *vma,
gather_stats(page, md, pte_dirty(*ptep));
}
}
+
+static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
+ unsigned long addr, unsigned long end, struct mm_walk *walk)
+{
+ struct page *page;
+
+ if (pte_none(*pte))
+ return 0;
+
+ page = pte_page(*pte);
+ if (!page)
+ return 0;
+
+ gather_stats(page, walk->private, pte_dirty(*pte));
+ return 0;
+}
+
#else
static inline void check_huge_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end,
struct numa_maps *md)
{
}
+
+static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
+ unsigned long addr, unsigned long end, struct mm_walk *walk)
+{
+ return 0;
+}
#endif

/*
@@ -2615,6 +2654,7 @@ int show_numa_map(struct seq_file *m, void *v)
struct numa_maps *md;
struct file *file = vma->vm_file;
struct mm_struct *mm = vma->vm_mm;
+ struct mm_walk walk = {};
struct mempolicy *pol;
int n;
char buffer[50];
@@ -2626,6 +2666,11 @@ int show_numa_map(struct seq_file *m, void *v)
if (!md)
return 0;

+ walk.hugetlb_entry = gather_hugetbl_stats;
+ walk.pte_entry = gather_pte_stats;
+ walk.private = md;
+ walk.mm = mm;
+
pol = get_vma_policy(priv->task, vma, vma->vm_start);
mpol_to_str(buffer, sizeof(buffer), pol, 0);
mpol_cond_put(pol);
@@ -2642,13 +2687,7 @@ int show_numa_map(struct seq_file *m, void *v)
seq_printf(m, " stack");
}

- if (is_vm_hugetlb_page(vma)) {
- check_huge_range(vma, vma->vm_start, vma->vm_end, md);
- seq_printf(m, " huge");
- } else {
- check_pgd_range(vma, vma->vm_start, vma->vm_end,
- &node_states[N_HIGH_MEMORY], MPOL_MF_STATS, md);
- }
+ walk_page_range(vma->vm_start, vma->vm_end, &walk);

if (!md->pages)
goto out;
--
1.7.3.5

2011-04-27 23:36:54

by Stephen Wilson

[permalink] [raw]
Subject: [PATCH 4/8] mm: make gather_stats() type-safe and remove forward declaration

Improve the prototype of gather_stats() to take a struct numa_maps as
argument instead of a generic void *. Update all callers to make the
required type explicit.

Since gather_stats() is not needed before its definition and is
scheduled to be moved out of mempolicy.c the declaration is removed as
well.

Signed-off-by: Stephen Wilson <[email protected]>
---
mm/mempolicy.c | 12 +++++++-----
1 files changed, 7 insertions(+), 5 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 63c0d69..d4c0b8d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -456,7 +456,6 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
},
};

-static void gather_stats(struct page *, void *, int pte_dirty);
static void migrate_page_add(struct page *page, struct list_head *pagelist,
unsigned long flags);

@@ -2538,9 +2537,8 @@ struct numa_maps {
unsigned long node[MAX_NUMNODES];
};

-static void gather_stats(struct page *page, void *private, int pte_dirty)
+static void gather_stats(struct page *page, struct numa_maps *md, int pte_dirty)
{
- struct numa_maps *md = private;
int count = page_mapcount(page);

md->pages++;
@@ -2568,6 +2566,7 @@ static void gather_stats(struct page *page, void *private, int pte_dirty)
static int gather_pte_stats(pte_t *pte, unsigned long addr,
unsigned long pte_size, struct mm_walk *walk)
{
+ struct numa_maps *md;
struct page *page;
int nid;

@@ -2582,7 +2581,8 @@ static int gather_pte_stats(pte_t *pte, unsigned long addr,
if (!node_isset(nid, node_states[N_HIGH_MEMORY]))
return 0;

- gather_stats(page, walk->private, pte_dirty(*pte));
+ md = walk->private;
+ gather_stats(page, md, pte_dirty(*pte));
return 0;
}

@@ -2619,6 +2619,7 @@ static void check_huge_range(struct vm_area_struct *vma,
static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
unsigned long addr, unsigned long end, struct mm_walk *walk)
{
+ struct numa_maps *md;
struct page *page;

if (pte_none(*pte))
@@ -2628,7 +2629,8 @@ static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
if (!page)
return 0;

- gather_stats(page, walk->private, pte_dirty(*pte));
+ md = walk->private;
+ gather_stats(page, md, pte_dirty(*pte));
return 0;
}

--
1.7.3.5

2011-04-27 23:36:36

by Stephen Wilson

[permalink] [raw]
Subject: [PATCH 1/8] mm: export get_vma_policy()

In commit 48fce3429df84a94766fbbc845fa8450d0715b48 get_vma_policy() was
marked static as all clients were local to mempolicy.c.

However, the decision to generate /proc/pid/numa_maps in the numa memory
policy code and outside the procfs subsystem introduces an artificial
interdependency between the two systems. Exporting get_vma_policy()
once again is the first step to clean up this interdependency.

Signed-off-by: Stephen Wilson <[email protected]>
---
include/linux/mempolicy.h | 3 +++
mm/mempolicy.c | 2 +-
2 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 31ac26c..c2f6032 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -199,6 +199,9 @@ void mpol_free_shared_policy(struct shared_policy *p);
struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
unsigned long idx);

+struct mempolicy *get_vma_policy(struct task_struct *tsk,
+ struct vm_area_struct *vma, unsigned long addr);
+
extern void numa_default_policy(void);
extern void numa_policy_init(void);
extern void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 959a8b8..5bfb03e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1489,7 +1489,7 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len,
* freeing by another task. It is the caller's responsibility to free the
* extra reference for shared policies.
*/
-static struct mempolicy *get_vma_policy(struct task_struct *task,
+struct mempolicy *get_vma_policy(struct task_struct *task,
struct vm_area_struct *vma, unsigned long addr)
{
struct mempolicy *pol = task->mempolicy;
--
1.7.3.5

2011-04-27 23:37:11

by Stephen Wilson

[permalink] [raw]
Subject: [PATCH 5/8] mm: remove check_huge_range()

This function has been superseded by gather_hugetbl_stats() and is no
longer needed.

Signed-off-by: Stephen Wilson <[email protected]>
---
mm/mempolicy.c | 35 -----------------------------------
1 files changed, 0 insertions(+), 35 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d4c0b8d..c5a4342 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2587,35 +2587,6 @@ static int gather_pte_stats(pte_t *pte, unsigned long addr,
}

#ifdef CONFIG_HUGETLB_PAGE
-static void check_huge_range(struct vm_area_struct *vma,
- unsigned long start, unsigned long end,
- struct numa_maps *md)
-{
- unsigned long addr;
- struct page *page;
- struct hstate *h = hstate_vma(vma);
- unsigned long sz = huge_page_size(h);
-
- for (addr = start; addr < end; addr += sz) {
- pte_t *ptep = huge_pte_offset(vma->vm_mm,
- addr & huge_page_mask(h));
- pte_t pte;
-
- if (!ptep)
- continue;
-
- pte = *ptep;
- if (pte_none(pte))
- continue;
-
- page = pte_page(pte);
- if (!page)
- continue;
-
- gather_stats(page, md, pte_dirty(*ptep));
- }
-}
-
static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
unsigned long addr, unsigned long end, struct mm_walk *walk)
{
@@ -2635,12 +2606,6 @@ static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
}

#else
-static inline void check_huge_range(struct vm_area_struct *vma,
- unsigned long start, unsigned long end,
- struct numa_maps *md)
-{
-}
-
static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
unsigned long addr, unsigned long end, struct mm_walk *walk)
{
--
1.7.3.5

2011-04-27 23:37:21

by Stephen Wilson

[permalink] [raw]
Subject: [PATCH 3/8] mm: remove MPOL_MF_STATS

Mapping statistics in a NUMA environment is now computed using the
generic walk_page_range() logic. Remove the old/equivalent
functionality.

Signed-off-by: Stephen Wilson <[email protected]>
---
mm/mempolicy.c | 10 ++++++----
1 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index dfe27e3..63c0d69 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -99,7 +99,6 @@
/* Internal flags */
#define MPOL_MF_DISCONTIG_OK (MPOL_MF_INTERNAL << 0) /* Skip checks for continuous vmas */
#define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1) /* Invert check for nodemask */
-#define MPOL_MF_STATS (MPOL_MF_INTERNAL << 2) /* Gather statistics */

static struct kmem_cache *policy_cache;
static struct kmem_cache *sn_cache;
@@ -492,9 +491,7 @@ static int check_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT))
continue;

- if (flags & MPOL_MF_STATS)
- gather_stats(page, private, pte_dirty(*pte));
- else if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+ if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
migrate_page_add(page, private, flags);
else
break;
@@ -2572,6 +2569,7 @@ static int gather_pte_stats(pte_t *pte, unsigned long addr,
unsigned long pte_size, struct mm_walk *walk)
{
struct page *page;
+ int nid;

if (pte_none(*pte))
return 0;
@@ -2580,6 +2578,10 @@ static int gather_pte_stats(pte_t *pte, unsigned long addr,
if (!page)
return 0;

+ nid = page_to_nid(page);
+ if (!node_isset(nid, node_states[N_HIGH_MEMORY]))
+ return 0;
+
gather_stats(page, walk->private, pte_dirty(*pte));
return 0;
}
--
1.7.3.5

2011-04-27 23:37:50

by Stephen Wilson

[permalink] [raw]
Subject: [PATCH 6/8] mm: proc: move show_numa_map() to fs/proc/task_mmu.c

Moving show_numa_map() from mempolicy.c to task_mmu.c solves several
issues.

- Having the show() operation "miles away" from the corresponding
seq_file iteration operations is a maintenance burden.

- The need to export ad hoc info like struct proc_maps_private is
eliminated.

- The implementation of show_numa_map() can be improved in a simple
manner by cooperating with the other seq_file operations (start,
stop, etc) -- something that would be messy to do without this
change.

Signed-off-by: Stephen Wilson <[email protected]>
---
fs/proc/task_mmu.c | 170 +++++++++++++++++++++++++++++++++++++++++++++++++++-
mm/mempolicy.c | 168 ---------------------------------------------------
2 files changed, 168 insertions(+), 170 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 2e7addf..9f069d2 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -856,8 +856,174 @@ const struct file_operations proc_pagemap_operations = {
#endif /* CONFIG_PROC_PAGE_MONITOR */

#ifdef CONFIG_NUMA
-extern int show_numa_map(struct seq_file *m, void *v);

+struct numa_maps {
+ unsigned long pages;
+ unsigned long anon;
+ unsigned long active;
+ unsigned long writeback;
+ unsigned long mapcount_max;
+ unsigned long dirty;
+ unsigned long swapcache;
+ unsigned long node[MAX_NUMNODES];
+};
+
+static void gather_stats(struct page *page, struct numa_maps *md, int pte_dirty)
+{
+ int count = page_mapcount(page);
+
+ md->pages++;
+ if (pte_dirty || PageDirty(page))
+ md->dirty++;
+
+ if (PageSwapCache(page))
+ md->swapcache++;
+
+ if (PageActive(page) || PageUnevictable(page))
+ md->active++;
+
+ if (PageWriteback(page))
+ md->writeback++;
+
+ if (PageAnon(page))
+ md->anon++;
+
+ if (count > md->mapcount_max)
+ md->mapcount_max = count;
+
+ md->node[page_to_nid(page)]++;
+}
+
+static int gather_pte_stats(pte_t *pte, unsigned long addr,
+ unsigned long pte_size, struct mm_walk *walk)
+{
+ struct numa_maps *md;
+ struct page *page;
+ int nid;
+
+ if (pte_none(*pte))
+ return 0;
+
+ page = pte_page(*pte);
+ if (!page)
+ return 0;
+
+ nid = page_to_nid(page);
+ if (!node_isset(nid, node_states[N_HIGH_MEMORY]))
+ return 0;
+
+ md = walk->private;
+ gather_stats(page, md, pte_dirty(*pte));
+ return 0;
+}
+
+#ifdef CONFIG_HUGETLB_PAGE
+static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
+ unsigned long addr, unsigned long end, struct mm_walk *walk)
+{
+ struct numa_maps *md;
+ struct page *page;
+
+ if (pte_none(*pte))
+ return 0;
+
+ page = pte_page(*pte);
+ if (!page)
+ return 0;
+
+ md = walk->private;
+ gather_stats(page, md, pte_dirty(*pte));
+ return 0;
+}
+
+#else
+static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
+ unsigned long addr, unsigned long end, struct mm_walk *walk)
+{
+ return 0;
+}
+#endif
+
+/*
+ * Display pages allocated per node and memory policy via /proc.
+ */
+static int show_numa_map(struct seq_file *m, void *v)
+{
+ struct proc_maps_private *priv = m->private;
+ struct vm_area_struct *vma = v;
+ struct numa_maps *md;
+ struct file *file = vma->vm_file;
+ struct mm_struct *mm = vma->vm_mm;
+ struct mm_walk walk = {};
+ struct mempolicy *pol;
+ int n;
+ char buffer[50];
+
+ if (!mm)
+ return 0;
+
+ md = kzalloc(sizeof(struct numa_maps), GFP_KERNEL);
+ if (!md)
+ return 0;
+
+ walk.hugetlb_entry = gather_hugetbl_stats;
+ walk.pte_entry = gather_pte_stats;
+ walk.private = md;
+ walk.mm = mm;
+
+ pol = get_vma_policy(priv->task, vma, vma->vm_start);
+ mpol_to_str(buffer, sizeof(buffer), pol, 0);
+ mpol_cond_put(pol);
+
+ seq_printf(m, "%08lx %s", vma->vm_start, buffer);
+
+ if (file) {
+ seq_printf(m, " file=");
+ seq_path(m, &file->f_path, "\n\t= ");
+ } else if (vma->vm_start <= mm->brk && vma->vm_end >= mm->start_brk) {
+ seq_printf(m, " heap");
+ } else if (vma->vm_start <= mm->start_stack &&
+ vma->vm_end >= mm->start_stack) {
+ seq_printf(m, " stack");
+ }
+
+ walk_page_range(vma->vm_start, vma->vm_end, &walk);
+
+ if (!md->pages)
+ goto out;
+
+ if (md->anon)
+ seq_printf(m, " anon=%lu", md->anon);
+
+ if (md->dirty)
+ seq_printf(m, " dirty=%lu", md->dirty);
+
+ if (md->pages != md->anon && md->pages != md->dirty)
+ seq_printf(m, " mapped=%lu", md->pages);
+
+ if (md->mapcount_max > 1)
+ seq_printf(m, " mapmax=%lu", md->mapcount_max);
+
+ if (md->swapcache)
+ seq_printf(m, " swapcache=%lu", md->swapcache);
+
+ if (md->active < md->pages && !is_vm_hugetlb_page(vma))
+ seq_printf(m, " active=%lu", md->active);
+
+ if (md->writeback)
+ seq_printf(m, " writeback=%lu", md->writeback);
+
+ for_each_node_state(n, N_HIGH_MEMORY)
+ if (md->node[n])
+ seq_printf(m, " N%d=%lu", n, md->node[n]);
+out:
+ seq_putc(m, '\n');
+ kfree(md);
+
+ if (m->count < m->size)
+ m->version = (vma != priv->tail_vma) ? vma->vm_start : 0;
+ return 0;
+}
static const struct seq_operations proc_pid_numa_maps_op = {
.start = m_start,
.next = m_next,
@@ -876,4 +1042,4 @@ const struct file_operations proc_numa_maps_operations = {
.llseek = seq_lseek,
.release = seq_release_private,
};
-#endif
+#endif /* CONFIG_NUMA */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index c5a4342..e7fb9d2 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2525,171 +2525,3 @@ int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol, int no_context)
}
return p - buffer;
}
-
-struct numa_maps {
- unsigned long pages;
- unsigned long anon;
- unsigned long active;
- unsigned long writeback;
- unsigned long mapcount_max;
- unsigned long dirty;
- unsigned long swapcache;
- unsigned long node[MAX_NUMNODES];
-};
-
-static void gather_stats(struct page *page, struct numa_maps *md, int pte_dirty)
-{
- int count = page_mapcount(page);
-
- md->pages++;
- if (pte_dirty || PageDirty(page))
- md->dirty++;
-
- if (PageSwapCache(page))
- md->swapcache++;
-
- if (PageActive(page) || PageUnevictable(page))
- md->active++;
-
- if (PageWriteback(page))
- md->writeback++;
-
- if (PageAnon(page))
- md->anon++;
-
- if (count > md->mapcount_max)
- md->mapcount_max = count;
-
- md->node[page_to_nid(page)]++;
-}
-
-static int gather_pte_stats(pte_t *pte, unsigned long addr,
- unsigned long pte_size, struct mm_walk *walk)
-{
- struct numa_maps *md;
- struct page *page;
- int nid;
-
- if (pte_none(*pte))
- return 0;
-
- page = pte_page(*pte);
- if (!page)
- return 0;
-
- nid = page_to_nid(page);
- if (!node_isset(nid, node_states[N_HIGH_MEMORY]))
- return 0;
-
- md = walk->private;
- gather_stats(page, md, pte_dirty(*pte));
- return 0;
-}
-
-#ifdef CONFIG_HUGETLB_PAGE
-static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
- unsigned long addr, unsigned long end, struct mm_walk *walk)
-{
- struct numa_maps *md;
- struct page *page;
-
- if (pte_none(*pte))
- return 0;
-
- page = pte_page(*pte);
- if (!page)
- return 0;
-
- md = walk->private;
- gather_stats(page, md, pte_dirty(*pte));
- return 0;
-}
-
-#else
-static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
- unsigned long addr, unsigned long end, struct mm_walk *walk)
-{
- return 0;
-}
-#endif
-
-/*
- * Display pages allocated per node and memory policy via /proc.
- */
-int show_numa_map(struct seq_file *m, void *v)
-{
- struct proc_maps_private *priv = m->private;
- struct vm_area_struct *vma = v;
- struct numa_maps *md;
- struct file *file = vma->vm_file;
- struct mm_struct *mm = vma->vm_mm;
- struct mm_walk walk = {};
- struct mempolicy *pol;
- int n;
- char buffer[50];
-
- if (!mm)
- return 0;
-
- md = kzalloc(sizeof(struct numa_maps), GFP_KERNEL);
- if (!md)
- return 0;
-
- walk.hugetlb_entry = gather_hugetbl_stats;
- walk.pte_entry = gather_pte_stats;
- walk.private = md;
- walk.mm = mm;
-
- pol = get_vma_policy(priv->task, vma, vma->vm_start);
- mpol_to_str(buffer, sizeof(buffer), pol, 0);
- mpol_cond_put(pol);
-
- seq_printf(m, "%08lx %s", vma->vm_start, buffer);
-
- if (file) {
- seq_printf(m, " file=");
- seq_path(m, &file->f_path, "\n\t= ");
- } else if (vma->vm_start <= mm->brk && vma->vm_end >= mm->start_brk) {
- seq_printf(m, " heap");
- } else if (vma->vm_start <= mm->start_stack &&
- vma->vm_end >= mm->start_stack) {
- seq_printf(m, " stack");
- }
-
- walk_page_range(vma->vm_start, vma->vm_end, &walk);
-
- if (!md->pages)
- goto out;
-
- if (md->anon)
- seq_printf(m," anon=%lu",md->anon);
-
- if (md->dirty)
- seq_printf(m," dirty=%lu",md->dirty);
-
- if (md->pages != md->anon && md->pages != md->dirty)
- seq_printf(m, " mapped=%lu", md->pages);
-
- if (md->mapcount_max > 1)
- seq_printf(m, " mapmax=%lu", md->mapcount_max);
-
- if (md->swapcache)
- seq_printf(m," swapcache=%lu", md->swapcache);
-
- if (md->active < md->pages && !is_vm_hugetlb_page(vma))
- seq_printf(m," active=%lu", md->active);
-
- if (md->writeback)
- seq_printf(m," writeback=%lu", md->writeback);
-
- for_each_node_state(n, N_HIGH_MEMORY)
- if (md->node[n])
- seq_printf(m, " N%d=%lu", n, md->node[n]);
-out:
- seq_putc(m, '\n');
- kfree(md);
-
- if (m->count < m->size)
- m->version = (vma != priv->tail_vma) ? vma->vm_start : 0;
- return 0;
-}
--
1.7.3.5

2011-04-27 23:38:08

by Stephen Wilson

[permalink] [raw]
Subject: [PATCH 8/8] proc: allocate storage for numa_maps statistics once

In show_numa_map() we collect statistics into a numa_maps structure.
Since the number of NUMA nodes can be very large, this structure is not
a candidate for stack allocation.

Instead of going thru a kmalloc()+kfree() cycle each time show_numa_map()
is invoked, perform the allocation just once when /proc/pid/numa_maps is
opened.

Performing the allocation when numa_maps is opened, and thus before a
reference to the target tasks mm is taken, eliminates a potential
stalemate condition in the oom-killer as originally described by Hugh
Dickins:

... imagine what happens if the system is out of memory, and the mm
we're looking at is selected for killing by the OOM killer: while
we wait in __get_free_page for more memory, no memory is freed
from the selected mm because it cannot reach exit_mmap while we hold
that reference.

Signed-off-by: Stephen Wilson <[email protected]>
---
fs/proc/task_mmu.c | 36 +++++++++++++++++++++++++++---------
1 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index 9f069d2..1ca3a00 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -868,6 +868,11 @@ struct numa_maps {
unsigned long node[MAX_NUMNODES];
};

+struct numa_maps_private {
+ struct proc_maps_private proc_maps;
+ struct numa_maps md;
+};
+
static void gather_stats(struct page *page, struct numa_maps *md, int pte_dirty)
{
int count = page_mapcount(page);
@@ -949,9 +954,10 @@ static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
*/
static int show_numa_map(struct seq_file *m, void *v)
{
- struct proc_maps_private *priv = m->private;
+ struct numa_maps_private *numa_priv = m->private;
+ struct proc_maps_private *proc_priv = &numa_priv->proc_maps;
struct vm_area_struct *vma = v;
- struct numa_maps *md;
+ struct numa_maps *md = &numa_priv->md;
struct file *file = vma->vm_file;
struct mm_struct *mm = vma->vm_mm;
struct mm_walk walk = {};
@@ -962,16 +968,15 @@ static int show_numa_map(struct seq_file *m, void *v)
if (!mm)
return 0;

- md = kzalloc(sizeof(struct numa_maps), GFP_KERNEL);
- if (!md)
- return 0;
+ /* Ensure we start with an empty set of numa_maps statistics. */
+ memset(md, 0, sizeof(*md));

walk.hugetlb_entry = gather_hugetbl_stats;
walk.pte_entry = gather_pte_stats;
walk.private = md;
walk.mm = mm;

- pol = get_vma_policy(priv->task, vma, vma->vm_start);
+ pol = get_vma_policy(proc_priv->task, vma, vma->vm_start);
mpol_to_str(buffer, sizeof(buffer), pol, 0);
mpol_cond_put(pol);

@@ -1018,12 +1023,12 @@ static int show_numa_map(struct seq_file *m, void *v)
seq_printf(m, " N%d=%lu", n, md->node[n]);
out:
seq_putc(m, '\n');
- kfree(md);

if (m->count < m->size)
- m->version = (vma != priv->tail_vma) ? vma->vm_start : 0;
+ m->version = (vma != proc_priv->tail_vma) ? vma->vm_start : 0;
return 0;
}
+
static const struct seq_operations proc_pid_numa_maps_op = {
.start = m_start,
.next = m_next,
@@ -1033,7 +1038,20 @@ static const struct seq_operations proc_pid_numa_maps_op = {

static int numa_maps_open(struct inode *inode, struct file *file)
{
- return do_maps_open(inode, file, &proc_pid_numa_maps_op);
+ struct numa_maps_private *priv;
+ int ret = -ENOMEM;
+ priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+ if (priv) {
+ priv->proc_maps.pid = proc_pid(inode);
+ ret = seq_open(file, &proc_pid_numa_maps_op);
+ if (!ret) {
+ struct seq_file *m = file->private_data;
+ m->private = priv;
+ } else {
+ kfree(priv);
+ }
+ }
+ return ret;
}

const struct file_operations proc_numa_maps_operations = {
--
1.7.3.5

2011-04-27 23:37:54

by Stephen Wilson

[permalink] [raw]
Subject: [PATCH 7/8] proc: make struct proc_maps_private truly private

Now that mm/mempolicy.c is no longer implementing /proc/pid/numa_maps
there is no need to export struct proc_maps_private to the world. Move
it to fs/proc/internal.h instead.

Signed-off-by: Stephen Wilson <[email protected]>
---
fs/proc/internal.h | 8 ++++++++
include/linux/proc_fs.h | 8 --------
2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index c03e8d3..3c39863 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -119,3 +119,11 @@ struct inode *proc_get_inode(struct super_block *, struct proc_dir_entry *);
*/
int proc_readdir(struct file *, void *, filldir_t);
struct dentry *proc_lookup(struct inode *, struct dentry *, struct nameidata *);
+
+struct proc_maps_private {
+ struct pid *pid;
+ struct task_struct *task;
+#ifdef CONFIG_MMU
+ struct vm_area_struct *tail_vma;
+#endif
+};
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 838c114..55ff594 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -286,12 +286,4 @@ static inline struct net *PDE_NET(struct proc_dir_entry *pde)
return pde->parent->data;
}

-struct proc_maps_private {
- struct pid *pid;
- struct task_struct *task;
-#ifdef CONFIG_MMU
- struct vm_area_struct *tail_vma;
-#endif
-};
-
#endif /* _LINUX_PROC_FS_H */
--
1.7.3.5

2011-05-04 23:10:32

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 0/8] avoid allocation in show_numa_map()

On Wed, 27 Apr 2011 19:35:41 -0400
Stephen Wilson <[email protected]> wrote:

> Recently a concern was raised[1] that performing an allocation while holding a
> reference on a tasks mm could lead to a stalemate in the oom killer. The
> concern was specific to the goings-on in /proc. Hugh Dickins stated the issue
> thusly:
>
> ...imagine what happens if the system is out of memory, and the mm
> we're looking at is selected for killing by the OOM killer: while we
> wait in __get_free_page for more memory, no memory is freed from the
> selected mm because it cannot reach exit_mmap while we hold that
> reference.
>
> The primary goal of this series is to eliminate repeated allocation/free cycles
> currently happening in show_numa_maps() while we hold a reference to an mm.
>
> The strategy is to perform the allocation once when /proc/pid/numa_maps is
> opened, before a reference on the target tasks mm is taken.
>
> Unfortunately, show_numa_maps() is implemented in mm/mempolicy.c while the
> primary procfs implementation lives in fs/proc/task_mmu.c. This makes
> clean cooperation between show_numa_maps() and the other seq_file operations
> (start(), stop(), etc) difficult.
>
>
> Patches 1-5 convert show_numa_maps() to use the generic walk_page_range()
> functionality instead of the mempolicy.c specific page table walking logic.
> Also, get_vma_policy() is exported. This makes the show_numa_maps()
> implementation independent of mempolicy.c.
>
> Patch 6 moves show_numa_maps() and supporting routines over to
> fs/proc/task_mmu.c.
>
> Finally, patches 7 and 8 provide minor cleanup and eliminates the troublesome
> allocation.
>
>
> Please note that moving show_numa_maps() into fs/proc/task_mmu.c essentially
> reverts 1a75a6c825 and 48fce3429d. Also, please see the discussion at [2]. My
> main justifications for moving the code back into task_mmu.c is:
>
> - Having the show() operation "miles away" from the corresponding
> seq_file iteration operations is a maintenance burden.
>
> - The need to export ad hoc info like struct proc_maps_private is
> eliminated.
>
>
> These patches are based on v2.6.39-rc5.

The patches look reasonable. It would be nice to get some more review
happening (poke).

>
> Please note that this series is VERY LIGHTLY TESTED. I have been using
> CONFIG_NUMA_EMU=y thus far as I will not have access to a real NUMA system for
> another week or two.

"lightly tested" evokes fear, but the patches don't look too scary to
me.

Did you look at using apply_to_page_range()?

I'm trying to remember why we're carrying both walk_page_range() and
apply_to_page_range() but can't immediately think of a reason.

There's also an apply_to_page_range_batch() in -mm but that code is
broken on PPC and not much is happening with it, so it will probably go
away again.

2011-05-05 02:37:43

by Stephen Wilson

[permalink] [raw]
Subject: Re: [PATCH 0/8] avoid allocation in show_numa_map()

On Wed, May 04, 2011 at 04:10:20PM -0700, Andrew Morton wrote:
> On Wed, 27 Apr 2011 19:35:41 -0400
> Stephen Wilson <[email protected]> wrote:
>
> > Recently a concern was raised[1] that performing an allocation while holding a
> > reference on a tasks mm could lead to a stalemate in the oom killer. The
> > concern was specific to the goings-on in /proc. Hugh Dickins stated the issue
> > thusly:
> >
> > ...imagine what happens if the system is out of memory, and the mm
> > we're looking at is selected for killing by the OOM killer: while we
> > wait in __get_free_page for more memory, no memory is freed from the
> > selected mm because it cannot reach exit_mmap while we hold that
> > reference.
> >
> > The primary goal of this series is to eliminate repeated allocation/free cycles
> > currently happening in show_numa_maps() while we hold a reference to an mm.
> >
> > The strategy is to perform the allocation once when /proc/pid/numa_maps is
> > opened, before a reference on the target tasks mm is taken.
> >
> > Unfortunately, show_numa_maps() is implemented in mm/mempolicy.c while the
> > primary procfs implementation lives in fs/proc/task_mmu.c. This makes
> > clean cooperation between show_numa_maps() and the other seq_file operations
> > (start(), stop(), etc) difficult.
> >
> >
> > Patches 1-5 convert show_numa_maps() to use the generic walk_page_range()
> > functionality instead of the mempolicy.c specific page table walking logic.
> > Also, get_vma_policy() is exported. This makes the show_numa_maps()
> > implementation independent of mempolicy.c.
> >
> > Patch 6 moves show_numa_maps() and supporting routines over to
> > fs/proc/task_mmu.c.
> >
> > Finally, patches 7 and 8 provide minor cleanup and eliminates the troublesome
> > allocation.
> >
> >
> > Please note that moving show_numa_maps() into fs/proc/task_mmu.c essentially
> > reverts 1a75a6c825 and 48fce3429d. Also, please see the discussion at [2]. My
> > main justifications for moving the code back into task_mmu.c is:
> >
> > - Having the show() operation "miles away" from the corresponding
> > seq_file iteration operations is a maintenance burden.
> >
> > - The need to export ad hoc info like struct proc_maps_private is
> > eliminated.
> >
> >
> > These patches are based on v2.6.39-rc5.
>
> The patches look reasonable. It would be nice to get some more review
> happening (poke).

If anyone would like me to resend the series please let me know.

> >
> > Please note that this series is VERY LIGHTLY TESTED. I have been using
> > CONFIG_NUMA_EMU=y thus far as I will not have access to a real NUMA system for
> > another week or two.
>
> "lightly tested" evokes fear, but the patches don't look too scary to
> me.

Indeed. I hope to have some real hardware to test the patches on by
the end of the week; fingers crossed. Will certainly address any
issues that come up at that time.


> Did you look at using apply_to_page_range()?

I did not look into it deeply, no. The main reason for using
walk_page_range() was that it supports hugetlb vma's in the same way as
was done in mempolicy.c's check_huge_range(). The algorithm was a very
natural fit so I ran with it.


> I'm trying to remember why we're carrying both walk_page_range() and
> apply_to_page_range() but can't immediately think of a reason.
>
> There's also an apply_to_page_range_batch() in -mm but that code is
> broken on PPC and not much is happening with it, so it will probably go
> away again.


--
steve

2011-05-09 07:38:58

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 2/8] mm: use walk_page_range() instead of custom page table walking code

Hello,

sorry for the long delay.

> In the specific case of show_numa_map(), the custom page table walking
> logic implemented in mempolicy.c does not provide any special service
> beyond that provided by walk_page_range().
>
> Also, converting show_numa_map() to use the generic routine decouples
> the function from mempolicy.c, allowing it to be moved out of the mm
> subsystem and into fs/proc.
>
> Signed-off-by: Stephen Wilson <[email protected]>
> ---
> mm/mempolicy.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++-------
> 1 files changed, 46 insertions(+), 7 deletions(-)
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 5bfb03e..dfe27e3 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2568,6 +2568,22 @@ static void gather_stats(struct page *page, void *private, int pte_dirty)
> md->node[page_to_nid(page)]++;
> }
>
> +static int gather_pte_stats(pte_t *pte, unsigned long addr,
> + unsigned long pte_size, struct mm_walk *walk)
> +{
> + struct page *page;
> +
> + if (pte_none(*pte))
> + return 0;
> +
> + page = pte_page(*pte);
> + if (!page)
> + return 0;

original check_pte_range() has following logic.

orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
do {
struct page *page;
int nid;

if (!pte_present(*pte))
continue;
page = vm_normal_page(vma, addr, *pte);
if (!page)
continue;
/*
* vm_normal_page() filters out zero pages, but there might
* still be PageReserved pages to skip, perhaps in a VDSO.
* And we cannot move PageKsm pages sensibly or safely yet.
*/
if (PageReserved(page) || PageKsm(page))
continue;
gather_stats(page, private, pte_dirty(*pte));

Why did you drop a lot of check? Is it safe?

Other parts looks good to me.

2011-05-09 07:40:04

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 1/8] mm: export get_vma_policy()

> diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
> index 31ac26c..c2f6032 100644
> --- a/include/linux/mempolicy.h
> +++ b/include/linux/mempolicy.h
> @@ -199,6 +199,9 @@ void mpol_free_shared_policy(struct shared_policy *p);
> struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
> unsigned long idx);
>
> +struct mempolicy *get_vma_policy(struct task_struct *tsk,
> + struct vm_area_struct *vma, unsigned long addr);
> +
> extern void numa_default_policy(void);
> extern void numa_policy_init(void);
> extern void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new,
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 959a8b8..5bfb03e 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -1489,7 +1489,7 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len,
> * freeing by another task. It is the caller's responsibility to free the
> * extra reference for shared policies.
> */
> -static struct mempolicy *get_vma_policy(struct task_struct *task,
> +struct mempolicy *get_vma_policy(struct task_struct *task,
> struct vm_area_struct *vma, unsigned long addr)

Looks reasonable to me.
Reviewed-by: KOSAKI Motohiro <[email protected]>



2011-05-09 07:44:27

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 3/8] mm: remove MPOL_MF_STATS

> Mapping statistics in a NUMA environment is now computed using the
> generic walk_page_range() logic. Remove the old/equivalent
> functionality.
>
> Signed-off-by: Stephen Wilson <[email protected]>
> ---
> mm/mempolicy.c | 10 ++++++----
> 1 files changed, 6 insertions(+), 4 deletions(-)
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index dfe27e3..63c0d69 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -99,7 +99,6 @@
> /* Internal flags */
> #define MPOL_MF_DISCONTIG_OK (MPOL_MF_INTERNAL << 0) /* Skip checks for continuous vmas */
> #define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1) /* Invert check for nodemask */
> -#define MPOL_MF_STATS (MPOL_MF_INTERNAL << 2) /* Gather statistics */
>
> static struct kmem_cache *policy_cache;
> static struct kmem_cache *sn_cache;
> @@ -492,9 +491,7 @@ static int check_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT))
> continue;
>
> - if (flags & MPOL_MF_STATS)
> - gather_stats(page, private, pte_dirty(*pte));
> - else if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
> + if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
> migrate_page_add(page, private, flags);
> else
> break;

This hunk looks good to me.


> @@ -2572,6 +2569,7 @@ static int gather_pte_stats(pte_t *pte, unsigned long addr,
> unsigned long pte_size, struct mm_walk *walk)
> {
> struct page *page;
> + int nid;
>
> if (pte_none(*pte))
> return 0;
> @@ -2580,6 +2578,10 @@ static int gather_pte_stats(pte_t *pte, unsigned long addr,
> if (!page)
> return 0;
>
> + nid = page_to_nid(page);
> + if (!node_isset(nid, node_states[N_HIGH_MEMORY]))
> + return 0;
> +
> gather_stats(page, walk->private, pte_dirty(*pte));
> return 0;

However this hunk should be moved into patch [2/8]. because 1) keeping
bisectability 2) The description says "Remove the old/equivalent
functionality." but it added new functionality.


2011-05-09 07:45:43

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 4/8] mm: make gather_stats() type-safe and remove forward declaration

> Improve the prototype of gather_stats() to take a struct numa_maps as
> argument instead of a generic void *. Update all callers to make the
> required type explicit.
>
> Since gather_stats() is not needed before its definition and is
> scheduled to be moved out of mempolicy.c the declaration is removed as
> well.
>
> Signed-off-by: Stephen Wilson <[email protected]>

Looks good to me.
Reviewed-by: KOSAKI Motohiro <[email protected]>


2011-05-09 07:46:42

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 5/8] mm: remove check_huge_range()

> This function has been superseded by gather_hugetbl_stats() and is no
> longer needed.
>
> Signed-off-by: Stephen Wilson <[email protected]>
> ---
> mm/mempolicy.c | 35 -----------------------------------
> 1 files changed, 0 insertions(+), 35 deletions(-)

Looks good.
Reviewed-by: KOSAKI Motohiro <[email protected]>



2011-05-09 07:49:21

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 6/8] mm: proc: move show_numa_map() to fs/proc/task_mmu.c

> Moving show_numa_map() from mempolicy.c to task_mmu.c solves several
> issues.
>
> - Having the show() operation "miles away" from the corresponding
> seq_file iteration operations is a maintenance burden.
>
> - The need to export ad hoc info like struct proc_maps_private is
> eliminated.
>
> - The implementation of show_numa_map() can be improved in a simple
> manner by cooperating with the other seq_file operations (start,
> stop, etc) -- something that would be messy to do without this
> change.
>
> Signed-off-by: Stephen Wilson <[email protected]>
> ---
> fs/proc/task_mmu.c | 170 +++++++++++++++++++++++++++++++++++++++++++++++++++-
> mm/mempolicy.c | 168 ---------------------------------------------------
> 2 files changed, 168 insertions(+), 170 deletions(-)

Looks good to me.
Reviewed-by: KOSAKI Motohiro <[email protected]>



2011-05-09 07:51:13

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 7/8] proc: make struct proc_maps_private truly private

> Now that mm/mempolicy.c is no longer implementing /proc/pid/numa_maps
> there is no need to export struct proc_maps_private to the world. Move
> it to fs/proc/internal.h instead.
>
> Signed-off-by: Stephen Wilson <[email protected]>
> ---
> fs/proc/internal.h | 8 ++++++++
> include/linux/proc_fs.h | 8 --------
> 2 files changed, 8 insertions(+), 8 deletions(-)

Looks good to me.
Reviewed-by: KOSAKI Motohiro <[email protected]>


2011-05-09 08:24:35

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 8/8] proc: allocate storage for numa_maps statistics once

> static int numa_maps_open(struct inode *inode, struct file *file)
> {
> - return do_maps_open(inode, file, &proc_pid_numa_maps_op);
> + struct numa_maps_private *priv;
> + int ret = -ENOMEM;
> + priv = kzalloc(sizeof(*priv), GFP_KERNEL);
> + if (priv) {
> + priv->proc_maps.pid = proc_pid(inode);
> + ret = seq_open(file, &proc_pid_numa_maps_op);
> + if (!ret) {
> + struct seq_file *m = file->private_data;
> + m->private = priv;
> + } else {
> + kfree(priv);
> + }
> + }
> + return ret;
> }

Looks good to me.
Reviewed-by: KOSAKI Motohiro <[email protected]>



2011-05-09 19:37:17

by Stephen Wilson

[permalink] [raw]
Subject: Re: [PATCH 2/8] mm: use walk_page_range() instead of custom page table walking code

On Mon, May 09, 2011 at 04:38:49PM +0900, KOSAKI Motohiro wrote:
> Hello,
>
> sorry for the long delay.

Please, no apologies. Thank you for the review!

> > In the specific case of show_numa_map(), the custom page table walking
> > logic implemented in mempolicy.c does not provide any special service
> > beyond that provided by walk_page_range().
> >
> > Also, converting show_numa_map() to use the generic routine decouples
> > the function from mempolicy.c, allowing it to be moved out of the mm
> > subsystem and into fs/proc.
> >
> > Signed-off-by: Stephen Wilson <[email protected]>
> > ---
> > mm/mempolicy.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++-------
> > 1 files changed, 46 insertions(+), 7 deletions(-)
> >
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index 5bfb03e..dfe27e3 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -2568,6 +2568,22 @@ static void gather_stats(struct page *page, void *private, int pte_dirty)
> > md->node[page_to_nid(page)]++;
> > }
> >
> > +static int gather_pte_stats(pte_t *pte, unsigned long addr,
> > + unsigned long pte_size, struct mm_walk *walk)
> > +{
> > + struct page *page;
> > +
> > + if (pte_none(*pte))
> > + return 0;
> > +
> > + page = pte_page(*pte);
> > + if (!page)
> > + return 0;
>
> original check_pte_range() has following logic.
>
> orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> do {
> struct page *page;
> int nid;
>
> if (!pte_present(*pte))
> continue;
> page = vm_normal_page(vma, addr, *pte);
> if (!page)
> continue;
> /*
> * vm_normal_page() filters out zero pages, but there might
> * still be PageReserved pages to skip, perhaps in a VDSO.
> * And we cannot move PageKsm pages sensibly or safely yet.
> */
> if (PageReserved(page) || PageKsm(page))
> continue;
> gather_stats(page, private, pte_dirty(*pte));
>
> Why did you drop a lot of check? Is it safe?

I must have been confused. For one, walk_page_range() does not even
lock the pmd entry when iterating over the pte's. I completely
overlooked that fact and so with that, the series is totally broken.

I am currently testing a slightly reworked set based on the following
variation. When finished I will send v2 of the series which will
address all issues raised so far.

Thanks again for the review!



>From 013a1e0fc96f8370339209f16d81df4ded40dbf2 Mon Sep 17 00:00:00 2001
From: Stephen Wilson <[email protected]>
Date: Mon, 9 May 2011 14:39:27 -0400
Subject: [PATCH] mm: use walk_page_range() instead of custom page table
walking code

Converting show_numa_map() to use the generic routine decouples
the function from mempolicy.c, allowing it to be moved out of the mm
subsystem and into fs/proc.

Also, include KSM pages in /proc/pid/numa_maps statistics. The pagewalk
logic implemented by check_pte_range() failed to account for such pages
as they were not applicable to the page migration case.

Signed-off-by: Stephen Wilson <[email protected]>
---
mm/mempolicy.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
1 files changed, 68 insertions(+), 7 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 5bfb03e..945e85d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2531,6 +2531,7 @@ int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol, int no_context)
}

struct numa_maps {
+ struct vm_area_struct *vma;
unsigned long pages;
unsigned long anon;
unsigned long active;
@@ -2568,6 +2569,41 @@ static void gather_stats(struct page *page, void *private, int pte_dirty)
md->node[page_to_nid(page)]++;
}

+static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
+{
+ struct numa_maps *md;
+ spinlock_t *ptl;
+ pte_t *orig_pte;
+ pte_t *pte;
+
+ md = walk->private;
+ orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+ do {
+ struct page *page;
+ int nid;
+
+ if (!pte_present(*pte))
+ continue;
+
+ page = vm_normal_page(md->vma, addr, *pte);
+ if (!page)
+ continue;
+
+ if (PageReserved(page))
+ continue;
+
+ nid = page_to_nid(page);
+ if (!node_isset(nid, node_states[N_HIGH_MEMORY]))
+ continue;
+
+ gather_stats(page, md, pte_dirty(*pte));
+
+ } while (pte++, addr += PAGE_SIZE, addr != end);
+ pte_unmap_unlock(orig_pte, ptl);
+ return 0;
+}
+
#ifdef CONFIG_HUGETLB_PAGE
static void check_huge_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end,
@@ -2597,12 +2633,35 @@ static void check_huge_range(struct vm_area_struct *vma,
gather_stats(page, md, pte_dirty(*ptep));
}
}
+
+static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
+ unsigned long addr, unsigned long end, struct mm_walk *walk)
+{
+ struct page *page;
+
+ if (pte_none(*pte))
+ return 0;
+
+ page = pte_page(*pte);
+ if (!page)
+ return 0;
+
+ gather_stats(page, walk->private, pte_dirty(*pte));
+ return 0;
+}
+
#else
static inline void check_huge_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end,
struct numa_maps *md)
{
}
+
+static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
+ unsigned long addr, unsigned long end, struct mm_walk *walk)
+{
+ return 0;
+}
#endif

/*
@@ -2615,6 +2674,7 @@ int show_numa_map(struct seq_file *m, void *v)
struct numa_maps *md;
struct file *file = vma->vm_file;
struct mm_struct *mm = vma->vm_mm;
+ struct mm_walk walk = {};
struct mempolicy *pol;
int n;
char buffer[50];
@@ -2626,6 +2686,13 @@ int show_numa_map(struct seq_file *m, void *v)
if (!md)
return 0;

+ md->vma = vma;
+
+ walk.hugetlb_entry = gather_hugetbl_stats;
+ walk.pmd_entry = gather_pte_stats;
+ walk.private = md;
+ walk.mm = mm;
+
pol = get_vma_policy(priv->task, vma, vma->vm_start);
mpol_to_str(buffer, sizeof(buffer), pol, 0);
mpol_cond_put(pol);
@@ -2642,13 +2709,7 @@ int show_numa_map(struct seq_file *m, void *v)
seq_printf(m, " stack");
}

- if (is_vm_hugetlb_page(vma)) {
- check_huge_range(vma, vma->vm_start, vma->vm_end, md);
- seq_printf(m, " huge");
- } else {
- check_pgd_range(vma, vma->vm_start, vma->vm_end,
- &node_states[N_HIGH_MEMORY], MPOL_MF_STATS, md);
- }
+ walk_page_range(vma->vm_start, vma->vm_end, &walk);

if (!md->pages)
goto out;
--
1.7.4.4


2011-05-09 19:39:29

by Stephen Wilson

[permalink] [raw]
Subject: Re: [PATCH 3/8] mm: remove MPOL_MF_STATS

On Mon, May 09, 2011 at 04:44:24PM +0900, KOSAKI Motohiro wrote:
> > Mapping statistics in a NUMA environment is now computed using the
> > generic walk_page_range() logic. Remove the old/equivalent
> > functionality.
> >
> > Signed-off-by: Stephen Wilson <[email protected]>
> > ---
> > mm/mempolicy.c | 10 ++++++----
> > 1 files changed, 6 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > index dfe27e3..63c0d69 100644
> > --- a/mm/mempolicy.c
> > +++ b/mm/mempolicy.c
> > @@ -99,7 +99,6 @@
> > /* Internal flags */
> > #define MPOL_MF_DISCONTIG_OK (MPOL_MF_INTERNAL << 0) /* Skip checks for continuous vmas */
> > #define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1) /* Invert check for nodemask */
> > -#define MPOL_MF_STATS (MPOL_MF_INTERNAL << 2) /* Gather statistics */
> >
> > static struct kmem_cache *policy_cache;
> > static struct kmem_cache *sn_cache;
> > @@ -492,9 +491,7 @@ static int check_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
> > if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT))
> > continue;
> >
> > - if (flags & MPOL_MF_STATS)
> > - gather_stats(page, private, pte_dirty(*pte));
> > - else if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
> > + if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
> > migrate_page_add(page, private, flags);
> > else
> > break;
>
> This hunk looks good to me.
>
>
> > @@ -2572,6 +2569,7 @@ static int gather_pte_stats(pte_t *pte, unsigned long addr,
> > unsigned long pte_size, struct mm_walk *walk)
> > {
> > struct page *page;
> > + int nid;
> >
> > if (pte_none(*pte))
> > return 0;
> > @@ -2580,6 +2578,10 @@ static int gather_pte_stats(pte_t *pte, unsigned long addr,
> > if (!page)
> > return 0;
> >
> > + nid = page_to_nid(page);
> > + if (!node_isset(nid, node_states[N_HIGH_MEMORY]))
> > + return 0;
> > +
> > gather_stats(page, walk->private, pte_dirty(*pte));
> > return 0;
>
> However this hunk should be moved into patch [2/8]. because 1) keeping
> bisectability 2) The description says "Remove the old/equivalent
> functionality." but it added new functionality.

Absolutely. Will move into the proper location and resend the whole
series.

Thanks again,

--
steve

2011-05-10 00:20:42

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH 2/8] mm: use walk_page_range() instead of custom page table walking code

> On Mon, May 09, 2011 at 04:38:49PM +0900, KOSAKI Motohiro wrote:
> > Hello,
> >
> > sorry for the long delay.
>
> Please, no apologies. Thank you for the review!
>
> > > In the specific case of show_numa_map(), the custom page table walking
> > > logic implemented in mempolicy.c does not provide any special service
> > > beyond that provided by walk_page_range().
> > >
> > > Also, converting show_numa_map() to use the generic routine decouples
> > > the function from mempolicy.c, allowing it to be moved out of the mm
> > > subsystem and into fs/proc.
> > >
> > > Signed-off-by: Stephen Wilson <[email protected]>
> > > ---
> > > mm/mempolicy.c | 53 ++++++++++++++++++++++++++++++++++++++++++++++-------
> > > 1 files changed, 46 insertions(+), 7 deletions(-)
> > >
> > > diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> > > index 5bfb03e..dfe27e3 100644
> > > --- a/mm/mempolicy.c
> > > +++ b/mm/mempolicy.c
> > > @@ -2568,6 +2568,22 @@ static void gather_stats(struct page *page, void *private, int pte_dirty)
> > > md->node[page_to_nid(page)]++;
> > > }
> > >
> > > +static int gather_pte_stats(pte_t *pte, unsigned long addr,
> > > + unsigned long pte_size, struct mm_walk *walk)
> > > +{
> > > + struct page *page;
> > > +
> > > + if (pte_none(*pte))
> > > + return 0;
> > > +
> > > + page = pte_page(*pte);
> > > + if (!page)
> > > + return 0;
> >
> > original check_pte_range() has following logic.
> >
> > orig_pte = pte = pte_offset_map_lock(vma->vm_mm, pmd, addr, &ptl);
> > do {
> > struct page *page;
> > int nid;
> >
> > if (!pte_present(*pte))
> > continue;
> > page = vm_normal_page(vma, addr, *pte);
> > if (!page)
> > continue;
> > /*
> > * vm_normal_page() filters out zero pages, but there might
> > * still be PageReserved pages to skip, perhaps in a VDSO.
> > * And we cannot move PageKsm pages sensibly or safely yet.
> > */
> > if (PageReserved(page) || PageKsm(page))
> > continue;
> > gather_stats(page, private, pte_dirty(*pte));
> >
> > Why did you drop a lot of check? Is it safe?
>
> I must have been confused. For one, walk_page_range() does not even
> lock the pmd entry when iterating over the pte's. I completely
> overlooked that fact and so with that, the series is totally broken.
>
> I am currently testing a slightly reworked set based on the following
> variation. When finished I will send v2 of the series which will
> address all issues raised so far.
>
> Thanks again for the review!
>
>
>
> From 013a1e0fc96f8370339209f16d81df4ded40dbf2 Mon Sep 17 00:00:00 2001
> From: Stephen Wilson <[email protected]>
> Date: Mon, 9 May 2011 14:39:27 -0400
> Subject: [PATCH] mm: use walk_page_range() instead of custom page table
> walking code
>
> Converting show_numa_map() to use the generic routine decouples
> the function from mempolicy.c, allowing it to be moved out of the mm
> subsystem and into fs/proc.
>
> Also, include KSM pages in /proc/pid/numa_maps statistics. The pagewalk
> logic implemented by check_pte_range() failed to account for such pages
> as they were not applicable to the page migration case.

Seems very reasonable change.

>
> Signed-off-by: Stephen Wilson <[email protected]>
> ---
> mm/mempolicy.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
> 1 files changed, 68 insertions(+), 7 deletions(-)
>
> diff --git a/mm/mempolicy.c b/mm/mempolicy.c
> index 5bfb03e..945e85d 100644
> --- a/mm/mempolicy.c
> +++ b/mm/mempolicy.c
> @@ -2531,6 +2531,7 @@ int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol, int no_context)
> }
>
> struct numa_maps {
> + struct vm_area_struct *vma;
> unsigned long pages;
> unsigned long anon;
> unsigned long active;
> @@ -2568,6 +2569,41 @@ static void gather_stats(struct page *page, void *private, int pte_dirty)
> md->node[page_to_nid(page)]++;
> }
>
> +static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
> + unsigned long end, struct mm_walk *walk)
> +{
> + struct numa_maps *md;
> + spinlock_t *ptl;
> + pte_t *orig_pte;
> + pte_t *pte;
> +
> + md = walk->private;
> + orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
> + do {
> + struct page *page;
> + int nid;
> +
> + if (!pte_present(*pte))
> + continue;
> +
> + page = vm_normal_page(md->vma, addr, *pte);
> + if (!page)
> + continue;
> +
> + if (PageReserved(page))
> + continue;
> +
> + nid = page_to_nid(page);
> + if (!node_isset(nid, node_states[N_HIGH_MEMORY]))
> + continue;
> +
> + gather_stats(page, md, pte_dirty(*pte));
> +
> + } while (pte++, addr += PAGE_SIZE, addr != end);
> + pte_unmap_unlock(orig_pte, ptl);
> + return 0;
> +}


Looks completely good.
Reviewed-by KOSAKI Motohiro <[email protected]>

Thank you for great work!