2011-05-15 22:21:26

by Stephen Wilson

[permalink] [raw]
Subject: [PATCH v2 0/9] avoid allocation in show_numa_map()

Hi all,

This is version 2 of a patch series[1] aimed at removing repeated
allocation/free cycles happening in show_numa_maps() while we hold a reference
to an mm. The concern is that performing an allocation while referencing an mm
could lead to a stalemate in the oom killer as previously explained by Hugh
Dickins[2].

This series addresses all issues raised in the previous round and is organized
as follows:

Patches 1-6 convert show_numa_maps() to use the generic walk_page_range()
functionality instead of the mempolicy.c specific page table walking logic.
Also, get_vma_policy() and mpol_to_str() are exported. This makes the
show_numa_maps() implementation independent of mempolicy.c.

Patch 7 moves show_numa_maps() and supporting routines over to
fs/proc/task_mmu.c.

Finally, patches 8 and 9 provide minor cleanup and eliminate the troublesome
allocation.


These patches are based on mmotm-2011-05-12-15-52 and have been tested on a
dual node NUMA machine.


Thanks,

--
steve

[1] http://lkml.org/lkml/2011/4/27/578
[2] http://lkml.org/lkml/2011/4/25/496


Changes since v1:
- Fix compilation error when CONFIG_TMPFS=n.

- Traverse pte's with proper locking and checks.


Stephen Wilson (9):
mm: export get_vma_policy()
mm: use walk_page_range() instead of custom page table walking code
mm: remove MPOL_MF_STATS
mm: make gather_stats() type-safe and remove forward declaration
mm: remove check_huge_range()
mm: declare mpol_to_str() when CONFIG_TMPFS=n
mm: proc: move show_numa_map() to fs/proc/task_mmu.c
proc: make struct proc_maps_private truly private
proc: allocate storage for numa_maps statistics once


fs/proc/internal.h | 7 ++
fs/proc/task_mmu.c | 204 ++++++++++++++++++++++++++++++++++++++++++++-
include/linux/mempolicy.h | 7 +-
include/linux/proc_fs.h | 8 --
mm/mempolicy.c | 164 +-----------------------------------
5 files changed, 215 insertions(+), 175 deletions(-)


2011-05-15 22:21:43

by Stephen Wilson

[permalink] [raw]
Subject: [PATCH v2 1/9] mm: export get_vma_policy()

In commit 48fce3429df84a94766fbbc845fa8450d0715b48 get_vma_policy() was
marked static as all clients were local to mempolicy.c.

However, the decision to generate /proc/pid/numa_maps in the numa memory
policy code and outside the procfs subsystem introduces an artificial
interdependency between the two systems. Exporting get_vma_policy()
once again is the first step to clean up this interdependency.

Signed-off-by: Stephen Wilson <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Christoph Lameter <[email protected]>
---
include/linux/mempolicy.h | 3 +++
mm/mempolicy.c | 2 +-
2 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index 31ac26c..c2f6032 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -199,6 +199,9 @@ void mpol_free_shared_policy(struct shared_policy *p);
struct mempolicy *mpol_shared_policy_lookup(struct shared_policy *sp,
unsigned long idx);

+struct mempolicy *get_vma_policy(struct task_struct *tsk,
+ struct vm_area_struct *vma, unsigned long addr);
+
extern void numa_default_policy(void);
extern void numa_policy_init(void);
extern void mpol_rebind_task(struct task_struct *tsk, const nodemask_t *new,
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 8e57a72..6cc997d 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -1490,7 +1490,7 @@ asmlinkage long compat_sys_mbind(compat_ulong_t start, compat_ulong_t len,
* freeing by another task. It is the caller's responsibility to free the
* extra reference for shared policies.
*/
-static struct mempolicy *get_vma_policy(struct task_struct *task,
+struct mempolicy *get_vma_policy(struct task_struct *task,
struct vm_area_struct *vma, unsigned long addr)
{
struct mempolicy *pol = task->mempolicy;
--
1.7.4.4

2011-05-15 22:22:10

by Stephen Wilson

[permalink] [raw]
Subject: [PATCH v2 2/9] mm: use walk_page_range() instead of custom page table walking code

Converting show_numa_map() to use the generic routine decouples
the function from mempolicy.c, allowing it to be moved out of the mm
subsystem and into fs/proc.

Also, include KSM pages in /proc/pid/numa_maps statistics. The pagewalk
logic implemented by check_pte_range() failed to account for such pages
as they were not applicable to the page migration case.

Signed-off-by: Stephen Wilson <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Christoph Lameter <[email protected]>
---
mm/mempolicy.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++-----
1 files changed, 68 insertions(+), 7 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 6cc997d..c894671 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2547,6 +2547,7 @@ int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol, int no_context)
}

struct numa_maps {
+ struct vm_area_struct *vma;
unsigned long pages;
unsigned long anon;
unsigned long active;
@@ -2584,6 +2585,41 @@ static void gather_stats(struct page *page, void *private, int pte_dirty)
md->node[page_to_nid(page)]++;
}

+static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
+{
+ struct numa_maps *md;
+ spinlock_t *ptl;
+ pte_t *orig_pte;
+ pte_t *pte;
+
+ md = walk->private;
+ orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+ do {
+ struct page *page;
+ int nid;
+
+ if (!pte_present(*pte))
+ continue;
+
+ page = vm_normal_page(md->vma, addr, *pte);
+ if (!page)
+ continue;
+
+ if (PageReserved(page))
+ continue;
+
+ nid = page_to_nid(page);
+ if (!node_isset(nid, node_states[N_HIGH_MEMORY]))
+ continue;
+
+ gather_stats(page, md, pte_dirty(*pte));
+
+ } while (pte++, addr += PAGE_SIZE, addr != end);
+ pte_unmap_unlock(orig_pte, ptl);
+ return 0;
+}
+
#ifdef CONFIG_HUGETLB_PAGE
static void check_huge_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end,
@@ -2613,12 +2649,35 @@ static void check_huge_range(struct vm_area_struct *vma,
gather_stats(page, md, pte_dirty(*ptep));
}
}
+
+static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
+ unsigned long addr, unsigned long end, struct mm_walk *walk)
+{
+ struct page *page;
+
+ if (pte_none(*pte))
+ return 0;
+
+ page = pte_page(*pte);
+ if (!page)
+ return 0;
+
+ gather_stats(page, walk->private, pte_dirty(*pte));
+ return 0;
+}
+
#else
static inline void check_huge_range(struct vm_area_struct *vma,
unsigned long start, unsigned long end,
struct numa_maps *md)
{
}
+
+static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
+ unsigned long addr, unsigned long end, struct mm_walk *walk)
+{
+ return 0;
+}
#endif

/*
@@ -2631,6 +2690,7 @@ int show_numa_map(struct seq_file *m, void *v)
struct numa_maps *md;
struct file *file = vma->vm_file;
struct mm_struct *mm = vma->vm_mm;
+ struct mm_walk walk = {};
struct mempolicy *pol;
int n;
char buffer[50];
@@ -2642,6 +2702,13 @@ int show_numa_map(struct seq_file *m, void *v)
if (!md)
return 0;

+ md->vma = vma;
+
+ walk.hugetlb_entry = gather_hugetbl_stats;
+ walk.pmd_entry = gather_pte_stats;
+ walk.private = md;
+ walk.mm = mm;
+
pol = get_vma_policy(priv->task, vma, vma->vm_start);
mpol_to_str(buffer, sizeof(buffer), pol, 0);
mpol_cond_put(pol);
@@ -2658,13 +2725,7 @@ int show_numa_map(struct seq_file *m, void *v)
seq_printf(m, " stack");
}

- if (is_vm_hugetlb_page(vma)) {
- check_huge_range(vma, vma->vm_start, vma->vm_end, md);
- seq_printf(m, " huge");
- } else {
- check_pgd_range(vma, vma->vm_start, vma->vm_end,
- &node_states[N_HIGH_MEMORY], MPOL_MF_STATS, md);
- }
+ walk_page_range(vma->vm_start, vma->vm_end, &walk);

if (!md->pages)
goto out;
--
1.7.4.4

2011-05-15 22:21:52

by Stephen Wilson

[permalink] [raw]
Subject: [PATCH v2 3/9] mm: remove MPOL_MF_STATS

Mapping statistics in a NUMA environment is now computed using the
generic walk_page_range() logic. Remove the old/equivalent
functionality.

Signed-off-by: Stephen Wilson <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Christoph Lameter <[email protected]>
---
mm/mempolicy.c | 5 +----
1 files changed, 1 insertions(+), 4 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index c894671..d43d05e 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -100,7 +100,6 @@
/* Internal flags */
#define MPOL_MF_DISCONTIG_OK (MPOL_MF_INTERNAL << 0) /* Skip checks for continuous vmas */
#define MPOL_MF_INVERT (MPOL_MF_INTERNAL << 1) /* Invert check for nodemask */
-#define MPOL_MF_STATS (MPOL_MF_INTERNAL << 2) /* Gather statistics */

static struct kmem_cache *policy_cache;
static struct kmem_cache *sn_cache;
@@ -493,9 +492,7 @@ static int check_pte_range(struct vm_area_struct *vma, pmd_t *pmd,
if (node_isset(nid, *nodes) == !!(flags & MPOL_MF_INVERT))
continue;

- if (flags & MPOL_MF_STATS)
- gather_stats(page, private, pte_dirty(*pte));
- else if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
+ if (flags & (MPOL_MF_MOVE | MPOL_MF_MOVE_ALL))
migrate_page_add(page, private, flags);
else
break;
--
1.7.4.4

2011-05-15 22:22:43

by Stephen Wilson

[permalink] [raw]
Subject: [PATCH v2 4/9] mm: make gather_stats() type-safe and remove forward declaration

Improve the prototype of gather_stats() to take a struct numa_maps as
argument instead of a generic void *. Update all callers to make the
required type explicit.

Since gather_stats() is not needed before its definition and is
scheduled to be moved out of mempolicy.c the declaration is removed as
well.

Signed-off-by: Stephen Wilson <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Christoph Lameter <[email protected]>
---
mm/mempolicy.c | 8 ++++----
1 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index d43d05e..108f858 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -457,7 +457,6 @@ static const struct mempolicy_operations mpol_ops[MPOL_MAX] = {
},
};

-static void gather_stats(struct page *, void *, int pte_dirty);
static void migrate_page_add(struct page *page, struct list_head *pagelist,
unsigned long flags);

@@ -2555,9 +2554,8 @@ struct numa_maps {
unsigned long node[MAX_NUMNODES];
};

-static void gather_stats(struct page *page, void *private, int pte_dirty)
+static void gather_stats(struct page *page, struct numa_maps *md, int pte_dirty)
{
- struct numa_maps *md = private;
int count = page_mapcount(page);

md->pages++;
@@ -2650,6 +2648,7 @@ static void check_huge_range(struct vm_area_struct *vma,
static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
unsigned long addr, unsigned long end, struct mm_walk *walk)
{
+ struct numa_maps *md;
struct page *page;

if (pte_none(*pte))
@@ -2659,7 +2658,8 @@ static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
if (!page)
return 0;

- gather_stats(page, walk->private, pte_dirty(*pte));
+ md = walk->private;
+ gather_stats(page, md, pte_dirty(*pte));
return 0;
}

--
1.7.4.4

2011-05-15 22:23:00

by Stephen Wilson

[permalink] [raw]
Subject: [PATCH v2 5/9] mm: remove check_huge_range()

This function has been superseded by gather_hugetbl_stats() and is no
longer needed.

Signed-off-by: Stephen Wilson <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Christoph Lameter <[email protected]>
---
mm/mempolicy.c | 35 -----------------------------------
1 files changed, 0 insertions(+), 35 deletions(-)

diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 108f858..231efc8 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2616,35 +2616,6 @@ static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
}

#ifdef CONFIG_HUGETLB_PAGE
-static void check_huge_range(struct vm_area_struct *vma,
- unsigned long start, unsigned long end,
- struct numa_maps *md)
-{
- unsigned long addr;
- struct page *page;
- struct hstate *h = hstate_vma(vma);
- unsigned long sz = huge_page_size(h);
-
- for (addr = start; addr < end; addr += sz) {
- pte_t *ptep = huge_pte_offset(vma->vm_mm,
- addr & huge_page_mask(h));
- pte_t pte;
-
- if (!ptep)
- continue;
-
- pte = *ptep;
- if (pte_none(pte))
- continue;
-
- page = pte_page(pte);
- if (!page)
- continue;
-
- gather_stats(page, md, pte_dirty(*ptep));
- }
-}
-
static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
unsigned long addr, unsigned long end, struct mm_walk *walk)
{
@@ -2664,12 +2635,6 @@ static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
}

#else
-static inline void check_huge_range(struct vm_area_struct *vma,
- unsigned long start, unsigned long end,
- struct numa_maps *md)
-{
-}
-
static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
unsigned long addr, unsigned long end, struct mm_walk *walk)
{
--
1.7.4.4

2011-05-15 22:22:28

by Stephen Wilson

[permalink] [raw]
Subject: [PATCH v2 6/9] mm: declare mpol_to_str() when CONFIG_TMPFS=n

When CONFIG_TMPFS=n mpol_to_str() is not declared in mempolicy.h.
However, in the NUMA case, the definition is always compiled.

Since it is not strictly true that tmpfs is the only client, and since
the symbol was always lurking around anyways, export mpol_to_str()
unconditionally. Furthermore, this will allow us to move
show_numa_map() out of mempolicy.c and into the procfs subsystem.

Signed-off-by: Stephen Wilson <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Christoph Lameter <[email protected]>
Cc: Randy Dunlap <[email protected]>
---
include/linux/mempolicy.h | 4 ++--
1 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/include/linux/mempolicy.h b/include/linux/mempolicy.h
index c2f6032..7978eec 100644
--- a/include/linux/mempolicy.h
+++ b/include/linux/mempolicy.h
@@ -231,10 +231,10 @@ int do_migrate_pages(struct mm_struct *mm,

#ifdef CONFIG_TMPFS
extern int mpol_parse_str(char *str, struct mempolicy **mpol, int no_context);
+#endif

extern int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol,
int no_context);
-#endif

/* Check if a vma is migratable */
static inline int vma_migratable(struct vm_area_struct *vma)
@@ -371,13 +371,13 @@ static inline int mpol_parse_str(char *str, struct mempolicy **mpol,
{
return 1; /* error */
}
+#endif

static inline int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol,
int no_context)
{
return 0;
}
-#endif

#endif /* CONFIG_NUMA */
#endif /* __KERNEL__ */
--
1.7.4.4

2011-05-15 22:22:53

by Stephen Wilson

[permalink] [raw]
Subject: [PATCH v2 7/9] mm: proc: move show_numa_map() to fs/proc/task_mmu.c

Moving show_numa_map() from mempolicy.c to task_mmu.c solves several
issues.

- Having the show() operation "miles away" from the corresponding
seq_file iteration operations is a maintenance burden.

- The need to export ad hoc info like struct proc_maps_private is
eliminated.

- The implementation of show_numa_map() can be improved in a simple
manner by cooperating with the other seq_file operations (start,
stop, etc) -- something that would be messy to do without this
change.

Signed-off-by: Stephen Wilson <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Christoph Lameter <[email protected]>
---
fs/proc/task_mmu.c | 184 +++++++++++++++++++++++++++++++++++++++++++++++++++-
mm/mempolicy.c | 183 ---------------------------------------------------
2 files changed, 182 insertions(+), 185 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index cd58813..fbe5ed9 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -859,8 +859,188 @@ const struct file_operations proc_pagemap_operations = {
#endif /* CONFIG_PROC_PAGE_MONITOR */

#ifdef CONFIG_NUMA
-extern int show_numa_map(struct seq_file *m, void *v);

+struct numa_maps {
+ struct vm_area_struct *vma;
+ unsigned long pages;
+ unsigned long anon;
+ unsigned long active;
+ unsigned long writeback;
+ unsigned long mapcount_max;
+ unsigned long dirty;
+ unsigned long swapcache;
+ unsigned long node[MAX_NUMNODES];
+};
+
+static void gather_stats(struct page *page, struct numa_maps *md, int pte_dirty)
+{
+ int count = page_mapcount(page);
+
+ md->pages++;
+ if (pte_dirty || PageDirty(page))
+ md->dirty++;
+
+ if (PageSwapCache(page))
+ md->swapcache++;
+
+ if (PageActive(page) || PageUnevictable(page))
+ md->active++;
+
+ if (PageWriteback(page))
+ md->writeback++;
+
+ if (PageAnon(page))
+ md->anon++;
+
+ if (count > md->mapcount_max)
+ md->mapcount_max = count;
+
+ md->node[page_to_nid(page)]++;
+}
+
+static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
+ unsigned long end, struct mm_walk *walk)
+{
+ struct numa_maps *md;
+ spinlock_t *ptl;
+ pte_t *orig_pte;
+ pte_t *pte;
+
+ md = walk->private;
+ orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
+ do {
+ struct page *page;
+ int nid;
+
+ if (!pte_present(*pte))
+ continue;
+
+ page = vm_normal_page(md->vma, addr, *pte);
+ if (!page)
+ continue;
+
+ if (PageReserved(page))
+ continue;
+
+ nid = page_to_nid(page);
+ if (!node_isset(nid, node_states[N_HIGH_MEMORY]))
+ continue;
+
+ gather_stats(page, md, pte_dirty(*pte));
+
+ } while (pte++, addr += PAGE_SIZE, addr != end);
+ pte_unmap_unlock(orig_pte, ptl);
+ return 0;
+}
+#ifdef CONFIG_HUGETLB_PAGE
+static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
+ unsigned long addr, unsigned long end, struct mm_walk *walk)
+{
+ struct numa_maps *md;
+ struct page *page;
+
+ if (pte_none(*pte))
+ return 0;
+
+ page = pte_page(*pte);
+ if (!page)
+ return 0;
+
+ md = walk->private;
+ gather_stats(page, md, pte_dirty(*pte));
+ return 0;
+}
+
+#else
+static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
+ unsigned long addr, unsigned long end, struct mm_walk *walk)
+{
+ return 0;
+}
+#endif
+
+/*
+ * Display pages allocated per node and memory policy via /proc.
+ */
+static int show_numa_map(struct seq_file *m, void *v)
+{
+ struct proc_maps_private *priv = m->private;
+ struct vm_area_struct *vma = v;
+ struct numa_maps *md;
+ struct file *file = vma->vm_file;
+ struct mm_struct *mm = vma->vm_mm;
+ struct mm_walk walk = {};
+ struct mempolicy *pol;
+ int n;
+ char buffer[50];
+
+ if (!mm)
+ return 0;
+
+ md = kzalloc(sizeof(struct numa_maps), GFP_KERNEL);
+ if (!md)
+ return 0;
+
+ md->vma = vma;
+
+ walk.hugetlb_entry = gather_hugetbl_stats;
+ walk.pmd_entry = gather_pte_stats;
+ walk.private = md;
+ walk.mm = mm;
+
+ pol = get_vma_policy(priv->task, vma, vma->vm_start);
+ mpol_to_str(buffer, sizeof(buffer), pol, 0);
+ mpol_cond_put(pol);
+
+ seq_printf(m, "%08lx %s", vma->vm_start, buffer);
+
+ if (file) {
+ seq_printf(m, " file=");
+ seq_path(m, &file->f_path, "\n\t= ");
+ } else if (vma->vm_start <= mm->brk && vma->vm_end >= mm->start_brk) {
+ seq_printf(m, " heap");
+ } else if (vma->vm_start <= mm->start_stack &&
+ vma->vm_end >= mm->start_stack) {
+ seq_printf(m, " stack");
+ }
+
+ walk_page_range(vma->vm_start, vma->vm_end, &walk);
+
+ if (!md->pages)
+ goto out;
+
+ if (md->anon)
+ seq_printf(m, " anon=%lu", md->anon);
+
+ if (md->dirty)
+ seq_printf(m, " dirty=%lu", md->dirty);
+
+ if (md->pages != md->anon && md->pages != md->dirty)
+ seq_printf(m, " mapped=%lu", md->pages);
+
+ if (md->mapcount_max > 1)
+ seq_printf(m, " mapmax=%lu", md->mapcount_max);
+
+ if (md->swapcache)
+ seq_printf(m, " swapcache=%lu", md->swapcache);
+
+ if (md->active < md->pages && !is_vm_hugetlb_page(vma))
+ seq_printf(m, " active=%lu", md->active);
+
+ if (md->writeback)
+ seq_printf(m, " writeback=%lu", md->writeback);
+
+ for_each_node_state(n, N_HIGH_MEMORY)
+ if (md->node[n])
+ seq_printf(m, " N%d=%lu", n, md->node[n]);
+out:
+ seq_putc(m, '\n');
+ kfree(md);
+
+ if (m->count < m->size)
+ m->version = (vma != priv->tail_vma) ? vma->vm_start : 0;
+ return 0;
+}
static const struct seq_operations proc_pid_numa_maps_op = {
.start = m_start,
.next = m_next,
@@ -879,4 +1059,4 @@ const struct file_operations proc_numa_maps_operations = {
.llseek = seq_lseek,
.release = seq_release_private,
};
-#endif
+#endif /* CONFIG_NUMA */
diff --git a/mm/mempolicy.c b/mm/mempolicy.c
index 231efc8..8b57173 100644
--- a/mm/mempolicy.c
+++ b/mm/mempolicy.c
@@ -2541,186 +2541,3 @@ int mpol_to_str(char *buffer, int maxlen, struct mempolicy *pol, int no_context)
}
return p - buffer;
}
-
-struct numa_maps {
- struct vm_area_struct *vma;
- unsigned long pages;
- unsigned long anon;
- unsigned long active;
- unsigned long writeback;
- unsigned long mapcount_max;
- unsigned long dirty;
- unsigned long swapcache;
- unsigned long node[MAX_NUMNODES];
-};
-
-static void gather_stats(struct page *page, struct numa_maps *md, int pte_dirty)
-{
- int count = page_mapcount(page);
-
- md->pages++;
- if (pte_dirty || PageDirty(page))
- md->dirty++;
-
- if (PageSwapCache(page))
- md->swapcache++;
-
- if (PageActive(page) || PageUnevictable(page))
- md->active++;
-
- if (PageWriteback(page))
- md->writeback++;
-
- if (PageAnon(page))
- md->anon++;
-
- if (count > md->mapcount_max)
- md->mapcount_max = count;
-
- md->node[page_to_nid(page)]++;
-}
-
-static int gather_pte_stats(pmd_t *pmd, unsigned long addr,
- unsigned long end, struct mm_walk *walk)
-{
- struct numa_maps *md;
- spinlock_t *ptl;
- pte_t *orig_pte;
- pte_t *pte;
-
- md = walk->private;
- orig_pte = pte = pte_offset_map_lock(walk->mm, pmd, addr, &ptl);
- do {
- struct page *page;
- int nid;
-
- if (!pte_present(*pte))
- continue;
-
- page = vm_normal_page(md->vma, addr, *pte);
- if (!page)
- continue;
-
- if (PageReserved(page))
- continue;
-
- nid = page_to_nid(page);
- if (!node_isset(nid, node_states[N_HIGH_MEMORY]))
- continue;
-
- gather_stats(page, md, pte_dirty(*pte));
-
- } while (pte++, addr += PAGE_SIZE, addr != end);
- pte_unmap_unlock(orig_pte, ptl);
- return 0;
-}
-
-#ifdef CONFIG_HUGETLB_PAGE
-static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
- unsigned long addr, unsigned long end, struct mm_walk *walk)
-{
- struct numa_maps *md;
- struct page *page;
-
- if (pte_none(*pte))
- return 0;
-
- page = pte_page(*pte);
- if (!page)
- return 0;
-
- md = walk->private;
- gather_stats(page, md, pte_dirty(*pte));
- return 0;
-}
-
-#else
-static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
- unsigned long addr, unsigned long end, struct mm_walk *walk)
-{
- return 0;
-}
-#endif
-
-/*
- * Display pages allocated per node and memory policy via /proc.
- */
-int show_numa_map(struct seq_file *m, void *v)
-{
- struct proc_maps_private *priv = m->private;
- struct vm_area_struct *vma = v;
- struct numa_maps *md;
- struct file *file = vma->vm_file;
- struct mm_struct *mm = vma->vm_mm;
- struct mm_walk walk = {};
- struct mempolicy *pol;
- int n;
- char buffer[50];
-
- if (!mm)
- return 0;
-
- md = kzalloc(sizeof(struct numa_maps), GFP_KERNEL);
- if (!md)
- return 0;
-
- md->vma = vma;
-
- walk.hugetlb_entry = gather_hugetbl_stats;
- walk.pmd_entry = gather_pte_stats;
- walk.private = md;
- walk.mm = mm;
-
- pol = get_vma_policy(priv->task, vma, vma->vm_start);
- mpol_to_str(buffer, sizeof(buffer), pol, 0);
- mpol_cond_put(pol);
-
- seq_printf(m, "%08lx %s", vma->vm_start, buffer);
-
- if (file) {
- seq_printf(m, " file=");
- seq_path(m, &file->f_path, "\n\t= ");
- } else if (vma->vm_start <= mm->brk && vma->vm_end >= mm->start_brk) {
- seq_printf(m, " heap");
- } else if (vma->vm_start <= mm->start_stack &&
- vma->vm_end >= mm->start_stack) {
- seq_printf(m, " stack");
- }
-
- walk_page_range(vma->vm_start, vma->vm_end, &walk);
-
- if (!md->pages)
- goto out;
-
- if (md->anon)
- seq_printf(m," anon=%lu",md->anon);
-
- if (md->dirty)
- seq_printf(m," dirty=%lu",md->dirty);
-
- if (md->pages != md->anon && md->pages != md->dirty)
- seq_printf(m, " mapped=%lu", md->pages);
-
- if (md->mapcount_max > 1)
- seq_printf(m, " mapmax=%lu", md->mapcount_max);
-
- if (md->swapcache)
- seq_printf(m," swapcache=%lu", md->swapcache);
-
- if (md->active < md->pages && !is_vm_hugetlb_page(vma))
- seq_printf(m," active=%lu", md->active);
-
- if (md->writeback)
- seq_printf(m," writeback=%lu", md->writeback);
-
- for_each_node_state(n, N_HIGH_MEMORY)
- if (md->node[n])
- seq_printf(m, " N%d=%lu", n, md->node[n]);
-out:
- seq_putc(m, '\n');
- kfree(md);
-
- if (m->count < m->size)
- m->version = (vma != priv->tail_vma) ? vma->vm_start : 0;
- return 0;
-}
--
1.7.4.4

2011-05-15 22:23:51

by Stephen Wilson

[permalink] [raw]
Subject: [PATCH v2 8/9] proc: make struct proc_maps_private truly private

Now that mm/mempolicy.c is no longer implementing /proc/pid/numa_maps
there is no need to export struct proc_maps_private to the world. Move
it to fs/proc/internal.h instead.

Signed-off-by: Stephen Wilson <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Christoph Lameter <[email protected]>
---
fs/proc/internal.h | 7 +++++++
include/linux/proc_fs.h | 8 --------
2 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/fs/proc/internal.h b/fs/proc/internal.h
index 96245a1..360223f 100644
--- a/fs/proc/internal.h
+++ b/fs/proc/internal.h
@@ -137,3 +137,10 @@ int proc_setattr(struct dentry *dentry, struct iattr *attr);
extern const struct inode_operations proc_ns_dir_inode_operations;
extern const struct file_operations proc_ns_dir_operations;

+struct proc_maps_private {
+ struct pid *pid;
+ struct task_struct *task;
+#ifdef CONFIG_MMU
+ struct vm_area_struct *tail_vma;
+#endif
+};
diff --git a/include/linux/proc_fs.h b/include/linux/proc_fs.h
index 5bdfd39..6ba9e31 100644
--- a/include/linux/proc_fs.h
+++ b/include/linux/proc_fs.h
@@ -288,12 +288,4 @@ static inline struct net *PDE_NET(struct proc_dir_entry *pde)
return pde->parent->data;
}

-struct proc_maps_private {
- struct pid *pid;
- struct task_struct *task;
-#ifdef CONFIG_MMU
- struct vm_area_struct *tail_vma;
-#endif
-};
-
#endif /* _LINUX_PROC_FS_H */
--
1.7.4.4

2011-05-15 22:23:17

by Stephen Wilson

[permalink] [raw]
Subject: [PATCH v2 9/9] proc: allocate storage for numa_maps statistics once

In show_numa_map() we collect statistics into a numa_maps structure.
Since the number of NUMA nodes can be very large, this structure is not
a candidate for stack allocation.

Instead of going thru a kmalloc()+kfree() cycle each time show_numa_map()
is invoked, perform the allocation just once when /proc/pid/numa_maps is
opened.

Performing the allocation when numa_maps is opened, and thus before a
reference to the target tasks mm is taken, eliminates a potential
stalemate condition in the oom-killer as originally described by Hugh
Dickins:

... imagine what happens if the system is out of memory, and the mm
we're looking at is selected for killing by the OOM killer: while
we wait in __get_free_page for more memory, no memory is freed
from the selected mm because it cannot reach exit_mmap while we hold
that reference.

Signed-off-by: Stephen Wilson <[email protected]>
Reviewed-by: KOSAKI Motohiro <[email protected]>
Cc: Hugh Dickins <[email protected]>
Cc: David Rientjes <[email protected]>
Cc: Lee Schermerhorn <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: Christoph Lameter <[email protected]>
---
fs/proc/task_mmu.c | 36 +++++++++++++++++++++++++++---------
1 files changed, 27 insertions(+), 9 deletions(-)

diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
index fbe5ed9..cccc102 100644
--- a/fs/proc/task_mmu.c
+++ b/fs/proc/task_mmu.c
@@ -872,6 +872,11 @@ struct numa_maps {
unsigned long node[MAX_NUMNODES];
};

+struct numa_maps_private {
+ struct proc_maps_private proc_maps;
+ struct numa_maps md;
+};
+
static void gather_stats(struct page *page, struct numa_maps *md, int pte_dirty)
{
int count = page_mapcount(page);
@@ -964,9 +969,10 @@ static int gather_hugetbl_stats(pte_t *pte, unsigned long hmask,
*/
static int show_numa_map(struct seq_file *m, void *v)
{
- struct proc_maps_private *priv = m->private;
+ struct numa_maps_private *numa_priv = m->private;
+ struct proc_maps_private *proc_priv = &numa_priv->proc_maps;
struct vm_area_struct *vma = v;
- struct numa_maps *md;
+ struct numa_maps *md = &numa_priv->md;
struct file *file = vma->vm_file;
struct mm_struct *mm = vma->vm_mm;
struct mm_walk walk = {};
@@ -977,9 +983,8 @@ static int show_numa_map(struct seq_file *m, void *v)
if (!mm)
return 0;

- md = kzalloc(sizeof(struct numa_maps), GFP_KERNEL);
- if (!md)
- return 0;
+ /* Ensure we start with an empty set of numa_maps statistics. */
+ memset(md, 0, sizeof(*md));

md->vma = vma;

@@ -988,7 +993,7 @@ static int show_numa_map(struct seq_file *m, void *v)
walk.private = md;
walk.mm = mm;

- pol = get_vma_policy(priv->task, vma, vma->vm_start);
+ pol = get_vma_policy(proc_priv->task, vma, vma->vm_start);
mpol_to_str(buffer, sizeof(buffer), pol, 0);
mpol_cond_put(pol);

@@ -1035,12 +1040,12 @@ static int show_numa_map(struct seq_file *m, void *v)
seq_printf(m, " N%d=%lu", n, md->node[n]);
out:
seq_putc(m, '\n');
- kfree(md);

if (m->count < m->size)
- m->version = (vma != priv->tail_vma) ? vma->vm_start : 0;
+ m->version = (vma != proc_priv->tail_vma) ? vma->vm_start : 0;
return 0;
}
+
static const struct seq_operations proc_pid_numa_maps_op = {
.start = m_start,
.next = m_next,
@@ -1050,7 +1055,20 @@ static const struct seq_operations proc_pid_numa_maps_op = {

static int numa_maps_open(struct inode *inode, struct file *file)
{
- return do_maps_open(inode, file, &proc_pid_numa_maps_op);
+ struct numa_maps_private *priv;
+ int ret = -ENOMEM;
+ priv = kzalloc(sizeof(*priv), GFP_KERNEL);
+ if (priv) {
+ priv->proc_maps.pid = proc_pid(inode);
+ ret = seq_open(file, &proc_pid_numa_maps_op);
+ if (!ret) {
+ struct seq_file *m = file->private_data;
+ m->private = priv;
+ } else {
+ kfree(priv);
+ }
+ }
+ return ret;
}

const struct file_operations proc_numa_maps_operations = {
--
1.7.4.4

2011-05-18 00:01:48

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH v2 4/9] mm: make gather_stats() type-safe and remove forward declaration

(2011/05/16 7:20), Stephen Wilson wrote:
> Improve the prototype of gather_stats() to take a struct numa_maps as
> argument instead of a generic void *. Update all callers to make the
> required type explicit.
>
> Since gather_stats() is not needed before its definition and is
> scheduled to be moved out of mempolicy.c the declaration is removed as
> well.
>
> Signed-off-by: Stephen Wilson<[email protected]>
> Cc: KOSAKI Motohiro<[email protected]>
> Cc: Hugh Dickins<[email protected]>
> Cc: David Rientjes<[email protected]>
> Cc: Lee Schermerhorn<[email protected]>
> Cc: Alexey Dobriyan<[email protected]>
> Cc: Christoph Lameter<[email protected]>

Reviewed-by: KOSAKI Motohiro <[email protected]>

2011-05-18 00:02:53

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [PATCH v2 7/9] mm: proc: move show_numa_map() to fs/proc/task_mmu.c

(2011/05/16 7:20), Stephen Wilson wrote:
> Moving show_numa_map() from mempolicy.c to task_mmu.c solves several
> issues.
>
> - Having the show() operation "miles away" from the corresponding
> seq_file iteration operations is a maintenance burden.
>
> - The need to export ad hoc info like struct proc_maps_private is
> eliminated.
>
> - The implementation of show_numa_map() can be improved in a simple
> manner by cooperating with the other seq_file operations (start,
> stop, etc) -- something that would be messy to do without this
> change.
>
> Signed-off-by: Stephen Wilson<[email protected]>
> Cc: KOSAKI Motohiro<[email protected]>
> Cc: Hugh Dickins<[email protected]>
> Cc: David Rientjes<[email protected]>
> Cc: Lee Schermerhorn<[email protected]>
> Cc: Alexey Dobriyan<[email protected]>
> Cc: Christoph Lameter<[email protected]>

Reviewed-by: KOSAKI Motohiro <[email protected]>