2009-12-15 09:12:14

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [mmotm][PATCH 0/5] mm rss counting updates


This is version 3 or 4....for rss counting. removed RFC.

My purpose is gathering more (rss-related) information per process without
scalability impact. (and improve oom-killer etc..)
The whole patch series is organized as

[1/5] clean-up per mm stat counting.
[2/5] making counter (a bit) more scalable with per-thread counting.
[3/5] adding swap counter per mm
[4/5] adding lowmem detection logic
[5/5] adding lowmem usage counter per mm.

Big changes from previous one are...
- removed per-cpu counter. added per-thread counter
- synchronization point of a counter is moved to memory.c
no hooks to ticks and scheduler.

Now, this patch is not very invasive as previous ones.

cache-miss/page fault with my benchmark on my box is

[Before patch] 4.55 cache-miss/fault
[After patch 2] 3.99 cache-miss/fault
[After all patch] 4.06 cache-miss/fault

>From this numbers, I think swap/lowmem counters can be added.

My test program is attached (this is not modified from previous one)

[Future Plan]
- add CONSTRAINT_LOWMEM oom killer.
- add rss+swap based oom killer (with sysctl ?)
- add some patch for perf ?
- add mm_accessor patch.
- improve page faults scalability, finally.

Thanks,
-Kame


Attachments:
multi-fault.c (1.82 kB)

2009-12-15 09:14:26

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [mmotm][PATCH 1/5] clean up mm_counter

From: KAMEZAWA Hiroyuki <[email protected]>

Now, per-mm statistics counter is defined by macro in sched.h

This patch modifies it to
- defined in mm.h as inlinf functions
- use array instead of macro's name creation.

This patch is for reducing patch size in future patch to modify
implementation of per-mm counter.

Changelog: 2009/12/14
- added a struct rss_stat instead of bare counters.
- use memset instead of for() loop.
- rewrite macros into static inline functions.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
fs/proc/task_mmu.c | 4 -
include/linux/mm.h | 104 +++++++++++++++++++++++++++++++++++++++++++++++
include/linux/mm_types.h | 33 +++++++++-----
include/linux/sched.h | 54 ------------------------
kernel/fork.c | 3 -
kernel/tsacct.c | 1
mm/filemap_xip.c | 2
mm/fremap.c | 2
mm/memory.c | 56 +++++++++++++++----------
mm/oom_kill.c | 4 -
mm/rmap.c | 10 ++--
mm/swapfile.c | 2
12 files changed, 174 insertions(+), 101 deletions(-)

Index: mmotm-2.6.32-Dec8-pth/include/linux/mm.h
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/include/linux/mm.h
+++ mmotm-2.6.32-Dec8-pth/include/linux/mm.h
@@ -868,6 +868,110 @@ extern int mprotect_fixup(struct vm_area
*/
int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages);
+/*
+ * per-process(per-mm_struct) statistics.
+ */
+#if USE_SPLIT_PTLOCKS
+/*
+ * The mm counters are not protected by its page_table_lock,
+ * so must be incremented atomically.
+ */
+static inline void set_mm_counter(struct mm_struct *mm, int member, long value)
+{
+ atomic_long_set(&mm->rss_stat.count[member], value);
+}
+
+static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
+{
+ return (unsigned long)atomic_long_read(&mm->rss_stat.count[member]);
+}
+
+static inline void add_mm_counter(struct mm_struct *mm, int member, long value)
+{
+ atomic_long_add(value, &mm->rss_stat.count[member]);
+}
+
+static inline void inc_mm_counter(struct mm_struct *mm, int member)
+{
+ atomic_long_inc(&mm->rss_stat.count[member]);
+}
+
+static inline void dec_mm_counter(struct mm_struct *mm, int member)
+{
+ atomic_long_dec(&mm->rss_stat.count[member]);
+}
+
+#else /* !USE_SPLIT_PTLOCKS */
+/*
+ * The mm counters are protected by its page_table_lock,
+ * so can be incremented directly.
+ */
+static inline void set_mm_counter(struct mm_struct *mm, int member, long value)
+{
+ mm->rss_stat.count[member] = value;
+}
+
+static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
+{
+ return mm->rss_stat.count[member];
+}
+
+static inline void add_mm_counter(struct mm_struct *mm, int member, long value)
+{
+ mm->rss_stat.count[member] += value;
+}
+
+static inline void inc_mm_counter(struct mm_struct *mm, int member)
+{
+ mm->rss_stat.count[member]++;
+}
+
+static inline void dec_mm_counter(struct mm_struct *mm, int member)
+{
+ mm->rss_stat.count[member]--;
+}
+
+#endif /* !USE_SPLIT_PTLOCKS */
+
+static inline unsigned long get_mm_rss(struct mm_struct *mm)
+{
+ return get_mm_counter(mm, MM_FILEPAGES) +
+ get_mm_counter(mm, MM_ANONPAGES);
+}
+
+static inline unsigned long get_mm_hiwater_rss(struct mm_struct *mm)
+{
+ return max(mm->hiwater_rss, get_mm_rss(mm));
+}
+
+static inline unsigned long get_mm_hiwater_vm(struct mm_struct *mm)
+{
+ return max(mm->hiwater_vm, mm->total_vm);
+}
+
+static inline void update_hiwater_rss(struct mm_struct *mm)
+{
+ unsigned long _rss = get_mm_rss(mm);
+
+ if ((mm)->hiwater_rss < _rss)
+ (mm)->hiwater_rss = _rss;
+}
+
+static inline void update_hiwater_vm(struct mm_struct *mm)
+{
+ if (mm->hiwater_vm < mm->total_vm)
+ mm->hiwater_vm = mm->total_vm;
+}
+
+static inline void setmax_mm_hiwater_rss(unsigned long *maxrss,
+ struct mm_struct *mm)
+{
+ unsigned long hiwater_rss = get_mm_hiwater_rss(mm);
+
+ if (*maxrss < hiwater_rss)
+ *maxrss = hiwater_rss;
+}
+

/*
* A callback you can register to apply pressure to ageable caches.
Index: mmotm-2.6.32-Dec8-pth/include/linux/mm_types.h
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/include/linux/mm_types.h
+++ mmotm-2.6.32-Dec8-pth/include/linux/mm_types.h
@@ -24,12 +24,6 @@ struct address_space;

#define USE_SPLIT_PTLOCKS (NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)

-#if USE_SPLIT_PTLOCKS
-typedef atomic_long_t mm_counter_t;
-#else /* !USE_SPLIT_PTLOCKS */
-typedef unsigned long mm_counter_t;
-#endif /* !USE_SPLIT_PTLOCKS */
-
/*
* Each physical page in the system has a struct page associated with
* it to keep track of whatever it is we are using the page for at the
@@ -199,6 +193,22 @@ struct core_state {
struct completion startup;
};

+enum {
+ MM_FILEPAGES,
+ MM_ANONPAGES,
+ NR_MM_COUNTERS
+};
+
+#if USE_SPLIT_PTLOCKS
+struct mm_rss_stat {
+ atomic_long_t count[NR_MM_COUNTERS];
+};
+#else /* !USE_SPLIT_PTLOCKS */
+struct mm_rss_stat {
+ unsigned long count[NR_MM_COUNTERS];
+};
+#endif /* !USE_SPLIT_PTLOCKS */
+
struct mm_struct {
struct vm_area_struct * mmap; /* list of VMAs */
struct rb_root mm_rb;
@@ -223,11 +233,6 @@ struct mm_struct {
* by mmlist_lock
*/

- /* Special counters, in some configurations protected by the
- * page_table_lock, in other configurations by being atomic.
- */
- mm_counter_t _file_rss;
- mm_counter_t _anon_rss;

unsigned long hiwater_rss; /* High-watermark of RSS usage */
unsigned long hiwater_vm; /* High-water virtual memory usage */
@@ -240,6 +245,12 @@ struct mm_struct {

unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */

+ /*
+ * Special counters, in some configurations protected by the
+ * page_table_lock, in other configurations by being atomic.
+ */
+ struct mm_rss_stat rss_stat;
+
struct linux_binfmt *binfmt;

cpumask_t cpu_vm_mask;
Index: mmotm-2.6.32-Dec8-pth/include/linux/sched.h
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/include/linux/sched.h
+++ mmotm-2.6.32-Dec8-pth/include/linux/sched.h
@@ -385,60 +385,6 @@ arch_get_unmapped_area_topdown(struct fi
extern void arch_unmap_area(struct mm_struct *, unsigned long);
extern void arch_unmap_area_topdown(struct mm_struct *, unsigned long);

-#if USE_SPLIT_PTLOCKS
-/*
- * The mm counters are not protected by its page_table_lock,
- * so must be incremented atomically.
- */
-#define set_mm_counter(mm, member, value) atomic_long_set(&(mm)->_##member, value)
-#define get_mm_counter(mm, member) ((unsigned long)atomic_long_read(&(mm)->_##member))
-#define add_mm_counter(mm, member, value) atomic_long_add(value, &(mm)->_##member)
-#define inc_mm_counter(mm, member) atomic_long_inc(&(mm)->_##member)
-#define dec_mm_counter(mm, member) atomic_long_dec(&(mm)->_##member)
-
-#else /* !USE_SPLIT_PTLOCKS */
-/*
- * The mm counters are protected by its page_table_lock,
- * so can be incremented directly.
- */
-#define set_mm_counter(mm, member, value) (mm)->_##member = (value)
-#define get_mm_counter(mm, member) ((mm)->_##member)
-#define add_mm_counter(mm, member, value) (mm)->_##member += (value)
-#define inc_mm_counter(mm, member) (mm)->_##member++
-#define dec_mm_counter(mm, member) (mm)->_##member--
-
-#endif /* !USE_SPLIT_PTLOCKS */
-
-#define get_mm_rss(mm) \
- (get_mm_counter(mm, file_rss) + get_mm_counter(mm, anon_rss))
-#define update_hiwater_rss(mm) do { \
- unsigned long _rss = get_mm_rss(mm); \
- if ((mm)->hiwater_rss < _rss) \
- (mm)->hiwater_rss = _rss; \
-} while (0)
-#define update_hiwater_vm(mm) do { \
- if ((mm)->hiwater_vm < (mm)->total_vm) \
- (mm)->hiwater_vm = (mm)->total_vm; \
-} while (0)
-
-static inline unsigned long get_mm_hiwater_rss(struct mm_struct *mm)
-{
- return max(mm->hiwater_rss, get_mm_rss(mm));
-}
-
-static inline void setmax_mm_hiwater_rss(unsigned long *maxrss,
- struct mm_struct *mm)
-{
- unsigned long hiwater_rss = get_mm_hiwater_rss(mm);
-
- if (*maxrss < hiwater_rss)
- *maxrss = hiwater_rss;
-}
-
-static inline unsigned long get_mm_hiwater_vm(struct mm_struct *mm)
-{
- return max(mm->hiwater_vm, mm->total_vm);
-}

extern void set_dumpable(struct mm_struct *mm, int value);
extern int get_dumpable(struct mm_struct *mm);
Index: mmotm-2.6.32-Dec8-pth/mm/memory.c
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/mm/memory.c
+++ mmotm-2.6.32-Dec8-pth/mm/memory.c
@@ -121,6 +121,7 @@ static int __init init_zero_pfn(void)
}
core_initcall(init_zero_pfn);

+
/*
* If a p?d_bad entry is found while walking page tables, report
* the error, before resetting entry to p?d_none. Usually (but
@@ -376,12 +377,18 @@ int __pte_alloc_kernel(pmd_t *pmd, unsig
return 0;
}

-static inline void add_mm_rss(struct mm_struct *mm, int file_rss, int anon_rss)
+static inline void init_rss_vec(int *rss)
{
- if (file_rss)
- add_mm_counter(mm, file_rss, file_rss);
- if (anon_rss)
- add_mm_counter(mm, anon_rss, anon_rss);
+ memset(rss, 0, sizeof(int) * NR_MM_COUNTERS);
+}
+
+static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
+{
+ int i;
+
+ for (i = 0; i < NR_MM_COUNTERS; i++)
+ if (rss[i])
+ add_mm_counter(mm, i, rss[i]);
}

/*
@@ -632,7 +639,10 @@ copy_one_pte(struct mm_struct *dst_mm, s
if (page) {
get_page(page);
page_dup_rmap(page);
- rss[PageAnon(page)]++;
+ if (PageAnon(page))
+ rss[MM_ANONPAGES]++;
+ else
+ rss[MM_FILEPAGES]++;
}

out_set_pte:
@@ -648,11 +658,12 @@ static int copy_pte_range(struct mm_stru
pte_t *src_pte, *dst_pte;
spinlock_t *src_ptl, *dst_ptl;
int progress = 0;
- int rss[2];
+ int rss[NR_MM_COUNTERS];
swp_entry_t entry = (swp_entry_t){0};

again:
- rss[1] = rss[0] = 0;
+ init_rss_vec(rss);
+
dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
if (!dst_pte)
return -ENOMEM;
@@ -688,7 +699,7 @@ again:
arch_leave_lazy_mmu_mode();
spin_unlock(src_ptl);
pte_unmap_nested(orig_src_pte);
- add_mm_rss(dst_mm, rss[0], rss[1]);
+ add_mm_rss_vec(dst_mm, rss);
pte_unmap_unlock(orig_dst_pte, dst_ptl);
cond_resched();

@@ -816,8 +827,9 @@ static unsigned long zap_pte_range(struc
struct mm_struct *mm = tlb->mm;
pte_t *pte;
spinlock_t *ptl;
- int file_rss = 0;
- int anon_rss = 0;
+ int rss[NR_MM_COUNTERS];
+
+ init_rss_vec(rss);

pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
arch_enter_lazy_mmu_mode();
@@ -863,14 +875,14 @@ static unsigned long zap_pte_range(struc
set_pte_at(mm, addr, pte,
pgoff_to_pte(page->index));
if (PageAnon(page))
- anon_rss--;
+ rss[MM_ANONPAGES]--;
else {
if (pte_dirty(ptent))
set_page_dirty(page);
if (pte_young(ptent) &&
likely(!VM_SequentialReadHint(vma)))
mark_page_accessed(page);
- file_rss--;
+ rss[MM_FILEPAGES]--;
}
page_remove_rmap(page);
if (unlikely(page_mapcount(page) < 0))
@@ -893,7 +905,7 @@ static unsigned long zap_pte_range(struc
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
} while (pte++, addr += PAGE_SIZE, (addr != end && *zap_work > 0));

- add_mm_rss(mm, file_rss, anon_rss);
+ add_mm_rss_vec(mm, rss);
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(pte - 1, ptl);

@@ -1527,7 +1539,7 @@ static int insert_page(struct vm_area_st

/* Ok, finally just insert the thing.. */
get_page(page);
- inc_mm_counter(mm, file_rss);
+ inc_mm_counter(mm, MM_FILEPAGES);
page_add_file_rmap(page);
set_pte_at(mm, addr, pte, mk_pte(page, prot));

@@ -2163,11 +2175,11 @@ gotten:
if (likely(pte_same(*page_table, orig_pte))) {
if (old_page) {
if (!PageAnon(old_page)) {
- dec_mm_counter(mm, file_rss);
- inc_mm_counter(mm, anon_rss);
+ dec_mm_counter(mm, MM_FILEPAGES);
+ inc_mm_counter(mm, MM_ANONPAGES);
}
} else
- inc_mm_counter(mm, anon_rss);
+ inc_mm_counter(mm, MM_ANONPAGES);
flush_cache_page(vma, address, pte_pfn(orig_pte));
entry = mk_pte(new_page, vma->vm_page_prot);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
@@ -2600,7 +2612,7 @@ static int do_swap_page(struct mm_struct
* discarded at swap_free().
*/

- inc_mm_counter(mm, anon_rss);
+ inc_mm_counter(mm, MM_ANONPAGES);
pte = mk_pte(page, vma->vm_page_prot);
if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
@@ -2684,7 +2696,7 @@ static int do_anonymous_page(struct mm_s
if (!pte_none(*page_table))
goto release;

- inc_mm_counter(mm, anon_rss);
+ inc_mm_counter(mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, address);
setpte:
set_pte_at(mm, address, page_table, entry);
@@ -2838,10 +2850,10 @@ static int __do_fault(struct mm_struct *
if (flags & FAULT_FLAG_WRITE)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
if (anon) {
- inc_mm_counter(mm, anon_rss);
+ inc_mm_counter(mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, address);
} else {
- inc_mm_counter(mm, file_rss);
+ inc_mm_counter(mm, MM_FILEPAGES);
page_add_file_rmap(page);
if (flags & FAULT_FLAG_WRITE) {
dirty_page = page;
Index: mmotm-2.6.32-Dec8-pth/fs/proc/task_mmu.c
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/fs/proc/task_mmu.c
+++ mmotm-2.6.32-Dec8-pth/fs/proc/task_mmu.c
@@ -65,11 +65,11 @@ unsigned long task_vsize(struct mm_struc
int task_statm(struct mm_struct *mm, int *shared, int *text,
int *data, int *resident)
{
- *shared = get_mm_counter(mm, file_rss);
+ *shared = get_mm_counter(mm, MM_FILEPAGES);
*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
>> PAGE_SHIFT;
*data = mm->total_vm - mm->shared_vm;
- *resident = *shared + get_mm_counter(mm, anon_rss);
+ *resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
return mm->total_vm;
}

Index: mmotm-2.6.32-Dec8-pth/kernel/fork.c
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/kernel/fork.c
+++ mmotm-2.6.32-Dec8-pth/kernel/fork.c
@@ -454,8 +454,7 @@ static struct mm_struct * mm_init(struct
(current->mm->flags & MMF_INIT_MASK) : default_dump_filter;
mm->core_state = NULL;
mm->nr_ptes = 0;
- set_mm_counter(mm, file_rss, 0);
- set_mm_counter(mm, anon_rss, 0);
+ memset(&mm->rss_stat, 0, sizeof(mm->rss_stat));
spin_lock_init(&mm->page_table_lock);
mm->free_area_cache = TASK_UNMAPPED_BASE;
mm->cached_hole_size = ~0UL;
Index: mmotm-2.6.32-Dec8-pth/kernel/tsacct.c
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/kernel/tsacct.c
+++ mmotm-2.6.32-Dec8-pth/kernel/tsacct.c
@@ -21,6 +21,7 @@
#include <linux/tsacct_kern.h>
#include <linux/acct.h>
#include <linux/jiffies.h>
+#include <linux/mm.h>

/*
* fill in basic accounting fields
Index: mmotm-2.6.32-Dec8-pth/mm/filemap_xip.c
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/mm/filemap_xip.c
+++ mmotm-2.6.32-Dec8-pth/mm/filemap_xip.c
@@ -194,7 +194,7 @@ retry:
flush_cache_page(vma, address, pte_pfn(*pte));
pteval = ptep_clear_flush_notify(vma, address, pte);
page_remove_rmap(page);
- dec_mm_counter(mm, file_rss);
+ dec_mm_counter(mm, MM_FILEPAGES);
BUG_ON(pte_dirty(pteval));
pte_unmap_unlock(pte, ptl);
page_cache_release(page);
Index: mmotm-2.6.32-Dec8-pth/mm/fremap.c
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/mm/fremap.c
+++ mmotm-2.6.32-Dec8-pth/mm/fremap.c
@@ -40,7 +40,7 @@ static void zap_pte(struct mm_struct *mm
page_remove_rmap(page);
page_cache_release(page);
update_hiwater_rss(mm);
- dec_mm_counter(mm, file_rss);
+ dec_mm_counter(mm, MM_FILEPAGES);
}
} else {
if (!pte_file(pte))
Index: mmotm-2.6.32-Dec8-pth/mm/oom_kill.c
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/mm/oom_kill.c
+++ mmotm-2.6.32-Dec8-pth/mm/oom_kill.c
@@ -401,8 +401,8 @@ static void __oom_kill_task(struct task_
"vsz:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
task_pid_nr(p), p->comm,
K(p->mm->total_vm),
- K(get_mm_counter(p->mm, anon_rss)),
- K(get_mm_counter(p->mm, file_rss)));
+ K(get_mm_counter(p->mm, MM_ANONPAGES)),
+ K(get_mm_counter(p->mm, MM_FILEPAGES)));
task_unlock(p);

/*
Index: mmotm-2.6.32-Dec8-pth/mm/rmap.c
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/mm/rmap.c
+++ mmotm-2.6.32-Dec8-pth/mm/rmap.c
@@ -815,9 +815,9 @@ int try_to_unmap_one(struct page *page,

if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
if (PageAnon(page))
- dec_mm_counter(mm, anon_rss);
+ dec_mm_counter(mm, MM_ANONPAGES);
else
- dec_mm_counter(mm, file_rss);
+ dec_mm_counter(mm, MM_FILEPAGES);
set_pte_at(mm, address, pte,
swp_entry_to_pte(make_hwpoison_entry(page)));
} else if (PageAnon(page)) {
@@ -839,7 +839,7 @@ int try_to_unmap_one(struct page *page,
list_add(&mm->mmlist, &init_mm.mmlist);
spin_unlock(&mmlist_lock);
}
- dec_mm_counter(mm, anon_rss);
+ dec_mm_counter(mm, MM_ANONPAGES);
} else if (PAGE_MIGRATION) {
/*
* Store the pfn of the page in a special migration
@@ -857,7 +857,7 @@ int try_to_unmap_one(struct page *page,
entry = make_migration_entry(page, pte_write(pteval));
set_pte_at(mm, address, pte, swp_entry_to_pte(entry));
} else
- dec_mm_counter(mm, file_rss);
+ dec_mm_counter(mm, MM_FILEPAGES);

page_remove_rmap(page);
page_cache_release(page);
@@ -996,7 +996,7 @@ static int try_to_unmap_cluster(unsigned

page_remove_rmap(page);
page_cache_release(page);
- dec_mm_counter(mm, file_rss);
+ dec_mm_counter(mm, MM_FILEPAGES);
(*mapcount)--;
}
pte_unmap_unlock(pte - 1, ptl);
Index: mmotm-2.6.32-Dec8-pth/mm/swapfile.c
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/mm/swapfile.c
+++ mmotm-2.6.32-Dec8-pth/mm/swapfile.c
@@ -840,7 +840,7 @@ static int unuse_pte(struct vm_area_stru
goto out;
}

- inc_mm_counter(vma->vm_mm, anon_rss);
+ inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
get_page(page);
set_pte_at(vma->vm_mm, addr, pte,
pte_mkold(mk_pte(page, vma->vm_page_prot)));

2009-12-15 09:16:42

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [mmotm][PATCH 2/5] mm : avoid false sharing on mm_counter

From: KAMEZAWA Hiroyuki <[email protected]>

Considering the nature of per mm stats, it's the shared object among
threads and can be a cache-miss point in the page fault path.

This patch adds per-thread cache for mm_counter. RSS value will be counted
into a struct in task_struct and synchronized with mm's one at events.

Now, in this patch, the event is the number of calls to handle_mm_fault.
Per-thread value is added to mm at each 64 calls.

rough estimation with small benchmark on parallel thread (2threads) shows
[before]
4.5 cache-miss/faults
[after]
4.0 cache-miss/faults
Anyway, the most contended object is mmap_sem if the number of threads grows.

Changlog: 2009/12/15
- added Documentation
- removed all hooks from scheduler and ticks.
- added event counter instead of them.
- make counter per thread rather than per cpu. This removes many
complicated codes.
- added SPLIT_RSS_COUNTING instead of reusing USE_SPLIT_PTLOCKS.

Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
Documentation/filesystems/proc.txt | 6 ++
fs/exec.c | 1
include/linux/mm.h | 8 +--
include/linux/mm_types.h | 6 ++
include/linux/sched.h | 4 +
kernel/exit.c | 3 -
mm/memory.c | 94 +++++++++++++++++++++++++++++++++----
7 files changed, 107 insertions(+), 15 deletions(-)

Index: mmotm-2.6.32-Dec8-pth/include/linux/mm_types.h
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/include/linux/mm_types.h
+++ mmotm-2.6.32-Dec8-pth/include/linux/mm_types.h
@@ -200,9 +200,15 @@ enum {
};

#if USE_SPLIT_PTLOCKS
+#define SPLIT_RSS_COUNTING
struct mm_rss_stat {
atomic_long_t count[NR_MM_COUNTERS];
};
+/* per-thread cached information, */
+struct task_rss_stat {
+ int events; /* for synchronization threshold */
+ int count[NR_MM_COUNTERS];
+};
#else /* !USE_SPLIT_PTLOCKS */
struct mm_rss_stat {
unsigned long count[NR_MM_COUNTERS];
Index: mmotm-2.6.32-Dec8-pth/include/linux/sched.h
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/include/linux/sched.h
+++ mmotm-2.6.32-Dec8-pth/include/linux/sched.h
@@ -1222,7 +1222,9 @@ struct task_struct {
struct plist_node pushable_tasks;

struct mm_struct *mm, *active_mm;
-
+#if defined(SPLIT_RSS_COUNTING)
+ struct task_rss_stat rss_stat;
+#endif
/* task state */
int exit_state;
int exit_code, exit_signal;
Index: mmotm-2.6.32-Dec8-pth/mm/memory.c
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/mm/memory.c
+++ mmotm-2.6.32-Dec8-pth/mm/memory.c
@@ -122,6 +122,79 @@ static int __init init_zero_pfn(void)
core_initcall(init_zero_pfn);


+#if defined(SPLIT_RSS_COUNTING)
+
+void __sync_task_rss_stat(struct task_struct *task, struct mm_struct *mm)
+{
+ int i;
+
+ for (i = 0; i < NR_MM_COUNTERS; i++) {
+ if (task->rss_stat.count[i]) {
+ add_mm_counter(mm, i, task->rss_stat.count[i]);
+ task->rss_stat.count[i] = 0;
+ }
+ }
+ task->rss_stat.events = 0;
+}
+
+static void add_mm_counter_fast(struct mm_struct *mm, int member, int val)
+{
+ struct task_struct *task = current;
+
+ if (likely(task->mm == mm))
+ task->rss_stat.count[member] += val;
+ else
+ add_mm_counter(mm, member, val);
+}
+#define inc_mm_counter_fast(mm, member) add_mm_counter_fast(mm, member,1)
+#define dec_mm_counter_fast(mm, member) add_mm_counter_fast(mm, member,-1)
+
+/* sync counter once per 64 page faults */
+#define TASK_RSS_EVENTS_THRESH (64)
+static void check_sync_rss_stat(struct task_struct *task)
+{
+ if (unlikely(task != current))
+ return;
+ if (unlikely(task->rss_stat.events++ > TASK_RSS_EVENTS_THRESH))
+ __sync_task_rss_stat(task, task->mm);
+}
+
+unsigned long get_mm_counter(struct mm_struct *mm, int member)
+{
+ long val = 0;
+
+ /*
+ * Don't use task->mm here...for avoiding to use task_get_mm()..
+ * The caller must guarantee task->mm is not invalid.
+ */
+ val = atomic_long_read(&mm->rss_stat.count[member]);
+ /*
+ * counter is updated in asynchronous manner and may go to minus.
+ * But it's never be expected number for users.
+ */
+ if (val < 0)
+ return 0;
+ return (unsigned long)val;
+}
+
+void sync_mm_rss(struct task_struct *task, struct mm_struct *mm)
+{
+ __sync_task_rss_stat(task, mm);
+}
+#else
+
+#define inc_mm_counter_fast(mm, member) inc_mm_counter(mm, member)
+#define dec_mm_counter_fast(mm, member) dec_mm_counter(mm, member)
+
+static void check_sync_rss_stat(struct task_struct *task)
+{
+}
+
+void sync_mm_rss(struct task_struct *task, struct mm_struct *mm)
+{
+}
+#endif
+
/*
* If a p?d_bad entry is found while walking page tables, report
* the error, before resetting entry to p?d_none. Usually (but
@@ -386,6 +459,8 @@ static inline void add_mm_rss_vec(struct
{
int i;

+ if (current->mm == mm)
+ sync_mm_rss(current, mm);
for (i = 0; i < NR_MM_COUNTERS; i++)
if (rss[i])
add_mm_counter(mm, i, rss[i]);
@@ -1539,7 +1614,7 @@ static int insert_page(struct vm_area_st

/* Ok, finally just insert the thing.. */
get_page(page);
- inc_mm_counter(mm, MM_FILEPAGES);
+ inc_mm_counter_fast(mm, MM_FILEPAGES);
page_add_file_rmap(page);
set_pte_at(mm, addr, pte, mk_pte(page, prot));

@@ -2175,11 +2250,11 @@ gotten:
if (likely(pte_same(*page_table, orig_pte))) {
if (old_page) {
if (!PageAnon(old_page)) {
- dec_mm_counter(mm, MM_FILEPAGES);
- inc_mm_counter(mm, MM_ANONPAGES);
+ dec_mm_counter_fast(mm, MM_FILEPAGES);
+ inc_mm_counter_fast(mm, MM_ANONPAGES);
}
} else
- inc_mm_counter(mm, MM_ANONPAGES);
+ inc_mm_counter_fast(mm, MM_ANONPAGES);
flush_cache_page(vma, address, pte_pfn(orig_pte));
entry = mk_pte(new_page, vma->vm_page_prot);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
@@ -2612,7 +2687,7 @@ static int do_swap_page(struct mm_struct
* discarded at swap_free().
*/

- inc_mm_counter(mm, MM_ANONPAGES);
+ inc_mm_counter_fast(mm, MM_ANONPAGES);
pte = mk_pte(page, vma->vm_page_prot);
if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
@@ -2696,7 +2771,7 @@ static int do_anonymous_page(struct mm_s
if (!pte_none(*page_table))
goto release;

- inc_mm_counter(mm, MM_ANONPAGES);
+ inc_mm_counter_fast(mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, address);
setpte:
set_pte_at(mm, address, page_table, entry);
@@ -2850,10 +2925,10 @@ static int __do_fault(struct mm_struct *
if (flags & FAULT_FLAG_WRITE)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
if (anon) {
- inc_mm_counter(mm, MM_ANONPAGES);
+ inc_mm_counter_fast(mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, address);
} else {
- inc_mm_counter(mm, MM_FILEPAGES);
+ inc_mm_counter_fast(mm, MM_FILEPAGES);
page_add_file_rmap(page);
if (flags & FAULT_FLAG_WRITE) {
dirty_page = page;
@@ -3031,6 +3106,9 @@ int handle_mm_fault(struct mm_struct *mm

count_vm_event(PGFAULT);

+ /* do counter updates before entering really critical section. */
+ check_sync_rss_stat(current);
+
if (unlikely(is_vm_hugetlb_page(vma)))
return hugetlb_fault(mm, vma, address, flags);

Index: mmotm-2.6.32-Dec8-pth/include/linux/mm.h
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/include/linux/mm.h
+++ mmotm-2.6.32-Dec8-pth/include/linux/mm.h
@@ -871,7 +871,7 @@ int __get_user_pages_fast(unsigned long
/*
* per-process(per-mm_struct) statistics.
*/
-#if USE_SPLIT_PTLOCKS
+#if defined(SPLIT_RSS_COUNTING)
/*
* The mm counters are not protected by its page_table_lock,
* so must be incremented atomically.
@@ -881,10 +881,7 @@ static inline void set_mm_counter(struct
atomic_long_set(&mm->rss_stat.count[member], value);
}

-static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
-{
- return (unsigned long)atomic_long_read(&mm->rss_stat.count[member]);
-}
+unsigned long get_mm_counter(struct mm_struct *mm, int member);

static inline void add_mm_counter(struct mm_struct *mm, int member, long value)
{
@@ -972,6 +969,7 @@ static inline void setmax_mm_hiwater_rss
*maxrss = hiwater_rss;
}

+void sync_mm_rss(struct task_struct *task, struct mm_struct *mm);

/*
* A callback you can register to apply pressure to ageable caches.
Index: mmotm-2.6.32-Dec8-pth/fs/exec.c
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/fs/exec.c
+++ mmotm-2.6.32-Dec8-pth/fs/exec.c
@@ -702,6 +702,7 @@ static int exec_mmap(struct mm_struct *m
/* Notify parent that we're no longer interested in the old VM */
tsk = current;
old_mm = current->mm;
+ sync_mm_rss(tsk, old_mm);
mm_release(tsk, old_mm);

if (old_mm) {
Index: mmotm-2.6.32-Dec8-pth/kernel/exit.c
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/kernel/exit.c
+++ mmotm-2.6.32-Dec8-pth/kernel/exit.c
@@ -944,7 +944,8 @@ NORET_TYPE void do_exit(long code)
preempt_count());

acct_update_integrals(tsk);
-
+ /* sync mm's RSS info before statistics gathering */
+ sync_mm_rss(tsk, tsk->mm);
group_dead = atomic_dec_and_test(&tsk->signal->live);
if (group_dead) {
hrtimer_cancel(&tsk->signal->real_timer);
Index: mmotm-2.6.32-Dec8-pth/Documentation/filesystems/proc.txt
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/Documentation/filesystems/proc.txt
+++ mmotm-2.6.32-Dec8-pth/Documentation/filesystems/proc.txt
@@ -189,6 +189,12 @@ memory usage. Its seven fields are expla
contains details information about the process itself. Its fields are
explained in Table 1-4.

+(for SMP CONFIG users)
+For making accounting scalable, RSS related information are handled in
+asynchronous manner and the vaule may not be very precise. To see a precise
+snapshot of a moment, you can see /proc/<pid>/smaps file and scan page table.
+It's slow but very precise.
+
Table 1-2: Contents of the statm files (as of 2.6.30-rc7)
..............................................................................
Field Content

2009-12-15 09:17:19

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [mmotm][PATCH 3/5] mm: count swap usage


One of frequent questions from users about memory management is
what numbers of swap ents are user for processes. And this information will
give some hints to oom-killer.

Besides we can count the number of swapents per a process by scanning
/proc/<pid>/smaps, this is very slow and not good for usual process information
handler which works like 'ps' or 'top'.
(ps or top is now enough slow..)

This patch adds a counter of swapents to mm_counter and update is at
each swap events. Information is exported via /proc/<pid>/status file as

[kamezawa@bluextal memory]$ cat /proc/self/status
Name: cat
State: R (running)
Tgid: 2910
Pid: 2910
PPid: 2823
TracerPid: 0
Uid: 500 500 500 500
Gid: 500 500 500 500
FDSize: 256
Groups: 500
VmPeak: 82696 kB
VmSize: 82696 kB
VmLck: 0 kB
VmHWM: 432 kB
VmRSS: 432 kB
VmData: 172 kB
VmStk: 84 kB
VmExe: 48 kB
VmLib: 1568 kB
VmPTE: 40 kB
VmSwap: 0 kB <=============== this.

Changelog: 2009/12/14
- removed a bad comment.
- Added Documentation

Reviewed-by: Minchan Kim <[email protected]>
Reviewed-by: Christoph Lameter <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
Documentation/filesystems/proc.txt | 2 ++
fs/proc/task_mmu.c | 9 ++++++---
include/linux/mm_types.h | 1 +
mm/memory.c | 16 ++++++++++++----
mm/rmap.c | 1 +
mm/swapfile.c | 1 +
6 files changed, 23 insertions(+), 7 deletions(-)

Index: mmotm-2.6.32-Dec8-pth/include/linux/mm_types.h
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/include/linux/mm_types.h
+++ mmotm-2.6.32-Dec8-pth/include/linux/mm_types.h
@@ -196,6 +196,7 @@ struct core_state {
enum {
MM_FILEPAGES,
MM_ANONPAGES,
+ MM_SWAPENTS,
NR_MM_COUNTERS
};

Index: mmotm-2.6.32-Dec8-pth/mm/memory.c
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/mm/memory.c
+++ mmotm-2.6.32-Dec8-pth/mm/memory.c
@@ -679,7 +679,9 @@ copy_one_pte(struct mm_struct *dst_mm, s
&src_mm->mmlist);
spin_unlock(&mmlist_lock);
}
- if (is_write_migration_entry(entry) &&
+ if (likely(!non_swap_entry(entry)))
+ rss[MM_SWAPENTS]++;
+ else if (is_write_migration_entry(entry) &&
is_cow_mapping(vm_flags)) {
/*
* COW mappings require pages in both parent
@@ -974,9 +976,14 @@ static unsigned long zap_pte_range(struc
if (pte_file(ptent)) {
if (unlikely(!(vma->vm_flags & VM_NONLINEAR)))
print_bad_pte(vma, addr, ptent, NULL);
- } else if
- (unlikely(!free_swap_and_cache(pte_to_swp_entry(ptent))))
- print_bad_pte(vma, addr, ptent, NULL);
+ } else {
+ swp_entry_t entry = pte_to_swp_entry(ptent);
+
+ if (!non_swap_entry(entry))
+ rss[MM_SWAPENTS]--;
+ if (unlikely(!free_swap_and_cache(entry)))
+ print_bad_pte(vma, addr, ptent, NULL);
+ }
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
} while (pte++, addr += PAGE_SIZE, (addr != end && *zap_work > 0));

@@ -2688,6 +2695,7 @@ static int do_swap_page(struct mm_struct
*/

inc_mm_counter_fast(mm, MM_ANONPAGES);
+ dec_mm_counter_fast(mm, MM_SWAPENTS);
pte = mk_pte(page, vma->vm_page_prot);
if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
Index: mmotm-2.6.32-Dec8-pth/mm/rmap.c
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/mm/rmap.c
+++ mmotm-2.6.32-Dec8-pth/mm/rmap.c
@@ -840,6 +840,7 @@ int try_to_unmap_one(struct page *page,
spin_unlock(&mmlist_lock);
}
dec_mm_counter(mm, MM_ANONPAGES);
+ inc_mm_counter(mm, MM_SWAPENTS);
} else if (PAGE_MIGRATION) {
/*
* Store the pfn of the page in a special migration
Index: mmotm-2.6.32-Dec8-pth/mm/swapfile.c
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/mm/swapfile.c
+++ mmotm-2.6.32-Dec8-pth/mm/swapfile.c
@@ -840,6 +840,7 @@ static int unuse_pte(struct vm_area_stru
goto out;
}

+ dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
get_page(page);
set_pte_at(vma->vm_mm, addr, pte,
Index: mmotm-2.6.32-Dec8-pth/fs/proc/task_mmu.c
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/fs/proc/task_mmu.c
+++ mmotm-2.6.32-Dec8-pth/fs/proc/task_mmu.c
@@ -16,7 +16,7 @@

void task_mem(struct seq_file *m, struct mm_struct *mm)
{
- unsigned long data, text, lib;
+ unsigned long data, text, lib, swap;
unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;

/*
@@ -36,6 +36,7 @@ void task_mem(struct seq_file *m, struct
data = mm->total_vm - mm->shared_vm - mm->stack_vm;
text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10;
lib = (mm->exec_vm << (PAGE_SHIFT-10)) - text;
+ swap = get_mm_counter(mm, MM_SWAPENTS);
seq_printf(m,
"VmPeak:\t%8lu kB\n"
"VmSize:\t%8lu kB\n"
@@ -46,7 +47,8 @@ void task_mem(struct seq_file *m, struct
"VmStk:\t%8lu kB\n"
"VmExe:\t%8lu kB\n"
"VmLib:\t%8lu kB\n"
- "VmPTE:\t%8lu kB\n",
+ "VmPTE:\t%8lu kB\n"
+ "VmSwap:\t%8lu kB\n",
hiwater_vm << (PAGE_SHIFT-10),
(total_vm - mm->reserved_vm) << (PAGE_SHIFT-10),
mm->locked_vm << (PAGE_SHIFT-10),
@@ -54,7 +56,8 @@ void task_mem(struct seq_file *m, struct
total_rss << (PAGE_SHIFT-10),
data << (PAGE_SHIFT-10),
mm->stack_vm << (PAGE_SHIFT-10), text, lib,
- (PTRS_PER_PTE*sizeof(pte_t)*mm->nr_ptes) >> 10);
+ (PTRS_PER_PTE*sizeof(pte_t)*mm->nr_ptes) >> 10,
+ swap << (PAGE_SHIFT-10));
}

unsigned long task_vsize(struct mm_struct *mm)
Index: mmotm-2.6.32-Dec8-pth/Documentation/filesystems/proc.txt
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/Documentation/filesystems/proc.txt
+++ mmotm-2.6.32-Dec8-pth/Documentation/filesystems/proc.txt
@@ -164,6 +164,7 @@ read the file /proc/PID/status:
VmExe: 68 kB
VmLib: 1412 kB
VmPTE: 20 kb
+ VmSwap: 0 kB
Threads: 1
SigQ: 0/28578
SigPnd: 0000000000000000
@@ -220,6 +221,7 @@ Table 1-2: Contents of the statm files (
VmExe size of text segment
VmLib size of shared library code
VmPTE size of page table entries
+ VmSwap size of swap usage (the number of referred swapents)
Threads number of threads
SigQ number of signals queued/max. number for queue
SigPnd bitmap of pending signals for the thread

2009-12-15 09:18:25

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [mmotm][PATCH 4/5] mm : add lowmem detection logic

From: KAMEZAWA Hiroyuki <[email protected]>

Final purpose of this patch is for improving oom/memoy shortage detection
better. In general there are OOM cases that lowmem is exhausted. What
this lowmem means is determined by the situation, but in general,
limited amount of memory for some special use is lowmem.

This patch adds an integer lowmem_zone, which is initialized to -1.
If zone_idx(zone) <= lowmem_zone, the zone is lowmem.

This patch uses simple definition that the zone for special use is the lowmem.
Not taking the amount of memory into account.

For example,
- if HIGHMEM is used, NORMAL is lowmem.
- If the system has both of NORMAL and DMA32, DMA32 is lowmem.
- When the system consists of only one zone, there are no lowmem.

This will be used for lowmem accounting per mm_struct and its information
will be used for oom-killer.

Q: Why you don't use policy_zone ?
A: It's for NUMA only. I want to use unified approach for detecting lowmem.
And policy_zone sounds like "for mempolicy"..

Concerns or TODO:
- Now, we have polizy_zone if CONFIG_NUMA=y. Maybe we can make it as
#define policy_zone (lowmem_zone + 1)
or remove it. But this itself should be done in other patch.

Changelog: 2009/12/14
- no change.
Changelog: 2009/12/09
- stop using policy_zone and use unified definition on each config.

Reviewed-by: Minchan Kim <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/mm.h | 9 +++++++
mm/page_alloc.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 71 insertions(+)

Index: mmotm-2.6.32-Dec8-pth/include/linux/mm.h
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/include/linux/mm.h
+++ mmotm-2.6.32-Dec8-pth/include/linux/mm.h
@@ -583,6 +583,15 @@ static inline void set_page_links(struct
}

/*
+ * Check a page is in lower zone
+ */
+extern int lowmem_zone;
+static inline bool is_lowmem_page(struct page *page)
+{
+ return page_zonenum(page) <= lowmem_zone;
+}
+
+/*
* Some inline functions in vmstat.h depend on page_zone()
*/
#include <linux/vmstat.h>
Index: mmotm-2.6.32-Dec8-pth/mm/page_alloc.c
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/mm/page_alloc.c
+++ mmotm-2.6.32-Dec8-pth/mm/page_alloc.c
@@ -2311,6 +2311,59 @@ static void zoneref_set_zone(struct zone
zoneref->zone_idx = zone_idx(zone);
}

+/* the zone is lowmem if zone_idx(zone) <= lowmem_zone */
+int lowmem_zone __read_mostly;
+/*
+ * Find out LOWMEM zone on this host. LOWMEM means a zone for special use
+ * and its size seems small and precious than other zones. For example,
+ * NORMAL zone is considered to be LOWMEM on a host which has HIGHMEM.
+ *
+ * This lowmem zone is determined by zone ordering and equipped memory layout.
+ * The amount of memory is not taken into account now.
+ */
+static void find_lowmem_zone(void)
+{
+ unsigned long pages[MAX_NR_ZONES];
+ struct zone *zone;
+ int idx;
+
+ for (idx = 0; idx < MAX_NR_ZONES; idx++)
+ pages[idx] = 0;
+ /* count the number of pages */
+ for_each_populated_zone(zone) {
+ idx = zone_idx(zone);
+ pages[idx] += zone->present_pages;
+ }
+ /* If We have HIGHMEM...we ignore ZONE_MOVABLE in this case. */
+#ifdef CONFIG_HIGHMEM
+ if (pages[ZONE_HIGHMEM]) {
+ lowmem_zone = ZONE_NORMAL;
+ return;
+ }
+#endif
+ /* If We have MOVABLE zone...which works like HIGHMEM. */
+ if (pages[ZONE_MOVABLE]) {
+ lowmem_zone = ZONE_NORMAL;
+ return;
+ }
+#ifdef CONFIG_ZONE_DMA32
+ /* If we have DMA32 and there is ZONE_NORMAL...*/
+ if (pages[ZONE_DMA32] && pages[ZONE_NORMAL]) {
+ lowmem_zone = ZONE_DMA32;
+ return;
+ }
+#endif
+#ifdef CONFIG_ZONE_DMA
+ /* If we have DMA and there is ZONE_NORMAL...*/
+ if (pages[ZONE_DMA] && pages[ZONE_NORMAL]) {
+ lowmem_zone = ZONE_DMA;
+ return;
+ }
+#endif
+ lowmem_zone = -1;
+ return;
+}
+
/*
* Builds allocation fallback zone lists.
*
@@ -2790,12 +2843,21 @@ void build_all_zonelists(void)
else
page_group_by_mobility_disabled = 0;

+ find_lowmem_zone();
+
printk("Built %i zonelists in %s order, mobility grouping %s. "
"Total pages: %ld\n",
nr_online_nodes,
zonelist_order_name[current_zonelist_order],
page_group_by_mobility_disabled ? "off" : "on",
vm_total_pages);
+
+ if (lowmem_zone >= 0)
+ printk("LOWMEM zone is detected as %s\n",
+ zone_names[lowmem_zone]);
+ else
+ printk("There are no special LOWMEM. The system seems flat\n");
+
#ifdef CONFIG_NUMA
printk("Policy zone: %s\n", zone_names[policy_zone]);
#endif

2009-12-15 09:19:34

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [mmotm][PATCH 5/5] mm : count lowmem rss

From: KAMEZAWA Hiroyuki <[email protected]>

Some case of OOM-Kill is caused by memory shortage in lowmem area. For example,
NORMAL_ZONE is exhausted on x86-32/HIGHMEM kernel.

Now, oom-killer doesn't have no lowmem usage information of processes and
selects victim processes based on global memory usage information.
In bad case, this can cause chains of kills of innocent processes without
progress, oom-serial-killer.

For making oom-killer lowmem aware, this patch adds counters for accounting
lowmem usage per process. (patches for oom-killer is not included in this.)

Adding counter is easy but one of concern is the cost for new counter.
But this patch doesn't adds # of counting cost but adds "if" senetense
to check a page is lwomem.
With micro benchmark, almost no regression.

Changelog: 2009/12/14
- makes get_xx_rss() to be not-inlined functions.

Reviewed-by: Minchan Kim <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
fs/proc/task_mmu.c | 4 +-
include/linux/mm.h | 27 ++++++++++++---
include/linux/mm_types.h | 7 ++--
mm/filemap_xip.c | 2 -
mm/fremap.c | 2 -
mm/memory.c | 80 ++++++++++++++++++++++++++++++++++++-----------
mm/oom_kill.c | 8 ++--
mm/rmap.c | 10 +++--
mm/swapfile.c | 2 -
9 files changed, 105 insertions(+), 37 deletions(-)

Index: mmotm-2.6.32-Dec8-pth/include/linux/mm_types.h
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/include/linux/mm_types.h
+++ mmotm-2.6.32-Dec8-pth/include/linux/mm_types.h
@@ -194,11 +194,14 @@ struct core_state {
};

enum {
- MM_FILEPAGES,
- MM_ANONPAGES,
+ MM_FILEPAGES, /* file's rss is MM_FILEPAGES + MM_LOW_FILEPAGES */
+ MM_ANONPAGES, /* anon`'s rss is MM_FILEPAGES + MM_LOW_FILEPAGES */
+ MM_FILE_LOWPAGES, /* pages from lower zones in file rss*/
+ MM_ANON_LOWPAGES, /* pages from lower zones in anon rss*/
MM_SWAPENTS,
NR_MM_COUNTERS
};
+#define LOWMEM_COUNTER 2

#if USE_SPLIT_PTLOCKS
#define SPLIT_RSS_COUNTING
Index: mmotm-2.6.32-Dec8-pth/mm/memory.c
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/mm/memory.c
+++ mmotm-2.6.32-Dec8-pth/mm/memory.c
@@ -137,7 +137,7 @@ void __sync_task_rss_stat(struct task_st
task->rss_stat.events = 0;
}

-static void add_mm_counter_fast(struct mm_struct *mm, int member, int val)
+static void __add_mm_counter_fast(struct mm_struct *mm, int member, int val)
{
struct task_struct *task = current;

@@ -146,8 +146,17 @@ static void add_mm_counter_fast(struct m
else
add_mm_counter(mm, member, val);
}
-#define inc_mm_counter_fast(mm, member) add_mm_counter_fast(mm, member,1)
-#define dec_mm_counter_fast(mm, member) add_mm_counter_fast(mm, member,-1)
+static void add_mm_counter_fast(struct mm_struct *mm, int member,
+ int val, struct page *page)
+{
+ if (is_lowmem_page(page))
+ member += LOWMEM_COUNTER;
+ __add_mm_counter_fast(mm, member, val);
+}
+#define inc_mm_counter_fast(mm, member, page)\
+ add_mm_counter_fast(mm, member,1, page)
+#define dec_mm_counter_fast(mm, member, page)\
+ add_mm_counter_fast(mm, member,-1, page)

/* sync counter once per 64 page faults */
#define TASK_RSS_EVENTS_THRESH (64)
@@ -183,8 +192,9 @@ void sync_mm_rss(struct task_struct *tas
}
#else

-#define inc_mm_counter_fast(mm, member) inc_mm_counter(mm, member)
-#define dec_mm_counter_fast(mm, member) dec_mm_counter(mm, member)
+#define inc_mm_counter_fast(mm, member, page) inc_mm_counter_page(mm, member, page)
+#define dec_mm_counter_fast(mm, member, page) dec_mm_counter_page(mm, member, page)
+#define __add_mm_counter_fast(mm, member, val) add_mm_counter(mm, member, val)

static void check_sync_rss_stat(struct task_struct *task)
{
@@ -195,6 +205,30 @@ void sync_mm_rss(struct task_struct *tas
}
#endif

+unsigned long get_file_rss(struct mm_struct *mm)
+{
+ return get_mm_counter(mm, MM_ANONPAGES)
+ + get_mm_counter(mm, MM_ANON_LOWPAGES);
+}
+
+unsigned long get_anon_rss(struct mm_struct *mm)
+{
+ return get_mm_counter(mm, MM_FILEPAGES)
+ + get_mm_counter(mm, MM_FILE_LOWPAGES);
+}
+
+unsigned long get_low_rss(struct mm_struct *mm)
+{
+ return get_mm_counter(mm, MM_ANON_LOWPAGES)
+ + get_mm_counter(mm, MM_FILE_LOWPAGES);
+}
+
+unsigned long get_mm_rss(struct mm_struct *mm)
+{
+ return get_file_rss(mm) + get_anon_rss(mm);
+}
+
+
/*
* If a p?d_bad entry is found while walking page tables, report
* the error, before resetting entry to p?d_none. Usually (but
@@ -714,12 +748,17 @@ copy_one_pte(struct mm_struct *dst_mm, s

page = vm_normal_page(vma, addr, pte);
if (page) {
+ int type;
+
get_page(page);
page_dup_rmap(page);
if (PageAnon(page))
- rss[MM_ANONPAGES]++;
+ type = MM_ANONPAGES;
else
- rss[MM_FILEPAGES]++;
+ type = MM_FILEPAGES;
+ if (is_lowmem_page(page))
+ type += LOWMEM_COUNTER;
+ rss[type]++;
}

out_set_pte:
@@ -905,6 +944,7 @@ static unsigned long zap_pte_range(struc
pte_t *pte;
spinlock_t *ptl;
int rss[NR_MM_COUNTERS];
+ int type;

init_rss_vec(rss);

@@ -952,15 +992,18 @@ static unsigned long zap_pte_range(struc
set_pte_at(mm, addr, pte,
pgoff_to_pte(page->index));
if (PageAnon(page))
- rss[MM_ANONPAGES]--;
+ type = MM_ANONPAGES;
else {
if (pte_dirty(ptent))
set_page_dirty(page);
if (pte_young(ptent) &&
likely(!VM_SequentialReadHint(vma)))
mark_page_accessed(page);
- rss[MM_FILEPAGES]--;
+ type = MM_FILEPAGES;
}
+ if (is_lowmem_page(page))
+ type += LOWMEM_COUNTER;
+ rss[type]--;
page_remove_rmap(page);
if (unlikely(page_mapcount(page) < 0))
print_bad_pte(vma, addr, ptent, page);
@@ -1621,7 +1664,7 @@ static int insert_page(struct vm_area_st

/* Ok, finally just insert the thing.. */
get_page(page);
- inc_mm_counter_fast(mm, MM_FILEPAGES);
+ inc_mm_counter_fast(mm, MM_FILEPAGES, page);
page_add_file_rmap(page);
set_pte_at(mm, addr, pte, mk_pte(page, prot));

@@ -2257,11 +2300,12 @@ gotten:
if (likely(pte_same(*page_table, orig_pte))) {
if (old_page) {
if (!PageAnon(old_page)) {
- dec_mm_counter_fast(mm, MM_FILEPAGES);
- inc_mm_counter_fast(mm, MM_ANONPAGES);
+ dec_mm_counter_fast(mm, MM_FILEPAGES, old_page);
+ inc_mm_counter_fast(mm, MM_ANONPAGES, new_page);
}
} else
- inc_mm_counter_fast(mm, MM_ANONPAGES);
+ inc_mm_counter_fast(mm, MM_ANONPAGES, new_page);
+
flush_cache_page(vma, address, pte_pfn(orig_pte));
entry = mk_pte(new_page, vma->vm_page_prot);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
@@ -2694,8 +2738,9 @@ static int do_swap_page(struct mm_struct
* discarded at swap_free().
*/

- inc_mm_counter_fast(mm, MM_ANONPAGES);
- dec_mm_counter_fast(mm, MM_SWAPENTS);
+ inc_mm_counter_fast(mm, MM_ANONPAGES, page);
+ /* SWAPENTS counter is not related to page..then use bare call */
+ __add_mm_counter_fast(mm, MM_SWAPENTS, -1);
pte = mk_pte(page, vma->vm_page_prot);
if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
@@ -2779,7 +2824,7 @@ static int do_anonymous_page(struct mm_s
if (!pte_none(*page_table))
goto release;

- inc_mm_counter_fast(mm, MM_ANONPAGES);
+ inc_mm_counter_fast(mm, MM_ANONPAGES, page);
page_add_new_anon_rmap(page, vma, address);
setpte:
set_pte_at(mm, address, page_table, entry);
@@ -2933,10 +2978,10 @@ static int __do_fault(struct mm_struct *
if (flags & FAULT_FLAG_WRITE)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
if (anon) {
- inc_mm_counter_fast(mm, MM_ANONPAGES);
+ inc_mm_counter_fast(mm, MM_ANONPAGES, page);
page_add_new_anon_rmap(page, vma, address);
} else {
- inc_mm_counter_fast(mm, MM_FILEPAGES);
+ inc_mm_counter_fast(mm, MM_FILEPAGES, page);
page_add_file_rmap(page);
if (flags & FAULT_FLAG_WRITE) {
dirty_page = page;
Index: mmotm-2.6.32-Dec8-pth/mm/rmap.c
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/mm/rmap.c
+++ mmotm-2.6.32-Dec8-pth/mm/rmap.c
@@ -815,9 +815,9 @@ int try_to_unmap_one(struct page *page,

if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
if (PageAnon(page))
- dec_mm_counter(mm, MM_ANONPAGES);
+ dec_mm_counter_page(mm, MM_ANONPAGES, page);
else
- dec_mm_counter(mm, MM_FILEPAGES);
+ dec_mm_counter_page(mm, MM_FILEPAGES, page);
set_pte_at(mm, address, pte,
swp_entry_to_pte(make_hwpoison_entry(page)));
} else if (PageAnon(page)) {
@@ -839,7 +839,7 @@ int try_to_unmap_one(struct page *page,
list_add(&mm->mmlist, &init_mm.mmlist);
spin_unlock(&mmlist_lock);
}
- dec_mm_counter(mm, MM_ANONPAGES);
+ dec_mm_counter_page(mm, MM_ANONPAGES, page);
inc_mm_counter(mm, MM_SWAPENTS);
} else if (PAGE_MIGRATION) {
/*
@@ -858,7 +858,7 @@ int try_to_unmap_one(struct page *page,
entry = make_migration_entry(page, pte_write(pteval));
set_pte_at(mm, address, pte, swp_entry_to_pte(entry));
} else
- dec_mm_counter(mm, MM_FILEPAGES);
+ dec_mm_counter_page(mm, MM_FILEPAGES, page);

page_remove_rmap(page);
page_cache_release(page);
@@ -998,6 +998,8 @@ static int try_to_unmap_cluster(unsigned
page_remove_rmap(page);
page_cache_release(page);
dec_mm_counter(mm, MM_FILEPAGES);
+ if (is_lowmem_page(page))
+ dec_mm_counter(mm, MM_FILEPAGES);
(*mapcount)--;
}
pte_unmap_unlock(pte - 1, ptl);
Index: mmotm-2.6.32-Dec8-pth/mm/swapfile.c
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/mm/swapfile.c
+++ mmotm-2.6.32-Dec8-pth/mm/swapfile.c
@@ -841,7 +841,7 @@ static int unuse_pte(struct vm_area_stru
}

dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
- inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
+ inc_mm_counter_page(vma->vm_mm, MM_ANONPAGES, page);
get_page(page);
set_pte_at(vma->vm_mm, addr, pte,
pte_mkold(mk_pte(page, vma->vm_page_prot)));
Index: mmotm-2.6.32-Dec8-pth/mm/filemap_xip.c
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/mm/filemap_xip.c
+++ mmotm-2.6.32-Dec8-pth/mm/filemap_xip.c
@@ -194,7 +194,7 @@ retry:
flush_cache_page(vma, address, pte_pfn(*pte));
pteval = ptep_clear_flush_notify(vma, address, pte);
page_remove_rmap(page);
- dec_mm_counter(mm, MM_FILEPAGES);
+ dec_mm_counter_page(mm, MM_FILEPAGES, page);
BUG_ON(pte_dirty(pteval));
pte_unmap_unlock(pte, ptl);
page_cache_release(page);
Index: mmotm-2.6.32-Dec8-pth/mm/fremap.c
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/mm/fremap.c
+++ mmotm-2.6.32-Dec8-pth/mm/fremap.c
@@ -40,7 +40,7 @@ static void zap_pte(struct mm_struct *mm
page_remove_rmap(page);
page_cache_release(page);
update_hiwater_rss(mm);
- dec_mm_counter(mm, MM_FILEPAGES);
+ dec_mm_counter_page(mm, MM_FILEPAGES, page);
}
} else {
if (!pte_file(pte))
Index: mmotm-2.6.32-Dec8-pth/include/linux/mm.h
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/include/linux/mm.h
+++ mmotm-2.6.32-Dec8-pth/include/linux/mm.h
@@ -939,11 +939,10 @@ static inline void dec_mm_counter(struct

#endif /* !USE_SPLIT_PTLOCKS */

-static inline unsigned long get_mm_rss(struct mm_struct *mm)
-{
- return get_mm_counter(mm, MM_FILEPAGES) +
- get_mm_counter(mm, MM_ANONPAGES);
-}
+unsigned long get_mm_rss(struct mm_struct *mm);
+unsigned long get_file_rss(struct mm_struct *mm);
+unsigned long get_anon_rss(struct mm_struct *mm);
+unsigned long get_low_rss(struct mm_struct *mm);

static inline unsigned long get_mm_hiwater_rss(struct mm_struct *mm)
{
@@ -978,6 +977,23 @@ static inline void setmax_mm_hiwater_rss
*maxrss = hiwater_rss;
}

+/* Utility for lowmem counting */
+static inline void
+inc_mm_counter_page(struct mm_struct *mm, int member, struct page *page)
+{
+ if (unlikely(is_lowmem_page(page)))
+ member += LOWMEM_COUNTER;
+ inc_mm_counter(mm, member);
+}
+
+static inline void
+dec_mm_counter_page(struct mm_struct *mm, int member, struct page *page)
+{
+ if (unlikely(is_lowmem_page(page)))
+ member += LOWMEM_COUNTER;
+ dec_mm_counter(mm, member);
+}
+
void sync_mm_rss(struct task_struct *task, struct mm_struct *mm);

/*
@@ -1034,6 +1050,7 @@ int __pmd_alloc(struct mm_struct *mm, pu
int __pte_alloc(struct mm_struct *mm, pmd_t *pmd, unsigned long address);
int __pte_alloc_kernel(pmd_t *pmd, unsigned long address);

+
/*
* The following ifdef needed to get the 4level-fixup.h header to work.
* Remove it when 4level-fixup.h has been removed.
Index: mmotm-2.6.32-Dec8-pth/fs/proc/task_mmu.c
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/fs/proc/task_mmu.c
+++ mmotm-2.6.32-Dec8-pth/fs/proc/task_mmu.c
@@ -68,11 +68,11 @@ unsigned long task_vsize(struct mm_struc
int task_statm(struct mm_struct *mm, int *shared, int *text,
int *data, int *resident)
{
- *shared = get_mm_counter(mm, MM_FILEPAGES);
+ *shared = get_file_rss(mm);
*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
>> PAGE_SHIFT;
*data = mm->total_vm - mm->shared_vm;
- *resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
+ *resident = *shared + get_anon_rss(mm);
return mm->total_vm;
}

Index: mmotm-2.6.32-Dec8-pth/mm/oom_kill.c
===================================================================
--- mmotm-2.6.32-Dec8-pth.orig/mm/oom_kill.c
+++ mmotm-2.6.32-Dec8-pth/mm/oom_kill.c
@@ -398,11 +398,13 @@ static void __oom_kill_task(struct task_

if (verbose)
printk(KERN_ERR "Killed process %d (%s) "
- "vsz:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
+ "vsz:%lukB, anon-rss:%lukB, file-rss:%lukB "
+ "lowmem %lukB\n",
task_pid_nr(p), p->comm,
K(p->mm->total_vm),
- K(get_mm_counter(p->mm, MM_ANONPAGES)),
- K(get_mm_counter(p->mm, MM_FILEPAGES)));
+ K(get_anon_rss(p->mm)),
+ K(get_file_rss(p->mm)),
+ K(get_low_rss(p->mm)));
task_unlock(p);

/*

2009-12-15 15:25:24

by Christoph Lameter

[permalink] [raw]
Subject: Re: [mmotm][PATCH 2/5] mm : avoid false sharing on mm_counter

On Tue, 15 Dec 2009, KAMEZAWA Hiroyuki wrote:

> #if USE_SPLIT_PTLOCKS
> +#define SPLIT_RSS_COUNTING
> struct mm_rss_stat {
> atomic_long_t count[NR_MM_COUNTERS];
> };
> +/* per-thread cached information, */
> +struct task_rss_stat {
> + int events; /* for synchronization threshold */

Why count events? Just always increment the task counters and fold them
at appropriate points into mm_struct. Or get rid of the mm_struct counters
and only sum them up on the fly if needed?

Add a pointer to thread rss_stat structure to mm_struct and remove the
counters? If the task has only one thread then the pointer points to the
accurate data (most frequent case). Otherwise it can be NULL and then we
calculate it on the fly?

> +static void add_mm_counter_fast(struct mm_struct *mm, int member, int val)
> +{
> + struct task_struct *task = current;
> +
> + if (likely(task->mm == mm))
> + task->rss_stat.count[member] += val;
> + else
> + add_mm_counter(mm, member, val);
> +}
> +#define inc_mm_counter_fast(mm, member) add_mm_counter_fast(mm, member,1)
> +#define dec_mm_counter_fast(mm, member) add_mm_counter_fast(mm, member,-1)
> +

Code will be much simpler if you always increment the task counts.

2009-12-15 16:54:14

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [mmotm][PATCH 2/5] mm : avoid false sharing on mm_counter

Christoph Lameter さんは書きました:
> On Tue, 15 Dec 2009, KAMEZAWA Hiroyuki wrote:
>
>> #if USE_SPLIT_PTLOCKS
>> +#define SPLIT_RSS_COUNTING
>> struct mm_rss_stat {
>> atomic_long_t count[NR_MM_COUNTERS];
>> };
>> +/* per-thread cached information, */
>> +struct task_rss_stat {
>> + int events; /* for synchronization threshold */
>
> Why count events? Just always increment the task counters and fold them
> at appropriate points into mm_struct.

I used event counter because I think this patch is _easy_ version of all
I've wrote since November. I'd like to start from simple one rather than
some codes which is invasive and can cause complicated discussion.

This event counter is very simple and all we do can be folded under /mm.
To be honest, I'd like to move synchronization point to tick or
schedule(), but for now, I'd like to start from this.
The point of this patch is "spliting" mm_counter counting and remove
false sharing. The problem of synchronization of counter can be
discussed later.

As you know, I have exterme version using percpu etc...but it's not
too late to think of some best counter after removing false sharing
of mmap_sem. When measuring page-fault speed, using more than 4 threads,
most of time is used for false sharing of mmap_sem and this counter's
scalability is not a problem. (So, my test program just use 2 threads.)

Considering trade-off, I'd like to start from "implement all under /mm"
imeplemnation. We can revisit and modify this after mmap_sem problem is
fixed.

If you recommend to drop this and just post 1,3,4,5. I'll do so.

> Or get rid of the mm_struct counters and only sum them up on the fly if
needed?
>
Get rid of mm_struct's counter is impossible because of get_user_pages(),
kswapd, vmscan etc...(now)

Then, we have 3 choices.
1. leave atomic counter on mm_struct
2. add pointer to some thread's counter in mm_struct.
3. use percpu counter on mm_stuct.

With 2. , we'll have to take care of atomicity of updateing per-thread
counter...so, not choiced. With 3, using percpu counter, as you did, seems
attractive. But there are problem scalabilty in read-side and we'll
need some synchonization point for avoid level-down in read-side even
using percpu counter..

Considering memory foot print, the benefit of per-thread counter is
that we can put per-thread counter near to cache-line of task->mm
and we don't have to take care of extra cache-miss.
(if counter size is enough small.)


> Add a pointer to thread rss_stat structure to mm_struct and remove the
> counters? If the task has only one thread then the pointer points to the
> accurate data (most frequent case). Otherwise it can be NULL and then we
> calculate it on the fly?
>
get_user_pages(), vmscan, kvm etc...will touch other process's page table.

>> +static void add_mm_counter_fast(struct mm_struct *mm, int member, int
>> val)
>> +{
>> + struct task_struct *task = current;
>> +
>> + if (likely(task->mm == mm))
>> + task->rss_stat.count[member] += val;
>> + else
>> + add_mm_counter(mm, member, val);
>> +}
>> +#define inc_mm_counter_fast(mm, member) add_mm_counter_fast(mm,
>> member,1)
>> +#define dec_mm_counter_fast(mm, member) add_mm_counter_fast(mm,
>> member,-1)
>> +
>
> Code will be much simpler if you always increment the task counts.
>
yes, I know and tried but failed. Maybe bigger patch will be required.

The result this patch shows is not very bad even if we have more chances.

Thanks,
-Kame

2009-12-15 23:31:27

by Minchan Kim

[permalink] [raw]
Subject: Re: [mmotm][PATCH 1/5] clean up mm_counter

On Tue, 15 Dec 2009 18:11:16 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Now, per-mm statistics counter is defined by macro in sched.h
>
> This patch modifies it to
> - defined in mm.h as inlinf functions
> - use array instead of macro's name creation.
>
> This patch is for reducing patch size in future patch to modify
> implementation of per-mm counter.
>
> Changelog: 2009/12/14
> - added a struct rss_stat instead of bare counters.
> - use memset instead of for() loop.
> - rewrite macros into static inline functions.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> fs/proc/task_mmu.c | 4 -
> include/linux/mm.h | 104 +++++++++++++++++++++++++++++++++++++++++++++++
> include/linux/mm_types.h | 33 +++++++++-----
> include/linux/sched.h | 54 ------------------------
> kernel/fork.c | 3 -
> kernel/tsacct.c | 1
> mm/filemap_xip.c | 2
> mm/fremap.c | 2
> mm/memory.c | 56 +++++++++++++++----------
> mm/oom_kill.c | 4 -
> mm/rmap.c | 10 ++--
> mm/swapfile.c | 2
> 12 files changed, 174 insertions(+), 101 deletions(-)
>
> Index: mmotm-2.6.32-Dec8-pth/include/linux/mm.h
> ===================================================================
> --- mmotm-2.6.32-Dec8-pth.orig/include/linux/mm.h
> +++ mmotm-2.6.32-Dec8-pth/include/linux/mm.h
> @@ -868,6 +868,110 @@ extern int mprotect_fixup(struct vm_area
> */
> int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
> struct page **pages);
> +/*
> + * per-process(per-mm_struct) statistics.
> + */
> +#if USE_SPLIT_PTLOCKS
> +/*
> + * The mm counters are not protected by its page_table_lock,
> + * so must be incremented atomically.
> + */
> +static inline void set_mm_counter(struct mm_struct *mm, int member, long value)
> +{
> + atomic_long_set(&mm->rss_stat.count[member], value);
> +}

I can't find mm->rss_stat in this patch.
Maybe it's part of next patch.
It could break bisect.

Otherwise, Looks good to me.

Reviewed-by: Minchan Kim <[email protected]>

--
Kind regards,
Minchan Kim

2009-12-15 23:55:09

by Minchan Kim

[permalink] [raw]
Subject: Re: [mmotm][PATCH 2/5] mm : avoid false sharing on mm_counter

Hi, Christoph.

On Tue, 15 Dec 2009 09:25:01 -0600 (CST)
Christoph Lameter <[email protected]> wrote:

> On Tue, 15 Dec 2009, KAMEZAWA Hiroyuki wrote:
>
> > #if USE_SPLIT_PTLOCKS
> > +#define SPLIT_RSS_COUNTING
> > struct mm_rss_stat {
> > atomic_long_t count[NR_MM_COUNTERS];
> > };
> > +/* per-thread cached information, */
> > +struct task_rss_stat {
> > + int events; /* for synchronization threshold */
>
> Why count events? Just always increment the task counters and fold them
> at appropriate points into mm_struct. Or get rid of the mm_struct counters
> and only sum them up on the fly if needed?

We are now suffering from finding appropriate points you mentioned.
That's because we want to remove read-side overhead with no regression.
So I think Kame removed schedule update hook.

Although the hooks is almost no overhead, I don't want to make mm counters
stale because it depends on schedule point.
If any process makes many faults in its time slice and it's not preempted
(ex, RT) as extreme case, we could show stale counters.

But now it makes consistency to merge counters.
Worst case is 64.

In this aspect, I like this idea.

--
Kind regards,
Minchan Kim

2009-12-15 23:56:29

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [mmotm][PATCH 1/5] clean up mm_counter

On Wed, 16 Dec 2009 08:25:29 +0900
Minchan Kim <[email protected]> wrote:

> On Tue, 15 Dec 2009 18:11:16 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > Now, per-mm statistics counter is defined by macro in sched.h
> >
> > This patch modifies it to
> > - defined in mm.h as inlinf functions
> > - use array instead of macro's name creation.
> >
> > This patch is for reducing patch size in future patch to modify
> > implementation of per-mm counter.
> >
> > Changelog: 2009/12/14
> > - added a struct rss_stat instead of bare counters.
> > - use memset instead of for() loop.
> > - rewrite macros into static inline functions.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > ---
> > fs/proc/task_mmu.c | 4 -
> > include/linux/mm.h | 104 +++++++++++++++++++++++++++++++++++++++++++++++
> > include/linux/mm_types.h | 33 +++++++++-----
> > include/linux/sched.h | 54 ------------------------
> > kernel/fork.c | 3 -
> > kernel/tsacct.c | 1
> > mm/filemap_xip.c | 2
> > mm/fremap.c | 2
> > mm/memory.c | 56 +++++++++++++++----------
> > mm/oom_kill.c | 4 -
> > mm/rmap.c | 10 ++--
> > mm/swapfile.c | 2
> > 12 files changed, 174 insertions(+), 101 deletions(-)
> >
> > Index: mmotm-2.6.32-Dec8-pth/include/linux/mm.h
> > ===================================================================
> > --- mmotm-2.6.32-Dec8-pth.orig/include/linux/mm.h
> > +++ mmotm-2.6.32-Dec8-pth/include/linux/mm.h
> > @@ -868,6 +868,110 @@ extern int mprotect_fixup(struct vm_area
> > */
> > int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
> > struct page **pages);
> > +/*
> > + * per-process(per-mm_struct) statistics.
> > + */
> > +#if USE_SPLIT_PTLOCKS
> > +/*
> > + * The mm counters are not protected by its page_table_lock,
> > + * so must be incremented atomically.
> > + */
> > +static inline void set_mm_counter(struct mm_struct *mm, int member, long value)
> > +{
> > + atomic_long_set(&mm->rss_stat.count[member], value);
> > +}
>
> I can't find mm->rss_stat in this patch.
> Maybe it's part of next patch.

It's in mm_types.h

@@ -223,11 +233,6 @@ struct mm_struct {
* by mmlist_lock
*/

- /* Special counters, in some configurations protected by the
- * page_table_lock, in other configurations by being atomic.
- */
- mm_counter_t _file_rss;
- mm_counter_t _anon_rss;

unsigned long hiwater_rss; /* High-watermark of RSS usage */
unsigned long hiwater_vm; /* High-water virtual memory usage */
@@ -240,6 +245,12 @@ struct mm_struct {

unsigned long saved_auxv[AT_VECTOR_SIZE]; /* for /proc/PID/auxv */

+ /*
+ * Special counters, in some configurations protected by the
+ * page_table_lock, in other configurations by being atomic.
+ */
+ struct mm_rss_stat rss_stat;
+
struct linux_binfmt *binfmt;

Moved to some bytes higher address for avoiding false sharing storm
of mmap_sem..


> Otherwise, Looks good to me.
>
> Reviewed-by: Minchan Kim <[email protected]>
>

Thank you for all your help for this series.

Regards,
-Kame