For better OOM handling and statistics per process, I'd like to add new
counters in mm_struct, counter for swap entries and lowmem usage.
But, simply adding new counter makes page fault path fat and adds more cache
misses. So, before going further, it's better to modify per-mm counter itself.
This is an updated version of percpu cached mm counter.
Main changes from previous one is:
- If no page faults, no sync at scheduler.
- removed synchronization at tick.
- added SWAPENTS counter.
- added lowmem_rss counter.
(I added Ingo to CC: because this patch has hooks for schedule(), tell me
if you finds concern in patch [2/5]...)
In general, maintaining a shared counter without frequent atomic_ops can be done
in following ways.
(1) use simple percpu counter. and calculates sum of it at read.
(2) use cached percpu counter and make some invalidation/synchronization points.
because read cost of this per mm counter is important, this set uses (2).
And synchronziation points is in schedule().
Scheule() is a good point for synchronize per-process per-cpu cached information.
I wanted to avoid adds hooks to schedule() but
- Hadnling all per-cpu cache handling requires complicated refcnt handling.
- taskacct requires full-synchronization of cached counter information.
IPI at each task exit()? it's silly.
Follwoing is the cost of this patches. On 2 socket x86-64 hosts.
Measured the number of pagefaults caused by 2 threads in 60secs.
One thread per one socket. (test program will follow this mail.)
This patch set is only for SPLIT_PTLOCK=y case.
[Before] (mmotom-2.6.32-Dec8)
Performance counter stats for './multi-fault 2' (5 runs):
45122351 page-faults ( +- 1.125% )
989608571 cache-references ( +- 1.198% )
205308558 cache-misses ( +- 0.159% )
29263096648639268 bus-cycles ( +- 0.004% )
60.003427500 seconds time elapsed ( +- 0.003% )
4.55 miss/faults
[After patch 2/5] (percpu cached counter)
Performance counter stats for './multi-fault 2' (5 runs):
46997471 page-faults ( +- 0.720% )
1004100076 cache-references ( +- 0.734% )
180959964 cache-misses ( +- 0.374% )
29263437363580464 bus-cycles ( +- 0.002% )
60.003315683 seconds time elapsed ( +- 0.004% )
3.85 miss/faults
[After patch 5/5] (adds 2 more coutners..swapents and lowmem)
Performance counter stats for './multi-fault 2' (5 runs):
45976947 page-faults ( +- 0.405% )
992296954 cache-references ( +- 0.860% )
183961537 cache-misses ( +- 0.473% )
29261902069414016 bus-cycles ( +- 0.002% )
60.001403261 seconds time elapsed ( +- 0.000% )
4.0 miss/faults.
Just for curious, this is the result when SPLIT_PTLOCKS is not enabled.
Performance counter stats for './multi-fault 2' (5 runs):
20329544 page-faults ( +- 0.795% )
1041624126 cache-references ( +- 1.153% )
160983634 cache-misses ( +- 3.188% )
29217349673892936 bus-cycles ( +- 0.035% )
60.004098210 seconds time elapsed ( +- 0.003% )
Too bad ;(
(Off topic) Why SPLIT_PTLOCKS is disabled if DEBUG_SPINLOCK=y ?
Thanks,
-Kame
From: KAMEZAWA Hiroyuki <[email protected]>
Now, per-mm statistics counter is defined by macro in sched.h
This patch modifies it to
- Define them in mm.h as inline functions
- Use array instead of macro's name creation. For making easier to add
new coutners.
This patch is for reducing patch size in future patch to modify
implementation of per-mm counter.
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
fs/proc/task_mmu.c | 4 -
include/linux/mm.h | 95 +++++++++++++++++++++++++++++++++++++++++++++++
include/linux/mm_types.h | 21 ++++++----
include/linux/sched.h | 54 --------------------------
kernel/fork.c | 6 +-
kernel/tsacct.c | 1
mm/filemap_xip.c | 2
mm/fremap.c | 2
mm/memory.c | 58 +++++++++++++++++-----------
mm/oom_kill.c | 4 -
mm/rmap.c | 10 ++--
mm/swapfile.c | 2
12 files changed, 161 insertions(+), 98 deletions(-)
Index: mmotm-2.6.32-Dec8/include/linux/mm.h
===================================================================
--- mmotm-2.6.32-Dec8.orig/include/linux/mm.h
+++ mmotm-2.6.32-Dec8/include/linux/mm.h
@@ -868,6 +868,101 @@ extern int mprotect_fixup(struct vm_area
*/
int __get_user_pages_fast(unsigned long start, int nr_pages, int write,
struct page **pages);
+/*
+ * per-process(per-mm_struct) statistics.
+ */
+#if USE_SPLIT_PTLOCKS
+/*
+ * The mm counters are not protected by its page_table_lock,
+ * so must be incremented atomically.
+ */
+static inline void set_mm_counter(struct mm_struct *mm, int member, long value)
+{
+ atomic_long_set(&(mm)->counters[member], value);
+}
+
+static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
+{
+ return (unsigned long)atomic_long_read(&(mm)->counters[member]);
+}
+
+static inline void add_mm_counter(struct mm_struct *mm, int member, long value)
+{
+ atomic_long_add(value, &(mm)->counters[member]);
+}
+
+static inline void inc_mm_counter(struct mm_struct *mm, int member)
+{
+ atomic_long_inc(&(mm)->counters[member]);
+}
+
+static inline void dec_mm_counter(struct mm_struct *mm, int member)
+{
+ atomic_long_dec(&(mm)->counters[member]);
+}
+
+#else /* !USE_SPLIT_PTLOCKS */
+/*
+ * The mm counters are protected by its page_table_lock,
+ * so can be incremented directly.
+ */
+static inline void set_mm_counter(struct mm_struct *mm, int member, long value)
+{
+ mm->counters[member] = value;
+}
+
+static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
+{
+ return mm->counters[member];
+}
+
+static inline void add_mm_counter(struct mm_struct *mm, int member, long value)
+{
+ mm->counters[member] += value;
+}
+
+static inline void inc_mm_counter(struct mm_struct *mm, int member)
+{
+ mm->counters[member]++;
+}
+
+static inline void dec_mm_counter(struct mm_struct *mm, int member)
+{
+ mm->counters[member]--;
+}
+
+#endif /* !USE_SPLIT_PTLOCKS */
+
+#define get_mm_rss(mm) \
+ (get_mm_counter(mm, MM_FILEPAGES) + get_mm_counter(mm, MM_ANONPAGES))
+#define update_hiwater_rss(mm) do { \
+ unsigned long _rss = get_mm_rss(mm); \
+ if ((mm)->hiwater_rss < _rss) \
+ (mm)->hiwater_rss = _rss; \
+} while (0)
+#define update_hiwater_vm(mm) do { \
+ if ((mm)->hiwater_vm < (mm)->total_vm) \
+ (mm)->hiwater_vm = (mm)->total_vm; \
+} while (0)
+
+static inline unsigned long get_mm_hiwater_rss(struct mm_struct *mm)
+{
+ return max(mm->hiwater_rss, get_mm_rss(mm));
+}
+
+static inline void setmax_mm_hiwater_rss(unsigned long *maxrss,
+ struct mm_struct *mm)
+{
+ unsigned long hiwater_rss = get_mm_hiwater_rss(mm);
+
+ if (*maxrss < hiwater_rss)
+ *maxrss = hiwater_rss;
+}
+
+static inline unsigned long get_mm_hiwater_vm(struct mm_struct *mm)
+{
+ return max(mm->hiwater_vm, mm->total_vm);
+}
/*
* A callback you can register to apply pressure to ageable caches.
Index: mmotm-2.6.32-Dec8/include/linux/mm_types.h
===================================================================
--- mmotm-2.6.32-Dec8.orig/include/linux/mm_types.h
+++ mmotm-2.6.32-Dec8/include/linux/mm_types.h
@@ -24,12 +24,6 @@ struct address_space;
#define USE_SPLIT_PTLOCKS (NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS)
-#if USE_SPLIT_PTLOCKS
-typedef atomic_long_t mm_counter_t;
-#else /* !USE_SPLIT_PTLOCKS */
-typedef unsigned long mm_counter_t;
-#endif /* !USE_SPLIT_PTLOCKS */
-
/*
* Each physical page in the system has a struct page associated with
* it to keep track of whatever it is we are using the page for at the
@@ -199,6 +193,18 @@ struct core_state {
struct completion startup;
};
+#if USE_SPLIT_PTLOCKS
+typedef atomic_long_t mm_counter_t;
+#else /* !USE_SPLIT_PTLOCKS */
+typedef unsigned long mm_counter_t;
+#endif /* !USE_SPLIT_PTLOCKS */
+
+enum {
+ MM_FILEPAGES,
+ MM_ANONPAGES,
+ NR_MM_COUNTERS
+};
+
struct mm_struct {
struct vm_area_struct * mmap; /* list of VMAs */
struct rb_root mm_rb;
@@ -226,8 +232,7 @@ struct mm_struct {
/* Special counters, in some configurations protected by the
* page_table_lock, in other configurations by being atomic.
*/
- mm_counter_t _file_rss;
- mm_counter_t _anon_rss;
+ mm_counter_t counters[NR_MM_COUNTERS];
unsigned long hiwater_rss; /* High-watermark of RSS usage */
unsigned long hiwater_vm; /* High-water virtual memory usage */
Index: mmotm-2.6.32-Dec8/include/linux/sched.h
===================================================================
--- mmotm-2.6.32-Dec8.orig/include/linux/sched.h
+++ mmotm-2.6.32-Dec8/include/linux/sched.h
@@ -385,60 +385,6 @@ arch_get_unmapped_area_topdown(struct fi
extern void arch_unmap_area(struct mm_struct *, unsigned long);
extern void arch_unmap_area_topdown(struct mm_struct *, unsigned long);
-#if USE_SPLIT_PTLOCKS
-/*
- * The mm counters are not protected by its page_table_lock,
- * so must be incremented atomically.
- */
-#define set_mm_counter(mm, member, value) atomic_long_set(&(mm)->_##member, value)
-#define get_mm_counter(mm, member) ((unsigned long)atomic_long_read(&(mm)->_##member))
-#define add_mm_counter(mm, member, value) atomic_long_add(value, &(mm)->_##member)
-#define inc_mm_counter(mm, member) atomic_long_inc(&(mm)->_##member)
-#define dec_mm_counter(mm, member) atomic_long_dec(&(mm)->_##member)
-
-#else /* !USE_SPLIT_PTLOCKS */
-/*
- * The mm counters are protected by its page_table_lock,
- * so can be incremented directly.
- */
-#define set_mm_counter(mm, member, value) (mm)->_##member = (value)
-#define get_mm_counter(mm, member) ((mm)->_##member)
-#define add_mm_counter(mm, member, value) (mm)->_##member += (value)
-#define inc_mm_counter(mm, member) (mm)->_##member++
-#define dec_mm_counter(mm, member) (mm)->_##member--
-
-#endif /* !USE_SPLIT_PTLOCKS */
-
-#define get_mm_rss(mm) \
- (get_mm_counter(mm, file_rss) + get_mm_counter(mm, anon_rss))
-#define update_hiwater_rss(mm) do { \
- unsigned long _rss = get_mm_rss(mm); \
- if ((mm)->hiwater_rss < _rss) \
- (mm)->hiwater_rss = _rss; \
-} while (0)
-#define update_hiwater_vm(mm) do { \
- if ((mm)->hiwater_vm < (mm)->total_vm) \
- (mm)->hiwater_vm = (mm)->total_vm; \
-} while (0)
-
-static inline unsigned long get_mm_hiwater_rss(struct mm_struct *mm)
-{
- return max(mm->hiwater_rss, get_mm_rss(mm));
-}
-
-static inline void setmax_mm_hiwater_rss(unsigned long *maxrss,
- struct mm_struct *mm)
-{
- unsigned long hiwater_rss = get_mm_hiwater_rss(mm);
-
- if (*maxrss < hiwater_rss)
- *maxrss = hiwater_rss;
-}
-
-static inline unsigned long get_mm_hiwater_vm(struct mm_struct *mm)
-{
- return max(mm->hiwater_vm, mm->total_vm);
-}
extern void set_dumpable(struct mm_struct *mm, int value);
extern int get_dumpable(struct mm_struct *mm);
Index: mmotm-2.6.32-Dec8/mm/memory.c
===================================================================
--- mmotm-2.6.32-Dec8.orig/mm/memory.c
+++ mmotm-2.6.32-Dec8/mm/memory.c
@@ -376,12 +376,21 @@ int __pte_alloc_kernel(pmd_t *pmd, unsig
return 0;
}
-static inline void add_mm_rss(struct mm_struct *mm, int file_rss, int anon_rss)
+static inline void init_rss_vec(int *rss)
{
- if (file_rss)
- add_mm_counter(mm, file_rss, file_rss);
- if (anon_rss)
- add_mm_counter(mm, anon_rss, anon_rss);
+ int i;
+
+ for (i = 0; i < NR_MM_COUNTERS; i++)
+ rss[i] = 0;
+}
+
+static inline void add_mm_rss_vec(struct mm_struct *mm, int *rss)
+{
+ int i;
+
+ for (i = 0; i < NR_MM_COUNTERS; i++)
+ if (rss[i])
+ add_mm_counter(mm, i, rss[i]);
}
/*
@@ -632,7 +641,10 @@ copy_one_pte(struct mm_struct *dst_mm, s
if (page) {
get_page(page);
page_dup_rmap(page);
- rss[PageAnon(page)]++;
+ if (PageAnon(page))
+ rss[MM_ANONPAGES]++;
+ else
+ rss[MM_FILEPAGES]++;
}
out_set_pte:
@@ -648,11 +660,12 @@ static int copy_pte_range(struct mm_stru
pte_t *src_pte, *dst_pte;
spinlock_t *src_ptl, *dst_ptl;
int progress = 0;
- int rss[2];
+ int rss[NR_MM_COUNTERS];
swp_entry_t entry = (swp_entry_t){0};
again:
- rss[1] = rss[0] = 0;
+ init_rss_vec(rss);
+
dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
if (!dst_pte)
return -ENOMEM;
@@ -688,7 +701,7 @@ again:
arch_leave_lazy_mmu_mode();
spin_unlock(src_ptl);
pte_unmap_nested(orig_src_pte);
- add_mm_rss(dst_mm, rss[0], rss[1]);
+ add_mm_rss_vec(dst_mm, rss);
pte_unmap_unlock(orig_dst_pte, dst_ptl);
cond_resched();
@@ -816,8 +829,9 @@ static unsigned long zap_pte_range(struc
struct mm_struct *mm = tlb->mm;
pte_t *pte;
spinlock_t *ptl;
- int file_rss = 0;
- int anon_rss = 0;
+ int rss[NR_MM_COUNTERS];
+
+ init_rss_vec(rss);
pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
arch_enter_lazy_mmu_mode();
@@ -863,14 +877,14 @@ static unsigned long zap_pte_range(struc
set_pte_at(mm, addr, pte,
pgoff_to_pte(page->index));
if (PageAnon(page))
- anon_rss--;
+ rss[MM_ANONPAGES]--;
else {
if (pte_dirty(ptent))
set_page_dirty(page);
if (pte_young(ptent) &&
likely(!VM_SequentialReadHint(vma)))
mark_page_accessed(page);
- file_rss--;
+ rss[MM_FILEPAGES]--;
}
page_remove_rmap(page);
if (unlikely(page_mapcount(page) < 0))
@@ -893,7 +907,7 @@ static unsigned long zap_pte_range(struc
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
} while (pte++, addr += PAGE_SIZE, (addr != end && *zap_work > 0));
- add_mm_rss(mm, file_rss, anon_rss);
+ add_mm_rss_vec(mm, rss);
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(pte - 1, ptl);
@@ -1527,7 +1541,7 @@ static int insert_page(struct vm_area_st
/* Ok, finally just insert the thing.. */
get_page(page);
- inc_mm_counter(mm, file_rss);
+ inc_mm_counter(mm, MM_FILEPAGES);
page_add_file_rmap(page);
set_pte_at(mm, addr, pte, mk_pte(page, prot));
@@ -2163,11 +2177,11 @@ gotten:
if (likely(pte_same(*page_table, orig_pte))) {
if (old_page) {
if (!PageAnon(old_page)) {
- dec_mm_counter(mm, file_rss);
- inc_mm_counter(mm, anon_rss);
+ dec_mm_counter(mm, MM_FILEPAGES);
+ inc_mm_counter(mm, MM_ANONPAGES);
}
} else
- inc_mm_counter(mm, anon_rss);
+ inc_mm_counter(mm, MM_ANONPAGES);
flush_cache_page(vma, address, pte_pfn(orig_pte));
entry = mk_pte(new_page, vma->vm_page_prot);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
@@ -2600,7 +2614,7 @@ static int do_swap_page(struct mm_struct
* discarded at swap_free().
*/
- inc_mm_counter(mm, anon_rss);
+ inc_mm_counter(mm, MM_ANONPAGES);
pte = mk_pte(page, vma->vm_page_prot);
if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
@@ -2684,7 +2698,7 @@ static int do_anonymous_page(struct mm_s
if (!pte_none(*page_table))
goto release;
- inc_mm_counter(mm, anon_rss);
+ inc_mm_counter(mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, address);
setpte:
set_pte_at(mm, address, page_table, entry);
@@ -2838,10 +2852,10 @@ static int __do_fault(struct mm_struct *
if (flags & FAULT_FLAG_WRITE)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
if (anon) {
- inc_mm_counter(mm, anon_rss);
+ inc_mm_counter(mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, address);
} else {
- inc_mm_counter(mm, file_rss);
+ inc_mm_counter(mm, MM_FILEPAGES);
page_add_file_rmap(page);
if (flags & FAULT_FLAG_WRITE) {
dirty_page = page;
Index: mmotm-2.6.32-Dec8/fs/proc/task_mmu.c
===================================================================
--- mmotm-2.6.32-Dec8.orig/fs/proc/task_mmu.c
+++ mmotm-2.6.32-Dec8/fs/proc/task_mmu.c
@@ -65,11 +65,11 @@ unsigned long task_vsize(struct mm_struc
int task_statm(struct mm_struct *mm, int *shared, int *text,
int *data, int *resident)
{
- *shared = get_mm_counter(mm, file_rss);
+ *shared = get_mm_counter(mm, MM_FILEPAGES);
*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
>> PAGE_SHIFT;
*data = mm->total_vm - mm->shared_vm;
- *resident = *shared + get_mm_counter(mm, anon_rss);
+ *resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
return mm->total_vm;
}
Index: mmotm-2.6.32-Dec8/kernel/fork.c
===================================================================
--- mmotm-2.6.32-Dec8.orig/kernel/fork.c
+++ mmotm-2.6.32-Dec8/kernel/fork.c
@@ -446,6 +446,8 @@ static void mm_init_aio(struct mm_struct
static struct mm_struct * mm_init(struct mm_struct * mm, struct task_struct *p)
{
+ int i;
+
atomic_set(&mm->mm_users, 1);
atomic_set(&mm->mm_count, 1);
init_rwsem(&mm->mmap_sem);
@@ -454,8 +456,8 @@ static struct mm_struct * mm_init(struct
(current->mm->flags & MMF_INIT_MASK) : default_dump_filter;
mm->core_state = NULL;
mm->nr_ptes = 0;
- set_mm_counter(mm, file_rss, 0);
- set_mm_counter(mm, anon_rss, 0);
+ for (i = 0; i < NR_MM_COUNTERS; i++)
+ set_mm_counter(mm, i, 0);
spin_lock_init(&mm->page_table_lock);
mm->free_area_cache = TASK_UNMAPPED_BASE;
mm->cached_hole_size = ~0UL;
Index: mmotm-2.6.32-Dec8/kernel/tsacct.c
===================================================================
--- mmotm-2.6.32-Dec8.orig/kernel/tsacct.c
+++ mmotm-2.6.32-Dec8/kernel/tsacct.c
@@ -21,6 +21,7 @@
#include <linux/tsacct_kern.h>
#include <linux/acct.h>
#include <linux/jiffies.h>
+#include <linux/mm.h>
/*
* fill in basic accounting fields
Index: mmotm-2.6.32-Dec8/mm/filemap_xip.c
===================================================================
--- mmotm-2.6.32-Dec8.orig/mm/filemap_xip.c
+++ mmotm-2.6.32-Dec8/mm/filemap_xip.c
@@ -194,7 +194,7 @@ retry:
flush_cache_page(vma, address, pte_pfn(*pte));
pteval = ptep_clear_flush_notify(vma, address, pte);
page_remove_rmap(page);
- dec_mm_counter(mm, file_rss);
+ dec_mm_counter(mm, MM_FILEPAGES);
BUG_ON(pte_dirty(pteval));
pte_unmap_unlock(pte, ptl);
page_cache_release(page);
Index: mmotm-2.6.32-Dec8/mm/fremap.c
===================================================================
--- mmotm-2.6.32-Dec8.orig/mm/fremap.c
+++ mmotm-2.6.32-Dec8/mm/fremap.c
@@ -40,7 +40,7 @@ static void zap_pte(struct mm_struct *mm
page_remove_rmap(page);
page_cache_release(page);
update_hiwater_rss(mm);
- dec_mm_counter(mm, file_rss);
+ dec_mm_counter(mm, MM_FILEPAGES);
}
} else {
if (!pte_file(pte))
Index: mmotm-2.6.32-Dec8/mm/oom_kill.c
===================================================================
--- mmotm-2.6.32-Dec8.orig/mm/oom_kill.c
+++ mmotm-2.6.32-Dec8/mm/oom_kill.c
@@ -401,8 +401,8 @@ static void __oom_kill_task(struct task_
"vsz:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
task_pid_nr(p), p->comm,
K(p->mm->total_vm),
- K(get_mm_counter(p->mm, anon_rss)),
- K(get_mm_counter(p->mm, file_rss)));
+ K(get_mm_counter(p->mm, MM_ANONPAGES)),
+ K(get_mm_counter(p->mm, MM_FILEPAGES)));
task_unlock(p);
/*
Index: mmotm-2.6.32-Dec8/mm/rmap.c
===================================================================
--- mmotm-2.6.32-Dec8.orig/mm/rmap.c
+++ mmotm-2.6.32-Dec8/mm/rmap.c
@@ -815,9 +815,9 @@ int try_to_unmap_one(struct page *page,
if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
if (PageAnon(page))
- dec_mm_counter(mm, anon_rss);
+ dec_mm_counter(mm, MM_ANONPAGES);
else
- dec_mm_counter(mm, file_rss);
+ dec_mm_counter(mm, MM_FILEPAGES);
set_pte_at(mm, address, pte,
swp_entry_to_pte(make_hwpoison_entry(page)));
} else if (PageAnon(page)) {
@@ -839,7 +839,7 @@ int try_to_unmap_one(struct page *page,
list_add(&mm->mmlist, &init_mm.mmlist);
spin_unlock(&mmlist_lock);
}
- dec_mm_counter(mm, anon_rss);
+ dec_mm_counter(mm, MM_ANONPAGES);
} else if (PAGE_MIGRATION) {
/*
* Store the pfn of the page in a special migration
@@ -857,7 +857,7 @@ int try_to_unmap_one(struct page *page,
entry = make_migration_entry(page, pte_write(pteval));
set_pte_at(mm, address, pte, swp_entry_to_pte(entry));
} else
- dec_mm_counter(mm, file_rss);
+ dec_mm_counter(mm, MM_FILEPAGES);
page_remove_rmap(page);
page_cache_release(page);
@@ -996,7 +996,7 @@ static int try_to_unmap_cluster(unsigned
page_remove_rmap(page);
page_cache_release(page);
- dec_mm_counter(mm, file_rss);
+ dec_mm_counter(mm, MM_FILEPAGES);
(*mapcount)--;
}
pte_unmap_unlock(pte - 1, ptl);
Index: mmotm-2.6.32-Dec8/mm/swapfile.c
===================================================================
--- mmotm-2.6.32-Dec8.orig/mm/swapfile.c
+++ mmotm-2.6.32-Dec8/mm/swapfile.c
@@ -840,7 +840,7 @@ static int unuse_pte(struct vm_area_stru
goto out;
}
- inc_mm_counter(vma->vm_mm, anon_rss);
+ inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
get_page(page);
set_pte_at(vma->vm_mm, addr, pte,
pte_mkold(mk_pte(page, vma->vm_page_prot)));
From: KAMEZAWA Hiroyuki <[email protected]>
Now, mm's counter information is updated by atomic_long_xxx() functions if
USE_SPLIT_PTLOCKS is defined. This causes cache-miss when page faults happens
simultaneously in prural cpus. (Almost all process-shared objects is...)
Considering accounting per-mm page usage more, one of problems is cost of
this counter.
This patch implements per-cpu mm cache. This per-cpu cache is loosely
synchronized with mm's counter. Current design is..
- prepare per-cpu object curr_mmc. curr_mmc containes pointer to mm and
array of counters.
- At page fault,
* if curr_mmc.mm != NULL, update curr_mmc.mm counter.
* if curr_mmc.mm == NULL, fill curr_mmc.mm = current->mm and account 1.
- At schedule()
* if curr_mm.mm != NULL, synchronize and invalidate cached information.
* if curr_mmc.mm == NULL, nothing to do.
By this.
- no atomic ops, which tends to cache-miss, under page table lock.
- mm->counters are synchronized when schedule() is called.
- No bad thing to read-side.
Concern:
- added cost to schedule().
Micro Benchmark:
measured the number of page faults with 2 threads on 2 sockets.
Before:
Performance counter stats for './multi-fault 2' (5 runs):
45122351 page-faults ( +- 1.125% )
989608571 cache-references ( +- 1.198% )
205308558 cache-misses ( +- 0.159% )
29263096648639268 bus-cycles ( +- 0.004% )
60.003427500 seconds time elapsed ( +- 0.003% )
After:
Performance counter stats for './multi-fault 2' (5 runs):
46997471 page-faults ( +- 0.720% )
1004100076 cache-references ( +- 0.734% )
180959964 cache-misses ( +- 0.374% )
29263437363580464 bus-cycles ( +- 0.002% )
60.003315683 seconds time elapsed ( +- 0.004% )
cachemiss/page faults is reduced from 4.55 miss/faults to be 3.85miss/faults
This microbencmark doesn't do usual behavior (page fault ->madvise(DONTNEED)
but reducing cache-miss cost sounds good to me even if it's very small.
Changelog 2009/12/09:
- loosely update curr_mmc.mm at the 1st page fault.
- removed hooks in tick.(update_process_times)
- exported curr_mmc and check curr_mmc.mm directly.
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/mm.h | 37 ++++++++++++++++++++++++++++
include/linux/mm_types.h | 12 +++++++++
kernel/exit.c | 3 +-
kernel/sched.c | 6 ++++
mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++-------
5 files changed, 108 insertions(+), 10 deletions(-)
Index: mmotm-2.6.32-Dec8/include/linux/mm_types.h
===================================================================
--- mmotm-2.6.32-Dec8.orig/include/linux/mm_types.h
+++ mmotm-2.6.32-Dec8/include/linux/mm_types.h
@@ -297,4 +297,16 @@ struct mm_struct {
/* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
#define mm_cpumask(mm) (&(mm)->cpu_vm_mask)
+#if USE_SPLIT_PTLOCKS
+/*
+ * percpu object used for caching thread->mm information.
+ */
+struct pcp_mm_cache {
+ struct mm_struct *mm;
+ unsigned long counters[NR_MM_COUNTERS];
+};
+
+DECLARE_PER_CPU(struct pcp_mm_cache, curr_mmc);
+#endif
+
#endif /* _LINUX_MM_TYPES_H */
Index: mmotm-2.6.32-Dec8/include/linux/mm.h
===================================================================
--- mmotm-2.6.32-Dec8.orig/include/linux/mm.h
+++ mmotm-2.6.32-Dec8/include/linux/mm.h
@@ -883,7 +883,16 @@ static inline void set_mm_counter(struct
static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
{
- return (unsigned long)atomic_long_read(&(mm)->counters[member]);
+ long ret;
+ /*
+ * Because this counter is loosely synchronized with percpu cached
+ * information, it's possible that value gets to be minus. For user's
+ * convenience/sanity, avoid returning minus.
+ */
+ ret = atomic_long_read(&(mm)->counters[member]);
+ if (unlikely(ret < 0))
+ return 0;
+ return (unsigned long)ret;
}
static inline void add_mm_counter(struct mm_struct *mm, int member, long value)
@@ -900,6 +909,25 @@ static inline void dec_mm_counter(struct
{
atomic_long_dec(&(mm)->counters[member]);
}
+extern void __sync_mm_counters(struct mm_struct *mm);
+/* Called under non-preemptable context, for syncing cached information */
+static inline void sync_mm_counters_atomic(void)
+{
+ struct mm_struct *mm;
+
+ mm = percpu_read(curr_mmc.mm);
+ if (mm) {
+ __sync_mm_counters(mm);
+ percpu_write(curr_mmc.mm, NULL);
+ }
+}
+/* called at thread exit */
+static inline void exit_mm_counters(void)
+{
+ preempt_disable();
+ sync_mm_counters_atomic();
+ preempt_enable();
+}
#else /* !USE_SPLIT_PTLOCKS */
/*
@@ -931,6 +959,13 @@ static inline void dec_mm_counter(struct
mm->counters[member]--;
}
+static inline void sync_mm_counters_atomic(void)
+{
+}
+
+static inline void exit_mm_counters(void)
+{
+}
#endif /* !USE_SPLIT_PTLOCKS */
#define get_mm_rss(mm) \
Index: mmotm-2.6.32-Dec8/mm/memory.c
===================================================================
--- mmotm-2.6.32-Dec8.orig/mm/memory.c
+++ mmotm-2.6.32-Dec8/mm/memory.c
@@ -121,6 +121,50 @@ static int __init init_zero_pfn(void)
}
core_initcall(init_zero_pfn);
+#if USE_SPLIT_PTLOCKS
+
+DEFINE_PER_CPU(struct pcp_mm_cache, curr_mmc);
+
+void __sync_mm_counters(struct mm_struct *mm)
+{
+ struct pcp_mm_cache *mmc = &per_cpu(curr_mmc, smp_processor_id());
+ int i;
+
+ for (i = 0; i < NR_MM_COUNTERS; i++) {
+ if (mmc->counters[i] != 0) {
+ atomic_long_add(mmc->counters[i], &mm->counters[i]);
+ mmc->counters[i] = 0;
+ }
+ }
+ return;
+}
+/*
+ * This add_mm_counter_fast() works well only when it's expexted that
+ * mm == current->mm. So, use of this function is limited under memory.c
+ * This add_mm_counter_fast() is called under page table lock.
+ */
+static void add_mm_counter_fast(struct mm_struct *mm, int member, int val)
+{
+ struct mm_struct *cached = percpu_read(curr_mmc.mm);
+
+ if (likely(cached == mm)) { /* fast path */
+ percpu_add(curr_mmc.counters[member], val);
+ } else if (mm == current->mm) { /* 1st page fault in this period */
+ percpu_write(curr_mmc.mm, mm);
+ percpu_write(curr_mmc.counters[member], val);
+ } else /* page fault via side-path context (get_user_pages()) */
+ add_mm_counter(mm, member, val);
+}
+
+#define inc_mm_counter_fast(mm, member) add_mm_counter_fast(mm, member, 1)
+#define dec_mm_counter_fast(mm, member) add_mm_counter_fast(mm, member, -1)
+#else
+
+#define inc_mm_counter_fast(mm, member) inc_mm_counter(mm, member)
+#define dec_mm_counter_fast(mm, member) dec_mm_counter(mm, member)
+
+#endif
+
/*
* If a p?d_bad entry is found while walking page tables, report
* the error, before resetting entry to p?d_none. Usually (but
@@ -1541,7 +1585,7 @@ static int insert_page(struct vm_area_st
/* Ok, finally just insert the thing.. */
get_page(page);
- inc_mm_counter(mm, MM_FILEPAGES);
+ inc_mm_counter_fast(mm, MM_FILEPAGES);
page_add_file_rmap(page);
set_pte_at(mm, addr, pte, mk_pte(page, prot));
@@ -2177,11 +2221,11 @@ gotten:
if (likely(pte_same(*page_table, orig_pte))) {
if (old_page) {
if (!PageAnon(old_page)) {
- dec_mm_counter(mm, MM_FILEPAGES);
- inc_mm_counter(mm, MM_ANONPAGES);
+ dec_mm_counter_fast(mm, MM_FILEPAGES);
+ inc_mm_counter_fast(mm, MM_ANONPAGES);
}
} else
- inc_mm_counter(mm, MM_ANONPAGES);
+ inc_mm_counter_fast(mm, MM_ANONPAGES);
flush_cache_page(vma, address, pte_pfn(orig_pte));
entry = mk_pte(new_page, vma->vm_page_prot);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
@@ -2614,7 +2658,7 @@ static int do_swap_page(struct mm_struct
* discarded at swap_free().
*/
- inc_mm_counter(mm, MM_ANONPAGES);
+ inc_mm_counter_fast(mm, MM_ANONPAGES);
pte = mk_pte(page, vma->vm_page_prot);
if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
@@ -2698,7 +2742,7 @@ static int do_anonymous_page(struct mm_s
if (!pte_none(*page_table))
goto release;
- inc_mm_counter(mm, MM_ANONPAGES);
+ inc_mm_counter_fast(mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, address);
setpte:
set_pte_at(mm, address, page_table, entry);
@@ -2852,10 +2896,10 @@ static int __do_fault(struct mm_struct *
if (flags & FAULT_FLAG_WRITE)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
if (anon) {
- inc_mm_counter(mm, MM_ANONPAGES);
+ inc_mm_counter_fast(mm, MM_ANONPAGES);
page_add_new_anon_rmap(page, vma, address);
} else {
- inc_mm_counter(mm, MM_FILEPAGES);
+ inc_mm_counter_fast(mm, MM_FILEPAGES);
page_add_file_rmap(page);
if (flags & FAULT_FLAG_WRITE) {
dirty_page = page;
Index: mmotm-2.6.32-Dec8/kernel/sched.c
===================================================================
--- mmotm-2.6.32-Dec8.orig/kernel/sched.c
+++ mmotm-2.6.32-Dec8/kernel/sched.c
@@ -2858,6 +2858,7 @@ context_switch(struct rq *rq, struct tas
trace_sched_switch(rq, prev, next);
mm = next->mm;
oldmm = prev->active_mm;
+
/*
* For paravirt, this is coupled with an exit in switch_to to
* combine the page table reload and the switch backend into
@@ -5477,6 +5478,11 @@ need_resched_nonpreemptible:
if (sched_feat(HRTICK))
hrtick_clear(rq);
+ /*
+ * sync/invaldidate per-cpu cached mm related information
+ * before taling rq->lock. (see include/linux/mm.h)
+ */
+ sync_mm_counters_atomic();
spin_lock_irq(&rq->lock);
update_rq_clock(rq);
Index: mmotm-2.6.32-Dec8/kernel/exit.c
===================================================================
--- mmotm-2.6.32-Dec8.orig/kernel/exit.c
+++ mmotm-2.6.32-Dec8/kernel/exit.c
@@ -942,7 +942,8 @@ NORET_TYPE void do_exit(long code)
printk(KERN_INFO "note: %s[%d] exited with preempt_count %d\n",
current->comm, task_pid_nr(current),
preempt_count());
-
+ /* synchronize per-cpu cached mm related information before account */
+ exit_mm_counters();
acct_update_integrals(tsk);
group_dead = atomic_dec_and_test(&tsk->signal->live);
* KAMEZAWA Hiroyuki <[email protected]> wrote:
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Now, mm's counter information is updated by atomic_long_xxx()
> functions if USE_SPLIT_PTLOCKS is defined. This causes cache-miss when
> page faults happens simultaneously in prural cpus. (Almost all
> process-shared objects is...)
>
> Considering accounting per-mm page usage more, one of problems is cost
> of this counter.
I'd really like these kinds of stats available via the tool you used to
develop this patchset:
> After:
> Performance counter stats for './multi-fault 2' (5 runs):
>
> 46997471 page-faults ( +- 0.720% )
> 1004100076 cache-references ( +- 0.734% )
> 180959964 cache-misses ( +- 0.374% )
> 29263437363580464 bus-cycles ( +- 0.002% )
>
> 60.003315683 seconds time elapsed ( +- 0.004% )
>
> cachemiss/page faults is reduced from 4.55 miss/faults to be 3.85miss/faults
I.e. why not expose these stats via perf events and counts as well,
beyond the current (rather minimal) set of MM stats perf supports
currently?
That way we'd get a _lot_ of interesting per task mm stats available via
perf stat (and maybe they can be profiled as well via perf record), and
we could perhaps avoid uglies like having to hack hooks into sched.c:
> + /*
> + * sync/invaldidate per-cpu cached mm related information
> + * before taling rq->lock. (see include/linux/mm.h)
(minor typo: s/taling/taking )
> + */
> + sync_mm_counters_atomic();
>
> spin_lock_irq(&rq->lock);
> update_rq_clock(rq);
It's not a simple task i guess since this per mm counting business has
grown its own variant which takes time to rearchitect, plus i'm sure
there's performance issues to solve if such a model is exposed via perf,
but users and developers would be _very_ well served by such
capabilities:
- clean, syscall based API available to monitor tasks, workloads and
CPUs. (or the whole system)
- sampling (profiling)
- tracing, post-process scripting via Perl plugins
etc.
Ingo
One of frequent questions from users about memory management is
what numbers of swap ents are user for processes. And this information will
give some hints to oom-killer.
Besides we can count the number of swapents per a process by scanning
/proc/<pid>/smaps, this is very slow and not good for usual process information
handler which works like 'ps' or 'top'.
(ps or top is now enough slow..)
This patch adds a counter of swapents to mm_counter and update is at
each swap events. Information is exported via /proc/<pid>/status file as
[kamezawa@bluextal ~]$ cat /proc/self/status
Name: cat
State: R (running)
Tgid: 2904
Pid: 2904
PPid: 2862
TracerPid: 0
Uid: 500 500 500 500
Gid: 500 500 500 500
FDSize: 256
Groups: 500
VmPeak: 82696 kB
VmSize: 82696 kB
VmLck: 0 kB
VmHWM: 504 kB
VmRSS: 504 kB
VmData: 172 kB
VmStk: 84 kB
VmExe: 48 kB
VmLib: 1568 kB
VmPTE: 40 kB
VmSwap: 0 kB <============== this.
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
fs/proc/task_mmu.c | 9 ++++++---
include/linux/mm_types.h | 1 +
mm/memory.c | 16 ++++++++++++----
mm/rmap.c | 3 ++-
mm/swapfile.c | 1 +
5 files changed, 22 insertions(+), 8 deletions(-)
Index: mmotm-2.6.32-Dec8/include/linux/mm_types.h
===================================================================
--- mmotm-2.6.32-Dec8.orig/include/linux/mm_types.h
+++ mmotm-2.6.32-Dec8/include/linux/mm_types.h
@@ -202,6 +202,7 @@ typedef unsigned long mm_counter_t;
enum {
MM_FILEPAGES,
MM_ANONPAGES,
+ MM_SWAPENTS,
NR_MM_COUNTERS
};
Index: mmotm-2.6.32-Dec8/mm/memory.c
===================================================================
--- mmotm-2.6.32-Dec8.orig/mm/memory.c
+++ mmotm-2.6.32-Dec8/mm/memory.c
@@ -650,7 +650,9 @@ copy_one_pte(struct mm_struct *dst_mm, s
&src_mm->mmlist);
spin_unlock(&mmlist_lock);
}
- if (is_write_migration_entry(entry) &&
+ if (likely(!non_swap_entry(entry)))
+ rss[MM_SWAPENTS]++;
+ else if (is_write_migration_entry(entry) &&
is_cow_mapping(vm_flags)) {
/*
* COW mappings require pages in both parent
@@ -945,9 +947,14 @@ static unsigned long zap_pte_range(struc
if (pte_file(ptent)) {
if (unlikely(!(vma->vm_flags & VM_NONLINEAR)))
print_bad_pte(vma, addr, ptent, NULL);
- } else if
- (unlikely(!free_swap_and_cache(pte_to_swp_entry(ptent))))
- print_bad_pte(vma, addr, ptent, NULL);
+ } else {
+ swp_entry_t entry = pte_to_swp_entry(ptent);
+
+ if (!non_swap_entry(entry))
+ rss[MM_SWAPENTS]--;
+ if (unlikely(!free_swap_and_cache(entry)))
+ print_bad_pte(vma, addr, ptent, NULL);
+ }
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
} while (pte++, addr += PAGE_SIZE, (addr != end && *zap_work > 0));
@@ -2659,6 +2666,7 @@ static int do_swap_page(struct mm_struct
*/
inc_mm_counter_fast(mm, MM_ANONPAGES);
+ dec_mm_counter_fast(mm, MM_SWAPENTS);
pte = mk_pte(page, vma->vm_page_prot);
if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
Index: mmotm-2.6.32-Dec8/mm/rmap.c
===================================================================
--- mmotm-2.6.32-Dec8.orig/mm/rmap.c
+++ mmotm-2.6.32-Dec8/mm/rmap.c
@@ -814,7 +814,7 @@ int try_to_unmap_one(struct page *page,
update_hiwater_rss(mm);
if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
- if (PageAnon(page))
+ if (PageAnon(page)) /* Not increments swapents counter */
dec_mm_counter(mm, MM_ANONPAGES);
else
dec_mm_counter(mm, MM_FILEPAGES);
@@ -840,6 +840,7 @@ int try_to_unmap_one(struct page *page,
spin_unlock(&mmlist_lock);
}
dec_mm_counter(mm, MM_ANONPAGES);
+ inc_mm_counter(mm, MM_SWAPENTS);
} else if (PAGE_MIGRATION) {
/*
* Store the pfn of the page in a special migration
Index: mmotm-2.6.32-Dec8/mm/swapfile.c
===================================================================
--- mmotm-2.6.32-Dec8.orig/mm/swapfile.c
+++ mmotm-2.6.32-Dec8/mm/swapfile.c
@@ -840,6 +840,7 @@ static int unuse_pte(struct vm_area_stru
goto out;
}
+ dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
get_page(page);
set_pte_at(vma->vm_mm, addr, pte,
Index: mmotm-2.6.32-Dec8/fs/proc/task_mmu.c
===================================================================
--- mmotm-2.6.32-Dec8.orig/fs/proc/task_mmu.c
+++ mmotm-2.6.32-Dec8/fs/proc/task_mmu.c
@@ -16,7 +16,7 @@
void task_mem(struct seq_file *m, struct mm_struct *mm)
{
- unsigned long data, text, lib;
+ unsigned long data, text, lib, swap;
unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
/*
@@ -36,6 +36,7 @@ void task_mem(struct seq_file *m, struct
data = mm->total_vm - mm->shared_vm - mm->stack_vm;
text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10;
lib = (mm->exec_vm << (PAGE_SHIFT-10)) - text;
+ swap = get_mm_counter(mm, MM_SWAPENTS);
seq_printf(m,
"VmPeak:\t%8lu kB\n"
"VmSize:\t%8lu kB\n"
@@ -46,7 +47,8 @@ void task_mem(struct seq_file *m, struct
"VmStk:\t%8lu kB\n"
"VmExe:\t%8lu kB\n"
"VmLib:\t%8lu kB\n"
- "VmPTE:\t%8lu kB\n",
+ "VmPTE:\t%8lu kB\n"
+ "VmSwap:\t%8lu kB\n",
hiwater_vm << (PAGE_SHIFT-10),
(total_vm - mm->reserved_vm) << (PAGE_SHIFT-10),
mm->locked_vm << (PAGE_SHIFT-10),
@@ -54,7 +56,8 @@ void task_mem(struct seq_file *m, struct
total_rss << (PAGE_SHIFT-10),
data << (PAGE_SHIFT-10),
mm->stack_vm << (PAGE_SHIFT-10), text, lib,
- (PTRS_PER_PTE*sizeof(pte_t)*mm->nr_ptes) >> 10);
+ (PTRS_PER_PTE*sizeof(pte_t)*mm->nr_ptes) >> 10,
+ swap << (PAGE_SHIFT-10));
}
unsigned long task_vsize(struct mm_struct *mm)
From: KAMEZAWA Hiroyuki <[email protected]>
Final purpose of this patch is for improving oom/memoy shortage detection
better. In general there are OOM cases that lowmem is exhausted. What
this lowmem means is determined by the situation, but in general,
limited amount of memory for some special use is lowmem.
This patch adds an integer lowmem_zone, which is initialized to -1.
If zone_idx(zone) <= lowmem_zone, the zone is lowmem.
This patch uses simple definition that the zone for special use is the lowmem.
Not taking the amount of memory into account.
For example,
- if HIGHMEM is used, NORMAL is lowmem.
- If the system has both of NORMAL and DMA32, DMA32 is lowmem.
- When the system consists of only one zone, there are no lowmem.
This will be used for lowmem accounting per mm_struct and its information
will be used for oom-killer.
Changelog: 2009/12/09
- stop using policy_zone and use unified definition on each config.
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/mm.h | 9 +++++++
mm/page_alloc.c | 62 +++++++++++++++++++++++++++++++++++++++++++++++++++++
2 files changed, 71 insertions(+)
Index: mmotm-2.6.32-Dec8/include/linux/mm.h
===================================================================
--- mmotm-2.6.32-Dec8.orig/include/linux/mm.h
+++ mmotm-2.6.32-Dec8/include/linux/mm.h
@@ -583,6 +583,15 @@ static inline void set_page_links(struct
}
/*
+ * Check a page is in lower zone
+ */
+extern int lowmem_zone;
+static inline bool is_lowmem_page(struct page *page)
+{
+ return page_zonenum(page) <= lowmem_zone;
+}
+
+/*
* Some inline functions in vmstat.h depend on page_zone()
*/
#include <linux/vmstat.h>
Index: mmotm-2.6.32-Dec8/mm/page_alloc.c
===================================================================
--- mmotm-2.6.32-Dec8.orig/mm/page_alloc.c
+++ mmotm-2.6.32-Dec8/mm/page_alloc.c
@@ -2311,6 +2311,59 @@ static void zoneref_set_zone(struct zone
zoneref->zone_idx = zone_idx(zone);
}
+/* the zone is lowmem if zone_idx(zone) <= lowmem_zone */
+int lowmem_zone __read_mostly;
+/*
+ * Find out LOWMEM zone on this host. LOWMEM means a zone for special use
+ * and its size seems small and precious than other zones. For example,
+ * NORMAL zone is considered to be LOWMEM on a host which has HIGHMEM.
+ *
+ * This lowmem zone is determined by zone ordering and equipped memory layout.
+ * The amount of memory is not taken into account now.
+ */
+static void find_lowmem_zone(void)
+{
+ unsigned long pages[MAX_NR_ZONES];
+ struct zone *zone;
+ int idx;
+
+ for (idx = 0; idx < MAX_NR_ZONES; idx++)
+ pages[idx] = 0;
+ /* count the number of pages */
+ for_each_populated_zone(zone) {
+ idx = zone_idx(zone);
+ pages[idx] += zone->present_pages;
+ }
+ /* If We have HIGHMEM...we ignore ZONE_MOVABLE in this case. */
+#ifdef CONFIG_HIGHMEM
+ if (pages[ZONE_HIGHMEM]) {
+ lowmem_zone = ZONE_NORMAL;
+ return;
+ }
+#endif
+ /* If We have MOVABLE zone...which works like HIGHMEM. */
+ if (pages[ZONE_MOVABLE]) {
+ lowmem_zone = ZONE_NORMAL;
+ return;
+ }
+#ifdef CONFIG_ZONE_DMA32
+ /* If we have DMA32 and there is ZONE_NORMAL...*/
+ if (pages[ZONE_DMA32] && pages[ZONE_NORMAL]) {
+ lowmem_zone = ZONE_DMA32;
+ return;
+ }
+#endif
+#ifdef CONFIG_ZONE_DMA
+ /* If we have DMA and there is ZONE_NORMAL...*/
+ if (pages[ZONE_DMA] && pages[ZONE_NORMAL]) {
+ lowmem_zone = ZONE_DMA;
+ return;
+ }
+#endif
+ lowmem_zone = -1;
+ return;
+}
+
/*
* Builds allocation fallback zone lists.
*
@@ -2790,12 +2843,21 @@ void build_all_zonelists(void)
else
page_group_by_mobility_disabled = 0;
+ find_lowmem_zone();
+
printk("Built %i zonelists in %s order, mobility grouping %s. "
"Total pages: %ld\n",
nr_online_nodes,
zonelist_order_name[current_zonelist_order],
page_group_by_mobility_disabled ? "off" : "on",
vm_total_pages);
+
+ if (lowmem_zone >= 0)
+ printk("LOWMEM zone is detected as %s\n",
+ zone_names[lowmem_zone]);
+ else
+ printk("There are no special LOWMEM. The system seems flat\n");
+
#ifdef CONFIG_NUMA
printk("Policy zone: %s\n", zone_names[policy_zone]);
#endif
From: KAMEZAWA Hiroyuki <[email protected]>
Some case of OOM-Kill is caused by memory shortage in lowmem area. For example,
NORMAL_ZONE is exhausted on x86-32/HIGHMEM kernel.
Now, oom-killer doesn't have no lowmem usage information of processes and
selects victim processes based on global memory usage information.
In bad case, this can cause chains of kills of innocent processes without
progress, oom-serial-killer.
For making oom-killer lowmem aware, this patch adds counters for accounting
lowmem usage per process. (patches for oom-killer is not included in this.)
Adding counter is easy but one of concern is the cost for new counter.
Following is the test result of micro-benchmark of parallel page faults.
Bigger page fault number indicates better scalability.
(measured under USE_SPLIT_PTLOCKS environemt)
[Before lowmem counter]
Performance counter stats for './multi-fault 2' (5 runs):
46997471 page-faults ( +- 0.720% )
1004100076 cache-references ( +- 0.734% )
180959964 cache-misses ( +- 0.374% )
29263437363580464 bus-cycles ( +- 0.002% )
60.003315683 seconds time elapsed ( +- 0.004% )
3.85 miss/faults
[After lowmem counter]
Performance counter stats for './multi-fault 2' (5 runs):
45976947 page-faults ( +- 0.405% )
992296954 cache-references ( +- 0.860% )
183961537 cache-misses ( +- 0.473% )
29261902069414016 bus-cycles ( +- 0.002% )
60.001403261 seconds time elapsed ( +- 0.000% )
4.0 miss/faults.
Then, small cost is added. But I think this is within reasonable
range.
If you have good idea for improve this number, it's welcome.
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
fs/proc/task_mmu.c | 4 +--
include/linux/mm.h | 38 +++++++++++++++++++++++++++++-
include/linux/mm_types.h | 7 +++--
mm/filemap_xip.c | 2 -
mm/fremap.c | 2 -
mm/memory.c | 59 +++++++++++++++++++++++++++++++++--------------
mm/oom_kill.c | 8 +++---
mm/rmap.c | 10 ++++---
mm/swapfile.c | 2 -
9 files changed, 100 insertions(+), 32 deletions(-)
Index: mmotm-2.6.32-Dec8/include/linux/mm_types.h
===================================================================
--- mmotm-2.6.32-Dec8.orig/include/linux/mm_types.h
+++ mmotm-2.6.32-Dec8/include/linux/mm_types.h
@@ -200,11 +200,14 @@ typedef unsigned long mm_counter_t;
#endif /* !USE_SPLIT_PTLOCKS */
enum {
- MM_FILEPAGES,
- MM_ANONPAGES,
+ MM_FILEPAGES, /* file's rss is MM_FILEPAGES + MM_LOW_FILEPAGES */
+ MM_ANONPAGES, /* anon`'s rss is MM_FILEPAGES + MM_LOW_FILEPAGES */
+ MM_FILE_LOWPAGES, /* pages from lower zones in file rss*/
+ MM_ANON_LOWPAGES, /* pages from lower zones in anon rss*/
MM_SWAPENTS,
NR_MM_COUNTERS
};
+#define LOWMEM_COUNTER 2
struct mm_struct {
struct vm_area_struct * mmap; /* list of VMAs */
Index: mmotm-2.6.32-Dec8/mm/memory.c
===================================================================
--- mmotm-2.6.32-Dec8.orig/mm/memory.c
+++ mmotm-2.6.32-Dec8/mm/memory.c
@@ -156,12 +156,26 @@ static void add_mm_counter_fast(struct m
add_mm_counter(mm, member, val);
}
-#define inc_mm_counter_fast(mm, member) add_mm_counter_fast(mm, member, 1)
-#define dec_mm_counter_fast(mm, member) add_mm_counter_fast(mm, member, -1)
+static void add_mm_counter_page_fast(struct mm_struct *mm,
+ int member, int val, struct page *page)
+{
+ if (unlikely(is_lowmem_page(page)))
+ member += LOWMEM_COUNTER;
+ return add_mm_counter_fast(mm, member, val);
+}
+
+#define inc_mm_counter_fast(mm, member, page) \
+ add_mm_counter_page_fast(mm, member, 1, page)
+#define dec_mm_counter_fast(mm, member, page) \
+ add_mm_counter_page_fast(mm, member, -1, page)
#else
-#define inc_mm_counter_fast(mm, member) inc_mm_counter(mm, member)
-#define dec_mm_counter_fast(mm, member) dec_mm_counter(mm, member)
+#define add_mm_counter_fast(mm, member, val) add_mm_counter(mm, member, val)
+
+#define inc_mm_counter_fast(mm, member, page)\
+ inc_mm_counter_page(mm, member, page)
+#define dec_mm_counter_fast(mm, member, page)\
+ dec_mm_counter_page(mm, member, page)
#endif
@@ -685,12 +699,17 @@ copy_one_pte(struct mm_struct *dst_mm, s
page = vm_normal_page(vma, addr, pte);
if (page) {
+ int type;
+
get_page(page);
page_dup_rmap(page);
if (PageAnon(page))
- rss[MM_ANONPAGES]++;
+ type = MM_ANONPAGES;
else
- rss[MM_FILEPAGES]++;
+ type = MM_FILEPAGES;
+ if (is_lowmem_page(page))
+ type += LOWMEM_COUNTER;
+ rss[type]++;
}
out_set_pte:
@@ -876,6 +895,7 @@ static unsigned long zap_pte_range(struc
pte_t *pte;
spinlock_t *ptl;
int rss[NR_MM_COUNTERS];
+ int type;
init_rss_vec(rss);
@@ -923,15 +943,18 @@ static unsigned long zap_pte_range(struc
set_pte_at(mm, addr, pte,
pgoff_to_pte(page->index));
if (PageAnon(page))
- rss[MM_ANONPAGES]--;
+ type = MM_ANONPAGES;
else {
if (pte_dirty(ptent))
set_page_dirty(page);
if (pte_young(ptent) &&
likely(!VM_SequentialReadHint(vma)))
mark_page_accessed(page);
- rss[MM_FILEPAGES]--;
+ type = MM_FILEPAGES;
}
+ if (is_lowmem_page(page))
+ type += LOWMEM_COUNTER;
+ rss[type]--;
page_remove_rmap(page);
if (unlikely(page_mapcount(page) < 0))
print_bad_pte(vma, addr, ptent, page);
@@ -1592,7 +1615,7 @@ static int insert_page(struct vm_area_st
/* Ok, finally just insert the thing.. */
get_page(page);
- inc_mm_counter_fast(mm, MM_FILEPAGES);
+ inc_mm_counter_fast(mm, MM_FILEPAGES, page);
page_add_file_rmap(page);
set_pte_at(mm, addr, pte, mk_pte(page, prot));
@@ -2228,11 +2251,12 @@ gotten:
if (likely(pte_same(*page_table, orig_pte))) {
if (old_page) {
if (!PageAnon(old_page)) {
- dec_mm_counter_fast(mm, MM_FILEPAGES);
- inc_mm_counter_fast(mm, MM_ANONPAGES);
+ dec_mm_counter_fast(mm, MM_FILEPAGES, old_page);
+ inc_mm_counter_fast(mm, MM_ANONPAGES, new_page);
}
} else
- inc_mm_counter_fast(mm, MM_ANONPAGES);
+ inc_mm_counter_fast(mm, MM_ANONPAGES, new_page);
+
flush_cache_page(vma, address, pte_pfn(orig_pte));
entry = mk_pte(new_page, vma->vm_page_prot);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
@@ -2665,8 +2689,9 @@ static int do_swap_page(struct mm_struct
* discarded at swap_free().
*/
- inc_mm_counter_fast(mm, MM_ANONPAGES);
- dec_mm_counter_fast(mm, MM_SWAPENTS);
+ inc_mm_counter_fast(mm, MM_ANONPAGES, page);
+ /* SWAPENTS counter is not related to page..then use bare call */
+ add_mm_counter_fast(mm, MM_SWAPENTS, -1);
pte = mk_pte(page, vma->vm_page_prot);
if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
@@ -2750,7 +2775,7 @@ static int do_anonymous_page(struct mm_s
if (!pte_none(*page_table))
goto release;
- inc_mm_counter_fast(mm, MM_ANONPAGES);
+ inc_mm_counter_fast(mm, MM_ANONPAGES, page);
page_add_new_anon_rmap(page, vma, address);
setpte:
set_pte_at(mm, address, page_table, entry);
@@ -2904,10 +2929,10 @@ static int __do_fault(struct mm_struct *
if (flags & FAULT_FLAG_WRITE)
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
if (anon) {
- inc_mm_counter_fast(mm, MM_ANONPAGES);
+ inc_mm_counter_fast(mm, MM_ANONPAGES, page);
page_add_new_anon_rmap(page, vma, address);
} else {
- inc_mm_counter_fast(mm, MM_FILEPAGES);
+ inc_mm_counter_fast(mm, MM_FILEPAGES, page);
page_add_file_rmap(page);
if (flags & FAULT_FLAG_WRITE) {
dirty_page = page;
Index: mmotm-2.6.32-Dec8/mm/rmap.c
===================================================================
--- mmotm-2.6.32-Dec8.orig/mm/rmap.c
+++ mmotm-2.6.32-Dec8/mm/rmap.c
@@ -815,9 +815,9 @@ int try_to_unmap_one(struct page *page,
if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
if (PageAnon(page)) /* Not increments swapents counter */
- dec_mm_counter(mm, MM_ANONPAGES);
+ dec_mm_counter_page(mm, MM_ANONPAGES, page);
else
- dec_mm_counter(mm, MM_FILEPAGES);
+ dec_mm_counter_page(mm, MM_FILEPAGES, page);
set_pte_at(mm, address, pte,
swp_entry_to_pte(make_hwpoison_entry(page)));
} else if (PageAnon(page)) {
@@ -839,7 +839,7 @@ int try_to_unmap_one(struct page *page,
list_add(&mm->mmlist, &init_mm.mmlist);
spin_unlock(&mmlist_lock);
}
- dec_mm_counter(mm, MM_ANONPAGES);
+ dec_mm_counter_page(mm, MM_ANONPAGES, page);
inc_mm_counter(mm, MM_SWAPENTS);
} else if (PAGE_MIGRATION) {
/*
@@ -858,7 +858,7 @@ int try_to_unmap_one(struct page *page,
entry = make_migration_entry(page, pte_write(pteval));
set_pte_at(mm, address, pte, swp_entry_to_pte(entry));
} else
- dec_mm_counter(mm, MM_FILEPAGES);
+ dec_mm_counter_page(mm, MM_FILEPAGES, page);
page_remove_rmap(page);
page_cache_release(page);
@@ -998,6 +998,8 @@ static int try_to_unmap_cluster(unsigned
page_remove_rmap(page);
page_cache_release(page);
dec_mm_counter(mm, MM_FILEPAGES);
+ if (is_lowmem_page(page))
+ dec_mm_counter(mm, MM_FILEPAGES);
(*mapcount)--;
}
pte_unmap_unlock(pte - 1, ptl);
Index: mmotm-2.6.32-Dec8/mm/swapfile.c
===================================================================
--- mmotm-2.6.32-Dec8.orig/mm/swapfile.c
+++ mmotm-2.6.32-Dec8/mm/swapfile.c
@@ -841,7 +841,7 @@ static int unuse_pte(struct vm_area_stru
}
dec_mm_counter(vma->vm_mm, MM_SWAPENTS);
- inc_mm_counter(vma->vm_mm, MM_ANONPAGES);
+ inc_mm_counter_page(vma->vm_mm, MM_ANONPAGES, page);
get_page(page);
set_pte_at(vma->vm_mm, addr, pte,
pte_mkold(mk_pte(page, vma->vm_page_prot)));
Index: mmotm-2.6.32-Dec8/mm/filemap_xip.c
===================================================================
--- mmotm-2.6.32-Dec8.orig/mm/filemap_xip.c
+++ mmotm-2.6.32-Dec8/mm/filemap_xip.c
@@ -194,7 +194,7 @@ retry:
flush_cache_page(vma, address, pte_pfn(*pte));
pteval = ptep_clear_flush_notify(vma, address, pte);
page_remove_rmap(page);
- dec_mm_counter(mm, MM_FILEPAGES);
+ dec_mm_counter_page(mm, MM_FILEPAGES, page);
BUG_ON(pte_dirty(pteval));
pte_unmap_unlock(pte, ptl);
page_cache_release(page);
Index: mmotm-2.6.32-Dec8/mm/fremap.c
===================================================================
--- mmotm-2.6.32-Dec8.orig/mm/fremap.c
+++ mmotm-2.6.32-Dec8/mm/fremap.c
@@ -40,7 +40,7 @@ static void zap_pte(struct mm_struct *mm
page_remove_rmap(page);
page_cache_release(page);
update_hiwater_rss(mm);
- dec_mm_counter(mm, MM_FILEPAGES);
+ dec_mm_counter_page(mm, MM_FILEPAGES, page);
}
} else {
if (!pte_file(pte))
Index: mmotm-2.6.32-Dec8/include/linux/mm.h
===================================================================
--- mmotm-2.6.32-Dec8.orig/include/linux/mm.h
+++ mmotm-2.6.32-Dec8/include/linux/mm.h
@@ -977,8 +977,27 @@ static inline void exit_mm_counters(void
}
#endif /* !USE_SPLIT_PTLOCKS */
+static inline unsigned long get_file_rss(struct mm_struct *mm)
+{
+ return get_mm_counter(mm, MM_FILEPAGES) +
+ get_mm_counter(mm, MM_FILE_LOWPAGES);
+}
+
+static inline unsigned long get_anon_rss(struct mm_struct *mm)
+{
+ return get_mm_counter(mm, MM_ANONPAGES) +
+ get_mm_counter(mm, MM_ANON_LOWPAGES);
+}
+
+static inline unsigned long get_low_rss(struct mm_struct *mm)
+{
+ return get_mm_counter(mm, MM_FILE_LOWPAGES) +
+ get_mm_counter(mm, MM_ANON_LOWPAGES);
+}
+
#define get_mm_rss(mm) \
- (get_mm_counter(mm, MM_FILEPAGES) + get_mm_counter(mm, MM_ANONPAGES))
+ (get_file_rss(mm) + get_anon_rss(mm))
+
#define update_hiwater_rss(mm) do { \
unsigned long _rss = get_mm_rss(mm); \
if ((mm)->hiwater_rss < _rss) \
@@ -1008,6 +1027,23 @@ static inline unsigned long get_mm_hiwat
return max(mm->hiwater_vm, mm->total_vm);
}
+/* Utility for lowmem counting */
+static inline void
+inc_mm_counter_page(struct mm_struct *mm, int member, struct page *page)
+{
+ if (unlikely(is_lowmem_page(page)))
+ member += LOWMEM_COUNTER;
+ inc_mm_counter(mm, member);
+}
+
+static inline void
+dec_mm_counter_page(struct mm_struct *mm, int member, struct page *page)
+{
+ if (unlikely(is_lowmem_page(page)))
+ member += LOWMEM_COUNTER;
+ dec_mm_counter(mm, member);
+}
+
/*
* A callback you can register to apply pressure to ageable caches.
*
Index: mmotm-2.6.32-Dec8/fs/proc/task_mmu.c
===================================================================
--- mmotm-2.6.32-Dec8.orig/fs/proc/task_mmu.c
+++ mmotm-2.6.32-Dec8/fs/proc/task_mmu.c
@@ -68,11 +68,11 @@ unsigned long task_vsize(struct mm_struc
int task_statm(struct mm_struct *mm, int *shared, int *text,
int *data, int *resident)
{
- *shared = get_mm_counter(mm, MM_FILEPAGES);
+ *shared = get_file_rss(mm);
*text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK))
>> PAGE_SHIFT;
*data = mm->total_vm - mm->shared_vm;
- *resident = *shared + get_mm_counter(mm, MM_ANONPAGES);
+ *resident = *shared + get_anon_rss(mm);
return mm->total_vm;
}
Index: mmotm-2.6.32-Dec8/mm/oom_kill.c
===================================================================
--- mmotm-2.6.32-Dec8.orig/mm/oom_kill.c
+++ mmotm-2.6.32-Dec8/mm/oom_kill.c
@@ -398,11 +398,13 @@ static void __oom_kill_task(struct task_
if (verbose)
printk(KERN_ERR "Killed process %d (%s) "
- "vsz:%lukB, anon-rss:%lukB, file-rss:%lukB\n",
+ "vsz:%lukB, anon-rss:%lukB, file-rss:%lukB "
+ "lowmem %lukB\n",
task_pid_nr(p), p->comm,
K(p->mm->total_vm),
- K(get_mm_counter(p->mm, MM_ANONPAGES)),
- K(get_mm_counter(p->mm, MM_FILEPAGES)));
+ K(get_anon_rss(p->mm)),
+ K(get_file_rss(p->mm)),
+ K(get_low_rss(p->mm)));
task_unlock(p);
/*
This is test program I used. This is tuned for 4core/socket.
==
/*
* multi-fault.c :: causes 60secs of parallel page fault in multi-thread.
* % gcc -O2 -o multi-fault multi-fault.c -lpthread
* % multi-fault # of cpus.
*/
#define _GNU_SOURCE
#include <stdio.h>
#include <pthread.h>
#include <sched.h>
#include <sys/mman.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <signal.h>
#define CORE_PER_SOCK 4
#define NR_THREADS 8
pthread_t threads[NR_THREADS];
/*
* For avoiding contention in page table lock, FAULT area is
* sparse. If FAULT_LENGTH is too large for your cpus, decrease it.
*/
#define MMAP_LENGTH (8 * 1024 * 1024)
#define FAULT_LENGTH (2 * 1024 * 1024)
void *mmap_area[NR_THREADS];
#define PAGE_SIZE 4096
pthread_barrier_t barrier;
int name[NR_THREADS];
void segv_handler(int sig)
{
sleep(100);
}
void *worker(void *data)
{
cpu_set_t set;
int cpu;
cpu = *(int *)data;
CPU_ZERO(&set);
CPU_SET(cpu, &set);
sched_setaffinity(0, sizeof(set), &set);
cpu /= CORE_PER_SOCK;
while (1) {
char *c;
char *start = mmap_area[cpu];
char *end = mmap_area[cpu] + FAULT_LENGTH;
pthread_barrier_wait(&barrier);
//printf("fault into %p-%p\n",start, end);
for (c = start; c < end; c += PAGE_SIZE)
*c = 0;
pthread_barrier_wait(&barrier);
madvise(start, FAULT_LENGTH, MADV_DONTNEED);
}
return NULL;
}
int main(int argc, char *argv[])
{
int i, ret;
unsigned int num;
if (argc < 2)
return 0;
num = atoi(argv[1]);
pthread_barrier_init(&barrier, NULL, num);
mmap_area[0] = mmap(NULL, MMAP_LENGTH * num, PROT_WRITE|PROT_READ,
MAP_PRIVATE | MAP_ANONYMOUS, 0, 0);
for (i = 1; i < num; i++) {
mmap_area[i] = mmap_area[i - 1]+ MMAP_LENGTH;
}
for (i = 0; i < num; ++i) {
name[i] = i * CORE_PER_SOCK;
ret = pthread_create(&threads[i], NULL, worker, &name[i]);
if (ret < 0) {
perror("pthread create");
return 0;
}
}
sleep(60);
return 0;
}
On Thu, 10 Dec 2009 08:54:54 +0100
Ingo Molnar <[email protected]> wrote:
>
> * KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > Now, mm's counter information is updated by atomic_long_xxx()
> > functions if USE_SPLIT_PTLOCKS is defined. This causes cache-miss when
> > page faults happens simultaneously in prural cpus. (Almost all
> > process-shared objects is...)
> >
> > Considering accounting per-mm page usage more, one of problems is cost
> > of this counter.
>
> I'd really like these kinds of stats available via the tool you used to
> develop this patchset:
>
> > After:
> > Performance counter stats for './multi-fault 2' (5 runs):
> >
> > 46997471 page-faults ( +- 0.720% )
> > 1004100076 cache-references ( +- 0.734% )
> > 180959964 cache-misses ( +- 0.374% )
> > 29263437363580464 bus-cycles ( +- 0.002% )
> >
> > 60.003315683 seconds time elapsed ( +- 0.004% )
> >
> > cachemiss/page faults is reduced from 4.55 miss/faults to be 3.85miss/faults
>
> I.e. why not expose these stats via perf events and counts as well,
> beyond the current (rather minimal) set of MM stats perf supports
> currently?
>
> That way we'd get a _lot_ of interesting per task mm stats available via
> perf stat (and maybe they can be profiled as well via perf record), and
> we could perhaps avoid uglies like having to hack hooks into sched.c:
>
As I wrote in 0/5, this is finally for oom-killer, for "kernel internal use".
Not for user's perf evetns.
- http://marc.info/?l=linux-mm&m=125714672531121&w=2
And Christoph has concerns on cache-miss on this counter.
- http://archives.free.net.ph/message/20091104.191441.1098b93c.ja.html
This patch is for replcacing atomic_long_add() with percpu counter.
> > + /*
> > + * sync/invaldidate per-cpu cached mm related information
> > + * before taling rq->lock. (see include/linux/mm.h)
>
> (minor typo: s/taling/taking )
>
Oh, thanks.
> > + */
> > + sync_mm_counters_atomic();
> >
> > spin_lock_irq(&rq->lock);
> > update_rq_clock(rq);
>
> It's not a simple task i guess since this per mm counting business has
> grown its own variant which takes time to rearchitect, plus i'm sure
> there's performance issues to solve if such a model is exposed via perf,
> but users and developers would be _very_ well served by such
> capabilities:
>
> - clean, syscall based API available to monitor tasks, workloads and
> CPUs. (or the whole system)
>
> - sampling (profiling)
>
> - tracing, post-process scripting via Perl plugins
>
I'm sorry If I miss your point...are you saying remove all mm_counter completely
and remake them under perf ? If so, some proc file (/proc/<pid>/statm etc)
will be corrupted ?
Thanks,
-Kame
* KAMEZAWA Hiroyuki <[email protected]> wrote:
> I'm sorry If I miss your point...are you saying remove all mm_counter
> completely and remake them under perf ? If so, some proc file
> (/proc/<pid>/statm etc) will be corrupted ?
No, i'm not suggesting that - i'm just suggesting that right now MM
stats are not very well suited to be exposed via perf. If we wanted to
measure/sample the information in /proc/<pid>/statm it just wouldnt be
possible. We have a few events like pagefaults and a few tracepoints as
well - but more would be possible IMO.
Ingo
On Thu, 10 Dec 2009 09:33:10 +0100
Ingo Molnar <[email protected]> wrote:
>
> * KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > I'm sorry If I miss your point...are you saying remove all mm_counter
> > completely and remake them under perf ? If so, some proc file
> > (/proc/<pid>/statm etc) will be corrupted ?
>
> No, i'm not suggesting that - i'm just suggesting that right now MM
> stats are not very well suited to be exposed via perf. If we wanted to
> measure/sample the information in /proc/<pid>/statm it just wouldnt be
> possible. We have a few events like pagefaults and a few tracepoints as
> well - but more would be possible IMO.
>
Ah, ok. More events will be useful.
This patch itself is for reduce(not increase) cache miss in page fault pass..
And counters I'll add is for task monitoring, like ps or top, and for improving
OOM killer. Not for counting events but for showing current _usage_ to users
via procfs or to oom killer.
I'll continue to make an efforts to find better synchronization scheme
rather than adding hook to schedule() but...
Thanks,
-Kame
On Thu, 10 Dec 2009, KAMEZAWA Hiroyuki wrote:
> This patch modifies it to
> - Define them in mm.h as inline functions
> - Use array instead of macro's name creation. For making easier to add
> new coutners.
Reviewed-by: Christoph Lameter <[email protected]>
> @@ -454,8 +456,8 @@ static struct mm_struct * mm_init(struct
> (current->mm->flags & MMF_INIT_MASK) : default_dump_filter;
> mm->core_state = NULL;
> mm->nr_ptes = 0;
> - set_mm_counter(mm, file_rss, 0);
> - set_mm_counter(mm, anon_rss, 0);
> + for (i = 0; i < NR_MM_COUNTERS; i++)
> + set_mm_counter(mm, i, 0);
memset? Or add a clear_mm_counter function? This also occurred earlier in
init_rss_vec().
On Thu, 10 Dec 2009, Ingo Molnar wrote:
> I.e. why not expose these stats via perf events and counts as well,
> beyond the current (rather minimal) set of MM stats perf supports
> currently?
Certainly one can write perf events that do a simular thing but that is
beyond the scope of the work here. This is the result of a test program.
The point here is to avoid fault regressions while introducing new process
specific counters that are then used by other VM code to make decisions
about a process.
On Thu, 10 Dec 2009, Ingo Molnar wrote:
>
> No, i'm not suggesting that - i'm just suggesting that right now MM
> stats are not very well suited to be exposed via perf. If we wanted to
> measure/sample the information in /proc/<pid>/statm it just wouldnt be
> possible. We have a few events like pagefaults and a few tracepoints as
> well - but more would be possible IMO.
vital MM stats are exposed via /proc/<pid> interfaces. Performance
monitoring is something optional MM VM stats are used for VM decision on
memory and process handling.
* Christoph Lameter <[email protected]> wrote:
> On Thu, 10 Dec 2009, Ingo Molnar wrote:
>
> >
> > No, i'm not suggesting that - i'm just suggesting that right now MM
> > stats are not very well suited to be exposed via perf. If we wanted to
> > measure/sample the information in /proc/<pid>/statm it just wouldnt be
> > possible. We have a few events like pagefaults and a few tracepoints as
> > well - but more would be possible IMO.
>
> vital MM stats are exposed via /proc/<pid> interfaces. Performance
> monitoring is something optional MM VM stats are used for VM decision
> on memory and process handling.
You list a few facts here but what is your point?
Ingo
On Thu, 10 Dec 2009, KAMEZAWA Hiroyuki wrote:
> Now, mm's counter information is updated by atomic_long_xxx() functions if
> USE_SPLIT_PTLOCKS is defined. This causes cache-miss when page faults happens
> simultaneously in prural cpus. (Almost all process-shared objects is...)
s/prural cpus/multiple cpus simultaneously/?
> This patch implements per-cpu mm cache. This per-cpu cache is loosely
> synchronized with mm's counter. Current design is..
Some more explanation about the role of the per cpu data would be useful.
For each cpu we keep a set of counters that can be incremented using per
cpu operations. curr_mc points to the mm struct that is currently using
the per cpu counters on a specific cpu?
> - prepare per-cpu object curr_mmc. curr_mmc containes pointer to mm and
> array of counters.
> - At page fault,
> * if curr_mmc.mm != NULL, update curr_mmc.mm counter.
> * if curr_mmc.mm == NULL, fill curr_mmc.mm = current->mm and account 1.
> - At schedule()
> * if curr_mm.mm != NULL, synchronize and invalidate cached information.
> * if curr_mmc.mm == NULL, nothing to do.
Sounds like a very good idea that could be expanded and used for other
things like tracking the amount of memory used on a specific NUMA node in
the future. Through that we may get to a schedule that can schedule with
an awareness where the memory of a process is actually located.
> By this.
> - no atomic ops, which tends to cache-miss, under page table lock.
> - mm->counters are synchronized when schedule() is called.
> - No bad thing to read-side.
>
> Concern:
> - added cost to schedule().
That is only a simple check right? Are we already touching that cacheline
in schedule? Or place that structure near other stuff touched by the
scheduer?
>
> +#if USE_SPLIT_PTLOCKS
> +
> +DEFINE_PER_CPU(struct pcp_mm_cache, curr_mmc);
> +
> +void __sync_mm_counters(struct mm_struct *mm)
> +{
> + struct pcp_mm_cache *mmc = &per_cpu(curr_mmc, smp_processor_id());
> + int i;
> +
> + for (i = 0; i < NR_MM_COUNTERS; i++) {
> + if (mmc->counters[i] != 0) {
Omit != 0?
if you change mmc->curr_mc then there is no need to set mmc->counters[0]
to zero right? add_mm_counter_fast will set the counter to 1 next?
> +static void add_mm_counter_fast(struct mm_struct *mm, int member, int val)
> +{
> + struct mm_struct *cached = percpu_read(curr_mmc.mm);
> +
> + if (likely(cached == mm)) { /* fast path */
> + percpu_add(curr_mmc.counters[member], val);
> + } else if (mm == current->mm) { /* 1st page fault in this period */
> + percpu_write(curr_mmc.mm, mm);
> + percpu_write(curr_mmc.counters[member], val);
> + } else /* page fault via side-path context (get_user_pages()) */
> + add_mm_counter(mm, member, val);
So get_user pages will not be accellerated.
> Index: mmotm-2.6.32-Dec8/kernel/sched.c
> ===================================================================
> --- mmotm-2.6.32-Dec8.orig/kernel/sched.c
> +++ mmotm-2.6.32-Dec8/kernel/sched.c
> @@ -2858,6 +2858,7 @@ context_switch(struct rq *rq, struct tas
> trace_sched_switch(rq, prev, next);
> mm = next->mm;
> oldmm = prev->active_mm;
> +
> /*
> * For paravirt, this is coupled with an exit in switch_to to
> * combine the page table reload and the switch backend into
Extraneous new line.
> @@ -5477,6 +5478,11 @@ need_resched_nonpreemptible:
>
> if (sched_feat(HRTICK))
> hrtick_clear(rq);
> + /*
> + * sync/invaldidate per-cpu cached mm related information
> + * before taling rq->lock. (see include/linux/mm.h)
> + */
> + sync_mm_counters_atomic();
>
> spin_lock_irq(&rq->lock);
> update_rq_clock(rq);
Could the per cpu counter stuff be placed into rq to avoid
touching another cacheline?
On Thu, 10 Dec 2009, KAMEZAWA Hiroyuki wrote:
> Index: mmotm-2.6.32-Dec8/mm/rmap.c
> ===================================================================
> --- mmotm-2.6.32-Dec8.orig/mm/rmap.c
> +++ mmotm-2.6.32-Dec8/mm/rmap.c
> @@ -814,7 +814,7 @@ int try_to_unmap_one(struct page *page,
> update_hiwater_rss(mm);
>
> if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
> - if (PageAnon(page))
> + if (PageAnon(page)) /* Not increments swapents counter */
> dec_mm_counter(mm, MM_ANONPAGES);
Remove comment. Its not helping.
Reviewed-by: Christoph Lameter <[email protected]>
On Thu, 10 Dec 2009, KAMEZAWA Hiroyuki wrote:
> This patch adds an integer lowmem_zone, which is initialized to -1.
> If zone_idx(zone) <= lowmem_zone, the zone is lowmem.
There is already a policy_zone in mempolicy.h. lowmem is if the zone
number is lower than policy_zone. Can we avoid adding another zone
limiter?
On Thu, 10 Dec 2009, Ingo Molnar wrote:
>
> * Christoph Lameter <[email protected]> wrote:
>
> > On Thu, 10 Dec 2009, Ingo Molnar wrote:
> >
> > >
> > > No, i'm not suggesting that - i'm just suggesting that right now MM
> > > stats are not very well suited to be exposed via perf. If we wanted to
> > > measure/sample the information in /proc/<pid>/statm it just wouldnt be
> > > possible. We have a few events like pagefaults and a few tracepoints as
> > > well - but more would be possible IMO.
> >
> > vital MM stats are exposed via /proc/<pid> interfaces. Performance
> > monitoring is something optional MM VM stats are used for VM decision
> > on memory and process handling.
>
> You list a few facts here but what is your point?
The stats are exposed already in a well defined way. Exposing via perf is
outside of the scope of his work.
* Christoph Lameter <[email protected]> wrote:
> On Thu, 10 Dec 2009, Ingo Molnar wrote:
>
> >
> > * Christoph Lameter <[email protected]> wrote:
> >
> > > On Thu, 10 Dec 2009, Ingo Molnar wrote:
> > >
> > > > No, i'm not suggesting that - i'm just suggesting that right now
> > > > MM stats are not very well suited to be exposed via perf. If we
> > > > wanted to measure/sample the information in /proc/<pid>/statm it
> > > > just wouldnt be possible. We have a few events like pagefaults
> > > > and a few tracepoints as well - but more would be possible IMO.
> > >
> > > vital MM stats are exposed via /proc/<pid> interfaces. Performance
> > > monitoring is something optional MM VM stats are used for VM
> > > decision on memory and process handling.
> >
> > You list a few facts here but what is your point?
>
> The stats are exposed already in a well defined way. [...]
They are exposed in a well defined but limited way: you cannot profile
based on those stats, you cannot measure them across a workload
transparently at precise task boundaries and you cannot trace based on
those stats.
For example, just via the simple page fault events we can today do
things like:
aldebaran:~> perf stat -e minor-faults /bin/bash -c "echo hello"
hello
Performance counter stats for '/bin/bash -c echo hello':
292 minor-faults
0.000884744 seconds time elapsed
aldebaran:~> perf record -e minor-faults -c 1 -f -g firefox
Error: cannot open display: :0
[ perf record: Woken up 3 times to write data ]
[ perf record: Captured and wrote 0.324 MB perf.data (~14135 samples) ]
aldebaran:~> perf report
no symbols found in /bin/sed, maybe install a debug package?
# Samples: 5312
#
# Overhead Command Shared Object Symbol
# ........ .............. ........................................ ......
#
12.54% firefox ld-2.10.90.so
[.] _dl_relocate_object
|
--- _dl_relocate_object
dl_open_worker
_dl_catch_error
dlopen_doit
0x7fffdf8c6562
0x68733d54524f5053
4.95% firefox libc-2.10.90.so
[.] __GI_memset
|
--- __GI_memset
...
I.e. 12.54% of the pagefaults in the firefox startup occur in
dlopen_doit()->_dl_catch_error()->dl_open_worker()->_dl_relocate_object()->
_dl_relocate_object() call path. 4.95% happen in __GI_memset() - etc.
> [...] Exposing via perf is outside of the scope of his work.
Please make thoughts about intelligent instrumentation solutions, and
please think "outside of the scope" of your usual routine.
Thanks,
Ingo
On Thu, 10 Dec 2009 11:30:46 -0600 (CST)
Christoph Lameter <[email protected]> wrote:
> On Thu, 10 Dec 2009, KAMEZAWA Hiroyuki wrote:
>
> > This patch modifies it to
> > - Define them in mm.h as inline functions
> > - Use array instead of macro's name creation. For making easier to add
> > new coutners.
>
> Reviewed-by: Christoph Lameter <[email protected]>
>
> > @@ -454,8 +456,8 @@ static struct mm_struct * mm_init(struct
> > (current->mm->flags & MMF_INIT_MASK) : default_dump_filter;
> > mm->core_state = NULL;
> > mm->nr_ptes = 0;
> > - set_mm_counter(mm, file_rss, 0);
> > - set_mm_counter(mm, anon_rss, 0);
> > + for (i = 0; i < NR_MM_COUNTERS; i++)
> > + set_mm_counter(mm, i, 0);
>
>
> memset? Or add a clear_mm_counter function? This also occurred earlier in
> init_rss_vec().
>
Ok, I'll try some cleaner codes.
Thanks,
-Kame
On Fri, Dec 11, 2009 at 2:30 AM, Christoph Lameter
<[email protected]> wrote:
> On Thu, 10 Dec 2009, KAMEZAWA Hiroyuki wrote:
>
>> This patch modifies it to
>> - Define them in mm.h as inline functions
>> - Use array instead of macro's name creation. For making easier to add
>> new coutners.
>
> Reviewed-by: Christoph Lameter <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
Except Christoph pointed out, it looks good to me.
--
Kind regards,
Minchan Kim
On Thu, 10 Dec 2009 19:54:59 +0100
Ingo Molnar <[email protected]> wrote:
> > [...] Exposing via perf is outside of the scope of his work.
>
> Please make thoughts about intelligent instrumentation solutions, and
> please think "outside of the scope" of your usual routine.
>
I'm sorry that I don't fully understand your suggestion...
This patch is for _usage_ counters (those can increase/decrease and can be
modified in batched manner), but you don't talk about usage counter
but about lack of (useful) _event_ counters under page fault path.
If so, yes, I agree that current events are not enough.
If not, hmm ?
More event counters I can think of around mm/page-fault is following..
- fault to new anon pages
+ a new anon page is from remote node.
- fault to file-backed area
+ a file page is from remote node.
- copy_on_write
+ a new anon page is from remote node.
+ copy-on-write to zero page.
- make page write (make page dirty)
- search vma. (find_vma() is called and goes into rb-tree lookup)
- swap-in (necessary ?)
- get_user_page() is called to snoop other process's memory.
I wonder adding event and inserting perf_sw_event(PERF_COUNT_SW....) is
enough for adding event coutners...but is there good documentation of
this hook ?
Thanks,
-Kame
thank you for review.
On Thu, 10 Dec 2009 11:51:24 -0600 (CST)
Christoph Lameter <[email protected]> wrote:
> On Thu, 10 Dec 2009, KAMEZAWA Hiroyuki wrote:
>
> > Now, mm's counter information is updated by atomic_long_xxx() functions if
> > USE_SPLIT_PTLOCKS is defined. This causes cache-miss when page faults happens
> > simultaneously in prural cpus. (Almost all process-shared objects is...)
>
> s/prural cpus/multiple cpus simultaneously/?
>
Ah, I see..I often does this misuse, sorry.
> > This patch implements per-cpu mm cache. This per-cpu cache is loosely
> > synchronized with mm's counter. Current design is..
>
> Some more explanation about the role of the per cpu data would be useful.
>
I see.
> For each cpu we keep a set of counters that can be incremented using per
> cpu operations. curr_mc points to the mm struct that is currently using
> the per cpu counters on a specific cpu?
>
yes. Precisely. per-cpu curr_mmc.mm points to mm_struct of current thread
if a page fault occurs since last schedule().
> > - prepare per-cpu object curr_mmc. curr_mmc containes pointer to mm and
> > array of counters.
> > - At page fault,
> > * if curr_mmc.mm != NULL, update curr_mmc.mm counter.
> > * if curr_mmc.mm == NULL, fill curr_mmc.mm = current->mm and account 1.
> > - At schedule()
> > * if curr_mm.mm != NULL, synchronize and invalidate cached information.
> > * if curr_mmc.mm == NULL, nothing to do.
>
> Sounds like a very good idea that could be expanded and used for other
> things like tracking the amount of memory used on a specific NUMA node in
> the future. Through that we may get to a schedule that can schedule with
> an awareness where the memory of a process is actually located.
>
Hmm. Expanding as per-node stat ?
> > By this.
> > - no atomic ops, which tends to cache-miss, under page table lock.
> > - mm->counters are synchronized when schedule() is called.
> > - No bad thing to read-side.
> >
> > Concern:
> > - added cost to schedule().
>
> That is only a simple check right?
yes.
> Are we already touching that cacheline in schedule?
0000000000010040 l O .data.percpu 0000000000000050 vmstat_work
00000000000100a0 g O .data.percpu 0000000000000030 curr_mmc
00000000000100e0 l O .data.percpu 0000000000000030 vmap_block_queue
Hmm...not touched unless a page fault occurs.
> Or place that structure near other stuff touched by the scheduer?
>
I'll think about that.
> >
> > +#if USE_SPLIT_PTLOCKS
> > +
> > +DEFINE_PER_CPU(struct pcp_mm_cache, curr_mmc);
> > +
> > +void __sync_mm_counters(struct mm_struct *mm)
> > +{
> > + struct pcp_mm_cache *mmc = &per_cpu(curr_mmc, smp_processor_id());
> > + int i;
> > +
> > + for (i = 0; i < NR_MM_COUNTERS; i++) {
> > + if (mmc->counters[i] != 0) {
>
> Omit != 0?
>
> if you change mmc->curr_mc then there is no need to set mmc->counters[0]
> to zero right? add_mm_counter_fast will set the counter to 1 next?
>
yes. I can omit that.
> > +static void add_mm_counter_fast(struct mm_struct *mm, int member, int val)
> > +{
> > + struct mm_struct *cached = percpu_read(curr_mmc.mm);
> > +
> > + if (likely(cached == mm)) { /* fast path */
> > + percpu_add(curr_mmc.counters[member], val);
> > + } else if (mm == current->mm) { /* 1st page fault in this period */
> > + percpu_write(curr_mmc.mm, mm);
> > + percpu_write(curr_mmc.counters[member], val);
> > + } else /* page fault via side-path context (get_user_pages()) */
> > + add_mm_counter(mm, member, val);
>
> So get_user pages will not be accellerated.
>
Yes. but I guess it's not fast path. I'll mention about that in patch description.
> > Index: mmotm-2.6.32-Dec8/kernel/sched.c
> > ===================================================================
> > --- mmotm-2.6.32-Dec8.orig/kernel/sched.c
> > +++ mmotm-2.6.32-Dec8/kernel/sched.c
> > @@ -2858,6 +2858,7 @@ context_switch(struct rq *rq, struct tas
> > trace_sched_switch(rq, prev, next);
> > mm = next->mm;
> > oldmm = prev->active_mm;
> > +
> > /*
> > * For paravirt, this is coupled with an exit in switch_to to
> > * combine the page table reload and the switch backend into
>
> Extraneous new line.
>
will fix.
> > @@ -5477,6 +5478,11 @@ need_resched_nonpreemptible:
> >
> > if (sched_feat(HRTICK))
> > hrtick_clear(rq);
> > + /*
> > + * sync/invaldidate per-cpu cached mm related information
> > + * before taling rq->lock. (see include/linux/mm.h)
> > + */
> > + sync_mm_counters_atomic();
> >
> > spin_lock_irq(&rq->lock);
> > update_rq_clock(rq);
>
> Could the per cpu counter stuff be placed into rq to avoid
> touching another cacheline?
>
I will try and check how it can be done without annoyting people.
Thanks,
-Kame
On Thu, 10 Dec 2009 11:55:25 -0600 (CST)
Christoph Lameter <[email protected]> wrote:
> On Thu, 10 Dec 2009, KAMEZAWA Hiroyuki wrote:
>
> > Index: mmotm-2.6.32-Dec8/mm/rmap.c
> > ===================================================================
> > --- mmotm-2.6.32-Dec8.orig/mm/rmap.c
> > +++ mmotm-2.6.32-Dec8/mm/rmap.c
> > @@ -814,7 +814,7 @@ int try_to_unmap_one(struct page *page,
> > update_hiwater_rss(mm);
> >
> > if (PageHWPoison(page) && !(flags & TTU_IGNORE_HWPOISON)) {
> > - if (PageAnon(page))
> > + if (PageAnon(page)) /* Not increments swapents counter */
> > dec_mm_counter(mm, MM_ANONPAGES);
>
> Remove comment. Its not helping.
>
ok.
> Reviewed-by: Christoph Lameter <[email protected]>
Thank you,
-Kame
On Thu, 10 Dec 2009 11:59:11 -0600 (CST)
Christoph Lameter <[email protected]> wrote:
> On Thu, 10 Dec 2009, KAMEZAWA Hiroyuki wrote:
>
> > This patch adds an integer lowmem_zone, which is initialized to -1.
> > If zone_idx(zone) <= lowmem_zone, the zone is lowmem.
>
> There is already a policy_zone in mempolicy.h. lowmem is if the zone
> number is lower than policy_zone. Can we avoid adding another zone
> limiter?
>
My previous version (one month ago) does that. In this set, I tried to use
unified approach for all CONFIG_NUMA/HIGHMEM/flat ones.
Hmm, How about adding following kind of patch after this
#define policy_zone (lowmem_zone + 1)
and remove policy_zone ? I think the name of "policy_zone" implies
"this is for mempolicy, NUMA" and don't think good name for generic use.
Thanks,
-Kame
Hi, Kame.
It looks good than older one. :)
On Thu, Dec 10, 2009 at 4:34 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Now, mm's counter information is updated by atomic_long_xxx() functions if
> USE_SPLIT_PTLOCKS is defined. This causes cache-miss when page faults happens
> simultaneously in prural cpus. (Almost all process-shared objects is...)
>
> Considering accounting per-mm page usage more, one of problems is cost of
> this counter.
>
> This patch implements per-cpu mm cache. This per-cpu cache is loosely
> synchronized with mm's counter. Current design is..
>
> - prepare per-cpu object curr_mmc. curr_mmc containes pointer to mm and
> array of counters.
> - At page fault,
> * if curr_mmc.mm != NULL, update curr_mmc.mm counter.
> * if curr_mmc.mm == NULL, fill curr_mmc.mm = current->mm and account 1.
> - At schedule()
> * if curr_mm.mm != NULL, synchronize and invalidate cached information.
> * if curr_mmc.mm == NULL, nothing to do.
>
> By this.
> - no atomic ops, which tends to cache-miss, under page table lock.
> - mm->counters are synchronized when schedule() is called.
> - No bad thing to read-side.
>
> Concern:
> - added cost to schedule().
>
> Micro Benchmark:
> measured the number of page faults with 2 threads on 2 sockets.
>
> Before:
> Performance counter stats for './multi-fault 2' (5 runs):
>
> 45122351 page-faults ( +- 1.125% )
> 989608571 cache-references ( +- 1.198% )
> 205308558 cache-misses ( +- 0.159% )
> 29263096648639268 bus-cycles ( +- 0.004% )
>
> 60.003427500 seconds time elapsed ( +- 0.003% )
>
> After:
> Performance counter stats for './multi-fault 2' (5 runs):
>
> 46997471 page-faults ( +- 0.720% )
> 1004100076 cache-references ( +- 0.734% )
> 180959964 cache-misses ( +- 0.374% )
> 29263437363580464 bus-cycles ( +- 0.002% )
>
> 60.003315683 seconds time elapsed ( +- 0.004% )
>
> cachemiss/page faults is reduced from 4.55 miss/faults to be 3.85miss/faults
>
> This microbencmark doesn't do usual behavior (page fault ->madvise(DONTNEED)
> but reducing cache-miss cost sounds good to me even if it's very small.
>
> Changelog 2009/12/09:
> - loosely update curr_mmc.mm at the 1st page fault.
> - removed hooks in tick.(update_process_times)
> - exported curr_mmc and check curr_mmc.mm directly.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> include/linux/mm.h | 37 ++++++++++++++++++++++++++++
> include/linux/mm_types.h | 12 +++++++++
> kernel/exit.c | 3 +-
> kernel/sched.c | 6 ++++
> mm/memory.c | 60 ++++++++++++++++++++++++++++++++++++++++-------
> 5 files changed, 108 insertions(+), 10 deletions(-)
>
> Index: mmotm-2.6.32-Dec8/include/linux/mm_types.h
> ===================================================================
> --- mmotm-2.6.32-Dec8.orig/include/linux/mm_types.h
> +++ mmotm-2.6.32-Dec8/include/linux/mm_types.h
> @@ -297,4 +297,16 @@ struct mm_struct {
> /* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
> #define mm_cpumask(mm) (&(mm)->cpu_vm_mask)
>
> +#if USE_SPLIT_PTLOCKS
> +/*
> + * percpu object used for caching thread->mm information.
> + */
> +struct pcp_mm_cache {
> + struct mm_struct *mm;
> + unsigned long counters[NR_MM_COUNTERS];
> +};
> +
> +DECLARE_PER_CPU(struct pcp_mm_cache, curr_mmc);
> +#endif
> +
> #endif /* _LINUX_MM_TYPES_H */
> Index: mmotm-2.6.32-Dec8/include/linux/mm.h
> ===================================================================
> --- mmotm-2.6.32-Dec8.orig/include/linux/mm.h
> +++ mmotm-2.6.32-Dec8/include/linux/mm.h
> @@ -883,7 +883,16 @@ static inline void set_mm_counter(struct
>
> static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
> {
> - return (unsigned long)atomic_long_read(&(mm)->counters[member]);
> + long ret;
> + /*
> + * Because this counter is loosely synchronized with percpu cached
> + * information, it's possible that value gets to be minus. For user's
> + * convenience/sanity, avoid returning minus.
> + */
> + ret = atomic_long_read(&(mm)->counters[member]);
> + if (unlikely(ret < 0))
> + return 0;
> + return (unsigned long)ret;
> }
Now, your sync point is only task switching time.
So we can't show exact number if many counting of mm happens
in short time.(ie, before context switching).
It isn't matter?
>
> static inline void add_mm_counter(struct mm_struct *mm, int member, long value)
> @@ -900,6 +909,25 @@ static inline void dec_mm_counter(struct
> {
> atomic_long_dec(&(mm)->counters[member]);
> }
> +extern void __sync_mm_counters(struct mm_struct *mm);
> +/* Called under non-preemptable context, for syncing cached information */
> +static inline void sync_mm_counters_atomic(void)
> +{
> + struct mm_struct *mm;
> +
> + mm = percpu_read(curr_mmc.mm);
> + if (mm) {
> + __sync_mm_counters(mm);
> + percpu_write(curr_mmc.mm, NULL);
> + }
> +}
> +/* called at thread exit */
> +static inline void exit_mm_counters(void)
> +{
> + preempt_disable();
> + sync_mm_counters_atomic();
> + preempt_enable();
> +}
>
> #else /* !USE_SPLIT_PTLOCKS */
> /*
> @@ -931,6 +959,13 @@ static inline void dec_mm_counter(struct
> mm->counters[member]--;
> }
>
> +static inline void sync_mm_counters_atomic(void)
> +{
> +}
> +
> +static inline void exit_mm_counters(void)
> +{
> +}
> #endif /* !USE_SPLIT_PTLOCKS */
>
> #define get_mm_rss(mm) \
> Index: mmotm-2.6.32-Dec8/mm/memory.c
> ===================================================================
> --- mmotm-2.6.32-Dec8.orig/mm/memory.c
> +++ mmotm-2.6.32-Dec8/mm/memory.c
> @@ -121,6 +121,50 @@ static int __init init_zero_pfn(void)
> }
> core_initcall(init_zero_pfn);
>
> +#if USE_SPLIT_PTLOCKS
> +
> +DEFINE_PER_CPU(struct pcp_mm_cache, curr_mmc);
> +
> +void __sync_mm_counters(struct mm_struct *mm)
> +{
> + struct pcp_mm_cache *mmc = &per_cpu(curr_mmc, smp_processor_id());
> + int i;
> +
> + for (i = 0; i < NR_MM_COUNTERS; i++) {
The cost depends on NR_MM_COUNTER.
Now, it's low but we might add the more counts in pcp_mm_cache.
Then, If we don't change any count in many counts, we don't need to loop
unnecessary. we will remove this with change flag of pcp_mm_cache.
But, change flag cmp/updating overhead is also ugly. So, it would be rather
overkill in now. How about leaving the NOTE ?
/* NOTE :
* We have to rethink for reducing overhead if we start to
* add many counts in pcp_mm_cache.
*/
> + if (mmc->counters[i] != 0) {
> + atomic_long_add(mmc->counters[i], &mm->counters[i]);
> + mmc->counters[i] = 0;
> + }
> + }
> + return;
> +}
> +/*
> + * This add_mm_counter_fast() works well only when it's expexted that
expexted => expected :)
> + * mm == current->mm. So, use of this function is limited under memory.c
> + * This add_mm_counter_fast() is called under page table lock.
> + */
> +static void add_mm_counter_fast(struct mm_struct *mm, int member, int val)
> +{
> + struct mm_struct *cached = percpu_read(curr_mmc.mm);
> +
> + if (likely(cached == mm)) { /* fast path */
> + percpu_add(curr_mmc.counters[member], val);
> + } else if (mm == current->mm) { /* 1st page fault in this period */
> + percpu_write(curr_mmc.mm, mm);
> + percpu_write(curr_mmc.counters[member], val);
> + } else /* page fault via side-path context (get_user_pages()) */
> + add_mm_counter(mm, member, val);
> +}
> +
> +#define inc_mm_counter_fast(mm, member) add_mm_counter_fast(mm, member, 1)
> +#define dec_mm_counter_fast(mm, member) add_mm_counter_fast(mm, member, -1)
> +#else
> +
> +#define inc_mm_counter_fast(mm, member) inc_mm_counter(mm, member)
> +#define dec_mm_counter_fast(mm, member) dec_mm_counter(mm, member)
> +
> +#endif
> +
> /*
> * If a p?d_bad entry is found while walking page tables, report
> * the error, before resetting entry to p?d_none. Usually (but
> @@ -1541,7 +1585,7 @@ static int insert_page(struct vm_area_st
>
> /* Ok, finally just insert the thing.. */
> get_page(page);
> - inc_mm_counter(mm, MM_FILEPAGES);
> + inc_mm_counter_fast(mm, MM_FILEPAGES);
> page_add_file_rmap(page);
> set_pte_at(mm, addr, pte, mk_pte(page, prot));
>
> @@ -2177,11 +2221,11 @@ gotten:
> if (likely(pte_same(*page_table, orig_pte))) {
> if (old_page) {
> if (!PageAnon(old_page)) {
> - dec_mm_counter(mm, MM_FILEPAGES);
> - inc_mm_counter(mm, MM_ANONPAGES);
> + dec_mm_counter_fast(mm, MM_FILEPAGES);
> + inc_mm_counter_fast(mm, MM_ANONPAGES);
> }
> } else
> - inc_mm_counter(mm, MM_ANONPAGES);
> + inc_mm_counter_fast(mm, MM_ANONPAGES);
> flush_cache_page(vma, address, pte_pfn(orig_pte));
> entry = mk_pte(new_page, vma->vm_page_prot);
> entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> @@ -2614,7 +2658,7 @@ static int do_swap_page(struct mm_struct
> * discarded at swap_free().
> */
>
> - inc_mm_counter(mm, MM_ANONPAGES);
> + inc_mm_counter_fast(mm, MM_ANONPAGES);
> pte = mk_pte(page, vma->vm_page_prot);
> if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
> pte = maybe_mkwrite(pte_mkdirty(pte), vma);
> @@ -2698,7 +2742,7 @@ static int do_anonymous_page(struct mm_s
> if (!pte_none(*page_table))
> goto release;
>
> - inc_mm_counter(mm, MM_ANONPAGES);
> + inc_mm_counter_fast(mm, MM_ANONPAGES);
> page_add_new_anon_rmap(page, vma, address);
> setpte:
> set_pte_at(mm, address, page_table, entry);
> @@ -2852,10 +2896,10 @@ static int __do_fault(struct mm_struct *
> if (flags & FAULT_FLAG_WRITE)
> entry = maybe_mkwrite(pte_mkdirty(entry), vma);
> if (anon) {
> - inc_mm_counter(mm, MM_ANONPAGES);
> + inc_mm_counter_fast(mm, MM_ANONPAGES);
> page_add_new_anon_rmap(page, vma, address);
> } else {
> - inc_mm_counter(mm, MM_FILEPAGES);
> + inc_mm_counter_fast(mm, MM_FILEPAGES);
> page_add_file_rmap(page);
> if (flags & FAULT_FLAG_WRITE) {
> dirty_page = page;
> Index: mmotm-2.6.32-Dec8/kernel/sched.c
> ===================================================================
> --- mmotm-2.6.32-Dec8.orig/kernel/sched.c
> +++ mmotm-2.6.32-Dec8/kernel/sched.c
> @@ -2858,6 +2858,7 @@ context_switch(struct rq *rq, struct tas
> trace_sched_switch(rq, prev, next);
> mm = next->mm;
> oldmm = prev->active_mm;
> +
> /*
> * For paravirt, this is coupled with an exit in switch_to to
> * combine the page table reload and the switch backend into
> @@ -5477,6 +5478,11 @@ need_resched_nonpreemptible:
>
> if (sched_feat(HRTICK))
> hrtick_clear(rq);
> + /*
> + * sync/invaldidate per-cpu cached mm related information
> + * before taling rq->lock. (see include/linux/mm.h)
taling => taking
> + */
> + sync_mm_counters_atomic();
It's my above concern.
before the process schedule out, we could get the wrong info.
It's not realistic problem?
>
> spin_lock_irq(&rq->lock);
> update_rq_clock(rq);
> Index: mmotm-2.6.32-Dec8/kernel/exit.c
> ===================================================================
> --- mmotm-2.6.32-Dec8.orig/kernel/exit.c
> +++ mmotm-2.6.32-Dec8/kernel/exit.c
> @@ -942,7 +942,8 @@ NORET_TYPE void do_exit(long code)
> printk(KERN_INFO "note: %s[%d] exited with preempt_count %d\n",
> current->comm, task_pid_nr(current),
> preempt_count());
> -
> + /* synchronize per-cpu cached mm related information before account */
> + exit_mm_counters();
> acct_update_integrals(tsk);
>
> group_dead = atomic_dec_and_test(&tsk->signal->live);
>
>
--
Kind regards,
Minchan Kim
On Fri, 11 Dec 2009 09:40:07 +0900
Minchan Kim <[email protected]> wrote:
> > static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
> > {
> > - return (unsigned long)atomic_long_read(&(mm)->counters[member]);
> > + long ret;
> > + /*
> > + * Because this counter is loosely synchronized with percpu cached
> > + * information, it's possible that value gets to be minus. For user's
> > + * convenience/sanity, avoid returning minus.
> > + */
> > + ret = atomic_long_read(&(mm)->counters[member]);
> > + if (unlikely(ret < 0))
> > + return 0;
> > + return (unsigned long)ret;
> > }
>
> Now, your sync point is only task switching time.
> So we can't show exact number if many counting of mm happens
> in short time.(ie, before context switching).
> It isn't matter?
>
I think it's not a matter from 2 reasons.
1. Now, considering servers which requires continuous memory usage monitoring
as ps/top, when there are 2000 processes, "ps -elf" takes 0.8sec.
Because system admins know that gathering process information consumes
some amount of cpu resource, they will not do that so frequently.(I hope)
2. When chains of page faults occur continously in a period, the monitor
of memory usage just see a snapshot of current numbers and "snapshot of what
moment" is at random, always. No one can get precise number in that kind of situation.
> >
> > static inline void add_mm_counter(struct mm_struct *mm, int member, long value)
<snip>
> > Index: mmotm-2.6.32-Dec8/kernel/sched.c
> > ===================================================================
> > --- mmotm-2.6.32-Dec8.orig/kernel/sched.c
> > +++ mmotm-2.6.32-Dec8/kernel/sched.c
> > @@ -2858,6 +2858,7 @@ context_switch(struct rq *rq, struct tas
> > trace_sched_switch(rq, prev, next);
> > mm = next->mm;
> > oldmm = prev->active_mm;
> > +
> > /*
> > * For paravirt, this is coupled with an exit in switch_to to
> > * combine the page table reload and the switch backend into
> > @@ -5477,6 +5478,11 @@ need_resched_nonpreemptible:
> >
> > if (sched_feat(HRTICK))
> > hrtick_clear(rq);
> > + /*
> > + * sync/invaldidate per-cpu cached mm related information
> > + * before taling rq->lock. (see include/linux/mm.h)
>
> taling => taking
>
> > + */
> > + sync_mm_counters_atomic();
>
> It's my above concern.
> before the process schedule out, we could get the wrong info.
> It's not realistic problem?
>
I think not, now.
Thanks,
-Kame
On Thu, Dec 10, 2009 at 4:59 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
>
> One of frequent questions from users about memory management is
> what numbers of swap ents are user for processes. And this information will
> give some hints to oom-killer.
>
> Besides we can count the number of swapents per a process by scanning
> /proc/<pid>/smaps, this is very slow and not good for usual process information
> handler which works like 'ps' or 'top'.
> (ps or top is now enough slow..)
>
> This patch adds a counter of swapents to mm_counter and update is at
> each swap events. Information is exported via /proc/<pid>/status file as
>
> [kamezawa@bluextal ~]$ cat /proc/self/status
> Name: cat
> State: R (running)
> Tgid: 2904
> Pid: 2904
> PPid: 2862
> TracerPid: 0
> Uid: 500 500 500 500
> Gid: 500 500 500 500
> FDSize: 256
> Groups: 500
> VmPeak: 82696 kB
> VmSize: 82696 kB
> VmLck: 0 kB
> VmHWM: 504 kB
> VmRSS: 504 kB
> VmData: 172 kB
> VmStk: 84 kB
> VmExe: 48 kB
> VmLib: 1568 kB
> VmPTE: 40 kB
> VmSwap: 0 kB <============== this.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
--
Kind regards,
Minchan Kim
On Thu, Dec 10, 2009 at 5:00 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Final purpose of this patch is for improving oom/memoy shortage detection
> better. In general there are OOM cases that lowmem is exhausted. What
> this lowmem means is determined by the situation, but in general,
> limited amount of memory for some special use is lowmem.
>
> This patch adds an integer lowmem_zone, which is initialized to -1.
> If zone_idx(zone) <= lowmem_zone, the zone is lowmem.
>
> This patch uses simple definition that the zone for special use is the lowmem.
> Not taking the amount of memory into account.
>
> For example,
> - if HIGHMEM is used, NORMAL is lowmem.
> - If the system has both of NORMAL and DMA32, DMA32 is lowmem.
> - When the system consists of only one zone, there are no lowmem.
>
> This will be used for lowmem accounting per mm_struct and its information
> will be used for oom-killer.
>
> Changelog: 2009/12/09
> - stop using policy_zone and use unified definition on each config.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
I like this than policy_zone version.
--
Kind regards,
Minchan Kim
On Thu, Dec 10, 2009 at 5:01 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Some case of OOM-Kill is caused by memory shortage in lowmem area. For example,
> NORMAL_ZONE is exhausted on x86-32/HIGHMEM kernel.
>
> Now, oom-killer doesn't have no lowmem usage information of processes and
> selects victim processes based on global memory usage information.
> In bad case, this can cause chains of kills of innocent processes without
> progress, oom-serial-killer.
>
> For making oom-killer lowmem aware, this patch adds counters for accounting
> lowmem usage per process. (patches for oom-killer is not included in this.)
>
> Adding counter is easy but one of concern is the cost for new counter.
>
> Following is the test result of micro-benchmark of parallel page faults.
> Bigger page fault number indicates better scalability.
> (measured under USE_SPLIT_PTLOCKS environemt)
> [Before lowmem counter]
> Performance counter stats for './multi-fault 2' (5 runs):
>
> 46997471 page-faults ( +- 0.720% )
> 1004100076 cache-references ( +- 0.734% )
> 180959964 cache-misses ( +- 0.374% )
> 29263437363580464 bus-cycles ( +- 0.002% )
>
> 60.003315683 seconds time elapsed ( +- 0.004% )
>
> 3.85 miss/faults
> [After lowmem counter]
> Performance counter stats for './multi-fault 2' (5 runs):
>
> 45976947 page-faults ( +- 0.405% )
> 992296954 cache-references ( +- 0.860% )
> 183961537 cache-misses ( +- 0.473% )
> 29261902069414016 bus-cycles ( +- 0.002% )
>
> 60.001403261 seconds time elapsed ( +- 0.000% )
>
> 4.0 miss/faults.
>
> Then, small cost is added. But I think this is within reasonable
> range.
>
> If you have good idea for improve this number, it's welcome.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Reviewed-by: Minchan Kim <[email protected]>
--
Kind regards,
Minchan Kim
On Fri, Dec 11, 2009 at 9:51 AM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> On Fri, 11 Dec 2009 09:40:07 +0900
> Minchan Kim <[email protected]> wrote:
>> > static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
>> > {
>> > - return (unsigned long)atomic_long_read(&(mm)->counters[member]);
>> > + long ret;
>> > + /*
>> > + * Because this counter is loosely synchronized with percpu cached
>> > + * information, it's possible that value gets to be minus. For user's
>> > + * convenience/sanity, avoid returning minus.
>> > + */
>> > + ret = atomic_long_read(&(mm)->counters[member]);
>> > + if (unlikely(ret < 0))
>> > + return 0;
>> > + return (unsigned long)ret;
>> > }
>>
>> Now, your sync point is only task switching time.
>> So we can't show exact number if many counting of mm happens
>> in short time.(ie, before context switching).
>> It isn't matter?
>>
> I think it's not a matter from 2 reasons.
>
> 1. Now, considering servers which requires continuous memory usage monitoring
> as ps/top, when there are 2000 processes, "ps -elf" takes 0.8sec.
> Because system admins know that gathering process information consumes
> some amount of cpu resource, they will not do that so frequently.(I hope)
>
> 2. When chains of page faults occur continously in a period, the monitor
> of memory usage just see a snapshot of current numbers and "snapshot of what
> moment" is at random, always. No one can get precise number in that kind of situation.
>
Yes. I understand that.
But we did rss updating as batch until now.
It was also stale. Just only your patch make stale period longer.
Hmm. I hope people don't expect mm count is precise.
I saw the many people believed sanpshot of mm counting is real in
embedded system.
They want to know the exact memory usage in system.
Maybe embedded system doesn't use SPLIT_LOCK so that there is no regression.
At least, I would like to add comment "It's not precise value." on
statm's Documentation.
Of course, It's off topic. :)
Thanks for commenting. Kame.
--
Kind regards,
Minchan Kim
On Fri, 11 Dec 2009 10:25:03 +0900
Minchan Kim <[email protected]> wrote:
> On Fri, Dec 11, 2009 at 9:51 AM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
> > On Fri, 11 Dec 2009 09:40:07 +0900
> > Minchan Kim <[email protected]> wrote:
> >> > static inline unsigned long get_mm_counter(struct mm_struct *mm, int member)
> >> > {
> >> > - return (unsigned long)atomic_long_read(&(mm)->counters[member]);
> >> > + long ret;
> >> > + /*
> >> > + * Because this counter is loosely synchronized with percpu cached
> >> > + * information, it's possible that value gets to be minus. For user's
> >> > + * convenience/sanity, avoid returning minus.
> >> > + */
> >> > + ret = atomic_long_read(&(mm)->counters[member]);
> >> > + if (unlikely(ret < 0))
> >> > + return 0;
> >> > + return (unsigned long)ret;
> >> > }
> >>
> >> Now, your sync point is only task switching time.
> >> So we can't show exact number if many counting of mm happens
> >> in short time.(ie, before context switching).
> >> It isn't matter?
> >>
> > I think it's not a matter from 2 reasons.
> >
> > 1. Now, considering servers which requires continuous memory usage monitoring
> > as ps/top, when there are 2000 processes, "ps -elf" takes 0.8sec.
> > Because system admins know that gathering process information consumes
> > some amount of cpu resource, they will not do that so frequently.(I hope)
> >
> > 2. When chains of page faults occur continously in a period, the monitor
> > of memory usage just see a snapshot of current numbers and "snapshot of what
> > moment" is at random, always. No one can get precise number in that kind of situation.
> >
>
> Yes. I understand that.
>
> But we did rss updating as batch until now.
> It was also stale. Just only your patch make stale period longer.
> Hmm. I hope people don't expect mm count is precise.
>
I hope so, too...
> I saw the many people believed sanpshot of mm counting is real in
> embedded system.
> They want to know the exact memory usage in system.
> Maybe embedded system doesn't use SPLIT_LOCK so that there is no regression.
>
> At least, I would like to add comment "It's not precise value." on
> statm's Documentation.
Ok, I'll will do.
> Of course, It's off topic. :)
>
> Thanks for commenting. Kame.
Thank you for review.
Regards,
-Kame
On Fri, 11 Dec 2009, KAMEZAWA Hiroyuki wrote:
> Hmm, How about adding following kind of patch after this
>
> #define policy_zone (lowmem_zone + 1)
>
> and remove policy_zone ? I think the name of "policy_zone" implies
> "this is for mempolicy, NUMA" and don't think good name for generic use.
Good idea but lets hear Lee's opinion about this one too.