Hi, as discussed in "Memory overcommit" threads, I started rewrite.
This is just for showing "I started" (not just chating or sleeping ;)
All implemtations are not fixed yet. So feel free to do any comments.
This set is for minimum change set, I think. Some more rich functions
can be implemented based on this.
All patches are against "mm-of-the-moment snapshot 2009-11-01-10-01"
Patches are organized as
(1) pass oom-killer more information, classification and fix mempolicy case.
(2) counting swap usage
(3) counting lowmem usage
(4) fork bomb detector/killer
(5) check expansion of total_vm
(6) rewrite __badness().
passed small tests on x86-64 boxes.
Thanks,
-Kame
From: KAMEZAWA Hiroyuki <[email protected]>
Rewrite oom constarint to be up to date.
(1). Now, at badness calculation, oom_constraint and other information
(which is available easily) are ignore. Pass them.
(2)Adds more classes of oom constraint as _MEMCG and _LOWMEM.
This is just a change for interface and doesn't add new logic, at this stage.
(3) Pass nodemask to oom_kill. Now alloc_pages() are totally rewritten and
it uses nodemask as its argument. By this, mempolicy doesn't have its own
private zonelist. So, Passing nodemask to out_of_memory() is necessary.
But, pagefault_out_of_memory() doesn't have enough information. We should
visit this again, later.
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
drivers/char/sysrq.c | 2 -
fs/proc/base.c | 4 +-
include/linux/oom.h | 8 +++-
mm/oom_kill.c | 101 +++++++++++++++++++++++++++++++++++++++------------
mm/page_alloc.c | 2 -
5 files changed, 88 insertions(+), 29 deletions(-)
Index: mmotm-2.6.32-Nov2/include/linux/oom.h
===================================================================
--- mmotm-2.6.32-Nov2.orig/include/linux/oom.h
+++ mmotm-2.6.32-Nov2/include/linux/oom.h
@@ -10,23 +10,27 @@
#ifdef __KERNEL__
#include <linux/types.h>
+#include <linux/nodemask.h>
struct zonelist;
struct notifier_block;
/*
- * Types of limitations to the nodes from which allocations may occur
+ * Types of limitations to zones from which allocations may occur
*/
enum oom_constraint {
CONSTRAINT_NONE,
+ CONSTRAINT_LOWMEM,
CONSTRAINT_CPUSET,
CONSTRAINT_MEMORY_POLICY,
+ CONSTRAINT_MEMCG
};
extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
-extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order);
+extern void out_of_memory(struct zonelist *zonelist,
+ gfp_t gfp_mask, int order, nodemask_t *mask);
extern int register_oom_notifier(struct notifier_block *nb);
extern int unregister_oom_notifier(struct notifier_block *nb);
Index: mmotm-2.6.32-Nov2/mm/oom_kill.c
===================================================================
--- mmotm-2.6.32-Nov2.orig/mm/oom_kill.c
+++ mmotm-2.6.32-Nov2/mm/oom_kill.c
@@ -27,6 +27,7 @@
#include <linux/notifier.h>
#include <linux/memcontrol.h>
#include <linux/security.h>
+#include <linux/mempolicy.h>
int sysctl_panic_on_oom;
int sysctl_oom_kill_allocating_task;
@@ -55,6 +56,8 @@ static int has_intersects_mems_allowed(s
* badness - calculate a numeric value for how bad this task has been
* @p: task struct of which task we should calculate
* @uptime: current uptime in seconds
+ * @constraint: type of oom_kill region
+ * @mem: set if called by memory cgroup
*
* The formula used is relatively simple and documented inline in the
* function. The main rationale is that we want to select a good task
@@ -70,7 +73,9 @@ static int has_intersects_mems_allowed(s
* of least surprise ... (be careful when you change it)
*/
-unsigned long badness(struct task_struct *p, unsigned long uptime)
+static unsigned long __badness(struct task_struct *p,
+ unsigned long uptime, enum oom_constraint constraint,
+ struct mem_cgroup *mem)
{
unsigned long points, cpu_time, run_time;
struct mm_struct *mm;
@@ -193,30 +198,68 @@ unsigned long badness(struct task_struct
return points;
}
+/* for /proc */
+unsigned long global_badness(struct task_struct *p, unsigned long uptime)
+{
+ return __badness(p, uptime, CONSTRAINT_NONE, NULL);
+}
+
+
/*
* Determine the type of allocation constraint.
*/
-static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
- gfp_t gfp_mask)
-{
+
#ifdef CONFIG_NUMA
+static inline enum oom_constraint guess_oom_context(struct zonelist *zonelist,
+ gfp_t gfp_mask, nodemask_t *nodemask)
+{
struct zone *zone;
struct zoneref *z;
enum zone_type high_zoneidx = gfp_zone(gfp_mask);
- nodemask_t nodes = node_states[N_HIGH_MEMORY];
+ enum oom_constraint ret = CONSTRAINT_NONE;
- for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
- if (cpuset_zone_allowed_softwall(zone, gfp_mask))
- node_clear(zone_to_nid(zone), nodes);
- else
+ /*
+ * In numa environ, almost all allocation will be against NORMAL zone.
+ * But some small area, ex)GFP_DMA for ia64 or GFP_DMA32 for x86-64
+ * can cause OOM. We can use policy_zone for checking lowmem.
+ */
+ if (high_zoneidx < policy_zone)
+ return CONSTRAINT_LOWMEM;
+ /*
+ * Now, only mempolicy specifies nodemask. But if nodemask
+ * covers all nodes, this oom is global oom.
+ */
+ if (nodemask && !nodes_equal(node_states[N_HIGH_MEMORY], *nodemask))
+ ret = CONSTRAINT_MEMORY_POLICY;
+ /*
+ * If not __GFP_THISNODE, zonelist containes all nodes. And if
+ * zonelist contains a zone which isn't allowed under cpuset, we assume
+ * this allocation failure is caused by cpuset's constraint.
+ * Note: all nodes are scanned if nodemask=NULL.
+ */
+ for_each_zone_zonelist_nodemask(zone,
+ z, zonelist, high_zoneidx, nodemask) {
+ if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
return CONSTRAINT_CPUSET;
+ }
+ return ret;
+}
- if (!nodes_empty(nodes))
- return CONSTRAINT_MEMORY_POLICY;
-#endif
-
+#elif defined(CONFIG_HIGHMEM)
+static inline enum oom_constraint
+guess_oom_context(struct zonelist *zonelist, gfp_t gfp_mask, nodemask_t *mask)
+{
+ if (gfp_mask & __GFP_HIGHMEM)
+ return CONSIRAINT_NONE;
+ return CONSTRAINT_LOWMEM;
+}
+#else
+static inline enum oom_constraint guess_oom_context(struct zonelist *zonelist,
+ gfp_t gfp_mask, nodemask_t *mask)
+{
return CONSTRAINT_NONE;
}
+#endif
/*
* Simple selection loop. We chose the process with the highest
@@ -225,7 +268,8 @@ static inline enum oom_constraint constr
* (not docbooked, we don't want this one cluttering up the manual)
*/
static struct task_struct *select_bad_process(unsigned long *ppoints,
- struct mem_cgroup *mem)
+ enum oom_constraint constraint,
+ struct mem_cgroup *mem)
{
struct task_struct *p;
struct task_struct *chosen = NULL;
@@ -281,7 +325,7 @@ static struct task_struct *select_bad_pr
if (p->signal->oom_adj == OOM_DISABLE)
continue;
- points = badness(p, uptime.tv_sec);
+ points = __badness(p, uptime.tv_sec, constraint, mem);
if (points > *ppoints || !chosen) {
chosen = p;
*ppoints = points;
@@ -443,7 +487,7 @@ void mem_cgroup_out_of_memory(struct mem
read_lock(&tasklist_lock);
retry:
- p = select_bad_process(&points, mem);
+ p = select_bad_process(&points, CONSTRAINT_MEMCG, mem);
if (PTR_ERR(p) == -1UL)
goto out;
@@ -525,7 +569,8 @@ void clear_zonelist_oom(struct zonelist
/*
* Must be called with tasklist_lock held for read.
*/
-static void __out_of_memory(gfp_t gfp_mask, int order)
+static void __out_of_memory(gfp_t gfp_mask, enum oom_constraint constraint,
+ int order, nodemask_t *mask)
{
struct task_struct *p;
unsigned long points;
@@ -539,7 +584,7 @@ retry:
* Rambo mode: Shoot down a process and hope it solves whatever
* issues we may have.
*/
- p = select_bad_process(&points, NULL);
+ p = select_bad_process(&points, constraint, NULL);
if (PTR_ERR(p) == -1UL)
return;
@@ -580,7 +625,12 @@ void pagefault_out_of_memory(void)
panic("out of memory from page fault. panic_on_oom is selected.\n");
read_lock(&tasklist_lock);
- __out_of_memory(0, 0); /* unknown gfp_mask and order */
+ /*
+ * Considering nature of pages required for page-fault,this must be
+ * global OOM (if not cpuset...). Then, CONSTRAINT_NONE is correct.
+ * zonelist, nodemasks are unknown...
+ */
+ __out_of_memory(0, CONSTRAINT_NONE, 0, NULL);
read_unlock(&tasklist_lock);
/*
@@ -597,13 +647,15 @@ rest_and_return:
* @zonelist: zonelist pointer
* @gfp_mask: memory allocation flags
* @order: amount of memory being requested as a power of 2
+ * @nodmask: nodemask which page allocater is called with.
*
* If we run out of memory, we have the choice between either
* killing a random task (bad), letting the system crash (worse)
* OR try to be smart about which process to kill. Note that we
* don't have to be perfect here, we just have to be good.
*/
-void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order)
+void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
+ int order, nodemask_t *nodemask)
{
unsigned long freed = 0;
enum oom_constraint constraint;
@@ -622,7 +674,7 @@ void out_of_memory(struct zonelist *zone
* Check if there were limitations on the allocation (only relevant for
* NUMA) that may require different handling.
*/
- constraint = constrained_alloc(zonelist, gfp_mask);
+ constraint = guess_oom_context(zonelist, gfp_mask, nodemask);
read_lock(&tasklist_lock);
switch (constraint) {
@@ -630,7 +682,7 @@ void out_of_memory(struct zonelist *zone
oom_kill_process(current, gfp_mask, order, 0, NULL,
"No available memory (MPOL_BIND)");
break;
-
+ case CONSTRAINT_LOWMEM:
case CONSTRAINT_NONE:
if (sysctl_panic_on_oom) {
dump_header(gfp_mask, order, NULL);
@@ -638,7 +690,10 @@ void out_of_memory(struct zonelist *zone
}
/* Fall-through */
case CONSTRAINT_CPUSET:
- __out_of_memory(gfp_mask, order);
+ __out_of_memory(gfp_mask, constraint, order, nodemask);
+ break;
+ case CONSTRAINT_MEMCG: /* never happens. but for warning.*/
+ BUG();
break;
}
Index: mmotm-2.6.32-Nov2/fs/proc/base.c
===================================================================
--- mmotm-2.6.32-Nov2.orig/fs/proc/base.c
+++ mmotm-2.6.32-Nov2/fs/proc/base.c
@@ -442,7 +442,7 @@ static const struct file_operations proc
#endif
/* The badness from the OOM killer */
-unsigned long badness(struct task_struct *p, unsigned long uptime);
+unsigned long global_badness(struct task_struct *p, unsigned long uptime);
static int proc_oom_score(struct task_struct *task, char *buffer)
{
unsigned long points;
@@ -450,7 +450,7 @@ static int proc_oom_score(struct task_st
do_posix_clock_monotonic_gettime(&uptime);
read_lock(&tasklist_lock);
- points = badness(task->group_leader, uptime.tv_sec);
+ points = global_badness(task->group_leader, uptime.tv_sec);
read_unlock(&tasklist_lock);
return sprintf(buffer, "%lu\n", points);
}
Index: mmotm-2.6.32-Nov2/mm/page_alloc.c
===================================================================
--- mmotm-2.6.32-Nov2.orig/mm/page_alloc.c
+++ mmotm-2.6.32-Nov2/mm/page_alloc.c
@@ -1669,7 +1669,7 @@ __alloc_pages_may_oom(gfp_t gfp_mask, un
goto out;
/* Exhausted what can be done so it's blamo time */
- out_of_memory(zonelist, gfp_mask, order);
+ out_of_memory(zonelist, gfp_mask, order, nodemask);
out:
clear_zonelist_oom(zonelist, gfp_mask);
Index: mmotm-2.6.32-Nov2/drivers/char/sysrq.c
===================================================================
--- mmotm-2.6.32-Nov2.orig/drivers/char/sysrq.c
+++ mmotm-2.6.32-Nov2/drivers/char/sysrq.c
@@ -339,7 +339,7 @@ static struct sysrq_key_op sysrq_term_op
static void moom_callback(struct work_struct *ignored)
{
- out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0);
+ out_of_memory(node_zonelist(0, GFP_KERNEL), GFP_KERNEL, 0, NULL);
}
static DECLARE_WORK(moom_work, moom_callback);
From: KAMEZAWA Hiroyuki <[email protected]>
Now, anon_rss and file_rss is counted as RSS and exported via /proc.
RSS usage is important information but one more information which
is often asked by users is "usage of swap".(user support team said.)
This patch counts swap entry usage per process and show it via
/proc/<pid>/status. I think status file is robust against new entry.
Then, it is the first candidate..
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
fs/proc/task_mmu.c | 10 +++++++---
include/linux/mm_types.h | 1 +
mm/memory.c | 30 +++++++++++++++++++++---------
mm/rmap.c | 1 +
mm/swapfile.c | 1 +
5 files changed, 31 insertions(+), 12 deletions(-)
Index: mmotm-2.6.32-Nov2/include/linux/mm_types.h
===================================================================
--- mmotm-2.6.32-Nov2.orig/include/linux/mm_types.h
+++ mmotm-2.6.32-Nov2/include/linux/mm_types.h
@@ -228,6 +228,7 @@ struct mm_struct {
*/
mm_counter_t _file_rss;
mm_counter_t _anon_rss;
+ mm_counter_t _swap_usage;
unsigned long hiwater_rss; /* High-watermark of RSS usage */
unsigned long hiwater_vm; /* High-water virtual memory usage */
Index: mmotm-2.6.32-Nov2/mm/memory.c
===================================================================
--- mmotm-2.6.32-Nov2.orig/mm/memory.c
+++ mmotm-2.6.32-Nov2/mm/memory.c
@@ -376,12 +376,15 @@ int __pte_alloc_kernel(pmd_t *pmd, unsig
return 0;
}
-static inline void add_mm_rss(struct mm_struct *mm, int file_rss, int anon_rss)
+static inline void
+add_mm_rss(struct mm_struct *mm, int file_rss, int anon_rss, int swap_usage)
{
if (file_rss)
add_mm_counter(mm, file_rss, file_rss);
if (anon_rss)
add_mm_counter(mm, anon_rss, anon_rss);
+ if (swap_usage)
+ add_mm_counter(mm, swap_usage, swap_usage);
}
/*
@@ -597,7 +600,9 @@ copy_one_pte(struct mm_struct *dst_mm, s
&src_mm->mmlist);
spin_unlock(&mmlist_lock);
}
- if (is_write_migration_entry(entry) &&
+ if (!is_migration_entry(entry))
+ rss[2]++;
+ else if (is_write_migration_entry(entry) &&
is_cow_mapping(vm_flags)) {
/*
* COW mappings require pages in both parent
@@ -648,11 +653,11 @@ static int copy_pte_range(struct mm_stru
pte_t *src_pte, *dst_pte;
spinlock_t *src_ptl, *dst_ptl;
int progress = 0;
- int rss[2];
+ int rss[3];
swp_entry_t entry = (swp_entry_t){0};
again:
- rss[1] = rss[0] = 0;
+ rss[2] = rss[1] = rss[0] = 0;
dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
if (!dst_pte)
return -ENOMEM;
@@ -688,7 +693,7 @@ again:
arch_leave_lazy_mmu_mode();
spin_unlock(src_ptl);
pte_unmap_nested(orig_src_pte);
- add_mm_rss(dst_mm, rss[0], rss[1]);
+ add_mm_rss(dst_mm, rss[0], rss[1], rss[2]);
pte_unmap_unlock(orig_dst_pte, dst_ptl);
cond_resched();
@@ -818,6 +823,7 @@ static unsigned long zap_pte_range(struc
spinlock_t *ptl;
int file_rss = 0;
int anon_rss = 0;
+ int swap_usage = 0;
pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
arch_enter_lazy_mmu_mode();
@@ -887,13 +893,18 @@ static unsigned long zap_pte_range(struc
if (pte_file(ptent)) {
if (unlikely(!(vma->vm_flags & VM_NONLINEAR)))
print_bad_pte(vma, addr, ptent, NULL);
- } else if
- (unlikely(!free_swap_and_cache(pte_to_swp_entry(ptent))))
- print_bad_pte(vma, addr, ptent, NULL);
+ } else {
+ swp_entry_t ent = pte_to_swp_entry(ptent);
+
+ if (!is_migration_entry(ent))
+ swap_usage--;
+ if (unlikely(!free_swap_and_cache(ent)))
+ print_bad_pte(vma, addr, ptent, NULL);
+ }
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
} while (pte++, addr += PAGE_SIZE, (addr != end && *zap_work > 0));
- add_mm_rss(mm, file_rss, anon_rss);
+ add_mm_rss(mm, file_rss, anon_rss, swap_usage);
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(pte - 1, ptl);
@@ -2595,6 +2606,7 @@ static int do_swap_page(struct mm_struct
*/
inc_mm_counter(mm, anon_rss);
+ dec_mm_counter(mm, swap_usage);
pte = mk_pte(page, vma->vm_page_prot);
if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
Index: mmotm-2.6.32-Nov2/mm/swapfile.c
===================================================================
--- mmotm-2.6.32-Nov2.orig/mm/swapfile.c
+++ mmotm-2.6.32-Nov2/mm/swapfile.c
@@ -837,6 +837,7 @@ static int unuse_pte(struct vm_area_stru
}
inc_mm_counter(vma->vm_mm, anon_rss);
+ dec_mm_counter(vma->vm_mm, swap_usage);
get_page(page);
set_pte_at(vma->vm_mm, addr, pte,
pte_mkold(mk_pte(page, vma->vm_page_prot)));
Index: mmotm-2.6.32-Nov2/fs/proc/task_mmu.c
===================================================================
--- mmotm-2.6.32-Nov2.orig/fs/proc/task_mmu.c
+++ mmotm-2.6.32-Nov2/fs/proc/task_mmu.c
@@ -17,7 +17,7 @@
void task_mem(struct seq_file *m, struct mm_struct *mm)
{
unsigned long data, text, lib;
- unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
+ unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss, swap;
/*
* Note: to minimize their overhead, mm maintains hiwater_vm and
@@ -36,6 +36,8 @@ void task_mem(struct seq_file *m, struct
data = mm->total_vm - mm->shared_vm - mm->stack_vm;
text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10;
lib = (mm->exec_vm << (PAGE_SHIFT-10)) - text;
+
+ swap = get_mm_counter(mm, swap_usage);
seq_printf(m,
"VmPeak:\t%8lu kB\n"
"VmSize:\t%8lu kB\n"
@@ -46,7 +48,8 @@ void task_mem(struct seq_file *m, struct
"VmStk:\t%8lu kB\n"
"VmExe:\t%8lu kB\n"
"VmLib:\t%8lu kB\n"
- "VmPTE:\t%8lu kB\n",
+ "VmPTE:\t%8lu kB\n"
+ "VmSwap:\t%8lu kB\n",
hiwater_vm << (PAGE_SHIFT-10),
(total_vm - mm->reserved_vm) << (PAGE_SHIFT-10),
mm->locked_vm << (PAGE_SHIFT-10),
@@ -54,7 +57,8 @@ void task_mem(struct seq_file *m, struct
total_rss << (PAGE_SHIFT-10),
data << (PAGE_SHIFT-10),
mm->stack_vm << (PAGE_SHIFT-10), text, lib,
- (PTRS_PER_PTE*sizeof(pte_t)*mm->nr_ptes) >> 10);
+ (PTRS_PER_PTE*sizeof(pte_t)*mm->nr_ptes) >> 10,
+ swap << (PAGE_SHIFT - 10));
}
unsigned long task_vsize(struct mm_struct *mm)
Index: mmotm-2.6.32-Nov2/mm/rmap.c
===================================================================
--- mmotm-2.6.32-Nov2.orig/mm/rmap.c
+++ mmotm-2.6.32-Nov2/mm/rmap.c
@@ -834,6 +834,7 @@ static int try_to_unmap_one(struct page
spin_unlock(&mmlist_lock);
}
dec_mm_counter(mm, anon_rss);
+ inc_mm_counter(mm, swap_usage);
} else if (PAGE_MIGRATION) {
/*
* Store the pfn of the page in a special migration
From: KAMEZAWA Hiroyuki <[email protected]>
Count lowmem rss per mm_struct. Lowmem here means...
for NUMA, pages in a zone < policy_zone.
for HIGHMEM x86, pages in NORMAL zone.
for others, all pages are lowmem.
Now, lower_zone_protection[] works very well for protecting lowmem but
possiblity of lowmem-oom is not 0 even if under good protection in the kernel.
(As fact, it's can be configured by sysctl. When we keep it high, there
will be tons of not-for-use memory but system will be protected against
rare event of lowmem-oom.)
Considering a x86 system with 2G of memory, NORMAL is 856MB and HIGHMEM is 1.1GB
...we can't keep lower_zone_protection too high.
This patch counts num of lowmem used for user process's page-cache memory.
Later patch will use this vaule for OOM calculation.
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/mempolicy.h | 21 +++++++++++++++++++++
include/linux/mm_types.h | 1 +
mm/memory.c | 32 ++++++++++++++++++++++++++------
mm/rmap.c | 2 ++
mm/swapfile.c | 2 ++
5 files changed, 52 insertions(+), 6 deletions(-)
Index: mmotm-2.6.32-Nov2/include/linux/mempolicy.h
===================================================================
--- mmotm-2.6.32-Nov2.orig/include/linux/mempolicy.h
+++ mmotm-2.6.32-Nov2/include/linux/mempolicy.h
@@ -240,6 +240,13 @@ static inline int vma_migratable(struct
return 1;
}
+static inline int is_lowmem_page(struct page *page)
+{
+ if (unlikely(page_zonenum(page) < policy_zone))
+ return 1;
+ return 0;
+}
+
#else
struct mempolicy {};
@@ -356,6 +363,20 @@ static inline int mpol_to_str(char *buff
}
#endif
+#ifdef CONFIG_HIGHMEM
+static inline int is_lowmem_page(struct page *page)
+{
+ if (page_zonenum(page) == ZONE_HIGHMEM)
+ return 0;
+ return 1;
+}
+#else
+static inline int is_lowmem_page(struct page *page)
+{
+ return 1;
+}
+#endif
+
#endif /* CONFIG_NUMA */
#endif /* __KERNEL__ */
Index: mmotm-2.6.32-Nov2/include/linux/mm_types.h
===================================================================
--- mmotm-2.6.32-Nov2.orig/include/linux/mm_types.h
+++ mmotm-2.6.32-Nov2/include/linux/mm_types.h
@@ -229,6 +229,7 @@ struct mm_struct {
mm_counter_t _file_rss;
mm_counter_t _anon_rss;
mm_counter_t _swap_usage;
+ mm_counter_t _low_rss;
unsigned long hiwater_rss; /* High-watermark of RSS usage */
unsigned long hiwater_vm; /* High-water virtual memory usage */
Index: mmotm-2.6.32-Nov2/mm/memory.c
===================================================================
--- mmotm-2.6.32-Nov2.orig/mm/memory.c
+++ mmotm-2.6.32-Nov2/mm/memory.c
@@ -376,8 +376,9 @@ int __pte_alloc_kernel(pmd_t *pmd, unsig
return 0;
}
-static inline void
-add_mm_rss(struct mm_struct *mm, int file_rss, int anon_rss, int swap_usage)
+
+static inline void add_mm_rss(struct mm_struct *mm,
+ int file_rss, int anon_rss, int swap_usage, int low_rss)
{
if (file_rss)
add_mm_counter(mm, file_rss, file_rss);
@@ -385,6 +386,8 @@ add_mm_rss(struct mm_struct *mm, int fil
add_mm_counter(mm, anon_rss, anon_rss);
if (swap_usage)
add_mm_counter(mm, swap_usage, swap_usage);
+ if (low_rss)
+ add_mm_counter(mm, low_rss, low_rss);
}
/*
@@ -638,6 +641,8 @@ copy_one_pte(struct mm_struct *dst_mm, s
get_page(page);
page_dup_rmap(page);
rss[PageAnon(page)]++;
+ if (is_lowmem_page(page))
+ rss[3]++;
}
out_set_pte:
@@ -653,11 +658,11 @@ static int copy_pte_range(struct mm_stru
pte_t *src_pte, *dst_pte;
spinlock_t *src_ptl, *dst_ptl;
int progress = 0;
- int rss[3];
+ int rss[4];
swp_entry_t entry = (swp_entry_t){0};
again:
- rss[2] = rss[1] = rss[0] = 0;
+ rss[3] = rss[2] = rss[1] = rss[0] = 0;
dst_pte = pte_alloc_map_lock(dst_mm, dst_pmd, addr, &dst_ptl);
if (!dst_pte)
return -ENOMEM;
@@ -693,7 +698,7 @@ again:
arch_leave_lazy_mmu_mode();
spin_unlock(src_ptl);
pte_unmap_nested(orig_src_pte);
- add_mm_rss(dst_mm, rss[0], rss[1], rss[2]);
+ add_mm_rss(dst_mm, rss[0], rss[1], rss[2], rss[3]);
pte_unmap_unlock(orig_dst_pte, dst_ptl);
cond_resched();
@@ -824,6 +829,7 @@ static unsigned long zap_pte_range(struc
int file_rss = 0;
int anon_rss = 0;
int swap_usage = 0;
+ int low_rss = 0;
pte = pte_offset_map_lock(mm, pmd, addr, &ptl);
arch_enter_lazy_mmu_mode();
@@ -878,6 +884,8 @@ static unsigned long zap_pte_range(struc
mark_page_accessed(page);
file_rss--;
}
+ if (is_lowmem_page(page))
+ low_rss--;
page_remove_rmap(page);
if (unlikely(page_mapcount(page) < 0))
print_bad_pte(vma, addr, ptent, page);
@@ -904,7 +912,7 @@ static unsigned long zap_pte_range(struc
pte_clear_not_present_full(mm, addr, pte, tlb->fullmm);
} while (pte++, addr += PAGE_SIZE, (addr != end && *zap_work > 0));
- add_mm_rss(mm, file_rss, anon_rss, swap_usage);
+ add_mm_rss(mm, file_rss, anon_rss, swap_usage, low_rss);
arch_leave_lazy_mmu_mode();
pte_unmap_unlock(pte - 1, ptl);
@@ -1539,6 +1547,8 @@ static int insert_page(struct vm_area_st
/* Ok, finally just insert the thing.. */
get_page(page);
inc_mm_counter(mm, file_rss);
+ if (is_lowmem_page(page))
+ inc_mm_counter(mm, low_rss);
page_add_file_rmap(page);
set_pte_at(mm, addr, pte, mk_pte(page, prot));
@@ -2179,6 +2189,10 @@ gotten:
}
} else
inc_mm_counter(mm, anon_rss);
+ if (old_page && is_lowmem_page(old_page))
+ dec_mm_counter(mm, low_rss);
+ if (is_lowmem_page(new_page))
+ inc_mm_counter(mm, low_rss);
flush_cache_page(vma, address, pte_pfn(orig_pte));
entry = mk_pte(new_page, vma->vm_page_prot);
entry = maybe_mkwrite(pte_mkdirty(entry), vma);
@@ -2607,6 +2621,8 @@ static int do_swap_page(struct mm_struct
inc_mm_counter(mm, anon_rss);
dec_mm_counter(mm, swap_usage);
+ if (is_lowmem_page(page))
+ inc_mm_counter(mm, low_rss);
pte = mk_pte(page, vma->vm_page_prot);
if ((flags & FAULT_FLAG_WRITE) && reuse_swap_page(page)) {
pte = maybe_mkwrite(pte_mkdirty(pte), vma);
@@ -2691,6 +2707,8 @@ static int do_anonymous_page(struct mm_s
goto release;
inc_mm_counter(mm, anon_rss);
+ if (is_lowmem_page(page))
+ inc_mm_counter(mm, low_rss);
page_add_new_anon_rmap(page, vma, address);
setpte:
set_pte_at(mm, address, page_table, entry);
@@ -2854,6 +2872,8 @@ static int __do_fault(struct mm_struct *
get_page(dirty_page);
}
}
+ if (is_lowmem_page(page))
+ inc_mm_counter(mm, low_rss);
set_pte_at(mm, address, page_table, entry);
/* no need to invalidate: a not-present page won't be cached */
Index: mmotm-2.6.32-Nov2/mm/rmap.c
===================================================================
--- mmotm-2.6.32-Nov2.orig/mm/rmap.c
+++ mmotm-2.6.32-Nov2/mm/rmap.c
@@ -854,6 +854,8 @@ static int try_to_unmap_one(struct page
} else
dec_mm_counter(mm, file_rss);
+ if (is_lowmem_page(page))
+ dec_mm_counter(mm, low_rss);
page_remove_rmap(page);
page_cache_release(page);
Index: mmotm-2.6.32-Nov2/mm/swapfile.c
===================================================================
--- mmotm-2.6.32-Nov2.orig/mm/swapfile.c
+++ mmotm-2.6.32-Nov2/mm/swapfile.c
@@ -838,6 +838,8 @@ static int unuse_pte(struct vm_area_stru
inc_mm_counter(vma->vm_mm, anon_rss);
dec_mm_counter(vma->vm_mm, swap_usage);
+ if (is_lowmem_page(page))
+ inc_mm_counter(vma->vm_mm, low_rss);
get_page(page);
set_pte_at(vma->vm_mm, addr, pte,
pte_mkold(mk_pte(page, vma->vm_page_prot)));
From: KAMEZAWA Hiroyuki <[email protected]>
This patch implements an easy fork-bomb detector.
Now, fork-bomb detecting logic checks sum of all children's total_vm. But
it tends to estimate badly and task lauchters are easily killed by mistake.
This patch uses new algorithm.
At first, check select_bad_process() to scan from the newest process.
For each process, if runtime is below FORK_BOMB_RUNTIME_THRESH(5min),
a process gets score +1 and adds sum all children score to itself.
By this, we can check size of recently created process tree.
If process tree is enough large (> 12.5% of nr_procs), we assume it as
fork-bomb and kill it. 12.5% seems small but we're under OOM situation
and this is not small number.
BTW, checking fork-bomb only at oom means that this check is done
only after most of processes are swapped out. Hmm..is there good
place to add a hook ?
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/memcontrol.h | 5 +
include/linux/mm_types.h | 2
include/linux/sched.h | 8 ++
mm/memcontrol.c | 7 ++
mm/oom_kill.c | 149 ++++++++++++++++++++++++++++++++++-----------
5 files changed, 137 insertions(+), 34 deletions(-)
Index: mmotm-2.6.32-Nov2/include/linux/mm_types.h
===================================================================
--- mmotm-2.6.32-Nov2.orig/include/linux/mm_types.h
+++ mmotm-2.6.32-Nov2/include/linux/mm_types.h
@@ -289,6 +289,8 @@ struct mm_struct {
#ifdef CONFIG_MMU_NOTIFIER
struct mmu_notifier_mm *mmu_notifier_mm;
#endif
+ /* For OOM, fork-bomb detector */
+ unsigned long bomb_score;
};
/* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
Index: mmotm-2.6.32-Nov2/include/linux/sched.h
===================================================================
--- mmotm-2.6.32-Nov2.orig/include/linux/sched.h
+++ mmotm-2.6.32-Nov2/include/linux/sched.h
@@ -2176,6 +2176,14 @@ static inline unsigned long wait_task_in
#define for_each_process(p) \
for (p = &init_task ; (p = next_task(p)) != &init_task ; )
+/*
+ * This function is for scanning list in reverse order. But, this is not
+ * RCU safe. lock(tasklist_lock) should be held. This is good when you want to
+ * find younger processes early.
+ */
+#define for_each_process_reverse(p) \
+ list_for_each_entry_reverse(p, &init_task.tasks, tasks)
+
extern bool current_is_single_threaded(void);
/*
Index: mmotm-2.6.32-Nov2/mm/oom_kill.c
===================================================================
--- mmotm-2.6.32-Nov2.orig/mm/oom_kill.c
+++ mmotm-2.6.32-Nov2/mm/oom_kill.c
@@ -79,7 +79,6 @@ static unsigned long __badness(struct ta
{
unsigned long points, cpu_time, run_time;
struct mm_struct *mm;
- struct task_struct *child;
int oom_adj = p->signal->oom_adj;
struct task_cputime task_time;
unsigned long utime;
@@ -112,21 +111,6 @@ static unsigned long __badness(struct ta
return ULONG_MAX;
/*
- * Processes which fork a lot of child processes are likely
- * a good choice. We add half the vmsize of the children if they
- * have an own mm. This prevents forking servers to flood the
- * machine with an endless amount of children. In case a single
- * child is eating the vast majority of memory, adding only half
- * to the parents will make the child our kill candidate of choice.
- */
- list_for_each_entry(child, &p->children, sibling) {
- task_lock(child);
- if (child->mm != mm && child->mm)
- points += get_mm_rss(child->mm)/2 + 1;
- task_unlock(child);
- }
-
- /*
* CPU time is in tens of seconds and run time is in thousands
* of seconds. There is no particular reason for this other than
* that it turned out to work very well in practice.
@@ -262,24 +246,92 @@ static inline enum oom_constraint guess_
#endif
/*
+ * Easy fork-bomb detector.
+ */
+/* 5 minutes for non-forkbomb processes */
+#define FORK_BOMB_RUNTIME_THRESH (5 * 60)
+
+static bool check_fork_bomb(struct task_struct *p, int uptime, int nr_procs)
+{
+ struct task_struct *child;
+ int runtime = uptime - p->start_time.tv_sec;
+ int bomb_score;
+ struct mm_struct *mm;
+ bool ret = false;
+
+ if (runtime > FORK_BOMB_RUNTIME_THRESH)
+ return ret;
+ /*
+ * Because we search from newer processes, we can calculate tree's score
+ * just by calculating children's score.
+ */
+ mm = get_task_mm(p);
+ if (!mm)
+ return ret;
+
+ bomb_score = 0;
+ list_for_each_entry(child, &p->children, sibling) {
+ task_lock(child);
+ if (child->mm && child->mm != mm)
+ bomb_score += child->mm->bomb_score;
+ task_unlock(child);
+ }
+ mm->bomb_score = bomb_score + 1;
+ /*
+ * Now, we estimated the size of process tree, which is recently
+ * created. If it's big, we treat it as fork-bomb. This is heuristics
+ * but we set this as 12.5% of all procs we do scan.
+ * This number may be a little small..but we're under OOM situation.
+ *
+ * Discussion: On HIGHMEM system, this number should be smaller ?..
+ */
+ if (bomb_score > nr_procs/8) {
+ ret = true;
+ printk(KERN_WARNING "Possible fork-bomb detected : %d(%s)",
+ p->pid, p->comm);
+ }
+ mmput(mm);
+ return ret;
+}
+
+/*
* Simple selection loop. We chose the process with the highest
* number of 'points'. We expect the caller will lock the tasklist.
*
* (not docbooked, we don't want this one cluttering up the manual)
*/
+
static struct task_struct *select_bad_process(unsigned long *ppoints,
- enum oom_constraint constraint,
- struct mem_cgroup *mem)
+ enum oom_constraint constraint, struct mem_cgroup *mem, int *fork_bomb)
{
struct task_struct *p;
struct task_struct *chosen = NULL;
struct timespec uptime;
+ int nr_proc;
+
*ppoints = 0;
+ *fork_bomb = 0;
do_posix_clock_monotonic_gettime(&uptime);
- for_each_process(p) {
+ switch (constraint) {
+ case CONSTRAINT_MEMCG:
+ /* This includes # of threads...but...*/
+ nr_proc = memory_cgroup_task_count(mem);
+ break;
+ default:
+ nr_proc = nr_processes();
+ break;
+ }
+ /*
+ * We're under read_lock(&tasklist_lock). At OOM, what we doubt is
+ * young processes....considring fork-bomb. Then, we scan task list
+ * in reverse order. (This is safe because we're under lock.
+ */
+ for_each_process_reverse(p) {
unsigned long points;
+ if (*ppoints == ULONG_MAX)
+ break;
/*
* skip kernel threads and tasks which have already released
* their mm.
@@ -324,11 +376,17 @@ static struct task_struct *select_bad_pr
if (p->signal->oom_adj == OOM_DISABLE)
continue;
-
- points = __badness(p, uptime.tv_sec, constraint, mem);
- if (points > *ppoints || !chosen) {
+ if (check_fork_bomb(p, uptime.tv_sec, nr_proc)) {
chosen = p;
- *ppoints = points;
+ *ppoints = ULONG_MAX;
+ *fork_bomb = 1;
+ }
+ if (*ppoints < ULONG_MAX) {
+ points = __badness(p, uptime.tv_sec, constraint, mem);
+ if (points > *ppoints || !chosen) {
+ chosen = p;
+ *ppoints = points;
+ }
}
}
@@ -448,9 +506,17 @@ static int oom_kill_task(struct task_str
return 0;
}
+static int is_forkbomb_family(struct task_struct *c, struct task_struct *p)
+{
+ for (c = c->real_parent; c != &init_task; c = c->real_parent)
+ if (c == p)
+ return 1;
+ return 0;
+}
+
static int oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
- unsigned long points, struct mem_cgroup *mem,
- const char *message)
+ unsigned long points, struct mem_cgroup *mem,int fork_bomb,
+ const char *message)
{
struct task_struct *c;
@@ -468,12 +534,25 @@ static int oom_kill_process(struct task_
printk(KERN_ERR "%s: kill process %d (%s) score %li or a child\n",
message, task_pid_nr(p), p->comm, points);
-
- /* Try to kill a child first */
+ if (fork_bomb) {
+ printk(KERN_ERR "possible fork-bomb is detected. kill them\n");
+ /* We need to kill the youngest one, at least */
+ rcu_read_lock();
+ for_each_process_reverse(c) {
+ if (c == p)
+ break;
+ if (is_forkbomb_family(c, p)) {
+ oom_kill_task(c);
+ break;
+ }
+ }
+ rcu_read_unlock();
+ }
+ /* Try to kill a child first. If fork-bomb, kill all. */
list_for_each_entry(c, &p->children, sibling) {
if (c->mm == p->mm)
continue;
- if (!oom_kill_task(c))
+ if (!oom_kill_task(c) && !fork_bomb)
return 0;
}
return oom_kill_task(p);
@@ -483,18 +562,19 @@ static int oom_kill_process(struct task_
void mem_cgroup_out_of_memory(struct mem_cgroup *mem, gfp_t gfp_mask)
{
unsigned long points = 0;
+ int fork_bomb = 0;
struct task_struct *p;
read_lock(&tasklist_lock);
retry:
- p = select_bad_process(&points, CONSTRAINT_MEMCG, mem);
+ p = select_bad_process(&points, CONSTRAINT_MEMCG, mem, &fork_bomb);
if (PTR_ERR(p) == -1UL)
goto out;
if (!p)
p = current;
- if (oom_kill_process(p, gfp_mask, 0, points, mem,
+ if (oom_kill_process(p, gfp_mask, 0, points, mem, fork_bomb,
"Memory cgroup out of memory"))
goto retry;
out:
@@ -574,9 +654,10 @@ static void __out_of_memory(gfp_t gfp_ma
{
struct task_struct *p;
unsigned long points;
+ int fork_bomb;
if (sysctl_oom_kill_allocating_task)
- if (!oom_kill_process(current, gfp_mask, order, 0, NULL,
+ if (!oom_kill_process(current, gfp_mask, order, 0, NULL, 0,
"Out of memory (oom_kill_allocating_task)"))
return;
retry:
@@ -584,7 +665,7 @@ retry:
* Rambo mode: Shoot down a process and hope it solves whatever
* issues we may have.
*/
- p = select_bad_process(&points, constraint, NULL);
+ p = select_bad_process(&points, constraint, NULL, &fork_bomb);
if (PTR_ERR(p) == -1UL)
return;
@@ -596,7 +677,7 @@ retry:
panic("Out of memory and no killable processes...\n");
}
- if (oom_kill_process(p, gfp_mask, order, points, NULL,
+ if (oom_kill_process(p, gfp_mask, order, points, NULL, fork_bomb,
"Out of memory"))
goto retry;
}
@@ -679,7 +760,7 @@ void out_of_memory(struct zonelist *zone
switch (constraint) {
case CONSTRAINT_MEMORY_POLICY:
- oom_kill_process(current, gfp_mask, order, 0, NULL,
+ oom_kill_process(current, gfp_mask, order, 0, NULL, 0,
"No available memory (MPOL_BIND)");
break;
case CONSTRAINT_LOWMEM:
Index: mmotm-2.6.32-Nov2/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.32-Nov2.orig/include/linux/memcontrol.h
+++ mmotm-2.6.32-Nov2/include/linux/memcontrol.h
@@ -126,6 +126,7 @@ void mem_cgroup_update_file_mapped(struc
unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
gfp_t gfp_mask, int nid,
int zid);
+int memory_cgroup_task_count(struct mem_cgroup *mem);
#else /* CONFIG_CGROUP_MEM_RES_CTLR */
struct mem_cgroup;
@@ -299,6 +300,10 @@ unsigned long mem_cgroup_soft_limit_recl
return 0;
}
+static int memory_cgroup_task_count(struct mem_cgroup *mem)
+{
+ return 0;
+}
#endif /* CONFIG_CGROUP_MEM_CONT */
#endif /* _LINUX_MEMCONTROL_H */
Index: mmotm-2.6.32-Nov2/mm/memcontrol.c
===================================================================
--- mmotm-2.6.32-Nov2.orig/mm/memcontrol.c
+++ mmotm-2.6.32-Nov2/mm/memcontrol.c
@@ -1223,6 +1223,13 @@ static void record_last_oom(struct mem_c
mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
}
+int memory_cgroup_task_count(struct mem_cgroup *mem)
+{
+ struct cgroup *cg = mem->css.cgroup;
+
+ return cgroup_task_count(cg);
+}
+
/*
* Currently used to update mapped file statistics, but the routine can be
* generalized to update other statistics as well.
From: KAMEZAWA Hiroyuki <[email protected]>
At considering oom-kill algorithm, we can't avoid to take runtime
into account. But this can adds too big bonus to slow-memory-leaker.
For adding penalty to slow-memory-leaker, we record jiffies of
the last mm->hiwater_vm expansion. That catches processes which leak
memory periodically.
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/mm_types.h | 2 ++
include/linux/sched.h | 4 +++-
2 files changed, 5 insertions(+), 1 deletion(-)
Index: mmotm-2.6.32-Nov2/include/linux/mm_types.h
===================================================================
--- mmotm-2.6.32-Nov2.orig/include/linux/mm_types.h
+++ mmotm-2.6.32-Nov2/include/linux/mm_types.h
@@ -291,6 +291,8 @@ struct mm_struct {
#endif
/* For OOM, fork-bomb detector */
unsigned long bomb_score;
+ /* set to jiffies at total_vm is finally expanded (see sched.h) */
+ unsigned long last_vm_expansion;
};
/* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
Index: mmotm-2.6.32-Nov2/include/linux/sched.h
===================================================================
--- mmotm-2.6.32-Nov2.orig/include/linux/sched.h
+++ mmotm-2.6.32-Nov2/include/linux/sched.h
@@ -422,8 +422,10 @@ extern void arch_unmap_area_topdown(stru
(mm)->hiwater_rss = _rss; \
} while (0)
#define update_hiwater_vm(mm) do { \
- if ((mm)->hiwater_vm < (mm)->total_vm) \
+ if ((mm)->hiwater_vm < (mm)->total_vm) { \
(mm)->hiwater_vm = (mm)->total_vm; \
+ (mm)->last_vm_expansion = jiffies; \
+ }\
} while (0)
static inline unsigned long get_mm_hiwater_rss(struct mm_struct *mm)
From: KAMEZAWA Hiroyuki <[email protected]>
rewrite __badness() heuristics.
Now, we have much more useful information for badness. use it.
And this patch changes too strong bonuses of cputime and runtime.
- use "constraint" for changing base value.
CPUSET: RSS tend to be unbalnaced between nodes. And we don't have
per node RSS value....use total_vm instead of it.
LOWMEM: we need to kill a process witch has low_rss.
MEMCG, NONE: use RSS+SWAP as base value.
- Runtime bonus.
Runtime bonus is 0.1% per sec for each base value up to 50%
For NONE/MEMCG, using total_vm-shared_vm here for taking requested amounts
of memory into account. This may be bigger than base value.
- cputime bonus
removed.
- Last Expansion bonus
If last call for mmap() which expands hiwat_total_vm was far in past,
get bonus. 0.1% per sec up to 25%.
- nice bonus was removed. (we have oom_adj, ROOT is checked.)
- import codes from KOSAKI's patch which coalesce capability checks.
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/oom_kill.c | 124 +++++++++++++++++++++++++++++++++++++++-------------------
1 file changed, 84 insertions(+), 40 deletions(-)
Index: mmotm-2.6.32-Nov2/mm/oom_kill.c
===================================================================
--- mmotm-2.6.32-Nov2.orig/mm/oom_kill.c
+++ mmotm-2.6.32-Nov2/mm/oom_kill.c
@@ -77,12 +77,10 @@ static unsigned long __badness(struct ta
unsigned long uptime, enum oom_constraint constraint,
struct mem_cgroup *mem)
{
- unsigned long points, cpu_time, run_time;
+ unsigned long points;
+ long runtime, quiet_time, penalty;
struct mm_struct *mm;
int oom_adj = p->signal->oom_adj;
- struct task_cputime task_time;
- unsigned long utime;
- unsigned long stime;
if (oom_adj == OOM_DISABLE)
return 0;
@@ -93,11 +91,28 @@ static unsigned long __badness(struct ta
task_unlock(p);
return 0;
}
-
- /*
- * The memory size of the process is the basis for the badness.
- */
- points = get_mm_rss(mm);
+ switch (constraint) {
+ case CONSTRAINT_CPUSET:
+ /*
+ * Because size of RSS/SWAP is highly affected by cpuset's
+ * configuration and not by result of memory reclaim.
+ * Then we use VM size here instead of RSS.
+ * (we don't have per-node-rss counting, now)
+ */
+ points = mm->total_vm;
+ break;
+ case CONSTRAINT_LOWMEM:
+ points = get_mm_counter(mm, low_rss);
+ break;
+ case CONSTRAINT_MEMCG:
+ case CONSTRAINT_NONE:
+ points = get_mm_counter(mm, anon_rss);
+ points += get_mm_counter(mm, swap_usage);
+ break;
+ default: /* mempolicy will not come here */
+ BUG();
+ break;
+ }
/*
* After this unlock we can no longer dereference local variable `mm'
@@ -109,53 +124,82 @@ static unsigned long __badness(struct ta
*/
if (p->flags & PF_OOM_ORIGIN)
return ULONG_MAX;
-
/*
- * CPU time is in tens of seconds and run time is in thousands
- * of seconds. There is no particular reason for this other than
- * that it turned out to work very well in practice.
- */
- thread_group_cputime(p, &task_time);
- utime = cputime_to_jiffies(task_time.utime);
- stime = cputime_to_jiffies(task_time.stime);
- cpu_time = (utime + stime) >> (SHIFT_HZ + 3);
-
+ * Check process's behavior and vm activity. And give bonus and
+ * penalty.
+ */
+ runtime = uptime - p->start_time.tv_sec;
+ penalty = 0;
+ /*
+ * At oom, younger processes tend to be bad one. And there is no
+ * good reason to kill a process which works very well befor OOM.
+ * This adds short-run-time penalty at most 50% of its vm size.
+ * and long-run process will get bonus up to 50% of its vm size.
+ * If a process runs 1sec, it gets 0.1% bonus.
+ *
+ * We just check run_time here.
+ */
+ runtime = 5000 - runtime;
+ if (runtime < -5000)
+ runtime = -5000;
+ switch (constraint) {
+ case CONSTRAINT_LOWMEM:
+ /* If LOWMEM OOM, seeing total_vm is wrong */
+ penalty = points * penalty / 10000;
+ break;
+ case CONSTRAINT_CPUSET:
+ penalty = mm->total_vm * penalty / 10000;
+ break;
+ default:
+ /* use total_vm - shared size as base of bonus */
+ penalty = (mm->total_vm - mm->shared_vm)* penalty / 10000;
+ break;
+ }
- if (uptime >= p->start_time.tv_sec)
- run_time = (uptime - p->start_time.tv_sec) >> 10;
+ if (likely(jiffies > mm->last_vm_expansion))
+ quiet_time = jiffies - mm->last_vm_expansion;
else
- run_time = 0;
-
- if (cpu_time)
- points /= int_sqrt(cpu_time);
- if (run_time)
- points /= int_sqrt(int_sqrt(run_time));
+ quiet_time = ULONG_MAX - mm->last_vm_expansion + jiffies;
+ quiet_time = jiffies_to_msecs(quiet_time)/1000;
/*
- * Niced processes are most likely less important, so double
- * their badness points.
+ * If a process recently expanded its (highest) vm size, get penalty.
+ * This is for catching slow memory leaker. 12.5% is half of runtime
+ * penalty.
*/
- if (task_nice(p) > 0)
- points *= 2;
+ quiet_time = 2500 - quiet_time;
+ if (quiet_time < -2500)
+ quiet_time = -1250;
+
+ switch (constraint) {
+ case CONSTRAINT_LOWMEM:
+ /* If LOWMEM OOM, seeing total_vm is wrong */
+ penalty += points * quiet_time / 10000;
+ break;
+ case CONSTRAINT_CPUSET:
+ penalty += mm->total_vm * quiet_time / 10000;
+ break;
+ default:
+ penalty += (mm->total_vm - mm->shared_vm) * quiet_time / 10000;
+ break;
+ }
+ /*
+ * If an old process was quiet, it gets 75% of bonus at maximum.
+ */
+ if ((penalty < 0) && (-penalty > points))
+ return 0;
+ points += penalty;
/*
* Superuser processes are usually more important, so we make it
* less likely that we kill those.
*/
if (has_capability_noaudit(p, CAP_SYS_ADMIN) ||
+ has_capability_noaudit(p, CAP_SYS_RAWIO) ||
has_capability_noaudit(p, CAP_SYS_RESOURCE))
points /= 4;
/*
- * We don't want to kill a process with direct hardware access.
- * Not only could that mess up the hardware, but usually users
- * tend to only have this flag set on applications they think
- * of as important.
- */
- if (has_capability_noaudit(p, CAP_SYS_RAWIO))
- points /= 4;
-
- /*
* If p's nodes don't overlap ours, it may still help to kill p
* because p may have allocated or otherwise mapped memory on
* this node before. However it will be less likely.
On Mon, 2 Nov 2009 16:27:16 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:
> -
> - /* Try to kill a child first */
> + if (fork_bomb) {
> + printk(KERN_ERR "possible fork-bomb is detected. kill them\n");
> + /* We need to kill the youngest one, at least */
> + rcu_read_lock();
> + for_each_process_reverse(c) {
> + if (c == p)
> + break;
> + if (is_forkbomb_family(c, p)) {
> + oom_kill_task(c);
> + break;
> + }
> + }
> + rcu_read_unlock();
> + }
Kosaki said we should kill all under tree and "break" is unnecessay here.
I nearly agree with him..after some experiments.
But it seems the biggest problem is latecy by swap-out...before deciding OOM
....
Thanks,
-Kame
Hi, Kame.
I looked over the patch series.
It's rather big change of OOM.
I see you and David want to make OOM fresh from scratch.
But, It makes for testers to test harder.
I like your idea of fork-bomb detector.
Don't we use it without big change of as-is OOM heuristic?
Anyway,I need time to dive the code and test it.
Maybe weekend.
Thanks for great effort. :)
On Mon, Nov 2, 2009 at 4:22 PM, KAMEZAWA Hiroyuki
<[email protected]> wrote:
> Hi, as discussed in "Memory overcommit" threads, I started rewrite.
>
> This is just for showing "I started" (not just chating or sleeping ;)
>
> All implemtations are not fixed yet. So feel free to do any comments.
> This set is for minimum change set, I think. Some more rich functions
> can be implemented based on this.
>
> All patches are against "mm-of-the-moment snapshot 2009-11-01-10-01"
>
> Patches are organized as
>
> (1) pass oom-killer more information, classification and fix mempolicy case.
> (2) counting swap usage
> (3) counting lowmem usage
> (4) fork bomb detector/killer
> (5) check expansion of total_vm
> (6) rewrite __badness().
>
> passed small tests on x86-64 boxes.
>
> Thanks,
> -Kame
>
>
--
Kind regards,
Minchan Kim
Minchan Kim wrote:
> Hi, Kame.
>
> I looked over the patch series.
> It's rather big change of OOM.
yes, bigger than I expected.
> I see you and David want to make OOM fresh from scratch.
> But, It makes for testers to test harder.
>
I see. Maybe I have to separate this to 2 or 3 stages.
> I like your idea of fork-bomb detector.
> Don't we use it without big change of as-is OOM heuristic?
>
yes, this is big change. And I'll cut out usable part ;)
Maybe I'll drop most of changes in patch 6's heuristics part.
(but selection of baseline for LOWMEM is not so bad.)
What I want in early stage is
- fix for mempolicy. (we need to pass nodemask)
- swap counting (regardless of oom)
- low_rss counting (if admited...)
- fork-bomb detector
Let me think how to make patch set small and easy to test.
> Anyway,I need time to dive the code and test it.
> Maybe weekend.
>
> Thanks for great effort. :)
>
Thank you for review.
Regards,
-Kame
> On Mon, Nov 2, 2009 at 4:22 PM, KAMEZAWA Hiroyuki
> <[email protected]> wrote:
>> Hi, as discussed in "Memory overcommit" threads, I started rewrite.
>>
>> This is just for showing "I started" (not just chating or sleeping ;)
>>
>> All implemtations are not fixed yet. So feel free to do any comments.
>> This set is for minimum change set, I think. Some more rich functions
>> can be implemented based on this.
>>
>> All patches are against "mm-of-the-moment snapshot 2009-11-01-10-01"
>>
>> Patches are organized as
>>
>> (1) pass oom-killer more information, classification and fix mempolicy
>> case.
>> (2) counting swap usage
>> (3) counting lowmem usage
>> (4) fork bomb detector/killer
>> (5) check expansion of total_vm
>> (6) rewrite __badness().
>>
>> passed small tests on x86-64 boxes.
>>
>> Thanks,
>> -Kame
>>
>>
>
>
>
> --
> Kind regards,
> Minchan Kim
>
On Mon, 2 Nov 2009, KAMEZAWA Hiroyuki wrote:
> /*
> - * Types of limitations to the nodes from which allocations may occur
> + * Types of limitations to zones from which allocations may occur
> */
"Types of limitations that may cause OOMs"? MEMCG limitations are not zone
based.
> */
>
> -unsigned long badness(struct task_struct *p, unsigned long uptime)
> +static unsigned long __badness(struct task_struct *p,
> + unsigned long uptime, enum oom_constraint constraint,
> + struct mem_cgroup *mem)
> {
> unsigned long points, cpu_time, run_time;
> struct mm_struct *mm;
Why rename this function? You are adding a global_badness anyways.
> + /*
> + * In numa environ, almost all allocation will be against NORMAL zone.
The typical allocations will be against the policy_zone! SGI IA64 (and
others) have policy_zone == GFP_DMA.
> + * But some small area, ex)GFP_DMA for ia64 or GFP_DMA32 for x86-64
> + * can cause OOM. We can use policy_zone for checking lowmem.
> + */
Simply say that we are checking if the zone constraint is below the policy
zone?
> + * Now, only mempolicy specifies nodemask. But if nodemask
> + * covers all nodes, this oom is global oom.
> + */
> + if (nodemask && !nodes_equal(node_states[N_HIGH_MEMORY], *nodemask))
> + ret = CONSTRAINT_MEMORY_POLICY;
Huh? A cpuset can also restrict the nodes?
> + /*
> + * If not __GFP_THISNODE, zonelist containes all nodes. And if
Dont see any __GFP_THISNODE checks here.
> panic("out of memory from page fault. panic_on_oom is selected.\n");
>
> read_lock(&tasklist_lock);
> - __out_of_memory(0, 0); /* unknown gfp_mask and order */
> + /*
> + * Considering nature of pages required for page-fault,this must be
> + * global OOM (if not cpuset...). Then, CONSTRAINT_NONE is correct.
> + * zonelist, nodemasks are unknown...
> + */
> + __out_of_memory(0, CONSTRAINT_NONE, 0, NULL);
> read_unlock(&tasklist_lock);
Page faults can occur on processes that have memory restrictions.
Submit this patch independently? I think it is generally useful.
I dont think this patch will work in !NUMA but its useful there too. Can
you make this work in general?
Thanks! your review is very helpful around NUMA.
Christoph Lameter wrote:
> On Mon, 2 Nov 2009, KAMEZAWA Hiroyuki wrote:
>
>> /*
>> - * Types of limitations to the nodes from which allocations may occur
>> + * Types of limitations to zones from which allocations may occur
>> */
>
> "Types of limitations that may cause OOMs"? MEMCG limitations are not zone
> based.
>
ah, will rewrite.
>> */
>>
>> -unsigned long badness(struct task_struct *p, unsigned long uptime)
>> +static unsigned long __badness(struct task_struct *p,
>> + unsigned long uptime, enum oom_constraint constraint,
>> + struct mem_cgroup *mem)
>> {
>> unsigned long points, cpu_time, run_time;
>> struct mm_struct *mm;
>
> Why rename this function? You are adding a global_badness anyways.
>
just because of history of my own updates...i.e. mistake.
no reason. sorry.
>
>> + /*
>> + * In numa environ, almost all allocation will be against NORMAL zone.
>
> The typical allocations will be against the policy_zone! SGI IA64 (and
> others) have policy_zone == GFP_DMA.
>
Hmm ? ok. I thought GPF_DMA for ia64 was below 4G zone.
If all memory are GFP_DMA(as ppc), that means no lowemem.
I'll just rewrite above comments as
"typical allocation will be against policy_zone".
>> + * But some small area, ex)GFP_DMA for ia64 or GFP_DMA32 for x86-64
>> + * can cause OOM. We can use policy_zone for checking lowmem.
>> + */
>
> Simply say that we are checking if the zone constraint is below the policy
> zone?
>
ok, will rewrite. Too verbose just bacause policy_zone isn't well unknown.
>> + * Now, only mempolicy specifies nodemask. But if nodemask
>> + * covers all nodes, this oom is global oom.
>> + */
>> + if (nodemask && !nodes_equal(node_states[N_HIGH_MEMORY], *nodemask))
>> + ret = CONSTRAINT_MEMORY_POLICY;
>
> Huh? A cpuset can also restrict the nodes?
>
cpuset doesn't pass nodemask for allocation(now).
It checks its nodemask in get_free_page_from_freelist(), internally.
>> + /*
>> + * If not __GFP_THISNODE, zonelist containes all nodes. And if
>
> Dont see any __GFP_THISNODE checks here.
>
If __GFP_THISNODE, zonelist includes local node only. Then zonelist/nodemask
check will hunt it and result will be CONSTRAINT_MEMPOLICY.
Then...hum....recommending CONSTRAINT_THISNODE ?
>> panic("out of memory from page fault. panic_on_oom is selected.\n");
>>
>> read_lock(&tasklist_lock);
>> - __out_of_memory(0, 0); /* unknown gfp_mask and order */
>> + /*
>> + * Considering nature of pages required for page-fault,this must be
>> + * global OOM (if not cpuset...). Then, CONSTRAINT_NONE is correct.
>> + * zonelist, nodemasks are unknown...
>> + */
>> + __out_of_memory(0, CONSTRAINT_NONE, 0, NULL);
>> read_unlock(&tasklist_lock);
>
> Page faults can occur on processes that have memory restrictions.
>
yes. comments are bad. will rewrite. But we don't have any useful
information here.Fixing pagefault_out_of_memory is on my to-do-list.
It seems wrong.
But a condition unclear to me is when VM_FAULT_OOM can be returned
without oom-kill...so plz give me time.
Thanks,
-Kame
Christoph Lameter wrote:
>
> Submit this patch independently? I think it is generally useful.
>
Thanks, I will do so.
Regards,
-Kame
Christoph Lameter wrote:
>
> I dont think this patch will work in !NUMA but its useful there too. Can
> you make this work in general?
>
for NUMA
==
+static inline int is_lowmem_page(struct page *page)
+{
+ if (unlikely(page_zonenum(page) < policy_zone))
+ return 1;
+ return 0;
+}
==
is used. Doesn't this work well ?
This check means
It enough memory:
On my ia64 box ZONE_DMA(<4G), x86-64 box(GFP_DMA32) is caught
If small memory (typically < 4G)
ia64 box no lowmem, x86-64 box GPF_DMA is caught
If all zones are policy zone (ppc)
no lowmem zone.
Because "amount of memory" changes the situation "which is lowmem?",
I used policy zone. If this usage is not appropriate, I'll add some new.
BTW, is it better to export this value from somewhere ?
Thanks,
-Kame
On Mon, 2 Nov 2009, KAMEZAWA Hiroyuki wrote:
> Now, anon_rss and file_rss is counted as RSS and exported via /proc.
> RSS usage is important information but one more information which
> is often asked by users is "usage of swap".(user support team said.)
>
> This patch counts swap entry usage per process and show it via
> /proc/<pid>/status. I think status file is robust against new entry.
> Then, it is the first candidate..
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
Acked-by; David Rientjes <[email protected]>
Thanks! I think this should be added to -mm now while the remainder of
your patchset is developed and reviewed, it's helpful as an independent
change.
> Index: mmotm-2.6.32-Nov2/fs/proc/task_mmu.c
> ===================================================================
> --- mmotm-2.6.32-Nov2.orig/fs/proc/task_mmu.c
> +++ mmotm-2.6.32-Nov2/fs/proc/task_mmu.c
> @@ -17,7 +17,7 @@
> void task_mem(struct seq_file *m, struct mm_struct *mm)
> {
> unsigned long data, text, lib;
> - unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
> + unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss, swap;
>
> /*
> * Note: to minimize their overhead, mm maintains hiwater_vm and
> @@ -36,6 +36,8 @@ void task_mem(struct seq_file *m, struct
> data = mm->total_vm - mm->shared_vm - mm->stack_vm;
> text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10;
> lib = (mm->exec_vm << (PAGE_SHIFT-10)) - text;
> +
> + swap = get_mm_counter(mm, swap_usage);
> seq_printf(m,
> "VmPeak:\t%8lu kB\n"
> "VmSize:\t%8lu kB\n"
Not sure about this newline though.
On Mon, 2 Nov 2009, KAMEZAWA Hiroyuki wrote:
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Rewrite oom constarint to be up to date.
>
> (1). Now, at badness calculation, oom_constraint and other information
> (which is available easily) are ignore. Pass them.
>
> (2)Adds more classes of oom constraint as _MEMCG and _LOWMEM.
> This is just a change for interface and doesn't add new logic, at this stage.
>
> (3) Pass nodemask to oom_kill. Now alloc_pages() are totally rewritten and
> it uses nodemask as its argument. By this, mempolicy doesn't have its own
> private zonelist. So, Passing nodemask to out_of_memory() is necessary.
> But, pagefault_out_of_memory() doesn't have enough information. We should
> visit this again, later.
>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> drivers/char/sysrq.c | 2 -
> fs/proc/base.c | 4 +-
> include/linux/oom.h | 8 +++-
> mm/oom_kill.c | 101 +++++++++++++++++++++++++++++++++++++++------------
> mm/page_alloc.c | 2 -
> 5 files changed, 88 insertions(+), 29 deletions(-)
>
> Index: mmotm-2.6.32-Nov2/include/linux/oom.h
> ===================================================================
> --- mmotm-2.6.32-Nov2.orig/include/linux/oom.h
> +++ mmotm-2.6.32-Nov2/include/linux/oom.h
> @@ -10,23 +10,27 @@
> #ifdef __KERNEL__
>
> #include <linux/types.h>
> +#include <linux/nodemask.h>
>
> struct zonelist;
> struct notifier_block;
>
> /*
> - * Types of limitations to the nodes from which allocations may occur
> + * Types of limitations to zones from which allocations may occur
> */
> enum oom_constraint {
> CONSTRAINT_NONE,
> + CONSTRAINT_LOWMEM,
> CONSTRAINT_CPUSET,
> CONSTRAINT_MEMORY_POLICY,
> + CONSTRAINT_MEMCG
> };
>
> extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
> extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
>
> -extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order);
> +extern void out_of_memory(struct zonelist *zonelist,
> + gfp_t gfp_mask, int order, nodemask_t *mask);
> extern int register_oom_notifier(struct notifier_block *nb);
> extern int unregister_oom_notifier(struct notifier_block *nb);
>
> Index: mmotm-2.6.32-Nov2/mm/oom_kill.c
> ===================================================================
> --- mmotm-2.6.32-Nov2.orig/mm/oom_kill.c
> +++ mmotm-2.6.32-Nov2/mm/oom_kill.c
> @@ -27,6 +27,7 @@
> #include <linux/notifier.h>
> #include <linux/memcontrol.h>
> #include <linux/security.h>
> +#include <linux/mempolicy.h>
>
> int sysctl_panic_on_oom;
> int sysctl_oom_kill_allocating_task;
> @@ -55,6 +56,8 @@ static int has_intersects_mems_allowed(s
> * badness - calculate a numeric value for how bad this task has been
> * @p: task struct of which task we should calculate
> * @uptime: current uptime in seconds
> + * @constraint: type of oom_kill region
> + * @mem: set if called by memory cgroup
> *
> * The formula used is relatively simple and documented inline in the
> * function. The main rationale is that we want to select a good task
> @@ -70,7 +73,9 @@ static int has_intersects_mems_allowed(s
> * of least surprise ... (be careful when you change it)
> */
>
> -unsigned long badness(struct task_struct *p, unsigned long uptime)
> +static unsigned long __badness(struct task_struct *p,
> + unsigned long uptime, enum oom_constraint constraint,
> + struct mem_cgroup *mem)
> {
> unsigned long points, cpu_time, run_time;
> struct mm_struct *mm;
> @@ -193,30 +198,68 @@ unsigned long badness(struct task_struct
> return points;
> }
>
> +/* for /proc */
> +unsigned long global_badness(struct task_struct *p, unsigned long uptime)
> +{
> + return __badness(p, uptime, CONSTRAINT_NONE, NULL);
> +}
I don't understand why this is necessary, CONSTRAINT_NONE should be
available to proc_oom_score() via linux/oom.h. It would probably be
better to not rename badness() and use it directly.
> +
> +
> /*
> * Determine the type of allocation constraint.
> */
> -static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> - gfp_t gfp_mask)
> -{
> +
> #ifdef CONFIG_NUMA
> +static inline enum oom_constraint guess_oom_context(struct zonelist *zonelist,
> + gfp_t gfp_mask, nodemask_t *nodemask)
Why is this renamed from constrained_alloc()? If the new code is really a
guess, we probably shouldn't be altering the oom killing behavior to kill
innocent tasks if it's wrong.
> +{
> struct zone *zone;
> struct zoneref *z;
> enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> - nodemask_t nodes = node_states[N_HIGH_MEMORY];
> + enum oom_constraint ret = CONSTRAINT_NONE;
>
> - for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> - if (cpuset_zone_allowed_softwall(zone, gfp_mask))
> - node_clear(zone_to_nid(zone), nodes);
> - else
> + /*
> + * In numa environ, almost all allocation will be against NORMAL zone.
> + * But some small area, ex)GFP_DMA for ia64 or GFP_DMA32 for x86-64
> + * can cause OOM. We can use policy_zone for checking lowmem.
> + */
> + if (high_zoneidx < policy_zone)
> + return CONSTRAINT_LOWMEM;
> + /*
> + * Now, only mempolicy specifies nodemask. But if nodemask
> + * covers all nodes, this oom is global oom.
> + */
> + if (nodemask && !nodes_equal(node_states[N_HIGH_MEMORY], *nodemask))
> + ret = CONSTRAINT_MEMORY_POLICY;
> + /*
> + * If not __GFP_THISNODE, zonelist containes all nodes. And if
> + * zonelist contains a zone which isn't allowed under cpuset, we assume
> + * this allocation failure is caused by cpuset's constraint.
> + * Note: all nodes are scanned if nodemask=NULL.
> + */
> + for_each_zone_zonelist_nodemask(zone,
> + z, zonelist, high_zoneidx, nodemask) {
> + if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
> return CONSTRAINT_CPUSET;
> + }
This could probably be written as
int nid;
if (nodemask)
for_each_node_mask(nid, *nodemask)
if (!__cpuset_node_allowed_softwall(nid, gfp_mask))
return CONSTRAINT_CPUSET;
and then you don't need the struct zoneref or struct zone.
On Mon, 2 Nov 2009, KAMEZAWA Hiroyuki wrote:
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Count lowmem rss per mm_struct. Lowmem here means...
>
> for NUMA, pages in a zone < policy_zone.
> for HIGHMEM x86, pages in NORMAL zone.
> for others, all pages are lowmem.
>
> Now, lower_zone_protection[] works very well for protecting lowmem but
> possiblity of lowmem-oom is not 0 even if under good protection in the kernel.
> (As fact, it's can be configured by sysctl. When we keep it high, there
> will be tons of not-for-use memory but system will be protected against
> rare event of lowmem-oom.)
Right, lowmem isn't addressed currently by the oom killer. Adding this
constraint will probably make the heuristics much harder to write and
understand. It's not always clear that we want to kill a task using
lowmem just because another task needs some, for instance. Do you think
we'll need a way to defer killing any task is no task is heuristically
found to be hogging lowmem?
On Mon, 2 Nov 2009, KAMEZAWA Hiroyuki wrote:
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> At considering oom-kill algorithm, we can't avoid to take runtime
> into account. But this can adds too big bonus to slow-memory-leaker.
> For adding penalty to slow-memory-leaker, we record jiffies of
> the last mm->hiwater_vm expansion. That catches processes which leak
> memory periodically.
>
No, it doesn't, it simply measures the last time the hiwater mark was
increased. That could have increased by a single page in the last tick
with no increase in memory consumption over the past year and then its
unfairly biased against for quiet_time in the new oom kill heuristic
(patch 6). Using this as part of the badness scoring is ill conceived
because it doesn't necessarily indicate a memory leaking task, just one
that has recently allocated memory.
On Mon, 2 Nov 2009, KAMEZAWA Hiroyuki wrote:
> Hi, as discussed in "Memory overcommit" threads, I started rewrite.
>
> This is just for showing "I started" (not just chating or sleeping ;)
>
> All implemtations are not fixed yet. So feel free to do any comments.
> This set is for minimum change set, I think. Some more rich functions
> can be implemented based on this.
>
> All patches are against "mm-of-the-moment snapshot 2009-11-01-10-01"
>
> Patches are organized as
>
> (1) pass oom-killer more information, classification and fix mempolicy case.
> (2) counting swap usage
> (3) counting lowmem usage
> (4) fork bomb detector/killer
> (5) check expansion of total_vm
> (6) rewrite __badness().
>
> passed small tests on x86-64 boxes.
>
Thanks for looking into improving the oom killer!
I think it would be easier to merge the four different concepts you have
here:
- counting for swap usage (patch 2),
- oom killer constraint reorganization (patches 1 and 3),
- fork bomb detector (patch 4), and
- heuristic changes (patches 5 and 6)
into seperate patchsets and get them merged one at a time. I think patch
2 can easily be merged into -mm now, and patches 1 and 3 could be merged
after cleaned up. We'll probably need more discussion on the rest.
Patches 1 and 6 have whitespace damage, btw.
On Tue, 3 Nov 2009 12:34:13 -0800 (PST)
David Rientjes <[email protected]> wrote:
> On Mon, 2 Nov 2009, KAMEZAWA Hiroyuki wrote:
>
> > Hi, as discussed in "Memory overcommit" threads, I started rewrite.
> >
> > This is just for showing "I started" (not just chating or sleeping ;)
> >
> > All implemtations are not fixed yet. So feel free to do any comments.
> > This set is for minimum change set, I think. Some more rich functions
> > can be implemented based on this.
> >
> > All patches are against "mm-of-the-moment snapshot 2009-11-01-10-01"
> >
> > Patches are organized as
> >
> > (1) pass oom-killer more information, classification and fix mempolicy case.
> > (2) counting swap usage
> > (3) counting lowmem usage
> > (4) fork bomb detector/killer
> > (5) check expansion of total_vm
> > (6) rewrite __badness().
> >
> > passed small tests on x86-64 boxes.
> >
>
> Thanks for looking into improving the oom killer!
>
Thank you for review.
> I think it would be easier to merge the four different concepts you have
> here:
>
> - counting for swap usage (patch 2),
>
> - oom killer constraint reorganization (patches 1 and 3),
>
> - fork bomb detector (patch 4), and
>
> - heuristic changes (patches 5 and 6)
>
> into seperate patchsets and get them merged one at a time.
yes, I will do so. I think we share total view of final image.
> I think patch 2 can easily be merged into -mm now, and patches 1 and 3 could
> be merged after cleaned up.
ok, maybe patch 1 should be separated more.
>We'll probably need more discussion on the rest.
>
agreed.
> Patches 1 and 6 have whitespace damage, btw.
Oh, will fix.
Thanks,
-Kame
On Tue, 3 Nov 2009 12:18:40 -0800 (PST)
David Rientjes <[email protected]> wrote:
> On Mon, 2 Nov 2009, KAMEZAWA Hiroyuki wrote:
>
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > Rewrite oom constarint to be up to date.
> >
> > (1). Now, at badness calculation, oom_constraint and other information
> > (which is available easily) are ignore. Pass them.
> >
> > (2)Adds more classes of oom constraint as _MEMCG and _LOWMEM.
> > This is just a change for interface and doesn't add new logic, at this stage.
> >
> > (3) Pass nodemask to oom_kill. Now alloc_pages() are totally rewritten and
> > it uses nodemask as its argument. By this, mempolicy doesn't have its own
> > private zonelist. So, Passing nodemask to out_of_memory() is necessary.
> > But, pagefault_out_of_memory() doesn't have enough information. We should
> > visit this again, later.
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > ---
> > drivers/char/sysrq.c | 2 -
> > fs/proc/base.c | 4 +-
> > include/linux/oom.h | 8 +++-
> > mm/oom_kill.c | 101 +++++++++++++++++++++++++++++++++++++++------------
> > mm/page_alloc.c | 2 -
> > 5 files changed, 88 insertions(+), 29 deletions(-)
> >
> > Index: mmotm-2.6.32-Nov2/include/linux/oom.h
> > ===================================================================
> > --- mmotm-2.6.32-Nov2.orig/include/linux/oom.h
> > +++ mmotm-2.6.32-Nov2/include/linux/oom.h
> > @@ -10,23 +10,27 @@
> > #ifdef __KERNEL__
> >
> > #include <linux/types.h>
> > +#include <linux/nodemask.h>
> >
> > struct zonelist;
> > struct notifier_block;
> >
> > /*
> > - * Types of limitations to the nodes from which allocations may occur
> > + * Types of limitations to zones from which allocations may occur
> > */
> > enum oom_constraint {
> > CONSTRAINT_NONE,
> > + CONSTRAINT_LOWMEM,
> > CONSTRAINT_CPUSET,
> > CONSTRAINT_MEMORY_POLICY,
> > + CONSTRAINT_MEMCG
> > };
> >
> > extern int try_set_zone_oom(struct zonelist *zonelist, gfp_t gfp_flags);
> > extern void clear_zonelist_oom(struct zonelist *zonelist, gfp_t gfp_flags);
> >
> > -extern void out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask, int order);
> > +extern void out_of_memory(struct zonelist *zonelist,
> > + gfp_t gfp_mask, int order, nodemask_t *mask);
> > extern int register_oom_notifier(struct notifier_block *nb);
> > extern int unregister_oom_notifier(struct notifier_block *nb);
> >
> > Index: mmotm-2.6.32-Nov2/mm/oom_kill.c
> > ===================================================================
> > --- mmotm-2.6.32-Nov2.orig/mm/oom_kill.c
> > +++ mmotm-2.6.32-Nov2/mm/oom_kill.c
> > @@ -27,6 +27,7 @@
> > #include <linux/notifier.h>
> > #include <linux/memcontrol.h>
> > #include <linux/security.h>
> > +#include <linux/mempolicy.h>
> >
> > int sysctl_panic_on_oom;
> > int sysctl_oom_kill_allocating_task;
> > @@ -55,6 +56,8 @@ static int has_intersects_mems_allowed(s
> > * badness - calculate a numeric value for how bad this task has been
> > * @p: task struct of which task we should calculate
> > * @uptime: current uptime in seconds
> > + * @constraint: type of oom_kill region
> > + * @mem: set if called by memory cgroup
> > *
> > * The formula used is relatively simple and documented inline in the
> > * function. The main rationale is that we want to select a good task
> > @@ -70,7 +73,9 @@ static int has_intersects_mems_allowed(s
> > * of least surprise ... (be careful when you change it)
> > */
> >
> > -unsigned long badness(struct task_struct *p, unsigned long uptime)
> > +static unsigned long __badness(struct task_struct *p,
> > + unsigned long uptime, enum oom_constraint constraint,
> > + struct mem_cgroup *mem)
> > {
> > unsigned long points, cpu_time, run_time;
> > struct mm_struct *mm;
> > @@ -193,30 +198,68 @@ unsigned long badness(struct task_struct
> > return points;
> > }
> >
> > +/* for /proc */
> > +unsigned long global_badness(struct task_struct *p, unsigned long uptime)
> > +{
> > + return __badness(p, uptime, CONSTRAINT_NONE, NULL);
> > +}
>
> I don't understand why this is necessary, CONSTRAINT_NONE should be
> available to proc_oom_score() via linux/oom.h. It would probably be
> better to not rename badness() and use it directly.
>
> > +
> > +
> > /*
> > * Determine the type of allocation constraint.
> > */
> > -static inline enum oom_constraint constrained_alloc(struct zonelist *zonelist,
> > - gfp_t gfp_mask)
> > -{
> > +
> > #ifdef CONFIG_NUMA
> > +static inline enum oom_constraint guess_oom_context(struct zonelist *zonelist,
> > + gfp_t gfp_mask, nodemask_t *nodemask)
>
> Why is this renamed from constrained_alloc()? If the new code is really a
> guess, we probably shouldn't be altering the oom killing behavior to kill
> innocent tasks if it's wrong.
>
No reasons. This just comes from my modification history.
I'll revert this part.
> > +{
> > struct zone *zone;
> > struct zoneref *z;
> > enum zone_type high_zoneidx = gfp_zone(gfp_mask);
> > - nodemask_t nodes = node_states[N_HIGH_MEMORY];
> > + enum oom_constraint ret = CONSTRAINT_NONE;
> >
> > - for_each_zone_zonelist(zone, z, zonelist, high_zoneidx)
> > - if (cpuset_zone_allowed_softwall(zone, gfp_mask))
> > - node_clear(zone_to_nid(zone), nodes);
> > - else
> > + /*
> > + * In numa environ, almost all allocation will be against NORMAL zone.
> > + * But some small area, ex)GFP_DMA for ia64 or GFP_DMA32 for x86-64
> > + * can cause OOM. We can use policy_zone for checking lowmem.
> > + */
> > + if (high_zoneidx < policy_zone)
> > + return CONSTRAINT_LOWMEM;
> > + /*
> > + * Now, only mempolicy specifies nodemask. But if nodemask
> > + * covers all nodes, this oom is global oom.
> > + */
> > + if (nodemask && !nodes_equal(node_states[N_HIGH_MEMORY], *nodemask))
> > + ret = CONSTRAINT_MEMORY_POLICY;
> > + /*
> > + * If not __GFP_THISNODE, zonelist containes all nodes. And if
> > + * zonelist contains a zone which isn't allowed under cpuset, we assume
> > + * this allocation failure is caused by cpuset's constraint.
> > + * Note: all nodes are scanned if nodemask=NULL.
> > + */
> > + for_each_zone_zonelist_nodemask(zone,
> > + z, zonelist, high_zoneidx, nodemask) {
> > + if (!cpuset_zone_allowed_softwall(zone, gfp_mask))
> > return CONSTRAINT_CPUSET;
> > + }
>
> This could probably be written as
>
> int nid;
> if (nodemask)
> for_each_node_mask(nid, *nodemask)
> if (!__cpuset_node_allowed_softwall(nid, gfp_mask))
> return CONSTRAINT_CPUSET;
>
> and then you don't need the struct zoneref or struct zone.
IIUC, typical allocation under cpuset is nodemask=NULL.
We'll have to scan zonelist.
The cpuset doesn't use nodemask at calling alloc_pages() because of its
softwall and hardwall feature and its nature of hierarchy.
Thanks,
-Kame
On Tue, 3 Nov 2009 11:47:23 -0800 (PST)
David Rientjes <[email protected]> wrote:
> On Mon, 2 Nov 2009, KAMEZAWA Hiroyuki wrote:
>
> > Now, anon_rss and file_rss is counted as RSS and exported via /proc.
> > RSS usage is important information but one more information which
> > is often asked by users is "usage of swap".(user support team said.)
> >
> > This patch counts swap entry usage per process and show it via
> > /proc/<pid>/status. I think status file is robust against new entry.
> > Then, it is the first candidate..
> >
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
> Acked-by; David Rientjes <[email protected]>
>
> Thanks! I think this should be added to -mm now while the remainder of
> your patchset is developed and reviewed, it's helpful as an independent
> change.
>
> > Index: mmotm-2.6.32-Nov2/fs/proc/task_mmu.c
> > ===================================================================
> > --- mmotm-2.6.32-Nov2.orig/fs/proc/task_mmu.c
> > +++ mmotm-2.6.32-Nov2/fs/proc/task_mmu.c
> > @@ -17,7 +17,7 @@
> > void task_mem(struct seq_file *m, struct mm_struct *mm)
> > {
> > unsigned long data, text, lib;
> > - unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss;
> > + unsigned long hiwater_vm, total_vm, hiwater_rss, total_rss, swap;
> >
> > /*
> > * Note: to minimize their overhead, mm maintains hiwater_vm and
> > @@ -36,6 +36,8 @@ void task_mem(struct seq_file *m, struct
> > data = mm->total_vm - mm->shared_vm - mm->stack_vm;
> > text = (PAGE_ALIGN(mm->end_code) - (mm->start_code & PAGE_MASK)) >> 10;
> > lib = (mm->exec_vm << (PAGE_SHIFT-10)) - text;
> > +
> > + swap = get_mm_counter(mm, swap_usage);
> > seq_printf(m,
> > "VmPeak:\t%8lu kB\n"
> > "VmSize:\t%8lu kB\n"
>
> Not sure about this newline though.
I'll clean up more. Thank you for pointing out.
Thanks,
-Kame
On Tue, 3 Nov 2009 12:24:01 -0800 (PST)
David Rientjes <[email protected]> wrote:
> On Mon, 2 Nov 2009, KAMEZAWA Hiroyuki wrote:
>
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > Count lowmem rss per mm_struct. Lowmem here means...
> >
> > for NUMA, pages in a zone < policy_zone.
> > for HIGHMEM x86, pages in NORMAL zone.
> > for others, all pages are lowmem.
> >
> > Now, lower_zone_protection[] works very well for protecting lowmem but
> > possiblity of lowmem-oom is not 0 even if under good protection in the kernel.
> > (As fact, it's can be configured by sysctl. When we keep it high, there
> > will be tons of not-for-use memory but system will be protected against
> > rare event of lowmem-oom.)
>
> Right, lowmem isn't addressed currently by the oom killer. Adding this
> constraint will probably make the heuristics much harder to write and
> understand. It's not always clear that we want to kill a task using
> lowmem just because another task needs some, for instance.
The same can be said against all oom-kill ;)
> Do you think we'll need a way to defer killing any task is no task is
> heuristically found to be hogging lowmem?
Yes, I think so. But my position is a bit different.
In typical x86-32 server case, which has 4-8G memory, most of memory usage
is highmem. So, if we have no knowledge of lowmem, multiple innocent processes
will be killed in every 30 secs of oom-kill.
My final goal is migrating lowmem pages to highmem as kswapd-migraion or
oom-migration. Total rewrite for this will be required in future.
Thanks,
-Kame
Thanks,
-Kame
Thanks,
-Kame
On Tue, 3 Nov 2009 12:29:39 -0800 (PST)
David Rientjes <[email protected]> wrote:
> On Mon, 2 Nov 2009, KAMEZAWA Hiroyuki wrote:
>
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > At considering oom-kill algorithm, we can't avoid to take runtime
> > into account. But this can adds too big bonus to slow-memory-leaker.
> > For adding penalty to slow-memory-leaker, we record jiffies of
> > the last mm->hiwater_vm expansion. That catches processes which leak
> > memory periodically.
> >
>
> No, it doesn't, it simply measures the last time the hiwater mark was
> increased. That could have increased by a single page in the last tick
> with no increase in memory consumption over the past year and then its
> unfairly biased against for quiet_time in the new oom kill heuristic
> (patch 6). Using this as part of the badness scoring is ill conceived
> because it doesn't necessarily indicate a memory leaking task, just one
> that has recently allocated memory.
Hmm. Maybe I can rewrite this as "periodic expansion have done or not" code.
Okay, this patch itself will be dropped.
If you find better algorithm, let me know.
Thanks,
-Kame