Date: Wed, 24 Feb 2010 16:59:21 +0900
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
To: "linux-mm@kvack.org" <linux-mm@kvack.org>
Cc: "linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
       "balbir@linux.vnet.ibm.com" <balbir@linux.vnet.ibm.com>,
       "nishimura@mxp.nes.nec.co.jp" <nishimura@mxp.nes.nec.co.jp>,
       rientjes@google.com
Subject: [RFC][PATCH 1/2] memcg: oom kill handling improvement
Message-Id: <20100224165921.cb091a4f.kamezawa.hiroyu@jp.fujitsu.com>
Organization: FUJITSU Co. LTD.
Mime-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 14607
Lines: 500

These are dump of patches just for showing concept, what I want to do.
But not tested. please see if you have free time. (you can ignore ;)

Anyway, this will HUNK to the latest mmotm, Kirill's work is merged.

This is not related to David's work. I don't hesitate to rebase mine
to the mmotm if his one is merged, it's easy.
But I'm not sure his one goes to mm soon. 

1st patch is for better handling oom-kill under memcg.
2nd patch is oom-notifier and oom-kill-disable for memcg knob.

I'm sorry that I'll be absent tomorrow.
==
From: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>

This is updated version of oom handling improvement om memcg.
But all codes are totaly renewed. This may not be sophisiticated well but
enough for showing idea.

This patch does following things.
  * set "memcg is under OOM" if somone gets into OOM under a memcg.
    like zone's OOM lock, tree-of-memcg is marked as under OOM.
    By this. simlutabeous OOM kill in a tree will not happen.

  * When other threads try to reclaim memory or call oom-kill, it
    checks its own target memcg is under oom or not. If someone
    calls oom-killer already, the thread will be queued to waitq.

  * At some event which makes room for new memory, threads on waitq
    are waken up.
    ** A page (or chunk of pages) are unchraged.
    ** A task is moved.
    ** limit is enlarged.

And this patch also allows to check "current's memcg is changed or not"
while charging.

Considering what admin/daemon can do when it notice OOM,
  * kill a process
  * move a process (to other cgroup which has free area)
  * remove a file (on tmpfs or some)
  * enlarge limit
I think all chances for wakeing up waiters are covered by these.

After this patch, memcg's accounting will not fail in usual path.
If all tasks are OOM_DISABLE, memcg may hang. But admin can have
several options described in above. So, oom notifier+freeze should be
implemented.

TODO: maybe not difficult.
  * Add oom notifier. (can reuse memory.threashold ?)
  * Add a switch for oom-freeze rather than oom-kill.

Cc: David Rientjes <rientjes@google.com>
Cc: Balbir Singh <balbir@in.ibm.com>
Cc: Daisuke Nishimura <nishimura@mxp.nes.nec.co.jp>
Signed-off-by: KAMEZAWA Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com>
---
 include/linux/memcontrol.h |    6 -
 mm/memcontrol.c            |  208 +++++++++++++++++++++++++++++++++++----------
 mm/oom_kill.c              |   11 --
 3 files changed, 167 insertions(+), 58 deletions(-)

Index: mmotm-2.6.33-Feb11/mm/memcontrol.c
===================================================================
--- mmotm-2.6.33-Feb11.orig/mm/memcontrol.c
+++ mmotm-2.6.33-Feb11/mm/memcontrol.c
@@ -200,7 +200,6 @@ struct mem_cgroup {
 	 * Should the accounting and control be hierarchical, per subtree?
 	 */
 	bool use_hierarchy;
-	unsigned long	last_oom_jiffies;
 	atomic_t	refcnt;
 
 	unsigned int	swappiness;
@@ -223,6 +222,8 @@ struct mem_cgroup {
 	 */
 	unsigned long 	move_charge_at_immigrate;
 
+	/* counting ongoing OOM requests under sub hierarchy */
+	atomic_t oom_count;
 	/*
 	 * percpu counter.
 	 */
@@ -1096,6 +1097,89 @@ done:
 }
 
 /*
+ * set/under memcg_oom counting is done under mutex.
+ */
+static DEFINE_MUTEX(memcg_oom_mutex);
+static DECLARE_WAIT_QUEUE_HEAD(memcg_oom_waitq);
+
+static int set_memcg_oom_cb(struct mem_cgroup *mem, void *data)
+{
+	int *max_count = (int*)data;
+	int count = atomic_inc_return(&mem->oom_count);
+	if (*max_count < count)
+		*max_count = count;
+	return 0;
+}
+
+static int unset_memcg_oom_cb(struct mem_cgroup *mem, void *data)
+{
+	atomic_set(&mem->oom_count, 0);
+	return 0;
+}
+
+static bool memcg_under_oom(struct mem_cgroup *mem)
+{
+	if (atomic_read(&mem->oom_count))
+		return true;
+	return false;
+}
+
+static void memcg_oom_wait(struct mem_cgroup *mem)
+{
+	DEFINE_WAIT(wait);
+
+	prepare_to_wait(&memcg_oom_waitq, &wait, TASK_INTERRUPTIBLE);
+	if (memcg_under_oom(mem))
+		schedule();
+	finish_wait(&memcg_oom_waitq, &wait);
+}
+
+static void memcg_oom_wake(void)
+{
+	/* This may wake up unnecessary tasks..but it's not big problem */
+	wake_up_all(&memcg_oom_waitq);
+}
+/*
+ * Check there are ongoing oom-kill in this hierarchy or not.
+ * If now under oom-kill, wait for some event to restart job.
+ */
+static bool memcg_handle_oom(struct mem_cgroup *mem, gfp_t mask)
+{
+	int oom_count = 0;
+	DEFINE_WAIT(wait);
+	/*
+	 * Considering hierarchy (below)
+	 * /A
+	 *   /01
+	 *   /02
+	 * If 01 or 02 is under oom-kill, oom-kill in A should wait.
+	 * If "A" is under oom-kill, oom-kill in 01 and 02 should wait.
+	 * (task in 01/02 can be killed.)
+	 * But if 01 is under oom-kill, 02's oom-kill is independent from it.
+	 */
+	prepare_to_wait(&memcg_oom_waitq, &wait, TASK_INTERRUPTIBLE);
+	mem_cgroup_walk_tree(mem, &oom_count, set_memcg_oom_cb);
+	/* Am I the 1st oom killer in this sub hierarchy ? */
+	if (oom_count == 1) {
+		finish_wait(&memcg_oom_waitq, &wait);
+		mem_cgroup_out_of_memory(mem, mask);
+		mem_cgroup_walk_tree(mem, NULL, unset_memcg_oom_cb);
+	} else {
+		/*
+		 * Wakeup is called when
+		 * 1. pages are uncharged. (by killed, or removal of a file)
+		 * 2. limit is enlarged.
+		 * 3. a task is moved.
+		 */
+		schedule();
+		finish_wait(&memcg_oom_waitq, &wait);
+	}
+	if (test_thread_flag(TIF_MEMDIE))
+		return false;
+	return true;
+}
+
+/*
  * This function returns the number of memcg under hierarchy tree. Returns
  * 1(self count) if no children.
  */
@@ -1234,34 +1318,6 @@ static int mem_cgroup_hierarchical_recla
 	return total;
 }
 
-bool mem_cgroup_oom_called(struct task_struct *task)
-{
-	bool ret = false;
-	struct mem_cgroup *mem;
-	struct mm_struct *mm;
-
-	rcu_read_lock();
-	mm = task->mm;
-	if (!mm)
-		mm = &init_mm;
-	mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
-	if (mem && time_before(jiffies, mem->last_oom_jiffies + HZ/10))
-		ret = true;
-	rcu_read_unlock();
-	return ret;
-}
-
-static int record_last_oom_cb(struct mem_cgroup *mem, void *data)
-{
-	mem->last_oom_jiffies = jiffies;
-	return 0;
-}
-
-static void record_last_oom(struct mem_cgroup *mem)
-{
-	mem_cgroup_walk_tree(mem, NULL, record_last_oom_cb);
-}
-
 /*
  * Currently used to update mapped file statistics, but the routine can be
  * generalized to update other statistics as well.
@@ -1419,6 +1475,7 @@ static int __cpuinit memcg_stock_cpu_cal
 	return NOTIFY_OK;
 }
 
+
 /*
  * Unlike exported interface, "oom" parameter is added. if oom==true,
  * oom-killer can be invoked.
@@ -1427,17 +1484,21 @@ static int __mem_cgroup_try_charge(struc
 			gfp_t gfp_mask, struct mem_cgroup **memcg,
 			bool oom, struct page *page)
 {
-	struct mem_cgroup *mem, *mem_over_limit;
-	int nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
+	struct mem_cgroup *mem, *mem_over_limit, *recorded;
+	int nr_retries, csize;
 	struct res_counter *fail_res;
-	int csize = CHARGE_SIZE;
+
+start:
+	nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
+	recorded = *memcg;
+	csize = CHARGE_SIZE;
+	mem = NULL;
 
 	if (unlikely(test_thread_flag(TIF_MEMDIE))) {
 		/* Don't account this! */
 		*memcg = NULL;
 		return 0;
 	}
-
 	/*
 	 * We always charge the cgroup the mm_struct belongs to.
 	 * The mm_struct's mem_cgroup changes on task migration if the
@@ -1489,6 +1550,12 @@ static int __mem_cgroup_try_charge(struc
 		}
 		if (!(gfp_mask & __GFP_WAIT))
 			goto nomem;
+		/* already in OOM ? */
+		if (memcg_under_oom(mem_over_limit)) {
+			/* Don't add too much pressure to the host */
+			memcg_oom_wait(mem_over_limit);
+			goto retry;
+		}
 
 		ret = mem_cgroup_hierarchical_reclaim(mem_over_limit, NULL,
 						gfp_mask, flags);
@@ -1549,11 +1616,15 @@ static int __mem_cgroup_try_charge(struc
 		}
 
 		if (!nr_retries--) {
-			if (oom) {
-				mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);
-				record_last_oom(mem_over_limit);
-			}
-			goto nomem;
+
+			if (!oom)
+				goto nomem;
+			/* returnes false if current is killed */
+			if (memcg_handle_oom(mem_over_limit, gfp_mask))
+				goto retry;
+			/* For smooth oom-kill of current, return 0 */
+			css_put(&mem->css);
+			return 0;
 		}
 	}
 	if (csize > PAGE_SIZE)
@@ -1572,6 +1643,15 @@ done:
 nomem:
 	css_put(&mem->css);
 	return -ENOMEM;
+
+retry:
+	/*
+	 * current's mem_cgroup can be moved while we're waiting for
+	 * memory reclaim or OOM-Kill.
+	 */
+	*memcg = recorded;
+	css_put(&mem->css);
+	goto start;
 }
 
 /*
@@ -1589,6 +1669,9 @@ static void __mem_cgroup_cancel_charge(s
 		VM_BUG_ON(test_bit(CSS_ROOT, &mem->css.flags));
 		WARN_ON_ONCE(count > INT_MAX);
 		__css_put(&mem->css, (int)count);
+
+		if (memcg_under_oom(mem))
+			memcg_oom_wake();
 	}
 	/* we don't need css_put for root */
 }
@@ -2061,6 +2144,10 @@ direct_uncharge:
 	res_counter_uncharge(&mem->res, PAGE_SIZE);
 	if (uncharge_memsw)
 		res_counter_uncharge(&mem->memsw, PAGE_SIZE);
+	/* Slow path to check OOM waiters */
+	if (!current->memcg_batch.do_batch || batch->memcg != mem)
+		if (memcg_under_oom(mem))
+			memcg_oom_wake();
 	return;
 }
 
@@ -2200,6 +2287,9 @@ void mem_cgroup_uncharge_end(void)
 		res_counter_uncharge(&batch->memcg->res, batch->bytes);
 	if (batch->memsw_bytes)
 		res_counter_uncharge(&batch->memcg->memsw, batch->memsw_bytes);
+
+	if (memcg_under_oom(batch->memcg))
+		memcg_oom_wake();
 	/* forget this pointer (for sanity check) */
 	batch->memcg = NULL;
 }
@@ -2408,8 +2498,7 @@ void mem_cgroup_end_migration(struct mem
 
 /*
  * A call to try to shrink memory usage on charge failure at shmem's swapin.
- * Calling hierarchical_reclaim is not enough because we should update
- * last_oom_jiffies to prevent pagefault_out_of_memory from invoking global OOM.
+ * Calling hierarchical_reclaim is not enough because we have to hand oom-kill.
  * Moreover considering hierarchy, we should reclaim from the mem_over_limit,
  * not from the memcg which this page would be charged to.
  * try_charge_swapin does all of these works properly.
@@ -2440,7 +2529,8 @@ static int mem_cgroup_resize_limit(struc
 	u64 memswlimit;
 	int ret = 0;
 	int children = mem_cgroup_count_children(memcg);
-	u64 curusage, oldusage;
+	u64 curusage, oldusage, curlimit;
+	int enlarge = 0;
 
 	/*
 	 * For keeping hierarchical_reclaim simple, how long we should retry
@@ -2451,6 +2541,7 @@ static int mem_cgroup_resize_limit(struc
 
 	oldusage = res_counter_read_u64(&memcg->res, RES_USAGE);
 
+
 	while (retry_count) {
 		if (signal_pending(current)) {
 			ret = -EINTR;
@@ -2468,6 +2559,9 @@ static int mem_cgroup_resize_limit(struc
 			mutex_unlock(&set_limit_mutex);
 			break;
 		}
+		curlimit = res_counter_read_u64(&memcg->res, RES_LIMIT);
+		if (curlimit < val)
+			enlarge = 1;
 		ret = res_counter_set_limit(&memcg->res, val);
 		if (!ret) {
 			if (memswlimit == val)
@@ -2477,8 +2571,20 @@ static int mem_cgroup_resize_limit(struc
 		}
 		mutex_unlock(&set_limit_mutex);
 
-		if (!ret)
+		if (!ret) {
+			/*
+			 * If we enlarge limit of memcg under OOM,
+			 * wake up waiters.
+			 */
+			if (enlarge && memcg_under_oom(memcg))
+				memcg_oom_wake();
+			break;
+		}
+		/* Under OOM ? If so, don't add more pressure. */
+		if (memcg_under_oom(memcg)) {
+			ret = -EBUSY;
 			break;
+		}
 
 		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
 						MEM_CGROUP_RECLAIM_SHRINK);
@@ -2497,9 +2603,10 @@ static int mem_cgroup_resize_memsw_limit
 					unsigned long long val)
 {
 	int retry_count;
-	u64 memlimit, oldusage, curusage;
+	u64 memlimit, oldusage, curusage, curlimit;
 	int children = mem_cgroup_count_children(memcg);
 	int ret = -EBUSY;
+	int enlarge;
 
 	/* see mem_cgroup_resize_res_limit */
  	retry_count = children * MEM_CGROUP_RECLAIM_RETRIES;
@@ -2521,6 +2628,9 @@ static int mem_cgroup_resize_memsw_limit
 			mutex_unlock(&set_limit_mutex);
 			break;
 		}
+		curlimit = res_counter_read_u64(&memcg->res, RES_LIMIT);
+		if (curlimit < val)
+			enlarge = 1;
 		ret = res_counter_set_limit(&memcg->memsw, val);
 		if (!ret) {
 			if (memlimit == val)
@@ -2530,8 +2640,15 @@ static int mem_cgroup_resize_memsw_limit
 		}
 		mutex_unlock(&set_limit_mutex);
 
-		if (!ret)
+		if (!ret) {
+			if (enlarge && memcg_under_oom(memcg))
+				memcg_oom_wake();
 			break;
+		}
+		if (memcg_under_oom(memcg)) {
+			ret = -EBUSY;
+			continue;
+		}
 
 		mem_cgroup_hierarchical_reclaim(memcg, NULL, GFP_KERNEL,
 						MEM_CGROUP_RECLAIM_NOSWAP |
@@ -3859,6 +3976,9 @@ one_by_one:
 			ret = -EINTR;
 			break;
 		}
+		/* Undo precharges if there is ongoing OOM */
+		if (memcg_under_oom(mem))
+			return -ENOMEM;
 		if (!batch_count--) {
 			batch_count = PRECHARGE_COUNT_AT_ONCE;
 			cond_resched();
Index: mmotm-2.6.33-Feb11/mm/oom_kill.c
===================================================================
--- mmotm-2.6.33-Feb11.orig/mm/oom_kill.c
+++ mmotm-2.6.33-Feb11/mm/oom_kill.c
@@ -487,6 +487,9 @@ retry:
 		goto retry;
 out:
 	read_unlock(&tasklist_lock);
+	/* give a chance to die for selected process */
+	if (!test_thread_flag(TIF_MEMDIE))
+		schedule_timeout_uninterruptible(1);
 }
 #endif
 
@@ -601,13 +604,6 @@ void pagefault_out_of_memory(void)
 		/* Got some memory back in the last second. */
 		return;
 
-	/*
-	 * If this is from memcg, oom-killer is already invoked.
-	 * and not worth to go system-wide-oom.
-	 */
-	if (mem_cgroup_oom_called(current))
-		goto rest_and_return;
-
 	if (sysctl_panic_on_oom)
 		panic("out of memory from page fault. panic_on_oom is selected.\n");
 
@@ -619,7 +615,6 @@ void pagefault_out_of_memory(void)
 	 * Give "p" a good chance of killing itself before we
 	 * retry to allocate memory.
 	 */
-rest_and_return:
 	if (!test_thread_flag(TIF_MEMDIE))
 		schedule_timeout_uninterruptible(1);
 }
Index: mmotm-2.6.33-Feb11/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.33-Feb11.orig/include/linux/memcontrol.h
+++ mmotm-2.6.33-Feb11/include/linux/memcontrol.h
@@ -124,7 +124,6 @@ static inline bool mem_cgroup_disabled(v
 	return false;
 }
 
-extern bool mem_cgroup_oom_called(struct task_struct *task);
 void mem_cgroup_update_file_mapped(struct page *page, int val);
 unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
 						gfp_t gfp_mask, int nid,
@@ -258,11 +257,6 @@ static inline bool mem_cgroup_disabled(v
 	return true;
 }
 
-static inline bool mem_cgroup_oom_called(struct task_struct *task)
-{
-	return false;
-}
-
 static inline int
 mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
 {

--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/