2010-02-23 03:06:46

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH] memcg: page fault oom improvement

Nishimura-san, could you review and test your extreme test case with this ?

==
From: KAMEZAWA Hiroyuki <[email protected]>

Now, because of page_fault_oom_kill, returning VM_FAULT_OOM means
random oom-killer should be called. Considering memcg, it handles
OOM-kill in its own logic, there was a problem as "oom-killer called
twice" problem.

By commit a636b327f731143ccc544b966cfd8de6cb6d72c6, I added a check
in pagefault_oom_killer shouldn't kill some (random) task if
memcg's oom-killer already killed anyone.
That was done by comapring current jiffies and last oom jiffies of memcg.

I thought that easy fix was enough, but Nishimura could write a test case
where checking jiffies is not enough. So, my fix was not enough.
This is a fix of above commit.

This new one does this.
* memcg's try_charge() never returns -ENOMEM if oom-killer is allowed.
* If someone is calling oom-killer, wait for it in try_charge().
* If TIF_MEMDIE is set as a result of try_charge(), return 0 and
allow process to make progress (and die.)
* removed hook in pagefault_out_of_memory.

By this, pagefult_out_of_memory will be never called if memcg's oom-killer
is called and scattered codes are now in memcg's charge logic again.

TODO:
If __GFP_WAIT is not specified in gfp_mask flag, VM_FAULT_OOM will return
anyway. We need to investigate it whether there is a case.

Cc: David Rientjes <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 41 +++++++++++++++++++++++------------------
mm/oom_kill.c | 11 +++--------
2 files changed, 26 insertions(+), 26 deletions(-)

Index: mmotm-2.6.33-Feb11/mm/memcontrol.c
===================================================================
--- mmotm-2.6.33-Feb11.orig/mm/memcontrol.c
+++ mmotm-2.6.33-Feb11/mm/memcontrol.c
@@ -1234,21 +1234,12 @@ static int mem_cgroup_hierarchical_recla
return total;
}

-bool mem_cgroup_oom_called(struct task_struct *task)
+DEFINE_MUTEX(memcg_oom_mutex);
+bool mem_cgroup_oom_called(struct mem_cgroup *mem)
{
- bool ret = false;
- struct mem_cgroup *mem;
- struct mm_struct *mm;
-
- rcu_read_lock();
- mm = task->mm;
- if (!mm)
- mm = &init_mm;
- mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
- if (mem && time_before(jiffies, mem->last_oom_jiffies + HZ/10))
- ret = true;
- rcu_read_unlock();
- return ret;
+ if (time_before(jiffies, mem->last_oom_jiffies + HZ/10))
+ return true;
+ return false;
}

static int record_last_oom_cb(struct mem_cgroup *mem, void *data)
@@ -1549,11 +1540,25 @@ static int __mem_cgroup_try_charge(struc
}

if (!nr_retries--) {
- if (oom) {
- mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);
+ int oom_kill_called;
+ if (!oom)
+ goto nomem;
+ mutex_lock(&memcg_oom_mutex);
+ oom_kill_called = mem_cgroup_oom_called(mem_over_limit);
+ if (!oom_kill_called)
record_last_oom(mem_over_limit);
- }
- goto nomem;
+ mutex_unlock(&memcg_oom_mutex);
+ if (!oom_kill_called)
+ mem_cgroup_out_of_memory(mem_over_limit,
+ gfp_mask);
+ else /* give a chance to die for other tasks */
+ schedule_timeout(1);
+ nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
+ /* Killed myself ? */
+ if (!test_thread_flag(TIF_MEMDIE))
+ continue;
+ /* For smooth oom-kill of current, return 0 */
+ return 0;
}
}
if (csize > PAGE_SIZE)
Index: mmotm-2.6.33-Feb11/mm/oom_kill.c
===================================================================
--- mmotm-2.6.33-Feb11.orig/mm/oom_kill.c
+++ mmotm-2.6.33-Feb11/mm/oom_kill.c
@@ -487,6 +487,9 @@ retry:
goto retry;
out:
read_unlock(&tasklist_lock);
+ /* give a chance to die for selected process */
+ if (test_thread_flag(TIF_MEMDIE))
+ schedule_timeout_uninterruptible(1);
}
#endif

@@ -601,13 +604,6 @@ void pagefault_out_of_memory(void)
/* Got some memory back in the last second. */
return;

- /*
- * If this is from memcg, oom-killer is already invoked.
- * and not worth to go system-wide-oom.
- */
- if (mem_cgroup_oom_called(current))
- goto rest_and_return;
-
if (sysctl_panic_on_oom)
panic("out of memory from page fault. panic_on_oom is selected.\n");

@@ -619,7 +615,6 @@ void pagefault_out_of_memory(void)
* Give "p" a good chance of killing itself before we
* retry to allocate memory.
*/
-rest_and_return:
if (!test_thread_flag(TIF_MEMDIE))
schedule_timeout_uninterruptible(1);
}


2010-02-23 05:07:30

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH] memcg: page fault oom improvement

On Tue, 23 Feb 2010 12:03:15 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> Nishimura-san, could you review and test your extreme test case with this ?
>
Thank you for your patch.
I don't know why, but the problem seems not so easy to cause in mmotm as in 2.6.32.8,
but I'll try more anyway.

A few comments are inlined.

> ==
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Now, because of page_fault_oom_kill, returning VM_FAULT_OOM means
> random oom-killer should be called. Considering memcg, it handles
> OOM-kill in its own logic, there was a problem as "oom-killer called
> twice" problem.
>
> By commit a636b327f731143ccc544b966cfd8de6cb6d72c6, I added a check
> in pagefault_oom_killer shouldn't kill some (random) task if
> memcg's oom-killer already killed anyone.
> That was done by comapring current jiffies and last oom jiffies of memcg.
>
> I thought that easy fix was enough, but Nishimura could write a test case
> where checking jiffies is not enough. So, my fix was not enough.
> This is a fix of above commit.
>
> This new one does this.
> * memcg's try_charge() never returns -ENOMEM if oom-killer is allowed.
> * If someone is calling oom-killer, wait for it in try_charge().
> * If TIF_MEMDIE is set as a result of try_charge(), return 0 and
> allow process to make progress (and die.)
> * removed hook in pagefault_out_of_memory.
>
> By this, pagefult_out_of_memory will be never called if memcg's oom-killer
> is called and scattered codes are now in memcg's charge logic again.
>
> TODO:
> If __GFP_WAIT is not specified in gfp_mask flag, VM_FAULT_OOM will return
> anyway. We need to investigate it whether there is a case.
>
> Cc: David Rientjes <[email protected]>
> Cc: Balbir Singh <[email protected]>
> Cc: Daisuke Nishimura <[email protected]>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> mm/memcontrol.c | 41 +++++++++++++++++++++++------------------
> mm/oom_kill.c | 11 +++--------
> 2 files changed, 26 insertions(+), 26 deletions(-)
>
> Index: mmotm-2.6.33-Feb11/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.33-Feb11.orig/mm/memcontrol.c
> +++ mmotm-2.6.33-Feb11/mm/memcontrol.c
> @@ -1234,21 +1234,12 @@ static int mem_cgroup_hierarchical_recla
> return total;
> }
>
> -bool mem_cgroup_oom_called(struct task_struct *task)
> +DEFINE_MUTEX(memcg_oom_mutex);
it can be static.

> +bool mem_cgroup_oom_called(struct mem_cgroup *mem)
> {
> - bool ret = false;
> - struct mem_cgroup *mem;
> - struct mm_struct *mm;
> -
> - rcu_read_lock();
> - mm = task->mm;
> - if (!mm)
> - mm = &init_mm;
> - mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
> - if (mem && time_before(jiffies, mem->last_oom_jiffies + HZ/10))
> - ret = true;
> - rcu_read_unlock();
> - return ret;
> + if (time_before(jiffies, mem->last_oom_jiffies + HZ/10))
> + return true;
> + return false;
> }
>
> static int record_last_oom_cb(struct mem_cgroup *mem, void *data)
> @@ -1549,11 +1540,25 @@ static int __mem_cgroup_try_charge(struc
> }
>
> if (!nr_retries--) {
> - if (oom) {
> - mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);
> + int oom_kill_called;
> + if (!oom)
> + goto nomem;
> + mutex_lock(&memcg_oom_mutex);
> + oom_kill_called = mem_cgroup_oom_called(mem_over_limit);
> + if (!oom_kill_called)
> record_last_oom(mem_over_limit);
> - }
> - goto nomem;
> + mutex_unlock(&memcg_oom_mutex);
> + if (!oom_kill_called)
> + mem_cgroup_out_of_memory(mem_over_limit,
> + gfp_mask);
> + else /* give a chance to die for other tasks */
> + schedule_timeout(1);
> + nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> + /* Killed myself ? */
> + if (!test_thread_flag(TIF_MEMDIE))
> + continue;
> + /* For smooth oom-kill of current, return 0 */
> + return 0;
> }
> }
> if (csize > PAGE_SIZE)
> Index: mmotm-2.6.33-Feb11/mm/oom_kill.c
> ===================================================================
> --- mmotm-2.6.33-Feb11.orig/mm/oom_kill.c
> +++ mmotm-2.6.33-Feb11/mm/oom_kill.c
> @@ -487,6 +487,9 @@ retry:
> goto retry;
> out:
> read_unlock(&tasklist_lock);
> + /* give a chance to die for selected process */
> + if (test_thread_flag(TIF_MEMDIE))
> + schedule_timeout_uninterruptible(1);
> }
> #endif
>
I think it should be "if (!test_thread_flag(TIF_MEMDIE))".


Thanks,
Daisuke Nishimura.

> @@ -601,13 +604,6 @@ void pagefault_out_of_memory(void)
> /* Got some memory back in the last second. */
> return;
>
> - /*
> - * If this is from memcg, oom-killer is already invoked.
> - * and not worth to go system-wide-oom.
> - */
> - if (mem_cgroup_oom_called(current))
> - goto rest_and_return;
> -
> if (sysctl_panic_on_oom)
> panic("out of memory from page fault. panic_on_oom is selected.\n");
>
> @@ -619,7 +615,6 @@ void pagefault_out_of_memory(void)
> * Give "p" a good chance of killing itself before we
> * retry to allocate memory.
> */
> -rest_and_return:
> if (!test_thread_flag(TIF_MEMDIE))
> schedule_timeout_uninterruptible(1);
> }
>

2010-02-23 06:10:28

by Balbir Singh

[permalink] [raw]
Subject: Re: [RFC][PATCH] memcg: page fault oom improvement

* KAMEZAWA Hiroyuki <[email protected]> [2010-02-23 12:03:15]:

> Nishimura-san, could you review and test your extreme test case with this ?
>
> ==
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Now, because of page_fault_oom_kill, returning VM_FAULT_OOM means
> random oom-killer should be called. Considering memcg, it handles
> OOM-kill in its own logic, there was a problem as "oom-killer called
> twice" problem.
>
> By commit a636b327f731143ccc544b966cfd8de6cb6d72c6, I added a check
> in pagefault_oom_killer shouldn't kill some (random) task if
> memcg's oom-killer already killed anyone.
> That was done by comapring current jiffies and last oom jiffies of memcg.
>
> I thought that easy fix was enough, but Nishimura could write a test case
> where checking jiffies is not enough. So, my fix was not enough.
> This is a fix of above commit.
>
> This new one does this.
> * memcg's try_charge() never returns -ENOMEM if oom-killer is allowed.
> * If someone is calling oom-killer, wait for it in try_charge().
> * If TIF_MEMDIE is set as a result of try_charge(), return 0 and
> allow process to make progress (and die.)
> * removed hook in pagefault_out_of_memory.
>
> By this, pagefult_out_of_memory will be never called if memcg's oom-killer
> is called and scattered codes are now in memcg's charge logic again.
>
> TODO:
> If __GFP_WAIT is not specified in gfp_mask flag, VM_FAULT_OOM will return
> anyway. We need to investigate it whether there is a case.
>
> Cc: David Rientjes <[email protected]>
> Cc: Balbir Singh <[email protected]>
> Cc: Daisuke Nishimura <[email protected]>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>

I've not reviewed David's latest OOM killer changes. Are these changes based on top of
what is going to come in with David's proposal?
--
Three Cheers,
Balbir

2010-02-23 06:15:58

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH] memcg: page fault oom improvement

On Tue, 23 Feb 2010 11:40:20 +0530
Balbir Singh <[email protected]> wrote:

> * KAMEZAWA Hiroyuki <[email protected]> [2010-02-23 12:03:15]:
>
> > Nishimura-san, could you review and test your extreme test case with this ?
> >
> > ==
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > Now, because of page_fault_oom_kill, returning VM_FAULT_OOM means
> > random oom-killer should be called. Considering memcg, it handles
> > OOM-kill in its own logic, there was a problem as "oom-killer called
> > twice" problem.
> >
> > By commit a636b327f731143ccc544b966cfd8de6cb6d72c6, I added a check
> > in pagefault_oom_killer shouldn't kill some (random) task if
> > memcg's oom-killer already killed anyone.
> > That was done by comapring current jiffies and last oom jiffies of memcg.
> >
> > I thought that easy fix was enough, but Nishimura could write a test case
> > where checking jiffies is not enough. So, my fix was not enough.
> > This is a fix of above commit.
> >
> > This new one does this.
> > * memcg's try_charge() never returns -ENOMEM if oom-killer is allowed.
> > * If someone is calling oom-killer, wait for it in try_charge().
> > * If TIF_MEMDIE is set as a result of try_charge(), return 0 and
> > allow process to make progress (and die.)
> > * removed hook in pagefault_out_of_memory.
> >
> > By this, pagefult_out_of_memory will be never called if memcg's oom-killer
> > is called and scattered codes are now in memcg's charge logic again.
> >
> > TODO:
> > If __GFP_WAIT is not specified in gfp_mask flag, VM_FAULT_OOM will return
> > anyway. We need to investigate it whether there is a case.
> >
> > Cc: David Rientjes <[email protected]>
> > Cc: Balbir Singh <[email protected]>
> > Cc: Daisuke Nishimura <[email protected]>
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
>
> I've not reviewed David's latest OOM killer changes. Are these changes based on top of
> what is going to come in with David's proposal?

About this change. no. This is an independent patch.
But through these a few month work, I(we) noticed page_fault_out_of_memory() is
dangerous and VM_FALUT_OOM should not be returned as much as possible.
About memcg, it's not necessary to return VM_FAULT_OOM when we know oom-killer
is called.

This fix itself is straightforward. But difficult thing here is test case, I think.

Thanks,
-Kame

2010-02-23 06:26:25

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH] memcg: page fault oom improvement

On Tue, 23 Feb 2010 14:02:18 +0900, Daisuke Nishimura <[email protected]> wrote:
> On Tue, 23 Feb 2010 12:03:15 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> > Nishimura-san, could you review and test your extreme test case with this ?
> >
> Thank you for your patch.
> I don't know why, but the problem seems not so easy to cause in mmotm as in 2.6.32.8,
> but I'll try more anyway.
>
I can triggered the problem in mmotm.

I'll continue my test with your patch applied.


Thanks,
Daisuke Nishimura.

2010-02-23 06:30:20

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH] memcg: page fault oom improvement v2

On Tue, 23 Feb 2010 15:21:16 +0900
Daisuke Nishimura <[email protected]> wrote:

> On Tue, 23 Feb 2010 14:02:18 +0900, Daisuke Nishimura <[email protected]> wrote:
> > On Tue, 23 Feb 2010 12:03:15 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> > > Nishimura-san, could you review and test your extreme test case with this ?
> > >
> > Thank you for your patch.
> > I don't know why, but the problem seems not so easy to cause in mmotm as in 2.6.32.8,
> > but I'll try more anyway.
> >
> I can triggered the problem in mmotm.
>
> I'll continue my test with your patch applied.
>

Thank you. Updated one here.

From: KAMEZAWA Hiroyuki <[email protected]>

Now, because of page_fault_oom_kill, returning VM_FAULT_OOM means
random oom-killer should be called. Considering memcg, it handles
OOM-kill in its own logic, there was a problem as "oom-killer called
twice" problem.

By commit a636b327f731143ccc544b966cfd8de6cb6d72c6, I added a check
in pagefault_oom_killer shouldn't kill some (random) task if
memcg's oom-killer already killed someone.
That was done by comapring current jiffies and last oom jiffies of memcg.

I thought that easy fix was enough, but Nishimura could write a test case
where checking jiffies is not enough. This is a fix of above commit.

This new one does this.
* memcg's try_charge() never returns -ENOMEM if oom-killer is allowed.
* If someone is calling oom-killer, wait for it in try_charge().
* If TIF_MEMDIE is set as a result of try_charge(), return 0 and
allow process to make progress (and die.)
* removed hook in pagefault_out_of_memory.

By this, pagefult_out_of_memory will be never called if memcg's oom-killer
is called.

TODO:
If __GFP_WAIT is not specified in gfp_mask flag, VM_FAULT_OOM will return
anyway. We need to investigate it whether there is a case.

Changelog: 2010/02/23
* fixed MEMDIE condition check.
* making internal symbols to be static.

Cc: David Rientjes <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
mm/memcontrol.c | 41 +++++++++++++++++++++++------------------
mm/oom_kill.c | 11 +++--------
2 files changed, 26 insertions(+), 26 deletions(-)

Index: mmotm-2.6.33-Feb11/mm/memcontrol.c
===================================================================
--- mmotm-2.6.33-Feb11.orig/mm/memcontrol.c
+++ mmotm-2.6.33-Feb11/mm/memcontrol.c
@@ -1234,21 +1234,12 @@ static int mem_cgroup_hierarchical_recla
return total;
}

-bool mem_cgroup_oom_called(struct task_struct *task)
+static DEFINE_MUTEX(memcg_oom_mutex);
+static bool mem_cgroup_oom_called(struct mem_cgroup *mem)
{
- bool ret = false;
- struct mem_cgroup *mem;
- struct mm_struct *mm;
-
- rcu_read_lock();
- mm = task->mm;
- if (!mm)
- mm = &init_mm;
- mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
- if (mem && time_before(jiffies, mem->last_oom_jiffies + HZ/10))
- ret = true;
- rcu_read_unlock();
- return ret;
+ if (time_before(jiffies, mem->last_oom_jiffies + HZ/10))
+ return true;
+ return false;
}

static int record_last_oom_cb(struct mem_cgroup *mem, void *data)
@@ -1549,11 +1540,25 @@ static int __mem_cgroup_try_charge(struc
}

if (!nr_retries--) {
- if (oom) {
- mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);
+ int oom_kill_called;
+ if (!oom)
+ goto nomem;
+ mutex_lock(&memcg_oom_mutex);
+ oom_kill_called = mem_cgroup_oom_called(mem_over_limit);
+ if (!oom_kill_called)
record_last_oom(mem_over_limit);
- }
- goto nomem;
+ mutex_unlock(&memcg_oom_mutex);
+ if (!oom_kill_called)
+ mem_cgroup_out_of_memory(mem_over_limit,
+ gfp_mask);
+ else /* give a chance to die for other tasks */
+ schedule_timeout(1);
+ nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
+ /* Killed myself ? */
+ if (!test_thread_flag(TIF_MEMDIE))
+ continue;
+ /* For smooth oom-kill of current, return 0 */
+ return 0;
}
}
if (csize > PAGE_SIZE)
Index: mmotm-2.6.33-Feb11/mm/oom_kill.c
===================================================================
--- mmotm-2.6.33-Feb11.orig/mm/oom_kill.c
+++ mmotm-2.6.33-Feb11/mm/oom_kill.c
@@ -487,6 +487,9 @@ retry:
goto retry;
out:
read_unlock(&tasklist_lock);
+ /* give a chance to die for selected process */
+ if (!test_thread_flag(TIF_MEMDIE))
+ schedule_timeout_uninterruptible(1);
}
#endif

@@ -601,13 +604,6 @@ void pagefault_out_of_memory(void)
/* Got some memory back in the last second. */
return;

- /*
- * If this is from memcg, oom-killer is already invoked.
- * and not worth to go system-wide-oom.
- */
- if (mem_cgroup_oom_called(current))
- goto rest_and_return;
-
if (sysctl_panic_on_oom)
panic("out of memory from page fault. panic_on_oom is selected.\n");

@@ -619,7 +615,6 @@ void pagefault_out_of_memory(void)
* Give "p" a good chance of killing itself before we
* retry to allocate memory.
*/
-rest_and_return:
if (!test_thread_flag(TIF_MEMDIE))
schedule_timeout_uninterruptible(1);
}

2010-02-23 06:58:34

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH] memcg: page fault oom improvement v2

On Tue, 23 Feb 2010 15:26:50 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> On Tue, 23 Feb 2010 15:21:16 +0900
> Daisuke Nishimura <[email protected]> wrote:
>
> > On Tue, 23 Feb 2010 14:02:18 +0900, Daisuke Nishimura <[email protected]> wrote:
> > > On Tue, 23 Feb 2010 12:03:15 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> > > > Nishimura-san, could you review and test your extreme test case with this ?
> > > >
> > > Thank you for your patch.
> > > I don't know why, but the problem seems not so easy to cause in mmotm as in 2.6.32.8,
> > > but I'll try more anyway.
> > >
> > I can triggered the problem in mmotm.
> >
> > I'll continue my test with your patch applied.
> >
>
> Thank you. Updated one here.
>
Unfortunately, we need one more fix to avoid build error: remove the declaration
of mem_cgroup_oom_called() from memcontrol.h.

Thanks,
Daisuke Nishimura.

> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Now, because of page_fault_oom_kill, returning VM_FAULT_OOM means
> random oom-killer should be called. Considering memcg, it handles
> OOM-kill in its own logic, there was a problem as "oom-killer called
> twice" problem.
>
> By commit a636b327f731143ccc544b966cfd8de6cb6d72c6, I added a check
> in pagefault_oom_killer shouldn't kill some (random) task if
> memcg's oom-killer already killed someone.
> That was done by comapring current jiffies and last oom jiffies of memcg.
>
> I thought that easy fix was enough, but Nishimura could write a test case
> where checking jiffies is not enough. This is a fix of above commit.
>
> This new one does this.
> * memcg's try_charge() never returns -ENOMEM if oom-killer is allowed.
> * If someone is calling oom-killer, wait for it in try_charge().
> * If TIF_MEMDIE is set as a result of try_charge(), return 0 and
> allow process to make progress (and die.)
> * removed hook in pagefault_out_of_memory.
>
> By this, pagefult_out_of_memory will be never called if memcg's oom-killer
> is called.
>
> TODO:
> If __GFP_WAIT is not specified in gfp_mask flag, VM_FAULT_OOM will return
> anyway. We need to investigate it whether there is a case.
>
> Changelog: 2010/02/23
> * fixed MEMDIE condition check.
> * making internal symbols to be static.
>
> Cc: David Rientjes <[email protected]>
> Cc: Balbir Singh <[email protected]>
> Cc: Daisuke Nishimura <[email protected]>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> mm/memcontrol.c | 41 +++++++++++++++++++++++------------------
> mm/oom_kill.c | 11 +++--------
> 2 files changed, 26 insertions(+), 26 deletions(-)
>
> Index: mmotm-2.6.33-Feb11/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.33-Feb11.orig/mm/memcontrol.c
> +++ mmotm-2.6.33-Feb11/mm/memcontrol.c
> @@ -1234,21 +1234,12 @@ static int mem_cgroup_hierarchical_recla
> return total;
> }
>
> -bool mem_cgroup_oom_called(struct task_struct *task)
> +static DEFINE_MUTEX(memcg_oom_mutex);
> +static bool mem_cgroup_oom_called(struct mem_cgroup *mem)
> {
> - bool ret = false;
> - struct mem_cgroup *mem;
> - struct mm_struct *mm;
> -
> - rcu_read_lock();
> - mm = task->mm;
> - if (!mm)
> - mm = &init_mm;
> - mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
> - if (mem && time_before(jiffies, mem->last_oom_jiffies + HZ/10))
> - ret = true;
> - rcu_read_unlock();
> - return ret;
> + if (time_before(jiffies, mem->last_oom_jiffies + HZ/10))
> + return true;
> + return false;
> }
>
> static int record_last_oom_cb(struct mem_cgroup *mem, void *data)
> @@ -1549,11 +1540,25 @@ static int __mem_cgroup_try_charge(struc
> }
>
> if (!nr_retries--) {
> - if (oom) {
> - mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);
> + int oom_kill_called;
> + if (!oom)
> + goto nomem;
> + mutex_lock(&memcg_oom_mutex);
> + oom_kill_called = mem_cgroup_oom_called(mem_over_limit);
> + if (!oom_kill_called)
> record_last_oom(mem_over_limit);
> - }
> - goto nomem;
> + mutex_unlock(&memcg_oom_mutex);
> + if (!oom_kill_called)
> + mem_cgroup_out_of_memory(mem_over_limit,
> + gfp_mask);
> + else /* give a chance to die for other tasks */
> + schedule_timeout(1);
> + nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> + /* Killed myself ? */
> + if (!test_thread_flag(TIF_MEMDIE))
> + continue;
> + /* For smooth oom-kill of current, return 0 */
> + return 0;
> }
> }
> if (csize > PAGE_SIZE)
> Index: mmotm-2.6.33-Feb11/mm/oom_kill.c
> ===================================================================
> --- mmotm-2.6.33-Feb11.orig/mm/oom_kill.c
> +++ mmotm-2.6.33-Feb11/mm/oom_kill.c
> @@ -487,6 +487,9 @@ retry:
> goto retry;
> out:
> read_unlock(&tasklist_lock);
> + /* give a chance to die for selected process */
> + if (!test_thread_flag(TIF_MEMDIE))
> + schedule_timeout_uninterruptible(1);
> }
> #endif
>
> @@ -601,13 +604,6 @@ void pagefault_out_of_memory(void)
> /* Got some memory back in the last second. */
> return;
>
> - /*
> - * If this is from memcg, oom-killer is already invoked.
> - * and not worth to go system-wide-oom.
> - */
> - if (mem_cgroup_oom_called(current))
> - goto rest_and_return;
> -
> if (sysctl_panic_on_oom)
> panic("out of memory from page fault. panic_on_oom is selected.\n");
>
> @@ -619,7 +615,6 @@ void pagefault_out_of_memory(void)
> * Give "p" a good chance of killing itself before we
> * retry to allocate memory.
> */
> -rest_and_return:
> if (!test_thread_flag(TIF_MEMDIE))
> schedule_timeout_uninterruptible(1);
> }
>

2010-02-23 07:10:46

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH] memcg: page fault oom improvement v2

On Tue, 23 Feb 2010 15:55:43 +0900
Daisuke Nishimura <[email protected]> wrote:

> On Tue, 23 Feb 2010 15:26:50 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> > On Tue, 23 Feb 2010 15:21:16 +0900
> > Daisuke Nishimura <[email protected]> wrote:
> >
> > > On Tue, 23 Feb 2010 14:02:18 +0900, Daisuke Nishimura <[email protected]> wrote:
> > > > On Tue, 23 Feb 2010 12:03:15 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> > > > > Nishimura-san, could you review and test your extreme test case with this ?
> > > > >
> > > > Thank you for your patch.
> > > > I don't know why, but the problem seems not so easy to cause in mmotm as in 2.6.32.8,
> > > > but I'll try more anyway.
> > > >
> > > I can triggered the problem in mmotm.
> > >
> > > I'll continue my test with your patch applied.
> > >
> >
> > Thank you. Updated one here.
> >
> Unfortunately, we need one more fix to avoid build error: remove the declaration
> of mem_cgroup_oom_called() from memcontrol.h.
>
Ouch, I missed to add memcontrol.h to quilt's reflesh set..
This is updated one. Anyway, I'd like to wait for the next mmotm.
We already have several changes.

-Kame
==
From: KAMEZAWA Hiroyuki <[email protected]>

Now, because of page_fault_oom_kill, returning VM_FAULT_OOM means
random oom-killer should be called. Considering memcg, it handles
OOM-kill in its own logic, there was a problem as "oom-killer called
twice" problem.

By commit a636b327f731143ccc544b966cfd8de6cb6d72c6, I added a check
in pagefault_oom_killer shouldn't kill some (random) task if
memcg's oom-killer already killed someone.
That was done by comapring current jiffies and last oom jiffies of memcg.

I thought that easy fix was enough, but Nishimura could write a test case
where checking jiffies is not enough. This is a fix of above commit.

This new one does this.
* memcg's try_charge() never returns -ENOMEM if oom-killer is allowed.
* If someone is calling oom-killer, wait for it in try_charge().
* If TIF_MEMDIE is set as a result of try_charge(), return 0 and
allow process to make progress (and die.)
* removed hook in pagefault_out_of_memory.

By this, pagefult_out_of_memory will be never called if memcg's oom-killer
is called.

TODO:
If __GFP_WAIT is not specified in gfp_mask flag, VM_FAULT_OOM will return
anyway. We need to investigate it whether there is a case.

Changelog: 2010/02/23
* fixed MEMDIE condition check.
* making internal symbols to be static.

Cc: David Rientjes <[email protected]>
Cc: Balbir Singh <[email protected]>
Cc: Daisuke Nishimura <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
include/linux/memcontrol.h | 6 ------
mm/memcontrol.c | 41 +++++++++++++++++++++++------------------
mm/oom_kill.c | 11 +++--------
3 files changed, 26 insertions(+), 32 deletions(-)

Index: mmotm-2.6.33-Feb11/mm/memcontrol.c
===================================================================
--- mmotm-2.6.33-Feb11.orig/mm/memcontrol.c
+++ mmotm-2.6.33-Feb11/mm/memcontrol.c
@@ -1234,21 +1234,12 @@ static int mem_cgroup_hierarchical_recla
return total;
}

-bool mem_cgroup_oom_called(struct task_struct *task)
+static DEFINE_MUTEX(memcg_oom_mutex);
+static bool mem_cgroup_oom_called(struct mem_cgroup *mem)
{
- bool ret = false;
- struct mem_cgroup *mem;
- struct mm_struct *mm;
-
- rcu_read_lock();
- mm = task->mm;
- if (!mm)
- mm = &init_mm;
- mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
- if (mem && time_before(jiffies, mem->last_oom_jiffies + HZ/10))
- ret = true;
- rcu_read_unlock();
- return ret;
+ if (time_before(jiffies, mem->last_oom_jiffies + HZ/10))
+ return true;
+ return false;
}

static int record_last_oom_cb(struct mem_cgroup *mem, void *data)
@@ -1549,11 +1540,25 @@ static int __mem_cgroup_try_charge(struc
}

if (!nr_retries--) {
- if (oom) {
- mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);
+ int oom_kill_called;
+ if (!oom)
+ goto nomem;
+ mutex_lock(&memcg_oom_mutex);
+ oom_kill_called = mem_cgroup_oom_called(mem_over_limit);
+ if (!oom_kill_called)
record_last_oom(mem_over_limit);
- }
- goto nomem;
+ mutex_unlock(&memcg_oom_mutex);
+ if (!oom_kill_called)
+ mem_cgroup_out_of_memory(mem_over_limit,
+ gfp_mask);
+ else /* give a chance to die for other tasks */
+ schedule_timeout(1);
+ nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
+ /* Killed myself ? */
+ if (!test_thread_flag(TIF_MEMDIE))
+ continue;
+ /* For smooth oom-kill of current, return 0 */
+ return 0;
}
}
if (csize > PAGE_SIZE)
Index: mmotm-2.6.33-Feb11/mm/oom_kill.c
===================================================================
--- mmotm-2.6.33-Feb11.orig/mm/oom_kill.c
+++ mmotm-2.6.33-Feb11/mm/oom_kill.c
@@ -487,6 +487,9 @@ retry:
goto retry;
out:
read_unlock(&tasklist_lock);
+ /* give a chance to die for selected process */
+ if (!test_thread_flag(TIF_MEMDIE))
+ schedule_timeout_uninterruptible(1);
}
#endif

@@ -601,13 +604,6 @@ void pagefault_out_of_memory(void)
/* Got some memory back in the last second. */
return;

- /*
- * If this is from memcg, oom-killer is already invoked.
- * and not worth to go system-wide-oom.
- */
- if (mem_cgroup_oom_called(current))
- goto rest_and_return;
-
if (sysctl_panic_on_oom)
panic("out of memory from page fault. panic_on_oom is selected.\n");

@@ -619,7 +615,6 @@ void pagefault_out_of_memory(void)
* Give "p" a good chance of killing itself before we
* retry to allocate memory.
*/
-rest_and_return:
if (!test_thread_flag(TIF_MEMDIE))
schedule_timeout_uninterruptible(1);
}
Index: mmotm-2.6.33-Feb11/include/linux/memcontrol.h
===================================================================
--- mmotm-2.6.33-Feb11.orig/include/linux/memcontrol.h
+++ mmotm-2.6.33-Feb11/include/linux/memcontrol.h
@@ -124,7 +124,6 @@ static inline bool mem_cgroup_disabled(v
return false;
}

-extern bool mem_cgroup_oom_called(struct task_struct *task);
void mem_cgroup_update_file_mapped(struct page *page, int val);
unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
gfp_t gfp_mask, int nid,
@@ -258,11 +257,6 @@ static inline bool mem_cgroup_disabled(v
return true;
}

-static inline bool mem_cgroup_oom_called(struct task_struct *task)
-{
- return false;
-}
-
static inline int
mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
{

2010-02-23 08:42:10

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH] memcg: page fault oom improvement v2

On Tue, 23 Feb 2010 16:07:14 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Tue, 23 Feb 2010 15:55:43 +0900
> Daisuke Nishimura <[email protected]> wrote:
>
> > On Tue, 23 Feb 2010 15:26:50 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> > > On Tue, 23 Feb 2010 15:21:16 +0900
> > > Daisuke Nishimura <[email protected]> wrote:
> > >
> > > > On Tue, 23 Feb 2010 14:02:18 +0900, Daisuke Nishimura <[email protected]> wrote:
> > > > > On Tue, 23 Feb 2010 12:03:15 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> > > > > > Nishimura-san, could you review and test your extreme test case with this ?
> > > > > >
> > > > > Thank you for your patch.
> > > > > I don't know why, but the problem seems not so easy to cause in mmotm as in 2.6.32.8,
> > > > > but I'll try more anyway.
> > > > >
> > > > I can triggered the problem in mmotm.
> > > >
> > > > I'll continue my test with your patch applied.
> > > >
> > >
> > > Thank you. Updated one here.
> > >
> > Unfortunately, we need one more fix to avoid build error: remove the declaration
> > of mem_cgroup_oom_called() from memcontrol.h.
> >
> Ouch, I missed to add memcontrol.h to quilt's reflesh set..
> This is updated one. Anyway, I'd like to wait for the next mmotm.
> We already have several changes.
>

After reviewing again, we may be able to remove memcg->oom_jiffies.
Because select_bad_process() returns -1 if there is a TIF_MEMDIE task,
no oom-kill will happen if a tasks is being killed.

But a concern is simultaneous calls of out-of-memory. I think mutex will
be necessary. I'll check tomorrow, again.

Thanks,
-Kame



> -Kame
> ==
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Now, because of page_fault_oom_kill, returning VM_FAULT_OOM means
> random oom-killer should be called. Considering memcg, it handles
> OOM-kill in its own logic, there was a problem as "oom-killer called
> twice" problem.
>
> By commit a636b327f731143ccc544b966cfd8de6cb6d72c6, I added a check
> in pagefault_oom_killer shouldn't kill some (random) task if
> memcg's oom-killer already killed someone.
> That was done by comapring current jiffies and last oom jiffies of memcg.
>
> I thought that easy fix was enough, but Nishimura could write a test case
> where checking jiffies is not enough. This is a fix of above commit.
>
> This new one does this.
> * memcg's try_charge() never returns -ENOMEM if oom-killer is allowed.
> * If someone is calling oom-killer, wait for it in try_charge().
> * If TIF_MEMDIE is set as a result of try_charge(), return 0 and
> allow process to make progress (and die.)
> * removed hook in pagefault_out_of_memory.
>
> By this, pagefult_out_of_memory will be never called if memcg's oom-killer
> is called.
>
> TODO:
> If __GFP_WAIT is not specified in gfp_mask flag, VM_FAULT_OOM will return
> anyway. We need to investigate it whether there is a case.
>
> Changelog: 2010/02/23
> * fixed MEMDIE condition check.
> * making internal symbols to be static.
>
> Cc: David Rientjes <[email protected]>
> Cc: Balbir Singh <[email protected]>
> Cc: Daisuke Nishimura <[email protected]>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> include/linux/memcontrol.h | 6 ------
> mm/memcontrol.c | 41 +++++++++++++++++++++++------------------
> mm/oom_kill.c | 11 +++--------
> 3 files changed, 26 insertions(+), 32 deletions(-)
>
> Index: mmotm-2.6.33-Feb11/mm/memcontrol.c
> ===================================================================
> --- mmotm-2.6.33-Feb11.orig/mm/memcontrol.c
> +++ mmotm-2.6.33-Feb11/mm/memcontrol.c
> @@ -1234,21 +1234,12 @@ static int mem_cgroup_hierarchical_recla
> return total;
> }
>
> -bool mem_cgroup_oom_called(struct task_struct *task)
> +static DEFINE_MUTEX(memcg_oom_mutex);
> +static bool mem_cgroup_oom_called(struct mem_cgroup *mem)
> {
> - bool ret = false;
> - struct mem_cgroup *mem;
> - struct mm_struct *mm;
> -
> - rcu_read_lock();
> - mm = task->mm;
> - if (!mm)
> - mm = &init_mm;
> - mem = mem_cgroup_from_task(rcu_dereference(mm->owner));
> - if (mem && time_before(jiffies, mem->last_oom_jiffies + HZ/10))
> - ret = true;
> - rcu_read_unlock();
> - return ret;
> + if (time_before(jiffies, mem->last_oom_jiffies + HZ/10))
> + return true;
> + return false;
> }
>
> static int record_last_oom_cb(struct mem_cgroup *mem, void *data)
> @@ -1549,11 +1540,25 @@ static int __mem_cgroup_try_charge(struc
> }
>
> if (!nr_retries--) {
> - if (oom) {
> - mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);
> + int oom_kill_called;
> + if (!oom)
> + goto nomem;
> + mutex_lock(&memcg_oom_mutex);
> + oom_kill_called = mem_cgroup_oom_called(mem_over_limit);
> + if (!oom_kill_called)
> record_last_oom(mem_over_limit);
> - }
> - goto nomem;
> + mutex_unlock(&memcg_oom_mutex);
> + if (!oom_kill_called)
> + mem_cgroup_out_of_memory(mem_over_limit,
> + gfp_mask);
> + else /* give a chance to die for other tasks */
> + schedule_timeout(1);
> + nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> + /* Killed myself ? */
> + if (!test_thread_flag(TIF_MEMDIE))
> + continue;
> + /* For smooth oom-kill of current, return 0 */
> + return 0;
> }
> }
> if (csize > PAGE_SIZE)
> Index: mmotm-2.6.33-Feb11/mm/oom_kill.c
> ===================================================================
> --- mmotm-2.6.33-Feb11.orig/mm/oom_kill.c
> +++ mmotm-2.6.33-Feb11/mm/oom_kill.c
> @@ -487,6 +487,9 @@ retry:
> goto retry;
> out:
> read_unlock(&tasklist_lock);
> + /* give a chance to die for selected process */
> + if (!test_thread_flag(TIF_MEMDIE))
> + schedule_timeout_uninterruptible(1);
> }
> #endif
>
> @@ -601,13 +604,6 @@ void pagefault_out_of_memory(void)
> /* Got some memory back in the last second. */
> return;
>
> - /*
> - * If this is from memcg, oom-killer is already invoked.
> - * and not worth to go system-wide-oom.
> - */
> - if (mem_cgroup_oom_called(current))
> - goto rest_and_return;
> -
> if (sysctl_panic_on_oom)
> panic("out of memory from page fault. panic_on_oom is selected.\n");
>
> @@ -619,7 +615,6 @@ void pagefault_out_of_memory(void)
> * Give "p" a good chance of killing itself before we
> * retry to allocate memory.
> */
> -rest_and_return:
> if (!test_thread_flag(TIF_MEMDIE))
> schedule_timeout_uninterruptible(1);
> }
> Index: mmotm-2.6.33-Feb11/include/linux/memcontrol.h
> ===================================================================
> --- mmotm-2.6.33-Feb11.orig/include/linux/memcontrol.h
> +++ mmotm-2.6.33-Feb11/include/linux/memcontrol.h
> @@ -124,7 +124,6 @@ static inline bool mem_cgroup_disabled(v
> return false;
> }
>
> -extern bool mem_cgroup_oom_called(struct task_struct *task);
> void mem_cgroup_update_file_mapped(struct page *page, int val);
> unsigned long mem_cgroup_soft_limit_reclaim(struct zone *zone, int order,
> gfp_t gfp_mask, int nid,
> @@ -258,11 +257,6 @@ static inline bool mem_cgroup_disabled(v
> return true;
> }
>
> -static inline bool mem_cgroup_oom_called(struct task_struct *task)
> -{
> - return false;
> -}
> -
> static inline int
> mem_cgroup_inactive_anon_is_low(struct mem_cgroup *memcg)
> {
>
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

2010-02-23 10:58:35

by Daisuke Nishimura

[permalink] [raw]
Subject: Re: [RFC][PATCH] memcg: page fault oom improvement v2

On Tue, 23 Feb 2010 17:38:35 +0900
KAMEZAWA Hiroyuki <[email protected]> wrote:

> On Tue, 23 Feb 2010 16:07:14 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > On Tue, 23 Feb 2010 15:55:43 +0900
> > Daisuke Nishimura <[email protected]> wrote:
> >
> > > On Tue, 23 Feb 2010 15:26:50 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> > > > On Tue, 23 Feb 2010 15:21:16 +0900
> > > > Daisuke Nishimura <[email protected]> wrote:
> > > >
> > > > > On Tue, 23 Feb 2010 14:02:18 +0900, Daisuke Nishimura <[email protected]> wrote:
> > > > > > On Tue, 23 Feb 2010 12:03:15 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> > > > > > > Nishimura-san, could you review and test your extreme test case with this ?
> > > > > > >
> > > > > > Thank you for your patch.
> > > > > > I don't know why, but the problem seems not so easy to cause in mmotm as in 2.6.32.8,
> > > > > > but I'll try more anyway.
> > > > > >
> > > > > I can triggered the problem in mmotm.
> > > > >
> > > > > I'll continue my test with your patch applied.
> > > > >
> > > >
> > > > Thank you. Updated one here.
> > > >
> > > Unfortunately, we need one more fix to avoid build error: remove the declaration
> > > of mem_cgroup_oom_called() from memcontrol.h.
> > >
> > Ouch, I missed to add memcontrol.h to quilt's reflesh set..
> > This is updated one. Anyway, I'd like to wait for the next mmotm.
> > We already have several changes.
> >
>
> After reviewing again, we may be able to remove memcg->oom_jiffies.
> Because select_bad_process() returns -1 if there is a TIF_MEMDIE task,
> no oom-kill will happen if a tasks is being killed.
>
> But a concern is simultaneous calls of out-of-memory. I think mutex will
> be necessary. I'll check tomorrow, again.
>
I see.

I have one more point.

> > @@ -1549,11 +1540,25 @@ static int __mem_cgroup_try_charge(struc
> > }
> >
> > if (!nr_retries--) {
> > - if (oom) {
> > - mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);
> > + int oom_kill_called;
> > + if (!oom)
> > + goto nomem;
> > + mutex_lock(&memcg_oom_mutex);
> > + oom_kill_called = mem_cgroup_oom_called(mem_over_limit);
> > + if (!oom_kill_called)
> > record_last_oom(mem_over_limit);
> > - }
> > - goto nomem;
> > + mutex_unlock(&memcg_oom_mutex);
> > + if (!oom_kill_called)
> > + mem_cgroup_out_of_memory(mem_over_limit,
> > + gfp_mask);
> > + else /* give a chance to die for other tasks */
> > + schedule_timeout(1);
> > + nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> > + /* Killed myself ? */
> > + if (!test_thread_flag(TIF_MEMDIE))
> > + continue;
> > + /* For smooth oom-kill of current, return 0 */
> > + return 0;
We must call css_put() and reset *memcg to NULL before returning 0.
Otherwise, following commit_charge will commits the page(i.e. set PCG_USED)
while we've not charged res_counter.
(In fact, I saw res_counter underflow warnings(res_counter.c:72).)


Thanks,
Daisuke Nishimura.

2010-02-23 22:49:27

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC][PATCH] memcg: page fault oom improvement v2

On Tue, 23 Feb 2010, KAMEZAWA Hiroyuki wrote:

> Ouch, I missed to add memcontrol.h to quilt's reflesh set..
> This is updated one. Anyway, I'd like to wait for the next mmotm.
> We already have several changes.
>

I think it would be better to just remove mem_cgroup_out_of_memory() and
make it go through out_of_memory() by specifying a non-NULL pointer to a
struct mem_cgroup. We don't need the duplication in code that these two
functions have and then we can begin to have some consistency with how to
deal with panic_on_oom.

It would be much better to prefer killing current in pagefault oom
conditions, as the final patch in my oom killer rewrite does, if it is
killable. If not, we scan the tasklist and find another suitable
candidate. If current is bound to a memcg, we pass that to
select_bad_process() so that we only kill other tasks from the same
cgroup.

This allows us to hijack the TIF_MEMDIE bit to detect when there is a
parallel pagefault oom killing when the oom killer hasn't necessarily been
invoked to kill a system-wide task (it's simply killing current, by
default, and giving it access to memory reserves). Then, we can change
out_of_memory(), which also now handles memcg oom conditions, to always
scan the tasklist first (including for mempolicy and cpuset constrained
ooms), check for any candidates that have TIF_MEMDIE, and return
ERR_PTR(-1UL) if so. That catches the parallel pagefault oom conditions
from needlessly killing memcg tasks. panic_on_oom would only panic after
the tasklist scan has completed and returned != ERR_PTR(-1UL), meaning
pagefault ooms are exempt from that sysctl.

Anyway, do you think it would be possible to rebase on mmotm with my oom
killer rewrite patches? They're at
http://www.kernel.org/pub/linux/kernel/people/rientjes/oom-killer-rewrite

2010-02-24 00:02:11

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH] memcg: page fault oom improvement v2

On Tue, 23 Feb 2010 20:00:52 +0900
Daisuke Nishimura <[email protected]> wrote:

> On Tue, 23 Feb 2010 17:38:35 +0900
> KAMEZAWA Hiroyuki <[email protected]> wrote:
>
> > On Tue, 23 Feb 2010 16:07:14 +0900
> > KAMEZAWA Hiroyuki <[email protected]> wrote:
> >
> > > On Tue, 23 Feb 2010 15:55:43 +0900
> > > Daisuke Nishimura <[email protected]> wrote:
> > >
> > > > On Tue, 23 Feb 2010 15:26:50 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> > > > > On Tue, 23 Feb 2010 15:21:16 +0900
> > > > > Daisuke Nishimura <[email protected]> wrote:
> > > > >
> > > > > > On Tue, 23 Feb 2010 14:02:18 +0900, Daisuke Nishimura <[email protected]> wrote:
> > > > > > > On Tue, 23 Feb 2010 12:03:15 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> > > > > > > > Nishimura-san, could you review and test your extreme test case with this ?
> > > > > > > >
> > > > > > > Thank you for your patch.
> > > > > > > I don't know why, but the problem seems not so easy to cause in mmotm as in 2.6.32.8,
> > > > > > > but I'll try more anyway.
> > > > > > >
> > > > > > I can triggered the problem in mmotm.
> > > > > >
> > > > > > I'll continue my test with your patch applied.
> > > > > >
> > > > >
> > > > > Thank you. Updated one here.
> > > > >
> > > > Unfortunately, we need one more fix to avoid build error: remove the declaration
> > > > of mem_cgroup_oom_called() from memcontrol.h.
> > > >
> > > Ouch, I missed to add memcontrol.h to quilt's reflesh set..
> > > This is updated one. Anyway, I'd like to wait for the next mmotm.
> > > We already have several changes.
> > >
> >
> > After reviewing again, we may be able to remove memcg->oom_jiffies.
> > Because select_bad_process() returns -1 if there is a TIF_MEMDIE task,
> > no oom-kill will happen if a tasks is being killed.
> >
> > But a concern is simultaneous calls of out-of-memory. I think mutex will
> > be necessary. I'll check tomorrow, again.
> >
> I see.
>
> I have one more point.
>
> > > @@ -1549,11 +1540,25 @@ static int __mem_cgroup_try_charge(struc
> > > }
> > >
> > > if (!nr_retries--) {
> > > - if (oom) {
> > > - mem_cgroup_out_of_memory(mem_over_limit, gfp_mask);
> > > + int oom_kill_called;
> > > + if (!oom)
> > > + goto nomem;
> > > + mutex_lock(&memcg_oom_mutex);
> > > + oom_kill_called = mem_cgroup_oom_called(mem_over_limit);
> > > + if (!oom_kill_called)
> > > record_last_oom(mem_over_limit);
> > > - }
> > > - goto nomem;
> > > + mutex_unlock(&memcg_oom_mutex);
> > > + if (!oom_kill_called)
> > > + mem_cgroup_out_of_memory(mem_over_limit,
> > > + gfp_mask);
> > > + else /* give a chance to die for other tasks */
> > > + schedule_timeout(1);
> > > + nr_retries = MEM_CGROUP_RECLAIM_RETRIES;
> > > + /* Killed myself ? */
> > > + if (!test_thread_flag(TIF_MEMDIE))
> > > + continue;
> > > + /* For smooth oom-kill of current, return 0 */
> > > + return 0;
> We must call css_put() and reset *memcg to NULL before returning 0.
> Otherwise, following commit_charge will commits the page(i.e. set PCG_USED)
> while we've not charged res_counter.
> (In fact, I saw res_counter underflow warnings(res_counter.c:72).)
>
Ah, ok. I'll do.

Thanks,
-Kame

>
> Thanks,
> Daisuke Nishimura.

2010-02-24 00:12:12

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH] memcg: page fault oom improvement v2

On Tue, 23 Feb 2010 14:49:12 -0800 (PST)
David Rientjes <[email protected]> wrote:

> On Tue, 23 Feb 2010, KAMEZAWA Hiroyuki wrote:
>
> > Ouch, I missed to add memcontrol.h to quilt's reflesh set..
> > This is updated one. Anyway, I'd like to wait for the next mmotm.
> > We already have several changes.
> >
>
> I think it would be better to just remove mem_cgroup_out_of_memory() and
> make it go through out_of_memory() by specifying a non-NULL pointer to a
> struct mem_cgroup. We don't need the duplication in code that these two
> functions have and then we can begin to have some consistency with how to
> deal with panic_on_oom.
>
> It would be much better to prefer killing current in pagefault oom
> conditions, as the final patch in my oom killer rewrite does, if it is
> killable. If not, we scan the tasklist and find another suitable
> candidate. If current is bound to a memcg, we pass that to
> select_bad_process() so that we only kill other tasks from the same
> cgroup.
Adding new argument to out_of_memory ?

>
> This allows us to hijack the TIF_MEMDIE bit to detect when there is a
> parallel pagefault oom killing when the oom killer hasn't necessarily been
> invoked to kill a system-wide task (it's simply killing current, by
> default, and giving it access to memory reserves). Then, we can change
> out_of_memory(), which also now handles memcg oom conditions, to always
> scan the tasklist first (including for mempolicy and cpuset constrained
> ooms), check for any candidates that have TIF_MEMDIE, and return
> ERR_PTR(-1UL) if so. That catches the parallel pagefault oom conditions
> from needlessly killing memcg tasks. panic_on_oom would only panic after
> the tasklist scan has completed and returned != ERR_PTR(-1UL), meaning
> pagefault ooms are exempt from that sysctl.
>
Sorry, I see your concern but I'd like not to do clean-up and bug-fix at
the same time.

I think clean up after fix is easy in this case.


> Anyway, do you think it would be possible to rebase on mmotm with my oom
> killer rewrite patches?
> They're at
> http://www.kernel.org/pub/linux/kernel/people/rientjes/oom-killer-rewrite
>

I can wait until your patch are merged if necessary. But it seems there will
not be much confliction.


Thanks,
-Kame

2010-02-24 01:42:52

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC][PATCH] memcg: page fault oom improvement v2

On Wed, 24 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > I think it would be better to just remove mem_cgroup_out_of_memory() and
> > make it go through out_of_memory() by specifying a non-NULL pointer to a
> > struct mem_cgroup. We don't need the duplication in code that these two
> > functions have and then we can begin to have some consistency with how to
> > deal with panic_on_oom.
> >
> > It would be much better to prefer killing current in pagefault oom
> > conditions, as the final patch in my oom killer rewrite does, if it is
> > killable. If not, we scan the tasklist and find another suitable
> > candidate. If current is bound to a memcg, we pass that to
> > select_bad_process() so that we only kill other tasks from the same
> > cgroup.
> Adding new argument to out_of_memory ?
>

Right, the pointer to pass into select_bad_process() to filter by memcg.

> >
> > This allows us to hijack the TIF_MEMDIE bit to detect when there is a
> > parallel pagefault oom killing when the oom killer hasn't necessarily been
> > invoked to kill a system-wide task (it's simply killing current, by
> > default, and giving it access to memory reserves). Then, we can change
> > out_of_memory(), which also now handles memcg oom conditions, to always
> > scan the tasklist first (including for mempolicy and cpuset constrained
> > ooms), check for any candidates that have TIF_MEMDIE, and return
> > ERR_PTR(-1UL) if so. That catches the parallel pagefault oom conditions
> > from needlessly killing memcg tasks. panic_on_oom would only panic after
> > the tasklist scan has completed and returned != ERR_PTR(-1UL), meaning
> > pagefault ooms are exempt from that sysctl.
> >
> Sorry, I see your concern but I'd like not to do clean-up and bug-fix at
> the same time.
>
> I think clean up after fix is easy in this case.
>

If you develop on top of my oom killer rewrite, pagefault ooms already
attempt to kill current first and then defer back to killing another task
if current is unkillable. That means that panic_on_oom must be redefined:
we _must_ now scan the entire tasklist looking for eligible tasks with the
TIF_MEMDIE bit set before panicking in _all_ oom conditions. Otherwise,
it is possible to needlessly panic when the result of a pagefault oom
(killing current) would lead to future memory freeing. The previous
VM_FAULT_OOM behavior before we used the oom killer was to kill current,
there was no consideration given to panic_on_oom for those cases. So
pagefault_out_of_memory() must now try to kill current first and then
leave panic_on_oom to be dealt with in out_of_memory() if the tasklist
scan doesn't show any pagefault oom victims.

2010-02-24 01:52:11

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH] memcg: page fault oom improvement v2

On Tue, 23 Feb 2010 17:42:33 -0800 (PST)
David Rientjes <[email protected]> wrote:

> > >
> > > This allows us to hijack the TIF_MEMDIE bit to detect when there is a
> > > parallel pagefault oom killing when the oom killer hasn't necessarily been
> > > invoked to kill a system-wide task (it's simply killing current, by
> > > default, and giving it access to memory reserves). Then, we can change
> > > out_of_memory(), which also now handles memcg oom conditions, to always
> > > scan the tasklist first (including for mempolicy and cpuset constrained
> > > ooms), check for any candidates that have TIF_MEMDIE, and return
> > > ERR_PTR(-1UL) if so. That catches the parallel pagefault oom conditions
> > > from needlessly killing memcg tasks. panic_on_oom would only panic after
> > > the tasklist scan has completed and returned != ERR_PTR(-1UL), meaning
> > > pagefault ooms are exempt from that sysctl.
> > >
> > Sorry, I see your concern but I'd like not to do clean-up and bug-fix at
> > the same time.
> >
> > I think clean up after fix is easy in this case.
> >
>
> If you develop on top of my oom killer rewrite, pagefault ooms already
> attempt to kill current first and then defer back to killing another task
> if current is unkillable.

After my fix, page_fault_out_of_memory is never called. (because memcg doesn't
return needless failure.)

Then, that's not point in this thread.

Thanks,
-Kame

2010-02-24 02:26:34

by David Rientjes

[permalink] [raw]
Subject: Re: [RFC][PATCH] memcg: page fault oom improvement v2

On Wed, 24 Feb 2010, KAMEZAWA Hiroyuki wrote:

> > > > This allows us to hijack the TIF_MEMDIE bit to detect when there is a
> > > > parallel pagefault oom killing when the oom killer hasn't necessarily been
> > > > invoked to kill a system-wide task (it's simply killing current, by
> > > > default, and giving it access to memory reserves). Then, we can change
> > > > out_of_memory(), which also now handles memcg oom conditions, to always
> > > > scan the tasklist first (including for mempolicy and cpuset constrained
> > > > ooms), check for any candidates that have TIF_MEMDIE, and return
> > > > ERR_PTR(-1UL) if so. That catches the parallel pagefault oom conditions
> > > > from needlessly killing memcg tasks. panic_on_oom would only panic after
> > > > the tasklist scan has completed and returned != ERR_PTR(-1UL), meaning
> > > > pagefault ooms are exempt from that sysctl.
> > > >
> > > Sorry, I see your concern but I'd like not to do clean-up and bug-fix at
> > > the same time.
> > >
> > > I think clean up after fix is easy in this case.
> > >
> >
> > If you develop on top of my oom killer rewrite, pagefault ooms already
> > attempt to kill current first and then defer back to killing another task
> > if current is unkillable.
>
> After my fix, page_fault_out_of_memory is never called. (because memcg doesn't
> return needless failure.)
>

Of course it's called, it's called from the pagefault handler whenever we
return VM_FAULT_OOM. Whenever that happens, we'd needlessly panic the
machine for panic_on_oom if we didn't do the tasklist scan and check for
eligible tasks with TIF_MEMDIE set because it prefers to kill current
first in pagefault conditions without consideration given to the sysctl.
pagefault_out_of_memory() has changed radically with my rewrite, so I'd
encourage you to develop on top of that where I've completely removed
mem_cgroup_oom_called() and memcg->last_oom_jiffies already because
they're nonsense.

My patches are available from
http://www.kernel.org/pub/linux/kernel/people/rientjes/oom-killer-rewrite

Thanks.

2010-02-24 02:28:45

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH] memcg: page fault oom improvement v2

On Tue, 23 Feb 2010 18:26:17 -0800 (PST)
David Rientjes <[email protected]> wrote:

> On Wed, 24 Feb 2010, KAMEZAWA Hiroyuki wrote:
>
> > > > > This allows us to hijack the TIF_MEMDIE bit to detect when there is a
> > > > > parallel pagefault oom killing when the oom killer hasn't necessarily been
> > > > > invoked to kill a system-wide task (it's simply killing current, by
> > > > > default, and giving it access to memory reserves). Then, we can change
> > > > > out_of_memory(), which also now handles memcg oom conditions, to always
> > > > > scan the tasklist first (including for mempolicy and cpuset constrained
> > > > > ooms), check for any candidates that have TIF_MEMDIE, and return
> > > > > ERR_PTR(-1UL) if so. That catches the parallel pagefault oom conditions
> > > > > from needlessly killing memcg tasks. panic_on_oom would only panic after
> > > > > the tasklist scan has completed and returned != ERR_PTR(-1UL), meaning
> > > > > pagefault ooms are exempt from that sysctl.
> > > > >
> > > > Sorry, I see your concern but I'd like not to do clean-up and bug-fix at
> > > > the same time.
> > > >
> > > > I think clean up after fix is easy in this case.
> > > >
> > >
> > > If you develop on top of my oom killer rewrite, pagefault ooms already
> > > attempt to kill current first and then defer back to killing another task
> > > if current is unkillable.
> >
> > After my fix, page_fault_out_of_memory is never called. (because memcg doesn't
> > return needless failure.)
> >
>
> Of course it's called, it's called from the pagefault handler whenever we
> return VM_FAULT_OOM. Whenever that happens, we'd needlessly panic the
> machine for panic_on_oom if we didn't do the tasklist scan and check for
> eligible tasks with TIF_MEMDIE set because it prefers to kill current
> first in pagefault conditions without consideration given to the sysctl.
> pagefault_out_of_memory() has changed radically with my rewrite, so I'd
> encourage you to develop on top of that where I've completely removed
> mem_cgroup_oom_called() and memcg->last_oom_jiffies already because
> they're nonsense.
>
> My patches are available from
> http://www.kernel.org/pub/linux/kernel/people/rientjes/oom-killer-rewrite
>

I do by myself.

Bye.
-Kame

> Thanks.
>