LinuxLists.cc - [PATCH] memcg: handle panic_on

2010-02-17 06:08:19

Subject: [PATCH] memcg: handle panic_on_oom=always case

tested on mmotm-Feb11.

Balbir-san, Nishimura-san, I want review from both of you.

==

From: KAMEZAWA Hiroyuki <[email protected]>

Now, if panic_on_oom=2, the whole system panics even if the oom happend
in some special situation (as cpuset, mempolicy....).
Then, panic_on_oom=2 means painc_on_oom_always.

Now, memcg doesn't check panic_on_oom flag. This patch adds a check.

Maybe someone doubts how it's useful. kdump+panic_on_oom=2 is the
last tool to investigate what happens in oom-ed system. If a task is killed,
the sysytem recovers and used memory were freed, there will be few hint
to know what happnes. In mission critical system, oom should never happen.
Then, investigation after OOM is very important.
Then, panic_on_oom=2+kdump is useful to avoid next OOM by knowing
precise information via snapshot.

TODO:
- For memcg, it's for isolate system's memory usage, oom-notiifer and
freeze_at_oom (or rest_at_oom) should be implemented. Then, management
daemon can do similar jobs (as kdump) in safer way or taking snapshot
per cgroup.

CC: Balbir Singh <[email protected]>
CC: Daisuke Nishimura <[email protected]>
CC: David Rientjes <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
Documentation/cgroups/memory.txt | 2 ++
Documentation/sysctl/vm.txt | 5 ++++-
mm/oom_kill.c | 2 ++
3 files changed, 8 insertions(+), 1 deletion(-)

Index: mmotm-2.6.33-Feb11/Documentation/cgroups/memory.txt
===================================================================
--- mmotm-2.6.33-Feb11.orig/Documentation/cgroups/memory.txt
+++ mmotm-2.6.33-Feb11/Documentation/cgroups/memory.txt
@@ -182,6 +182,8 @@ list.
NOTE: Reclaim does not work for the root cgroup, since we cannot set any
limits on the root cgroup.

+Note2: When panic_on_oom is set to "2", the whole system will panic.
+
2. Locking

The memory controller uses the following hierarchy
Index: mmotm-2.6.33-Feb11/Documentation/sysctl/vm.txt
===================================================================
--- mmotm-2.6.33-Feb11.orig/Documentation/sysctl/vm.txt
+++ mmotm-2.6.33-Feb11/Documentation/sysctl/vm.txt
@@ -573,11 +573,14 @@ Because other nodes' memory may be free.
may be not fatal yet.

If this is set to 2, the kernel panics compulsorily even on the
-above-mentioned.
+above-mentioned. Even oom happens under memoyr cgroup, the whole
+system panics.

The default value is 0.
1 and 2 are for failover of clustering. Please select either
according to your policy of failover.
+2 seems too strong but panic_on_oom=2+kdump gives you very strong
+tool to investigate a system which should never cause OOM.

=============================================================

Index: mmotm-2.6.33-Feb11/mm/oom_kill.c
===================================================================
--- mmotm-2.6.33-Feb11.orig/mm/oom_kill.c
+++ mmotm-2.6.33-Feb11/mm/oom_kill.c
@@ -471,6 +471,8 @@ void mem_cgroup_out_of_memory(struct mem
unsigned long points = 0;
struct task_struct *p;

+ if (sysctl_panic_on_oom == 2)
+ panic("out of memory(memcg). panic_on_oom is selected.\n");
read_lock(&tasklist_lock);
retry:
p = select_bad_process(&points, mem);

2010-02-17 06:53:51

by Daisuke Nishimura

[permalink] [raw]

Subject: Re: [PATCH] memcg: handle panic_on_oom=always case

On Wed, 17 Feb 2010 15:04:45 +0900, KAMEZAWA Hiroyuki <[email protected]> wrote:
> tested on mmotm-Feb11.
>
> Balbir-san, Nishimura-san, I want review from both of you.
>
I've read only part of the original patch set yet, but I agree to the direction
of making memcg's oom panic the system on panic_on_oom==2, not panic on panic_on_oom==1.

Reviewed-by: Daisuke Nishimura <[email protected]>

Thanks,
Daisuke Nishimura.

> ==
>
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Now, if panic_on_oom=2, the whole system panics even if the oom happend
> in some special situation (as cpuset, mempolicy....).
> Then, panic_on_oom=2 means painc_on_oom_always.
>
> Now, memcg doesn't check panic_on_oom flag. This patch adds a check.
>
> Maybe someone doubts how it's useful. kdump+panic_on_oom=2 is the
> last tool to investigate what happens in oom-ed system. If a task is killed,
> the sysytem recovers and used memory were freed, there will be few hint
> to know what happnes. In mission critical system, oom should never happen.
> Then, investigation after OOM is very important.
> Then, panic_on_oom=2+kdump is useful to avoid next OOM by knowing
> precise information via snapshot.
>
> TODO:
> - For memcg, it's for isolate system's memory usage, oom-notiifer and
> freeze_at_oom (or rest_at_oom) should be implemented. Then, management
> daemon can do similar jobs (as kdump) in safer way or taking snapshot
> per cgroup.
>
> CC: Balbir Singh <[email protected]>
> CC: Daisuke Nishimura <[email protected]>
> CC: David Rientjes <[email protected]>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> Documentation/cgroups/memory.txt | 2 ++
> Documentation/sysctl/vm.txt | 5 ++++-
> mm/oom_kill.c | 2 ++
> 3 files changed, 8 insertions(+), 1 deletion(-)
>
> Index: mmotm-2.6.33-Feb11/Documentation/cgroups/memory.txt
> ===================================================================
> --- mmotm-2.6.33-Feb11.orig/Documentation/cgroups/memory.txt
> +++ mmotm-2.6.33-Feb11/Documentation/cgroups/memory.txt
> @@ -182,6 +182,8 @@ list.
> NOTE: Reclaim does not work for the root cgroup, since we cannot set any
> limits on the root cgroup.
>
> +Note2: When panic_on_oom is set to "2", the whole system will panic.
> +
> 2. Locking
>
> The memory controller uses the following hierarchy
> Index: mmotm-2.6.33-Feb11/Documentation/sysctl/vm.txt
> ===================================================================
> --- mmotm-2.6.33-Feb11.orig/Documentation/sysctl/vm.txt
> +++ mmotm-2.6.33-Feb11/Documentation/sysctl/vm.txt
> @@ -573,11 +573,14 @@ Because other nodes' memory may be free.
> may be not fatal yet.
>
> If this is set to 2, the kernel panics compulsorily even on the
> -above-mentioned.
> +above-mentioned. Even oom happens under memoyr cgroup, the whole
> +system panics.
>
> The default value is 0.
> 1 and 2 are for failover of clustering. Please select either
> according to your policy of failover.
> +2 seems too strong but panic_on_oom=2+kdump gives you very strong
> +tool to investigate a system which should never cause OOM.
>
> =============================================================
>
> Index: mmotm-2.6.33-Feb11/mm/oom_kill.c
> ===================================================================
> --- mmotm-2.6.33-Feb11.orig/mm/oom_kill.c
> +++ mmotm-2.6.33-Feb11/mm/oom_kill.c
> @@ -471,6 +471,8 @@ void mem_cgroup_out_of_memory(struct mem
> unsigned long points = 0;
> struct task_struct *p;
>
> + if (sysctl_panic_on_oom == 2)
> + panic("out of memory(memcg). panic_on_oom is selected.\n");
> read_lock(&tasklist_lock);
> retry:
> p = select_bad_process(&points, mem);
>

2010-02-17 08:45:37

by Nick Piggin

[permalink] [raw]

Subject: Re: [PATCH] memcg: handle panic_on_oom=always case

On Wed, Feb 17, 2010 at 03:04:45PM +0900, KAMEZAWA Hiroyuki wrote:
> tested on mmotm-Feb11.
>
> Balbir-san, Nishimura-san, I want review from both of you.
>
> ==
>
> From: KAMEZAWA Hiroyuki <[email protected]>
>
> Now, if panic_on_oom=2, the whole system panics even if the oom happend
> in some special situation (as cpuset, mempolicy....).
> Then, panic_on_oom=2 means painc_on_oom_always.
>
> Now, memcg doesn't check panic_on_oom flag. This patch adds a check.
>
> Maybe someone doubts how it's useful. kdump+panic_on_oom=2 is the
> last tool to investigate what happens in oom-ed system. If a task is killed,
> the sysytem recovers and used memory were freed, there will be few hint
> to know what happnes. In mission critical system, oom should never happen.
> Then, investigation after OOM is very important.
> Then, panic_on_oom=2+kdump is useful to avoid next OOM by knowing
> precise information via snapshot.

No I don't doubt it is useful, and I think this probably is the simplest
and most useful semantic. So thanks for doing this.

I hate to pick nits in a trivial patch but I will anyway:

> TODO:
> - For memcg, it's for isolate system's memory usage, oom-notiifer and
> freeze_at_oom (or rest_at_oom) should be implemented. Then, management
> daemon can do similar jobs (as kdump) in safer way or taking snapshot
> per cgroup.
>
> CC: Balbir Singh <[email protected]>
> CC: Daisuke Nishimura <[email protected]>
> CC: David Rientjes <[email protected]>
> Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> ---
> Documentation/cgroups/memory.txt | 2 ++
> Documentation/sysctl/vm.txt | 5 ++++-
> mm/oom_kill.c | 2 ++
> 3 files changed, 8 insertions(+), 1 deletion(-)
>
> Index: mmotm-2.6.33-Feb11/Documentation/cgroups/memory.txt
> ===================================================================
> --- mmotm-2.6.33-Feb11.orig/Documentation/cgroups/memory.txt
> +++ mmotm-2.6.33-Feb11/Documentation/cgroups/memory.txt
> @@ -182,6 +182,8 @@ list.
> NOTE: Reclaim does not work for the root cgroup, since we cannot set any
> limits on the root cgroup.
>
> +Note2: When panic_on_oom is set to "2", the whole system will panic.
> +

Maybe:

NOTE2: When panic_on_oom is set to "2", the whole system will panic in
case of an oom event in any cgroup.

> 2. Locking
>
> The memory controller uses the following hierarchy
> Index: mmotm-2.6.33-Feb11/Documentation/sysctl/vm.txt
> ===================================================================
> --- mmotm-2.6.33-Feb11.orig/Documentation/sysctl/vm.txt
> +++ mmotm-2.6.33-Feb11/Documentation/sysctl/vm.txt
> @@ -573,11 +573,14 @@ Because other nodes' memory may be free.
> may be not fatal yet.
>
> If this is set to 2, the kernel panics compulsorily even on the
> -above-mentioned.
> +above-mentioned. Even oom happens under memoyr cgroup, the whole
> +system panics.
memory

>
> The default value is 0.
> 1 and 2 are for failover of clustering. Please select either
> according to your policy of failover.
> +2 seems too strong but panic_on_oom=2+kdump gives you very strong
> +tool to investigate a system which should never cause OOM.

I don't think you need say 2 seems too strong because as you rightfully
say, it has real uses. The hint about using it to investigate OOM
conditions is good though.

>
> =============================================================
>
> Index: mmotm-2.6.33-Feb11/mm/oom_kill.c
> ===================================================================
> --- mmotm-2.6.33-Feb11.orig/mm/oom_kill.c
> +++ mmotm-2.6.33-Feb11/mm/oom_kill.c
> @@ -471,6 +471,8 @@ void mem_cgroup_out_of_memory(struct mem
> unsigned long points = 0;
> struct task_struct *p;
>
> + if (sysctl_panic_on_oom == 2)
> + panic("out of memory(memcg). panic_on_oom is selected.\n");
> read_lock(&tasklist_lock);
> retry:
> p = select_bad_process(&points, mem);

2010-02-17 08:54:38

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: Re: [PATCH] memcg: handle panic_on_oom=always case

On Wed, 17 Feb 2010 19:45:26 +1100
Nick Piggin <[email protected]> wrote:

> On Wed, Feb 17, 2010 at 03:04:45PM +0900, KAMEZAWA Hiroyuki wrote:
> > tested on mmotm-Feb11.
> >
> > Balbir-san, Nishimura-san, I want review from both of you.
> >
> > ==
> >
> > From: KAMEZAWA Hiroyuki <[email protected]>
> >
> > Now, if panic_on_oom=2, the whole system panics even if the oom happend
> > in some special situation (as cpuset, mempolicy....).
> > Then, panic_on_oom=2 means painc_on_oom_always.
> >
> > Now, memcg doesn't check panic_on_oom flag. This patch adds a check.
> >
> > Maybe someone doubts how it's useful. kdump+panic_on_oom=2 is the
> > last tool to investigate what happens in oom-ed system. If a task is killed,
> > the sysytem recovers and used memory were freed, there will be few hint
> > to know what happnes. In mission critical system, oom should never happen.
> > Then, investigation after OOM is very important.
> > Then, panic_on_oom=2+kdump is useful to avoid next OOM by knowing
> > precise information via snapshot.
>
> No I don't doubt it is useful, and I think this probably is the simplest
> and most useful semantic. So thanks for doing this.
>
Thank you for review.

> I hate to pick nits in a trivial patch but I will anyway:
>
>
> > TODO:
> > - For memcg, it's for isolate system's memory usage, oom-notiifer and
> > freeze_at_oom (or rest_at_oom) should be implemented. Then, management
> > daemon can do similar jobs (as kdump) in safer way or taking snapshot
> > per cgroup.
> >
> > CC: Balbir Singh <[email protected]>
> > CC: Daisuke Nishimura <[email protected]>
> > CC: David Rientjes <[email protected]>
> > Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
> > ---
> > Documentation/cgroups/memory.txt | 2 ++
> > Documentation/sysctl/vm.txt | 5 ++++-
> > mm/oom_kill.c | 2 ++
> > 3 files changed, 8 insertions(+), 1 deletion(-)
> >
> > Index: mmotm-2.6.33-Feb11/Documentation/cgroups/memory.txt
> > ===================================================================
> > --- mmotm-2.6.33-Feb11.orig/Documentation/cgroups/memory.txt
> > +++ mmotm-2.6.33-Feb11/Documentation/cgroups/memory.txt
> > @@ -182,6 +182,8 @@ list.
> > NOTE: Reclaim does not work for the root cgroup, since we cannot set any
> > limits on the root cgroup.
> >
> > +Note2: When panic_on_oom is set to "2", the whole system will panic.
> > +
>
> Maybe:
>
> NOTE2: When panic_on_oom is set to "2", the whole system will panic in
> case of an oom event in any cgroup.
>

ok.

> > 2. Locking
> >
> > The memory controller uses the following hierarchy
> > Index: mmotm-2.6.33-Feb11/Documentation/sysctl/vm.txt
> > ===================================================================
> > --- mmotm-2.6.33-Feb11.orig/Documentation/sysctl/vm.txt
> > +++ mmotm-2.6.33-Feb11/Documentation/sysctl/vm.txt
> > @@ -573,11 +573,14 @@ Because other nodes' memory may be free.
> > may be not fatal yet.
> >
> > If this is set to 2, the kernel panics compulsorily even on the
> > -above-mentioned.
> > +above-mentioned. Even oom happens under memoyr cgroup, the whole
> > +system panics.
> memory
>
> >
> > The default value is 0.
> > 1 and 2 are for failover of clustering. Please select either
> > according to your policy of failover.
> > +2 seems too strong but panic_on_oom=2+kdump gives you very strong
> > +tool to investigate a system which should never cause OOM.
>
> I don't think you need say 2 seems too strong because as you rightfully
> say, it has real uses. The hint about using it to investigate OOM
> conditions is good though.
>

ok. I'll update this patch.

Thanks,
-Kame

> >
> > =============================================================
> >
> > Index: mmotm-2.6.33-Feb11/mm/oom_kill.c
> > ===================================================================
> > --- mmotm-2.6.33-Feb11.orig/mm/oom_kill.c
> > +++ mmotm-2.6.33-Feb11/mm/oom_kill.c
> > @@ -471,6 +471,8 @@ void mem_cgroup_out_of_memory(struct mem
> > unsigned long points = 0;
> > struct task_struct *p;
> >
> > + if (sysctl_panic_on_oom == 2)
> > + panic("out of memory(memcg). panic_on_oom is selected.\n");
> > read_lock(&tasklist_lock);
> > retry:
> > p = select_bad_process(&points, mem);
>

2010-02-17 09:07:41

by Kamezawa Hiroyuki

[permalink] [raw]

Subject: [PATCH] memcg: handle panic_on_oom=always case v2

Documenation is updated.
==
From: KAMEZAWA Hiroyuki <[email protected]>

Now, if panic_on_oom=2, the whole system panics even if the oom happend
in some special situation (as cpuset, mempolicy....).
Then, panic_on_oom=2 means painc_on_oom_always.

Now, memcg doesn't check panic_on_oom flag. This patch adds a check.

BTW, how it's useful ?

kdump+panic_on_oom=2 is the last tool to investigate what happens in oom-ed
system. When a task is killed, the sysytem recovers and there will be few hint
to know what happnes. In mission critical system, oom should never happen.
Then, panic_on_oom=2+kdump is useful to avoid next OOM by knowing
precise information via snapshot.

TODO:
- For memcg, it's for isolate system's memory usage, oom-notiifer and
freeze_at_oom (or rest_at_oom) should be implemented. Then, management
daemon can do similar jobs (as kdump) or taking snapshot per cgroup.

Changelg:
- rewrote documentations.

CC: Balbir Singh <[email protected]>
CC: David Rientjes <[email protected]>
CC: Nick Piggin <[email protected]>
Reviewed-by: Daisuke Nishimura <[email protected]>
Signed-off-by: KAMEZAWA Hiroyuki <[email protected]>
---
Documentation/cgroups/memory.txt | 5 ++++-
Documentation/sysctl/vm.txt | 5 ++++-
mm/oom_kill.c | 2 ++
3 files changed, 10 insertions(+), 2 deletions(-)

Index: mmotm-2.6.33-Feb11/Documentation/cgroups/memory.txt
===================================================================
--- mmotm-2.6.33-Feb11.orig/Documentation/cgroups/memory.txt
+++ mmotm-2.6.33-Feb11/Documentation/cgroups/memory.txt
@@ -182,6 +182,8 @@ list.
NOTE: Reclaim does not work for the root cgroup, since we cannot set any
limits on the root cgroup.

+Note2: When panic_on_oom is set to "2", the whole system will panic.
+
2. Locking

The memory controller uses the following hierarchy
@@ -379,7 +381,8 @@ The feature can be disabled by
NOTE1: Enabling/disabling will fail if the cgroup already has other
cgroups created below it.

-NOTE2: This feature can be enabled/disabled per subtree.
+NOTE2: When panic_on_oom is set to "2", the whole system will panic in
+case of an oom event in any cgroup.

7. Soft limits

Index: mmotm-2.6.33-Feb11/Documentation/sysctl/vm.txt
===================================================================
--- mmotm-2.6.33-Feb11.orig/Documentation/sysctl/vm.txt
+++ mmotm-2.6.33-Feb11/Documentation/sysctl/vm.txt
@@ -573,11 +573,14 @@ Because other nodes' memory may be free.
may be not fatal yet.

If this is set to 2, the kernel panics compulsorily even on the
-above-mentioned.
+above-mentioned. Even oom happens under memory cgroup, the whole
+system panics.

The default value is 0.
1 and 2 are for failover of clustering. Please select either
according to your policy of failover.
+panic_on_oom=2+kdump gives you very strong tool to investigate
+why oom happens. You can get snapshot.

=============================================================

Index: mmotm-2.6.33-Feb11/mm/oom_kill.c
===================================================================
--- mmotm-2.6.33-Feb11.orig/mm/oom_kill.c
+++ mmotm-2.6.33-Feb11/mm/oom_kill.c
@@ -471,6 +471,8 @@ void mem_cgroup_out_of_memory(struct mem
unsigned long points = 0;
struct task_struct *p;

+ if (sysctl_panic_on_oom == 2)
+ panic("out of memory(memcg). panic_on_oom is selected.\n");
read_lock(&tasklist_lock);
retry:
p = select_bad_process(&points, mem);