Date: Fri, 19 Jun 2015 17:36:21 +0200
From: Michal Hocko <mhocko@suse.cz>
To: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: linux-mm@kvack.org, rientjes@google.com, hannes@cmpxchg.org, tj@kernel.org,
        akpm@linux-foundation.org, linux-kernel@vger.kernel.org
Subject: Re: [RFC -v2] panic_on_oom_timeout
Message-ID: <20150619153620.GI4913@dhcp22.suse.cz>
References: <20150617121104.GD25056@dhcp22.suse.cz>
 <201506172131.EFE12444.JMLFOSVOHFOtFQ@I-love.SAKURA.ne.jp>
 <20150617125127.GF25056@dhcp22.suse.cz>
 <201506172259.EAI00575.OFQtVFFSHMOLJO@I-love.SAKURA.ne.jp>
 <20150617154159.GJ25056@dhcp22.suse.cz>
 <201506192030.CAH00597.FQVOtFFLOJMHOS@I-love.SAKURA.ne.jp>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <201506192030.CAH00597.FQVOtFFLOJMHOS@I-love.SAKURA.ne.jp>
User-Agent: Mutt/1.5.23 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 15226
Lines: 386

On Fri 19-06-15 20:30:10, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > Fixed in my local version. I will post the new version of the patch
> > after we settle with the approach.
> > 
> 
> I'd like to see now,

Sure see below

[...]
> But oom_victims is incremented via mark_oom_victim() for both global OOM
> and non-global OOM, isn't it? Then, I think that more difficult part is
> exit_oom_victim().

Yes you cannot tell which OOM context has killed a particular task. And
it even shouldn't matter, I believe - see below.

> We can hit a sequence like
> 
>   (1) Task1 in memcg1 hits memcg OOM.
>   (2) Task1 gets TIF_MEMDIE and increments oom_victims.
>   (3) Task2 hits global OOM.
>   (4) Task2 activates 10 seconds of timeout.
>   (5) Task2 gets TIF_MEMDIE and increments oom_victims.
>   (6) Task2 remained unkillable for 1 second since (5).
>   (7) Task2 calls exit_oom_victim().
>   (8) Task2 drops TIF_MEMDIE and decrements oom_victims.
>   (9) panic_on_oom_timer is not deactivated because oom_vctims > 0.
>   (10) Task1 remains unkillable for 10 seconds since (2).
>   (11) panic_on_oom_timer expires and the system will panic while
>        the system is no longer under global OOM.

I find this highly unlikely but yes this is possible. If it really
matters we can check watermarks on all zones and bail out if at least
one is OK from the timer handler.

[...]
>   (1) Task1 in memcg1 hits memcg OOM.
>   (2) Task1 gets TIF_MEMDIE and increments oom_victims.
>   (3) Task2 hits system OOM.
>   (4) Task2 activates 10 seconds of timeout.
>   (5) Task2 gets TIF_MEMDIE and increments oom_victims.
>   (6) Task1 remained unkillable for 9 seconds since (2).
>   (7) Task1 calls exit_oom_victim().
>   (8) Task1 drops TIF_MEMDIE and decrements oom_victims.
>   (9) panic_on_oom_timer is deactivated.
>   (10) Task3 hits system OOM.
>   (11) Task3 again activates 10 seconds of timeout.
>   (12) Task2 remains unkillable for 19 seconds since (5).
>   (13) panic_on_oom_timer expires and the system will panic, but
>        the expected timeout is 10 seconds while actual timeout is
>        19 seconds.
>
> if we deactivate panic_on_oom_timer like
> 
>  void exit_oom_victim(void)
>  {
>  	clear_thread_flag(TIF_MEMDIE);
>  
> +	del_timer(&panic_on_oom_timer);
> 	if (!atomic_dec_return(&oom_victims))
>  		wake_up_all(&oom_victims_wait);
>  }

Yes I was thinking about this as well because the primary assumption
of the OOM killer is that the victim will release some memory. And it
doesn't matter whether the OOM killer was constrained or the global
one. So the above looks good at first sight, I am just afraid it is too
relaxed for cases where many tasks are sharing mm.

> If we want to avoid premature or over-delayed timeout, I think we need to
> update timeout at exit_oom_victim() by doing something like
> 
>  void exit_oom_victim(void)
>  {
>  	clear_thread_flag(TIF_MEMDIE);
>  
> +	/*
> +	 * If current thread got TIF_MEMDIE due to global OOM, we need to
> +	 * update panic_on_oom_timer to "jiffies till the nearest timeout
> +	 * of all threads which got TIF_MEMDIE due to global OOM" and
> +	 * delete panic_on_oom_timer if "there is no more threads which
> +	 * got TIF_MEMDIE due to global OOM".
> +	 */
> +	if (/* Was I OOM-killed due to global OOM? */) {
> +		mutex_lock(&oom_lock); /* oom_lock needed for avoiding race. */
> +		if (/* Am I the last thread ? */) {
> +			del_timer(&panic_on_oom_timer);
> +		else
> +			mod_timer(&panic_on_oom_timer,
> +				  /* jiffies of the nearest timeout */);
> +		mutex_unlock(&oom_lock);
> +	}

I think you are missing an important point. The context which has caused
the killing is not important. As mentioned above even constrained OOM
killer will relief the global OOM condition as well.

The primary problem that we have is that we do not have any reliable
under_oom() check and we simply try to approximate it by heuristics
which work well enough in most cases. I admit that oom_victims is not
the ideal one and there might be better. As mentioned above we can check
watermarks on all zones and cancel the timer if at least one allows for
an allocation. But there are surely downsides of that approach as well
because the OOM killer could have been triggered for a higher order
allocation and we might be still under OOM for those requests.

There is no simple answer here I am afraid. So let's focus on being
reasonably good and simple rather than complex and still not perfect.

>  	if (!atomic_dec_return(&oom_victims))
>  		wake_up_all(&oom_victims_wait);
>  }
> 
> but we don't have hint for finding global OOM victims from all TIF_MEMDIE
> threads and when is the nearest timeout among all global OOM victims.

> We need to keep track of per global OOM victim's timeout (e.g. "struct
> task_struct"->memdie_start ) ?

I do not think this will help anything. It will just lead to a different
set of corner cases. E.g.

1) mark_oom_victim(T1) memdie_start = jiffies
2) fatal_signal_pending(T2) memdie_start = jiffies + delta
3) T2 releases memory - No OOM anymore
4) out_of_memory - check_memdie_timeout(T1) - KABOOM

[...]

> Well, do we really need to set TIF_MEMDIE to non-global OOM victims?

As already said non-global OOM victim will free some memory as well so
the global OOM killer shouldn't kill new tasks if there is a chance
that another victim will release a memory. TIF_MEMDIE acts as a lock.

> I'm wondering how {memcg,cpuset,mempolicy} OOM stall can occur because
> there is enough memory (unless global OOM runs concurrently) for any
> operations (e.g. XFS filesystem's writeback, workqueue) which non-global
> OOM victims might depend on to make forward progress.

Same like for the global case. The victim is uninterruptibly blocked on
a resource/lock/event.
 
[...]
> By the way, I think we can replace
> 
>   if (!atomic_dec_return(&oom_victims))
> 
> with
> 
>   if (atomic_dec_and_test(&oom_victims))
> 
> . But this logic puzzles me. The number of threads that are killed by
> the OOM killer can be larger than value of oom_victims. This means that
> there might be fatal_signal_pending() threads even after oom_victims drops
> to 0. Why waiting for only TIF_MEMDIE threads at oom_killer_disable() is
> considered sufficient?

Please have look at c32b3cbe0d06 ("oom, PM: make OOM detection in the
freezer path raceless") which imho explains it.

---
>From a25885b588a245c58ec6e7b38172b6f553f45538 Mon Sep 17 00:00:00 2001
From: Michal Hocko <mhocko@suse.cz>
Date: Tue, 9 Jun 2015 16:15:42 +0200
Subject: [PATCH] oom: implement panic_on_oom_timeout for panic_on_oom=1

OOM killer is a desparate last resort reclaim attempt to free some
memory. It is based on heuristics which will never be 100% and may
result in an unusable or a locked up system.

panic_on_oom sysctl knob allows to set the OOM policy to panic the
system instead of trying to resolve the OOM condition. This might be
useful for several reasons - e.g. reduce the downtime to a predictable
amount of time, allow to get a crash dump of the system and debug the
issue post-mortem.

panic_on_oom is, however, a big hammer in many situations when the
OOM condition could be resolved in a reasonable time. So it would be
good to have some middle ground and allow the OOM killer to do its job
but have a failover when things go wrong and it is not able to make any
further progress for a considerable amount of time.

This patch implements panic_on_oom_timeout sysctl which is active only
when panic_on_oom=1 and it configures a maximum timeout for the OOM
killer to resolve the OOM situation. If the system is still under OOM
after the timeout expires it will panic the system. A reasonably chosen
timeout can protect from both temporal OOM conditions and allows to have
a predictable time frame for the OOM condition.

The feature is implemented by a timer which is scheduled when the OOM
condition is declared for the first time (there is no timer scheduled
yet) in check_panic_on_oom and it is canceled in exit_oom_victim after
the oom_victims count drops down to zero.
For this time period OOM killer cannot kill new tasks and it only
allows exiting or killed tasks to access memory reserves (and increase
oom_victims counter via mark_oom_victim) in order to make a progress
so it is reasonable to consider the elevated oom_victims count as an
ongoing OOM condition.

Timeout for panic_on_oom=2 is not currently supported because it
would make the code more complicated and it is even not clear whether
this combination is useful. panic_on_oom=2 is disputable on its own
because the system is still usable at the time so the administrator can
intervene to resolve the OOM situation (e.g. relax NUMA restrictions,
increase memory limit for the oom memcg, reboot/panic the system or
simply selectively kill a task which might be blocking oom killer from
making progress).

The log will then contain something like:
[   32.071128] Out of memory: Kill process 2998 (mem_eater) score 42 or sacrifice child
[   32.074022] Killed process 2998 (mem_eater) total-vm:14172kB, anon-rss:628kB, file-rss:4kB
[   32.091849] mem_eater invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[   32.103139] mem_eater cpuset=/ mems_allowed=0
[...]
[   32.354388] Killed process 3001 (mem_eater) total-vm:14172kB, anon-rss:580kB, file-rss:4kB
[   33.088078] panic_on_oom timeout 1s has expired
[   33.089062] Mem-Info:
[   33.089557] active_anon:7991 inactive_anon:8106 isolated_anon:665
[   33.089557]  active_file:183 inactive_file:125 isolated_file:0
[   33.089557]  unevictable:0 dirty:0 writeback:1037 unstable:0
[   33.089557]  slab_reclaimable:2492 slab_unreclaimable:3253
[   33.089557]  mapped:275 shmem:0 pagetables:631 bounce:0
[   33.089557]  free:400 free_pcp:20 free_cma:0
[...]
[   33.092023] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
[   33.092023]
[   33.092023] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.0.0-oomtimeout5-00001-g667aff689092 #582
[   33.092023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150428_134905-gandalf 04/01/2014
[   33.092023]  0000000000000101 ffff880007803da8 ffffffff81538b24 ffffffff8108a645
[   33.092023]  ffffffff817ee27d ffff880007803e28 ffffffff81537ac6 ffff880007803e08
[   33.092023]  ffff880000000008 ffff880007803e38 ffff880007803dd8 ffffffff8108a645
[   33.092023] Call Trace:
[   33.092023]  <IRQ>  [<ffffffff81538b24>] dump_stack+0x4f/0x7b
[   33.092023]  [<ffffffff8108a645>] ? console_unlock+0x337/0x366
[   33.092023]  [<ffffffff81537ac6>] panic+0xbe/0x1eb
[   33.092023]  [<ffffffff8108a645>] ? console_unlock+0x337/0x366
[   33.092023]  [<ffffffff8108a645>] ? console_unlock+0x337/0x366
[   33.092023]  [<ffffffff810fc561>] ? ftrace_raw_output_oom_score_adj_update+0x6d/0x6d
[   33.092023]  [<ffffffff810fc596>] delayed_panic_on_oom+0x35/0x35
[   33.092023]  [<ffffffff8109866f>] call_timer_fn+0xa7/0x1d0
[   33.092023]  [<ffffffff810fc561>] ? ftrace_raw_output_oom_score_adj_update+0x6d/0x6d
[   33.092023]  [<ffffffff81098eb6>] run_timer_softirq+0x22d/0x2a1
[   33.092023]  [<ffffffff810463f3>] __do_softirq+0x141/0x322
[   33.092023]  [<ffffffff810467af>] irq_exit+0x6f/0xb6

TODO: Documentation update
Signed-off-by: Michal Hocko <mhocko@suse.cz>
---
 include/linux/oom.h |  1 +
 kernel/sysctl.c     |  8 ++++++++
 mm/oom_kill.c       | 44 ++++++++++++++++++++++++++++++++++++++++++--
 3 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 061e0ffd3493..6884c8dc37a0 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -100,4 +100,5 @@ static inline bool task_will_free_mem(struct task_struct *task)
 extern int sysctl_oom_dump_tasks;
 extern int sysctl_oom_kill_allocating_task;
 extern int sysctl_panic_on_oom;
+extern int sysctl_panic_on_oom_timeout;
 #endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index d6fff89b78db..3ac2e5d0b1e2 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1141,6 +1141,14 @@ static struct ctl_table vm_table[] = {
 		.extra2		= &two,
 	},
 	{
+		.procname	= "panic_on_oom_timeout",
+		.data		= &sysctl_panic_on_oom_timeout,
+		.maxlen		= sizeof(sysctl_panic_on_oom_timeout),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec_minmax,
+		.extra1		= &zero,
+	},
+	{
 		.procname	= "oom_kill_allocating_task",
 		.data		= &sysctl_oom_kill_allocating_task,
 		.maxlen		= sizeof(sysctl_oom_kill_allocating_task),
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d7fb1275e200..fab631f812f5 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -40,6 +40,7 @@
 #include <trace/events/oom.h>
 
 int sysctl_panic_on_oom;
+int sysctl_panic_on_oom_timeout;
 int sysctl_oom_kill_allocating_task;
 int sysctl_oom_dump_tasks = 1;
 
@@ -430,6 +431,16 @@ void mark_oom_victim(struct task_struct *tsk)
 	atomic_inc(&oom_victims);
 }
 
+static void delayed_panic_on_oom(unsigned long unused)
+{
+	/* We are in the irq context so we cannot call dump_header */
+	pr_info("panic_on_oom timeout %ds has expired\n", sysctl_panic_on_oom_timeout);
+	show_mem(SHOW_MEM_FILTER_NODES);
+	panic("Out of memory: system-wide panic_on_oom is enabled\n");
+}
+
+static DEFINE_TIMER(panic_on_oom_timer, delayed_panic_on_oom, 0, 0);
+
 /**
  * exit_oom_victim - note the exit of an OOM victim
  */
@@ -437,8 +448,10 @@ void exit_oom_victim(void)
 {
 	clear_thread_flag(TIF_MEMDIE);
 
-	if (!atomic_dec_return(&oom_victims))
+	if (!atomic_dec_return(&oom_victims)) {
+		del_timer(&panic_on_oom_timer);
 		wake_up_all(&oom_victims_wait);
+	}
 }
 
 /**
@@ -539,7 +552,7 @@ static void __oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 	p = find_lock_task_mm(victim);
 	if (!p) {
 		put_task_struct(victim);
-		return;
+		goto no_oom_victim;
 	} else if (victim != p) {
 		get_task_struct(p);
 		put_task_struct(victim);
@@ -581,6 +594,15 @@ static void __oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
 
 	do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
 	put_task_struct(victim);
+	return;
+no_oom_victim:
+	/*
+	 * If there is no oom victim selected (e.g. we might have raced with
+	 * an exiting task) then cancel delayed oom panic handler.
+	 */
+	if (!atomic_read(&oom_victims))
+		del_timer(&panic_on_oom_timer);
+
 }
 #undef K
 
@@ -624,6 +646,24 @@ void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
 		if (constraint != CONSTRAINT_NONE)
 			return;
 	}
+
+	if (sysctl_panic_on_oom_timeout) {
+		if (sysctl_panic_on_oom > 1) {
+			pr_warn("panic_on_oom_timeout is ignored for panic_on_oom=2\n");
+		} else {
+			/*
+			 * Only schedule the delayed panic_on_oom when this is
+			 * the first OOM triggered. oom_lock will protect us
+			 * from races
+			 */
+			if (timer_pending(&panic_on_oom_timer))
+				return;
+
+			mod_timer(&panic_on_oom_timer,
+					jiffies + (sysctl_panic_on_oom_timeout * HZ));
+			return;
+		}
+	}
 	dump_header(NULL, gfp_mask, order, memcg, nodemask);
 	panic("Out of memory: %s panic_on_oom is enabled\n",
 		sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide");
-- 
2.1.4

-- 
Michal Hocko
SUSE Labs
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
Please read the FAQ at  http://www.tux.org/lkml/