2015-06-09 17:03:23

by Michal Hocko

[permalink] [raw]
Subject: [RFC] panic_on_oom_timeout

Hi,
during the last iteration of the timeout based oom killer discussion
(http://marc.info/?l=linux-mm&m=143351457601723) I've proposed to
introduce panic_on_oom_timeout as an extension to panic_on_oom rather
than oom timeout which would allow OOM killer to select another oom
victim and do that until the OOM is resolved or the system panics due to
potential oom victims depletion.

My main rationale for going panic_on_oom_timeout way is that this
approach will lead to much more predictable behavior because the system
will get to a usable state after given amount of time + reboot time.
On the other hand, if the other approach was chosen then there is no
guarantee that another victim would be in any better situation than the
original one. In fact there might be many tasks blocked on a single lock
(e.g. i_mutex) and the oom killer doesn't have any way to find out which
task to kill in order to make the progress. The result would be
N*timeout time period when the system is basically unusable and the N is
unknown to the admin.

I think that it is more appropriate to shut such a system down when such
a corner case is hit rather than struggle for basically unbounded amount
of time.

Thoughts? An RFC implementing this is below. It is quite trivial and
I've tried to test it a bit. I will add the missing pieces if this looks
like a way to go.

There are obviously places in the oom killer and the page allocator path
which could be improved and this patch doesn't try to put them aside. It
is just providing a reasonable the very last resort when things go
really wrong.
---
>From 35b7cff442326c609cdbb78757ef46e6d0ca0c61 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Tue, 9 Jun 2015 16:15:42 +0200
Subject: [RFC] oom: implement panic_on_oom_timeout

OOM killer is a desparate last resort reclaim attempt to free some
memory. It is based on heuristics which will never be 100% and may
result in an unusable or a locked up system.

panic_on_oom sysctl knob allows to set the OOM policy to panic the
system instead of trying to resolve the OOM condition. This might be
useful for several reasons - e.g. reduce the downtime to a predictable
amount of time, allow to get a crash dump of the system and debug the
issue post-mortem.

panic_on_oom is, however, a big hammer in many situations when the
OOM condition could be resolved in a reasonable time. So it would be
good to have some middle ground and allow the OOM killer to do its job
but have a failover when things go wrong and it is not able to make any
further progress for a considerable amount of time.

This patch implements panic_on_oom_timeout sysctl which is active
only when panic_on_oom!=0 and it configures a maximum timeout for
the OOM killer to resolve the OOM situation. If the system is still
under OOM after the timeout expires it will panic the system as per
panic_on_oom configuration. A reasonably chosen timeout can protect from
both temporal OOM conditions and allows to have a predictable time frame
for the OOM condition.

The feature is implemented as a delayed work which is scheduled when
the OOM condition is declared for the first time (oom_victims is still
zero) in out_of_memory and it is canceled in exit_oom_victim after
the oom_victims count drops down to zero. For this time period OOM
killer cannot kill new tasks and it only allows exiting or killed
tasks to access memory reserves (and increase oom_victims counter via
mark_oom_victim) in order to make a progress so it is reasonable to
consider the elevated oom_victims count as an ongoing OOM condition

The log will then contain something like:
[ 904.144494] run_test.sh invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[ 904.145854] run_test.sh cpuset=/ mems_allowed=0
[ 904.146651] CPU: 0 PID: 5244 Comm: run_test.sh Not tainted 4.0.0-oomtimeout2-00001-g3b4737913602 #575
[...]
[ 905.147523] panic_on_oom timeout 1s has expired
[ 905.150049] kworker/0:1 invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
[ 905.154572] kworker/0:1 cpuset=/ mems_allowed=0
[...]
[ 905.503378] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled

TODO: Documentation update
TODO: check all potential paths which might skip mark_oom_victim
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/oom.h | 1 +
kernel/sysctl.c | 8 ++++++
mm/oom_kill.c | 75 ++++++++++++++++++++++++++++++++++++++++++++++++++---
3 files changed, 80 insertions(+), 4 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 061e0ffd3493..6884c8dc37a0 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -100,4 +100,5 @@ static inline bool task_will_free_mem(struct task_struct *task)
extern int sysctl_oom_dump_tasks;
extern int sysctl_oom_kill_allocating_task;
extern int sysctl_panic_on_oom;
+extern int sysctl_panic_on_oom_timeout;
#endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index d6fff89b78db..3ac2e5d0b1e2 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1141,6 +1141,14 @@ static struct ctl_table vm_table[] = {
.extra2 = &two,
},
{
+ .procname = "panic_on_oom_timeout",
+ .data = &sysctl_panic_on_oom_timeout,
+ .maxlen = sizeof(sysctl_panic_on_oom_timeout),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &zero,
+ },
+ {
.procname = "oom_kill_allocating_task",
.data = &sysctl_oom_kill_allocating_task,
.maxlen = sizeof(sysctl_oom_kill_allocating_task),
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d7fb1275e200..9b1ac69caa24 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -35,11 +35,14 @@
#include <linux/freezer.h>
#include <linux/ftrace.h>
#include <linux/ratelimit.h>
+#include <linux/nodemask.h>
+#include <linux/lockdep.h>

#define CREATE_TRACE_POINTS
#include <trace/events/oom.h>

int sysctl_panic_on_oom;
+int sysctl_panic_on_oom_timeout;
int sysctl_oom_kill_allocating_task;
int sysctl_oom_dump_tasks = 1;

@@ -430,6 +433,9 @@ void mark_oom_victim(struct task_struct *tsk)
atomic_inc(&oom_victims);
}

+static void delayed_panic_on_oom(struct work_struct *w);
+static DECLARE_DELAYED_WORK(panic_on_oom_work, delayed_panic_on_oom);
+
/**
* exit_oom_victim - note the exit of an OOM victim
*/
@@ -437,8 +443,10 @@ void exit_oom_victim(void)
{
clear_thread_flag(TIF_MEMDIE);

- if (!atomic_dec_return(&oom_victims))
+ if (!atomic_dec_return(&oom_victims)) {
+ cancel_delayed_work(&panic_on_oom_work);
wake_up_all(&oom_victims_wait);
+ }
}

/**
@@ -538,6 +546,7 @@ static void __oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,

p = find_lock_task_mm(victim);
if (!p) {
+ /* TODO cancel delayed_panic_on_oom */
put_task_struct(victim);
return;
} else if (victim != p) {
@@ -606,6 +615,62 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
nodemask, message);
}

+static void panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
+ int order, const nodemask_t *nodemask,
+ struct mem_cgroup *memcg)
+{
+ dump_header(NULL, gfp_mask, order, memcg, nodemask);
+ panic("Out of memory: %s panic_on_oom is enabled\n",
+ sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide");
+}
+
+static struct oom_ctx {
+ enum oom_constraint constraint;
+ gfp_t gfp_mask;
+ int order;
+ nodemask_t nodemask;
+ struct mem_cgroup *memcg;
+ int timeout;
+} oom_ctx;
+
+static void delayed_panic_on_oom(struct work_struct *w)
+{
+ pr_info("panic_on_oom timeout %ds has expired\n", oom_ctx.timeout);
+ panic_on_oom(oom_ctx.constraint, oom_ctx.gfp_mask, oom_ctx.order,
+ &oom_ctx.nodemask, oom_ctx.memcg);
+}
+
+void schedule_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
+ int order, const nodemask_t *nodemask,
+ struct mem_cgroup *memcg, int timeout)
+{
+ lockdep_assert_held(&oom_lock);
+
+ /*
+ * Only schedule the delayed panic_on_oom when this is the first OOM
+ * triggered. oom_lock will protect us from races
+ */
+ if (atomic_read(&oom_victims))
+ return;
+
+ oom_ctx.constraint = constraint;
+ oom_ctx.gfp_mask = gfp_mask;
+ oom_ctx.order = order;
+ if (nodemask)
+ oom_ctx.nodemask = *nodemask;
+ else
+ memset(&oom_ctx.nodemask, 0, sizeof(oom_ctx.nodemask));
+
+ /*
+ * The killed task should ping the memcg and the even the delayed
+ * work either expires or strikes before the victim exits.
+ */
+ oom_ctx.memcg = memcg;
+ oom_ctx.timeout = timeout;
+
+ schedule_delayed_work(&panic_on_oom_work, timeout*HZ);
+}
+
/*
* Determines whether the kernel must panic because of the panic_on_oom sysctl.
*/
@@ -624,9 +689,11 @@ void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
if (constraint != CONSTRAINT_NONE)
return;
}
- dump_header(NULL, gfp_mask, order, memcg, nodemask);
- panic("Out of memory: %s panic_on_oom is enabled\n",
- sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide");
+ if (!sysctl_panic_on_oom_timeout)
+ panic_on_oom(constraint, gfp_mask, order, nodemask, memcg);
+ else
+ schedule_panic_on_oom(constraint, gfp_mask, order, nodemask,
+ memcg, sysctl_panic_on_oom_timeout);
}

static BLOCKING_NOTIFIER_HEAD(oom_notify_list);
--
2.1.4

--
Michal Hocko
SUSE Labs


2015-06-10 12:21:36

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [RFC] panic_on_oom_timeout

Michal Hocko wrote:
> Hi,
> during the last iteration of the timeout based oom killer discussion
> (http://marc.info/?l=linux-mm&m=143351457601723) I've proposed to
> introduce panic_on_oom_timeout as an extension to panic_on_oom rather
> than oom timeout which would allow OOM killer to select another oom
> victim and do that until the OOM is resolved or the system panics due to
> potential oom victims depletion.

I welcome the timeout, but I have several questions about implementation.

>
> My main rationale for going panic_on_oom_timeout way is that this
> approach will lead to much more predictable behavior because the system
> will get to a usable state after given amount of time + reboot time.
> On the other hand, if the other approach was chosen then there is no
> guarantee that another victim would be in any better situation than the
> original one. In fact there might be many tasks blocked on a single lock
> (e.g. i_mutex) and the oom killer doesn't have any way to find out which
> task to kill in order to make the progress. The result would be
> N*timeout time period when the system is basically unusable and the N is
> unknown to the admin.

My version ( http://marc.info/?l=linux-mm&m=143239200805478 ) implemented
two timeouts. /proc/sys/vm/memdie_task_skip_secs is for choosing next OOM
victim and /proc/sys/vm/memdie_task_panic_secs is for triggering panic.
Therefore, the result is no longer N*timeout time period.

>
> I think that it is more appropriate to shut such a system down when such
> a corner case is hit rather than struggle for basically unbounded amount
> of time.

Ditto. Not unbounded amount of time.

>
> Thoughts? An RFC implementing this is below. It is quite trivial and
> I've tried to test it a bit. I will add the missing pieces if this looks
> like a way to go.
>
> There are obviously places in the oom killer and the page allocator path
> which could be improved and this patch doesn't try to put them aside. It
> is just providing a reasonable the very last resort when things go
> really wrong.
> ---
> From 35b7cff442326c609cdbb78757ef46e6d0ca0c61 Mon Sep 17 00:00:00 2001
> From: Michal Hocko <[email protected]>
> Date: Tue, 9 Jun 2015 16:15:42 +0200
> Subject: [RFC] oom: implement panic_on_oom_timeout
>
> OOM killer is a desparate last resort reclaim attempt to free some
> memory. It is based on heuristics which will never be 100% and may
> result in an unusable or a locked up system.
>
> panic_on_oom sysctl knob allows to set the OOM policy to panic the
> system instead of trying to resolve the OOM condition. This might be
> useful for several reasons - e.g. reduce the downtime to a predictable
> amount of time, allow to get a crash dump of the system and debug the
> issue post-mortem.
>
> panic_on_oom is, however, a big hammer in many situations when the
> OOM condition could be resolved in a reasonable time. So it would be
> good to have some middle ground and allow the OOM killer to do its job
> but have a failover when things go wrong and it is not able to make any
> further progress for a considerable amount of time.
>
> This patch implements panic_on_oom_timeout sysctl which is active
> only when panic_on_oom!=0 and it configures a maximum timeout for
> the OOM killer to resolve the OOM situation. If the system is still
> under OOM after the timeout expires it will panic the system as per
> panic_on_oom configuration. A reasonably chosen timeout can protect from
> both temporal OOM conditions and allows to have a predictable time frame
> for the OOM condition.

Since your version uses the oom_ctx as a global lock (it acts as a lock
because it is assigned when atomic_read(&oom_victims) == 0) without
holding a refcount, you cannot safely handle OOM race like

(1) p1 in memcg1 calls out_of_memory().
(2) memcg1 is copied to oom_ctx.memcg and 5 seconds of timeout starts.
(3) mark_oom_victim(p1) is called.
(4) p1 takes 3 seconds for some reason.
(5) p2 in memcg2 calls out_of_memory().
(6) mark_oom_victim(p2) is called.
(7) p1 calls unmark_oom_victim().
(8) all threads in memcg1 exits and memcg1 is released.
(9) p2 takes 2 seconds for some reason.
(10) 5 seconds of timeout expires despite individual delay was less than
5 seconds!?
(11) panic_on_oom tries to dereference oom_ctx.memcg which is already
released memcg1, resulting in oops. But panic() will not be called
if panic_on_oops == 0 because workqueue callback is a sleepable
context!?

Since my version uses per a "struct task_struct" variable (memdie_start),
5 seconds of timeout is checked for individual memory cgroup. It can avoid
unnecessary panic() calls if nobody needs to call out_of_memory() again
(probably because somebody volunteered memory) when the OOM victim cannot
be terminated for some reason. If we want distinction between "the entire
system is under OOM" and "some memory cgroup is under OOM" because the
former is urgent but the latter is less urgent, it can be modified to
allow different timeout period for system-wide OOM and cgroup OOM.
Finally, it can give a hint for "in what sequence threads got stuck" and
"which thread did take 5 seconds" when analyzing vmcore.

>
> The feature is implemented as a delayed work which is scheduled when
> the OOM condition is declared for the first time (oom_victims is still
> zero) in out_of_memory and it is canceled in exit_oom_victim after
> the oom_victims count drops down to zero. For this time period OOM
> killer cannot kill new tasks and it only allows exiting or killed
> tasks to access memory reserves (and increase oom_victims counter via
> mark_oom_victim) in order to make a progress so it is reasonable to
> consider the elevated oom_victims count as an ongoing OOM condition

By the way, what guarantees that the panic_on_oom_work is executed under
OOM condition? The moom_work used by SysRq-f sometimes cannot be executed
because some work which is processed before the moom_work is processed is
stalled for unbounded amount of time due to looping inside the memory
allocator. Therefore, my version used DEFINE_TIMER() than
DECLARE_DELAYED_WORK() in order to make sure that the callback shall be
called as soon as timeout expires.

>
> The log will then contain something like:
> [ 904.144494] run_test.sh invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
> [ 904.145854] run_test.sh cpuset=/ mems_allowed=0
> [ 904.146651] CPU: 0 PID: 5244 Comm: run_test.sh Not tainted 4.0.0-oomtimeout2-00001-g3b4737913602 #575
> [...]
> [ 905.147523] panic_on_oom timeout 1s has expired
> [ 905.150049] kworker/0:1 invoked oom-killer: gfp_mask=0x280da, order=0, oom_score_adj=0
> [ 905.154572] kworker/0:1 cpuset=/ mems_allowed=0
> [...]
> [ 905.503378] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
>
> TODO: Documentation update
> TODO: check all potential paths which might skip mark_oom_victim
> Signed-off-by: Michal Hocko <[email protected]>

2015-06-10 14:28:15

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC] panic_on_oom_timeout

On Wed 10-06-15 21:20:58, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > This patch implements panic_on_oom_timeout sysctl which is active
> > only when panic_on_oom!=0 and it configures a maximum timeout for
> > the OOM killer to resolve the OOM situation. If the system is still
> > under OOM after the timeout expires it will panic the system as per
> > panic_on_oom configuration. A reasonably chosen timeout can protect from
> > both temporal OOM conditions and allows to have a predictable time frame
> > for the OOM condition.
>
> Since your version uses the oom_ctx as a global lock (it acts as a lock
> because it is assigned when atomic_read(&oom_victims) == 0) without
> holding a refcount, you cannot safely handle OOM race like
>
> (1) p1 in memcg1 calls out_of_memory().
> (2) memcg1 is copied to oom_ctx.memcg and 5 seconds of timeout starts.
> (3) mark_oom_victim(p1) is called.
> (4) p1 takes 3 seconds for some reason.
> (5) p2 in memcg2 calls out_of_memory().
> (6) mark_oom_victim(p2) is called.
> (7) p1 calls unmark_oom_victim().
> (8) all threads in memcg1 exits and memcg1 is released.
> (9) p2 takes 2 seconds for some reason.
> (10) 5 seconds of timeout expires despite individual delay was less than
> 5 seconds!?

Yes this is certainly possible and a similar thing would happen with
mempolicy/cpusets. I haven't considered panic_on_oom=2 much for this RFC
to be honest (I am sorry I should have mentioned that in the changelog
explicitly). The semantic of such a timeout is not clear to me. OOM
domains might intersect and it is not clear which domain will benefit
from killing the particular task. Say Node1 and Node2 are in the OOM
nodemask while a task had memory only from Node2 so Node1 stays OOM.
Should we keep ticking the timer for Node1? Sure we can optimistically
cancel that timer and hope some other allocation would trigger OOM on
Node1 again but wouldn't that break the semantic of the knob.

Memcg case would be easier to handle because we have the memcg context
which can hold the timer and memcg_oom_recover would be a natural place
to stop the delayed panic for the affected hierarchy.

Btw. panic_on_oom=2 is kind of weird and I am not sure it was well
thought through when introduced. 2b744c01a54f ("mm: fix handling of
panic_on_oom when cpusets are in use") is far from being specific about
usecases. It is hardly a failover feature because the system as whole is
not OOM and administrator still has chance to perform steps to resolve
the potential lockup or trashing from the global context (e.g. by
relaxing restrictions or even rebooting cleanly).

That being said, I recognize this is a problem and it makes the timeout
implementation more complicated. The question is do we actually care or
should we start simple and ignore the timeout for panic_on_oom=2 and
implement it after somebody shows up with a reasonable usecase for this
conf+timeout?

> (11) panic_on_oom tries to dereference oom_ctx.memcg which is already
> released memcg1, resulting in oops.

Yes this is obviously buggy. Thanks for pointing this out.

> But panic() will not be called
> if panic_on_oops == 0 because workqueue callback is a sleepable
> context!?
>
> Since my version uses per a "struct task_struct" variable (memdie_start),
> 5 seconds of timeout is checked for individual memory cgroup. It can avoid
> unnecessary panic() calls if nobody needs to call out_of_memory() again
> (probably because somebody volunteered memory) when the OOM victim cannot
> be terminated for some reason. If we want distinction between "the entire
> system is under OOM" and "some memory cgroup is under OOM" because the
> former is urgent but the latter is less urgent, it can be modified to
> allow different timeout period for system-wide OOM and cgroup OOM.
> Finally, it can give a hint for "in what sequence threads got stuck" and
> "which thread did take 5 seconds" when analyzing vmcore.

I will have a look how you have implemented that but separate timeouts
sound like a major over engineering. Also note that global vs. memcg OOM
is not sufficient because there are other oom domains as mentioned above.

> > The feature is implemented as a delayed work which is scheduled when
> > the OOM condition is declared for the first time (oom_victims is still
> > zero) in out_of_memory and it is canceled in exit_oom_victim after
> > the oom_victims count drops down to zero. For this time period OOM
> > killer cannot kill new tasks and it only allows exiting or killed
> > tasks to access memory reserves (and increase oom_victims counter via
> > mark_oom_victim) in order to make a progress so it is reasonable to
> > consider the elevated oom_victims count as an ongoing OOM condition
>
> By the way, what guarantees that the panic_on_oom_work is executed under
> OOM condition?

Tejun would be much better to explain this but my understanding of the
delayed workqueue code is that it doesn't depend on memory allocations.

> The moom_work used by SysRq-f sometimes cannot be executed
> because some work which is processed before the moom_work is processed is
> stalled for unbounded amount of time due to looping inside the memory
> allocator.

Wouldn't wq code pick up another worker thread to execute the work.
There is also a rescuer thread as the last resort AFAIR.

> Therefore, my version used DEFINE_TIMER() than
> DECLARE_DELAYED_WORK() in order to make sure that the callback shall be
> called as soon as timeout expires.

I do not care that much whether it is timer or delayed work which would
be used.

Thanks!
--
Michal Hocko
SUSE Labs

2015-06-10 16:02:42

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC] panic_on_oom_timeout

On Wed 10-06-15 16:28:01, Michal Hocko wrote:
> On Wed 10-06-15 21:20:58, Tetsuo Handa wrote:
[...]
> > Since my version uses per a "struct task_struct" variable (memdie_start),
> > 5 seconds of timeout is checked for individual memory cgroup. It can avoid
> > unnecessary panic() calls if nobody needs to call out_of_memory() again
> > (probably because somebody volunteered memory) when the OOM victim cannot
> > be terminated for some reason. If we want distinction between "the entire
> > system is under OOM" and "some memory cgroup is under OOM" because the
> > former is urgent but the latter is less urgent, it can be modified to
> > allow different timeout period for system-wide OOM and cgroup OOM.
> > Finally, it can give a hint for "in what sequence threads got stuck" and
> > "which thread did take 5 seconds" when analyzing vmcore.
>
> I will have a look how you have implemented that but separate timeouts
> sound like a major over engineering. Also note that global vs. memcg OOM
> is not sufficient because there are other oom domains as mentioned above.

Your patch is doing way too many things at once :/ So let me just focus
on the "panic if a task is stuck with TIF_MEMDIE for too long". It looks
like an alternative to the approach I've chosen. It doesn't consider
the allocation restriction so a locked up cpuset/numa node(s) might
panic the system which doesn't sound like a good idea but that is easily
fixable. Could you tear just this part out and repost it so that we can
compare the two approaches?

The panic_on_oom=2 would be still weird because some nodes might stay in
OOM condition without triggering the panic but maybe this is acceptable.

Thanks!
--
Michal Hocko
SUSE Labs

2015-06-11 13:13:23

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [RFC] panic_on_oom_timeout

Michal Hocko wrote:
> > > The feature is implemented as a delayed work which is scheduled when
> > > the OOM condition is declared for the first time (oom_victims is still
> > > zero) in out_of_memory and it is canceled in exit_oom_victim after
> > > the oom_victims count drops down to zero. For this time period OOM
> > > killer cannot kill new tasks and it only allows exiting or killed
> > > tasks to access memory reserves (and increase oom_victims counter via
> > > mark_oom_victim) in order to make a progress so it is reasonable to
> > > consider the elevated oom_victims count as an ongoing OOM condition
> >
> > By the way, what guarantees that the panic_on_oom_work is executed under
> > OOM condition?
>
> Tejun would be much better to explain this but my understanding of the
> delayed workqueue code is that it doesn't depend on memory allocations.
>

Thanks to Tejun's commit 3494fc30846d("workqueue: dump workqueues on
sysrq-t") for allowing us to confirm that moom_work is staying on the
"workqueue events: flags=0x0" (shown below).

> > The moom_work used by SysRq-f sometimes cannot be executed
> > because some work which is processed before the moom_work is processed is
> > stalled for unbounded amount of time due to looping inside the memory
> > allocator.
>
> Wouldn't wq code pick up another worker thread to execute the work.
> There is also a rescuer thread as the last resort AFAIR.
>

Below is an example of moom_work lockup in v4.1-rc7 from
http://I-love.SAKURA.ne.jp/tmp/serial-20150611.txt.xz

----------
[ 171.710406] sysrq: SysRq : Manual OOM execution
[ 171.720193] kworker/2:9 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
[ 171.722699] kworker/2:9 cpuset=/ mems_allowed=0
[ 171.724603] CPU: 2 PID: 11016 Comm: kworker/2:9 Not tainted 4.1.0-rc7 #3
[ 171.726817] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[ 171.729727] Workqueue: events moom_callback
(...snipped...)
[ 258.302016] sysrq: SysRq : Manual OOM execution
[ 258.311360] kworker/2:9 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
[ 258.313340] kworker/2:9 cpuset=/ mems_allowed=0
[ 258.314809] CPU: 2 PID: 11016 Comm: kworker/2:9 Not tainted 4.1.0-rc7 #3
[ 258.316579] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
[ 258.319083] Workqueue: events moom_callback
(...snipped...)
[ 301.362635] Showing busy workqueues and worker pools:
[ 301.364421] workqueue events: flags=0x0
[ 301.365988] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=4/256
[ 301.368753] pending: vmstat_shepherd, vmstat_update, fb_deferred_io_work, vmpressure_work_fn
[ 301.371402] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=3/256
[ 301.373460] pending: e1000_watchdog [e1000], vmstat_update, push_to_pool
[ 301.375757] workqueue events_power_efficient: flags=0x80
[ 301.377563] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[ 301.379625] pending: fb_flashcursor
[ 301.381236] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[ 301.383271] pending: neigh_periodic_work
[ 301.384951] workqueue events_freezable_power_: flags=0x84
[ 301.386792] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 301.388880] in-flight: 62:disk_events_workfn
[ 301.390634] workqueue writeback: flags=0x4e
[ 301.392200] pwq 16: cpus=0-7 flags=0x4 nice=0 active=2/256
[ 301.394137] in-flight: 358:bdi_writeback_workfn bdi_writeback_workfn
[ 301.396272] workqueue xfs-data/sda1: flags=0xc
[ 301.397903] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=10/256
[ 301.400013] in-flight: 7733:xfs_end_io, 26:xfs_end_io, 7729:xfs_end_io, 2737:xfs_end_io, 7731:xfs_end_io, 423:xfs_end_io, 7734:xfs_end_io, 7732:xfs_end_io, 7735:xfs_end_io, 65:xfs_end_io
[ 301.405064] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=8/256 MAYDAY
[ 301.407320] in-flight: 134:xfs_end_io, 10998:xfs_end_io, 11007:xfs_end_io, 20:xfs_end_io, 448:xfs_end_io, 11005:xfs_end_io, 232:xfs_end_io, 10997:xfs_end_io
[ 301.411922] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=56/256 MAYDAY
[ 301.414122] in-flight: 10999:xfs_end_io, 11004:xfs_end_io, 11018:xfs_end_io, 11025:xfs_end_io, 11001:xfs_end_io, 11020:xfs_end_io, 7728:xfs_end_io, 7730:xfs_end_io, 11011:xfs_end_io, 215:xfs_end_io, 11023:xfs_end_io, 11003:xfs_end_io, 11024:xfs_end_io, 10995:xfs_end_io, 11000:xfs_end_io, 11017:xfs_end_io, 11010:xfs_end_io, 43:xfs_end_io, 11015:xfs_end_io, 11009:xfs_end_io, 10996:xfs_end_io, 14:xfs_end_io, 11026:xfs_end_io, 371(RESCUER):xfs_end_io xfs_end_io xfs_end_io, 11006:xfs_end_io, 11014:xfs_end_io, 11022:xfs_end_io, 11002:xfs_end_io, 11019:xfs_end_io, 11008:xfs_end_io, 11012:xfs_end_io, 11021:xfs_end_io
[ 301.428743] pending: xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io
[ 301.436022] workqueue xfs-cil/sda1: flags=0xc
[ 301.437850] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 301.440083] pending: xlog_cil_push_work BAR(5327)
[ 301.442106] workqueue xfs-log/sda1: flags=0x14
[ 301.444250] pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=1/256
[ 301.446479] in-flight: 5327:xfs_log_worker
[ 301.448327] pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=4 idle: 173 64 4
[ 301.450865] pool 1: cpus=0 node=0 flags=0x0 nice=-20 workers=3 idle: 2016 5
[ 301.453311] pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=32 manager: 11027
[ 301.455735] pool 4: cpus=2 node=0 flags=0x0 nice=0 workers=10 manager: 11013 idle: 11016
[ 301.458373] pool 6: cpus=3 node=0 flags=0x0 nice=0 workers=13 idle: 7736 7738 7737
[ 301.460919] pool 16: cpus=0-7 flags=0x4 nice=0 workers=32 idle: 357 356 355 354 353 352 351 350 349 348 347 346 345 344 343 342 341 340 339 338 337 336 335 334 263 269 264 60 6 360 359
[ 417.532993] sysrq: SysRq : Manual OOM execution
[ 437.388767] sysrq: SysRq : Manual OOM execution
[ 455.257690] sysrq: SysRq : Manual OOM execution
[ 458.381422] sysrq: SysRq : Manual OOM execution
[ 463.411448] sysrq: SysRq : Show State
(...snipped...)
[ 464.441490] kswapd0 D 0000000000000002 0 45 2 0x00000000
[ 464.444050] ffff88007cdecf50 ffffea0000118a00 ffff880078798000 ffff88007fc8eb00
[ 464.446114] ffff88007fc8eb00 0000000100027b48 ffff88007fac39a0 0000000000000017
[ 464.448212] ffffffff8161d37a ffff880078797a68 ffffffff8161fa83 0000000000000005
[ 464.450265] Call Trace:
[ 464.451304] [<ffffffff8161d37a>] ? schedule+0x2a/0x80
[ 464.452928] [<ffffffff8161fa83>] ? schedule_timeout+0x113/0x1b0
[ 464.454609] [<ffffffff810c2b10>] ? migrate_timer_list+0x60/0x60
[ 464.456297] [<ffffffff8161ca89>] ? io_schedule_timeout+0x99/0x100
[ 464.458031] [<ffffffff8112df99>] ? congestion_wait+0x79/0xd0
[ 464.459670] [<ffffffff810a31c0>] ? wait_woken+0x80/0x80
[ 464.461224] [<ffffffff8112396b>] ? shrink_inactive_list+0x4bb/0x530
[ 464.463011] [<ffffffff811243a1>] ? shrink_lruvec+0x571/0x780
[ 464.464636] [<ffffffff811246a2>] ? shrink_zone+0xf2/0x300
[ 464.466230] [<ffffffff8112572d>] ? kswapd+0x50d/0x940
[ 464.467752] [<ffffffff81125220>] ? mem_cgroup_shrink_node_zone+0xd0/0xd0
[ 464.469565] [<ffffffff81084658>] ? kthread+0xc8/0xe0
[ 464.471053] [<ffffffff81084590>] ? kthread_create_on_node+0x190/0x190
[ 464.472810] [<ffffffff81620f62>] ? ret_from_fork+0x42/0x70
[ 464.474395] [<ffffffff81084590>] ? kthread_create_on_node+0x190/0x190
(...snipped...)
[ 466.620387] xfs-data/sda1 D ffff88007fc56700 0 371 2 0x00000000
[ 466.622286] Workqueue: xfs-data/sda1 xfs_end_io
[ 466.623683] ffff8800775b69c0 ffff88007fcd6700 ffff880035448000 ffff8800775b69c0
[ 466.625720] ffff88007b582ca8 ffffffff00000000 fffffffe00000001 0000000000000002
[ 466.628214] ffffffff8161d37a ffff88007b582c90 ffffffff8161f79d ffff88007fbf1f00
[ 466.630233] Call Trace:
[ 466.631241] [<ffffffff8161d37a>] ? schedule+0x2a/0x80
[ 466.632745] [<ffffffff8161f79d>] ? rwsem_down_write_failed+0x20d/0x3c0
[ 466.634520] [<ffffffff8130ce63>] ? call_rwsem_down_write_failed+0x13/0x20
[ 466.636387] [<ffffffff8161ef64>] ? down_write+0x24/0x40
[ 466.637912] [<ffffffff8122bb50>] ? xfs_setfilesize+0x20/0xa0
[ 466.639506] [<ffffffff8122bcbc>] ? xfs_end_io+0x5c/0x90
[ 466.641088] [<ffffffff8107ef51>] ? process_one_work+0x131/0x350
[ 466.642737] [<ffffffff8107f834>] ? rescuer_thread+0x194/0x3d0
[ 466.644360] [<ffffffff8107f6a0>] ? worker_thread+0x530/0x530
[ 466.646023] [<ffffffff81084658>] ? kthread+0xc8/0xe0
[ 466.647516] [<ffffffff81084590>] ? kthread_create_on_node+0x190/0x190
[ 466.649269] [<ffffffff81620f62>] ? ret_from_fork+0x42/0x70
[ 466.650848] [<ffffffff81084590>] ? kthread_create_on_node+0x190/0x190
(...snipped...)
[ 472.440941] Showing busy workqueues and worker pools:
[ 472.442719] workqueue events: flags=0x0
[ 472.444290] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=6/256
[ 472.446497] pending: vmstat_update, console_callback, flush_to_ldisc, moom_callback, sysrq_reinject_alt_sysrq, push_to_pool
[ 472.449749] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=4/256
[ 472.451860] pending: vmstat_shepherd, vmstat_update, fb_deferred_io_work, vmpressure_work_fn
[ 472.454529] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[ 472.456602] pending: vmstat_update, e1000_watchdog [e1000]
[ 472.458743] workqueue events_freezable: flags=0x4
[ 472.460456] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[ 472.462526] pending: vmballoon_work [vmw_balloon]
[ 472.464896] workqueue events_power_efficient: flags=0x80
[ 472.466700] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[ 472.468884] pending: fb_flashcursor
[ 472.470482] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[ 472.472605] pending: neigh_periodic_work
[ 472.474316] workqueue events_freezable_power_: flags=0x84
[ 472.476126] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 472.478274] in-flight: 62:disk_events_workfn
[ 472.480016] workqueue writeback: flags=0x4e
[ 472.481679] pwq 16: cpus=0-7 flags=0x4 nice=0 active=2/256
[ 472.483606] in-flight: 358:bdi_writeback_workfn bdi_writeback_workfn
[ 472.485733] workqueue xfs-data/sda1: flags=0xc
[ 472.487382] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=10/256
[ 472.489442] in-flight: 7733:xfs_end_io, 26:xfs_end_io, 7729:xfs_end_io, 2737:xfs_end_io, 7731:xfs_end_io, 423:xfs_end_io, 7734:xfs_end_io, 7732:xfs_end_io, 7735:xfs_end_io, 65:xfs_end_io
[ 472.494498] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=8/256 MAYDAY
[ 472.496662] in-flight: 134:xfs_end_io, 10998:xfs_end_io, 11007:xfs_end_io, 20:xfs_end_io, 448:xfs_end_io, 11005:xfs_end_io, 232:xfs_end_io, 10997:xfs_end_io
[ 472.501281] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=56/256 MAYDAY
[ 472.503479] in-flight: 10999:xfs_end_io, 11004:xfs_end_io, 11018:xfs_end_io, 11038:xfs_end_io, 11025:xfs_end_io, 11001:xfs_end_io, 11020:xfs_end_io, 7728:xfs_end_io, 7730:xfs_end_io, 11027:xfs_end_io, 11011:xfs_end_io, 215:xfs_end_io, 11035:xfs_end_io, 11023:xfs_end_io, 11003:xfs_end_io, 11032:xfs_end_io, 11024:xfs_end_io, 10995:xfs_end_io, 11036:xfs_end_io, 11000:xfs_end_io, 11017:xfs_end_io, 11028:xfs_end_io, 11010:xfs_end_io, 43:xfs_end_io, 11015:xfs_end_io, 11034:xfs_end_io, 11030:xfs_end_io, 11009:xfs_end_io, 10996:xfs_end_io, 14:xfs_end_io, 11026:xfs_end_io, 371(RESCUER):xfs_end_io xfs_end_io xfs_end_io, 11037:xfs_end_io, 11006:xfs_end_io, 11014:xfs_end_io, 11022:xfs_end_io, 11002:xfs_end_io, 11019:xfs_end_io, 11033:xfs_end_io, 11031:xfs_end_io, 11008:xfs_end_io, 11012:xfs_end_io, 11021:xfs_end_io
[ 472.523473] pending: xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io
[ 472.527333] workqueue xfs-cil/sda1: flags=0xc
[ 472.529211] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 472.531450] pending: xlog_cil_push_work BAR(5327)
[ 472.533532] workqueue xfs-log/sda1: flags=0x14
[ 472.535342] pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=1/256
[ 472.537617] in-flight: 5327:xfs_log_worker
[ 472.539663] workqueue xfs-eofblocks/sda1: flags=0x4
[ 472.541540] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[ 472.543880] in-flight: 11029:xfs_eofblocks_worker
[ 472.545912] pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=3 idle: 173 64
[ 472.548397] pool 1: cpus=0 node=0 flags=0x0 nice=-20 workers=3 idle: 2016 5
[ 472.550855] pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=43 manager: 11039
[ 472.553317] pool 4: cpus=2 node=0 flags=0x0 nice=0 workers=11 idle: 11013 11016
[ 472.555879] pool 6: cpus=3 node=0 flags=0x0 nice=0 workers=13 idle: 7736 7738 7737
[ 472.558907] pool 16: cpus=0-7 flags=0x4 nice=0 workers=3 idle: 357 356
[ 522.490286] sysrq: SysRq : Manual OOM execution
[ 525.045576] sysrq: SysRq : Manual OOM execution
[ 527.096215] sysrq: SysRq : Manual OOM execution
[ 529.154109] sysrq: SysRq : Manual OOM execution
[ 531.284654] sysrq: SysRq : Manual OOM execution
[ 593.591208] sysrq: SysRq : Show State
(...snipped...)
[ 594.564502] kswapd0 D 0000000000000003 0 45 2 0x00000000
[ 594.566443] ffff88007cdecf50 ffffea000117f540 ffff880078798000 ffff88007fcceb00
[ 594.568680] ffff88007fcceb00 0000000100047b1c ffff88007fac39a0 0000000000000017
[ 594.570753] ffffffff8161d37a ffff880078797a68 ffffffff8161fa83 0000000000000005
[ 594.572826] Call Trace:
[ 594.573882] [<ffffffff8161d37a>] ? schedule+0x2a/0x80
[ 594.575433] [<ffffffff8161fa83>] ? schedule_timeout+0x113/0x1b0
[ 594.577136] [<ffffffff810c2b10>] ? migrate_timer_list+0x60/0x60
[ 594.578828] [<ffffffff8161ca89>] ? io_schedule_timeout+0x99/0x100
[ 594.580553] [<ffffffff8112df99>] ? congestion_wait+0x79/0xd0
[ 594.582194] [<ffffffff810a31c0>] ? wait_woken+0x80/0x80
[ 594.583760] [<ffffffff8112396b>] ? shrink_inactive_list+0x4bb/0x530
[ 594.585557] [<ffffffff811243a1>] ? shrink_lruvec+0x571/0x780
[ 594.587182] [<ffffffff8100c671>] ? __switch_to+0x201/0x540
[ 594.588796] [<ffffffff8100c671>] ? __switch_to+0x201/0x540
[ 594.590387] [<ffffffff811246a2>] ? shrink_zone+0xf2/0x300
[ 594.591956] [<ffffffff8112572d>] ? kswapd+0x50d/0x940
[ 594.593450] [<ffffffff81125220>] ? mem_cgroup_shrink_node_zone+0xd0/0xd0
[ 594.595277] [<ffffffff81084658>] ? kthread+0xc8/0xe0
[ 594.596761] [<ffffffff81084590>] ? kthread_create_on_node+0x190/0x190
[ 594.598509] [<ffffffff81620f62>] ? ret_from_fork+0x42/0x70
[ 594.600105] [<ffffffff81084590>] ? kthread_create_on_node+0x190/0x190
(...snipped...)
[ 596.773310] xfs-data/sda1 D ffff88007fc56700 0 371 2 0x00000000
[ 596.775200] Workqueue: xfs-data/sda1 xfs_end_io
[ 596.776610] ffff8800775b69c0 ffff88007fcd6700 ffff880035448000 ffff8800775b69c0
[ 596.778636] ffff88007b582ca8 ffffffff00000000 fffffffe00000001 0000000000000002
[ 596.780652] ffffffff8161d37a ffff88007b582c90 ffffffff8161f79d ffff88007fbf1f00
[ 596.782772] Call Trace:
[ 596.783780] [<ffffffff8161d37a>] ? schedule+0x2a/0x80
[ 596.785277] [<ffffffff8161f79d>] ? rwsem_down_write_failed+0x20d/0x3c0
[ 596.787034] [<ffffffff8130ce63>] ? call_rwsem_down_write_failed+0x13/0x20
[ 596.788875] [<ffffffff8161ef64>] ? down_write+0x24/0x40
[ 596.790398] [<ffffffff8122bb50>] ? xfs_setfilesize+0x20/0xa0
[ 596.792010] [<ffffffff8122bcbc>] ? xfs_end_io+0x5c/0x90
[ 596.793577] [<ffffffff8107ef51>] ? process_one_work+0x131/0x350
[ 596.795231] [<ffffffff8107f834>] ? rescuer_thread+0x194/0x3d0
[ 596.796854] [<ffffffff8107f6a0>] ? worker_thread+0x530/0x530
[ 596.798473] [<ffffffff81084658>] ? kthread+0xc8/0xe0
[ 596.799956] [<ffffffff81084590>] ? kthread_create_on_node+0x190/0x190
[ 596.801704] [<ffffffff81620f62>] ? ret_from_fork+0x42/0x70
[ 596.803290] [<ffffffff81084590>] ? kthread_create_on_node+0x190/0x190
(...snipped...)
[ 602.433643] Showing busy workqueues and worker pools:
[ 602.435435] workqueue events: flags=0x0
[ 602.436985] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=6/256
[ 602.439035] pending: vmstat_update, console_callback, flush_to_ldisc, moom_callback, sysrq_reinject_alt_sysrq, push_to_pool
[ 602.442192] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=4/256
[ 602.444276] pending: vmstat_shepherd, vmstat_update, fb_deferred_io_work, vmpressure_work_fn
[ 602.446899] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[ 602.448983] pending: vmstat_update, e1000_watchdog [e1000]
[ 602.451019] workqueue events_freezable: flags=0x4
[ 602.452741] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[ 602.454894] pending: vmballoon_work [vmw_balloon]
[ 602.456739] workqueue events_power_efficient: flags=0x80
[ 602.458558] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[ 602.460611] pending: fb_flashcursor
[ 602.462207] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[ 602.464257] pending: neigh_periodic_work
[ 602.465957] workqueue events_freezable_power_: flags=0x84
[ 602.467766] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 602.469802] in-flight: 62:disk_events_workfn
[ 602.471535] workqueue writeback: flags=0x4e
[ 602.473167] pwq 16: cpus=0-7 flags=0x4 nice=0 active=2/256
[ 602.475070] in-flight: 358:bdi_writeback_workfn bdi_writeback_workfn
[ 602.477233] workqueue xfs-data/sda1: flags=0xc
[ 602.478832] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=10/256
[ 602.480845] in-flight: 7733:xfs_end_io, 26:xfs_end_io, 7729:xfs_end_io, 2737:xfs_end_io, 7731:xfs_end_io, 423:xfs_end_io, 7734:xfs_end_io, 7732:xfs_end_io, 7735:xfs_end_io, 65:xfs_end_io
[ 602.486167] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=8/256 MAYDAY
[ 602.488307] in-flight: 134:xfs_end_io, 10998:xfs_end_io, 11007:xfs_end_io, 20:xfs_end_io, 448:xfs_end_io, 11005:xfs_end_io, 232:xfs_end_io, 10997:xfs_end_io
[ 602.492909] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=56/256 MAYDAY
[ 602.495132] in-flight: 10999:xfs_end_io, 11004:xfs_end_io, 11018:xfs_end_io, 11038:xfs_end_io, 11025:xfs_end_io, 11001:xfs_end_io, 11020:xfs_end_io, 7728:xfs_end_io, 7730:xfs_end_io, 11027:xfs_end_io, 11011:xfs_end_io, 215:xfs_end_io, 11035:xfs_end_io, 11023:xfs_end_io, 11003:xfs_end_io, 11032:xfs_end_io, 11024:xfs_end_io, 10995:xfs_end_io, 11036:xfs_end_io, 11000:xfs_end_io, 11017:xfs_end_io, 11028:xfs_end_io, 11010:xfs_end_io, 43:xfs_end_io, 11015:xfs_end_io, 11034:xfs_end_io, 11030:xfs_end_io, 11009:xfs_end_io, 10996:xfs_end_io, 14:xfs_end_io, 11026:xfs_end_io, 371(RESCUER):xfs_end_io xfs_end_io xfs_end_io, 11037:xfs_end_io, 11006:xfs_end_io, 11014:xfs_end_io, 11022:xfs_end_io, 11002:xfs_end_io, 11019:xfs_end_io, 11033:xfs_end_io, 11031:xfs_end_io, 11008:xfs_end_io, 11012:xfs_end_io, 11021:xfs_end_io
[ 602.515036] pending: xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io
[ 602.519005] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 602.521255] pending: xfs_end_io
[ 602.522969] workqueue xfs-cil/sda1: flags=0xc
[ 602.524847] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 602.527565] pending: xlog_cil_push_work BAR(5327)
[ 602.529603] workqueue xfs-log/sda1: flags=0x14
[ 602.531414] pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=1/256
[ 602.533669] in-flight: 5327:xfs_log_worker
[ 602.535572] workqueue xfs-eofblocks/sda1: flags=0x4
[ 602.537513] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[ 602.539703] in-flight: 11029:xfs_eofblocks_worker
[ 602.541740] pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=3 idle: 173 64
[ 602.544233] pool 1: cpus=0 node=0 flags=0x0 nice=-20 workers=3 idle: 2016 5
[ 602.546746] pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=43 manager: 11039
[ 602.549164] pool 4: cpus=2 node=0 flags=0x0 nice=0 workers=11 idle: 11013 11016
[ 602.551756] pool 6: cpus=3 node=0 flags=0x0 nice=0 workers=13 idle: 7736 7738 7737
[ 602.554295] pool 16: cpus=0-7 flags=0x4 nice=0 workers=3 idle: 357 356
(...snipped...)
[ 682.791133] kswapd0 D 0000000000000002 0 45 2 0x00000000
[ 682.793129] ffff88007cdecf50 ffffea0001e98d00 ffff880078798000 ffff88007fc8eb00
[ 682.795229] ffff88007fc8eb00 000000010005cf73 ffff88007fac39a0 0000000000000017
[ 682.797305] ffffffff8161d37a ffff880078797a68 ffffffff8161fa83 0000000000000005
[ 682.799379] Call Trace:
[ 682.800437] [<ffffffff8161d37a>] ? schedule+0x2a/0x80
[ 682.801976] [<ffffffff8161fa83>] ? schedule_timeout+0x113/0x1b0
[ 682.803677] [<ffffffff810c2b10>] ? migrate_timer_list+0x60/0x60
[ 682.805379] [<ffffffff8161ca89>] ? io_schedule_timeout+0x99/0x100
[ 682.807110] [<ffffffff8112df99>] ? congestion_wait+0x79/0xd0
[ 682.808752] [<ffffffff810a31c0>] ? wait_woken+0x80/0x80
[ 682.810318] [<ffffffff8112396b>] ? shrink_inactive_list+0x4bb/0x530
[ 682.812073] [<ffffffff811243a1>] ? shrink_lruvec+0x571/0x780
[ 682.813721] [<ffffffff8100c671>] ? __switch_to+0x201/0x540
[ 682.815324] [<ffffffff8100c671>] ? __switch_to+0x201/0x540
[ 682.816966] [<ffffffff811246a2>] ? shrink_zone+0xf2/0x300
[ 682.818536] [<ffffffff8112572d>] ? kswapd+0x50d/0x940
[ 682.820027] [<ffffffff81125220>] ? mem_cgroup_shrink_node_zone+0xd0/0xd0
[ 682.821847] [<ffffffff81084658>] ? kthread+0xc8/0xe0
[ 682.823331] [<ffffffff81084590>] ? kthread_create_on_node+0x190/0x190
[ 682.825072] [<ffffffff81620f62>] ? ret_from_fork+0x42/0x70
[ 682.826649] [<ffffffff81084590>] ? kthread_create_on_node+0x190/0x190
(...snipped...)
[ 685.013561] xfs-data/sda1 D ffff88007fc56700 0 371 2 0x00000000
[ 685.015492] Workqueue: xfs-data/sda1 xfs_end_io
[ 685.016894] ffff8800775b69c0 ffff88007fcd6700 ffff880035448000 ffff8800775b69c0
[ 685.018898] ffff88007b582ca8 ffffffff00000000 fffffffe00000001 0000000000000002
[ 685.020943] ffffffff8161d37a ffff88007b582c90 ffffffff8161f79d ffff88007fbf1f00
[ 685.022936] Call Trace:
[ 685.023945] [<ffffffff8161d37a>] ? schedule+0x2a/0x80
[ 685.025459] [<ffffffff8161f79d>] ? rwsem_down_write_failed+0x20d/0x3c0
[ 685.027232] [<ffffffff8130ce63>] ? call_rwsem_down_write_failed+0x13/0x20
[ 685.029035] [<ffffffff8161ef64>] ? down_write+0x24/0x40
[ 685.030566] [<ffffffff8122bb50>] ? xfs_setfilesize+0x20/0xa0
[ 685.032195] [<ffffffff8122bcbc>] ? xfs_end_io+0x5c/0x90
[ 685.033760] [<ffffffff8107ef51>] ? process_one_work+0x131/0x350
[ 685.035400] [<ffffffff8107f834>] ? rescuer_thread+0x194/0x3d0
[ 685.037046] [<ffffffff8107f6a0>] ? worker_thread+0x530/0x530
[ 685.038663] [<ffffffff81084658>] ? kthread+0xc8/0xe0
[ 685.040142] [<ffffffff81084590>] ? kthread_create_on_node+0x190/0x190
[ 685.041883] [<ffffffff81620f62>] ? ret_from_fork+0x42/0x70
[ 685.043466] [<ffffffff81084590>] ? kthread_create_on_node+0x190/0x190
(...snipped...)
[ 690.761839] Showing busy workqueues and worker pools:
[ 690.763617] workqueue events: flags=0x0
[ 690.765158] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=6/256
[ 690.767223] pending: vmstat_update, console_callback, flush_to_ldisc, moom_callback, sysrq_reinject_alt_sysrq, push_to_pool
[ 690.770382] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=4/256
[ 690.772458] pending: vmstat_shepherd, vmstat_update, fb_deferred_io_work, vmpressure_work_fn
[ 690.775199] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
[ 690.777285] pending: vmstat_update, e1000_watchdog [e1000]
[ 690.779309] workqueue events_freezable: flags=0x4
[ 690.781031] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[ 690.783117] pending: vmballoon_work [vmw_balloon]
[ 690.784964] workqueue events_power_efficient: flags=0x80
[ 690.786755] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[ 690.788835] pending: fb_flashcursor
[ 690.790445] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[ 690.792508] pending: neigh_periodic_work
[ 690.794188] workqueue events_freezable_power_: flags=0x84
[ 690.795981] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 690.798029] in-flight: 62:disk_events_workfn
[ 690.799767] workqueue writeback: flags=0x4e
[ 690.801324] pwq 16: cpus=0-7 flags=0x4 nice=0 active=2/256
[ 690.803281] in-flight: 358:bdi_writeback_workfn bdi_writeback_workfn
[ 690.805415] workqueue xfs-data/sda1: flags=0xc
[ 690.807041] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=10/256
[ 690.809080] in-flight: 7733:xfs_end_io, 26:xfs_end_io, 7729:xfs_end_io, 2737:xfs_end_io, 7731:xfs_end_io, 423:xfs_end_io, 7734:xfs_end_io, 7732:xfs_end_io, 7735:xfs_end_io, 65:xfs_end_io
[ 690.814195] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=8/256 MAYDAY
[ 690.816359] in-flight: 134:xfs_end_io, 10998:xfs_end_io, 11007:xfs_end_io, 20:xfs_end_io, 448:xfs_end_io, 11005:xfs_end_io, 232:xfs_end_io, 10997:xfs_end_io
[ 690.820985] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=56/256 MAYDAY
[ 690.823218] in-flight: 10999:xfs_end_io, 11004:xfs_end_io, 11018:xfs_end_io, 11038:xfs_end_io, 11025:xfs_end_io, 11001:xfs_end_io, 11020:xfs_end_io, 7728:xfs_end_io, 7730:xfs_end_io, 11027:xfs_end_io, 11011:xfs_end_io, 215:xfs_end_io, 11035:xfs_end_io, 11023:xfs_end_io, 11003:xfs_end_io, 11032:xfs_end_io, 11024:xfs_end_io, 10995:xfs_end_io, 11036:xfs_end_io, 11000:xfs_end_io, 11017:xfs_end_io, 11028:xfs_end_io, 11010:xfs_end_io, 43:xfs_end_io, 11015:xfs_end_io, 11034:xfs_end_io, 11030:xfs_end_io, 11009:xfs_end_io, 10996:xfs_end_io, 14:xfs_end_io, 11026:xfs_end_io, 371(RESCUER):xfs_end_io xfs_end_io xfs_end_io, 11037:xfs_end_io, 11006:xfs_end_io, 11014:xfs_end_io, 11022:xfs_end_io, 11002:xfs_end_io, 11019:xfs_end_io, 11033:xfs_end_io, 11031:xfs_end_io, 11008:xfs_end_io, 11012:xfs_end_io, 11021:xfs_end_io
[ 690.843071] pending: xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io, xfs_end_io
[ 690.846995] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 690.849221] pending: xfs_end_io
[ 690.850963] workqueue xfs-cil/sda1: flags=0xc
[ 690.852757] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256
[ 690.855025] pending: xlog_cil_push_work BAR(5327)
[ 690.857038] workqueue xfs-log/sda1: flags=0x14
[ 690.858847] pwq 1: cpus=0 node=0 flags=0x0 nice=-20 active=1/256
[ 690.861085] in-flight: 5327:xfs_log_worker
[ 690.862980] workqueue xfs-eofblocks/sda1: flags=0x4
[ 690.864859] pwq 4: cpus=2 node=0 flags=0x0 nice=0 active=1/256
[ 690.867052] in-flight: 11029:xfs_eofblocks_worker
[ 690.869050] pool 0: cpus=0 node=0 flags=0x0 nice=0 workers=3 idle: 173 64
[ 690.871468] pool 1: cpus=0 node=0 flags=0x0 nice=-20 workers=3 idle: 2016 5
[ 690.873964] pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=43 manager: 11039
[ 690.876395] pool 4: cpus=2 node=0 flags=0x0 nice=0 workers=11 idle: 11013 11016
[ 690.878886] pool 6: cpus=3 node=0 flags=0x0 nice=0 workers=13 idle: 7736 7738 7737
[ 690.881421] pool 16: cpus=0-7 flags=0x4 nice=0 workers=3 idle: 357 356
[ 707.584171] sysrq: SysRq : Resetting
----------

using a reproducer shown below.

----------
#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <sched.h>

static int file_writer(void *unused)
{
static char buffer[1048576] = { };
const int fd = open("/tmp/file",
O_WRONLY | O_CREAT | O_APPEND, 0600);
while (write(fd, buffer, sizeof(buffer)) == sizeof(buffer));
return 0;
}

static int memory_consumer(void *unused)
{
const int fd = open("/dev/zero", O_RDONLY);
unsigned long size;
char *buf = NULL;
sleep(3);
for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
char *cp = realloc(buf, size);
if (!cp) {
size >>= 1;
break;
}
buf = cp;
}
read(fd, buf, size); /* Will cause OOM due to overcommit */
return 0;
}

int main(int argc, char *argv[])
{
clone(file_writer, malloc(4 * 1024) + 4 * 1024, CLONE_SIGHAND | CLONE_VM, NULL);
clone(file_writer, malloc(4 * 1024) + 4 * 1024, CLONE_SIGHAND | CLONE_VM, NULL);
clone(memory_consumer, malloc(4 * 1024) + 4 * 1024, CLONE_SIGHAND | CLONE_VM, NULL);
pause();
return 0;
}
----------

FYI, similar moom_work lockup in 4.1.0-rc3+ can be found from
http://I-love.SAKURA.ne.jp/tmp/serial-20150516-1.txt.xz .

----------
[ 463.070568] sysrq: SysRq : Manual OOM execution
[ 469.189784] sysrq: SysRq : Show Blocked State
(...snipped...)
[ 469.305369] kswapd0 D ffff88007fc96480 0 45 2 0x00000000
[ 469.307259] ffff88007cde4f50 ffffea00014a6cc0 ffff880078774000 ffff88007fc8e8c0
[ 469.309273] ffff88007fc8e8c0 00000001000291e6 ffff88007fac39a0 0000000000000017
[ 469.311419] ffffffff8160e92a ffff880078773a68 ffffffff81611033 0000000000000015
[ 469.313475] Call Trace:
[ 469.314537] [<ffffffff8160e92a>] ? schedule+0x2a/0x80
[ 469.316090] [<ffffffff81611033>] ? schedule_timeout+0x113/0x1b0
[ 469.317775] [<ffffffff810c2990>] ? migrate_timer_list+0x60/0x60
[ 469.319470] [<ffffffff8160e01f>] ? io_schedule_timeout+0x9f/0x120
[ 469.321189] [<ffffffff8112db49>] ? congestion_wait+0x79/0xd0
[ 469.322827] [<ffffffff810a3040>] ? wait_woken+0x80/0x80
[ 469.324378] [<ffffffff811234fb>] ? shrink_inactive_list+0x4bb/0x530
[ 469.326122] [<ffffffff81123f31>] ? shrink_lruvec+0x571/0x780
[ 469.327745] [<ffffffff8100c671>] ? __switch_to+0x201/0x540
[ 469.329343] [<ffffffff8100c671>] ? __switch_to+0x201/0x540
[ 469.330917] [<ffffffff81124232>] ? shrink_zone+0xf2/0x300
[ 469.332481] [<ffffffff811252bd>] ? kswapd+0x50d/0x940
[ 469.333981] [<ffffffff81124db0>] ? mem_cgroup_shrink_node_zone+0xd0/0xd0
[ 469.335794] [<ffffffff81084588>] ? kthread+0xc8/0xe0
[ 469.337280] [<ffffffff810844c0>] ? kthread_create_on_node+0x190/0x190
[ 469.339069] [<ffffffff816124e2>] ? ret_from_fork+0x42/0x70
[ 469.340658] [<ffffffff810844c0>] ? kthread_create_on_node+0x190/0x190
(...snipped...)
[ 469.409864] xfs-data/sda1 D 0000000000000003 0 374 2 0x00000000
[ 469.411722] Workqueue: xfs-data/sda1 xfs_end_io
[ 469.413083] ffff880077565820 ffff88007fc56480 ffff8800001bc000 ffff880077565820
[ 469.415097] ffff88007b9d20a8 ffffffff00000000 fffffffe00000001 0000000000000002
[ 469.417032] ffffffff8160e92a ffff88007b9d2090 ffffffff81610d4d ffff88007fbf1d80
[ 469.419083] Call Trace:
[ 469.420159] [<ffffffff8160e92a>] ? schedule+0x2a/0x80
[ 469.421722] [<ffffffff81610d4d>] ? rwsem_down_write_failed+0x20d/0x3c0
[ 469.423520] [<ffffffff812feb03>] ? call_rwsem_down_write_failed+0x13/0x20
[ 469.425370] [<ffffffff81610514>] ? down_write+0x24/0x40
[ 469.426915] [<ffffffff8122b720>] ? xfs_setfilesize+0x20/0xa0
[ 469.428562] [<ffffffff8122b88c>] ? xfs_end_io+0x5c/0x90
[ 469.430202] [<ffffffff8107ee81>] ? process_one_work+0x131/0x350
[ 469.431864] [<ffffffff8107f764>] ? rescuer_thread+0x194/0x3d0
[ 469.433497] [<ffffffff8107f5d0>] ? worker_thread+0x530/0x530
[ 469.435106] [<ffffffff81084588>] ? kthread+0xc8/0xe0
[ 469.436574] [<ffffffff810844c0>] ? kthread_create_on_node+0x190/0x190
[ 469.438331] [<ffffffff816124e2>] ? ret_from_fork+0x42/0x70
[ 469.439935] [<ffffffff810844c0>] ? kthread_create_on_node+0x190/0x190
(...snipped...)
[ 496.158109] sysrq: SysRq : Manual OOM execution
[ 500.707551] sysrq: SysRq : Manual OOM execution
[ 501.965086] sysrq: SysRq : Manual OOM execution
[ 505.905775] sysrq: SysRq : Manual OOM execution
[ 507.051127] sysrq: SysRq : Manual OOM execution
[ 508.556863] sysrq: SysRq : Show State
(...snipped...)
[ 515.536393] Showing busy workqueues and worker pools:
[ 515.538185] workqueue events: flags=0x0
[ 515.539758] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=8/256
[ 515.541872] pending: vmpressure_work_fn, console_callback, vmstat_update, flush_to_ldisc, push_to_pool, moom_callback, sysrq_reinject_alt_sysrq, fb_deferred_io_work
[ 515.546684] workqueue events_power_efficient: flags=0x80
[ 515.548589] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=2/256
[ 515.550829] pending: neigh_periodic_work, check_lifetime
[ 515.552884] workqueue events_freezable_power_: flags=0x84
[ 515.554742] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[ 515.556846] in-flight: 3837:disk_events_workfn
[ 515.558665] workqueue writeback: flags=0x4e
[ 515.560291] pwq 16: cpus=0-7 flags=0x4 nice=0 active=2/256
[ 515.562271] in-flight: 3812:bdi_writeback_workfn bdi_writeback_workfn
[ 515.564544] workqueue xfs-data/sda1: flags=0xc
[ 515.566265] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=4/256
[ 515.568359] in-flight: 374(RESCUER):xfs_end_io, 3759:xfs_end_io, 26:xfs_end_io, 3836:xfs_end_io
[ 515.571018] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[ 515.573113] in-flight: 179:xfs_end_io
[ 515.574782] pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=4 idle: 3790 237 3820
[ 515.577230] pool 6: cpus=3 node=0 flags=0x0 nice=0 workers=5 manager: 219
[ 515.579488] pool 16: cpus=0-7 flags=0x4 nice=0 workers=3 idle: 356 357
(...snipped...)
[ 574.910809] Showing busy workqueues and worker pools:
[ 574.912540] workqueue events: flags=0x0
[ 574.914048] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=8/256
[ 574.916022] pending: vmpressure_work_fn, console_callback, vmstat_update, flush_to_ldisc, push_to_pool, moom_callback, sysrq_reinject_alt_sysrq, fb_deferred_io_work
[ 574.920570] workqueue events_power_efficient: flags=0x80
[ 574.922458] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=2/256
[ 574.924436] pending: neigh_periodic_work, check_lifetime
[ 574.926362] workqueue events_freezable_power_: flags=0x84
[ 574.928083] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=1/256
[ 574.930102] in-flight: 3837:disk_events_workfn
[ 574.931957] workqueue writeback: flags=0x4e
[ 574.933755] pwq 16: cpus=0-7 flags=0x4 nice=0 active=2/256
[ 574.935876] in-flight: 3812:bdi_writeback_workfn bdi_writeback_workfn
[ 574.938001] workqueue xfs-data/sda1: flags=0xc
[ 574.939621] pwq 6: cpus=3 node=0 flags=0x0 nice=0 active=4/256
[ 574.941571] in-flight: 374(RESCUER):xfs_end_io, 3759:xfs_end_io, 26:xfs_end_io, 3836:xfs_end_io
[ 574.944138] pwq 2: cpus=1 node=0 flags=0x0 nice=0 active=1/256
[ 574.946216] in-flight: 179:xfs_end_io
[ 574.947863] pool 2: cpus=1 node=0 flags=0x0 nice=0 workers=4 idle: 3790 237 3820
[ 574.950170] pool 6: cpus=3 node=0 flags=0x0 nice=0 workers=5 manager: 219
[ 574.952372] pool 16: cpus=0-7 flags=0x4 nice=0 workers=3 idle: 356 357
[ 618.816885] sysrq: SysRq : Resetting
----------

> > Therefore, my version used DEFINE_TIMER() than
> > DECLARE_DELAYED_WORK() in order to make sure that the callback shall be
> > called as soon as timeout expires.
>
> I do not care that much whether it is timer or delayed work which would
> be used.
>

I do care that SysRq-f's moom_work work is unable to call moom_callback()
as well as out_of_memory(true) by moom_callback() is failing to choose another
mm struct. This forces administrators to use SysRq-{i,c,b} like I explained
in my version.

I think that the basic rule for handling OOM condition is that "Do not trust
anybody, for even the kswapd and rescuer threads can be waiting for lock or
allocation. Decide when to give up at administrator's own risk, and choose
another mm struct or call panic()".

2015-06-11 14:18:21

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC] panic_on_oom_timeout

On Thu 11-06-15 22:12:40, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > > The moom_work used by SysRq-f sometimes cannot be executed
> > > because some work which is processed before the moom_work is processed is
> > > stalled for unbounded amount of time due to looping inside the memory
> > > allocator.
> >
> > Wouldn't wq code pick up another worker thread to execute the work.
> > There is also a rescuer thread as the last resort AFAIR.
> >
>
> Below is an example of moom_work lockup in v4.1-rc7 from
> http://I-love.SAKURA.ne.jp/tmp/serial-20150611.txt.xz
>
> ----------
> [ 171.710406] sysrq: SysRq : Manual OOM execution
> [ 171.720193] kworker/2:9 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
> [ 171.722699] kworker/2:9 cpuset=/ mems_allowed=0
> [ 171.724603] CPU: 2 PID: 11016 Comm: kworker/2:9 Not tainted 4.1.0-rc7 #3
> [ 171.726817] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
> [ 171.729727] Workqueue: events moom_callback
> (...snipped...)
> [ 258.302016] sysrq: SysRq : Manual OOM execution

Wow, this is a _lot_. I was aware that workqueues might be overloaded.
We have seen that in real loads and that led to
http://marc.info/?l=linux-kernel&m=141456398425553 wher the rescuer
didn't handle pending work properly. I thought that the fix helped in
the end. But 1.5 minutes is indeed unexpected for me.

This of course disqualifies DELAYED_WORK for anything that has at least
reasonable time expectations which is the case here.

[...]

> I think that the basic rule for handling OOM condition is that "Do not trust
> anybody, for even the kswapd and rescuer threads can be waiting for lock or
> allocation. Decide when to give up at administrator's own risk, and choose
> another mm struct or call panic()".

Yes I agree with this and checking the timeout at select_bad_process
time sounds much more plausible in that regard.

Thanks!
--
Michal Hocko
SUSE Labs

2015-06-11 14:46:06

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [RFC] panic_on_oom_timeout

Michal Hocko wrote:
> On Thu 11-06-15 22:12:40, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> [...]
> > > > The moom_work used by SysRq-f sometimes cannot be executed
> > > > because some work which is processed before the moom_work is processed is
> > > > stalled for unbounded amount of time due to looping inside the memory
> > > > allocator.
> > >
> > > Wouldn't wq code pick up another worker thread to execute the work.
> > > There is also a rescuer thread as the last resort AFAIR.
> > >
> >
> > Below is an example of moom_work lockup in v4.1-rc7 from
> > http://I-love.SAKURA.ne.jp/tmp/serial-20150611.txt.xz
> >
> > ----------
> > [ 171.710406] sysrq: SysRq : Manual OOM execution
> > [ 171.720193] kworker/2:9 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
> > [ 171.722699] kworker/2:9 cpuset=/ mems_allowed=0
> > [ 171.724603] CPU: 2 PID: 11016 Comm: kworker/2:9 Not tainted 4.1.0-rc7 #3
> > [ 171.726817] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
> > [ 171.729727] Workqueue: events moom_callback
> > (...snipped...)
> > [ 258.302016] sysrq: SysRq : Manual OOM execution
>
> Wow, this is a _lot_. I was aware that workqueues might be overloaded.
> We have seen that in real loads and that led to
> http://marc.info/?l=linux-kernel&m=141456398425553 wher the rescuer
> didn't handle pending work properly. I thought that the fix helped in
> the end. But 1.5 minutes is indeed unexpected for me.

Excuse me, but you misunderstood the log. The logs for uptime = 171 and
uptime = 258 are cases where SysRq-f (indicated by "sysrq: SysRq : Manual
OOM execution" message) immediately invoked the OOM killer (indicated by
"invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0" message).

What you should check is uptime > 301. Until I do SysRq-b at uptime = 707,
the "sysrq: SysRq : Manual OOM execution" message is printed but the
"invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0" message is not
printed. During this period (so far 5 minutes, presumably forever),
moom_callback() remained pending.

>
> This of course disqualifies DELAYED_WORK for anything that has at least
> reasonable time expectations which is the case here.

2015-06-11 15:38:35

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC] panic_on_oom_timeout

On Thu 11-06-15 23:45:26, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > On Thu 11-06-15 22:12:40, Tetsuo Handa wrote:
> > > Michal Hocko wrote:
> > [...]
> > > > > The moom_work used by SysRq-f sometimes cannot be executed
> > > > > because some work which is processed before the moom_work is processed is
> > > > > stalled for unbounded amount of time due to looping inside the memory
> > > > > allocator.
> > > >
> > > > Wouldn't wq code pick up another worker thread to execute the work.
> > > > There is also a rescuer thread as the last resort AFAIR.
> > > >
> > >
> > > Below is an example of moom_work lockup in v4.1-rc7 from
> > > http://I-love.SAKURA.ne.jp/tmp/serial-20150611.txt.xz
> > >
> > > ----------
> > > [ 171.710406] sysrq: SysRq : Manual OOM execution
> > > [ 171.720193] kworker/2:9 invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0
> > > [ 171.722699] kworker/2:9 cpuset=/ mems_allowed=0
> > > [ 171.724603] CPU: 2 PID: 11016 Comm: kworker/2:9 Not tainted 4.1.0-rc7 #3
> > > [ 171.726817] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 07/31/2013
> > > [ 171.729727] Workqueue: events moom_callback
> > > (...snipped...)
> > > [ 258.302016] sysrq: SysRq : Manual OOM execution
> >
> > Wow, this is a _lot_. I was aware that workqueues might be overloaded.
> > We have seen that in real loads and that led to
> > http://marc.info/?l=linux-kernel&m=141456398425553 wher the rescuer
> > didn't handle pending work properly. I thought that the fix helped in
> > the end. But 1.5 minutes is indeed unexpected for me.
>
> Excuse me, but you misunderstood the log.

Yes I've misread the log (I've interpreted (...) is the wait time and
haven't payed closer attention to what was the triggering and the invocation
part). Sorry about that.
[...]
> What you should check is uptime > 301. Until I do SysRq-b at uptime = 707,
> the "sysrq: SysRq : Manual OOM execution" message is printed but the
> "invoked oom-killer: gfp_mask=0xd0, order=0, oom_score_adj=0" message is not
> printed. During this period (so far 5 minutes, presumably forever),
> moom_callback() remained pending.

OK, I can see it now. So this is even worse than a latency caused by
overloaded workqueues. A long lag would be a sufficient reason to
disqualify DELAYED_WORK already but this makes it no no completely.
--
Michal Hocko
SUSE Labs

2015-06-12 15:23:39

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [RFC] panic_on_oom_timeout

Michal Hocko wrote:
> > Since my version uses per a "struct task_struct" variable (memdie_start),
> > 5 seconds of timeout is checked for individual memory cgroup. It can avoid
> > unnecessary panic() calls if nobody needs to call out_of_memory() again
> > (probably because somebody volunteered memory) when the OOM victim cannot
> > be terminated for some reason. If we want distinction between "the entire
> > system is under OOM" and "some memory cgroup is under OOM" because the
> > former is urgent but the latter is less urgent, it can be modified to
> > allow different timeout period for system-wide OOM and cgroup OOM.
> > Finally, it can give a hint for "in what sequence threads got stuck" and
> > "which thread did take 5 seconds" when analyzing vmcore.
>
> I will have a look how you have implemented that but separate timeouts
> sound like a major over engineering. Also note that global vs. memcg OOM
> is not sufficient because there are other oom domains as mentioned above.

We also need to consider mempolicy and cpusets, right? I'm unfamiliar with
NUMA systems, but I guess that mempolicy OOM is a situation where "some
memory node is under OOM" and cpusets OOM is a situation where "memory
cannot be reclaimed/allocated without borrowing cpus outside of the given
cpusets".

Michal Hocko wrote:
> Your patch is doing way too many things at once :/ So let me just focus
> on the "panic if a task is stuck with TIF_MEMDIE for too long". It looks
> like an alternative to the approach I've chosen. It doesn't consider
> the allocation restriction so a locked up cpuset/numa node(s) might
> panic the system which doesn't sound like a good idea but that is easily
> fixable. Could you tear just this part out and repost it so that we can
> compare the two approaches?

Sure. Here is a "tear just this part out" version. I think that most
administrators will no longer need to use panic_on_oom > 0 by setting
adequate values to these timeouts.
------------------------------------------------------------
>From e59b64683827151a35257384352c70bce61babdd Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <[email protected]>
Date: Fri, 12 Jun 2015 23:56:18 +0900
Subject: [RFC] oom: implement memdie_task_panic_secs

OOM killer is a desperate last resort reclaim attempt to free some
memory. It is based on heuristics which will never be 100% and may
result in an unusable or a locked up system.

panic_on_oom sysctl knob allows to set the OOM policy to panic the
system instead of trying to resolve the OOM condition. This might be
useful for several reasons - e.g. reduce the downtime to a predictable
amount of time, allow to get a crash dump of the system and debug the
issue post-mortem.

panic_on_oom is, however, a big hammer in many situations when the
OOM condition could be resolved in a reasonable time. So it would be
good to have some middle ground and allow the OOM killer to do its job
but have a failover when things go wrong and it is not able to make any
further progress for a considerable amount of time.

This patch implements system_memdie_panic_secs sysctl which configures
a maximum timeout for the OOM killer to resolve the OOM situation.
If the system is still under OOM (i.e. the OOM victim cannot release
memory) after the timeout expires, it will panic the system. A
reasonably chosen timeout can protect from both temporal OOM conditions
and allows to have a predictable time frame for the OOM condition.

Since there are memcg OOM, cpuset OOM, mempolicy OOM as with system OOM,
this patch also implements {memcg,cpuset,mempolicy}_memdie_panic_secs .
These will allow administrator to use different timeout settings for
each type of OOM, for administrator still has chance to perform steps
to resolve the potential lockup or trashing from the global context
(e.g. by relaxing restrictions or even rebooting cleanly).

Signed-off-by: Tetsuo Handa <[email protected]>
---
include/linux/oom.h | 8 +++++
include/linux/sched.h | 1 +
kernel/sysctl.c | 39 ++++++++++++++++++++++++
mm/oom_kill.c | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 131 insertions(+)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 7deecb7..f69e0dd 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -99,4 +99,12 @@ static inline bool task_will_free_mem(struct task_struct *task)
extern int sysctl_oom_dump_tasks;
extern int sysctl_oom_kill_allocating_task;
extern int sysctl_panic_on_oom;
+extern unsigned long sysctl_system_memdie_panic_secs;
+#ifdef CONFIG_MEMCG
+extern unsigned long sysctl_memcg_memdie_panic_secs;
+#endif
+#ifdef CONFIG_NUMA
+extern unsigned long sysctl_cpuset_memdie_panic_secs;
+extern unsigned long sysctl_mempolicy_memdie_panic_secs;
+#endif
#endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d505bca..333bb3a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1784,6 +1784,7 @@ struct task_struct {
unsigned long task_state_change;
#endif
int pagefault_disabled;
+ unsigned long memdie_start;
};

/* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c566b56..0c5261f 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -144,6 +144,9 @@ static const int cap_last_cap = CAP_LAST_CAP;
static unsigned long hung_task_timeout_max = (LONG_MAX/HZ);
#endif

+/* Used by proc_doulongvec_minmax of sysctl_*_memdie_panic_secs. */
+static unsigned long wait_timeout_max = (LONG_MAX/HZ);
+
#ifdef CONFIG_INOTIFY_USER
#include <linux/inotify.h>
#endif
@@ -1535,6 +1538,42 @@ static struct ctl_table vm_table[] = {
.mode = 0644,
.proc_handler = proc_doulongvec_minmax,
},
+ {
+ .procname = "system_memdie_panic_secs",
+ .data = &sysctl_system_memdie_panic_secs,
+ .maxlen = sizeof(sysctl_system_memdie_panic_secs),
+ .mode = 0644,
+ .proc_handler = proc_doulongvec_minmax,
+ .extra2 = &wait_timeout_max,
+ },
+#ifdef CONFIG_MEMCG
+ {
+ .procname = "memcg_memdie_panic_secs",
+ .data = &sysctl_memcg_memdie_panic_secs,
+ .maxlen = sizeof(sysctl_memcg_memdie_panic_secs),
+ .mode = 0644,
+ .proc_handler = proc_doulongvec_minmax,
+ .extra2 = &wait_timeout_max,
+ },
+#endif
+#ifdef CONFIG_NUMA
+ {
+ .procname = "cpuset_memdie_panic_secs",
+ .data = &sysctl_cpuset_memdie_panic_secs,
+ .maxlen = sizeof(sysctl_cpuset_memdie_panic_secs),
+ .mode = 0644,
+ .proc_handler = proc_doulongvec_minmax,
+ .extra2 = &wait_timeout_max,
+ },
+ {
+ .procname = "mempolicy_memdie_panic_secs",
+ .data = &sysctl_mempolicy_memdie_panic_secs,
+ .maxlen = sizeof(sysctl_mempolicy_memdie_panic_secs),
+ .mode = 0644,
+ .proc_handler = proc_doulongvec_minmax,
+ .extra2 = &wait_timeout_max,
+ },
+#endif
{ }
};

diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index dff991e..40d7b6d0 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -43,6 +43,14 @@ int sysctl_panic_on_oom;
int sysctl_oom_kill_allocating_task;
int sysctl_oom_dump_tasks = 1;

+unsigned long sysctl_system_memdie_panic_secs;
+#ifdef CONFIG_MEMCG
+unsigned long sysctl_cgroup_memdie_panic_secs;
+#endif
+#ifdef CONFIG_NUMA
+unsigned long sysctl_cpuset_memdie_panic_secs;
+unsigned long sysctl_mempolicy_memdie_panic_secs;
+#endif
DEFINE_MUTEX(oom_lock);

#ifdef CONFIG_NUMA
@@ -118,6 +126,72 @@ found:
return t;
}

+/**
+ * check_memdie_task - check task is not stuck with TIF_MEMDIE flag set.
+ *
+ * @p: Pointer to "struct task_struct".
+ * @memcg: Pointer to "struct mem_cgroup". Maybe NULL.
+ * @nodemask: Pointer to "nodemask_t". Maybe NULL.
+ *
+ * Setting TIF_MEMDIE flag to @p disables the OOM killer. However, @p could get
+ * stuck due to dependency which is invisible to the OOM killer. When @p got
+ * stuck, the system will stall for unpredictable duration (presumably forever)
+ * because the OOM killer is kept disabled.
+ *
+ * If @p remained stuck for
+ * /proc/sys/vm/{system,memcg,cpuset,mempolicy}_memdie_panic_secs seconds,
+ * this function triggers kernel panic.
+ * Setting 0 to {memcg,cpuset,mempolicy}_memdie_panic_secs causes
+ * respective interfaces to use system_memdie_panic_secs setting.
+ * Setting 0 to system_memdie_panic_secs disables this check.
+ */
+static void check_memdie_task(struct task_struct *p, struct mem_cgroup *memcg,
+ const nodemask_t *nodemask)
+{
+ unsigned long start = p->memdie_start;
+ unsigned long spent;
+ unsigned long timeout = 0;
+
+ /* If task does not have TIF_MEMDIE flag, there is nothing to do. */
+ if (!start)
+ return;
+ spent = jiffies - start;
+#ifdef CONFIG_MEMCG
+ /* task_in_mem_cgroup(p, memcg) is true. */
+ if (memcg)
+ timeout = sysctl_cgroup_memdie_panic_secs;
+#endif
+#ifdef CONFIG_NUMA
+ /* has_intersects_mems_allowed(p, nodemask) is true. */
+ else if (nodemask)
+ timeout = sysctl_mempolicy_memdie_panic_secs;
+ else
+ timeout = sysctl_cpuset_memdie_panic_secs;
+#endif
+ if (!timeout)
+ timeout = sysctl_system_memdie_panic_secs;
+ /* If timeout is disabled, there is nothing to do. */
+ if (!timeout)
+ return;
+#ifdef CONFIG_NUMA
+ {
+ struct task_struct *t;
+
+ rcu_read_lock();
+ for_each_thread(p, t) {
+ start = t->memdie_start;
+ if (start && time_after(spent, timeout * HZ))
+ break;
+ }
+ rcu_read_unlock();
+ }
+#endif
+ if (time_before(spent, timeout * HZ))
+ return;
+ panic("Out of memory: %s (%u) did not die within %lu seconds.\n",
+ p->comm, p->pid, timeout);
+}
+
/* return true if the task is not adequate as candidate victim task. */
static bool oom_unkillable_task(struct task_struct *p,
struct mem_cgroup *memcg, const nodemask_t *nodemask)
@@ -135,6 +209,7 @@ static bool oom_unkillable_task(struct task_struct *p,
if (!has_intersects_mems_allowed(p, nodemask))
return true;

+ check_memdie_task(p, memcg, nodemask);
return false;
}

@@ -416,10 +491,17 @@ bool oom_killer_disabled __read_mostly;
*/
void mark_oom_victim(struct task_struct *tsk)
{
+ unsigned long start;
+
WARN_ON(oom_killer_disabled);
/* OOM killer might race with memcg OOM */
if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
return;
+ /* Set current time for is_killable_memdie_task() check. */
+ start = jiffies;
+ if (!start)
+ start = 1;
+ tsk->memdie_start = start;
/*
* Make sure that the task is woken up from uninterruptible sleep
* if it is frozen because OOM killer wouldn't be able to free
@@ -435,6 +517,7 @@ void mark_oom_victim(struct task_struct *tsk)
*/
void exit_oom_victim(void)
{
+ current->memdie_start = 0;
clear_thread_flag(TIF_MEMDIE);

if (!atomic_dec_return(&oom_victims))
--
1.8.3.1
------------------------------------------------------------

By the way, with introduction of per "struct task_struct" variable, I think
that we can replace TIF_MEMDIE checks with memdie_start checks via

test_tsk_thread_flag(p, TIF_MEMDIE) => p->memdie_start

test_and_clear_thread_flag(TIF_MEMDIE) => xchg(&current->memdie_start, 0)

test_and_set_tsk_thread_flag(p, TIF_MEMDIE)
=> xchg(&p->memdie_start, jiffies (or 1 if jiffies == 0))

though above patch did not replace TIF_MEMDIE in order to focus on one thing.

2015-06-15 12:45:30

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC] panic_on_oom_timeout

On Sat 13-06-15 00:23:00, Tetsuo Handa wrote:
[...]
> >From e59b64683827151a35257384352c70bce61babdd Mon Sep 17 00:00:00 2001
> From: Tetsuo Handa <[email protected]>
> Date: Fri, 12 Jun 2015 23:56:18 +0900
> Subject: [RFC] oom: implement memdie_task_panic_secs
>
> OOM killer is a desperate last resort reclaim attempt to free some
> memory. It is based on heuristics which will never be 100% and may
> result in an unusable or a locked up system.
>
> panic_on_oom sysctl knob allows to set the OOM policy to panic the
> system instead of trying to resolve the OOM condition. This might be
> useful for several reasons - e.g. reduce the downtime to a predictable
> amount of time, allow to get a crash dump of the system and debug the
> issue post-mortem.
>
> panic_on_oom is, however, a big hammer in many situations when the
> OOM condition could be resolved in a reasonable time. So it would be
> good to have some middle ground and allow the OOM killer to do its job
> but have a failover when things go wrong and it is not able to make any
> further progress for a considerable amount of time.
>
> This patch implements system_memdie_panic_secs sysctl which configures
> a maximum timeout for the OOM killer to resolve the OOM situation.
> If the system is still under OOM (i.e. the OOM victim cannot release
> memory) after the timeout expires, it will panic the system. A
> reasonably chosen timeout can protect from both temporal OOM conditions
> and allows to have a predictable time frame for the OOM condition.
>
> Since there are memcg OOM, cpuset OOM, mempolicy OOM as with system OOM,
> this patch also implements {memcg,cpuset,mempolicy}_memdie_panic_secs .

I really hate having so many knobs. What would they be good for? Why
cannot you simply use a single timeout and decide whether to panic or
not based on panic_on_oom value? Or do you have any strong reason to
put this aside from panic_on_oom?

> These will allow administrator to use different timeout settings for
> each type of OOM, for administrator still has chance to perform steps
> to resolve the potential lockup or trashing from the global context
> (e.g. by relaxing restrictions or even rebooting cleanly).
>
> Signed-off-by: Tetsuo Handa <[email protected]>
> ---
> include/linux/oom.h | 8 +++++
> include/linux/sched.h | 1 +
> kernel/sysctl.c | 39 ++++++++++++++++++++++++
> mm/oom_kill.c | 83 +++++++++++++++++++++++++++++++++++++++++++++++++++
> 4 files changed, 131 insertions(+)
>
[...]
> diff --git a/mm/oom_kill.c b/mm/oom_kill.c
> index dff991e..40d7b6d0 100644
> --- a/mm/oom_kill.c
> +++ b/mm/oom_kill.c
[...]
> +static void check_memdie_task(struct task_struct *p, struct mem_cgroup *memcg,
> + const nodemask_t *nodemask)
> +{
> + unsigned long start = p->memdie_start;
> + unsigned long spent;
> + unsigned long timeout = 0;
> +
> + /* If task does not have TIF_MEMDIE flag, there is nothing to do. */
> + if (!start)
> + return;
> + spent = jiffies - start;
> +#ifdef CONFIG_MEMCG
> + /* task_in_mem_cgroup(p, memcg) is true. */
> + if (memcg)
> + timeout = sysctl_cgroup_memdie_panic_secs;
> +#endif
> +#ifdef CONFIG_NUMA
> + /* has_intersects_mems_allowed(p, nodemask) is true. */
> + else if (nodemask)
> + timeout = sysctl_mempolicy_memdie_panic_secs;
> + else
> + timeout = sysctl_cpuset_memdie_panic_secs;
> +#endif
> + if (!timeout)
> + timeout = sysctl_system_memdie_panic_secs;
> + /* If timeout is disabled, there is nothing to do. */
> + if (!timeout)
> + return;
> +#ifdef CONFIG_NUMA
> + {
> + struct task_struct *t;
> +
> + rcu_read_lock();
> + for_each_thread(p, t) {
> + start = t->memdie_start;
> + if (start && time_after(spent, timeout * HZ))
> + break;
> + }
> + rcu_read_unlock();

This doesn't make any sense to me. What are you trying to achieve here?
Why would you want to check all threads and do that only for CONFIG_NUMA
and even then do a noop if the timeout expired?

> + }
> +#endif
> + if (time_before(spent, timeout * HZ))
> + return;

I think the whole function is way too much complicated without a good
reason. Why don't you simply store the expire time when marking the task
oom victim and compare it with the current jiffies with time_after and
be done with it. This is few lines of code...

> + panic("Out of memory: %s (%u) did not die within %lu seconds.\n",
> + p->comm, p->pid, timeout);

It would be better to sync this message with what check_panic_on_oom
does.

> +}
> +
> /* return true if the task is not adequate as candidate victim task. */
> static bool oom_unkillable_task(struct task_struct *p,
> struct mem_cgroup *memcg, const nodemask_t *nodemask)
> @@ -135,6 +209,7 @@ static bool oom_unkillable_task(struct task_struct *p,
> if (!has_intersects_mems_allowed(p, nodemask))
> return true;
>
> + check_memdie_task(p, memcg, nodemask);

This is not sufficient. oom_scan_process_thread would break out from the
loop when encountering the first TIF_MEMDIE task and could have missed
an older one later in the task_list.
Besides that oom_unkillable_task doesn't sound like a good match to
evaluate this logic. I would expect it to be in oom_scan_process_thread.

> return false;
> }
>
> @@ -416,10 +491,17 @@ bool oom_killer_disabled __read_mostly;
> */
> void mark_oom_victim(struct task_struct *tsk)
> {
> + unsigned long start;
> +
> WARN_ON(oom_killer_disabled);
> /* OOM killer might race with memcg OOM */
> if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
> return;
> + /* Set current time for is_killable_memdie_task() check. */
> + start = jiffies;
> + if (!start)
> + start = 1;
> + tsk->memdie_start = start;

I would rather go with tsk->oom_expire = jiffies + timeout and set the
timeout depending on panic_on_oom value (which would require nodemask
and memcg parameters here).

> /*
> * Make sure that the task is woken up from uninterruptible sleep
> * if it is frozen because OOM killer wouldn't be able to free
> @@ -435,6 +517,7 @@ void mark_oom_victim(struct task_struct *tsk)
> */
> void exit_oom_victim(void)
> {
> + current->memdie_start = 0;

Is this really needed? OOM killer shouldn't see the task because it has
already released its mm. oom_scan_process_thread checks mm after it
TIF_MEMDIE so we can race theoretically but this shouldn't matter much.
If a task is still visible after the timeout then there obviously was a
problem in making progress.

> clear_thread_flag(TIF_MEMDIE);
>
> if (!atomic_dec_return(&oom_victims))
> --
> 1.8.3.1
> ------------------------------------------------------------
>
> By the way, with introduction of per "struct task_struct" variable, I think
> that we can replace TIF_MEMDIE checks with memdie_start checks via
>
> test_tsk_thread_flag(p, TIF_MEMDIE) => p->memdie_start
>
> test_and_clear_thread_flag(TIF_MEMDIE) => xchg(&current->memdie_start, 0)
>
> test_and_set_tsk_thread_flag(p, TIF_MEMDIE)
> => xchg(&p->memdie_start, jiffies (or 1 if jiffies == 0))
>
> though above patch did not replace TIF_MEMDIE in order to focus on one thing.

I fail to see a direct advantage other than to safe one bit in flags. Is
something asking for it?
--
Michal Hocko
SUSE Labs

2015-06-16 13:14:58

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [RFC] panic_on_oom_timeout

Michal Hocko wrote:
> > This patch implements system_memdie_panic_secs sysctl which configures
> > a maximum timeout for the OOM killer to resolve the OOM situation.
> > If the system is still under OOM (i.e. the OOM victim cannot release
> > memory) after the timeout expires, it will panic the system. A
> > reasonably chosen timeout can protect from both temporal OOM conditions
> > and allows to have a predictable time frame for the OOM condition.
> >
> > Since there are memcg OOM, cpuset OOM, mempolicy OOM as with system OOM,
> > this patch also implements {memcg,cpuset,mempolicy}_memdie_panic_secs .
>
> I really hate having so many knobs. What would they be good for? Why
> cannot you simply use a single timeout and decide whether to panic or
> not based on panic_on_oom value? Or do you have any strong reason to
> put this aside from panic_on_oom?
>

The reason would depend on

(a) whether {memcg,cpuset,mempolicy} OOM stall is possible

(b) what {memcg,cpuset,mempolicy} users want to do when (a) is possible
and {memcg,cpuset,mempolicy} OOM stall occurred

.

Since memcg OOM is less critical than system OOM because administrator still
has chance to perform steps to resolve the OOM state, we could give longer
timeout (e.g. 600 seconds) for memcg OOM while giving shorter timeout (e.g.
10 seconds) for system OOM. But if (a) is impossible, trying to configure
different timeout for non-system OOM stall makes no sense.



> > +#ifdef CONFIG_NUMA
> > + {
> > + struct task_struct *t;
> > +
> > + rcu_read_lock();
> > + for_each_thread(p, t) {
> > + start = t->memdie_start;
> > + if (start && time_after(spent, timeout * HZ))
> > + break;
> > + }
> > + rcu_read_unlock();
>
> This doesn't make any sense to me. What are you trying to achieve here?
> Why would you want to check all threads and do that only for CONFIG_NUMA
> and even then do a noop if the timeout expired?
>
> > + }
> > +#endif

This block tried to mimic what has_intersects_mems_allowed() does.
Since TIF_MEMDIE is set to only one thread than all threads in a process,
I thought that I need to check all threads of a process when searching for
a TIF_MEMDIE thread.

But I forgot that for_each_process_thread() in select_bad_process() already
checked all threads of all processes. Thus, this block would be a garbage
because checking all threads of a process here is unnecessary.



> > @@ -135,6 +209,7 @@ static bool oom_unkillable_task(struct task_struct *p,
> > if (!has_intersects_mems_allowed(p, nodemask))
> > return true;
> >
> > + check_memdie_task(p, memcg, nodemask);
>
> This is not sufficient. oom_scan_process_thread would break out from the
> loop when encountering the first TIF_MEMDIE task and could have missed
> an older one later in the task_list.

Indeed, not sufficient for this "tear just this part out" version.

My concern is to allow timeout for

(1) choosing next OOM victim if previous OOM victim fails to release memory
(2) triggering kernel panic if none of OOM victims can release memory
(3) invoking OOM killer for !__GFP_FS allocations

for system OOM stall, for we can see that any mechanism is unreliable due to
e.g. workqueue being not processed, kswapd / rescuer threads making no
progress.

My "full" version tried to be sufficient because it makes
oom_unkillable_task() return true when timer for (1) expires. But indeed,
still not sufficient because older victim's timer for (2) could fail to
expire until younger victim's timer for (1) expires.

> Besides that oom_unkillable_task doesn't sound like a good match to
> evaluate this logic. I would expect it to be in oom_scan_process_thread.

Well, select_bad_process() which calls oom_scan_process_thread() would
break out from the loop when encountering the first TIF_MEMDIE task.
We need to change

case OOM_SCAN_ABORT:
rcu_read_unlock();
return (struct task_struct *)(-1UL);

to defer returning of (-1UL) when a TIF_MEMDIE thread was found, in order to
make sure that all TIF_MEMDIE threads are examined for timeout. With that
change made,

if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
/*** this location ***/
if (!force_kill)
return OOM_SCAN_ABORT;
}

in oom_scan_process_thread() will be an appropriate place for evaluating
this logic.



> > @@ -416,10 +491,17 @@ bool oom_killer_disabled __read_mostly;
> > */
> > void mark_oom_victim(struct task_struct *tsk)
> > {
> > + unsigned long start;
> > +
> > WARN_ON(oom_killer_disabled);
> > /* OOM killer might race with memcg OOM */
> > if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
> > return;
> > + /* Set current time for is_killable_memdie_task() check. */
> > + start = jiffies;
> > + if (!start)
> > + start = 1;
> > + tsk->memdie_start = start;
>
> I would rather go with tsk->oom_expire = jiffies + timeout and set the
> timeout depending on panic_on_oom value (which would require nodemask
> and memcg parameters here).
>

I got lost when making distinction between mempolicy/cpuset OOM and
system OOM. I consider that memcg OOM is memcg != NULL.
I consider that system OOM is memcg == NULL && nodemask == NULL.
But has_intersects_mems_allowed() thinks as mempolicy OOM if
nodemask != NULL and thinks as cpuset OOM if nodemask == NULL.

If (a) is possible and we want to configure different timeout for
cpuset OOM stall and system OOM stall, where is distinction between
cpuset OOM and system OOM? Are cpuset OOM and system OOM identical?



> > @@ -435,6 +517,7 @@ void mark_oom_victim(struct task_struct *tsk)
> > */
> > void exit_oom_victim(void)
> > {
> > + current->memdie_start = 0;
>
> Is this really needed? OOM killer shouldn't see the task because it has
> already released its mm. oom_scan_process_thread checks mm after it
> TIF_MEMDIE so we can race theoretically but this shouldn't matter much.

Only to show that we could replace TIF_MEMDIE with memdie_start.

> If a task is still visible after the timeout then there obviously was a
> problem in making progress.
>

I can see that an OOM victim is still visible after the timeout regarding
system OOM. That's why I want timeout for (1), (2) and (3).
(This RFC focuses on only timeout for (2).)



> > By the way, with introduction of per "struct task_struct" variable, I think
> > that we can replace TIF_MEMDIE checks with memdie_start checks via
> >
> > test_tsk_thread_flag(p, TIF_MEMDIE) => p->memdie_start
> >
> > test_and_clear_thread_flag(TIF_MEMDIE) => xchg(&current->memdie_start, 0)
> >
> > test_and_set_tsk_thread_flag(p, TIF_MEMDIE)
> > => xchg(&p->memdie_start, jiffies (or 1 if jiffies == 0))
> >
> > though above patch did not replace TIF_MEMDIE in order to focus on one thing.
>
> I fail to see a direct advantage other than to safe one bit in flags. Is
> something asking for it?

So far, nothing but saving one bit in flags.

If we remove TIF_MEMDIE, we could reuse that bit as TIF_MEMALLOC_STALLING
which is similar to TIF_NEED_RESCHED. TIF_MEMALLOC_STALLING is set via
timer which is enabled at

/*
* Try direct compaction. The first pass is asynchronous. Subsequent
* attempts after direct reclaim are synchronous
*/

in __alloc_pages_slowpath() in order to indicate that current thread is
spending too much time inside the memory allocator (a sign of failing to
make forward progress, hint for triggering (3) above).

2015-06-16 13:47:06

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC] panic_on_oom_timeout

On Tue 16-06-15 22:14:28, Tetsuo Handa wrote:
> Michal Hocko wrote:
> > > This patch implements system_memdie_panic_secs sysctl which configures
> > > a maximum timeout for the OOM killer to resolve the OOM situation.
> > > If the system is still under OOM (i.e. the OOM victim cannot release
> > > memory) after the timeout expires, it will panic the system. A
> > > reasonably chosen timeout can protect from both temporal OOM conditions
> > > and allows to have a predictable time frame for the OOM condition.
> > >
> > > Since there are memcg OOM, cpuset OOM, mempolicy OOM as with system OOM,
> > > this patch also implements {memcg,cpuset,mempolicy}_memdie_panic_secs .
> >
> > I really hate having so many knobs. What would they be good for? Why
> > cannot you simply use a single timeout and decide whether to panic or
> > not based on panic_on_oom value? Or do you have any strong reason to
> > put this aside from panic_on_oom?
> >
>
> The reason would depend on
>
> (a) whether {memcg,cpuset,mempolicy} OOM stall is possible
>
> (b) what {memcg,cpuset,mempolicy} users want to do when (a) is possible
> and {memcg,cpuset,mempolicy} OOM stall occurred

The system as such is still usable. And an administrator might
intervene. E.g. enlarge the memcg limit or relax the numa restrictions
for the same purpose.

> Since memcg OOM is less critical than system OOM because administrator still
> has chance to perform steps to resolve the OOM state, we could give longer
> timeout (e.g. 600 seconds) for memcg OOM while giving shorter timeout (e.g.
> 10 seconds) for system OOM. But if (a) is impossible, trying to configure
> different timeout for non-system OOM stall makes no sense.

I still do not see any point for a separate timeouts.

Again panic_on_oom=2 sounds very dubious to me as already mentioned. The
life would be so much easier if we simply start by supporting
panic_on_oom=1 for now. It would be a simple timer (as we cannot use
DELAYED_WORK) which would just panic the machine after a timeout. We
wouldn't have a full oom report but that shouldn't matter much because
the original one would be in the log. Yes we could race with mempolicy
resp. memcg OOM killers when canceling the timer but does this matter
much? Dunno to be honest but having a simpler solution sounds much more
attractive to me. Shouldn't we go this way first?

[...]

> But I forgot that for_each_process_thread() in select_bad_process() already
> checked all threads of all processes. Thus, this block would be a garbage
> because checking all threads of a process here is unnecessary.

Exactly.

[...]
> > Besides that oom_unkillable_task doesn't sound like a good match to
> > evaluate this logic. I would expect it to be in oom_scan_process_thread.
>
> Well, select_bad_process() which calls oom_scan_process_thread() would
> break out from the loop when encountering the first TIF_MEMDIE task.
> We need to change
>
> case OOM_SCAN_ABORT:
> rcu_read_unlock();
> return (struct task_struct *)(-1UL);
>
> to defer returning of (-1UL) when a TIF_MEMDIE thread was found, in order to
> make sure that all TIF_MEMDIE threads are examined for timeout. With that
> change made,
>
> if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
> /*** this location ***/
> if (!force_kill)
> return OOM_SCAN_ABORT;
> }
>
> in oom_scan_process_thread() will be an appropriate place for evaluating
> this logic.

You can also keep select_bad_process untouched and simply check the
remaining TIF_MEMDIE tasks in oom_scan_process_thread (if the timeout is > 0
of course so the most configurations will be unaffected).

> > > @@ -416,10 +491,17 @@ bool oom_killer_disabled __read_mostly;
> > > */
> > > void mark_oom_victim(struct task_struct *tsk)
> > > {
> > > + unsigned long start;
> > > +
> > > WARN_ON(oom_killer_disabled);
> > > /* OOM killer might race with memcg OOM */
> > > if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
> > > return;
> > > + /* Set current time for is_killable_memdie_task() check. */
> > > + start = jiffies;
> > > + if (!start)
> > > + start = 1;
> > > + tsk->memdie_start = start;
> >
> > I would rather go with tsk->oom_expire = jiffies + timeout and set the
> > timeout depending on panic_on_oom value (which would require nodemask
> > and memcg parameters here).
> >
>
> I got lost when making distinction between mempolicy/cpuset OOM and
> system OOM. I consider that memcg OOM is memcg != NULL.

right

> I consider that system OOM is memcg == NULL && nodemask == NULL.
> But has_intersects_mems_allowed() thinks as mempolicy OOM if
> nodemask != NULL and thinks as cpuset OOM if nodemask == NULL.

cpuset is an interface to enforce nodemask. So nodemask != is an OOM
caused by a NUMA policy.

[...]
--
Michal Hocko
SUSE Labs

2015-06-17 12:12:59

by Michal Hocko

[permalink] [raw]
Subject: [RFC -v2] panic_on_oom_timeout

Hi,
I was thinking about this and I am more and more convinced that we
shouldn't care about panic_on_oom=2 configuration for now and go with
the simplest solution first. I have revisited my original patch and
replaced delayed work by a timer based on the feedback from Tetsuo.

I think we can rely on timers. A downside would be that we cannot dump
the full OOM report from the IRQ context because we rely on task_lock
which is not IRQ safe. But I do not think we really need it. An OOM
report will be in the log already most of the time and show_mem will
tell us the current memory situation.

What do you think?
---
>From cec53a806cc19b691d6a63544a1283aa90cf92f1 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Tue, 9 Jun 2015 16:15:42 +0200
Subject: [PATCH] oom: implement panic_on_oom_timeout for panic_on_oom=1

OOM killer is a desparate last resort reclaim attempt to free some
memory. It is based on heuristics which will never be 100% and may
result in an unusable or a locked up system.

panic_on_oom sysctl knob allows to set the OOM policy to panic the
system instead of trying to resolve the OOM condition. This might be
useful for several reasons - e.g. reduce the downtime to a predictable
amount of time, allow to get a crash dump of the system and debug the
issue post-mortem.

panic_on_oom is, however, a big hammer in many situations when the
OOM condition could be resolved in a reasonable time. So it would be
good to have some middle ground and allow the OOM killer to do its job
but have a failover when things go wrong and it is not able to make any
further progress for a considerable amount of time.

This patch implements panic_on_oom_timeout sysctl which is active only
when panic_on_oom=1 and it configures a maximum timeout for the OOM
killer to resolve the OOM situation. If the system is still under OOM
after the timeout expires it will panic the system. A reasonably chosen
timeout can protect from both temporal OOM conditions and allows to have
a predictable time frame for the OOM condition.

The feature is implemented by a timer which is scheduled when the OOM
condition is declared for the first time (oom_victims is still zero)
in out_of_memory and it is canceled in exit_oom_victim after the
oom_victims count drops down to zero.
For this time period OOM killer cannot kill new tasks and it only
allows exiting or killed tasks to access memory reserves (and increase
oom_victims counter via mark_oom_victim) in order to make a progress
so it is reasonable to consider the elevated oom_victims count as an
ongoing OOM condition.

Timeout for panic_on_oom=2 is not currently supported because it
would make the code more complicated and it is even not clear whether
this combination is useful. panic_on_oom=2 is disputable on its own
because the system is still usable at the time so the administrator can
intervene to resolve the OOM situation (e.g. relax NUMA restrictions,
increase memory limit for the oom memcg, reboot/panic the system or
simply selectively kill a task which might be blocking oom killer from
making progress).

The log will then contain something like:
[ 32.071128] Out of memory: Kill process 2998 (mem_eater) score 42 or sacrifice child
[ 32.074022] Killed process 2998 (mem_eater) total-vm:14172kB, anon-rss:628kB, file-rss:4kB
[ 32.091849] mem_eater invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[ 32.103139] mem_eater cpuset=/ mems_allowed=0
[...]
[ 32.354388] Killed process 3001 (mem_eater) total-vm:14172kB, anon-rss:580kB, file-rss:4kB
[ 33.088078] panic_on_oom timeout 1s has expired
[ 33.089062] Mem-Info:
[ 33.089557] active_anon:7991 inactive_anon:8106 isolated_anon:665
[ 33.089557] active_file:183 inactive_file:125 isolated_file:0
[ 33.089557] unevictable:0 dirty:0 writeback:1037 unstable:0
[ 33.089557] slab_reclaimable:2492 slab_unreclaimable:3253
[ 33.089557] mapped:275 shmem:0 pagetables:631 bounce:0
[ 33.089557] free:400 free_pcp:20 free_cma:0
[...]
[ 33.092023] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
[ 33.092023]
[ 33.092023] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.0.0-oomtimeout5-00001-g667aff689092 #582
[ 33.092023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150428_134905-gandalf 04/01/2014
[ 33.092023] 0000000000000101 ffff880007803da8 ffffffff81538b24 ffffffff8108a645
[ 33.092023] ffffffff817ee27d ffff880007803e28 ffffffff81537ac6 ffff880007803e08
[ 33.092023] ffff880000000008 ffff880007803e38 ffff880007803dd8 ffffffff8108a645
[ 33.092023] Call Trace:
[ 33.092023] <IRQ> [<ffffffff81538b24>] dump_stack+0x4f/0x7b
[ 33.092023] [<ffffffff8108a645>] ? console_unlock+0x337/0x366
[ 33.092023] [<ffffffff81537ac6>] panic+0xbe/0x1eb
[ 33.092023] [<ffffffff8108a645>] ? console_unlock+0x337/0x366
[ 33.092023] [<ffffffff8108a645>] ? console_unlock+0x337/0x366
[ 33.092023] [<ffffffff810fc561>] ? ftrace_raw_output_oom_score_adj_update+0x6d/0x6d
[ 33.092023] [<ffffffff810fc596>] delayed_panic_on_oom+0x35/0x35
[ 33.092023] [<ffffffff8109866f>] call_timer_fn+0xa7/0x1d0
[ 33.092023] [<ffffffff810fc561>] ? ftrace_raw_output_oom_score_adj_update+0x6d/0x6d
[ 33.092023] [<ffffffff81098eb6>] run_timer_softirq+0x22d/0x2a1
[ 33.092023] [<ffffffff810463f3>] __do_softirq+0x141/0x322
[ 33.092023] [<ffffffff810467af>] irq_exit+0x6f/0xb6

TODO: Documentation update
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/oom.h | 1 +
kernel/sysctl.c | 8 ++++++++
mm/oom_kill.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 061e0ffd3493..6884c8dc37a0 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -100,4 +100,5 @@ static inline bool task_will_free_mem(struct task_struct *task)
extern int sysctl_oom_dump_tasks;
extern int sysctl_oom_kill_allocating_task;
extern int sysctl_panic_on_oom;
+extern int sysctl_panic_on_oom_timeout;
#endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index d6fff89b78db..3ac2e5d0b1e2 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1141,6 +1141,14 @@ static struct ctl_table vm_table[] = {
.extra2 = &two,
},
{
+ .procname = "panic_on_oom_timeout",
+ .data = &sysctl_panic_on_oom_timeout,
+ .maxlen = sizeof(sysctl_panic_on_oom_timeout),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &zero,
+ },
+ {
.procname = "oom_kill_allocating_task",
.data = &sysctl_oom_kill_allocating_task,
.maxlen = sizeof(sysctl_oom_kill_allocating_task),
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d7fb1275e200..d48e02627ad0 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -40,6 +40,7 @@
#include <trace/events/oom.h>

int sysctl_panic_on_oom;
+int sysctl_panic_on_oom_timeout;
int sysctl_oom_kill_allocating_task;
int sysctl_oom_dump_tasks = 1;

@@ -430,6 +431,16 @@ void mark_oom_victim(struct task_struct *tsk)
atomic_inc(&oom_victims);
}

+static void delayed_panic_on_oom(unsigned long unused)
+{
+ /* We are in the irq context so we cannot call dump_header */
+ pr_info("panic_on_oom timeout %ds has expired\n", sysctl_panic_on_oom_timeout);
+ show_mem(SHOW_MEM_FILTER_NODES);
+ panic("Out of memory: system-wide panic_on_oom is enabled\n");
+}
+
+static DEFINE_TIMER(panic_on_oom_timer, delayed_panic_on_oom, 0, 0);
+
/**
* exit_oom_victim - note the exit of an OOM victim
*/
@@ -437,8 +448,10 @@ void exit_oom_victim(void)
{
clear_thread_flag(TIF_MEMDIE);

- if (!atomic_dec_return(&oom_victims))
+ if (!atomic_dec_return(&oom_victims)) {
+ del_timer(&panic_on_oom_timer);
wake_up_all(&oom_victims_wait);
+ }
}

/**
@@ -539,7 +552,7 @@ static void __oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
p = find_lock_task_mm(victim);
if (!p) {
put_task_struct(victim);
- return;
+ goto no_oom_victim;
} else if (victim != p) {
get_task_struct(p);
put_task_struct(victim);
@@ -581,6 +594,15 @@ static void __oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,

do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
put_task_struct(victim);
+ return;
+no_oom_victim:
+ /*
+ * If there is no oom victim selected (e.g. we might have raced with
+ * an exiting task) then cancel delayed oom panic handler.
+ */
+ if (!atomic_read(&oom_victims))
+ del_timer(&panic_on_oom_timer);
+
}
#undef K

@@ -624,6 +646,24 @@ void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
if (constraint != CONSTRAINT_NONE)
return;
}
+
+ if (sysctl_panic_on_oom_timeout) {
+ if (sysctl_panic_on_oom > 1) {
+ pr_warn("panic_on_oom_timeout is ignored for panic_on_oom=2\n");
+ } else {
+ /*
+ * Only schedule the delayed panic_on_oom when this is
+ * the first OOM triggered. oom_lock will protect us
+ * from races
+ */
+ if (atomic_read(&oom_victims))
+ return;
+
+ mod_timer(&panic_on_oom_timer,
+ jiffies + (sysctl_panic_on_oom_timeout * HZ));
+ return;
+ }
+ }
dump_header(NULL, gfp_mask, order, memcg, nodemask);
panic("Out of memory: %s panic_on_oom is enabled\n",
sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide");
--
2.1.4


--
Michal Hocko
SUSE Labs

2015-06-17 12:17:20

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [RFC] panic_on_oom_timeout

Michal Hocko wrote a few minutes ago:
> Subject: [RFC -v2] panic_on_oom_timeout

Oops, we raced...

Michal Hocko wrote:
> On Tue 16-06-15 22:14:28, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> > > > This patch implements system_memdie_panic_secs sysctl which configures
> > > > a maximum timeout for the OOM killer to resolve the OOM situation.
> > > > If the system is still under OOM (i.e. the OOM victim cannot release
> > > > memory) after the timeout expires, it will panic the system. A
> > > > reasonably chosen timeout can protect from both temporal OOM conditions
> > > > and allows to have a predictable time frame for the OOM condition.
> > > >
> > > > Since there are memcg OOM, cpuset OOM, mempolicy OOM as with system OOM,
> > > > this patch also implements {memcg,cpuset,mempolicy}_memdie_panic_secs .
> > >
> > > I really hate having so many knobs. What would they be good for? Why
> > > cannot you simply use a single timeout and decide whether to panic or
> > > not based on panic_on_oom value? Or do you have any strong reason to
> > > put this aside from panic_on_oom?
> > >
> >
> > The reason would depend on
> >
> > (a) whether {memcg,cpuset,mempolicy} OOM stall is possible
> >
> > (b) what {memcg,cpuset,mempolicy} users want to do when (a) is possible
> > and {memcg,cpuset,mempolicy} OOM stall occurred
>
> The system as such is still usable. And an administrator might
> intervene. E.g. enlarge the memcg limit or relax the numa restrictions
> for the same purpose.
>
> > Since memcg OOM is less critical than system OOM because administrator still
> > has chance to perform steps to resolve the OOM state, we could give longer
> > timeout (e.g. 600 seconds) for memcg OOM while giving shorter timeout (e.g.
> > 10 seconds) for system OOM. But if (a) is impossible, trying to configure
> > different timeout for non-system OOM stall makes no sense.
>
> I still do not see any point for a separate timeouts.
>
I think that administrator cannot configure adequate timeout if we don't allow
separate timeouts.

> Again panic_on_oom=2 sounds very dubious to me as already mentioned. The
> life would be so much easier if we simply start by supporting
> panic_on_oom=1 for now. It would be a simple timer (as we cannot use
> DELAYED_WORK) which would just panic the machine after a timeout. We

My patch recommends administrators to stop setting panic_on_oom to non-zero
value and to start setting a separate timeouts, one is for system OOM (short
timeout) and the other is for non-system OOM (long timeout).

How does my patch involve panic_on_oom ?
My patch does not care about dubious panic_on_oom=2.



> > > Besides that oom_unkillable_task doesn't sound like a good match to
> > > evaluate this logic. I would expect it to be in oom_scan_process_thread.
> >
> > Well, select_bad_process() which calls oom_scan_process_thread() would
> > break out from the loop when encountering the first TIF_MEMDIE task.
> > We need to change
> >
> > case OOM_SCAN_ABORT:
> > rcu_read_unlock();
> > return (struct task_struct *)(-1UL);
> >
> > to defer returning of (-1UL) when a TIF_MEMDIE thread was found, in order to
> > make sure that all TIF_MEMDIE threads are examined for timeout. With that
> > change made,
> >
> > if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
> > /*** this location ***/
> > if (!force_kill)
> > return OOM_SCAN_ABORT;
> > }
> >
> > in oom_scan_process_thread() will be an appropriate place for evaluating
> > this logic.
>
> You can also keep select_bad_process untouched and simply check the
> remaining TIF_MEMDIE tasks in oom_scan_process_thread (if the timeout is > 0
> of course so the most configurations will be unaffected).

The most configurations will be unaffected because there is usually no
TIF_MEMDIE thread. But if something went wrong and there were 100 TIF_MEMDIE
threads out of 10000 threads, traversing the tasklist from
oom_scan_process_thread() whenever finding a TIF_MEMDIE thread sounds
wasteful to me. If we keep traversing from select_bad_process(), the nuber
of threads to check remains 10000.

----------------------------------------
>From abc7d9dcf76ec32844d131ac6d6cf8d1c06427c2 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <[email protected]>
Date: Wed, 17 Jun 2015 21:05:06 +0900
Subject: [RFC] oom: implement memdie_task_panic_secs

OOM killer is a desperate last resort reclaim attempt to free some
memory. It is based on heuristics which will never be 100% and may
result in an unusable or a locked up system.

panic_on_oom sysctl knob allows to set the OOM policy to panic the
system instead of trying to resolve the OOM condition. This might be
useful for several reasons - e.g. reduce the downtime to a predictable
amount of time, allow to get a crash dump of the system and debug the
issue post-mortem.

panic_on_oom is, however, a big hammer in many situations when the
OOM condition could be resolved in a reasonable time. So it would be
good to have some middle ground and allow the OOM killer to do its job
but have a failover when things go wrong and it is not able to make any
further progress for a considerable amount of time.

This patch implements system_oom_panic_secs sysctl which configures
a maximum timeout for the OOM killer to resolve the OOM situation.
If the system is still under OOM (i.e. the OOM victim cannot release
memory) after the timeout expires, it will panic the system. A
reasonably chosen timeout can protect from both temporal OOM conditions
and allows to have a predictable time frame for the OOM condition.

Since there are memcg OOM, mempolicy OOM as with system OOM,
this patch also implements {memcg,mempolicy}_oom_panic_secs .
These will allow administrator to use different timeout settings for
each type of OOM, for administrator still has chance to perform steps
to resolve the potential lockup or trashing from the global context
(e.g. by relaxing restrictions or even rebooting cleanly).

Note that this patch obsoletes panic_on_oom . In other words, this
patch recommends administrators to keep panic_on_oom to 0 and to set
{system,memcg,mempolicy}_oom_panic_secs to non-0.

Signed-off-by: Tetsuo Handa <[email protected]>
---
include/linux/oom.h | 7 ++++++
include/linux/sched.h | 1 +
kernel/sysctl.c | 31 +++++++++++++++++++++++
mm/memcontrol.c | 13 +++++-----
mm/oom_kill.c | 68 ++++++++++++++++++++++++++++++++++++++++++++++++---
5 files changed, 111 insertions(+), 9 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 7deecb7..1c7637b 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -99,4 +99,11 @@ static inline bool task_will_free_mem(struct task_struct *task)
extern int sysctl_oom_dump_tasks;
extern int sysctl_oom_kill_allocating_task;
extern int sysctl_panic_on_oom;
+extern unsigned long sysctl_system_oom_panic_secs;
+#ifdef CONFIG_MEMCG
+extern unsigned long sysctl_memcg_oom_panic_secs;
+#endif
+#ifdef CONFIG_NUMA
+extern unsigned long sysctl_mempolicy_oom_panic_secs;
+#endif
#endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/include/linux/sched.h b/include/linux/sched.h
index d505bca..333bb3a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1784,6 +1784,7 @@ struct task_struct {
unsigned long task_state_change;
#endif
int pagefault_disabled;
+ unsigned long memdie_start;
};

/* Future-safe accessor for struct task_struct's cpus_allowed. */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c566b56..27c7c93 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -144,6 +144,9 @@ static const int cap_last_cap = CAP_LAST_CAP;
static unsigned long hung_task_timeout_max = (LONG_MAX/HZ);
#endif

+/* Used by proc_doulongvec_minmax of sysctl_*_oom_panic_secs. */
+static unsigned long wait_timeout_max = (LONG_MAX/HZ);
+
#ifdef CONFIG_INOTIFY_USER
#include <linux/inotify.h>
#endif
@@ -1535,6 +1538,34 @@ static struct ctl_table vm_table[] = {
.mode = 0644,
.proc_handler = proc_doulongvec_minmax,
},
+ {
+ .procname = "system_oom_panic_secs",
+ .data = &sysctl_system_oom_panic_secs,
+ .maxlen = sizeof(sysctl_system_oom_panic_secs),
+ .mode = 0644,
+ .proc_handler = proc_doulongvec_minmax,
+ .extra2 = &wait_timeout_max,
+ },
+#ifdef CONFIG_MEMCG
+ {
+ .procname = "memcg_oom_panic_secs",
+ .data = &sysctl_memcg_oom_panic_secs,
+ .maxlen = sizeof(sysctl_memcg_oom_panic_secs),
+ .mode = 0644,
+ .proc_handler = proc_doulongvec_minmax,
+ .extra2 = &wait_timeout_max,
+ },
+#endif
+#ifdef CONFIG_NUMA
+ {
+ .procname = "mempolicy_oom_panic_secs",
+ .data = &sysctl_mempolicy_oom_panic_secs,
+ .maxlen = sizeof(sysctl_mempolicy_oom_panic_secs),
+ .mode = 0644,
+ .proc_handler = proc_doulongvec_minmax,
+ .extra2 = &wait_timeout_max,
+ },
+#endif
{ }
};

diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index acb93c5..b68f3a4 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1550,6 +1550,7 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
unsigned long totalpages;
unsigned int points = 0;
struct task_struct *chosen = NULL;
+ bool memdie_pending = false;

mutex_lock(&oom_lock);

@@ -1583,11 +1584,8 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
case OOM_SCAN_CONTINUE:
continue;
case OOM_SCAN_ABORT:
- css_task_iter_end(&it);
- mem_cgroup_iter_break(memcg, iter);
- if (chosen)
- put_task_struct(chosen);
- goto unlock;
+ memdie_pending = true;
+ continue;
case OOM_SCAN_OK:
break;
};
@@ -1608,7 +1606,10 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
css_task_iter_end(&it);
}

- if (chosen) {
+ if (memdie_pending) {
+ if (chosen)
+ put_task_struct(chosen);
+ } else if (chosen) {
points = chosen_points * 1000 / totalpages;
oom_kill_process(chosen, gfp_mask, order, points, totalpages,
memcg, NULL, "Memory cgroup out of memory");
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index dff991e..51e127c 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -43,6 +43,13 @@ int sysctl_panic_on_oom;
int sysctl_oom_kill_allocating_task;
int sysctl_oom_dump_tasks = 1;

+unsigned long sysctl_system_oom_panic_secs;
+#ifdef CONFIG_MEMCG
+unsigned long sysctl_cgroup_oom_panic_secs;
+#endif
+#ifdef CONFIG_NUMA
+unsigned long sysctl_mempolicy_oom_panic_secs;
+#endif
DEFINE_MUTEX(oom_lock);

#ifdef CONFIG_NUMA
@@ -118,6 +125,55 @@ found:
return t;
}

+/**
+ * check_memdie_task - check task is not stuck with TIF_MEMDIE flag set.
+ *
+ * @p: Pointer to "struct task_struct".
+ * @memcg: Pointer to "struct mem_cgroup". Maybe NULL.
+ * @nodemask: Pointer to "nodemask_t". Maybe NULL.
+ *
+ * Setting TIF_MEMDIE flag to @p disables the OOM killer. However, @p could get
+ * stuck due to dependency which is invisible to the OOM killer. When @p got
+ * stuck, the system will stall for unpredictable duration (presumably forever)
+ * because the OOM killer is kept disabled.
+ *
+ * If @p remained stuck for
+ * /proc/sys/vm/{system,memcg,mempolicy}_oom_panic_secs seconds,
+ * this function triggers kernel panic.
+ * Setting 0 to {memcg,mempolicy}_oom_panic_secs causes
+ * respective interfaces to use system_oom_panic_secs setting.
+ * Setting 0 to system_oom_panic_secs disables this check.
+ */
+static void check_memdie_task(struct task_struct *p, struct mem_cgroup *memcg,
+ const nodemask_t *nodemask)
+{
+ unsigned long start = p->memdie_start;
+ unsigned long spent;
+ unsigned long timeout = 0;
+
+ /* If task does not have TIF_MEMDIE flag, there is nothing to do. */
+ if (!start)
+ return;
+ spent = jiffies - start;
+#ifdef CONFIG_MEMCG
+ if (memcg)
+ timeout = sysctl_cgroup_oom_panic_secs;
+#endif
+#ifdef CONFIG_NUMA
+ if (nodemask)
+ timeout = sysctl_mempolicy_oom_panic_secs;
+#endif
+ if (!timeout)
+ timeout = sysctl_system_oom_panic_secs;
+ /* If timeout is disabled, there is nothing to do. */
+ if (!timeout)
+ return;
+ if (time_before(spent, timeout * HZ))
+ return;
+ panic("Out of memory: %s (%u) did not die within %lu seconds.\n",
+ p->comm, p->pid, timeout);
+}
+
/* return true if the task is not adequate as candidate victim task. */
static bool oom_unkillable_task(struct task_struct *p,
struct mem_cgroup *memcg, const nodemask_t *nodemask)
@@ -135,6 +191,7 @@ static bool oom_unkillable_task(struct task_struct *p,
if (!has_intersects_mems_allowed(p, nodemask))
return true;

+ check_memdie_task(p, memcg, nodemask);
return false;
}

@@ -299,6 +356,7 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
struct task_struct *g, *p;
struct task_struct *chosen = NULL;
unsigned long chosen_points = 0;
+ bool memdie_pending = false;

rcu_read_lock();
for_each_process_thread(g, p) {
@@ -313,8 +371,8 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
case OOM_SCAN_CONTINUE:
continue;
case OOM_SCAN_ABORT:
- rcu_read_unlock();
- return (struct task_struct *)(-1UL);
+ memdie_pending = true;
+ continue;
case OOM_SCAN_OK:
break;
};
@@ -328,7 +386,9 @@ static struct task_struct *select_bad_process(unsigned int *ppoints,
chosen = p;
chosen_points = points;
}
- if (chosen)
+ if (memdie_pending)
+ chosen = (struct task_struct *)(-1UL);
+ else if (chosen)
get_task_struct(chosen);
rcu_read_unlock();

@@ -420,6 +480,8 @@ void mark_oom_victim(struct task_struct *tsk)
/* OOM killer might race with memcg OOM */
if (test_and_set_tsk_thread_flag(tsk, TIF_MEMDIE))
return;
+ /* Set current time for check_memdie_task() check. */
+ tsk->memdie_start = jiffies;
/*
* Make sure that the task is woken up from uninterruptible sleep
* if it is frozen because OOM killer wouldn't be able to free
--
1.8.3.1

2015-06-17 12:32:06

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [RFC -v2] panic_on_oom_timeout

Michal Hocko wrote:
> Hi,
> I was thinking about this and I am more and more convinced that we
> shouldn't care about panic_on_oom=2 configuration for now and go with
> the simplest solution first. I have revisited my original patch and
> replaced delayed work by a timer based on the feedback from Tetsuo.
>

To me, obsolating panic_on_oom > 0 sounds cleaner.

> I think we can rely on timers. A downside would be that we cannot dump
> the full OOM report from the IRQ context because we rely on task_lock
> which is not IRQ safe. But I do not think we really need it. An OOM
> report will be in the log already most of the time and show_mem will
> tell us the current memory situation.
>
> What do you think?

We can rely on timers, but we can't rely on global timer.

> + if (sysctl_panic_on_oom_timeout) {
> + if (sysctl_panic_on_oom > 1) {
> + pr_warn("panic_on_oom_timeout is ignored for panic_on_oom=2\n");
> + } else {
> + /*
> + * Only schedule the delayed panic_on_oom when this is
> + * the first OOM triggered. oom_lock will protect us
> + * from races
> + */
> + if (atomic_read(&oom_victims))
> + return;
> +
> + mod_timer(&panic_on_oom_timer,
> + jiffies + (sysctl_panic_on_oom_timeout * HZ));
> + return;
> + }
> + }

Since this version uses global panic_on_oom_timer, you cannot handle
OOM race like below.

(1) p1 in memcg1 calls out_of_memory().
(2) 5 seconds of timeout is started by p1.
(3) p1 takes 3 seconds for some reason.
(4) p2 in memcg2 calls out_of_memory().
(5) p1 calls unmark_oom_victim() but timer continues.
(6) p2 takes 2 seconds for some reason.
(7) 5 seconds of timeout expires despite individual delay was less than
5 seconds.

2015-06-17 12:37:06

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC] panic_on_oom_timeout

On Wed 17-06-15 21:16:37, Tetsuo Handa wrote:
> Michal Hocko wrote a few minutes ago:
> > Subject: [RFC -v2] panic_on_oom_timeout
>
> Oops, we raced...
>
> Michal Hocko wrote:
> > On Tue 16-06-15 22:14:28, Tetsuo Handa wrote:
[...]
> > > Since memcg OOM is less critical than system OOM because administrator still
> > > has chance to perform steps to resolve the OOM state, we could give longer
> > > timeout (e.g. 600 seconds) for memcg OOM while giving shorter timeout (e.g.
> > > 10 seconds) for system OOM. But if (a) is impossible, trying to configure
> > > different timeout for non-system OOM stall makes no sense.
> >
> > I still do not see any point for a separate timeouts.
> >
> I think that administrator cannot configure adequate timeout if we don't allow
> separate timeouts.

Why? What prevents a user space policy if the system is still usable?

> > Again panic_on_oom=2 sounds very dubious to me as already mentioned. The
> > life would be so much easier if we simply start by supporting
> > panic_on_oom=1 for now. It would be a simple timer (as we cannot use
> > DELAYED_WORK) which would just panic the machine after a timeout. We
>
> My patch recommends administrators to stop setting panic_on_oom to non-zero
> value and to start setting a separate timeouts, one is for system OOM (short
> timeout) and the other is for non-system OOM (long timeout).
>
> How does my patch involve panic_on_oom ?

You are panicing the system on OOM condition. It feels natural to bind a
timeout based policy to this knob.

> My patch does not care about dubious panic_on_oom=2.

Yes, it replaces it by additional timeouts which seems an overkill to
me.

> > > > Besides that oom_unkillable_task doesn't sound like a good match to
> > > > evaluate this logic. I would expect it to be in oom_scan_process_thread.
> > >
> > > Well, select_bad_process() which calls oom_scan_process_thread() would
> > > break out from the loop when encountering the first TIF_MEMDIE task.
> > > We need to change
> > >
> > > case OOM_SCAN_ABORT:
> > > rcu_read_unlock();
> > > return (struct task_struct *)(-1UL);
> > >
> > > to defer returning of (-1UL) when a TIF_MEMDIE thread was found, in order to
> > > make sure that all TIF_MEMDIE threads are examined for timeout. With that
> > > change made,
> > >
> > > if (test_tsk_thread_flag(task, TIF_MEMDIE)) {
> > > /*** this location ***/
> > > if (!force_kill)
> > > return OOM_SCAN_ABORT;
> > > }
> > >
> > > in oom_scan_process_thread() will be an appropriate place for evaluating
> > > this logic.
> >
> > You can also keep select_bad_process untouched and simply check the
> > remaining TIF_MEMDIE tasks in oom_scan_process_thread (if the timeout is > 0
> > of course so the most configurations will be unaffected).
>
> The most configurations will be unaffected because there is usually no
> TIF_MEMDIE thread. But if something went wrong and there were 100 TIF_MEMDIE
> threads out of 10000 threads, traversing the tasklist from
> oom_scan_process_thread() whenever finding a TIF_MEMDIE thread sounds
> wasteful to me. If we keep traversing from select_bad_process(), the nuber
> of threads to check remains 10000.

Yes, but the code would be uglier and duplicated for memcg and global case.
Also this is an extremely slow path so optimization to skip scanning
some tasks is not worth making the code more obscure.

[...]
> @@ -1583,11 +1584,8 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
> case OOM_SCAN_CONTINUE:
> continue;
> case OOM_SCAN_ABORT:
> - css_task_iter_end(&it);
> - mem_cgroup_iter_break(memcg, iter);
> - if (chosen)
> - put_task_struct(chosen);
> - goto unlock;
> + memdie_pending = true;
> + continue;
> case OOM_SCAN_OK:
> break;
> };

OOM_SCAN_ABORT can be returned even for !TIF_MEMDIE task so you might
force a victim selection when there is an exiting task and we could
delay actual killing.
--
Michal Hocko
SUSE Labs

2015-06-17 12:51:42

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC -v2] panic_on_oom_timeout

On Wed 17-06-15 21:31:21, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > I think we can rely on timers. A downside would be that we cannot dump
> > the full OOM report from the IRQ context because we rely on task_lock
> > which is not IRQ safe. But I do not think we really need it. An OOM
> > report will be in the log already most of the time and show_mem will
> > tell us the current memory situation.
> >
> > What do you think?
>
> We can rely on timers, but we can't rely on global timer.

Why not?

>
> > + if (sysctl_panic_on_oom_timeout) {
> > + if (sysctl_panic_on_oom > 1) {
> > + pr_warn("panic_on_oom_timeout is ignored for panic_on_oom=2\n");
> > + } else {
> > + /*
> > + * Only schedule the delayed panic_on_oom when this is
> > + * the first OOM triggered. oom_lock will protect us
> > + * from races
> > + */
> > + if (atomic_read(&oom_victims))
> > + return;
> > +
> > + mod_timer(&panic_on_oom_timer,
> > + jiffies + (sysctl_panic_on_oom_timeout * HZ));
> > + return;
> > + }
> > + }
>
> Since this version uses global panic_on_oom_timer, you cannot handle
> OOM race like below.
>
> (1) p1 in memcg1 calls out_of_memory().
> (2) 5 seconds of timeout is started by p1.
> (3) p1 takes 3 seconds for some reason.
> (4) p2 in memcg2 calls out_of_memory().
> (5) p1 calls unmark_oom_victim() but timer continues.
> (6) p2 takes 2 seconds for some reason.
> (7) 5 seconds of timeout expires despite individual delay was less than
> 5 seconds.

Yes it is not intended to handle such a race. Timeout is completely
ignored for panic_on_oom=2 and contrained oom context doesn't trigger
this path for panic_on_oom=1.

But you have a point that we could have
- constrained OOM which elevates oom_victims
- global OOM killer strikes but wouldn't start the timer

This is certainly possible and timer_pending(&panic_on_oom) replacing
oom_victims check should help here. I will think about this some more.
But this sounds like a minor detail.

The important thing is to decide what is the reasonable way forward. We
have two two implementations of panic based timeout. So we should decide
- Should be the timeout bound to panic_on_oom?
- Should we care about constrained OOM contexts?
- If yes should they use the same timeout?
- If yes should each memcg be able to define its own timeout?

My thinking is that it should be bound to panic_on_oom=1 only until we
hear from somebody actually asking for a constrained oom and even then
do not allow for too large configuration space (e.g. no per-memcg
timeout) or have separate mempolicy vs. memcg timeouts.

Let's start simple and make things more complicated later!

--
Michal Hocko
SUSE Labs

2015-06-17 13:24:40

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC -v2] panic_on_oom_timeout

On Wed 17-06-15 14:51:27, Michal Hocko wrote:
[...]
> The important thing is to decide what is the reasonable way forward. We
> have two two implementations of panic based timeout. So we should decide

And the most obvious question, of course.
- Should we add a panic timeout at all?

> - Should be the timeout bound to panic_on_oom?
> - Should we care about constrained OOM contexts?
> - If yes should they use the same timeout?
> - If yes should each memcg be able to define its own timeout?
^ no

> My thinking is that it should be bound to panic_on_oom=1 only until we
> hear from somebody actually asking for a constrained oom and even then
> do not allow for too large configuration space (e.g. no per-memcg
> timeout) or have separate mempolicy vs. memcg timeouts.
>
> Let's start simple and make things more complicated later!

--
Michal Hocko
SUSE Labs

2015-06-17 14:00:32

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [RFC -v2] panic_on_oom_timeout

Michal Hocko wrote:
> > > + if (sysctl_panic_on_oom_timeout) {
> > > + if (sysctl_panic_on_oom > 1) {
> > > + pr_warn("panic_on_oom_timeout is ignored for panic_on_oom=2\n");
> > > + } else {
> > > + /*
> > > + * Only schedule the delayed panic_on_oom when this is
> > > + * the first OOM triggered. oom_lock will protect us
> > > + * from races
> > > + */
> > > + if (atomic_read(&oom_victims))
> > > + return;
> > > +
> > > + mod_timer(&panic_on_oom_timer,
> > > + jiffies + (sysctl_panic_on_oom_timeout * HZ));
> > > + return;
> > > + }
> > > + }
> >
> > Since this version uses global panic_on_oom_timer, you cannot handle
> > OOM race like below.
> >
> > (1) p1 in memcg1 calls out_of_memory().
> > (2) 5 seconds of timeout is started by p1.
> > (3) p1 takes 3 seconds for some reason.
> > (4) p2 in memcg2 calls out_of_memory().
> > (5) p1 calls unmark_oom_victim() but timer continues.
> > (6) p2 takes 2 seconds for some reason.
> > (7) 5 seconds of timeout expires despite individual delay was less than
> > 5 seconds.
>
> Yes it is not intended to handle such a race. Timeout is completely
> ignored for panic_on_oom=2 and contrained oom context doesn't trigger
> this path for panic_on_oom=1.
>
Oops.

> But you have a point that we could have
> - constrained OOM which elevates oom_victims
> - global OOM killer strikes but wouldn't start the timer
>
> This is certainly possible and timer_pending(&panic_on_oom) replacing
> oom_victims check should help here. I will think about this some more.

Yes, please.



> The important thing is to decide what is the reasonable way forward. We
> have two two implementations of panic based timeout. So we should decide
> - Should we add a panic timeout at all?
> - Should be the timeout bound to panic_on_oom?
> - Should we care about constrained OOM contexts?
> - If yes should they use the same timeout?
> - If no should each memcg be able to define its own timeout?
>
Exactly.

> My thinking is that it should be bound to panic_on_oom=1 only until we
> hear from somebody actually asking for a constrained oom and even then
> do not allow for too large configuration space (e.g. no per-memcg
> timeout) or have separate mempolicy vs. memcg timeouts.
>
My implementation comes from providing debugging hints when analyzing
vmcore of a stalled system. I'm posting logs of stalls after global OOM
killer struck because it is easy to reproduce. But what I have problem
is when a system stalled before the OOM killer strikes (I saw many cases
for customer's enterprise servers), for we don't have hints for guessing
whether memory allocator is the cause or not. Thus, my version tried to
emit warning messages using sysctl_memalloc_task_warn_secs .

Ability to take care of constrained OOM contexts is a side effect of use of
per a "struct task_struct" variable. Even if we come to a conclusion that
we should not add a timeout for panic, I hope that a timeout for warning
about memory allocation stalls is added.

> Let's start simple and make things more complicated later!

I think we mismatch about what the timeout counters are for.

2015-06-17 15:42:07

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC -v2] panic_on_oom_timeout

On Wed 17-06-15 22:59:54, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > But you have a point that we could have
> > - constrained OOM which elevates oom_victims
> > - global OOM killer strikes but wouldn't start the timer
> >
> > This is certainly possible and timer_pending(&panic_on_oom) replacing
> > oom_victims check should help here. I will think about this some more.
>
> Yes, please.

Fixed in my local version. I will post the new version of the patch
after we settle with the approach.

> > The important thing is to decide what is the reasonable way forward. We
> > have two two implementations of panic based timeout. So we should decide
> > - Should we add a panic timeout at all?
> > - Should be the timeout bound to panic_on_oom?
> > - Should we care about constrained OOM contexts?
> > - If yes should they use the same timeout?
> > - If no should each memcg be able to define its own timeout?
> >
> Exactly.
>
> > My thinking is that it should be bound to panic_on_oom=1 only until we
> > hear from somebody actually asking for a constrained oom and even then
> > do not allow for too large configuration space (e.g. no per-memcg
> > timeout) or have separate mempolicy vs. memcg timeouts.
>
> My implementation comes from providing debugging hints when analyzing
> vmcore of a stalled system. I'm posting logs of stalls after global OOM
> killer struck because it is easy to reproduce. But what I have problem
> is when a system stalled before the OOM killer strikes (I saw many cases
> for customer's enterprise servers), for we don't have hints for guessing
> whether memory allocator is the cause or not. Thus, my version tried to
> emit warning messages using sysctl_memalloc_task_warn_secs .

I can understand your frustration but I believe that a debugability is
a separate topic and we should start by defining a reasonable _policy_
so that an administrator has a way to handle potential OOM stalls
reasonably and with a well defined semantic.

> Ability to take care of constrained OOM contexts is a side effect of use of
> per a "struct task_struct" variable. Even if we come to a conclusion that
> we should not add a timeout for panic, I hope that a timeout for warning
> about memory allocation stalls is added.
>
> > Let's start simple and make things more complicated later!
>
> I think we mismatch about what the timeout counters are for.

--
Michal Hocko
SUSE Labs

2015-06-19 11:30:51

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [RFC -v2] panic_on_oom_timeout

Michal Hocko wrote:
> On Wed 17-06-15 22:59:54, Tetsuo Handa wrote:
> > Michal Hocko wrote:
> [...]
> > > But you have a point that we could have
> > > - constrained OOM which elevates oom_victims
> > > - global OOM killer strikes but wouldn't start the timer
> > >
> > > This is certainly possible and timer_pending(&panic_on_oom) replacing
> > > oom_victims check should help here. I will think about this some more.
> >
> > Yes, please.
>
> Fixed in my local version. I will post the new version of the patch
> after we settle with the approach.
>

I'd like to see now, for it looks to me that it is very difficult to expire
your timeout with reasonable precision.

I think that you changed

/*
* Only schedule the delayed panic_on_oom when this is
* the first OOM triggered. oom_lock will protect us
* from races
*/
if (atomic_read(&oom_victims))
return;

to something like

/*
* Schedule the delayed panic_on_oom if timer is not active.
* oom_lock will protect us from races.
*/
if (timer_pending(&panic_on_oom_timer))
return;

. But such change alone sounds far from sufficient.

We are trying to activate panic_on_oom_timer timer only when
sysctl_panic_on_oom == 1 and global OOM occurred because you add
this block after

if (likely(!sysctl_panic_on_oom))
return;
if (sysctl_panic_on_oom != 2) {
/*
* panic_on_oom == 1 only affects CONSTRAINT_NONE, the kernel
* does not panic for cpuset, mempolicy, or memcg allocation
* failures.
*/
if (constraint != CONSTRAINT_NONE)
return;
}

check. That part is fine.

But oom_victims is incremented via mark_oom_victim() for both global OOM
and non-global OOM, isn't it? Then, I think that more difficult part is
exit_oom_victim().

We can hit a sequence like

(1) Task1 in memcg1 hits memcg OOM.
(2) Task1 gets TIF_MEMDIE and increments oom_victims.
(3) Task2 hits global OOM.
(4) Task2 activates 10 seconds of timeout.
(5) Task2 gets TIF_MEMDIE and increments oom_victims.
(6) Task2 remained unkillable for 1 second since (5).
(7) Task2 calls exit_oom_victim().
(8) Task2 drops TIF_MEMDIE and decrements oom_victims.
(9) panic_on_oom_timer is not deactivated because oom_vctims > 0.
(10) Task1 remains unkillable for 10 seconds since (2).
(11) panic_on_oom_timer expires and the system will panic while
the system is no longer under global OOM.

if we deactivate panic_on_oom_timer like

void exit_oom_victim(void)
{
clear_thread_flag(TIF_MEMDIE);

- if (!atomic_dec_return(&oom_victims))
+ if (!atomic_dec_return(&oom_victims)) {
+ del_timer(&panic_on_oom_timer);
wake_up_all(&oom_victims_wait);
+ }
}

.

On the other hand, we can hit a sequence like

(1) Task1 in memcg1 hits memcg OOM.
(2) Task1 gets TIF_MEMDIE and increments oom_victims.
(3) Task2 hits system OOM.
(4) Task2 activates 10 seconds of timeout.
(5) Task2 gets TIF_MEMDIE and increments oom_victims.
(6) Task1 remained unkillable for 9 seconds since (2).
(7) Task1 calls exit_oom_victim().
(8) Task1 drops TIF_MEMDIE and decrements oom_victims.
(9) panic_on_oom_timer is deactivated.
(10) Task3 hits system OOM.
(11) Task3 again activates 10 seconds of timeout.
(12) Task2 remains unkillable for 19 seconds since (5).
(13) panic_on_oom_timer expires and the system will panic, but
the expected timeout is 10 seconds while actual timeout is
19 seconds.

if we deactivate panic_on_oom_timer like

void exit_oom_victim(void)
{
clear_thread_flag(TIF_MEMDIE);

+ del_timer(&panic_on_oom_timer);
if (!atomic_dec_return(&oom_victims))
wake_up_all(&oom_victims_wait);
}

.

If we want to avoid premature or over-delayed timeout, I think we need to
update timeout at exit_oom_victim() by doing something like

void exit_oom_victim(void)
{
clear_thread_flag(TIF_MEMDIE);

+ /*
+ * If current thread got TIF_MEMDIE due to global OOM, we need to
+ * update panic_on_oom_timer to "jiffies till the nearest timeout
+ * of all threads which got TIF_MEMDIE due to global OOM" and
+ * delete panic_on_oom_timer if "there is no more threads which
+ * got TIF_MEMDIE due to global OOM".
+ */
+ if (/* Was I OOM-killed due to global OOM? */) {
+ mutex_lock(&oom_lock); /* oom_lock needed for avoiding race. */
+ if (/* Am I the last thread ? */) {
+ del_timer(&panic_on_oom_timer);
+ else
+ mod_timer(&panic_on_oom_timer,
+ /* jiffies of the nearest timeout */);
+ mutex_unlock(&oom_lock);
+ }
if (!atomic_dec_return(&oom_victims))
wake_up_all(&oom_victims_wait);
}

but we don't have hint for finding global OOM victims from all TIF_MEMDIE
threads and when is the nearest timeout among all global OOM victims. We
need to keep track of per global OOM victim's timeout (e.g.
"struct task_struct"->memdie_start ) ?

Moreover, mark_oom_victim(current) at

if (current->mm &&
(fatal_signal_pending(current) || task_will_free_mem(current))) {
mark_oom_victim(current);
goto out;
}

lacks information for "Was I OOM-killed due to global OOM?". We need to keep
track of per mm timeout (e.g. "struct mm_struct"->memdie_start ) so that
we can recalculate the nearest timeout among all global OOM victims?



Well, do we really need to set TIF_MEMDIE to non-global OOM victims?
I'm wondering how {memcg,cpuset,mempolicy} OOM stall can occur because
there is enough memory (unless global OOM runs concurrently) for any
operations (e.g. XFS filesystem's writeback, workqueue) which non-global
OOM victims might depend on to make forward progress.

> > The reason would depend on
> >
> > (a) whether {memcg,cpuset,mempolicy} OOM stall is possible
> >
> > (b) what {memcg,cpuset,mempolicy} users want to do when (a) is possible
> > and {memcg,cpuset,mempolicy} OOM stall occurred
>
> The system as such is still usable. And an administrator might
> intervene. E.g. enlarge the memcg limit or relax the numa restrictions
> for the same purpose.

If we set TIF_MEMDIE to only global OOM victims, the problem of "when to call
del_timer(&panic_on_oom_timer) at exit_oom_victim()" will go away.



By the way, I think we can replace

if (!atomic_dec_return(&oom_victims))

with

if (atomic_dec_and_test(&oom_victims))

. But this logic puzzles me. The number of threads that are killed by
the OOM killer can be larger than value of oom_victims. This means that
there might be fatal_signal_pending() threads even after oom_victims drops
to 0. Why waiting for only TIF_MEMDIE threads at oom_killer_disable() is
considered sufficient?

2015-06-19 15:36:41

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC -v2] panic_on_oom_timeout

On Fri 19-06-15 20:30:10, Tetsuo Handa wrote:
> Michal Hocko wrote:
[...]
> > Fixed in my local version. I will post the new version of the patch
> > after we settle with the approach.
> >
>
> I'd like to see now,

Sure see below

[...]
> But oom_victims is incremented via mark_oom_victim() for both global OOM
> and non-global OOM, isn't it? Then, I think that more difficult part is
> exit_oom_victim().

Yes you cannot tell which OOM context has killed a particular task. And
it even shouldn't matter, I believe - see below.

> We can hit a sequence like
>
> (1) Task1 in memcg1 hits memcg OOM.
> (2) Task1 gets TIF_MEMDIE and increments oom_victims.
> (3) Task2 hits global OOM.
> (4) Task2 activates 10 seconds of timeout.
> (5) Task2 gets TIF_MEMDIE and increments oom_victims.
> (6) Task2 remained unkillable for 1 second since (5).
> (7) Task2 calls exit_oom_victim().
> (8) Task2 drops TIF_MEMDIE and decrements oom_victims.
> (9) panic_on_oom_timer is not deactivated because oom_vctims > 0.
> (10) Task1 remains unkillable for 10 seconds since (2).
> (11) panic_on_oom_timer expires and the system will panic while
> the system is no longer under global OOM.

I find this highly unlikely but yes this is possible. If it really
matters we can check watermarks on all zones and bail out if at least
one is OK from the timer handler.

[...]
> (1) Task1 in memcg1 hits memcg OOM.
> (2) Task1 gets TIF_MEMDIE and increments oom_victims.
> (3) Task2 hits system OOM.
> (4) Task2 activates 10 seconds of timeout.
> (5) Task2 gets TIF_MEMDIE and increments oom_victims.
> (6) Task1 remained unkillable for 9 seconds since (2).
> (7) Task1 calls exit_oom_victim().
> (8) Task1 drops TIF_MEMDIE and decrements oom_victims.
> (9) panic_on_oom_timer is deactivated.
> (10) Task3 hits system OOM.
> (11) Task3 again activates 10 seconds of timeout.
> (12) Task2 remains unkillable for 19 seconds since (5).
> (13) panic_on_oom_timer expires and the system will panic, but
> the expected timeout is 10 seconds while actual timeout is
> 19 seconds.
>
> if we deactivate panic_on_oom_timer like
>
> void exit_oom_victim(void)
> {
> clear_thread_flag(TIF_MEMDIE);
>
> + del_timer(&panic_on_oom_timer);
> if (!atomic_dec_return(&oom_victims))
> wake_up_all(&oom_victims_wait);
> }

Yes I was thinking about this as well because the primary assumption
of the OOM killer is that the victim will release some memory. And it
doesn't matter whether the OOM killer was constrained or the global
one. So the above looks good at first sight, I am just afraid it is too
relaxed for cases where many tasks are sharing mm.

> If we want to avoid premature or over-delayed timeout, I think we need to
> update timeout at exit_oom_victim() by doing something like
>
> void exit_oom_victim(void)
> {
> clear_thread_flag(TIF_MEMDIE);
>
> + /*
> + * If current thread got TIF_MEMDIE due to global OOM, we need to
> + * update panic_on_oom_timer to "jiffies till the nearest timeout
> + * of all threads which got TIF_MEMDIE due to global OOM" and
> + * delete panic_on_oom_timer if "there is no more threads which
> + * got TIF_MEMDIE due to global OOM".
> + */
> + if (/* Was I OOM-killed due to global OOM? */) {
> + mutex_lock(&oom_lock); /* oom_lock needed for avoiding race. */
> + if (/* Am I the last thread ? */) {
> + del_timer(&panic_on_oom_timer);
> + else
> + mod_timer(&panic_on_oom_timer,
> + /* jiffies of the nearest timeout */);
> + mutex_unlock(&oom_lock);
> + }

I think you are missing an important point. The context which has caused
the killing is not important. As mentioned above even constrained OOM
killer will relief the global OOM condition as well.

The primary problem that we have is that we do not have any reliable
under_oom() check and we simply try to approximate it by heuristics
which work well enough in most cases. I admit that oom_victims is not
the ideal one and there might be better. As mentioned above we can check
watermarks on all zones and cancel the timer if at least one allows for
an allocation. But there are surely downsides of that approach as well
because the OOM killer could have been triggered for a higher order
allocation and we might be still under OOM for those requests.

There is no simple answer here I am afraid. So let's focus on being
reasonably good and simple rather than complex and still not perfect.

> if (!atomic_dec_return(&oom_victims))
> wake_up_all(&oom_victims_wait);
> }
>
> but we don't have hint for finding global OOM victims from all TIF_MEMDIE
> threads and when is the nearest timeout among all global OOM victims.

> We need to keep track of per global OOM victim's timeout (e.g. "struct
> task_struct"->memdie_start ) ?

I do not think this will help anything. It will just lead to a different
set of corner cases. E.g.

1) mark_oom_victim(T1) memdie_start = jiffies
2) fatal_signal_pending(T2) memdie_start = jiffies + delta
3) T2 releases memory - No OOM anymore
4) out_of_memory - check_memdie_timeout(T1) - KABOOM

[...]

> Well, do we really need to set TIF_MEMDIE to non-global OOM victims?

As already said non-global OOM victim will free some memory as well so
the global OOM killer shouldn't kill new tasks if there is a chance
that another victim will release a memory. TIF_MEMDIE acts as a lock.

> I'm wondering how {memcg,cpuset,mempolicy} OOM stall can occur because
> there is enough memory (unless global OOM runs concurrently) for any
> operations (e.g. XFS filesystem's writeback, workqueue) which non-global
> OOM victims might depend on to make forward progress.

Same like for the global case. The victim is uninterruptibly blocked on
a resource/lock/event.

[...]
> By the way, I think we can replace
>
> if (!atomic_dec_return(&oom_victims))
>
> with
>
> if (atomic_dec_and_test(&oom_victims))
>
> . But this logic puzzles me. The number of threads that are killed by
> the OOM killer can be larger than value of oom_victims. This means that
> there might be fatal_signal_pending() threads even after oom_victims drops
> to 0. Why waiting for only TIF_MEMDIE threads at oom_killer_disable() is
> considered sufficient?

Please have look at c32b3cbe0d06 ("oom, PM: make OOM detection in the
freezer path raceless") which imho explains it.

---
>From a25885b588a245c58ec6e7b38172b6f553f45538 Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Tue, 9 Jun 2015 16:15:42 +0200
Subject: [PATCH] oom: implement panic_on_oom_timeout for panic_on_oom=1

OOM killer is a desparate last resort reclaim attempt to free some
memory. It is based on heuristics which will never be 100% and may
result in an unusable or a locked up system.

panic_on_oom sysctl knob allows to set the OOM policy to panic the
system instead of trying to resolve the OOM condition. This might be
useful for several reasons - e.g. reduce the downtime to a predictable
amount of time, allow to get a crash dump of the system and debug the
issue post-mortem.

panic_on_oom is, however, a big hammer in many situations when the
OOM condition could be resolved in a reasonable time. So it would be
good to have some middle ground and allow the OOM killer to do its job
but have a failover when things go wrong and it is not able to make any
further progress for a considerable amount of time.

This patch implements panic_on_oom_timeout sysctl which is active only
when panic_on_oom=1 and it configures a maximum timeout for the OOM
killer to resolve the OOM situation. If the system is still under OOM
after the timeout expires it will panic the system. A reasonably chosen
timeout can protect from both temporal OOM conditions and allows to have
a predictable time frame for the OOM condition.

The feature is implemented by a timer which is scheduled when the OOM
condition is declared for the first time (there is no timer scheduled
yet) in check_panic_on_oom and it is canceled in exit_oom_victim after
the oom_victims count drops down to zero.
For this time period OOM killer cannot kill new tasks and it only
allows exiting or killed tasks to access memory reserves (and increase
oom_victims counter via mark_oom_victim) in order to make a progress
so it is reasonable to consider the elevated oom_victims count as an
ongoing OOM condition.

Timeout for panic_on_oom=2 is not currently supported because it
would make the code more complicated and it is even not clear whether
this combination is useful. panic_on_oom=2 is disputable on its own
because the system is still usable at the time so the administrator can
intervene to resolve the OOM situation (e.g. relax NUMA restrictions,
increase memory limit for the oom memcg, reboot/panic the system or
simply selectively kill a task which might be blocking oom killer from
making progress).

The log will then contain something like:
[ 32.071128] Out of memory: Kill process 2998 (mem_eater) score 42 or sacrifice child
[ 32.074022] Killed process 2998 (mem_eater) total-vm:14172kB, anon-rss:628kB, file-rss:4kB
[ 32.091849] mem_eater invoked oom-killer: gfp_mask=0x201da, order=0, oom_score_adj=0
[ 32.103139] mem_eater cpuset=/ mems_allowed=0
[...]
[ 32.354388] Killed process 3001 (mem_eater) total-vm:14172kB, anon-rss:580kB, file-rss:4kB
[ 33.088078] panic_on_oom timeout 1s has expired
[ 33.089062] Mem-Info:
[ 33.089557] active_anon:7991 inactive_anon:8106 isolated_anon:665
[ 33.089557] active_file:183 inactive_file:125 isolated_file:0
[ 33.089557] unevictable:0 dirty:0 writeback:1037 unstable:0
[ 33.089557] slab_reclaimable:2492 slab_unreclaimable:3253
[ 33.089557] mapped:275 shmem:0 pagetables:631 bounce:0
[ 33.089557] free:400 free_pcp:20 free_cma:0
[...]
[ 33.092023] Kernel panic - not syncing: Out of memory: system-wide panic_on_oom is enabled
[ 33.092023]
[ 33.092023] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.0.0-oomtimeout5-00001-g667aff689092 #582
[ 33.092023] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150428_134905-gandalf 04/01/2014
[ 33.092023] 0000000000000101 ffff880007803da8 ffffffff81538b24 ffffffff8108a645
[ 33.092023] ffffffff817ee27d ffff880007803e28 ffffffff81537ac6 ffff880007803e08
[ 33.092023] ffff880000000008 ffff880007803e38 ffff880007803dd8 ffffffff8108a645
[ 33.092023] Call Trace:
[ 33.092023] <IRQ> [<ffffffff81538b24>] dump_stack+0x4f/0x7b
[ 33.092023] [<ffffffff8108a645>] ? console_unlock+0x337/0x366
[ 33.092023] [<ffffffff81537ac6>] panic+0xbe/0x1eb
[ 33.092023] [<ffffffff8108a645>] ? console_unlock+0x337/0x366
[ 33.092023] [<ffffffff8108a645>] ? console_unlock+0x337/0x366
[ 33.092023] [<ffffffff810fc561>] ? ftrace_raw_output_oom_score_adj_update+0x6d/0x6d
[ 33.092023] [<ffffffff810fc596>] delayed_panic_on_oom+0x35/0x35
[ 33.092023] [<ffffffff8109866f>] call_timer_fn+0xa7/0x1d0
[ 33.092023] [<ffffffff810fc561>] ? ftrace_raw_output_oom_score_adj_update+0x6d/0x6d
[ 33.092023] [<ffffffff81098eb6>] run_timer_softirq+0x22d/0x2a1
[ 33.092023] [<ffffffff810463f3>] __do_softirq+0x141/0x322
[ 33.092023] [<ffffffff810467af>] irq_exit+0x6f/0xb6

TODO: Documentation update
Signed-off-by: Michal Hocko <[email protected]>
---
include/linux/oom.h | 1 +
kernel/sysctl.c | 8 ++++++++
mm/oom_kill.c | 44 ++++++++++++++++++++++++++++++++++++++++++--
3 files changed, 51 insertions(+), 2 deletions(-)

diff --git a/include/linux/oom.h b/include/linux/oom.h
index 061e0ffd3493..6884c8dc37a0 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -100,4 +100,5 @@ static inline bool task_will_free_mem(struct task_struct *task)
extern int sysctl_oom_dump_tasks;
extern int sysctl_oom_kill_allocating_task;
extern int sysctl_panic_on_oom;
+extern int sysctl_panic_on_oom_timeout;
#endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index d6fff89b78db..3ac2e5d0b1e2 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1141,6 +1141,14 @@ static struct ctl_table vm_table[] = {
.extra2 = &two,
},
{
+ .procname = "panic_on_oom_timeout",
+ .data = &sysctl_panic_on_oom_timeout,
+ .maxlen = sizeof(sysctl_panic_on_oom_timeout),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &zero,
+ },
+ {
.procname = "oom_kill_allocating_task",
.data = &sysctl_oom_kill_allocating_task,
.maxlen = sizeof(sysctl_oom_kill_allocating_task),
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index d7fb1275e200..fab631f812f5 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -40,6 +40,7 @@
#include <trace/events/oom.h>

int sysctl_panic_on_oom;
+int sysctl_panic_on_oom_timeout;
int sysctl_oom_kill_allocating_task;
int sysctl_oom_dump_tasks = 1;

@@ -430,6 +431,16 @@ void mark_oom_victim(struct task_struct *tsk)
atomic_inc(&oom_victims);
}

+static void delayed_panic_on_oom(unsigned long unused)
+{
+ /* We are in the irq context so we cannot call dump_header */
+ pr_info("panic_on_oom timeout %ds has expired\n", sysctl_panic_on_oom_timeout);
+ show_mem(SHOW_MEM_FILTER_NODES);
+ panic("Out of memory: system-wide panic_on_oom is enabled\n");
+}
+
+static DEFINE_TIMER(panic_on_oom_timer, delayed_panic_on_oom, 0, 0);
+
/**
* exit_oom_victim - note the exit of an OOM victim
*/
@@ -437,8 +448,10 @@ void exit_oom_victim(void)
{
clear_thread_flag(TIF_MEMDIE);

- if (!atomic_dec_return(&oom_victims))
+ if (!atomic_dec_return(&oom_victims)) {
+ del_timer(&panic_on_oom_timer);
wake_up_all(&oom_victims_wait);
+ }
}

/**
@@ -539,7 +552,7 @@ static void __oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
p = find_lock_task_mm(victim);
if (!p) {
put_task_struct(victim);
- return;
+ goto no_oom_victim;
} else if (victim != p) {
get_task_struct(p);
put_task_struct(victim);
@@ -581,6 +594,15 @@ static void __oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,

do_send_sig_info(SIGKILL, SEND_SIG_FORCED, victim, true);
put_task_struct(victim);
+ return;
+no_oom_victim:
+ /*
+ * If there is no oom victim selected (e.g. we might have raced with
+ * an exiting task) then cancel delayed oom panic handler.
+ */
+ if (!atomic_read(&oom_victims))
+ del_timer(&panic_on_oom_timer);
+
}
#undef K

@@ -624,6 +646,24 @@ void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
if (constraint != CONSTRAINT_NONE)
return;
}
+
+ if (sysctl_panic_on_oom_timeout) {
+ if (sysctl_panic_on_oom > 1) {
+ pr_warn("panic_on_oom_timeout is ignored for panic_on_oom=2\n");
+ } else {
+ /*
+ * Only schedule the delayed panic_on_oom when this is
+ * the first OOM triggered. oom_lock will protect us
+ * from races
+ */
+ if (timer_pending(&panic_on_oom_timer))
+ return;
+
+ mod_timer(&panic_on_oom_timer,
+ jiffies + (sysctl_panic_on_oom_timeout * HZ));
+ return;
+ }
+ }
dump_header(NULL, gfp_mask, order, memcg, nodemask);
panic("Out of memory: %s panic_on_oom is enabled\n",
sysctl_panic_on_oom == 2 ? "compulsory" : "system-wide");
--
2.1.4

--
Michal Hocko
SUSE Labs

2015-06-19 18:55:25

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [RFC -v2] panic_on_oom_timeout

Michal Hocko wrote:
> Yes I was thinking about this as well because the primary assumption
> of the OOM killer is that the victim will release some memory. And it
> doesn't matter whether the OOM killer was constrained or the global
> one. So the above looks good at first sight, I am just afraid it is too
> relaxed for cases where many tasks are sharing mm.

Excuse me for again exceeding the scope of your patch (only trying to handle
sysctl_panic_on_oom == 1 case), but I think my patch (trying to also handle
sysctl_panic_on_oom == 0 case) will not be too relaxed for cases where many
tasks are sharing mm, for my approach can set task_struct->memdie_start
to jiffies for all tasks sharing mm.

> The primary problem that we have is that we do not have any reliable
> under_oom() check and we simply try to approximate it by heuristics
> which work well enough in most cases. I admit that oom_victims is not
> the ideal one and there might be better. As mentioned above we can check
> watermarks on all zones and cancel the timer if at least one allows for
> an allocation. But there are surely downsides of that approach as well
> because the OOM killer could have been triggered for a higher order
> allocation and we might be still under OOM for those requests.

In my approach, threads calling out_of_memory() is considered as somewhat
reliable under_oom() check. I think this is more reliable than checking
watermarks because the caller of out_of_memory() declared that watermark
is still low for that caller.

> > We need to keep track of per global OOM victim's timeout (e.g. "struct
> > task_struct"->memdie_start ) ?
>
> I do not think this will help anything. It will just lead to a different
> set of corner cases. E.g.
>
> 1) mark_oom_victim(T1) memdie_start = jiffies
> 2) fatal_signal_pending(T2) memdie_start = jiffies + delta
> 3) T2 releases memory - No OOM anymore
> 4) out_of_memory - check_memdie_timeout(T1) - KABOOM

Two possible corner cases for my approach are shown below.

One case is that the system can not panic of threads are unable to call
out_of_memory() for some reason.

The other case is that the system will panic if a sequence shown below
occurred.

(1) First round of OOM state begins.
(2) Somebody calls out_of_memory().
(3) OOM victim threads assign jiffies to their memdie_start.
(4) Somebody else releases memory before timeout expires.
(5) There comes a moment where nobody needs to call out_of_memory()
because watermark is no longer low.
(6) First round of OOM state ends.
(7) Some of OOM victim threads remain stuck for some reason.
(8) There comes a moment where somebody needs to call out_of_memory()
because watermark is again low.
(9) Second round of OOM state begins.
(10) Somebody calls out_of_memory().
(11) The caller of out_of_memory() finds memdie_start which was
assigned by first round of OOM state. But since the caller of
out_of_memory() cannot tell whether memdie_start is assigned by
first round of OOM state or not, the caller will use memdie_start
as if assigned by second round of OOM state.
(12) The timeout comes earlier than it should be.

If (7) does not last till (12), we will not hit this case.

If we can distinguish round number of OOM state (e.g. srcu_read_lock() at
memory allocation entry and srcu_read_unlock() at memory allocation return,
while synchronize_srcu() from kernel thread for OOM-killer), we will not hit
this case because (11) will not occur. Well, maybe just comparing
current->oom_start (assigned before calling out_of_memory() for the first
time of this memory allocation request) and victim->memdie_start can do it?

if (time_after(current->oom_start, victim->memdie_start)) {
if (time_after(jiffies, current->oom_start + timeout))
panic();
} else {
if (time_after(jiffies, victim->memdie_start + timeout))
panic();
}

Well, if without analysis purpose,

if (time_after(jiffies, oom_start + sysctl_panic_on_oom_timeout * HZ))
panic();

(that is, pass the jiffies as of calling out_of_memory() for the first time
of this memory allocation request as an argument to out_of_memory(), and
compare at check_panic_on_oom()) is sufficient? Very simple implementation
because we do not use mod_timer()/del_timer().

2015-06-20 07:58:13

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [RFC -v2] panic_on_oom_timeout

Tetsuo Handa wrote:
> One case is that the system can not panic of threads are unable to call
> out_of_memory() for some reason.
^ if

> Well, if without analysis purpose,
>
> if (time_after(jiffies, oom_start + sysctl_panic_on_oom_timeout * HZ))
> panic();
>
> (that is, pass the jiffies as of calling out_of_memory() for the first time
> of this memory allocation request as an argument to out_of_memory(), and
> compare at check_panic_on_oom()) is sufficient? Very simple implementation
> because we do not use mod_timer()/del_timer().

Here is an untested patch.

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index b5b4278..4c64b92 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -355,7 +355,7 @@ static void moom_callback(struct work_struct *ignored)
{
mutex_lock(&oom_lock);
if (!out_of_memory(node_zonelist(first_memory_node, GFP_KERNEL),
- GFP_KERNEL, 0, NULL, true))
+ GFP_KERNEL, 0, NULL, true, NULL))
pr_info("OOM request ignored because killer is disabled\n");
mutex_unlock(&oom_lock);
}
diff --git a/include/linux/oom.h b/include/linux/oom.h
index 7deecb7..75525e9 100644
--- a/include/linux/oom.h
+++ b/include/linux/oom.h
@@ -64,14 +64,16 @@ extern void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,

extern void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
int order, const nodemask_t *nodemask,
- struct mem_cgroup *memcg);
+ struct mem_cgroup *memcg,
+ const unsigned long *oom_start);

extern enum oom_scan_t oom_scan_process_thread(struct task_struct *task,
unsigned long totalpages, const nodemask_t *nodemask,
bool force_kill);

extern bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
- int order, nodemask_t *mask, bool force_kill);
+ int order, nodemask_t *mask, bool force_kill,
+ const unsigned long *oom_start);

extern void exit_oom_victim(void);

@@ -99,4 +101,5 @@ static inline bool task_will_free_mem(struct task_struct *task)
extern int sysctl_oom_dump_tasks;
extern int sysctl_oom_kill_allocating_task;
extern int sysctl_panic_on_oom;
+extern int sysctl_panic_on_oom_timeout;
#endif /* _INCLUDE_LINUX_OOM_H */
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index c566b56..74a1b68 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1162,6 +1162,14 @@ static struct ctl_table vm_table[] = {
.extra2 = &two,
},
{
+ .procname = "panic_on_oom_timeout",
+ .data = &sysctl_panic_on_oom_timeout,
+ .maxlen = sizeof(sysctl_panic_on_oom_timeout),
+ .mode = 0644,
+ .proc_handler = proc_dointvec_minmax,
+ .extra1 = &zero,
+ },
+ {
.procname = "oom_kill_allocating_task",
.data = &sysctl_oom_kill_allocating_task,
.maxlen = sizeof(sysctl_oom_kill_allocating_task),
diff --git a/mm/memcontrol.c b/mm/memcontrol.c
index acb93c5..ab1ae3e 100644
--- a/mm/memcontrol.c
+++ b/mm/memcontrol.c
@@ -1563,7 +1563,8 @@ static void mem_cgroup_out_of_memory(struct mem_cgroup *memcg, gfp_t gfp_mask,
goto unlock;
}

- check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL, memcg);
+ check_panic_on_oom(CONSTRAINT_MEMCG, gfp_mask, order, NULL, memcg,
+ NULL);
totalpages = mem_cgroup_get_limit(memcg) ? : 1;
for_each_mem_cgroup_tree(iter, memcg) {
struct css_task_iter it;
diff --git a/mm/oom_kill.c b/mm/oom_kill.c
index dff991e..9d30f2e 100644
--- a/mm/oom_kill.c
+++ b/mm/oom_kill.c
@@ -40,6 +40,7 @@
#include <trace/events/oom.h>

int sysctl_panic_on_oom;
+int sysctl_panic_on_oom_timeout;
int sysctl_oom_kill_allocating_task;
int sysctl_oom_dump_tasks = 1;

@@ -602,7 +603,8 @@ void oom_kill_process(struct task_struct *p, gfp_t gfp_mask, int order,
*/
void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
int order, const nodemask_t *nodemask,
- struct mem_cgroup *memcg)
+ struct mem_cgroup *memcg,
+ const unsigned long *oom_start)
{
if (likely(!sysctl_panic_on_oom))
return;
@@ -614,6 +616,14 @@ void check_panic_on_oom(enum oom_constraint constraint, gfp_t gfp_mask,
*/
if (constraint != CONSTRAINT_NONE)
return;
+ /*
+ * panic_on_oom_timeout only affects panic_on_oom == 1 and
+ * CONSTRAINT_NONE.
+ */
+ if (sysctl_panic_on_oom_timeout && oom_start &&
+ time_before(jiffies,
+ *oom_start + sysctl_panic_on_oom_timeout * HZ))
+ return;
}
dump_header(NULL, gfp_mask, order, memcg, nodemask);
panic("Out of memory: %s panic_on_oom is enabled\n",
@@ -641,6 +651,8 @@ EXPORT_SYMBOL_GPL(unregister_oom_notifier);
* @order: amount of memory being requested as a power of 2
* @nodemask: nodemask passed to page allocator
* @force_kill: true if a task must be killed, even if others are exiting
+ * @oom_start: Pointer to jiffies as of calling this function for the first
+ * time of this memory allocation request. Ignored if NULL.
*
* If we run out of memory, we have the choice between either
* killing a random task (bad), letting the system crash (worse)
@@ -648,7 +660,8 @@ EXPORT_SYMBOL_GPL(unregister_oom_notifier);
* don't have to be perfect here, we just have to be good.
*/
bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
- int order, nodemask_t *nodemask, bool force_kill)
+ int order, nodemask_t *nodemask, bool force_kill,
+ const unsigned long *oom_start)
{
const nodemask_t *mpol_mask;
struct task_struct *p;
@@ -687,7 +700,8 @@ bool out_of_memory(struct zonelist *zonelist, gfp_t gfp_mask,
constraint = constrained_alloc(zonelist, gfp_mask, nodemask,
&totalpages);
mpol_mask = (constraint == CONSTRAINT_MEMORY_POLICY) ? nodemask : NULL;
- check_panic_on_oom(constraint, gfp_mask, order, mpol_mask, NULL);
+ check_panic_on_oom(constraint, gfp_mask, order, mpol_mask, NULL,
+ oom_start);

if (sysctl_oom_kill_allocating_task && current->mm &&
!oom_unkillable_task(current, NULL, nodemask) &&
@@ -734,7 +748,7 @@ void pagefault_out_of_memory(void)
if (!mutex_trylock(&oom_lock))
return;

- if (!out_of_memory(NULL, 0, 0, NULL, false)) {
+ if (!out_of_memory(NULL, 0, 0, NULL, false, NULL)) {
/*
* There shouldn't be any user tasks runnable while the
* OOM killer is disabled, so the current task has to
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 73aa335..3a75fe8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2678,7 +2678,9 @@ void warn_alloc_failed(gfp_t gfp_mask, int order, const char *fmt, ...)

static inline struct page *
__alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
- const struct alloc_context *ac, unsigned long *did_some_progress)
+ const struct alloc_context *ac,
+ unsigned long *did_some_progress,
+ unsigned long *oom_start)
{
struct page *page;

@@ -2731,7 +2733,10 @@ __alloc_pages_may_oom(gfp_t gfp_mask, unsigned int order,
goto out;
}
/* Exhausted what can be done so it's blamo time */
- if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false)
+ if (!*oom_start)
+ *oom_start = jiffies;
+ if (out_of_memory(ac->zonelist, gfp_mask, order, ac->nodemask, false,
+ oom_start)
|| WARN_ON_ONCE(gfp_mask & __GFP_NOFAIL))
*did_some_progress = 1;
out:
@@ -2968,6 +2973,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
enum migrate_mode migration_mode = MIGRATE_ASYNC;
bool deferred_compaction = false;
int contended_compaction = COMPACT_CONTENDED_NONE;
+ unsigned long oom_start = 0;

/*
* In the slowpath, we sanity check order to avoid ever trying to
@@ -3122,7 +3128,8 @@ retry:
}

/* Reclaim has failed us, start killing things */
- page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress);
+ page = __alloc_pages_may_oom(gfp_mask, order, ac, &did_some_progress,
+ &oom_start);
if (page)
goto got_pg;

2015-07-29 11:55:50

by Michal Hocko

[permalink] [raw]
Subject: Re: [RFC -v2] panic_on_oom_timeout

On Wed 17-06-15 15:24:27, Michal Hocko wrote:
> On Wed 17-06-15 14:51:27, Michal Hocko wrote:
> [...]
> > The important thing is to decide what is the reasonable way forward. We
> > have two two implementations of panic based timeout. So we should decide
>
> And the most obvious question, of course.
> - Should we add a panic timeout at all?
>
> > - Should be the timeout bound to panic_on_oom?
> > - Should we care about constrained OOM contexts?
> > - If yes should they use the same timeout?
> > - If yes should each memcg be able to define its own timeout?
> ^ no
>
> > My thinking is that it should be bound to panic_on_oom=1 only until we
> > hear from somebody actually asking for a constrained oom and even then
> > do not allow for too large configuration space (e.g. no per-memcg
> > timeout) or have separate mempolicy vs. memcg timeouts.
> >
> > Let's start simple and make things more complicated later!

Any more ideas/thoughts on this?
--
Michal Hocko
SUSE Labs

2015-07-29 14:09:57

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [RFC -v2] panic_on_oom_timeout

Michal Hocko wrote:
> On Wed 17-06-15 15:24:27, Michal Hocko wrote:
> > On Wed 17-06-15 14:51:27, Michal Hocko wrote:
> > [...]
> > > The important thing is to decide what is the reasonable way forward. We
> > > have two two implementations of panic based timeout. So we should decide
> >
> > And the most obvious question, of course.
> > - Should we add a panic timeout at all?
> >
> > > - Should be the timeout bound to panic_on_oom?
> > > - Should we care about constrained OOM contexts?
> > > - If yes should they use the same timeout?
> > > - If yes should each memcg be able to define its own timeout?
> > ^ no
> >
> > > My thinking is that it should be bound to panic_on_oom=1 only until we
> > > hear from somebody actually asking for a constrained oom and even then
> > > do not allow for too large configuration space (e.g. no per-memcg
> > > timeout) or have separate mempolicy vs. memcg timeouts.
> > >
> > > Let's start simple and make things more complicated later!
>
> Any more ideas/thoughts on this?

No ideas/thoughts from my side.



By the way, the "set TIF_MEMDIE upon calling out_of_memory() when TIF_MEMDIE
was not set by previous out_of_memory() because oom_kill_process() chose a
different thread" logic

if (current->mm &&
(fatal_signal_pending(current) || task_will_free_mem(current))) {
mark_oom_victim(current);
goto out;
}

sounds broken for me, for GFP_NOFS allocations do not call
out_of_memory() from the beginning.

Say, Process1 has two threads called Thread1 and Thread2. Thread1 was blocked
at unkillable lock and Thread2 was doing GFP_NOFS allocation from syscall
context (e.g. codes under security/ directory) when TIF_MEMDIE was set on
Thread1.

While failing GFP_NOFS allocation for ext4 filesystem's operations damages
the filesystem, failing GFP_NOFS allocation from syscall context will not
damage the filesystem. Therefore, Thread2 should be able to fail GFP_NOFS
allocations than wait for TIF_MEMDIE forever (which will never be set
because the logic above does not apply to GFP_NOFS allocation).

I didn't imagine kmalloc_killable() when I wrote "(3) Replace kmalloc()
with kmalloc_nofail() and kmalloc_noretry()." at
http://marc.info/?l=linux-mm&m=142408937117294 . But I came to feel that
introducing GFP_KILLABLE (retry unless fatal_signal_pending()) which is
between GFP_NORETRY (don't retry) and GFP_NOFAIL (retry forever) might help
reducing the possibility of stalling multi-threaded OOM victim process.



Other than that, my ideas/thoughts are staying at
http://marc.info/?l=linux-mm&m=143239200805478 .

Please continue CC'ing me because I'm not subscribed to linux-mm ML.