2015-12-12 15:33:13

by Tetsuo Handa

[permalink] [raw]
Subject: [PATCH v4] mm,oom: Add memory allocation watchdog kernel thread.

>From 2804913f4d21a20a154b93d5437c21e52bf761a1 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <[email protected]>
Date: Sun, 13 Dec 2015 00:02:29 +0900
Subject: [PATCH v4] mm/oom: Add memory allocation watchdog kernel thread.

This patch adds a kernel thread which periodically reports number of
memory allocating tasks, dying tasks and OOM victim tasks when some task
is spending too long time inside __alloc_pages_slowpath().

Changes from v1:

(1) Use per a "struct task_struct" variables. This allows vmcore to
remember information about last memory allocation request, which
is useful for understanding last-minute behavior of the kernel.

(2) Report using accurate timeout. This increases possibility of
successfully reporting before watchdog timers reset the machine.

(3) Show memory information (SysRq-m). This makes it easier to know
the reason of stalling.

(4) Show both $state_of_allocation and $state_of_task in the same
line. This makes it easier to grep the output.

(5) Minimize duration of spinlock held by the kernel thread.

Changes from v2:

(1) Print sequence number. This makes it easier to know whether
memory allocation is succeeding (looks like a livelock but making
forward progress) or not.

(2) Replace spinlock with cheaper seqlock_t like sequence number based
method. The caller no longer contend on lock, and major overhead
for caller side will be two smp_wmb() instead for
read_lock()/read_unlock().

(3) Print "exiting" instead for "dying" if an OOM victim is stalling
at do_exit(), for SIGKILL is removed before arriving at do_exit().

(4) Moved explanation to Documentation/malloc-watchdog.txt .

Changes from v3:

(1) Avoid stalls even if there are so many tasks to report.

Signed-off-by: Tetsuo Handa <[email protected]>
---
Documentation/malloc-watchdog.txt | 139 +++++++++++++++++++++
include/linux/sched.h | 25 ++++
kernel/fork.c | 4 +
mm/Kconfig | 10 ++
mm/page_alloc.c | 254 ++++++++++++++++++++++++++++++++++++++
5 files changed, 432 insertions(+)
create mode 100644 Documentation/malloc-watchdog.txt

diff --git a/Documentation/malloc-watchdog.txt b/Documentation/malloc-watchdog.txt
new file mode 100644
index 0000000..599d751
--- /dev/null
+++ b/Documentation/malloc-watchdog.txt
@@ -0,0 +1,139 @@
+=========================================
+Memory allocation watchdog kernel thread.
+=========================================
+
+
+- What is it?
+
+This kernel thread resembles khungtaskd kernel thread, but this kernel
+thread is for warning that memory allocation requests are stalling, in
+order to catch unexplained hangups/reboots caused by memory allocation
+stalls.
+
+
+- Why need to use it?
+
+Currently, when something went wrong inside memory allocation request,
+the system will stall with either 100% CPU usage (if memory allocating
+tasks are doing busy loop) or 0% CPU usage (if memory allocating tasks
+are waiting for file data to be flushed to storage).
+But /proc/sys/kernel/hung_task_warnings is not helpful because memory
+allocating tasks unlikely sleep in uninterruptible state for
+/proc/sys/kernel/hung_task_timeout_secs seconds.
+
+People are reporting hang up problems. But we are forcing people to use
+kernels without means to find out what was happening. The means are
+expected to work without knowledge to use trace points functionality,
+are expected to run without memory allocation, are expected to dump
+output without administrator's operation, are expected to work before
+watchdog timers reset the machine. Without this kernel thread, it is
+extremely hard to figure out that the system hung up due to memory
+allocation stalls.
+
+
+- How to configure it?
+
+Build kernels with CONFIG_MEMALLOC_WATCHDOG=y.
+
+Default scan interval is 10 seconds. Scan interval can be changed by passing
+integer value to kmallocwd boot parameter. For example, passing kmallocwd=30
+will emit first stall warnings in 30 seconds, and emit subsequent warnings in
+30 seconds.
+
+Even if you disable this kernel thread by passing kmallocwd=0 boot parameter,
+information about last memory allocation request is kept. That is, you will
+get some hint for understanding last-minute behavior of the kernel when you
+analyze vmcore (or memory snapshot of a virtualized machine).
+
+
+- How memory allocation stalls are reported?
+
+There are two types of memory allocation stalls, one is that we fail to
+solve OOM conditions after the OOM killer is invoked, the other is that
+we fail to solve OOM conditions before the OOM killer is invoked.
+
+The former case is that the OOM killer chose an OOM victim but the chosen
+victim is unable to make forward progress. Although the OOM victim
+receives TIF_MEMDIE by the OOM killer, TIF_MEMDIE helps only if the OOM
+victim was doing memory allocation. That is, if the OOM victim was
+blocked at unkillable locks (e.g. mutex_lock(&inode->i_mutex) or
+down_read(&mm->mmap_sem)), the system will hang up upon global OOM
+condition. This kernel thread will report such situation by printing
+
+ MemAlloc-Info: $X stalling task, $Y dying task, $Z victim task.
+
+line where $X > 0 and $Y > 0 and $Z > 0, followed by at most $X + $Y
+lines of
+
+ MemAlloc: $name($pid) $state_of_allocation $state_of_task
+
+where $name and $pid are comm name and pid of a task.
+
+$state_of_allocation is reported only when that task is stalling inside
+__alloc_pages_slowpath(), in seq=$seq gfp=$gfp order=$order delay=$delay
+format where $seq is the sequence number for allocation request, $gfp is
+the gfp flags used for that allocation request, $order is the order,
+delay is jiffies elapsed since entering into __alloc_pages_slowpath().
+
+$state_of_task is reported only when that task is dying, in combination
+of "uninterruptible" (where that task is in uninterruptible sleep,
+likely due to uninterruptible lock), "exiting" (where that task arrived
+at do_exit() function), "dying" (where that task has pending SIGKILL)
+and "victim" (where that task received TIF_MEMDIE, likely be only 1 task).
+
+The latter case has three possibilities. First possibility is simply
+overloaded (not a livelock but progress is too slow to wait). You can
+check for seq=$seq field for each reported process. If $seq is
+increasing over time, it is not a livelock. Second possibility is that
+at least one task is doing __GFP_FS || __GFP_NOFAIL memory allocation
+request but operation for reclaiming memory is not working as expected
+due to unknown reason (a livelock), which will not invoke the OOM
+killer. Third possibility is that all ongoing memory allocation
+requests are !__GFP_FS && !__GFP_NOFAIL, which does not invoke the OOM
+killer. This kernel thread will report such situation with $X > 0,
+$Y >= 0 and $Z = 0.
+
+
+- How the messages look like?
+
+An example of MemAlloc lines (grep of dmesg output) is shown below.
+You can use serial console and/or netconsole to save these messages
+when the system is stalling.
+
+ [ 78.402510] MemAlloc-Info: 7 stalling task, 1 dying task, 1 victim task.
+ [ 78.404691] MemAlloc: kthreadd(2) seq=6 gfp=0x27000c0 order=2 delay=9931 uninterruptible
+ [ 78.451201] MemAlloc: systemd-journal(478) seq=73 gfp=0x24201ca order=0 delay=9842
+ [ 78.497058] MemAlloc: irqbalance(747) seq=4 gfp=0x24201ca order=0 delay=7454
+ [ 78.542291] MemAlloc: crond(969) seq=18 gfp=0x24201ca order=0 delay=9842
+ [ 78.586270] MemAlloc: vmtoolsd(1912) seq=64 gfp=0x24201ca order=0 delay=9847
+ [ 78.631516] MemAlloc: oom-write(3786) seq=25322 gfp=0x24280ca order=0 delay=10000 uninterruptible
+ [ 78.676193] MemAlloc: write(3787) seq=46308 gfp=0x2400240 order=0 delay=9847 uninterruptible exiting
+ [ 78.755351] MemAlloc: write(3788) uninterruptible dying victim
+ [ 88.854456] MemAlloc-Info: 8 stalling task, 1 dying task, 1 victim task.
+ [ 88.856533] MemAlloc: kthreadd(2) seq=6 gfp=0x27000c0 order=2 delay=20383 uninterruptible
+ [ 88.900375] MemAlloc: systemd-journal(478) seq=73 gfp=0x24201ca order=0 delay=20294 uninterruptible
+ [ 88.952300] MemAlloc: irqbalance(747) seq=4 gfp=0x24201ca order=0 delay=17906 uninterruptible
+ [ 88.997542] MemAlloc: crond(969) seq=18 gfp=0x24201ca order=0 delay=20294
+ [ 89.041480] MemAlloc: vmtoolsd(1912) seq=64 gfp=0x24201ca order=0 delay=20299
+ [ 89.090096] MemAlloc: nmbd(3709) seq=9 gfp=0x24201ca order=0 delay=13855
+ [ 89.142032] MemAlloc: oom-write(3786) seq=25322 gfp=0x24280ca order=0 delay=20452
+ [ 89.177999] MemAlloc: write(3787) seq=46308 gfp=0x2400240 order=0 delay=20299 exiting
+ [ 89.254554] MemAlloc: write(3788) uninterruptible dying victim
+ [ 99.353664] MemAlloc-Info: 11 stalling task, 1 dying task, 1 victim task.
+ [ 99.356044] MemAlloc: kthreadd(2) seq=6 gfp=0x27000c0 order=2 delay=30882 uninterruptible
+ [ 99.403609] MemAlloc: systemd-journal(478) seq=73 gfp=0x24201ca order=0 delay=30793 uninterruptible
+ [ 99.449469] MemAlloc: irqbalance(747) seq=4 gfp=0x24201ca order=0 delay=28405
+ [ 99.493474] MemAlloc: crond(969) seq=18 gfp=0x24201ca order=0 delay=30793 uninterruptible
+ [ 99.536027] MemAlloc: vmtoolsd(1912) seq=64 gfp=0x24201ca order=0 delay=30798 uninterruptible
+ [ 99.582630] MemAlloc: master(3682) seq=2 gfp=0x24201ca order=0 delay=10886
+ [ 99.626574] MemAlloc: nmbd(3709) seq=9 gfp=0x24201ca order=0 delay=24354
+ [ 99.669191] MemAlloc: smbd(3737) seq=2 gfp=0x24201ca order=0 delay=7130
+ [ 99.714555] MemAlloc: smbd(3753) seq=2 gfp=0x24201ca order=0 delay=6616 uninterruptible
+ [ 99.758412] MemAlloc: oom-write(3786) seq=25322 gfp=0x24280ca order=0 delay=30951
+ [ 99.793156] MemAlloc: write(3787) seq=46308 gfp=0x2400240 order=0 delay=30798 uninterruptible exiting
+ [ 99.871842] MemAlloc: write(3788) uninterruptible dying victim
+
+You can check whether memory allocations are making forward progress.
+You can check where memory allocations are stalling using stack trace
+of reported task which follows each MemAlloc line. You can check memory
+information (SysRq-m) which follows end of MemAlloc lines.
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7b76e39..039b04d 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1379,6 +1379,28 @@ struct tlbflush_unmap_batch {
bool writable;
};

+struct memalloc_info {
+ /* For locking and progress monitoring. */
+ unsigned int sequence;
+ /*
+ * 0: not doing __GFP_RECLAIM allocation.
+ * 1: doing non-recursive __GFP_RECLAIM allocation.
+ * 2: doing recursive __GFP_RECLAIM allocation.
+ */
+ u8 valid;
+ /*
+ * bit 0: Will be reported as OOM victim.
+ * bit 1: Will be reported as dying task.
+ * bit 2: Will be reported as stalling task.
+ */
+ u8 type;
+ /* Started time in jiffies as of valid == 1. */
+ unsigned long start;
+ /* Requested order and gfp flags as of valid == 1. */
+ unsigned int order;
+ gfp_t gfp;
+};
+
struct task_struct {
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
void *stack;
@@ -1822,6 +1844,9 @@ struct task_struct {
unsigned long task_state_change;
#endif
int pagefault_disabled;
+#ifdef CONFIG_MEMALLOC_WATCHDOG
+ struct memalloc_info memalloc;
+#endif
/* CPU-specific state of this task */
struct thread_struct thread;
/*
diff --git a/kernel/fork.c b/kernel/fork.c
index 8cb287a..aed1c89 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1414,6 +1414,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
p->sequential_io_avg = 0;
#endif

+#ifdef CONFIG_MEMALLOC_WATCHDOG
+ p->memalloc.sequence = 0;
+#endif
+
/* Perform scheduler related setup. Assign this task to a CPU. */
retval = sched_fork(clone_flags, p);
if (retval)
diff --git a/mm/Kconfig b/mm/Kconfig
index 97a4e06..df05f85 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -668,3 +668,13 @@ config ZONE_DEVICE

config FRAME_VECTOR
bool
+
+config MEMALLOC_WATCHDOG
+ bool "Memory allocation stalling watchdog"
+ default n
+ help
+ This option emits warning messages and traces when memory
+ allocation requests are stalling, in order to catch unexplained
+ hangups/reboots caused by memory allocation stalls.
+
+ See Documentation/malloc-watchdog.txt for more information.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bac8842d..5ff89ae 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -62,6 +62,7 @@
#include <linux/sched/rt.h>
#include <linux/page_owner.h>
#include <linux/kthread.h>
+#include <linux/console.h>

#include <asm/sections.h>
#include <asm/tlbflush.h>
@@ -3199,6 +3200,257 @@ got_pg:
return page;
}

+#ifdef CONFIG_MEMALLOC_WATCHDOG
+
+static unsigned long kmallocwd_timeout = 10 * HZ; /* Default scan interval. */
+static struct memalloc_info memalloc; /* Filled by is_stalling_task(). */
+
+/**
+ * is_stalling_task - Check and copy a task's memalloc variable.
+ *
+ * @task: A task to check.
+ * @expire: Timeout in jiffies.
+ *
+ * Returns true if a task is stalling, false otherwise.
+ */
+static bool is_stalling_task(const struct task_struct *task,
+ const unsigned long expire)
+{
+ const struct memalloc_info *m = &task->memalloc;
+
+ /*
+ * If start_memalloc_timer() is updating "struct memalloc_info" now,
+ * we can ignore it because timeout jiffies cannot be expired as soon
+ * as updating it completes.
+ */
+ if (!m->valid || (m->sequence & 1))
+ return false;
+ smp_rmb(); /* Block start_memalloc_timer(). */
+ memalloc.start = m->start;
+ memalloc.order = m->order;
+ memalloc.gfp = m->gfp;
+ smp_rmb(); /* Unblock start_memalloc_timer(). */
+ memalloc.sequence = m->sequence;
+ /*
+ * If start_memalloc_timer() started updating it while we read it,
+ * we can ignore it for the same reason.
+ */
+ if (!m->valid || (memalloc.sequence & 1))
+ return false;
+ /* This is a valid "struct memalloc_info". Check for timeout. */
+ return time_after_eq(expire, memalloc.start);
+}
+
+/*
+ * kmallocwd - A kernel thread for monitoring memory allocation stalls.
+ *
+ * @unused: Not used.
+ *
+ * This kernel thread does not terminate.
+ */
+static int kmallocwd(void *unused)
+{
+ char buf[128];
+ struct task_struct *g, *p;
+ unsigned long now;
+ unsigned long expire;
+ unsigned int sigkill_pending;
+ unsigned int memdie_pending;
+ unsigned int stalling_tasks;
+
+ restart:
+ /* Sleep until stalled tasks are found. */
+ while (1) {
+ /*
+ * If memory allocations are not stalling, the value of t after
+ * this for_each_process_thread() loop should remain close to
+ * kmallocwd_timeout. Also, we sleep for kmallocwd_timeout
+ * before retrying if memory allocations are stalling.
+ * Therefore, this while() loop won't waste too much CPU cycles
+ * due to sleeping for too short period.
+ */
+ long t = kmallocwd_timeout;
+ const unsigned long delta = t - jiffies;
+ /*
+ * We might see outdated values in "struct memalloc_info" here.
+ * We will recheck later using is_stalling_task().
+ */
+ preempt_disable();
+ rcu_read_lock();
+ for_each_process_thread(g, p) {
+ if (likely(!p->memalloc.valid))
+ continue;
+ t = min_t(long, t, p->memalloc.start + delta);
+ if (unlikely(t <= 0))
+ goto stalling;
+ }
+ rcu_read_unlock();
+ preempt_enable();
+ schedule_timeout_interruptible(t);
+ }
+ stalling:
+ rcu_read_unlock();
+ preempt_enable();
+ cond_resched();
+ now = jiffies;
+ /*
+ * Report tasks that stalled for more than half of timeout duration
+ * because such tasks might be correlated with tasks that already
+ * stalled for full timeout duration.
+ */
+ expire = now - kmallocwd_timeout / 2;
+ /* Count stalling tasks, dying and victim tasks. */
+ sigkill_pending = 0;
+ memdie_pending = 0;
+ stalling_tasks = 0;
+ preempt_disable();
+ rcu_read_lock();
+ for_each_process_thread(g, p) {
+ u8 type = 0;
+
+ if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
+ type |= 1;
+ memdie_pending++;
+ }
+ if (fatal_signal_pending(p)) {
+ type |= 2;
+ sigkill_pending++;
+ }
+ if (is_stalling_task(p, expire)) {
+ type |= 4;
+ stalling_tasks++;
+ }
+ p->memalloc.type = type;
+ }
+ rcu_read_unlock();
+ preempt_enable();
+ if (!stalling_tasks)
+ goto restart;
+ /* Report stalling tasks, dying and victim tasks. */
+ pr_warn("MemAlloc-Info: %u stalling task, %u dying task, %u victim task.\n",
+ stalling_tasks, sigkill_pending, memdie_pending);
+ cond_resched();
+ preempt_disable();
+ rcu_read_lock();
+ restart_report:
+ for_each_process_thread(g, p) {
+ bool can_cont;
+ u8 type = p->memalloc.type;
+
+ if (likely(!type))
+ continue;
+ p->memalloc.type = 0;
+ buf[0] = '\0';
+ /*
+ * Recheck stalling tasks in case they called
+ * stop_memalloc_timer() meanwhile.
+ */
+ if (type & 4) {
+ if (is_stalling_task(p, expire)) {
+ snprintf(buf, sizeof(buf),
+ " seq=%u gfp=0x%x order=%u delay=%lu",
+ memalloc.sequence >> 1, memalloc.gfp,
+ memalloc.order, now - memalloc.start);
+ } else {
+ type &= ~4;
+ if (!type)
+ continue;
+ }
+ }
+ /*
+ * Victim tasks get pending SIGKILL removed before arriving at
+ * do_exit(). Therefore, print " exiting" instead for " dying".
+ */
+ pr_warn("MemAlloc: %s(%u)%s%s%s%s%s\n", p->comm, p->pid, buf,
+ (p->state & TASK_UNINTERRUPTIBLE) ?
+ " uninterruptible" : "",
+ (p->flags & PF_EXITING) ? " exiting" : "",
+ (type & 2) ? " dying" : "",
+ (type & 1) ? " victim" : "");
+ sched_show_task(p);
+ debug_show_held_locks(p);
+ /*
+ * Since there could be thousands of tasks to report, we always
+ * sleep and try to flush printk() buffer after each report, in
+ * order to avoid RCU stalls and reduce possibility of messages
+ * being dropped by continuous printk() flood.
+ *
+ * Since not yet reported tasks have p->memalloc.type > 0, we
+ * can simply restart this loop in case "g" or "p" went away.
+ */
+ get_task_struct(g);
+ get_task_struct(p);
+ rcu_read_unlock();
+ preempt_enable();
+ schedule_timeout_interruptible(1);
+ console_lock();
+ console_unlock();
+ preempt_disable();
+ rcu_read_lock();
+ can_cont = pid_alive(g) && pid_alive(p);
+ put_task_struct(p);
+ put_task_struct(g);
+ if (!can_cont)
+ goto restart_report;
+ }
+ rcu_read_unlock();
+ preempt_enable();
+ cond_resched();
+ /* Show memory information. (SysRq-m) */
+ show_mem(0);
+ /* Sleep until next timeout duration. */
+ schedule_timeout_interruptible(kmallocwd_timeout);
+ goto restart;
+ return 0; /* To suppress "no return statement" compiler warning. */
+}
+
+static int __init start_kmallocwd(void)
+{
+ if (kmallocwd_timeout)
+ kthread_run(kmallocwd, NULL, "kmallocwd");
+ return 0;
+}
+late_initcall(start_kmallocwd);
+
+static int __init kmallocwd_config(char *str)
+{
+ if (kstrtoul(str, 10, &kmallocwd_timeout) == 0)
+ kmallocwd_timeout = min(kmallocwd_timeout * HZ,
+ (unsigned long) LONG_MAX);
+ return 0;
+}
+__setup("kmallocwd=", kmallocwd_config);
+
+static void start_memalloc_timer(const gfp_t gfp_mask, const int order)
+{
+ struct memalloc_info *m = &current->memalloc;
+
+ /* We don't check for stalls for !__GFP_RECLAIM allocations. */
+ if (!(gfp_mask & __GFP_RECLAIM))
+ return;
+ /* We don't check for stalls for nested __GFP_RECLAIM allocations */
+ if (!m->valid) {
+ m->sequence++;
+ smp_wmb(); /* Block is_stalling_task(). */
+ m->start = jiffies;
+ m->order = order;
+ m->gfp = gfp_mask;
+ smp_wmb(); /* Unblock is_stalling_task(). */
+ m->sequence++;
+ }
+ m->valid++;
+}
+
+static void stop_memalloc_timer(const gfp_t gfp_mask)
+{
+ if (gfp_mask & __GFP_RECLAIM)
+ current->memalloc.valid--;
+}
+#else
+#define start_memalloc_timer(gfp_mask, order) do { } while (0)
+#define stop_memalloc_timer(gfp_mask) do { } while (0)
+#endif
+
/*
* This is the 'heart' of the zoned buddy allocator.
*/
@@ -3266,7 +3518,9 @@ retry_cpuset:
alloc_mask = memalloc_noio_flags(gfp_mask);
ac.spread_dirty_pages = false;

+ start_memalloc_timer(alloc_mask, order);
page = __alloc_pages_slowpath(alloc_mask, order, &ac);
+ stop_memalloc_timer(alloc_mask);
}

if (kmemcheck_enabled && page)
--
1.8.3.1


2015-12-12 17:00:47

by Johannes Weiner

[permalink] [raw]
Subject: Re: [PATCH v4] mm,oom: Add memory allocation watchdog kernel thread.

On Sun, Dec 13, 2015 at 12:33:04AM +0900, Tetsuo Handa wrote:
> +Currently, when something went wrong inside memory allocation request,
> +the system will stall with either 100% CPU usage (if memory allocating
> +tasks are doing busy loop) or 0% CPU usage (if memory allocating tasks
> +are waiting for file data to be flushed to storage).
> +But /proc/sys/kernel/hung_task_warnings is not helpful because memory
> +allocating tasks unlikely sleep in uninterruptible state for
> +/proc/sys/kernel/hung_task_timeout_secs seconds.

Yes, this is very annoying. Other tasks in the system get dumped out
as they are blocked for too long, but not the allocating task itself
as it's busy looping.

That being said, I'm not entirely sure why we need daemon to do this,
which then requires us to duplicate allocation state to task_struct.
There is no scenario where the allocating task is not moving at all
anymore, right? So can't we dump the allocation state from within the
allocator and leave the rest to the hung task detector?

diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 05ef7fb..fbfc581 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3004,6 +3004,7 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
enum migrate_mode migration_mode = MIGRATE_ASYNC;
bool deferred_compaction = false;
int contended_compaction = COMPACT_CONTENDED_NONE;
+ unsigned int nr_tries = 0;

/*
* In the slowpath, we sanity check order to avoid ever trying to
@@ -3033,6 +3034,9 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
goto nopage;

retry:
+ if (++nr_retries % 1000 == 0)
+ warn_alloc_failed(gfp_mask, order, "Potential GFP deadlock\n");
+
if (gfp_mask & __GFP_KSWAPD_RECLAIM)
wake_all_kswapds(order, ac);

Basing it on nr_retries alone might be too crude and take too long
when each cycle spends time waiting for IO. However, if that is a
problem we can make it time-based instead, like your memalloc_timer,
to catch tasks that spend too much time in a single alloc attempt.

> + start_memalloc_timer(alloc_mask, order);
> page = __alloc_pages_slowpath(alloc_mask, order, &ac);
> + stop_memalloc_timer(alloc_mask);

2015-12-13 14:26:57

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH v4] mm,oom: Add memory allocation watchdog kernel thread.

Johannes Weiner wrote:
> On Sun, Dec 13, 2015 at 12:33:04AM +0900, Tetsuo Handa wrote:
> > +Currently, when something went wrong inside memory allocation request,
> > +the system will stall with either 100% CPU usage (if memory allocating
> > +tasks are doing busy loop) or 0% CPU usage (if memory allocating tasks
> > +are waiting for file data to be flushed to storage).
> > +But /proc/sys/kernel/hung_task_warnings is not helpful because memory
> > +allocating tasks unlikely sleep in uninterruptible state for
> > +/proc/sys/kernel/hung_task_timeout_secs seconds.
>
> Yes, this is very annoying. Other tasks in the system get dumped out
> as they are blocked for too long, but not the allocating task itself
> as it's busy looping.
>

Off-topic, but judging from my experience at support center, people are not
utilizing hung task detector well. Since /proc/sys/kernel/hung_task_warnings
is defaulted to 10 and people are using it with default value, hung task
detector complains nothing because hung_task_warnings is likely already 0
when their systems hang up after many days/months...

> That being said, I'm not entirely sure why we need daemon to do this,
> which then requires us to duplicate allocation state to task_struct.

I thought doing this using timer interrupt (i.e. add_timer()/del_timer())
because allocating tasks can be blocked at locks (e.g. mutex_lock()) or
loops (e.g. too_many_isolated() in shrink_inactive_list()).

[ 99.793156] MemAlloc: write(3787) seq=46308 gfp=0x2400240 order=0 delay=30798 uninterruptible exiting
[ 99.795428] write D ffff880075c974b8 0 3787 3786 0x20020086
[ 99.797381] ffff880075c974b8 ffff880035d98000 ffff8800777995c0 ffff880075c98000
[ 99.799459] ffff880075c974f0 ffff88007fc10240 00000000fffcf09e 0000000000000000
[ 99.801518] ffff880075c974d0 ffffffff816f4127 ffff88007fc10240 ffff880075c97578
[ 99.803571] Call Trace:
[ 99.804605] [<ffffffff816f4127>] schedule+0x37/0x90
[ 99.806102] [<ffffffff816f8427>] schedule_timeout+0x117/0x1c0
[ 99.807738] [<ffffffff810dfbc0>] ? init_timer_key+0x40/0x40
[ 99.809332] [<ffffffff816f8554>] schedule_timeout_uninterruptible+0x24/0x30
[ 99.811199] [<ffffffff81147eba>] __alloc_pages_nodemask+0x8ba/0xc80
[ 99.812940] [<ffffffff8118f2c6>] alloc_pages_current+0x96/0x1b0
[ 99.814621] [<ffffffff812e3e64>] xfs_buf_allocate_memory+0x170/0x29f
[ 99.816373] [<ffffffff812ac204>] xfs_buf_get_map+0xe4/0x140
[ 99.817981] [<ffffffff812ac7e9>] xfs_buf_read_map+0x29/0xd0
[ 99.819603] [<ffffffff812d65c7>] xfs_trans_read_buf_map+0x97/0x1a0
[ 99.821326] [<ffffffff81286b53>] xfs_btree_read_buf_block.constprop.28+0x73/0xc0
[ 99.823262] [<ffffffff81286c1b>] xfs_btree_lookup_get_block+0x7b/0xf0
[ 99.825029] [<ffffffff8128b15b>] xfs_btree_lookup+0xbb/0x500
[ 99.826662] [<ffffffff8127561c>] ? xfs_allocbt_init_cursor+0x3c/0xc0
[ 99.828430] [<ffffffff81273c3b>] xfs_free_ag_extent+0x6b/0x5f0
[ 99.830053] [<ffffffff81274fb4>] xfs_free_extent+0xf4/0x120
[ 99.831734] [<ffffffff812d6b31>] xfs_trans_free_extent+0x21/0x60
[ 99.833453] [<ffffffff812a8eca>] xfs_bmap_finish+0xfa/0x120
[ 99.835087] [<ffffffff812bda7d>] xfs_itruncate_extents+0x10d/0x190
[ 99.836812] [<ffffffff812a9bf5>] xfs_free_eofblocks+0x1d5/0x240
[ 99.838494] [<ffffffff812bdc9f>] xfs_release+0x8f/0x150
[ 99.840017] [<ffffffff812af330>] xfs_file_release+0x10/0x20
[ 99.841593] [<ffffffff811c1018>] __fput+0xb8/0x230
[ 99.843071] [<ffffffff811c11c9>] ____fput+0x9/0x10
[ 99.844449] [<ffffffff8108d922>] task_work_run+0x72/0xa0
[ 99.845968] [<ffffffff81071a91>] do_exit+0x2f1/0xb50
[ 99.847406] [<ffffffff81072377>] do_group_exit+0x47/0xc0
[ 99.848883] [<ffffffff8107dc22>] get_signal+0x222/0x7e0
[ 99.850406] [<ffffffff8100f362>] do_signal+0x32/0x670
[ 99.851900] [<ffffffff8106a3c7>] ? syscall_slow_exit_work+0x4b/0x10d
[ 99.853600] [<ffffffff811c1b9c>] ? __sb_end_write+0x1c/0x20
[ 99.855142] [<ffffffff8106a31a>] ? exit_to_usermode_loop+0x2e/0x90
[ 99.856778] [<ffffffff8106a338>] exit_to_usermode_loop+0x4c/0x90
[ 99.858399] [<ffffffff810036f2>] do_syscall_32_irqs_off+0x122/0x190
[ 99.860055] [<ffffffff816fbc38>] entry_INT80_compat+0x38/0x50
[ 99.861630] 3 locks held by write/3787:
[ 99.863151] #0: (sb_internal){.+.+.?}, at: [<ffffffff811c2b3c>] __sb_start_write+0xcc/0xe0
[ 99.865391] #1: (&(&ip->i_iolock)->mr_lock){++++++}, at: [<ffffffff812bb629>] xfs_ilock_nowait+0x59/0x140
[ 99.867910] #2: (&xfs_nondir_ilock_class){++++--}, at: [<ffffffff812bb4ff>] xfs_ilock+0x7f/0xe0
[ 99.871842] MemAlloc: write(3788) uninterruptible dying victim
[ 99.873491] write D ffff8800793cbcb8 0 3788 3786 0x20120084
[ 99.875414] ffff8800793cbcb8 ffffffff81c11500 ffff8800766b0000 ffff8800793cc000
[ 99.877443] ffff88007a9e76b0 ffff8800766b0000 0000000000000246 00000000ffffffff
[ 99.879460] ffff8800793cbcd0 ffffffff816f4127 ffff88007a9e76a8 ffff8800793cbce0
[ 99.881472] Call Trace:
[ 99.882543] [<ffffffff816f4127>] schedule+0x37/0x90
[ 99.883993] [<ffffffff816f4450>] schedule_preempt_disabled+0x10/0x20
[ 99.885726] [<ffffffff816f51db>] mutex_lock_nested+0x17b/0x3e0
[ 99.887342] [<ffffffff812b0b7f>] ? xfs_file_buffered_aio_write+0x5f/0x1f0
[ 99.889125] [<ffffffff812b0b7f>] xfs_file_buffered_aio_write+0x5f/0x1f0
[ 99.890874] [<ffffffff812b0d94>] xfs_file_write_iter+0x84/0x140
[ 99.892498] [<ffffffff811be7b7>] __vfs_write+0xc7/0x100
[ 99.894013] [<ffffffff811bf21d>] vfs_write+0x9d/0x190
[ 99.895482] [<ffffffff811deb0a>] ? __fget_light+0x6a/0x90
[ 99.897020] [<ffffffff811c0133>] SyS_write+0x53/0xd0
[ 99.898469] [<ffffffff8100362f>] do_syscall_32_irqs_off+0x5f/0x190
[ 99.900133] [<ffffffff816fbc38>] entry_INT80_compat+0x38/0x50
[ 99.901752] 2 locks held by write/3788:
[ 99.903061] #0: (sb_writers#8){.+.+.+}, at: [<ffffffff811c2b3c>] __sb_start_write+0xcc/0xe0
[ 99.905412] #1: (&sb->s_type->i_mutex_key#12){+.+.+.}, at: [<ffffffff812b0b7f>] xfs_file_buffered_aio_write+0x5f/0x1f0

But I realized that trying to printk() from timer interrupt generates
unreadable/corrupted dumps because there could be thousands of tasks to
report (e.g. when we entered into OOM livelock) while we don't want to do
busy loop inside timer interrupt for serializing printk().

[ 211.563810] MemAlloc-Info: 3914 stalling task, 0 dying task, 1 victim task.

Same thing would happen for warn_alloc_failed() approach when many tasks
called warn_alloc_failed() at the same time...

> There is no scenario where the allocating task is not moving at all
> anymore, right? So can't we dump the allocation state from within the
> allocator and leave the rest to the hung task detector?

I don't think we can reliably get information with hung task detector
approach.

Duplicating allocation state to task_struct allows us to keep information
about last memory allocation request. That is, we will get some hint for
understanding last-minute behavior of the kernel when we analyze vmcore
(or memory snapshot of a virtualized machine).

Besides that, duplicating allocation state to task_struct will allow OOM killer
(a task calling select_bad_process()) to check whether the candidate is stuck
(e.g. http://lkml.kernel.org/r/[email protected] ),
compared to current situation (i.e. whether the candidate already has TIF_MEMDIE
or not).

"[PATCH] mm/oom_kill.c: don't kill TASK_UNINTERRUPTIBLE tasks" tried to judge it
using the candidate's task state but was not accepted. I already showed that it
is not difficult to defeat OOM reaper using "mmap_sem livelock case" and "least
memory consuming victim case". We will eventually need to consider timeout based
next OOM victim selection...

Even if we don't use a daemon, I think that duplicating allocation state
itself is helpful.

2015-12-16 15:11:39

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH v4] mm,oom: Add memory allocation watchdog kernel thread.

Here is a different version.
Is this version better than creating a dedicated kernel thread?
----------
From c9f61902977b04d24d809d2b853a5682fc3c41e8 Mon Sep 17 00:00:00 2001
From: Tetsuo Handa <[email protected]>
Date: Thu, 17 Dec 2015 00:02:37 +0900
Subject: [PATCH draft] mm,oom: Add memory allocation watchdog kernel thread.

The oom_reaper kernel thread currently proposed by Michal Hocko tries to
reclaim additional memory by preemptively reaping the anonymous or swapped
out memory owned by the OOM victim under an assumption that such a memory
won't be needed when its owner is killed and kicked from the userspace
anyway, in a good hope that preemptively reaped memory is sufficient for
OOM victim to exit.

However, it has been shown by Tetsuo Handa that it is not that hard to
construct workloads which break this hope and the OOM victim takes
unbounded amount of time to exit because oom_reaper does not choose
subsequent OOM victims.

Since currently we are not going to implement timeout based subsequent
OOM victim selection, this patch implements a watchdog for emitting
warning messages when memory allocation is stalling, in case oom_reaper
did not help. The khungtaskd kernel thread is reused for this purpose,
which saves one kernel thread and avoids printk() collisions between
hang task checking and memory allocation stall checking.

Signed-off-by: Tetsuo Handa <[email protected]>
---
include/linux/sched.h | 25 +++++
include/linux/sched/sysctl.h | 3 +
kernel/fork.c | 4 +
kernel/hung_task.c | 240 +++++++++++++++++++++++++++++++++++++++++--
kernel/sysctl.c | 10 ++
lib/Kconfig.debug | 24 +++++
mm/page_alloc.c | 33 ++++++
7 files changed, 330 insertions(+), 9 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 7b76e39..f903368 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1379,6 +1379,28 @@ struct tlbflush_unmap_batch {
bool writable;
};

+struct memalloc_info {
+ /* For locking and progress monitoring. */
+ unsigned int sequence;
+ /*
+ * 0: not doing __GFP_RECLAIM allocation.
+ * 1: doing non-recursive __GFP_RECLAIM allocation.
+ * 2: doing recursive __GFP_RECLAIM allocation.
+ */
+ u8 valid;
+ /*
+ * bit 0: Will be reported as OOM victim.
+ * bit 1: Will be reported as dying task.
+ * bit 2: Will be reported as stalling task.
+ */
+ u8 type;
+ /* Started time in jiffies as of valid == 1. */
+ unsigned long start;
+ /* Requested order and gfp flags as of valid == 1. */
+ unsigned int order;
+ gfp_t gfp;
+};
+
struct task_struct {
volatile long state; /* -1 unrunnable, 0 runnable, >0 stopped */
void *stack;
@@ -1822,6 +1844,9 @@ struct task_struct {
unsigned long task_state_change;
#endif
int pagefault_disabled;
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+ struct memalloc_info memalloc;
+#endif
/* CPU-specific state of this task */
struct thread_struct thread;
/*
diff --git a/include/linux/sched/sysctl.h b/include/linux/sched/sysctl.h
index c9e4731..fb3004a 100644
--- a/include/linux/sched/sysctl.h
+++ b/include/linux/sched/sysctl.h
@@ -9,6 +9,9 @@ extern int sysctl_hung_task_warnings;
extern int proc_dohung_task_timeout_secs(struct ctl_table *table, int write,
void __user *buffer,
size_t *lenp, loff_t *ppos);
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+extern unsigned long sysctl_memalloc_task_timeout_secs;
+#endif
#else
/* Avoid need for ifdefs elsewhere in the code */
enum { sysctl_hung_task_timeout_secs = 0 };
diff --git a/kernel/fork.c b/kernel/fork.c
index 8cb287a..13a1b76 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1414,6 +1414,10 @@ static struct task_struct *copy_process(unsigned long clone_flags,
p->sequential_io_avg = 0;
#endif

+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+ p->memalloc.sequence = 0;
+#endif
+
/* Perform scheduler related setup. Assign this task to a CPU. */
retval = sched_fork(clone_flags, p);
if (retval)
diff --git a/kernel/hung_task.c b/kernel/hung_task.c
index e0f90c2..8550cc8 100644
--- a/kernel/hung_task.c
+++ b/kernel/hung_task.c
@@ -16,6 +16,7 @@
#include <linux/export.h>
#include <linux/sysctl.h>
#include <linux/utsname.h>
+#include <linux/console.h>
#include <trace/events/sched.h>

/*
@@ -72,6 +73,207 @@ static struct notifier_block panic_block = {
.notifier_call = hung_task_panic,
};

+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+/*
+ * Zero means infinite timeout - no checking done:
+ */
+unsigned long __read_mostly sysctl_memalloc_task_timeout_secs =
+ CONFIG_DEFAULT_MEMALLOC_TASK_TIMEOUT;
+static struct memalloc_info memalloc; /* Filled by is_stalling_task(). */
+
+static long memalloc_timeout_jiffies(unsigned long last_checked, long timeout)
+{
+ struct task_struct *g, *p;
+ long t;
+ unsigned long delta;
+
+ /* timeout of 0 will disable the watchdog */
+ if (!timeout)
+ return MAX_SCHEDULE_TIMEOUT;
+ /* At least wait for timeout duration. */
+ t = last_checked - jiffies + timeout * HZ;
+ if (t > 0)
+ return t;
+ /* Calculate how long to wait more. */
+ t = timeout * HZ;
+ delta = t - jiffies;
+
+ /*
+ * We might see outdated values in "struct memalloc_info" here.
+ * We will recheck later using is_stalling_task().
+ */
+ preempt_disable();
+ rcu_read_lock();
+ for_each_process_thread(g, p) {
+ if (likely(!p->memalloc.valid))
+ continue;
+ t = min_t(long, t, p->memalloc.start + delta);
+ if (unlikely(t <= 0))
+ goto stalling;
+ }
+ stalling:
+ rcu_read_unlock();
+ preempt_enable();
+ return t;
+}
+
+/**
+ * is_stalling_task - Check and copy a task's memalloc variable.
+ *
+ * @task: A task to check.
+ * @expire: Timeout in jiffies.
+ *
+ * Returns true if a task is stalling, false otherwise.
+ */
+static bool is_stalling_task(const struct task_struct *task,
+ const unsigned long expire)
+{
+ const struct memalloc_info *m = &task->memalloc;
+
+ /*
+ * If start_memalloc_timer() is updating "struct memalloc_info" now,
+ * we can ignore it because timeout jiffies cannot be expired as soon
+ * as updating it completes.
+ */
+ if (!m->valid || (m->sequence & 1))
+ return false;
+ smp_rmb(); /* Block start_memalloc_timer(). */
+ memalloc.start = m->start;
+ memalloc.order = m->order;
+ memalloc.gfp = m->gfp;
+ smp_rmb(); /* Unblock start_memalloc_timer(). */
+ memalloc.sequence = m->sequence;
+ /*
+ * If start_memalloc_timer() started updating it while we read it,
+ * we can ignore it for the same reason.
+ */
+ if (!m->valid || (memalloc.sequence & 1))
+ return false;
+ /* This is a valid "struct memalloc_info". Check for timeout. */
+ return time_after_eq(expire, memalloc.start);
+}
+
+/* Check for memory allocation stalls. */
+static void check_memalloc_stalling_tasks(unsigned long timeout)
+{
+ char buf[128];
+ struct task_struct *g, *p;
+ unsigned long now;
+ unsigned long expire;
+ unsigned int sigkill_pending;
+ unsigned int memdie_pending;
+ unsigned int stalling_tasks;
+
+ cond_resched();
+ now = jiffies;
+ /*
+ * Report tasks that stalled for more than half of timeout duration
+ * because such tasks might be correlated with tasks that already
+ * stalled for full timeout duration.
+ */
+ expire = now - timeout * (HZ / 2);
+ /* Count stalling tasks, dying and victim tasks. */
+ sigkill_pending = 0;
+ memdie_pending = 0;
+ stalling_tasks = 0;
+ preempt_disable();
+ rcu_read_lock();
+ for_each_process_thread(g, p) {
+ u8 type = 0;
+
+ if (test_tsk_thread_flag(p, TIF_MEMDIE)) {
+ type |= 1;
+ memdie_pending++;
+ }
+ if (fatal_signal_pending(p)) {
+ type |= 2;
+ sigkill_pending++;
+ }
+ if (is_stalling_task(p, expire)) {
+ type |= 4;
+ stalling_tasks++;
+ }
+ p->memalloc.type = type;
+ }
+ rcu_read_unlock();
+ preempt_enable();
+ if (!stalling_tasks)
+ return;
+ /* Report stalling tasks, dying and victim tasks. */
+ pr_warn("MemAlloc-Info: %u stalling task, %u dying task, %u victim task.\n",
+ stalling_tasks, sigkill_pending, memdie_pending);
+ cond_resched();
+ preempt_disable();
+ rcu_read_lock();
+ restart_report:
+ for_each_process_thread(g, p) {
+ bool can_cont;
+ u8 type;
+
+ if (likely(!p->memalloc.type))
+ continue;
+ p->memalloc.type = 0;
+ /* Recheck in case state changed meanwhile. */
+ type = 0;
+ if (test_tsk_thread_flag(p, TIF_MEMDIE))
+ type |= 1;
+ if (fatal_signal_pending(p))
+ type |= 2;
+ if (is_stalling_task(p, expire)) {
+ type |= 4;
+ snprintf(buf, sizeof(buf),
+ " seq=%u gfp=0x%x order=%u delay=%lu",
+ memalloc.sequence >> 1, memalloc.gfp,
+ memalloc.order, now - memalloc.start);
+ } else {
+ buf[0] = '\0';
+ }
+ if (unlikely(!type))
+ continue;
+ /*
+ * Victim tasks get pending SIGKILL removed before arriving at
+ * do_exit(). Therefore, print " exiting" instead for " dying".
+ */
+ pr_warn("MemAlloc: %s(%u)%s%s%s%s%s\n", p->comm, p->pid, buf,
+ (p->state & TASK_UNINTERRUPTIBLE) ?
+ " uninterruptible" : "",
+ (p->flags & PF_EXITING) ? " exiting" : "",
+ (type & 2) ? " dying" : "",
+ (type & 1) ? " victim" : "");
+ sched_show_task(p);
+ debug_show_held_locks(p);
+ /*
+ * Since there could be thousands of tasks to report, we always
+ * sleep and try to flush printk() buffer after each report, in
+ * order to avoid RCU stalls and reduce possibility of messages
+ * being dropped by continuous printk() flood.
+ *
+ * Since not yet reported tasks have p->memalloc.type > 0, we
+ * can simply restart this loop in case "g" or "p" went away.
+ */
+ get_task_struct(g);
+ get_task_struct(p);
+ rcu_read_unlock();
+ preempt_enable();
+ schedule_timeout_interruptible(1);
+ console_lock();
+ console_unlock();
+ preempt_disable();
+ rcu_read_lock();
+ can_cont = pid_alive(g) && pid_alive(p);
+ put_task_struct(p);
+ put_task_struct(g);
+ if (!can_cont)
+ goto restart_report;
+ }
+ rcu_read_unlock();
+ preempt_enable();
+ cond_resched();
+ /* Show memory information. (SysRq-m) */
+ show_mem(0);
+}
+#endif /* CONFIG_DETECT_MEMALLOC_STALL_TASK */
+
static void check_hung_task(struct task_struct *t, unsigned long timeout)
{
unsigned long switch_count = t->nvcsw + t->nivcsw;
@@ -185,10 +387,12 @@ static void check_hung_uninterruptible_tasks(unsigned long timeout)
rcu_read_unlock();
}

-static unsigned long timeout_jiffies(unsigned long timeout)
+static unsigned long hung_timeout_jiffies(long last_checked, long timeout)
{
/* timeout of 0 will disable the watchdog */
- return timeout ? timeout * HZ : MAX_SCHEDULE_TIMEOUT;
+ if (!timeout)
+ return MAX_SCHEDULE_TIMEOUT;
+ return last_checked - jiffies + timeout * HZ;
}

/*
@@ -224,18 +428,36 @@ EXPORT_SYMBOL_GPL(reset_hung_task_detector);
*/
static int watchdog(void *dummy)
{
+ unsigned long hung_last_checked = jiffies;
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+ unsigned long stall_last_checked = hung_last_checked;
+#endif
+
set_user_nice(current, 0);

for ( ; ; ) {
unsigned long timeout = sysctl_hung_task_timeout_secs;
-
- while (schedule_timeout_interruptible(timeout_jiffies(timeout)))
- timeout = sysctl_hung_task_timeout_secs;
-
- if (atomic_xchg(&reset_hung_task, 0))
+ long t = hung_timeout_jiffies(hung_last_checked, timeout);
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+ unsigned long timeout2 = sysctl_memalloc_task_timeout_secs;
+ long t2 = memalloc_timeout_jiffies(stall_last_checked,
+ timeout2);
+
+ if (t2 <= 0) {
+ check_memalloc_stalling_tasks(timeout2);
+ stall_last_checked = jiffies;
continue;
-
- check_hung_uninterruptible_tasks(timeout);
+ }
+#else
+ long t2 = t;
+#endif
+ if (t <= 0) {
+ if (!atomic_xchg(&reset_hung_task, 0))
+ check_hung_uninterruptible_tasks(timeout);
+ hung_last_checked = jiffies;
+ continue;
+ }
+ schedule_timeout_interruptible(min(t, t2));
}

return 0;
diff --git a/kernel/sysctl.c b/kernel/sysctl.c
index 0d6edb5..aac2a20 100644
--- a/kernel/sysctl.c
+++ b/kernel/sysctl.c
@@ -1061,6 +1061,16 @@ static struct ctl_table kern_table[] = {
.proc_handler = proc_dointvec_minmax,
.extra1 = &neg_one,
},
+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+ {
+ .procname = "memalloc_task_timeout_secs",
+ .data = &sysctl_memalloc_task_timeout_secs,
+ .maxlen = sizeof(unsigned long),
+ .mode = 0644,
+ .proc_handler = proc_dohung_task_timeout_secs,
+ .extra2 = &hung_task_timeout_max,
+ },
+#endif
#endif
#ifdef CONFIG_COMPAT
{
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index efa0f5f..3be59f9 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -820,6 +820,30 @@ config BOOTPARAM_HUNG_TASK_PANIC_VALUE
default 0 if !BOOTPARAM_HUNG_TASK_PANIC
default 1 if BOOTPARAM_HUNG_TASK_PANIC

+config DETECT_MEMALLOC_STALL_TASK
+ bool "Detect tasks stalling inside memory allocator"
+ default n
+ depends on DETECT_HUNG_TASK
+ help
+ This option emits warning messages and traces when memory
+ allocation requests are stalling, in order to catch unexplained
+ hangups/reboots caused by memory allocation stalls.
+
+config DEFAULT_MEMALLOC_TASK_TIMEOUT
+ int "Default timeout for stalling task detection (in seconds)"
+ depends on DETECT_MEMALLOC_STALL_TASK
+ default 10
+ help
+ This option controls the default timeout (in seconds) used
+ to determine when a task has become non-responsive and should
+ be considered stalling inside memory allocator.
+
+ It can be adjusted at runtime via the kernel.memalloc_task_timeout_secs
+ sysctl or by writing a value to
+ /proc/sys/kernel/memalloc_task_timeout_secs.
+
+ A timeout of 0 disables the check. The default is 10 seconds.
+
config WQ_WATCHDOG
bool "Detect Workqueue Stalls"
depends on DEBUG_KERNEL
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index bac8842d..a0cf4b3 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -3199,6 +3199,37 @@ got_pg:
return page;
}

+#ifdef CONFIG_DETECT_MEMALLOC_STALL_TASK
+static void start_memalloc_timer(const gfp_t gfp_mask, const int order)
+{
+ struct memalloc_info *m = &current->memalloc;
+
+ /* We don't check for stalls for !__GFP_RECLAIM allocations. */
+ if (!(gfp_mask & __GFP_RECLAIM))
+ return;
+ /* We don't check for stalls for nested __GFP_RECLAIM allocations */
+ if (!m->valid) {
+ m->sequence++;
+ smp_wmb(); /* Block is_stalling_task(). */
+ m->start = jiffies;
+ m->order = order;
+ m->gfp = gfp_mask;
+ smp_wmb(); /* Unblock is_stalling_task(). */
+ m->sequence++;
+ }
+ m->valid++;
+}
+
+static void stop_memalloc_timer(const gfp_t gfp_mask)
+{
+ if (gfp_mask & __GFP_RECLAIM)
+ current->memalloc.valid--;
+}
+#else
+#define start_memalloc_timer(gfp_mask, order) do { } while (0)
+#define stop_memalloc_timer(gfp_mask) do { } while (0)
+#endif
+
/*
* This is the 'heart' of the zoned buddy allocator.
*/
@@ -3266,7 +3297,9 @@ retry_cpuset:
alloc_mask = memalloc_noio_flags(gfp_mask);
ac.spread_dirty_pages = false;

+ start_memalloc_timer(alloc_mask, order);
page = __alloc_pages_slowpath(alloc_mask, order, &ac);
+ stop_memalloc_timer(alloc_mask);
}

if (kmemcheck_enabled && page)
--
1.8.3.1