2015-11-02 15:01:44

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

On Fri 23-10-15 13:26:49, Tejun Heo wrote:
> Hello,
>
> So, something like the following. Just compile tested but this is
> essentially partial revert of 3270476a6c0c ("workqueue: reimplement
> WQ_HIGHPRI using a separate worker_pool") - resurrecting the old
> WQ_HIGHPRI implementation under WQ_IMMEDIATE, so we know this works.
> If for some reason, it gets decided against simply adding one jiffy
> sleep, please let me know. I'll verify the operation and post a
> proper patch. That said, given that this prolly needs -stable
> backport and vmstat is likely to be the only user (busy loops are
> really rare in the kernel after all), I think the better approach
> would be reinstating the short sleep.

As already pointed out I really detest a short sleep and would prefer
a way to tell WQ what we really need. vmstat is not the only user. OOM
sysrq will need this special treatment as well. While the
zone_reclaimable can be fixed in an easy patch
(http://lkml.kernel.org/r/201510212126.JIF90648.HOOFJVFQLMStOF%40I-love.SAKURA.ne.jp)
which is perfectly suited for the stable backport, OOM sysrq resp. any
sysrq which runs from the WQ context should be as robust as possible and
shouldn't rely on all the code running from WQ context to issue a sleep
to get unstuck. So I definitely support something like this patch.

I am still not sure whether other WQ_MEM_RECLAIM users needs this flag
as well because I am not familiar with their implementation but at
vmstat and sysrq should use it and should be safe to do so without risk
of breaking anything AFAICS.

Thanks!
--
Michal Hocko
SUSE Labs


2015-11-02 19:20:58

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

On Mon, Nov 02, 2015 at 04:01:37PM +0100, Michal Hocko wrote:
...
> which is perfectly suited for the stable backport, OOM sysrq resp. any
> sysrq which runs from the WQ context should be as robust as possible and
> shouldn't rely on all the code running from WQ context to issue a sleep
> to get unstuck. So I definitely support something like this patch.

Well, sysrq wouldn't run successfully either on a cpu which is busy
looping with preemption off. I don't think this calls for a new flag
to modify workqueue behavior especially given that missing such flag
would lead to the same kind of lockup. It's a shitty solution. If
the possibility of sysrq getting stuck behind concurrency management
is an issue, queueing them on an unbound or highpri workqueue should
be good enough.

Thanks.

--
tejun

2015-11-03 02:32:13

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

Tejun Heo wrote:
> If
> the possibility of sysrq getting stuck behind concurrency management
> is an issue, queueing them on an unbound or highpri workqueue should
> be good enough.

Regarding SysRq-f, we could do like below. Though I think that converting
the OOM killer into a dedicated kernel thread would allow more things to do
(e.g. Oleg's memory zapping code, my timeout based next victim selection).

diff --git a/drivers/tty/sysrq.c b/drivers/tty/sysrq.c
index 5381a72..46b951aa 100644
--- a/drivers/tty/sysrq.c
+++ b/drivers/tty/sysrq.c
@@ -47,6 +47,7 @@
#include <linux/syscalls.h>
#include <linux/of.h>
#include <linux/rcupdate.h>
+#include <linux/kthread.h>

#include <asm/ptrace.h>
#include <asm/irq_regs.h>
@@ -351,27 +352,35 @@ static struct sysrq_key_op sysrq_term_op = {
.enable_mask = SYSRQ_ENABLE_SIGNAL,
};

-static void moom_callback(struct work_struct *ignored)
+static DECLARE_WAIT_QUEUE_HEAD(moom_wait);
+
+static int moom_callback(void *unused)
{
const gfp_t gfp_mask = GFP_KERNEL;
- struct oom_control oc = {
- .zonelist = node_zonelist(first_memory_node, gfp_mask),
- .nodemask = NULL,
- .gfp_mask = gfp_mask,
- .order = -1,
- };
-
- mutex_lock(&oom_lock);
- if (!out_of_memory(&oc))
- pr_info("OOM request ignored because killer is disabled\n");
- mutex_unlock(&oom_lock);
+ DEFINE_WAIT(wait);
+
+ while (1) {
+ struct oom_control oc = {
+ .zonelist = node_zonelist(first_memory_node, gfp_mask),
+ .nodemask = NULL,
+ .gfp_mask = gfp_mask,
+ .order = -1,
+ };
+
+ prepare_to_wait(&moom_wait, &wait, TASK_INTERRUPTIBLE);
+ schedule();
+ finish_wait(&moom_wait, &wait);
+ mutex_lock(&oom_lock);
+ if (!out_of_memory(&oc))
+ pr_info("OOM request ignored because killer is disabled\n");
+ mutex_unlock(&oom_lock);
+ }
+ return 0;
}

-static DECLARE_WORK(moom_work, moom_callback);
-
static void sysrq_handle_moom(int key)
{
- schedule_work(&moom_work);
+ wake_up(&moom_wait);
}
static struct sysrq_key_op sysrq_moom_op = {
.handler = sysrq_handle_moom,
@@ -1116,6 +1125,9 @@ static inline void sysrq_init_procfs(void)

static int __init sysrq_init(void)
{
+ struct task_struct *task = kthread_run(moom_callback, NULL,
+ "manual_oom");
+ BUG_ON(IS_ERR(task));
sysrq_init_procfs();

if (sysrq_on())

2015-11-03 19:43:31

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

Hello, Tetsuo.

On Tue, Nov 03, 2015 at 11:32:06AM +0900, Tetsuo Handa wrote:
> Tejun Heo wrote:
> > If
> > the possibility of sysrq getting stuck behind concurrency management
> > is an issue, queueing them on an unbound or highpri workqueue should
> > be good enough.
>
> Regarding SysRq-f, we could do like below. Though I think that converting
> the OOM killer into a dedicated kernel thread would allow more things to do
> (e.g. Oleg's memory zapping code, my timeout based next victim selection).

I'm not sure doing anything to sysrq-f is warranted. If workqueue
can't make forward progress due to memory exhaustion, OOM will be
triggered anyway. Getting stuck behind concurrency management isn't
that different a failure mode from getting stuck behind busy loop with
preemption off. We should just plug them at the source. If
necessary, what we can do is adding stall watchdog (can prolly
combined with the usual watchdog) so that it can better point out the
culprit.

Thanks.

--
tejun

2015-11-05 15:00:00

by Tetsuo Handa

[permalink] [raw]
Subject: Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

Michal Hocko wrote:
> As already pointed out I really detest a short sleep and would prefer
> a way to tell WQ what we really need. vmstat is not the only user. OOM
> sysrq will need this special treatment as well. While the
> zone_reclaimable can be fixed in an easy patch
> (http://lkml.kernel.org/r/201510212126.JIF90648.HOOFJVFQLMStOF%40I-love.SAKURA.ne.jp)
> which is perfectly suited for the stable backport, OOM sysrq resp. any
> sysrq which runs from the WQ context should be as robust as possible and
> shouldn't rely on all the code running from WQ context to issue a sleep
> to get unstuck. So I definitely support something like this patch.

I still prefer a short sleep from a different perspective.

I tested above patch with below patch applied

----------------------------------------
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index d0499ff..54bedd8 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -2992,6 +2992,53 @@ static inline bool is_thp_gfp_mask(gfp_t gfp_mask)
return (gfp_mask & (GFP_TRANSHUGE | __GFP_KSWAPD_RECLAIM)) == GFP_TRANSHUGE;
}

+static atomic_t stall_tasks;
+
+static int kmallocwd(void *unused)
+{
+ struct task_struct *g, *p;
+ unsigned int sigkill_pending;
+ unsigned int memdie_pending;
+ unsigned int stalling_tasks;
+
+ not_stalling: /* Healty case. */
+ schedule_timeout_interruptible(HZ);
+ if (likely(!atomic_read(&stall_tasks)))
+ goto not_stalling;
+ maybe_stalling: /* Maybe something is wrong. Let's check. */
+ /* Count stalling tasks, dying and victim tasks. */
+ sigkill_pending = 0;
+ memdie_pending = 0;
+ stalling_tasks = atomic_read(&stall_tasks);
+ preempt_disable();
+ rcu_read_lock();
+ for_each_process_thread(g, p) {
+ if (test_tsk_thread_flag(p, TIF_MEMDIE))
+ memdie_pending++;
+ if (fatal_signal_pending(p))
+ sigkill_pending++;
+ }
+ rcu_read_unlock();
+ preempt_enable();
+ pr_warn("MemAlloc-Info: %u stalling task, %u dying task, %u victim task.\n",
+ stalling_tasks, sigkill_pending, memdie_pending);
+ show_workqueue_state();
+ schedule_timeout_interruptible(10 * HZ);
+ if (atomic_read(&stall_tasks))
+ goto maybe_stalling;
+ goto not_stalling;
+ return 0; /* To suppress "no return statement" compiler warning. */
+}
+
+static int __init start_kmallocwd(void)
+{
+ struct task_struct *task = kthread_run(kmallocwd, NULL,
+ "kmallocwd");
+ BUG_ON(IS_ERR(task));
+ return 0;
+}
+late_initcall(start_kmallocwd);
+
static inline struct page *
__alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
struct alloc_context *ac)
@@ -3004,6 +3051,8 @@ __alloc_pages_slowpath(gfp_t gfp_mask, unsigned int order,
enum migrate_mode migration_mode = MIGRATE_ASYNC;
bool deferred_compaction = false;
int contended_compaction = COMPACT_CONTENDED_NONE;
+ unsigned long start = jiffies;
+ bool stall_counted = false;

/*
* In the slowpath, we sanity check order to avoid ever trying to
@@ -3095,6 +3144,11 @@ retry:
if (test_thread_flag(TIF_MEMDIE) && !(gfp_mask & __GFP_NOFAIL))
goto nopage;

+ if (!stall_counted && time_after(jiffies, start + 10 * HZ)) {
+ atomic_inc(&stall_tasks);
+ stall_counted = true;
+ }
+
/*
* Try direct compaction. The first pass is asynchronous. Subsequent
* attempts after direct reclaim are synchronous
@@ -3188,6 +3242,8 @@ noretry:
nopage:
warn_alloc_failed(gfp_mask, order, NULL);
got_pg:
+ if (stall_counted)
+ atomic_dec(&stall_tasks);
return page;
}

----------------------------------------

using a crazy stressing program. (Not a TIF_MEMDIE stall.)

----------------------------------------
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <signal.h>
#include <fcntl.h>

static void child(void)
{
char *buf = NULL;
unsigned long size = 0;
const int fd = open("/dev/zero", O_RDONLY);
for (size = 1048576; size < 512UL * (1 << 30); size <<= 1) {
char *cp = realloc(buf, size);
if (!cp) {
size >>= 1;
break;
}
buf = cp;
}
read(fd, buf, size); /* Will cause OOM due to overcommit */
}

int main(int argc, char *argv[])
{
if (argc > 1) {
int i;
char buffer[4096];
for (i = 0; i < 1000; i++) {
if (fork() == 0) {
sleep(20);
memset(buffer, 0, sizeof(buffer));
_exit(0);
}
}
child();
return 0;
}
signal(SIGCLD, SIG_IGN);
while (1) {
switch (fork()) {
case 0:
execl("/proc/self/exe", argv[0], "1", NULL);;
_exit(0);
case -1:
sleep(1);
}
}
return 0;
}
----------------------------------------

Note the interval between invoking the OOM killer.
(Complete log is at http://I-love.SAKURA.ne.jp/tmp/serial-20151105.txt.xz .)
----------------------------------------
[ 74.260621] exe invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
[ 75.069510] exe invoked oom-killer: gfp_mask=0x24200ca, order=0, oom_score_adj=0
[ 79.062507] exe invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
[ 80.464618] MemAlloc-Info: 459 stalling task, 0 dying task, 0 victim task.
[ 90.482731] MemAlloc-Info: 699 stalling task, 0 dying task, 0 victim task.
[ 100.503633] MemAlloc-Info: 3972 stalling task, 0 dying task, 0 victim task.
[ 110.534937] MemAlloc-Info: 4097 stalling task, 0 dying task, 0 victim task.
[ 120.535740] MemAlloc-Info: 4098 stalling task, 0 dying task, 0 victim task.
[ 130.563961] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[ 140.593108] MemAlloc-Info: 4096 stalling task, 0 dying task, 0 victim task.
[ 150.617960] MemAlloc-Info: 4096 stalling task, 0 dying task, 0 victim task.
[ 160.639131] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[ 170.659915] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[ 172.597736] exe invoked oom-killer: gfp_mask=0x24280ca, order=0, oom_score_adj=0
[ 180.680650] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[ 190.705534] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[ 200.724567] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[ 210.745397] MemAlloc-Info: 4065 stalling task, 0 dying task, 0 victim task.
[ 220.769501] MemAlloc-Info: 4092 stalling task, 0 dying task, 0 victim task.
[ 230.791530] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[ 240.816711] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[ 250.836724] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[ 260.860257] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[ 270.883573] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[ 280.910072] MemAlloc-Info: 4088 stalling task, 0 dying task, 0 victim task.
[ 290.931988] MemAlloc-Info: 4092 stalling task, 0 dying task, 0 victim task.
[ 300.955543] MemAlloc-Info: 4099 stalling task, 0 dying task, 0 victim task.
[ 308.212307] exe invoked oom-killer: gfp_mask=0x24200ca, order=0, oom_score_adj=0
[ 310.977057] MemAlloc-Info: 3988 stalling task, 0 dying task, 0 victim task.
[ 320.999353] MemAlloc-Info: 4096 stalling task, 0 dying task, 0 victim task.
----------------------------------------

See? The memory allocation requests cannot constantly invoke the OOM-killer
because the sum of CPU cycles wasted for sleep-less retry loop is close to
mutually blocking other tasks when number of tasks doing memory allocation
requests exceeded number of available CPUs. We should be careful not to defer
invocation of the OOM-killer too much.

If a short sleep patch
( http://lkml.kernel.org/r/[email protected] )
is applied in addition to the above patches, the memory allocation requests
can constantly invoke the OOM-killer.

By using short sleep, some task might be able to do some useful computation
job which does not involve a __GFP_WAIT memory allocation.

We don't need to defer workqueue items which do not involve a __GFP_WAIT
memory allocation. By allowing workqueue items to be processed (by using
short sleep), some task might release memory when workqueue item is
processed.

Therefore, not only to keep vmstat counters up to date, but also for
avoid wasting CPU cycles, I prefer a short sleep.

Subject: Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

On Thu, 5 Nov 2015, Tetsuo Handa wrote:

> memory allocation. By allowing workqueue items to be processed (by using
> short sleep), some task might release memory when workqueue item is
> processed.
>
> Therefore, not only to keep vmstat counters up to date, but also for
> avoid wasting CPU cycles, I prefer a short sleep.

Sorry but we need work queue processing for vmstat counters that is
independent of other requests submitted that may block. Adding points
where we sleep / schedule everywhere to do this is not the right approach.

2015-11-06 00:16:56

by Tejun Heo

[permalink] [raw]
Subject: Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

Hello,

On Thu, Nov 05, 2015 at 11:45:42AM -0600, Christoph Lameter wrote:
> Sorry but we need work queue processing for vmstat counters that is

I made this analogy before but this is similar to looping with
preemption off. If anything on workqueue stays RUNNING w/o making
forward progress, it's buggy. I'd venture to say any code which busy
loops without making forward progress in the time scale noticeable to
human beings is borderline buggy too. If things need to be retried in
that time scale, putting in a short sleep between trials is a sensible
thing to do. There's no point in occupying the cpu and burning cycles
without making forward progress.

These things actually matter. Freezer used to burn cycles this way
and was really good at burning off the last remaining battery reserve
during emergency hibernation if freezing takes some amount of time.

It is true that as it currently stands this is error-prone as
workqueue can't detect these conditions and warn about them. The same
goes for workqueues which sit in memory reclaim path but forgets
WQ_MEM_RECLAIM. I'm going to add lockup detection, similar to how
softlockup but that's a different issue, so please update the code.

Thanks.

--
tejun

2015-11-11 15:44:30

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

On Thu 05-11-15 19:16:48, Tejun Heo wrote:
> Hello,
>
> On Thu, Nov 05, 2015 at 11:45:42AM -0600, Christoph Lameter wrote:
> > Sorry but we need work queue processing for vmstat counters that is
>
> I made this analogy before but this is similar to looping with
> preemption off. If anything on workqueue stays RUNNING w/o making
> forward progress, it's buggy. I'd venture to say any code which busy
> loops without making forward progress in the time scale noticeable to
> human beings is borderline buggy too.

Well, the caller asked for a memory but the request cannot succeed. Due
to the memory allocator semantic we cannot fail the request so we have
to loop. If we had an event to wait for we would do so, of course.

Now wrt. to a small sleep. We used to do that and called
congestion_wait(HZ/50) before retry. This has proved to cause stalls
during high memory pressure 0e093d99763e ("writeback: do not sleep on
the congestion queue if there are no congested BDIs or if significant
congestion is not being encountered in the current zone"). I do not
really remember what was CONFIG_HZ in those reports but it is quite
possible it was 250. So there is a risk of (partial) re-introducing of
those stalls with the patch from Tetsuo
(http://lkml.kernel.org/r/[email protected])

If we really have to do short sleep, though, then I would suggest
sticking that into wait_iff_congested rather than spread it into more
places and reduce it only to worker threads. This should be much more
safer. Thought?
---
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 8ed2ffd963c5..7340353f8aea 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -957,8 +957,9 @@ EXPORT_SYMBOL(congestion_wait);
* jiffies for either a BDI to exit congestion of the given @sync queue
* or a write to complete.
*
- * In the absence of zone congestion, cond_resched() is called to yield
- * the processor if necessary but otherwise does not sleep.
+ * In the absence of zone congestion, a short sleep or a cond_resched is
+ * performed to yield the processor and to allow other subsystems to make
+ * a forward progress.
*
* The return value is 0 if the sleep is for the full timeout. Otherwise,
* it is the number of jiffies that were still remaining when the function
@@ -978,7 +979,19 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
*/
if (atomic_read(&nr_wb_congested[sync]) == 0 ||
!test_bit(ZONE_CONGESTED, &zone->flags)) {
- cond_resched();
+
+ /*
+ * Memory allocation/reclaim might be called from a WQ
+ * context and the current implementation of the WQ
+ * concurrency control doesn't recognize that a particular
+ * WQ is congested if the worker thread is looping without
+ * ever sleeping. Therefore we have to do a short sleep
+ * here rather than calling cond_resched().
+ */
+ if (current->flags & PF_WQ_WORKER)
+ schedule_timeout(1);
+ else
+ cond_resched();

/* In case we scheduled, work out time remaining */
ret = timeout - (jiffies - start);
--
Michal Hocko
SUSE Labs

2015-11-11 16:03:41

by Michal Hocko

[permalink] [raw]
Subject: Re: [PATCH] mm,vmscan: Use accurate values for zone_reclaimable() checks

With the full changelog and the vmstat update for the reference.
---
>From 9492966a552751e6d7a63e9aafb87e35992b840a Mon Sep 17 00:00:00 2001
From: Michal Hocko <[email protected]>
Date: Wed, 11 Nov 2015 16:45:53 +0100
Subject: [PATCH] mm, vmstat: Allow WQ concurrency to discover memory reclaim
doesn't make any progress

Tetsuo Handa has reported that the system might basically livelock in OOM
condition without triggering the OOM killer. The issue is caused by
internal dependency of the direct reclaim on vmstat counter updates (via
zone_reclaimable) which are performed from the workqueue context.
If all the current workers get assigned to an allocation request,
though, they will be looping inside the allocator trying to reclaim
memory but zone_reclaimable can see stalled numbers so it will consider
a zone reclaimable even though it has been scanned way too much. WQ
concurrency logic will not consider this situation as a congested workqueue
because it relies that worker would have to sleep in such a situation.
This also means that it doesn't try to spawn new workers or invoke
the rescuer thread if the one is assigned to the queue.

In order to fix this issue we need to do two things. First we have to
let wq concurrency code know that we are in trouble so we have to do
a short sleep. In order to prevent from issues handled by 0e093d99763e
("writeback: do not sleep on the congestion queue if there are no
congested BDIs or if significant congestion is not being encountered in
the current zone") we limit the sleep only to worker threads which are
the ones of the interest anyway.

The second thing to do is to create a dedicated workqueue for vmstat and
mark it WQ_MEM_RECLAIM to note it participates in the reclaim and to
have a spare worker thread for it.

Reported-by: Tetsuo Handa <[email protected]>
Signed-off-by: Michal Hocko <[email protected]>
---
mm/backing-dev.c | 19 ++++++++++++++++---
mm/vmstat.c | 6 ++++--
2 files changed, 20 insertions(+), 5 deletions(-)

diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 8ed2ffd963c5..7340353f8aea 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -957,8 +957,9 @@ EXPORT_SYMBOL(congestion_wait);
* jiffies for either a BDI to exit congestion of the given @sync queue
* or a write to complete.
*
- * In the absence of zone congestion, cond_resched() is called to yield
- * the processor if necessary but otherwise does not sleep.
+ * In the absence of zone congestion, a short sleep or a cond_resched is
+ * performed to yield the processor and to allow other subsystems to make
+ * a forward progress.
*
* The return value is 0 if the sleep is for the full timeout. Otherwise,
* it is the number of jiffies that were still remaining when the function
@@ -978,7 +979,19 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout)
*/
if (atomic_read(&nr_wb_congested[sync]) == 0 ||
!test_bit(ZONE_CONGESTED, &zone->flags)) {
- cond_resched();
+
+ /*
+ * Memory allocation/reclaim might be called from a WQ
+ * context and the current implementation of the WQ
+ * concurrency control doesn't recognize that a particular
+ * WQ is congested if the worker thread is looping without
+ * ever sleeping. Therefore we have to do a short sleep
+ * here rather than calling cond_resched().
+ */
+ if (current->flags & PF_WQ_WORKER)
+ schedule_timeout(1);
+ else
+ cond_resched();

/* In case we scheduled, work out time remaining */
ret = timeout - (jiffies - start);
diff --git a/mm/vmstat.c b/mm/vmstat.c
index 45dcbcb5c594..0975da8e3432 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -1381,6 +1381,7 @@ static const struct file_operations proc_vmstat_file_operations = {
#endif /* CONFIG_PROC_FS */

#ifdef CONFIG_SMP
+static struct workqueue_struct *vmstat_wq;
static DEFINE_PER_CPU(struct delayed_work, vmstat_work);
int sysctl_stat_interval __read_mostly = HZ;
static cpumask_var_t cpu_stat_off;
@@ -1393,7 +1394,7 @@ static void vmstat_update(struct work_struct *w)
* to occur in the future. Keep on running the
* update worker thread.
*/
- schedule_delayed_work_on(smp_processor_id(),
+ queue_delayed_work_on(smp_processor_id(), vmstat_wq,
this_cpu_ptr(&vmstat_work),
round_jiffies_relative(sysctl_stat_interval));
} else {
@@ -1462,7 +1463,7 @@ static void vmstat_shepherd(struct work_struct *w)
if (need_update(cpu) &&
cpumask_test_and_clear_cpu(cpu, cpu_stat_off))

- schedule_delayed_work_on(cpu,
+ queue_delayed_work_on(cpu, vmstat_wq,
&per_cpu(vmstat_work, cpu), 0);

put_online_cpus();
@@ -1551,6 +1552,7 @@ static int __init setup_vmstat(void)

start_shepherd_timer();
cpu_notifier_register_done();
+ vmstat_wq = alloc_workqueue("vmstat", WQ_FREEZABLE|WQ_MEM_RECLAIM, 0);
#endif
#ifdef CONFIG_PROC_FS
proc_create("buddyinfo", S_IRUGO, NULL, &fragmentation_file_operations);
--
2.6.2

--
Michal Hocko
SUSE Labs