Hello,
There haven't been too many workqueue stall bugs; however, good part
of them have been pretty painful to track down because there's no
lockup detection mechanism for workqueue and it isn't easy to tell
what's going on with workqueues; furthermore, some requirements are
tricky to get right - e.g. it's not too difficult to miss
WQ_MEM_RECLAIM for a workqueue which runs a work item which is flushed
by something which sits in the reclaim path.
To alleviate the situation, this two patch series implements workqueue
lockup detector. Each worker_pool tracks the last time it made
forward progress and if no forward progress is made for longer than
threshold it triggers warnings and dumps workqueue state. It's
controlled together with scheduler softlockup mechanism and uses the
same threshold value as it shares a lot of the characteristics.
Thanks.
------ 8< ------
touch_softlockup_watchdog() is used to tell watchdog that scheduler
stall is expected. One group of usage is from paths where the task
may not be able to yield for a long time such as performing slow PIO
to finicky device and coming out of suspend. The other is to account
for scheduler and timer going idle.
For scheduler softlockup detection, there's no reason to distinguish
the two cases; however, workqueue lockup detector is planned and it
can use the same signals from the former group while the latter would
spuriously prevent detection. This patch introduces a new function
touch_softlockup_watchdog_sched() and convert the latter group to call
it instead. For now, it just calls touch_softlockup_watchdog() and
there's no functional difference.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Ulrich Obergfell <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: Peter Zijlstra <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Andrew Morton <[email protected]>
---
include/linux/sched.h | 4 ++++
kernel/sched/clock.c | 2 +-
kernel/time/tick-sched.c | 6 +++---
kernel/watchdog.c | 15 ++++++++++++++-
4 files changed, 22 insertions(+), 5 deletions(-)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -377,6 +377,7 @@ extern void scheduler_tick(void);
extern void sched_show_task(struct task_struct *p);
#ifdef CONFIG_LOCKUP_DETECTOR
+extern void touch_softlockup_watchdog_sched(void);
extern void touch_softlockup_watchdog(void);
extern void touch_softlockup_watchdog_sync(void);
extern void touch_all_softlockup_watchdogs(void);
@@ -387,6 +388,9 @@ extern unsigned int softlockup_panic;
extern unsigned int hardlockup_panic;
void lockup_detector_init(void);
#else
+static inline void touch_softlockup_watchdog_sched(void)
+{
+}
static inline void touch_softlockup_watchdog(void)
{
}
--- a/kernel/sched/clock.c
+++ b/kernel/sched/clock.c
@@ -354,7 +354,7 @@ void sched_clock_idle_wakeup_event(u64 d
return;
sched_clock_tick();
- touch_softlockup_watchdog();
+ touch_softlockup_watchdog_sched();
}
EXPORT_SYMBOL_GPL(sched_clock_idle_wakeup_event);
--- a/kernel/time/tick-sched.c
+++ b/kernel/time/tick-sched.c
@@ -143,7 +143,7 @@ static void tick_sched_handle(struct tic
* when we go busy again does not account too much ticks.
*/
if (ts->tick_stopped) {
- touch_softlockup_watchdog();
+ touch_softlockup_watchdog_sched();
if (is_idle_task(current))
ts->idle_jiffies++;
}
@@ -430,7 +430,7 @@ static void tick_nohz_update_jiffies(kti
tick_do_update_jiffies64(now);
local_irq_restore(flags);
- touch_softlockup_watchdog();
+ touch_softlockup_watchdog_sched();
}
/*
@@ -701,7 +701,7 @@ static void tick_nohz_restart_sched_tick
update_cpu_load_nohz();
calc_load_exit_idle();
- touch_softlockup_watchdog();
+ touch_softlockup_watchdog_sched();
/*
* Cancel the scheduled timer and restore the tick
*/
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -225,7 +225,15 @@ static void __touch_watchdog(void)
__this_cpu_write(watchdog_touch_ts, get_timestamp());
}
-void touch_softlockup_watchdog(void)
+/**
+ * touch_softlockup_watchdog_sched - touch watchdog on scheduler stalls
+ *
+ * Call when the scheduler may have stalled for legitimate reasons
+ * preventing the watchdog task from executing - e.g. the scheduler
+ * entering idle state. This should only be used for scheduler events.
+ * Use touch_softlockup_watchdog() for everything else.
+ */
+void touch_softlockup_watchdog_sched(void)
{
/*
* Preemption can be enabled. It doesn't matter which CPU's timestamp
@@ -233,6 +241,11 @@ void touch_softlockup_watchdog(void)
*/
raw_cpu_write(watchdog_touch_ts, 0);
}
+
+void touch_softlockup_watchdog(void)
+{
+ touch_softlockup_watchdog_sched();
+}
EXPORT_SYMBOL(touch_softlockup_watchdog);
void touch_all_softlockup_watchdogs(void)
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is enabled/disabled together with scheduler
softlockup watchdog and uses the same threshold value; however, it
currently doesn't trigger panic. We can do that later once the
detection mechanism proves to be reliable.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Ulrich Obergfell <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Andrew Morton <[email protected]>
---
include/linux/sched.h | 1
include/linux/workqueue.h | 6 +
kernel/watchdog.c | 15 +++-
kernel/workqueue.c | 146 +++++++++++++++++++++++++++++++++++++++++++++-
4 files changed, 162 insertions(+), 6 deletions(-)
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -381,6 +381,7 @@ extern void touch_softlockup_watchdog_sc
extern void touch_softlockup_watchdog(void);
extern void touch_softlockup_watchdog_sync(void);
extern void touch_all_softlockup_watchdogs(void);
+extern int get_softlockup_thresh(void);
extern int proc_dowatchdog_thresh(struct ctl_table *table, int write,
void __user *buffer,
size_t *lenp, loff_t *ppos);
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -618,4 +618,10 @@ static inline int workqueue_sysfs_regist
{ return 0; }
#endif /* CONFIG_SYSFS */
+#ifdef CONFIG_LOCKUP_DETECTOR
+void enable_workqueue_watchdog(void);
+void disable_workqueue_watchdog(void);
+void touch_workqueue_watchdog(int cpu);
+#endif /* CONFIG_LOCKUP_DETECTOR */
+
#endif
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -20,6 +20,7 @@
#include <linux/smpboot.h>
#include <linux/sched/rt.h>
#include <linux/tick.h>
+#include <linux/workqueue.h>
#include <asm/irq_regs.h>
#include <linux/kvm_para.h>
@@ -192,7 +193,7 @@ __setup("hardlockup_all_cpu_backtrace=",
* the thresholds with a factor: we make the soft threshold twice the amount of
* time the hard threshold is.
*/
-static int get_softlockup_thresh(void)
+int get_softlockup_thresh(void)
{
return watchdog_thresh * 2;
}
@@ -245,6 +246,7 @@ void touch_softlockup_watchdog_sched(voi
void touch_softlockup_watchdog(void)
{
touch_softlockup_watchdog_sched();
+ touch_workqueue_watchdog(raw_smp_processor_id());
}
EXPORT_SYMBOL(touch_softlockup_watchdog);
@@ -259,6 +261,7 @@ void touch_all_softlockup_watchdogs(void
*/
for_each_watchdog_cpu(cpu)
per_cpu(watchdog_touch_ts, cpu) = 0;
+ touch_workqueue_watchdog(-1);
}
#ifdef CONFIG_HARDLOCKUP_DETECTOR
@@ -795,13 +798,18 @@ static int watchdog_enable_all_cpus(void
{
int err = 0;
+ disable_workqueue_watchdog();
+
if (!watchdog_running) {
err = smpboot_register_percpu_thread_cpumask(&watchdog_threads,
&watchdog_cpumask);
- if (err)
+ if (err) {
pr_err("Failed to create watchdog threads, disabled\n");
- else
+ } else {
+ if (watchdog_enabled & SOFT_WATCHDOG_ENABLED)
+ enable_workqueue_watchdog();
watchdog_running = 1;
+ }
} else {
/*
* Enable/disable the lockup detectors or
@@ -826,6 +834,7 @@ static void watchdog_disable_all_cpus(vo
if (watchdog_running) {
watchdog_running = 0;
smpboot_unregister_percpu_thread(&watchdog_threads);
+ disable_workqueue_watchdog();
}
}
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -148,6 +148,8 @@ struct worker_pool {
int id; /* I: pool ID */
unsigned int flags; /* X: flags */
+ unsigned long watchdog_ts; /* L: watchdog timestamp */
+
struct list_head worklist; /* L: list of pending works */
int nr_workers; /* L: total number of workers */
@@ -1083,6 +1085,8 @@ static void pwq_activate_delayed_work(st
struct pool_workqueue *pwq = get_work_pwq(work);
trace_workqueue_activate_work(work);
+ if (list_empty(&pwq->pool->worklist))
+ pwq->pool->watchdog_ts = jiffies;
move_linked_works(work, &pwq->pool->worklist, NULL);
__clear_bit(WORK_STRUCT_DELAYED_BIT, work_data_bits(work));
pwq->nr_active++;
@@ -1385,6 +1389,8 @@ retry:
trace_workqueue_activate_work(work);
pwq->nr_active++;
worklist = &pwq->pool->worklist;
+ if (list_empty(worklist))
+ pwq->pool->watchdog_ts = jiffies;
} else {
work_flags |= WORK_STRUCT_DELAYED;
worklist = &pwq->delayed_works;
@@ -2157,6 +2163,8 @@ recheck:
list_first_entry(&pool->worklist,
struct work_struct, entry);
+ pool->watchdog_ts = jiffies;
+
if (likely(!(*work_data_bits(work) & WORK_STRUCT_LINKED))) {
/* optimization path, not strictly necessary */
process_one_work(worker, work);
@@ -2240,6 +2248,7 @@ repeat:
struct pool_workqueue, mayday_node);
struct worker_pool *pool = pwq->pool;
struct work_struct *work, *n;
+ bool first = true;
__set_current_state(TASK_RUNNING);
list_del_init(&pwq->mayday_node);
@@ -2256,9 +2265,14 @@ repeat:
* process'em.
*/
WARN_ON_ONCE(!list_empty(scheduled));
- list_for_each_entry_safe(work, n, &pool->worklist, entry)
- if (get_work_pwq(work) == pwq)
+ list_for_each_entry_safe(work, n, &pool->worklist, entry) {
+ if (get_work_pwq(work) == pwq) {
+ if (first)
+ pool->watchdog_ts = jiffies;
move_linked_works(work, scheduled, &n);
+ }
+ first = false;
+ }
if (!list_empty(scheduled)) {
process_scheduled_works(rescuer);
@@ -3069,6 +3083,7 @@ static int init_worker_pool(struct worke
pool->cpu = -1;
pool->node = NUMA_NO_NODE;
pool->flags |= POOL_DISASSOCIATED;
+ pool->watchdog_ts = jiffies;
INIT_LIST_HEAD(&pool->worklist);
INIT_LIST_HEAD(&pool->idle_list);
hash_init(pool->busy_hash);
@@ -4308,7 +4323,9 @@ void show_workqueue_state(void)
pr_info("pool %d:", pool->id);
pr_cont_pool_info(pool);
- pr_cont(" workers=%d", pool->nr_workers);
+ pr_cont(" hung=%us workers=%d",
+ jiffies_to_msecs(jiffies - pool->watchdog_ts) / 1000,
+ pool->nr_workers);
if (pool->manager)
pr_cont(" manager: %d",
task_pid_nr(pool->manager->task));
@@ -5167,6 +5184,129 @@ static void workqueue_sysfs_unregister(s
static void workqueue_sysfs_unregister(struct workqueue_struct *wq) { }
#endif /* CONFIG_SYSFS */
+/*
+ * Workqueue watchdog.
+ *
+ * Stall may be caused by various bugs - missing WQ_MEM_RECLAIM, illegal
+ * flush dependency, a concurrency managed work item which stays RUNNING
+ * indefinitely. Workqueue stalls can be very difficult to debug as the
+ * usual warning mechanisms don't trigger and internal workqueue state is
+ * largely opaque.
+ *
+ * Workqueue watchdog monitors all worker pools periodically and dumps
+ * state if some pools failed to make forward progress for a while where
+ * forward progress is defined as the first item on ->worklist changing.
+ *
+ * The mechanism is enabled and disabled together with softlockup watchdog
+ * and uses the same threshold duration; however, it currently doesn't
+ * cause panic even if softlockup_panic is set. Also, workqueue watchdog
+ * assumes that the usual jiffies and timer mechanisms are working as there
+ * isn't much point in warning about workqueue stalls when timer is broken.
+ */
+#ifdef CONFIG_LOCKUP_DETECTOR
+
+static void wq_watchdog_timer_fn(unsigned long data);
+
+static unsigned long wq_watchdog_thresh;
+static struct timer_list wq_watchdog_timer =
+ TIMER_DEFERRED_INITIALIZER(wq_watchdog_timer_fn, 0, 0);
+
+static unsigned long wq_watchdog_touched = INITIAL_JIFFIES;
+static DEFINE_PER_CPU(unsigned long, wq_watchdog_touched_cpu) = INITIAL_JIFFIES;
+
+static void wq_watchdog_reset_touched(void)
+{
+ int cpu;
+
+ wq_watchdog_touched = jiffies;
+ for_each_possible_cpu(cpu)
+ per_cpu(wq_watchdog_touched_cpu, cpu) = jiffies;
+}
+
+static void wq_watchdog_timer_fn(unsigned long data)
+{
+ unsigned long thresh = wq_watchdog_thresh;
+ bool lockup_detected = false;
+ struct worker_pool *pool;
+ int pi;
+
+ if (!thresh)
+ return;
+
+ rcu_read_lock();
+
+ for_each_pool(pool, pi) {
+ unsigned long pool_ts, touched, ts;
+
+ if (list_empty(&pool->worklist))
+ continue;
+
+ /* get the latest of pool and touched timestamps */
+ pool_ts = READ_ONCE(pool->watchdog_ts);
+ touched = READ_ONCE(wq_watchdog_touched);
+
+ if (time_after(pool_ts, touched))
+ ts = pool_ts;
+ else
+ ts = touched;
+
+ if (pool->cpu >= 0) {
+ unsigned long cpu_touched =
+ READ_ONCE(per_cpu(wq_watchdog_touched_cpu,
+ pool->cpu));
+ if (time_after(cpu_touched, ts))
+ ts = cpu_touched;
+ }
+
+ /* did we stall? */
+ if (time_after(jiffies, ts + thresh)) {
+ lockup_detected = true;
+ pr_emerg("BUG: workqueue lockup - pool");
+ pr_cont_pool_info(pool);
+ pr_cont(" stuck for %us!\n",
+ jiffies_to_msecs(jiffies - pool_ts) / 1000);
+ }
+ }
+
+ rcu_read_unlock();
+
+ if (lockup_detected)
+ show_workqueue_state();
+
+ wq_watchdog_reset_touched();
+ mod_timer(&wq_watchdog_timer, jiffies + thresh);
+}
+
+void enable_workqueue_watchdog(void)
+{
+ wq_watchdog_thresh = get_softlockup_thresh() * HZ;
+
+ wq_watchdog_reset_touched();
+ mod_timer(&wq_watchdog_timer, jiffies + wq_watchdog_thresh);
+}
+
+void disable_workqueue_watchdog(void)
+{
+ wq_watchdog_thresh = 0;
+ del_timer_sync(&wq_watchdog_timer);
+}
+
+void touch_workqueue_watchdog(int cpu)
+{
+ /*
+ * If not explicitly touched, these stamps are never updated, which
+ * means that they may affect stall detection if jiffies wraps;
+ * however, it's highly unlikely and the worst that can happen is
+ * delaying stall detection by upto one threshold duration.
+ */
+ if (cpu >= 0)
+ per_cpu(wq_watchdog_touched_cpu, cpu) = jiffies;
+ else
+ wq_watchdog_touched = jiffies;
+}
+
+#endif /* CONFIG_LOCKUP_DETECTOR */
+
static void __init wq_numa_init(void)
{
cpumask_var_t *tbl;
On Wed, Dec 02, 2015 at 07:28:10PM -0500, Tejun Heo wrote:
> Hello,
>
> There haven't been too many workqueue stall bugs; however, good part
> of them have been pretty painful to track down because there's no
> lockup detection mechanism for workqueue and it isn't easy to tell
> what's going on with workqueues; furthermore, some requirements are
> tricky to get right - e.g. it's not too difficult to miss
> WQ_MEM_RECLAIM for a workqueue which runs a work item which is flushed
> by something which sits in the reclaim path.
have you considered something as simple as:
WARN_ON(current->reclaim_state && !WQ_MEM_RECLAIM);
?
On Thu, Dec 03, 2015 at 10:33:50AM +0100, Peter Zijlstra wrote:
> On Wed, Dec 02, 2015 at 07:28:10PM -0500, Tejun Heo wrote:
> > Hello,
> >
> > There haven't been too many workqueue stall bugs; however, good part
> > of them have been pretty painful to track down because there's no
> > lockup detection mechanism for workqueue and it isn't easy to tell
> > what's going on with workqueues; furthermore, some requirements are
> > tricky to get right - e.g. it's not too difficult to miss
> > WQ_MEM_RECLAIM for a workqueue which runs a work item which is flushed
> > by something which sits in the reclaim path.
>
> have you considered something as simple as:
>
> WARN_ON(current->reclaim_state && !WQ_MEM_RECLAIM);
>
> ?
Alternatively, you can 'abuse' the lockdep reclaim bits by marking
!MEM_RECLAIM workqueue 'locks' with lockdep_trace_alloc(GFP_KERNEL),
that way lockdep will yell if they get used in a reclaim context.
This might be a tad tricky in that you need 2 sets of (lockdep) keys for
things.
Hey, Peter.
On Thu, Dec 03, 2015 at 11:00:18AM +0100, Peter Zijlstra wrote:
> > have you considered something as simple as:
> >
> > WARN_ON(current->reclaim_state && !WQ_MEM_RECLAIM);
> >
> > ?
>
> Alternatively, you can 'abuse' the lockdep reclaim bits by marking
> !MEM_RECLAIM workqueue 'locks' with lockdep_trace_alloc(GFP_KERNEL),
> that way lockdep will yell if they get used in a reclaim context.
>
> This might be a tad tricky in that you need 2 sets of (lockdep) keys for
> things.
One of the latest bugs was an xfs work item in reclaim path which had
WQ_MEM_RECLAIM waiting on a waitqueue for an event which is to be
triggered by another work item which incorrectly was missing the flag.
Under memory pressure, it leads to silent deadlocks. The other one
was vmstat update work busy looping waiting for a work item which is
queued behind it. None of the dependency tracking mechanisms could
have detected either and both were pretty tricky to track down.
We can add MEM_RECLAIM -> !MEM_RECLAIM warning mechanism in addition
but I think adding stall detection is justified.
Thanks.
--
tejun
On Wed, Dec 02, 2015 at 07:28:39PM -0500, Tejun Heo wrote:
...
> +void touch_workqueue_watchdog(int cpu)
> +{
> + /*
> + * If not explicitly touched, these stamps are never updated, which
> + * means that they may affect stall detection if jiffies wraps;
> + * however, it's highly unlikely and the worst that can happen is
> + * delaying stall detection by upto one threshold duration.
> + */
Oops, this comment is stale. Updated the code but forgot to drop it.
Will remove in the next iteration.
Thanks.
--
tejun
On Thu, Dec 03, 2015 at 09:48:11AM -0500, Tejun Heo wrote:
> We can add MEM_RECLAIM -> !MEM_RECLAIM warning mechanism in addition
> but I think adding stall detection is justified.
Sure, a stall mech is always nice, but I was thinking we should be able
to better catch some of these with explicit stuff.
On Thu, Dec 03, 2015 at 04:04:42PM +0100, Peter Zijlstra wrote:
> On Thu, Dec 03, 2015 at 09:48:11AM -0500, Tejun Heo wrote:
> > We can add MEM_RECLAIM -> !MEM_RECLAIM warning mechanism in addition
> > but I think adding stall detection is justified.
>
> Sure, a stall mech is always nice, but I was thinking we should be able
> to better catch some of these with explicit stuff.
Yeah, sure, I don't think we even need to meddle with lockdep. I'm
gonna add WARN_ON_ONCE()s which trigger whenever something on reclaim
path tries to flush !WQ_MEM_RECLAIM workqueues or work items.
Thanks.
--
tejun
On Wed, Dec 02, 2015 at 07:28:39PM -0500, Tejun Heo wrote:
> Workqueue stalls can happen from a variety of usage bugs such as
> missing WQ_MEM_RECLAIM flag or concurrency managed work item
> indefinitely staying RUNNING. These stalls can be extremely difficult
> to hunt down because the usual warning mechanisms can't detect
> workqueue stalls and the internal state is pretty opaque.
>
> To alleviate the situation, this patch implements workqueue lockup
> detector. It periodically monitors all worker_pools periodically and,
> if any pool failed to make forward progress longer than the threshold
> duration, triggers warning and dumps workqueue state as follows.
This sort of looks like the hung task detector..
I am a little concerned because we just made a big effort to properly
separate the hardlockup and softlockup paths and yet retain the flexibility
to enable/disable them separately. Now it seems the workqueue detector is
permanently entwined with the softlockup detector. I am not entirely sure
that is correct thing to do.
It also seems awkward for the lockup code to have to jump to the workqueue
code to function properly. :-/ Though we have made exceptions for the virt
stuff and the workqueue code is simple..
Actually, I am curious, it seems if you just added a
/proc/sys/kernel/wq_watchdog entry, you could elminiate the entire need for
modifying the watchdog code to begin with. As you really aren't using any
of it other than piggybacking on the touch_softlockup_watchdog stuff, which
could probably be easily added without all the extra enable/disable changes
in watchdog.c.
Again, this looks like what the hung task detector is doing, which I
struggled with years ago to integrate with the lockup code because in the
end I had trouble re-using much of it.
Thoughts?
Cheers,
Don
>
> BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
> Showing busy workqueues and worker pools:
> workqueue events: flags=0x0
> pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
> pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
> workqueue events_power_efficient: flags=0x80
> pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
> pending: check_lifetime, neigh_periodic_work
> workqueue cgroup_pidlist_destroy: flags=0x0
> pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
> pending: cgroup_pidlist_destroy_work_fn
> ...
>
> The detection mechanism is enabled/disabled together with scheduler
> softlockup watchdog and uses the same threshold value; however, it
> currently doesn't trigger panic. We can do that later once the
> detection mechanism proves to be reliable.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Ulrich Obergfell <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Chris Mason <[email protected]>
> Cc: Andrew Morton <[email protected]>
> ---
> include/linux/sched.h | 1
> include/linux/workqueue.h | 6 +
> kernel/watchdog.c | 15 +++-
> kernel/workqueue.c | 146 +++++++++++++++++++++++++++++++++++++++++++++-
> 4 files changed, 162 insertions(+), 6 deletions(-)
>
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -381,6 +381,7 @@ extern void touch_softlockup_watchdog_sc
> extern void touch_softlockup_watchdog(void);
> extern void touch_softlockup_watchdog_sync(void);
> extern void touch_all_softlockup_watchdogs(void);
> +extern int get_softlockup_thresh(void);
> extern int proc_dowatchdog_thresh(struct ctl_table *table, int write,
> void __user *buffer,
> size_t *lenp, loff_t *ppos);
> --- a/include/linux/workqueue.h
> +++ b/include/linux/workqueue.h
> @@ -618,4 +618,10 @@ static inline int workqueue_sysfs_regist
> { return 0; }
> #endif /* CONFIG_SYSFS */
>
> +#ifdef CONFIG_LOCKUP_DETECTOR
> +void enable_workqueue_watchdog(void);
> +void disable_workqueue_watchdog(void);
> +void touch_workqueue_watchdog(int cpu);
> +#endif /* CONFIG_LOCKUP_DETECTOR */
> +
> #endif
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -20,6 +20,7 @@
> #include <linux/smpboot.h>
> #include <linux/sched/rt.h>
> #include <linux/tick.h>
> +#include <linux/workqueue.h>
>
> #include <asm/irq_regs.h>
> #include <linux/kvm_para.h>
> @@ -192,7 +193,7 @@ __setup("hardlockup_all_cpu_backtrace=",
> * the thresholds with a factor: we make the soft threshold twice the amount of
> * time the hard threshold is.
> */
> -static int get_softlockup_thresh(void)
> +int get_softlockup_thresh(void)
> {
> return watchdog_thresh * 2;
> }
> @@ -245,6 +246,7 @@ void touch_softlockup_watchdog_sched(voi
> void touch_softlockup_watchdog(void)
> {
> touch_softlockup_watchdog_sched();
> + touch_workqueue_watchdog(raw_smp_processor_id());
> }
> EXPORT_SYMBOL(touch_softlockup_watchdog);
>
> @@ -259,6 +261,7 @@ void touch_all_softlockup_watchdogs(void
> */
> for_each_watchdog_cpu(cpu)
> per_cpu(watchdog_touch_ts, cpu) = 0;
> + touch_workqueue_watchdog(-1);
> }
>
> #ifdef CONFIG_HARDLOCKUP_DETECTOR
> @@ -795,13 +798,18 @@ static int watchdog_enable_all_cpus(void
> {
> int err = 0;
>
> + disable_workqueue_watchdog();
> +
> if (!watchdog_running) {
> err = smpboot_register_percpu_thread_cpumask(&watchdog_threads,
> &watchdog_cpumask);
> - if (err)
> + if (err) {
> pr_err("Failed to create watchdog threads, disabled\n");
> - else
> + } else {
> + if (watchdog_enabled & SOFT_WATCHDOG_ENABLED)
> + enable_workqueue_watchdog();
> watchdog_running = 1;
> + }
> } else {
> /*
> * Enable/disable the lockup detectors or
> @@ -826,6 +834,7 @@ static void watchdog_disable_all_cpus(vo
> if (watchdog_running) {
> watchdog_running = 0;
> smpboot_unregister_percpu_thread(&watchdog_threads);
> + disable_workqueue_watchdog();
> }
> }
>
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -148,6 +148,8 @@ struct worker_pool {
> int id; /* I: pool ID */
> unsigned int flags; /* X: flags */
>
> + unsigned long watchdog_ts; /* L: watchdog timestamp */
> +
> struct list_head worklist; /* L: list of pending works */
> int nr_workers; /* L: total number of workers */
>
> @@ -1083,6 +1085,8 @@ static void pwq_activate_delayed_work(st
> struct pool_workqueue *pwq = get_work_pwq(work);
>
> trace_workqueue_activate_work(work);
> + if (list_empty(&pwq->pool->worklist))
> + pwq->pool->watchdog_ts = jiffies;
> move_linked_works(work, &pwq->pool->worklist, NULL);
> __clear_bit(WORK_STRUCT_DELAYED_BIT, work_data_bits(work));
> pwq->nr_active++;
> @@ -1385,6 +1389,8 @@ retry:
> trace_workqueue_activate_work(work);
> pwq->nr_active++;
> worklist = &pwq->pool->worklist;
> + if (list_empty(worklist))
> + pwq->pool->watchdog_ts = jiffies;
> } else {
> work_flags |= WORK_STRUCT_DELAYED;
> worklist = &pwq->delayed_works;
> @@ -2157,6 +2163,8 @@ recheck:
> list_first_entry(&pool->worklist,
> struct work_struct, entry);
>
> + pool->watchdog_ts = jiffies;
> +
> if (likely(!(*work_data_bits(work) & WORK_STRUCT_LINKED))) {
> /* optimization path, not strictly necessary */
> process_one_work(worker, work);
> @@ -2240,6 +2248,7 @@ repeat:
> struct pool_workqueue, mayday_node);
> struct worker_pool *pool = pwq->pool;
> struct work_struct *work, *n;
> + bool first = true;
>
> __set_current_state(TASK_RUNNING);
> list_del_init(&pwq->mayday_node);
> @@ -2256,9 +2265,14 @@ repeat:
> * process'em.
> */
> WARN_ON_ONCE(!list_empty(scheduled));
> - list_for_each_entry_safe(work, n, &pool->worklist, entry)
> - if (get_work_pwq(work) == pwq)
> + list_for_each_entry_safe(work, n, &pool->worklist, entry) {
> + if (get_work_pwq(work) == pwq) {
> + if (first)
> + pool->watchdog_ts = jiffies;
> move_linked_works(work, scheduled, &n);
> + }
> + first = false;
> + }
>
> if (!list_empty(scheduled)) {
> process_scheduled_works(rescuer);
> @@ -3069,6 +3083,7 @@ static int init_worker_pool(struct worke
> pool->cpu = -1;
> pool->node = NUMA_NO_NODE;
> pool->flags |= POOL_DISASSOCIATED;
> + pool->watchdog_ts = jiffies;
> INIT_LIST_HEAD(&pool->worklist);
> INIT_LIST_HEAD(&pool->idle_list);
> hash_init(pool->busy_hash);
> @@ -4308,7 +4323,9 @@ void show_workqueue_state(void)
>
> pr_info("pool %d:", pool->id);
> pr_cont_pool_info(pool);
> - pr_cont(" workers=%d", pool->nr_workers);
> + pr_cont(" hung=%us workers=%d",
> + jiffies_to_msecs(jiffies - pool->watchdog_ts) / 1000,
> + pool->nr_workers);
> if (pool->manager)
> pr_cont(" manager: %d",
> task_pid_nr(pool->manager->task));
> @@ -5167,6 +5184,129 @@ static void workqueue_sysfs_unregister(s
> static void workqueue_sysfs_unregister(struct workqueue_struct *wq) { }
> #endif /* CONFIG_SYSFS */
>
> +/*
> + * Workqueue watchdog.
> + *
> + * Stall may be caused by various bugs - missing WQ_MEM_RECLAIM, illegal
> + * flush dependency, a concurrency managed work item which stays RUNNING
> + * indefinitely. Workqueue stalls can be very difficult to debug as the
> + * usual warning mechanisms don't trigger and internal workqueue state is
> + * largely opaque.
> + *
> + * Workqueue watchdog monitors all worker pools periodically and dumps
> + * state if some pools failed to make forward progress for a while where
> + * forward progress is defined as the first item on ->worklist changing.
> + *
> + * The mechanism is enabled and disabled together with softlockup watchdog
> + * and uses the same threshold duration; however, it currently doesn't
> + * cause panic even if softlockup_panic is set. Also, workqueue watchdog
> + * assumes that the usual jiffies and timer mechanisms are working as there
> + * isn't much point in warning about workqueue stalls when timer is broken.
> + */
> +#ifdef CONFIG_LOCKUP_DETECTOR
> +
> +static void wq_watchdog_timer_fn(unsigned long data);
> +
> +static unsigned long wq_watchdog_thresh;
> +static struct timer_list wq_watchdog_timer =
> + TIMER_DEFERRED_INITIALIZER(wq_watchdog_timer_fn, 0, 0);
> +
> +static unsigned long wq_watchdog_touched = INITIAL_JIFFIES;
> +static DEFINE_PER_CPU(unsigned long, wq_watchdog_touched_cpu) = INITIAL_JIFFIES;
> +
> +static void wq_watchdog_reset_touched(void)
> +{
> + int cpu;
> +
> + wq_watchdog_touched = jiffies;
> + for_each_possible_cpu(cpu)
> + per_cpu(wq_watchdog_touched_cpu, cpu) = jiffies;
> +}
> +
> +static void wq_watchdog_timer_fn(unsigned long data)
> +{
> + unsigned long thresh = wq_watchdog_thresh;
> + bool lockup_detected = false;
> + struct worker_pool *pool;
> + int pi;
> +
> + if (!thresh)
> + return;
> +
> + rcu_read_lock();
> +
> + for_each_pool(pool, pi) {
> + unsigned long pool_ts, touched, ts;
> +
> + if (list_empty(&pool->worklist))
> + continue;
> +
> + /* get the latest of pool and touched timestamps */
> + pool_ts = READ_ONCE(pool->watchdog_ts);
> + touched = READ_ONCE(wq_watchdog_touched);
> +
> + if (time_after(pool_ts, touched))
> + ts = pool_ts;
> + else
> + ts = touched;
> +
> + if (pool->cpu >= 0) {
> + unsigned long cpu_touched =
> + READ_ONCE(per_cpu(wq_watchdog_touched_cpu,
> + pool->cpu));
> + if (time_after(cpu_touched, ts))
> + ts = cpu_touched;
> + }
> +
> + /* did we stall? */
> + if (time_after(jiffies, ts + thresh)) {
> + lockup_detected = true;
> + pr_emerg("BUG: workqueue lockup - pool");
> + pr_cont_pool_info(pool);
> + pr_cont(" stuck for %us!\n",
> + jiffies_to_msecs(jiffies - pool_ts) / 1000);
> + }
> + }
> +
> + rcu_read_unlock();
> +
> + if (lockup_detected)
> + show_workqueue_state();
> +
> + wq_watchdog_reset_touched();
> + mod_timer(&wq_watchdog_timer, jiffies + thresh);
> +}
> +
> +void enable_workqueue_watchdog(void)
> +{
> + wq_watchdog_thresh = get_softlockup_thresh() * HZ;
> +
> + wq_watchdog_reset_touched();
> + mod_timer(&wq_watchdog_timer, jiffies + wq_watchdog_thresh);
> +}
> +
> +void disable_workqueue_watchdog(void)
> +{
> + wq_watchdog_thresh = 0;
> + del_timer_sync(&wq_watchdog_timer);
> +}
> +
> +void touch_workqueue_watchdog(int cpu)
> +{
> + /*
> + * If not explicitly touched, these stamps are never updated, which
> + * means that they may affect stall detection if jiffies wraps;
> + * however, it's highly unlikely and the worst that can happen is
> + * delaying stall detection by upto one threshold duration.
> + */
> + if (cpu >= 0)
> + per_cpu(wq_watchdog_touched_cpu, cpu) = jiffies;
> + else
> + wq_watchdog_touched = jiffies;
> +}
> +
> +#endif /* CONFIG_LOCKUP_DETECTOR */
> +
> static void __init wq_numa_init(void)
> {
> cpumask_var_t *tbl;
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
Task or work item involved in memory reclaim trying to flush a
non-WQ_MEM_RECLAIM workqueue or one of its work items can lead to
deadlock. Trigger WARN_ONCE() if such conditions are detected.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Peter Zijlstra <[email protected]>
---
Hello,
So, something like this. Seems to work fine here. If there's no
objection, I'm gonna push it through wq/for-4.5.
Thanks.
kernel/workqueue.c | 35 +++++++++++++++++++++++++++++++++++
1 file changed, 35 insertions(+)
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -2330,6 +2330,37 @@ repeat:
goto repeat;
}
+/**
+ * check_flush_dependency - check for flush dependency sanity
+ * @target_wq: workqueue being flushed
+ * @target_work: work item being flushed (NULL for workqueue flushes)
+ *
+ * %current is trying to flush the whole @target_wq or @target_work on it.
+ * If @target_wq doesn't have %WQ_MEM_RECLAIM, verify that %current is not
+ * reclaiming memory or running on a workqueue which doesn't have
+ * %WQ_MEM_RECLAIM as that can break forward-progress guarantee leading to
+ * a deadlock.
+ */
+static void check_flush_dependency(struct workqueue_struct *target_wq,
+ struct work_struct *target_work)
+{
+ work_func_t target_func = target_work ? target_work->func : NULL;
+ struct worker *worker;
+
+ if (target_wq->flags & WQ_MEM_RECLAIM)
+ return;
+
+ worker = current_wq_worker();
+
+ WARN_ONCE(current->flags & PF_MEMALLOC,
+ "workqueue: PF_MEMALLOC task %d(%s) is flushing !WQ_MEM_RECLAIM %s:%pf",
+ current->pid, current->comm, target_wq->name, target_func);
+ WARN_ONCE(worker && (worker->current_pwq->wq->flags & WQ_MEM_RECLAIM),
+ "workqueue: WQ_MEM_RECLAIM %s:%pf is flushing !WQ_MEM_RECLAIM %s:%pf",
+ worker->current_pwq->wq->name, worker->current_func,
+ target_wq->name, target_func);
+}
+
struct wq_barrier {
struct work_struct work;
struct completion done;
@@ -2539,6 +2570,8 @@ void flush_workqueue(struct workqueue_st
list_add_tail(&this_flusher.list, &wq->flusher_overflow);
}
+ check_flush_dependency(wq, NULL);
+
mutex_unlock(&wq->mutex);
wait_for_completion(&this_flusher.done);
@@ -2711,6 +2744,8 @@ static bool start_flush_work(struct work
pwq = worker->current_pwq;
}
+ check_flush_dependency(pwq->wq, work);
+
insert_wq_barrier(pwq, barr, work, worker);
spin_unlock_irq(&pool->lock);
Hello, Don.
On Thu, Dec 03, 2015 at 12:50:24PM -0500, Don Zickus wrote:
> This sort of looks like the hung task detector..
>
> I am a little concerned because we just made a big effort to properly
> separate the hardlockup and softlockup paths and yet retain the flexibility
> to enable/disable them separately. Now it seems the workqueue detector is
> permanently entwined with the softlockup detector. I am not entirely sure
> that is correct thing to do.
The only area they get entwined is how it's controlled from userland.
While it isn't quite the same as softlockup detection, I think what it
monitors is close enough that it makes sense to put them under the
same interface.
> It also seems awkward for the lockup code to have to jump to the workqueue
> code to function properly. :-/ Though we have made exceptions for the virt
> stuff and the workqueue code is simple..
Softlockup code doesn't depend on workqueue in any way. Workqueue
tags on touch_softlockup to detect cases which shouldn't be warned and
its enabledness is controlled together with softlockup and that's it.
> Actually, I am curious, it seems if you just added a
> /proc/sys/kernel/wq_watchdog entry, you could elminiate the entire need for
> modifying the watchdog code to begin with. As you really aren't using any
> of it other than piggybacking on the touch_softlockup_watchdog stuff, which
> could probably be easily added without all the extra enable/disable changes
> in watchdog.c.
Yeah, except for touch signal, it's purely interface thing. I don't
feel too strong about this but it seems a bit silly to introduce a
whole different set of interface for this. e.g. if the user wanted to
disable softlockup detection, it'd be weird to leave wq lockup
detection running. The same goes for threshold.
> Again, this looks like what the hung task detector is doing, which I
> struggled with years ago to integrate with the lockup code because in the
> end I had trouble re-using much of it.
So, it's a stall detector and there are inherent similarities but the
conditions tested are pretty different and it's a lot lighter. I'm
not really sure what you're meaning to say.
Thanks.
--
tejun
Tejun,
I share Don's concern about connecting the soft lockup detector and the
workqueue watchdog to the same kernel parameter in /proc. I would feel
more comfortable if the workqueue watchdog had its dedicated parameter.
I also see a scenario that the proposed patch does not handle well: The
watchdog_thresh parameter can be changed 'on the fly' - i.e. it is not
necessary to disable and re-enable the watchdog. The flow of execution
looks like this.
proc_watchdog_thresh
proc_watchdog_update
if (watchdog_enabled && watchdog_thresh)
watchdog_enable_all_cpus
if (!watchdog_running) {
...
} else {
//
// update 'on the fly'
//
update_watchdog_all_cpus()
}
The patched watchdog_enable_all_cpus() function disables the workqueue watchdog
unconditionally at [1]. However, the workqueue watchdog remains disabled if the
code path [2] is executed (and wq_watchdog_thresh is not updated as well).
static int watchdog_enable_all_cpus(void)
{
int err = 0;
[1] --> disable_workqueue_watchdog();
if (!watchdog_running) {
...
} else {
.- /*
| * Enable/disable the lockup detectors or
| * change the sample period 'on the fly'.
| */
[2] < err = update_watchdog_all_cpus();
|
| if (err) {
| watchdog_disable_all_cpus();
| pr_err("Failed to update lockup detectors, disabled\n");
'- }
}
if (err)
watchdog_enabled = 0;
return err;
}
And another question that comes to my mind is: Would the workqueue watchdog
participate in the lockup detector suspend/resume mechanism, and if yes, how
would it be integrated into this ?
Regards,
Uli
----- Original Message -----
From: "Tejun Heo" <[email protected]>
To: "Don Zickus" <[email protected]>
Cc: "Ulrich Obergfell" <[email protected]>, "Ingo Molnar" <[email protected]>, "Peter Zijlstra" <[email protected]>, "Andrew Morton" <[email protected]>, [email protected], [email protected]
Sent: Thursday, December 3, 2015 8:43:58 PM
Subject: Re: [PATCH 2/2] workqueue: implement lockup detector
Hello, Don.
On Thu, Dec 03, 2015 at 12:50:24PM -0500, Don Zickus wrote:
> This sort of looks like the hung task detector..
>
> I am a little concerned because we just made a big effort to properly
> separate the hardlockup and softlockup paths and yet retain the flexibility
> to enable/disable them separately. Now it seems the workqueue detector is
> permanently entwined with the softlockup detector. I am not entirely sure
> that is correct thing to do.
The only area they get entwined is how it's controlled from userland.
While it isn't quite the same as softlockup detection, I think what it
monitors is close enough that it makes sense to put them under the
same interface.
> It also seems awkward for the lockup code to have to jump to the workqueue
> code to function properly. :-/ Though we have made exceptions for the virt
> stuff and the workqueue code is simple..
Softlockup code doesn't depend on workqueue in any way. Workqueue
tags on touch_softlockup to detect cases which shouldn't be warned and
its enabledness is controlled together with softlockup and that's it.
> Actually, I am curious, it seems if you just added a
> /proc/sys/kernel/wq_watchdog entry, you could elminiate the entire need for
> modifying the watchdog code to begin with. As you really aren't using any
> of it other than piggybacking on the touch_softlockup_watchdog stuff, which
> could probably be easily added without all the extra enable/disable changes
> in watchdog.c.
Yeah, except for touch signal, it's purely interface thing. I don't
feel too strong about this but it seems a bit silly to introduce a
whole different set of interface for this. e.g. if the user wanted to
disable softlockup detection, it'd be weird to leave wq lockup
detection running. The same goes for threshold.
> Again, this looks like what the hung task detector is doing, which I
> struggled with years ago to integrate with the lockup code because in the
> end I had trouble re-using much of it.
So, it's a stall detector and there are inherent similarities but the
conditions tested are pretty different and it's a lot lighter. I'm
not really sure what you're meaning to say.
Thanks.
--
tejun
On Thu, Dec 03, 2015 at 02:26:16PM -0500, Tejun Heo wrote:
> + WARN_ONCE(current->flags & PF_MEMALLOC,
I'm not sure about using PF_MEMALLOC for detecting reclaim. There appear
to be more sites setting this than reclaim. See:
drivers/block/nbd.c: current->flags |= PF_MEMALLOC;
drivers/mmc/card/queue.c: current->flags |= PF_MEMALLOC;
drivers/mtd/nand/nandsim.c: current->flags |= PF_MEMALLOC;
drivers/scsi/iscsi_tcp.c: current->flags |= PF_MEMALLOC;
drivers/staging/lustre/include/linux/libcfs/linux/linux-mem.h:#define memory_pressure_set() do { current->flags |= PF_MEMALLOC; } while (0)
fs/cifs/connect.c: current->flags |= PF_MEMALLOC;
fs/xfs/libxfs/xfs_btree.c: new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
fs/xfs/xfs_trans_ail.c: current->flags |= PF_MEMALLOC;
include/linux/sched.h: current->flags |= PF_MEMALLOC_NOIO;
mm/page_alloc.c: current->flags |= PF_MEMALLOC;
mm/page_alloc.c: current->flags |= PF_MEMALLOC;
mm/vmscan.c: tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
mm/vmscan.c: p->flags |= PF_MEMALLOC;
mm/vmscan.c: p->flags |= PF_MEMALLOC | PF_SWAPWRITE;
net/core/dev.c: current->flags |= PF_MEMALLOC;
net/core/sock.c: current->flags |= PF_MEMALLOC;
The actual reclaim sites in page_alloc and vmscan set
current->reclaim_state. So testing against that might be more accurate.
Hello, Ulrich.
On Thu, Dec 03, 2015 at 03:12:20PM -0500, Ulrich Obergfell wrote:
> I share Don's concern about connecting the soft lockup detector and the
> workqueue watchdog to the same kernel parameter in /proc. I would feel
> more comfortable if the workqueue watchdog had its dedicated parameter.
Sure, separating the knobs out isn't difficult. I still don't like
the idea of having multiple set of similar knobs controlling about the
same thing tho.
For example, let's say there's a user who boots with "nosoftlockup"
explicitly. I'm pretty sure the user wouldn't be intending to keep
workqueue watchdog running. The same goes for threshold adjustments,
so here's my question. What are the reasons for the concern? What
are we worrying about?
> The patched watchdog_enable_all_cpus() function disables the workqueue watchdog
> unconditionally at [1]. However, the workqueue watchdog remains disabled if the
> code path [2] is executed (and wq_watchdog_thresh is not updated as well).
Oops, you're right.
> And another question that comes to my mind is: Would the workqueue watchdog
> participate in the lockup detector suspend/resume mechanism, and if yes, how
> would it be integrated into this ?
>From the usage, I can't quite tell what the purpose of the mechanism
is. The only user seems to be fixup_ht_bug() and when it fails it
says "failed to disable PMU erratum BJ122, BV98, HSD29 workaround" so
if you could give me a pointer, it'd be great. But at any rate, if
shutting down watchdog is all that's necessary, it shouldn't be a
problem to integrate.
Thanks.
--
tejun
Hey, Peter.
On Thu, Dec 03, 2015 at 09:43:13PM +0100, Peter Zijlstra wrote:
> I'm not sure about using PF_MEMALLOC for detecting reclaim. There appear
> to be more sites setting this than reclaim. See:
>
> drivers/block/nbd.c: current->flags |= PF_MEMALLOC;
> drivers/mmc/card/queue.c: current->flags |= PF_MEMALLOC;
> drivers/mtd/nand/nandsim.c: current->flags |= PF_MEMALLOC;
> drivers/scsi/iscsi_tcp.c: current->flags |= PF_MEMALLOC;
> drivers/staging/lustre/include/linux/libcfs/linux/linux-mem.h:#define memory_pressure_set() do { current->flags |= PF_MEMALLOC; } while (0)
> fs/cifs/connect.c: current->flags |= PF_MEMALLOC;
> fs/xfs/libxfs/xfs_btree.c: new_pflags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> fs/xfs/xfs_trans_ail.c: current->flags |= PF_MEMALLOC;
> include/linux/sched.h: current->flags |= PF_MEMALLOC_NOIO;
> mm/page_alloc.c: current->flags |= PF_MEMALLOC;
> mm/page_alloc.c: current->flags |= PF_MEMALLOC;
> mm/vmscan.c: tsk->flags |= PF_MEMALLOC | PF_SWAPWRITE | PF_KSWAPD;
> mm/vmscan.c: p->flags |= PF_MEMALLOC;
> mm/vmscan.c: p->flags |= PF_MEMALLOC | PF_SWAPWRITE;
> net/core/dev.c: current->flags |= PF_MEMALLOC;
> net/core/sock.c: current->flags |= PF_MEMALLOC;
>
>
> The actual reclaim sites in page_alloc and vmscan set
> current->reclaim_state. So testing against that might be more accurate.
So, if I'm not mistaken, those are all marking tasks which can be
depended upon during memory reclaim and we do want to catch them all.
PF_MEMALLOC shouldn't depend on something which require memory to be
reclaimed to guarantee forward progress.
Thanks.
--
tejun
On Thu, Dec 03, 2015 at 03:56:32PM -0500, Tejun Heo wrote:
> So, if I'm not mistaken, those are all marking tasks which can be
> depended upon during memory reclaim and we do want to catch them all.
Up to a point yes, these are things that want to be reliable during
reclaim, but lacking memory reserves and usage bounds (which we
discussed last at lsf/mm) these are just wanna-be.
> PF_MEMALLOC shouldn't depend on something which require memory to be
> reclaimed to guarantee forward progress.
PF_MEMALLOC basically avoids reclaim for any memory allocation while its
set.
The thing is, even if your workqueue has WQ_MEM_RECLAIM set, it will not
hit the mayday button until you're completely full flat out of memory.
At which point you're probably boned anyway, because, as per the above,
all that code assumes there's _some_ memory to be had.
One solution is to always fail maybe_create_worker() when PF_MEMALLOC is
set, thus always hitting the mayday button.
Hello, Peter.
On Thu, Dec 03, 2015 at 10:09:11PM +0100, Peter Zijlstra wrote:
> On Thu, Dec 03, 2015 at 03:56:32PM -0500, Tejun Heo wrote:
> > So, if I'm not mistaken, those are all marking tasks which can be
> > depended upon during memory reclaim and we do want to catch them all.
>
> Up to a point yes, these are things that want to be reliable during
> reclaim, but lacking memory reserves and usage bounds (which we
> discussed last at lsf/mm) these are just wanna-be.
Hmmm... even if buggy in that they can't guarantee forward-progress
even with access to the emergency pool, I think it makes sense to warn
them about creating an extra dependency which doesn't have access to
the emergency pool.
> > PF_MEMALLOC shouldn't depend on something which require memory to be
> > reclaimed to guarantee forward progress.
>
> PF_MEMALLOC basically avoids reclaim for any memory allocation while its
> set.
So, the assumption is that they're already on the reclaim path and
thus shouldn't recurse into it again.
> The thing is, even if your workqueue has WQ_MEM_RECLAIM set, it will not
> hit the mayday button until you're completely full flat out of memory.
It's more trigger-happy than that. It's timer based. If new worker
can't be created for a certain amount of time for whatever reason,
it'll summon the rescuer.
> At which point you're probably boned anyway, because, as per the above,
> all that code assumes there's _some_ memory to be had.
Not really. PF_MEMALLOC tasks have access to the emergency pool,
creating new workers doesn't, so this really is creating a dependency
which is qualitatively different.
> One solution is to always fail maybe_create_worker() when PF_MEMALLOC is
> set, thus always hitting the mayday button.
I'm not following. When PF_MEMALLOC is set where?
Thanks.
--
tejun
* Tejun Heo <[email protected]> wrote:
> Hello, Ulrich.
>
> On Thu, Dec 03, 2015 at 03:12:20PM -0500, Ulrich Obergfell wrote:
> > I share Don's concern about connecting the soft lockup detector and the
> > workqueue watchdog to the same kernel parameter in /proc. I would feel
> > more comfortable if the workqueue watchdog had its dedicated parameter.
>
> Sure, separating the knobs out isn't difficult. I still don't like
> the idea of having multiple set of similar knobs controlling about the
> same thing tho.
>
> For example, let's say there's a user who boots with "nosoftlockup"
> explicitly. I'm pretty sure the user wouldn't be intending to keep
> workqueue watchdog running. The same goes for threshold adjustments,
> so here's my question. What are the reasons for the concern? What
> are we worrying about?
As Don mentioned it already, we went through similar arguments (and pain) with the
hard/soft lockup detectors and its various control knobs, it would be better to
have new control knobs separated.
As for the ease of use argument, we can add a new, obviously named control knob
that controls _all_ lockup detectors:
boot param: nolockupdetectors
matching Kconfig knob: CONFIG_BOOTPARAM_NO_LOCKUP_DETECTORS=0
but please don't artificially couple the control knobs of these various lockup
detectors, as these internal assumptions are less than obvious to users. With
(effectively) 4 lockup detectors such coupling of interfaces is even more
confusing and damaging.
but ease of use is not a big concern in any case, since new debug features are
disabled by default, so only those people will see it (and may want to disable it
via a boot parameter, hopefully only temporarily) who enable it intentionally.
Thanks,
Ingo
On Thu, Dec 03, 2015 at 05:04:06PM -0500, Tejun Heo wrote:
> > One solution is to always fail maybe_create_worker() when PF_MEMALLOC is
> > set, thus always hitting the mayday button.
>
> I'm not following. When PF_MEMALLOC is set where?
It seems I made a false assumption. I was thinking the worker creation
was done from queue/flush context, but its done by other workers, at a
time when PF_MEMALLOC cannot be set.
In any case, no objections to the proposed patch.
Tejun,
> Sure, separating the knobs out isn't difficult. I still don't like
> the idea of having multiple set of similar knobs controlling about the
> same thing tho.
>
> For example, let's say there's a user who boots with "nosoftlockup"
> explicitly. I'm pretty sure the user wouldn't be intending to keep
> workqueue watchdog running. The same goes for threshold adjustments,
> so here's my question. What are the reasons for the concern? What
> are we worrying about?
I'm not sure if it is obvious to a user that a stall of workqueues is
"about the same thing" as a soft lockup, and that one could thus argue
that both should be controlled by the same knob. Looking at this from
perspective of usability, I would still vote for having separate knobs
for each lockup detector. For example
/proc/sys/kernel/wq_watchdog_thresh
could control the on|off state of the workqueue watchdog and the timeout
at the same time (0 means off, > 0 means on and specifies the timeout).
Separating wq_watchdog_thresh from watchdog_thresh might also be useful
for diagnostic purposes for example, if during the investigation of a
problem one would want to explicitly increase or lower one threshold
without impacting the other.
>> And another question that comes to my mind is: Would the workqueue watchdog
>> participate in the lockup detector suspend/resume mechanism, and if yes, how
>> would it be integrated into this ?
>
> From the usage, I can't quite tell what the purpose of the mechanism
> is. The only user seems to be fixup_ht_bug() and when it fails it
> says "failed to disable PMU erratum BJ122, BV98, HSD29 workaround" so
> if you could give me a pointer, it'd be great. But at any rate, if
> shutting down watchdog is all that's necessary, it shouldn't be a
> problem to integrate.
The patch post that introduced the mechanism is here:
http://marc.info/?l=linux-kernel&m=143843318208917&w=2
The watchdog_{suspend|resume} functions were later renamed:
http://marc.info/?l=linux-kernel&m=143894132129982&w=2
At the moment I don't see a reason why the workqueue watchdog would have to
participate in that mechanism. However, if the workqueue watchdog would be
connected to the soft lockup detector as you proposed, I think it should be
participating for the 'sake of consistency' (it would seem hard to under-
stand if the interface would only suspend parts of the lockup detector).
Regards,
Uli
On Fri, Dec 04, 2015 at 09:02:26AM +0100, Ingo Molnar wrote:
>
> * Tejun Heo <[email protected]> wrote:
>
> > Hello, Ulrich.
> >
> > On Thu, Dec 03, 2015 at 03:12:20PM -0500, Ulrich Obergfell wrote:
> > > I share Don's concern about connecting the soft lockup detector and the
> > > workqueue watchdog to the same kernel parameter in /proc. I would feel
> > > more comfortable if the workqueue watchdog had its dedicated parameter.
> >
> > Sure, separating the knobs out isn't difficult. I still don't like
> > the idea of having multiple set of similar knobs controlling about the
> > same thing tho.
> >
> > For example, let's say there's a user who boots with "nosoftlockup"
> > explicitly. I'm pretty sure the user wouldn't be intending to keep
> > workqueue watchdog running. The same goes for threshold adjustments,
> > so here's my question. What are the reasons for the concern? What
> > are we worrying about?
>
> As Don mentioned it already, we went through similar arguments (and pain) with the
> hard/soft lockup detectors and its various control knobs, it would be better to
> have new control knobs separated.
>
> As for the ease of use argument, we can add a new, obviously named control knob
> that controls _all_ lockup detectors:
>
> boot param: nolockupdetectors
> matching Kconfig knob: CONFIG_BOOTPARAM_NO_LOCKUP_DETECTORS=0
>
> but please don't artificially couple the control knobs of these various lockup
> detectors, as these internal assumptions are less than obvious to users. With
> (effectively) 4 lockup detectors such coupling of interfaces is even more
> confusing and damaging.
It might be worth tying them together with a generic knob and expanding the
bit mask for the 'watchdog' variable. I can't figure out an easy way to do
that right now.
I don't think we want to go down the route of 'registering' detectors yet.
:-)
Cheers,
Don
On Thu, Dec 03, 2015 at 02:26:16PM -0500, Tejun Heo wrote:
> Task or work item involved in memory reclaim trying to flush a
> non-WQ_MEM_RECLAIM workqueue or one of its work items can lead to
> deadlock. Trigger WARN_ONCE() if such conditions are detected.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
Applied to wq/for-4.5.
--
tejun
Hello,
Decoupled the control knobs from softlockup. It's now workqueue
module param which can be updated at runtime. If there's no
objection, I'll push the two patches through wq/for-4.5.
Thanks.
------ 8< ------
Workqueue stalls can happen from a variety of usage bugs such as
missing WQ_MEM_RECLAIM flag or concurrency managed work item
indefinitely staying RUNNING. These stalls can be extremely difficult
to hunt down because the usual warning mechanisms can't detect
workqueue stalls and the internal state is pretty opaque.
To alleviate the situation, this patch implements workqueue lockup
detector. It periodically monitors all worker_pools periodically and,
if any pool failed to make forward progress longer than the threshold
duration, triggers warning and dumps workqueue state as follows.
BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
Showing busy workqueues and worker pools:
workqueue events: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
workqueue events_power_efficient: flags=0x80
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
pending: check_lifetime, neigh_periodic_work
workqueue cgroup_pidlist_destroy: flags=0x0
pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
pending: cgroup_pidlist_destroy_work_fn
...
The detection mechanism is controller through kernel parameter
workqueue.watchdog_thresh and can be updated at runtime through the
sysfs module parameter file.
v2: Decoupled from softlockup control knobs.
Signed-off-by: Tejun Heo <[email protected]>
Cc: Ulrich Obergfell <[email protected]>
Cc: Michal Hocko <[email protected]>
Cc: Chris Mason <[email protected]>
Cc: Andrew Morton <[email protected]>
---
Documentation/kernel-parameters.txt | 9 +
include/linux/workqueue.h | 6 +
kernel/watchdog.c | 3
kernel/workqueue.c | 174 +++++++++++++++++++++++++++++++++++-
lib/Kconfig.debug | 11 ++
5 files changed, 200 insertions(+), 3 deletions(-)
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -4114,6 +4114,15 @@ bytes respectively. Such letter suffixes
or other driver-specific files in the
Documentation/watchdog/ directory.
+ workqueue.watchdog_thresh=
+ If CONFIG_WQ_WATCHDOG is configured, workqueue can
+ warn stall conditions and dump internal state to
+ help debugging. 0 disables workqueue stall
+ detection; otherwise, it's the stall threshold
+ duration in seconds. The default value is 30 and
+ it can be updated at runtime by writing to the
+ corresponding sysfs file.
+
workqueue.disable_numa
By default, all work items queued to unbound
workqueues are affine to the NUMA nodes they're
--- a/include/linux/workqueue.h
+++ b/include/linux/workqueue.h
@@ -618,4 +618,10 @@ static inline int workqueue_sysfs_regist
{ return 0; }
#endif /* CONFIG_SYSFS */
+#ifdef CONFIG_WQ_WATCHDOG
+void wq_watchdog_touch(int cpu);
+#else /* CONFIG_WQ_WATCHDOG */
+static inline void wq_watchdog_touch(int cpu) { }
+#endif /* CONFIG_WQ_WATCHDOG */
+
#endif
--- a/kernel/watchdog.c
+++ b/kernel/watchdog.c
@@ -20,6 +20,7 @@
#include <linux/smpboot.h>
#include <linux/sched/rt.h>
#include <linux/tick.h>
+#include <linux/workqueue.h>
#include <asm/irq_regs.h>
#include <linux/kvm_para.h>
@@ -245,6 +246,7 @@ void touch_softlockup_watchdog_sched(voi
void touch_softlockup_watchdog(void)
{
touch_softlockup_watchdog_sched();
+ wq_watchdog_touch(raw_smp_processor_id());
}
EXPORT_SYMBOL(touch_softlockup_watchdog);
@@ -259,6 +261,7 @@ void touch_all_softlockup_watchdogs(void
*/
for_each_watchdog_cpu(cpu)
per_cpu(watchdog_touch_ts, cpu) = 0;
+ wq_watchdog_touch(-1);
}
#ifdef CONFIG_HARDLOCKUP_DETECTOR
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -148,6 +148,8 @@ struct worker_pool {
int id; /* I: pool ID */
unsigned int flags; /* X: flags */
+ unsigned long watchdog_ts; /* L: watchdog timestamp */
+
struct list_head worklist; /* L: list of pending works */
int nr_workers; /* L: total number of workers */
@@ -1083,6 +1085,8 @@ static void pwq_activate_delayed_work(st
struct pool_workqueue *pwq = get_work_pwq(work);
trace_workqueue_activate_work(work);
+ if (list_empty(&pwq->pool->worklist))
+ pwq->pool->watchdog_ts = jiffies;
move_linked_works(work, &pwq->pool->worklist, NULL);
__clear_bit(WORK_STRUCT_DELAYED_BIT, work_data_bits(work));
pwq->nr_active++;
@@ -1385,6 +1389,8 @@ retry:
trace_workqueue_activate_work(work);
pwq->nr_active++;
worklist = &pwq->pool->worklist;
+ if (list_empty(worklist))
+ pwq->pool->watchdog_ts = jiffies;
} else {
work_flags |= WORK_STRUCT_DELAYED;
worklist = &pwq->delayed_works;
@@ -2157,6 +2163,8 @@ recheck:
list_first_entry(&pool->worklist,
struct work_struct, entry);
+ pool->watchdog_ts = jiffies;
+
if (likely(!(*work_data_bits(work) & WORK_STRUCT_LINKED))) {
/* optimization path, not strictly necessary */
process_one_work(worker, work);
@@ -2240,6 +2248,7 @@ repeat:
struct pool_workqueue, mayday_node);
struct worker_pool *pool = pwq->pool;
struct work_struct *work, *n;
+ bool first = true;
__set_current_state(TASK_RUNNING);
list_del_init(&pwq->mayday_node);
@@ -2256,9 +2265,14 @@ repeat:
* process'em.
*/
WARN_ON_ONCE(!list_empty(scheduled));
- list_for_each_entry_safe(work, n, &pool->worklist, entry)
- if (get_work_pwq(work) == pwq)
+ list_for_each_entry_safe(work, n, &pool->worklist, entry) {
+ if (get_work_pwq(work) == pwq) {
+ if (first)
+ pool->watchdog_ts = jiffies;
move_linked_works(work, scheduled, &n);
+ }
+ first = false;
+ }
if (!list_empty(scheduled)) {
process_scheduled_works(rescuer);
@@ -3104,6 +3118,7 @@ static int init_worker_pool(struct worke
pool->cpu = -1;
pool->node = NUMA_NO_NODE;
pool->flags |= POOL_DISASSOCIATED;
+ pool->watchdog_ts = jiffies;
INIT_LIST_HEAD(&pool->worklist);
INIT_LIST_HEAD(&pool->idle_list);
hash_init(pool->busy_hash);
@@ -4343,7 +4358,9 @@ void show_workqueue_state(void)
pr_info("pool %d:", pool->id);
pr_cont_pool_info(pool);
- pr_cont(" workers=%d", pool->nr_workers);
+ pr_cont(" hung=%us workers=%d",
+ jiffies_to_msecs(jiffies - pool->watchdog_ts) / 1000,
+ pool->nr_workers);
if (pool->manager)
pr_cont(" manager: %d",
task_pid_nr(pool->manager->task));
@@ -5202,6 +5219,154 @@ static void workqueue_sysfs_unregister(s
static void workqueue_sysfs_unregister(struct workqueue_struct *wq) { }
#endif /* CONFIG_SYSFS */
+/*
+ * Workqueue watchdog.
+ *
+ * Stall may be caused by various bugs - missing WQ_MEM_RECLAIM, illegal
+ * flush dependency, a concurrency managed work item which stays RUNNING
+ * indefinitely. Workqueue stalls can be very difficult to debug as the
+ * usual warning mechanisms don't trigger and internal workqueue state is
+ * largely opaque.
+ *
+ * Workqueue watchdog monitors all worker pools periodically and dumps
+ * state if some pools failed to make forward progress for a while where
+ * forward progress is defined as the first item on ->worklist changing.
+ *
+ * This mechanism is controlled through the kernel parameter
+ * "workqueue.watchdog_thresh" which can be updated at runtime through the
+ * corresponding sysfs parameter file.
+ */
+#ifdef CONFIG_WQ_WATCHDOG
+
+static void wq_watchdog_timer_fn(unsigned long data);
+
+static unsigned long wq_watchdog_thresh = 30;
+static struct timer_list wq_watchdog_timer =
+ TIMER_DEFERRED_INITIALIZER(wq_watchdog_timer_fn, 0, 0);
+
+static unsigned long wq_watchdog_touched = INITIAL_JIFFIES;
+static DEFINE_PER_CPU(unsigned long, wq_watchdog_touched_cpu) = INITIAL_JIFFIES;
+
+static void wq_watchdog_reset_touched(void)
+{
+ int cpu;
+
+ wq_watchdog_touched = jiffies;
+ for_each_possible_cpu(cpu)
+ per_cpu(wq_watchdog_touched_cpu, cpu) = jiffies;
+}
+
+static void wq_watchdog_timer_fn(unsigned long data)
+{
+ unsigned long thresh = READ_ONCE(wq_watchdog_thresh) * HZ;
+ bool lockup_detected = false;
+ struct worker_pool *pool;
+ int pi;
+
+ if (!thresh)
+ return;
+
+ rcu_read_lock();
+
+ for_each_pool(pool, pi) {
+ unsigned long pool_ts, touched, ts;
+
+ if (list_empty(&pool->worklist))
+ continue;
+
+ /* get the latest of pool and touched timestamps */
+ pool_ts = READ_ONCE(pool->watchdog_ts);
+ touched = READ_ONCE(wq_watchdog_touched);
+
+ if (time_after(pool_ts, touched))
+ ts = pool_ts;
+ else
+ ts = touched;
+
+ if (pool->cpu >= 0) {
+ unsigned long cpu_touched =
+ READ_ONCE(per_cpu(wq_watchdog_touched_cpu,
+ pool->cpu));
+ if (time_after(cpu_touched, ts))
+ ts = cpu_touched;
+ }
+
+ /* did we stall? */
+ if (time_after(jiffies, ts + thresh)) {
+ lockup_detected = true;
+ pr_emerg("BUG: workqueue lockup - pool");
+ pr_cont_pool_info(pool);
+ pr_cont(" stuck for %us!\n",
+ jiffies_to_msecs(jiffies - pool_ts) / 1000);
+ }
+ }
+
+ rcu_read_unlock();
+
+ if (lockup_detected)
+ show_workqueue_state();
+
+ wq_watchdog_reset_touched();
+ mod_timer(&wq_watchdog_timer, jiffies + thresh);
+}
+
+void wq_watchdog_touch(int cpu)
+{
+ if (cpu >= 0)
+ per_cpu(wq_watchdog_touched_cpu, cpu) = jiffies;
+ else
+ wq_watchdog_touched = jiffies;
+}
+
+static void wq_watchdog_set_thresh(unsigned long thresh)
+{
+ wq_watchdog_thresh = 0;
+ del_timer_sync(&wq_watchdog_timer);
+
+ if (thresh) {
+ wq_watchdog_thresh = thresh;
+ wq_watchdog_reset_touched();
+ mod_timer(&wq_watchdog_timer, jiffies + thresh * HZ);
+ }
+}
+
+static int wq_watchdog_param_set_thresh(const char *val,
+ const struct kernel_param *kp)
+{
+ unsigned long thresh;
+ int ret;
+
+ ret = kstrtoul(val, 0, &thresh);
+ if (ret)
+ return ret;
+
+ if (system_wq)
+ wq_watchdog_set_thresh(thresh);
+ else
+ wq_watchdog_thresh = thresh;
+
+ return 0;
+}
+
+static const struct kernel_param_ops wq_watchdog_thresh_ops = {
+ .set = wq_watchdog_param_set_thresh,
+ .get = param_get_ulong,
+};
+
+module_param_cb(watchdog_thresh, &wq_watchdog_thresh_ops, &wq_watchdog_thresh,
+ 0644);
+
+static void wq_watchdog_init(void)
+{
+ wq_watchdog_set_thresh(wq_watchdog_thresh);
+}
+
+#else /* CONFIG_WQ_WATCHDOG */
+
+static inline void wq_watchdog_init(void) { }
+
+#endif /* CONFIG_WQ_WATCHDOG */
+
static void __init wq_numa_init(void)
{
cpumask_var_t *tbl;
@@ -5325,6 +5490,9 @@ static int __init init_workqueues(void)
!system_unbound_wq || !system_freezable_wq ||
!system_power_efficient_wq ||
!system_freezable_power_efficient_wq);
+
+ wq_watchdog_init();
+
return 0;
}
early_initcall(init_workqueues);
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -812,6 +812,17 @@ config BOOTPARAM_HUNG_TASK_PANIC_VALUE
default 0 if !BOOTPARAM_HUNG_TASK_PANIC
default 1 if BOOTPARAM_HUNG_TASK_PANIC
+config WQ_WATCHDOG
+ bool "Detect Workqueue Stalls"
+ depends on DEBUG_KERNEL
+ help
+ Say Y here to enable stall detection on workqueues. If a
+ worker pool doesn't make forward progress on a pending work
+ item for over a given amount of time, 30s by default, a
+ warning message is printed along with dump of workqueue
+ state. This can be configured through kernel parameter
+ "workqueue.watchdog_thresh" and its sysfs counterpart.
+
endmenu # "Debug lockups and hangs"
config PANIC_ON_OOPS
On Mon, Dec 07, 2015 at 02:06:17PM -0500, Tejun Heo wrote:
> Hello,
>
> Decoupled the control knobs from softlockup. It's now workqueue
> module param which can be updated at runtime. If there's no
> objection, I'll push the two patches through wq/for-4.5.
Does this still compile correctly with CONFIG_WQ_WATCHDOG disabled?
Cheers,
Don
>
> Thanks.
> ------ 8< ------
> Workqueue stalls can happen from a variety of usage bugs such as
> missing WQ_MEM_RECLAIM flag or concurrency managed work item
> indefinitely staying RUNNING. These stalls can be extremely difficult
> to hunt down because the usual warning mechanisms can't detect
> workqueue stalls and the internal state is pretty opaque.
>
> To alleviate the situation, this patch implements workqueue lockup
> detector. It periodically monitors all worker_pools periodically and,
> if any pool failed to make forward progress longer than the threshold
> duration, triggers warning and dumps workqueue state as follows.
>
> BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 31s!
> Showing busy workqueues and worker pools:
> workqueue events: flags=0x0
> pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=17/256
> pending: monkey_wrench_fn, e1000_watchdog, cache_reap, vmstat_shepherd, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, release_one_tty, cgroup_release_agent
> workqueue events_power_efficient: flags=0x80
> pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=2/256
> pending: check_lifetime, neigh_periodic_work
> workqueue cgroup_pidlist_destroy: flags=0x0
> pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/1
> pending: cgroup_pidlist_destroy_work_fn
> ...
>
> The detection mechanism is controller through kernel parameter
> workqueue.watchdog_thresh and can be updated at runtime through the
> sysfs module parameter file.
>
> v2: Decoupled from softlockup control knobs.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Ulrich Obergfell <[email protected]>
> Cc: Michal Hocko <[email protected]>
> Cc: Chris Mason <[email protected]>
> Cc: Andrew Morton <[email protected]>
> ---
> Documentation/kernel-parameters.txt | 9 +
> include/linux/workqueue.h | 6 +
> kernel/watchdog.c | 3
> kernel/workqueue.c | 174 +++++++++++++++++++++++++++++++++++-
> lib/Kconfig.debug | 11 ++
> 5 files changed, 200 insertions(+), 3 deletions(-)
>
> --- a/Documentation/kernel-parameters.txt
> +++ b/Documentation/kernel-parameters.txt
> @@ -4114,6 +4114,15 @@ bytes respectively. Such letter suffixes
> or other driver-specific files in the
> Documentation/watchdog/ directory.
>
> + workqueue.watchdog_thresh=
> + If CONFIG_WQ_WATCHDOG is configured, workqueue can
> + warn stall conditions and dump internal state to
> + help debugging. 0 disables workqueue stall
> + detection; otherwise, it's the stall threshold
> + duration in seconds. The default value is 30 and
> + it can be updated at runtime by writing to the
> + corresponding sysfs file.
> +
> workqueue.disable_numa
> By default, all work items queued to unbound
> workqueues are affine to the NUMA nodes they're
> --- a/include/linux/workqueue.h
> +++ b/include/linux/workqueue.h
> @@ -618,4 +618,10 @@ static inline int workqueue_sysfs_regist
> { return 0; }
> #endif /* CONFIG_SYSFS */
>
> +#ifdef CONFIG_WQ_WATCHDOG
> +void wq_watchdog_touch(int cpu);
> +#else /* CONFIG_WQ_WATCHDOG */
> +static inline void wq_watchdog_touch(int cpu) { }
> +#endif /* CONFIG_WQ_WATCHDOG */
> +
> #endif
> --- a/kernel/watchdog.c
> +++ b/kernel/watchdog.c
> @@ -20,6 +20,7 @@
> #include <linux/smpboot.h>
> #include <linux/sched/rt.h>
> #include <linux/tick.h>
> +#include <linux/workqueue.h>
>
> #include <asm/irq_regs.h>
> #include <linux/kvm_para.h>
> @@ -245,6 +246,7 @@ void touch_softlockup_watchdog_sched(voi
> void touch_softlockup_watchdog(void)
> {
> touch_softlockup_watchdog_sched();
> + wq_watchdog_touch(raw_smp_processor_id());
> }
> EXPORT_SYMBOL(touch_softlockup_watchdog);
>
> @@ -259,6 +261,7 @@ void touch_all_softlockup_watchdogs(void
> */
> for_each_watchdog_cpu(cpu)
> per_cpu(watchdog_touch_ts, cpu) = 0;
> + wq_watchdog_touch(-1);
> }
>
> #ifdef CONFIG_HARDLOCKUP_DETECTOR
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -148,6 +148,8 @@ struct worker_pool {
> int id; /* I: pool ID */
> unsigned int flags; /* X: flags */
>
> + unsigned long watchdog_ts; /* L: watchdog timestamp */
> +
> struct list_head worklist; /* L: list of pending works */
> int nr_workers; /* L: total number of workers */
>
> @@ -1083,6 +1085,8 @@ static void pwq_activate_delayed_work(st
> struct pool_workqueue *pwq = get_work_pwq(work);
>
> trace_workqueue_activate_work(work);
> + if (list_empty(&pwq->pool->worklist))
> + pwq->pool->watchdog_ts = jiffies;
> move_linked_works(work, &pwq->pool->worklist, NULL);
> __clear_bit(WORK_STRUCT_DELAYED_BIT, work_data_bits(work));
> pwq->nr_active++;
> @@ -1385,6 +1389,8 @@ retry:
> trace_workqueue_activate_work(work);
> pwq->nr_active++;
> worklist = &pwq->pool->worklist;
> + if (list_empty(worklist))
> + pwq->pool->watchdog_ts = jiffies;
> } else {
> work_flags |= WORK_STRUCT_DELAYED;
> worklist = &pwq->delayed_works;
> @@ -2157,6 +2163,8 @@ recheck:
> list_first_entry(&pool->worklist,
> struct work_struct, entry);
>
> + pool->watchdog_ts = jiffies;
> +
> if (likely(!(*work_data_bits(work) & WORK_STRUCT_LINKED))) {
> /* optimization path, not strictly necessary */
> process_one_work(worker, work);
> @@ -2240,6 +2248,7 @@ repeat:
> struct pool_workqueue, mayday_node);
> struct worker_pool *pool = pwq->pool;
> struct work_struct *work, *n;
> + bool first = true;
>
> __set_current_state(TASK_RUNNING);
> list_del_init(&pwq->mayday_node);
> @@ -2256,9 +2265,14 @@ repeat:
> * process'em.
> */
> WARN_ON_ONCE(!list_empty(scheduled));
> - list_for_each_entry_safe(work, n, &pool->worklist, entry)
> - if (get_work_pwq(work) == pwq)
> + list_for_each_entry_safe(work, n, &pool->worklist, entry) {
> + if (get_work_pwq(work) == pwq) {
> + if (first)
> + pool->watchdog_ts = jiffies;
> move_linked_works(work, scheduled, &n);
> + }
> + first = false;
> + }
>
> if (!list_empty(scheduled)) {
> process_scheduled_works(rescuer);
> @@ -3104,6 +3118,7 @@ static int init_worker_pool(struct worke
> pool->cpu = -1;
> pool->node = NUMA_NO_NODE;
> pool->flags |= POOL_DISASSOCIATED;
> + pool->watchdog_ts = jiffies;
> INIT_LIST_HEAD(&pool->worklist);
> INIT_LIST_HEAD(&pool->idle_list);
> hash_init(pool->busy_hash);
> @@ -4343,7 +4358,9 @@ void show_workqueue_state(void)
>
> pr_info("pool %d:", pool->id);
> pr_cont_pool_info(pool);
> - pr_cont(" workers=%d", pool->nr_workers);
> + pr_cont(" hung=%us workers=%d",
> + jiffies_to_msecs(jiffies - pool->watchdog_ts) / 1000,
> + pool->nr_workers);
> if (pool->manager)
> pr_cont(" manager: %d",
> task_pid_nr(pool->manager->task));
> @@ -5202,6 +5219,154 @@ static void workqueue_sysfs_unregister(s
> static void workqueue_sysfs_unregister(struct workqueue_struct *wq) { }
> #endif /* CONFIG_SYSFS */
>
> +/*
> + * Workqueue watchdog.
> + *
> + * Stall may be caused by various bugs - missing WQ_MEM_RECLAIM, illegal
> + * flush dependency, a concurrency managed work item which stays RUNNING
> + * indefinitely. Workqueue stalls can be very difficult to debug as the
> + * usual warning mechanisms don't trigger and internal workqueue state is
> + * largely opaque.
> + *
> + * Workqueue watchdog monitors all worker pools periodically and dumps
> + * state if some pools failed to make forward progress for a while where
> + * forward progress is defined as the first item on ->worklist changing.
> + *
> + * This mechanism is controlled through the kernel parameter
> + * "workqueue.watchdog_thresh" which can be updated at runtime through the
> + * corresponding sysfs parameter file.
> + */
> +#ifdef CONFIG_WQ_WATCHDOG
> +
> +static void wq_watchdog_timer_fn(unsigned long data);
> +
> +static unsigned long wq_watchdog_thresh = 30;
> +static struct timer_list wq_watchdog_timer =
> + TIMER_DEFERRED_INITIALIZER(wq_watchdog_timer_fn, 0, 0);
> +
> +static unsigned long wq_watchdog_touched = INITIAL_JIFFIES;
> +static DEFINE_PER_CPU(unsigned long, wq_watchdog_touched_cpu) = INITIAL_JIFFIES;
> +
> +static void wq_watchdog_reset_touched(void)
> +{
> + int cpu;
> +
> + wq_watchdog_touched = jiffies;
> + for_each_possible_cpu(cpu)
> + per_cpu(wq_watchdog_touched_cpu, cpu) = jiffies;
> +}
> +
> +static void wq_watchdog_timer_fn(unsigned long data)
> +{
> + unsigned long thresh = READ_ONCE(wq_watchdog_thresh) * HZ;
> + bool lockup_detected = false;
> + struct worker_pool *pool;
> + int pi;
> +
> + if (!thresh)
> + return;
> +
> + rcu_read_lock();
> +
> + for_each_pool(pool, pi) {
> + unsigned long pool_ts, touched, ts;
> +
> + if (list_empty(&pool->worklist))
> + continue;
> +
> + /* get the latest of pool and touched timestamps */
> + pool_ts = READ_ONCE(pool->watchdog_ts);
> + touched = READ_ONCE(wq_watchdog_touched);
> +
> + if (time_after(pool_ts, touched))
> + ts = pool_ts;
> + else
> + ts = touched;
> +
> + if (pool->cpu >= 0) {
> + unsigned long cpu_touched =
> + READ_ONCE(per_cpu(wq_watchdog_touched_cpu,
> + pool->cpu));
> + if (time_after(cpu_touched, ts))
> + ts = cpu_touched;
> + }
> +
> + /* did we stall? */
> + if (time_after(jiffies, ts + thresh)) {
> + lockup_detected = true;
> + pr_emerg("BUG: workqueue lockup - pool");
> + pr_cont_pool_info(pool);
> + pr_cont(" stuck for %us!\n",
> + jiffies_to_msecs(jiffies - pool_ts) / 1000);
> + }
> + }
> +
> + rcu_read_unlock();
> +
> + if (lockup_detected)
> + show_workqueue_state();
> +
> + wq_watchdog_reset_touched();
> + mod_timer(&wq_watchdog_timer, jiffies + thresh);
> +}
> +
> +void wq_watchdog_touch(int cpu)
> +{
> + if (cpu >= 0)
> + per_cpu(wq_watchdog_touched_cpu, cpu) = jiffies;
> + else
> + wq_watchdog_touched = jiffies;
> +}
> +
> +static void wq_watchdog_set_thresh(unsigned long thresh)
> +{
> + wq_watchdog_thresh = 0;
> + del_timer_sync(&wq_watchdog_timer);
> +
> + if (thresh) {
> + wq_watchdog_thresh = thresh;
> + wq_watchdog_reset_touched();
> + mod_timer(&wq_watchdog_timer, jiffies + thresh * HZ);
> + }
> +}
> +
> +static int wq_watchdog_param_set_thresh(const char *val,
> + const struct kernel_param *kp)
> +{
> + unsigned long thresh;
> + int ret;
> +
> + ret = kstrtoul(val, 0, &thresh);
> + if (ret)
> + return ret;
> +
> + if (system_wq)
> + wq_watchdog_set_thresh(thresh);
> + else
> + wq_watchdog_thresh = thresh;
> +
> + return 0;
> +}
> +
> +static const struct kernel_param_ops wq_watchdog_thresh_ops = {
> + .set = wq_watchdog_param_set_thresh,
> + .get = param_get_ulong,
> +};
> +
> +module_param_cb(watchdog_thresh, &wq_watchdog_thresh_ops, &wq_watchdog_thresh,
> + 0644);
> +
> +static void wq_watchdog_init(void)
> +{
> + wq_watchdog_set_thresh(wq_watchdog_thresh);
> +}
> +
> +#else /* CONFIG_WQ_WATCHDOG */
> +
> +static inline void wq_watchdog_init(void) { }
> +
> +#endif /* CONFIG_WQ_WATCHDOG */
> +
> static void __init wq_numa_init(void)
> {
> cpumask_var_t *tbl;
> @@ -5325,6 +5490,9 @@ static int __init init_workqueues(void)
> !system_unbound_wq || !system_freezable_wq ||
> !system_power_efficient_wq ||
> !system_freezable_power_efficient_wq);
> +
> + wq_watchdog_init();
> +
> return 0;
> }
> early_initcall(init_workqueues);
> --- a/lib/Kconfig.debug
> +++ b/lib/Kconfig.debug
> @@ -812,6 +812,17 @@ config BOOTPARAM_HUNG_TASK_PANIC_VALUE
> default 0 if !BOOTPARAM_HUNG_TASK_PANIC
> default 1 if BOOTPARAM_HUNG_TASK_PANIC
>
> +config WQ_WATCHDOG
> + bool "Detect Workqueue Stalls"
> + depends on DEBUG_KERNEL
> + help
> + Say Y here to enable stall detection on workqueues. If a
> + worker pool doesn't make forward progress on a pending work
> + item for over a given amount of time, 30s by default, a
> + warning message is printed along with dump of workqueue
> + state. This can be configured through kernel parameter
> + "workqueue.watchdog_thresh" and its sysfs counterpart.
> +
> endmenu # "Debug lockups and hangs"
>
> config PANIC_ON_OOPS
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
On Mon, Dec 07, 2015 at 04:38:16PM -0500, Don Zickus wrote:
> On Mon, Dec 07, 2015 at 02:06:17PM -0500, Tejun Heo wrote:
> > Hello,
> >
> > Decoupled the control knobs from softlockup. It's now workqueue
> > module param which can be updated at runtime. If there's no
> > objection, I'll push the two patches through wq/for-4.5.
>
> Does this still compile correctly with CONFIG_WQ_WATCHDOG disabled?
Yeah and if some config combo breaks I'm gonna catch it while it goes
through wq tree.
Thanks.
--
tejun
On Mon, Dec 07, 2015 at 04:39:52PM -0500, Tejun Heo wrote:
> On Mon, Dec 07, 2015 at 04:38:16PM -0500, Don Zickus wrote:
> > On Mon, Dec 07, 2015 at 02:06:17PM -0500, Tejun Heo wrote:
> > > Hello,
> > >
> > > Decoupled the control knobs from softlockup. It's now workqueue
> > > module param which can be updated at runtime. If there's no
> > > objection, I'll push the two patches through wq/for-4.5.
> >
> > Does this still compile correctly with CONFIG_WQ_WATCHDOG disabled?
>
> Yeah and if some config combo breaks I'm gonna catch it while it goes
> through wq tree.
I think I am fine with this approach.
Acked-by: Don Zickus <[email protected]>
Cheers,
Don
On Tue, Dec 08, 2015 at 11:00:54AM -0500, Don Zickus wrote:
> On Mon, Dec 07, 2015 at 04:39:52PM -0500, Tejun Heo wrote:
> > On Mon, Dec 07, 2015 at 04:38:16PM -0500, Don Zickus wrote:
> > > On Mon, Dec 07, 2015 at 02:06:17PM -0500, Tejun Heo wrote:
> > > > Hello,
> > > >
> > > > Decoupled the control knobs from softlockup. It's now workqueue
> > > > module param which can be updated at runtime. If there's no
> > > > objection, I'll push the two patches through wq/for-4.5.
> > >
> > > Does this still compile correctly with CONFIG_WQ_WATCHDOG disabled?
> >
> > Yeah and if some config combo breaks I'm gonna catch it while it goes
> > through wq tree.
>
> I think I am fine with this approach.
>
> Acked-by: Don Zickus <[email protected]>
Applied the two patches to wq/for-4.5.
Thanks.
--
tejun
On 03/12/15 21:26, Tejun Heo wrote:
> Task or work item involved in memory reclaim trying to flush a
> non-WQ_MEM_RECLAIM workqueue or one of its work items can lead to
> deadlock. Trigger WARN_ONCE() if such conditions are detected.
>
> Signed-off-by: Tejun Heo <[email protected]>
> Cc: Peter Zijlstra <[email protected]>
> ---
> Hello,
>
> So, something like this. Seems to work fine here. If there's no
> objection, I'm gonna push it through wq/for-4.5.
>
> Thanks.
>
> kernel/workqueue.c | 35 +++++++++++++++++++++++++++++++++++
> 1 file changed, 35 insertions(+)
>
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -2330,6 +2330,37 @@ repeat:
> goto repeat;
> }
>
> +/**
> + * check_flush_dependency - check for flush dependency sanity
> + * @target_wq: workqueue being flushed
> + * @target_work: work item being flushed (NULL for workqueue flushes)
> + *
> + * %current is trying to flush the whole @target_wq or @target_work on it.
> + * If @target_wq doesn't have %WQ_MEM_RECLAIM, verify that %current is not
> + * reclaiming memory or running on a workqueue which doesn't have
> + * %WQ_MEM_RECLAIM as that can break forward-progress guarantee leading to
> + * a deadlock.
> + */
> +static void check_flush_dependency(struct workqueue_struct *target_wq,
> + struct work_struct *target_work)
> +{
> + work_func_t target_func = target_work ? target_work->func : NULL;
> + struct worker *worker;
> +
> + if (target_wq->flags & WQ_MEM_RECLAIM)
> + return;
> +
> + worker = current_wq_worker();
> +
> + WARN_ONCE(current->flags & PF_MEMALLOC,
> + "workqueue: PF_MEMALLOC task %d(%s) is flushing !WQ_MEM_RECLAIM %s:%pf",
> + current->pid, current->comm, target_wq->name, target_func);
> + WARN_ONCE(worker && (worker->current_pwq->wq->flags & WQ_MEM_RECLAIM),
> + "workqueue: WQ_MEM_RECLAIM %s:%pf is flushing !WQ_MEM_RECLAIM %s:%pf",
> + worker->current_pwq->wq->name, worker->current_func,
> + target_wq->name, target_func);
> +}
> +
> struct wq_barrier {
> struct work_struct work;
> struct completion done;
> @@ -2539,6 +2570,8 @@ void flush_workqueue(struct workqueue_st
> list_add_tail(&this_flusher.list, &wq->flusher_overflow);
> }
>
> + check_flush_dependency(wq, NULL);
> +
> mutex_unlock(&wq->mutex);
>
> wait_for_completion(&this_flusher.done);
> @@ -2711,6 +2744,8 @@ static bool start_flush_work(struct work
> pwq = worker->current_pwq;
> }
>
> + check_flush_dependency(pwq->wq, work);
> +
> insert_wq_barrier(pwq, barr, work, worker);
> spin_unlock_irq(&pool->lock);
>
>
I am hitting the warnings when using cancel_delayed_work_sync(). I would
have thought that forward progress would still be guaranteed in that case.
Is it true that it is not?
Hello,
On Thu, Mar 10, 2016 at 05:12:54PM +0200, Adrian Hunter wrote:
> > @@ -2711,6 +2744,8 @@ static bool start_flush_work(struct work
> > pwq = worker->current_pwq;
> > }
> >
> > + check_flush_dependency(pwq->wq, work);
> > +
> > insert_wq_barrier(pwq, barr, work, worker);
> > spin_unlock_irq(&pool->lock);
>
> I am hitting the warnings when using cancel_delayed_work_sync(). I would
> have thought that forward progress would still be guaranteed in that case.
> Is it true that it is not?
I'm not sure I understand what you're trying to say. If you're trying
to do cancel_delayed_work_sync() from a memreclaim wq on a work item
which is executing on !memreclaim wq, that'd be an incorrect thing to
do as that can deadlock the memreclaim wq under memory pressure.
Thanks.
--
tejun