2023-02-01 13:46:25

by Petr Mladek

[permalink] [raw]
Subject: [RFC 0/5] workqueue: Debugging improvements

The workqueue watchdog provides a lot of impormation when a stall is
detected. The report says a lot about what workqueues and worker pools
are active and what is being blocked. Unfortunately, it does not provide
much information about what caused the stall.

In particular, it did not help me to get root of the following problems:

+ New workers were not created because the system reached PID limit.
Admins limited it too much in a cloud.

+ A networking driver was not loaded because systemd killed modprobe
when switching the root from initrd to the booted system.

It was surprisingly quite reproducible. Interrupts are not handled
immediately in kernel code. The wait in kthread_create_on_node()
was one of few locations. So the race window evidently was not
trivial.


1st patch fixes a misleading "hung" time report.

2nd, 3rd, and 4rd patches add warnings into create_worker() and
create_rescuer(). The rather persistent errors are printed only once
until it succeeds again. Otherwise it would be too noisy and might even
break the watchdog.

5th patch adds printing bracktraces of CPU-bound workers that might
block CPU-bound workqueues. The candidate is well defined to keep
the number of backtraces small. It always printed only the right one
during my testing.


The first 4 patches would have helped me to debug the real problems
that I met.

The 5th patch is theoretical. I did not see this case in practice.
But it looks realistic enough. And it worked very well when I
simulated the problem. IMHO, it should be pretty useful.


Petr Mladek (5):
workqueue: Fix hung time report of worker pools
workqueue: Warn when a new worker could not be created
workqueue: Interrupted create_worker() is not a repeated event
workqueue: Warn when a rescuer could not be created
workqueue: Print backtraces from CPUs with hung CPU bound workqueues

kernel/workqueue.c | 186 ++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 177 insertions(+), 9 deletions(-)

--
2.35.3



2023-02-01 13:46:29

by Petr Mladek

[permalink] [raw]
Subject: [RFC 1/5] workqueue: Fix hung time report of worker pools

The workqueue watchdog prints a warning when there is no progress in
a worker pool. Where the progress means that the pool started processing
a pending work item.

Note that it is perfectly fine to process work items much longer.
The progress should be guaranteed by waking up or creating idle
workers.

show_one_worker_pool() prints state of non-idle worker pool. It shows
a delay since the last pool->watchdog_ts.

The timestamp is updated when a first pending work is queued in
__queue_work(). Also it is updated when a work is dequeued for
processing in worker_thread() and rescuer_thread().

The delay is misleading when there is no pending work item. In this
case it shows how long the last work item is being proceed. Show
zero instead. There is no stall if there is no pending work.

Fixes: 82607adcf9cdf40fb7b ("workqueue: implement lockup detector")
Signed-off-by: Petr Mladek <[email protected]>
---
kernel/workqueue.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 07895deca271..5419d12e56ae 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -4848,10 +4848,16 @@ static void show_one_worker_pool(struct worker_pool *pool)
struct worker *worker;
bool first = true;
unsigned long flags;
+ unsigned long hung = 0;

raw_spin_lock_irqsave(&pool->lock, flags);
if (pool->nr_workers == pool->nr_idle)
goto next_pool;
+
+ /* How long the first pending work is waiting for a worker. */
+ if (!list_empty(&pool->worklist))
+ hung = jiffies_to_msecs(jiffies - pool->watchdog_ts) / 1000;
+
/*
* Defer printing to avoid deadlocks in console drivers that
* queue work while holding locks also taken in their write
@@ -4860,9 +4866,7 @@ static void show_one_worker_pool(struct worker_pool *pool)
printk_deferred_enter();
pr_info("pool %d:", pool->id);
pr_cont_pool_info(pool);
- pr_cont(" hung=%us workers=%d",
- jiffies_to_msecs(jiffies - pool->watchdog_ts) / 1000,
- pool->nr_workers);
+ pr_cont(" hung=%lus workers=%d", hung, pool->nr_workers);
if (pool->manager)
pr_cont(" manager: %d",
task_pid_nr(pool->manager->task));
--
2.35.3


2023-02-01 13:46:33

by Petr Mladek

[permalink] [raw]
Subject: [RFC 2/5] workqueue: Warn when a new worker could not be created

The workqueue watchdog reports a lockup when there was not any progress
in the worker pool for a long time. The progress means that a pending
work item starts being proceed.

The progress is guaranteed by using idle workers or creating new workers
for pending work items.

There are several reasons why a new worker could not be created:

+ there is not enough memory

+ there is no free pool ID (IDR API)

+ the system reached PID limit

+ the process creating the new worker was interrupted

+ the last idle worker (manager) has not been scheduled for a long
time. It was not able to even start creating the kthread.

None of these failures is reported at the moment. The only clue is that
show_one_worker_pool() prints that there is a manager. It is the last
idle worker that is responsible for creating a new one. But it is not
clear if create_worker() is repeatedly failing and why.

Make the debugging easier by printing warnings in create_worker().

The error code is important, especially from kthread_create_on_node().
It helps to distinguish the various reasons. For example, reaching
memory limit (-ENOMEM), other system limits (-EAGAIN), or process
interrupted (-EINTR).

Print the warnings only once during the stall. printk() might touch
the watchdog and prevent reaching the watchdog thresh. For example,
see touch_nmi_watchdog() in serial8250_console_write(). Anyway,
many of these failures are quite persistent. And it does not make
sense to report the same problem every second, see CREATE_COOLDOWN.

In addition print a message when create_worker() succeeded again.

Also print how many times it failed in the hung report when it
still keeps failing.

Signed-off-by: Petr Mladek <[email protected]>
---
kernel/workqueue.c | 82 +++++++++++++++++++++++++++++++++++++++++++---
1 file changed, 77 insertions(+), 5 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 5419d12e56ae..ab109ef7a7c0 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1911,7 +1911,68 @@ static void worker_detach_from_pool(struct worker *worker)
complete(detach_completion);
}

-/**
+static int create_worker_failed;
+static DEFINE_SPINLOCK(create_worker_failed_lock);
+
+static __printf(2, 3) __cold
+void __print_create_worker_failure(long err, const char *fmt, ...)
+{
+ spin_lock_irq(&create_worker_failed_lock);
+
+ /*
+ * Report potentially repeated failures only once during a stall.
+ * Otherwise, it might be noisy. Also slow serial console drivers
+ * touch watchdogs so that more frequent messages would prevent
+ * reaching the watchdog thresh.
+ */
+ if (!create_worker_failed) {
+ va_list args;
+
+ va_start(args, fmt);
+ vprintk(fmt, args);
+ va_end(args);
+ }
+
+ create_worker_failed++;
+
+ spin_unlock_irq(&create_worker_failed_lock);
+}
+
+#define print_create_worker_failure(msg, err) \
+ do { \
+ long _err = err; \
+ \
+ __print_create_worker_failure(_err, KERN_WARNING msg ":%pe\n", (void *)_err); \
+ } while (0)
+
+static void print_create_worker_success(void)
+{
+ spin_lock_irq(&create_worker_failed_lock);
+
+ if (create_worker_failed) {
+ pr_info("workqueue: Successfully created a worker after %d attempts\n",
+ create_worker_failed);
+ }
+ create_worker_failed = 0;
+
+ spin_unlock_irq(&create_worker_failed_lock);
+}
+
+static void print_create_worker_failed_num(void)
+{
+ unsigned long flags;
+
+ spin_lock_irqsave(&create_worker_failed_lock, flags);
+
+ if (create_worker_failed) {
+ pr_info("workqueue: %d times failed to create a new worker since the last success\n",
+ create_worker_failed);
+ }
+
+ spin_unlock_irqrestore(&create_worker_failed_lock, flags);
+}
+
+ /**
* create_worker - create a new workqueue worker
* @pool: pool the new worker will belong to
*
@@ -1931,12 +1992,16 @@ static struct worker *create_worker(struct worker_pool *pool)

/* ID is needed to determine kthread name */
id = ida_alloc(&pool->worker_ida, GFP_KERNEL);
- if (id < 0)
+ if (id < 0) {
+ print_create_worker_failure("workqueue: Failed to allocate a pool ID", id);
return NULL;
+ }

worker = alloc_worker(pool->node);
- if (!worker)
+ if (!worker) {
+ print_create_worker_failure("workqueue: Failed to allocate a worker", -ENOMEM);
goto fail;
+ }

worker->id = id;

@@ -1948,8 +2013,11 @@ static struct worker *create_worker(struct worker_pool *pool)

worker->task = kthread_create_on_node(worker_thread, worker, pool->node,
"kworker/%s", id_buf);
- if (IS_ERR(worker->task))
+ if (IS_ERR(worker->task)) {
+ print_create_worker_failure("workqueue: Failed to create a worker thread",
+ PTR_ERR(worker->task));
goto fail;
+ }

set_user_nice(worker->task, pool->attrs->nice);
kthread_bind_mask(worker->task, pool->attrs->cpumask);
@@ -1964,6 +2032,8 @@ static struct worker *create_worker(struct worker_pool *pool)
wake_up_process(worker->task);
raw_spin_unlock_irq(&pool->lock);

+ print_create_worker_success();
+
return worker;

fail:
@@ -5880,8 +5950,10 @@ static void wq_watchdog_timer_fn(struct timer_list *unused)

rcu_read_unlock();

- if (lockup_detected)
+ if (lockup_detected) {
+ print_create_worker_failed_num();
show_all_workqueues();
+ }

wq_watchdog_reset_touched();
mod_timer(&wq_watchdog_timer, jiffies + thresh);
--
2.35.3


2023-02-01 13:46:36

by Petr Mladek

[permalink] [raw]
Subject: [RFC 3/5] workqueue: Interrupted create_worker() is not a repeated event

kthread_create_on_node() might get interrupted(). It is rare but realistic.
For example, when an unbound workqueue is allocated in module_init()
callback. It is done in the context of the "modprobe" process. And,
for example, systemd might kill pending processes when switching root
from initrd to the booted system.

Handle the interrupt a special way. It a one-off event in compare with
the other failures. Always print this warning. And do not block the other
warnings.

Signed-off-by: Petr Mladek <[email protected]>
---
kernel/workqueue.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index ab109ef7a7c0..5f3327e119b4 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1925,7 +1925,7 @@ void __print_create_worker_failure(long err, const char *fmt, ...)
* touch watchdogs so that more frequent messages would prevent
* reaching the watchdog thresh.
*/
- if (!create_worker_failed) {
+ if (!create_worker_failed || err == -EINTR) {
va_list args;

va_start(args, fmt);
@@ -1933,7 +1933,8 @@ void __print_create_worker_failure(long err, const char *fmt, ...)
va_end(args);
}

- create_worker_failed++;
+ if (err != -EINTR)
+ create_worker_failed++;

spin_unlock_irq(&create_worker_failed_lock);
}
--
2.35.3


2023-02-01 13:46:41

by Petr Mladek

[permalink] [raw]
Subject: [RFC 4/5] workqueue: Warn when a rescuer could not be created

Rescuers are created when a workqueue with WQ_MEM_RECLAIM is allocated.
It typically happens during the system boot.

systemd switches the root filesystem from initrd to the booted system
during boot. It kills processes that block the switch for too long.
One of the process might be modprobe that tries to create a workqueue.

These problems are hard to reproduce. Also alloc_workqueue() does not
pass the error code. Make the debugging easier by printing a warning,
similar to create_worker().

Signed-off-by: Petr Mladek <[email protected]>
---
kernel/workqueue.c | 31 +++++++++++++++++++++++++------
1 file changed, 25 insertions(+), 6 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 5f3327e119b4..58b835420435 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1914,9 +1914,11 @@ static void worker_detach_from_pool(struct worker *worker)
static int create_worker_failed;
static DEFINE_SPINLOCK(create_worker_failed_lock);

-static __printf(2, 3) __cold
-void __print_create_worker_failure(long err, const char *fmt, ...)
+static __printf(3, 4) __cold
+void __print_create_worker_failure(bool is_rescuer, long err, const char *fmt, ...)
{
+ bool one_off_failure;
+
spin_lock_irq(&create_worker_failed_lock);

/*
@@ -1925,7 +1927,8 @@ void __print_create_worker_failure(long err, const char *fmt, ...)
* touch watchdogs so that more frequent messages would prevent
* reaching the watchdog thresh.
*/
- if (!create_worker_failed || err == -EINTR) {
+ one_off_failure = (err == -EINTR || is_rescuer);
+ if (!create_worker_failed || one_off_failure) {
va_list args;

va_start(args, fmt);
@@ -1933,7 +1936,7 @@ void __print_create_worker_failure(long err, const char *fmt, ...)
va_end(args);
}

- if (err != -EINTR)
+ if (!one_off_failure)
create_worker_failed++;

spin_unlock_irq(&create_worker_failed_lock);
@@ -1943,7 +1946,9 @@ void __print_create_worker_failure(long err, const char *fmt, ...)
do { \
long _err = err; \
\
- __print_create_worker_failure(_err, KERN_WARNING msg ":%pe\n", (void *)_err); \
+ __print_create_worker_failure(false, _err, \
+ KERN_WARNING msg ":%pe\n", \
+ (void *)_err); \
} while (0)

static void print_create_worker_success(void)
@@ -4323,6 +4328,15 @@ static int wq_clamp_max_active(int max_active, unsigned int flags,
return clamp_val(max_active, 1, lim);
}

+#define print_create_rescuer_failure(msg, name, err) \
+ do { \
+ long _err = err; \
+ \
+ __print_create_worker_failure(true, _err, \
+ KERN_WARNING msg ": %s :%pe\n", \
+ name, (void *)_err); \
+ } while (0)
+
/*
* Workqueues which may be used during memory reclaim should have a rescuer
* to guarantee forward progress.
@@ -4336,13 +4350,18 @@ static int init_rescuer(struct workqueue_struct *wq)
return 0;

rescuer = alloc_worker(NUMA_NO_NODE);
- if (!rescuer)
+ if (!rescuer) {
+ print_create_rescuer_failure("workqueue: Failed to allocate a rescuer",
+ wq->name, -ENOMEM);
return -ENOMEM;
+ }

rescuer->rescue_wq = wq;
rescuer->task = kthread_create(rescuer_thread, rescuer, "%s", wq->name);
if (IS_ERR(rescuer->task)) {
ret = PTR_ERR(rescuer->task);
+ print_create_rescuer_failure("workqueue: Failed to create a rescuer thread",
+ wq->name, ret);
kfree(rescuer);
return ret;
}
--
2.35.3


2023-02-01 13:46:43

by Petr Mladek

[permalink] [raw]
Subject: [RFC 5/5] workqueue: Print backtraces from CPUs with hung CPU bound workqueues

The workqueue watchdog reports a lockup when there was not any progress
in the worker pool for a long time. The progress means that a pending
work item starts being proceed.

Worker pools for unbound workqueues always wake up an idle worker and
try to process the work immediately. The last idle worker has to create
new worker first. The stall might happen only when a new worker could
not be created in which case some warnings should get printed.
Another problem might be too high load. In this case, workers are
victims of a global system problem.

Worker pools for CPU bound workqueues are designed for lightweight
work items that do not need much CPU time. They are ideally proceed
on a single worker. New worker is used only when a work is sleeping.
It brings one additional problem. The stall might happen when
the CPU-bound workqueue is used for CPU-intensive work.

The problem happens when the CPU-intensive work is not sleeping/waiting
for a long time. Such a worker would be in TASK_RUNNING state.

In this case, it might be useful to see backtrace from the problematic
worker.

wq_watchdog_timer_fn() starts with one line report for each stalled worker
pool. All the extra information are printed later.

The most precise solution would be to make a note in struct worker_pool
when the first check found a stall in a CPU-bound workqueue.

A good enough heuristic should be printing backtraces of workers in
TASK_RUNNING state on CPUs where the stall was detected. After all,
there are just two CPU-bound workers pools for normal and high priority
workers.

Signed-off-by: Petr Mladek <[email protected]>
---
kernel/workqueue.c | 72 ++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 72 insertions(+)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 58b835420435..c6017d178b91 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -49,6 +49,7 @@
#include <linux/moduleparam.h>
#include <linux/uaccess.h>
#include <linux/sched/isolation.h>
+#include <linux/sched/debug.h>
#include <linux/nmi.h>
#include <linux/kvm_para.h>

@@ -5911,6 +5912,68 @@ static struct timer_list wq_watchdog_timer;

static unsigned long wq_watchdog_touched = INITIAL_JIFFIES;
static DEFINE_PER_CPU(unsigned long, wq_watchdog_touched_cpu) = INITIAL_JIFFIES;
+static DEFINE_PER_CPU(bool, wq_watchdog_hung_cpu);
+
+/*
+ * Show workers that might prevent processing pending work items.
+ * The only candidates are CPU-bound workers in a running state.
+ * Pending work items should be handled by another idle worker in all
+ * other situations.
+ */
+static bool show_pool_suspicious_workers(struct worker_pool *pool, bool shown_title)
+{
+ bool shown_any = false;
+ struct worker *worker;
+ unsigned long flags;
+ int bkt;
+
+ raw_spin_lock_irqsave(&pool->lock, flags);
+
+ if (pool->cpu < 0)
+ goto out;
+
+ if (!per_cpu(wq_watchdog_hung_cpu, pool->cpu))
+ goto out;
+
+ if (list_empty(&pool->worklist))
+ goto out;
+
+ hash_for_each(pool->busy_hash, bkt, worker, hentry) {
+ if (!task_is_running(worker->task))
+ continue;
+
+ if (!shown_title) {
+ pr_info("Showing backtraces of running workers in stuck CPU-bound worker pools:\n");
+ shown_title = true;
+ }
+
+ shown_any = true;
+ pr_info("pool %d:\n", pool->id);
+ sched_show_task(worker->task);
+ }
+out:
+ raw_spin_unlock_irqrestore(&pool->lock, flags);
+ return shown_any;
+}
+
+static void show_suspicious_workers(void)
+{
+ struct worker_pool *pool;
+ bool shown_title = false;
+ int pi;
+
+ rcu_read_lock();
+
+ for_each_pool(pool, pi) {
+ bool shown;
+
+ shown = show_pool_suspicious_workers(pool, shown_title);
+ if (shown)
+ shown_title = shown;
+ }
+
+ rcu_read_unlock();
+}

static void wq_watchdog_reset_touched(void)
{
@@ -5927,11 +5990,15 @@ static void wq_watchdog_timer_fn(struct timer_list *unused)
bool lockup_detected = false;
unsigned long now = jiffies;
struct worker_pool *pool;
+ int cpu;
int pi;

if (!thresh)
return;

+ for_each_online_cpu(cpu)
+ per_cpu(wq_watchdog_hung_cpu, cpu) = false;
+
rcu_read_lock();

for_each_pool(pool, pi) {
@@ -5961,11 +6028,15 @@ static void wq_watchdog_timer_fn(struct timer_list *unused)
/* did we stall? */
if (time_after(now, ts + thresh)) {
lockup_detected = true;
+ if (pool->cpu >= 0)
+ per_cpu(wq_watchdog_hung_cpu, pool->cpu) = true;
pr_emerg("BUG: workqueue lockup - pool");
pr_cont_pool_info(pool);
pr_cont(" stuck for %us!\n",
jiffies_to_msecs(now - pool_ts) / 1000);
}
+
+
}

rcu_read_unlock();
@@ -5973,6 +6044,7 @@ static void wq_watchdog_timer_fn(struct timer_list *unused)
if (lockup_detected) {
print_create_worker_failed_num();
show_all_workqueues();
+ show_suspicious_workers();
}

wq_watchdog_reset_touched();
--
2.35.3


2023-02-02 23:30:48

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC 2/5] workqueue: Warn when a new worker could not be created

Hello,

On Wed, Feb 01, 2023 at 02:45:40PM +0100, Petr Mladek wrote:
> +static __printf(2, 3) __cold
> +void __print_create_worker_failure(long err, const char *fmt, ...)
> +{
> + spin_lock_irq(&create_worker_failed_lock);
> +
> + /*
> + * Report potentially repeated failures only once during a stall.
> + * Otherwise, it might be noisy. Also slow serial console drivers
> + * touch watchdogs so that more frequent messages would prevent
> + * reaching the watchdog thresh.
> + */
> + if (!create_worker_failed) {
> + va_list args;
> +
> + va_start(args, fmt);
> + vprintk(fmt, args);
> + va_end(args);
> + }
> +
> + create_worker_failed++;
> +
> + spin_unlock_irq(&create_worker_failed_lock);
> +}

That's pretty elaborate. Why not just use printk_ratelimited()?

> @@ -1931,12 +1992,16 @@ static struct worker *create_worker(struct worker_pool *pool)
>
> /* ID is needed to determine kthread name */
> id = ida_alloc(&pool->worker_ida, GFP_KERNEL);
> - if (id < 0)
> + if (id < 0) {
> + print_create_worker_failure("workqueue: Failed to allocate a pool ID", id);
^
worker ID?

> @@ -5880,8 +5950,10 @@ static void wq_watchdog_timer_fn(struct timer_list *unused)
>
> rcu_read_unlock();
>
> - if (lockup_detected)
> + if (lockup_detected) {
> + print_create_worker_failed_num();
> show_all_workqueues();
> + }

I'd just use printk_ratelimited() and leave it at that.

Thanks.

--
tejun

2023-02-02 23:45:17

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC 5/5] workqueue: Print backtraces from CPUs with hung CPU bound workqueues

Hello,

I generally really like it.

> +static bool show_pool_suspicious_workers(struct worker_pool *pool, bool shown_title)

Maybe the name can be a bit more specific? show_cpu_pool_hog()?

> +{
> + bool shown_any = false;
> + struct worker *worker;
> + unsigned long flags;
> + int bkt;
> +
> + raw_spin_lock_irqsave(&pool->lock, flags);
> +
> + if (pool->cpu < 0)
> + goto out;

This can be tested before grabbling the lock.

> + if (!per_cpu(wq_watchdog_hung_cpu, pool->cpu))
> + goto out;

Given that the state is per-pool, would it make sense to mark this on the
pool instead?

> + if (list_empty(&pool->worklist))
> + goto out;

This condition isn't really necessary, right?

> + hash_for_each(pool->busy_hash, bkt, worker, hentry) {
> + if (!task_is_running(worker->task))
> + continue;
> +
> + if (!shown_title) {
> + pr_info("Showing backtraces of running workers in stuck CPU-bound worker pools:\n");
> + shown_title = true;
> + }
> +
> + shown_any = true;
> + pr_info("pool %d:\n", pool->id);
> + sched_show_task(worker->task);
> + }
> +out:
> + raw_spin_unlock_irqrestore(&pool->lock, flags);
> + return shown_any;
> +}
> +
> +static void show_suspicious_workers(void)
> +{
> + struct worker_pool *pool;
> + bool shown_title = false;
> + int pi;
> +
> + rcu_read_lock();
> +
> + for_each_pool(pool, pi) {
> + bool shown;
> +
> + shown = show_pool_suspicious_workers(pool, shown_title);
> + if (shown)
> + shown_title = shown;

Maybe, move shown to the outer scope and:

shown |= show_pool_suspicious_workers(pool, show);

> + }
> +
> + rcu_read_unlock();
> +}
>
> static void wq_watchdog_reset_touched(void)
> {

Thanks.

--
tejun

2023-02-03 14:12:51

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC 2/5] workqueue: Warn when a new worker could not be created

On Thu 2023-02-02 13:30:39, Tejun Heo wrote:
> Hello,
>
> On Wed, Feb 01, 2023 at 02:45:40PM +0100, Petr Mladek wrote:
> > +static __printf(2, 3) __cold
> > +void __print_create_worker_failure(long err, const char *fmt, ...)
> > +{
> > + spin_lock_irq(&create_worker_failed_lock);
> > +
> > + /*
> > + * Report potentially repeated failures only once during a stall.
> > + * Otherwise, it might be noisy. Also slow serial console drivers
> > + * touch watchdogs so that more frequent messages would prevent
> > + * reaching the watchdog thresh.
> > + */
> > + if (!create_worker_failed) {
> > + va_list args;
> > +
> > + va_start(args, fmt);
> > + vprintk(fmt, args);
> > + va_end(args);
> > + }
> > +
> > + create_worker_failed++;
> > +
> > + spin_unlock_irq(&create_worker_failed_lock);
> > +}
>
> That's pretty elaborate.

Yeah, I am not super happy with it either.

> Why not just use printk_ratelimited()?

The default printk_ratelimited() is not usable because it uses:

#define DEFAULT_RATELIMIT_INTERVAL (5 * HZ)
#define DEFAULT_RATELIMIT_BURST 10

It allows 10 messages per 5 seconds. It would be still too noisy.
maybe_create_worker() tries to create a new worker every second.

And more importantly, it would break both softlockup and workqueue
watchdogs. See touch_nmi_watchdog() in serial8250_console_write().
By other words, it would break both softlockup and workqueue watchdogs.


A solution would be to use a custom printk_ratelimited_wq() that would
allow printing one message per 2 * wq_watchdog_thresh.
Something like:

#define printk_ratelimited_wq(fmt, ...) \
({ \
static DEFINE_RATELIMIT_STATE(_rs, 60 * HZ, 1); \
\
if (__ratelimit(&_rs)) \
printk(fmt, ##__VA_ARGS__); \
})

I admit that it is much easier than __print_create_worker_failure().

The only problem might be that wq_watchdog_thresh might be modified
at runtime. But it can be solved by sharing the same
struct ratelimit_state rs in all printk_ratelimited_wq() calls
and updating it together with wq_watchdog_thresh.

Would you prefer the custom printk_ratelimited_wq(), please?

Best Regards,
Petr

2023-02-03 14:27:09

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC 5/5] workqueue: Print backtraces from CPUs with hung CPU bound workqueues

Hello,

On Thu 2023-02-02 13:45:05, Tejun Heo wrote:
> > +static bool show_pool_suspicious_workers(struct worker_pool *pool, bool shown_title)
> > +{
> > + bool shown_any = false;
> > + struct worker *worker;
> > + unsigned long flags;
> > + int bkt;
> > +
> > + raw_spin_lock_irqsave(&pool->lock, flags);
> > +
> > + if (pool->cpu < 0)
> > + goto out;
>
> This can be tested before grabbling the lock.

I see.

> > + if (!per_cpu(wq_watchdog_hung_cpu, pool->cpu))
> > + goto out;
>
> Given that the state is per-pool, would it make sense to mark this on the
> pool instead?

Makes sense. I think that I started with the per-CPU variable before
I sorted my thoughts about what backtraces were useful ;-)

> > + if (list_empty(&pool->worklist))
> > + goto out;
>
> This condition isn't really necessary, right?

IMHO, it should be there. The watchdog reports the problem only when
there are pending work items, see

if (list_empty(&pool->worklist))
continue;

in wq_watchdog_timer_fn().

My understanding is that it is OK to process work items longer
time when they are sleeping and waiting for something.

Best Regards,
Petr

2023-02-03 19:30:21

by Tejun Heo

[permalink] [raw]
Subject: Re: [RFC 2/5] workqueue: Warn when a new worker could not be created

On Fri, Feb 03, 2023 at 03:10:28PM +0100, Petr Mladek wrote:
> A solution would be to use a custom printk_ratelimited_wq() that would
> allow printing one message per 2 * wq_watchdog_thresh.
> Something like:
>
> #define printk_ratelimited_wq(fmt, ...) \
> ({ \
> static DEFINE_RATELIMIT_STATE(_rs, 60 * HZ, 1); \
> \
> if (__ratelimit(&_rs)) \
> printk(fmt, ##__VA_ARGS__); \
> })
>
> I admit that it is much easier than __print_create_worker_failure().
>
> The only problem might be that wq_watchdog_thresh might be modified
> at runtime. But it can be solved by sharing the same
> struct ratelimit_state rs in all printk_ratelimited_wq() calls
> and updating it together with wq_watchdog_thresh.
>
> Would you prefer the custom printk_ratelimited_wq(), please?

Yeah, I think so.

Thanks.

--
tejun

2023-02-15 18:02:12

by Michal Koutný

[permalink] [raw]
Subject: Re: [RFC 2/5] workqueue: Warn when a new worker could not be created

Hello.

On Wed, Feb 01, 2023 at 02:45:40PM +0100, Petr Mladek <[email protected]> wrote:
> + the system reached PID limit
or threads-max limit.

FTR, I was once considering something like

--->8---
diff --git a/kernel/fork.c b/kernel/fork.c
index 867b46d6fd0a..bba05ecc3765 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -1684,8 +1684,10 @@ static __latent_entropy struct task_struct *copy_process(
* to stop root fork bombs.
*/
retval = -EAGAIN;
- if (nr_threads >= max_threads)
+ if (nr_threads >= max_threads) {
+ printk_once(KERN_INFO "clone failed due to threads-max limit\n");
goto bad_fork_cleanup_count;
+ }

delayacct_tsk_init(p); /* Must remain after dup_task_struct() */
p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER | PF_IDLE);
@@ -1816,6 +1818,7 @@ static __latent_entropy struct task_struct *copy_process(
if (pid != &init_struct_pid) {
pid = alloc_pid(p->nsproxy->pid_ns_for_children);
if (IS_ERR(pid)) {
+ printk_once(KERN_INFO "fork failed to find pid\n");
retval = PTR_ERR(pid);
goto bad_fork_cleanup_thread;
}
--->8---

Effects of the global limits on anything but kthreads should be much less
important and easier to troubleshoot anyway.
Covering kworkers with your changes should be useful and substitute my idea
above. Take that as my support for this patch (from my perspective reporting
*_once would be enough to guide a troubleshooter).

Thanks,
Michal



Attachments:
(No filename) (1.56 kB)
signature.asc (228.00 B)
Digital signature
Download all attachments

2023-02-16 09:43:36

by Petr Mladek

[permalink] [raw]
Subject: Re: [RFC 2/5] workqueue: Warn when a new worker could not be created

On Wed 2023-02-15 19:02:04, Michal Koutn? wrote:
> Hello.
>
> On Wed, Feb 01, 2023 at 02:45:40PM +0100, Petr Mladek <[email protected]> wrote:
> > + the system reached PID limit
> or threads-max limit.
>
> FTR, I was once considering something like
>
> --->8---
> diff --git a/kernel/fork.c b/kernel/fork.c
> index 867b46d6fd0a..bba05ecc3765 100644
> --- a/kernel/fork.c
> +++ b/kernel/fork.c
> @@ -1684,8 +1684,10 @@ static __latent_entropy struct task_struct *copy_process(
> * to stop root fork bombs.
> */
> retval = -EAGAIN;
> - if (nr_threads >= max_threads)
> + if (nr_threads >= max_threads) {
> + printk_once(KERN_INFO "clone failed due to threads-max limit\n");
> goto bad_fork_cleanup_count;
> + }
>
> delayacct_tsk_init(p); /* Must remain after dup_task_struct() */
> p->flags &= ~(PF_SUPERPRIV | PF_WQ_WORKER | PF_IDLE);
> @@ -1816,6 +1818,7 @@ static __latent_entropy struct task_struct *copy_process(
> if (pid != &init_struct_pid) {
> pid = alloc_pid(p->nsproxy->pid_ns_for_children);
> if (IS_ERR(pid)) {
> + printk_once(KERN_INFO "fork failed to find pid\n");
> retval = PTR_ERR(pid);
> goto bad_fork_cleanup_thread;
> }
> --->8---
>
> Effects of the global limits on anything but kthreads should be much less
> important and easier to troubleshoot anyway.
> Covering kworkers with your changes should be useful and substitute my idea
> above.

I was not brave enough to touch kernel/fork.c.

> Take that as my support for this patch (from my perspective reporting
> *_once would be enough to guide a troubleshooter).

My experience is that users often send logs with many stall reports and
the earlier messages are missing/lost. This is why I wanted to repeat
them.

Anyway, thanks for the support. I am going to send v2 when time permits.

Best Regards,
Petr