2023-08-11 10:49:19

by Tio Zhang

[permalink] [raw]
Subject: [PATCH] workqueue: let WORKER_CPU_INTENSIVE be included in watchdog

When a pool has a worker with WORKER_CPU_INTENSIVE set but other workers
are not that busy, the pool->worklist will mostly be empty, which leads
the intensive work always having a chance of escaping from the watchdog's
check. This may cause watchdog miss finding out a forever running work
in WQ_CPU_INTENSIVE.

Also, after commit '616db8779b1e3f93075df691432cccc5ef3c3ba0',
workers with potentially intensive works will automatically be converted
into WORKER_CPU_INTENSIVE. This might let watchdog to miss all work
potentially running forever.

Signed-off-by: Tio Zhang <[email protected]>
---
kernel/workqueue.c | 15 ++++++++++++++-
1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 02a8f402eeb5..29875b680f5b 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -6280,10 +6280,23 @@ static void wq_watchdog_timer_fn(struct timer_list *unused)
rcu_read_lock();

for_each_pool(pool, pi) {
+ struct worker *worker;
unsigned long pool_ts, touched, ts;
+ bool check_intensive = false;

pool->cpu_stall = false;
- if (list_empty(&pool->worklist))
+
+ /* Not sure if we should let WORKER_UNBOUND to
+ * be included? Since let a unbound work to last
+ * more than e,g, 30 seconds seem also unacceptable.
+ */
+ for_each_pool_worker(worker, pool) {
+ if (worker->flags & WORKER_CPU_INTENSIVE) {
+ check_intensive = true;
+ break;
+ }
+ }
+ if (list_empty(&pool->worklist) && !check_intensive)
continue;

/*
--
2.17.1



2023-08-14 14:54:56

by kernel test robot

[permalink] [raw]
Subject: Re: [PATCH] workqueue: let WORKER_CPU_INTENSIVE be included in watchdog



Hello,

kernel test robot noticed "WARNING:at_kernel/workqueue.c:#wq_watchdog_timer_fn" on:

commit: f5d265c1a77104897fad14235b2637b155c01efd ("[PATCH] workqueue: let WORKER_CPU_INTENSIVE be included in watchdog")
url: https://github.com/intel-lab-lkp/linux/commits/Tio-Zhang/workqueue-let-WORKER_CPU_INTENSIVE-be-included-in-watchdog/20230811-182610
base: https://git.kernel.org/cgit/linux/kernel/git/tj/wq.git for-next
patch link: https://lore.kernel.org/all/20230811102250.GA7959@didi-ThinkCentre-M930t-N000/
patch subject: [PATCH] workqueue: let WORKER_CPU_INTENSIVE be included in watchdog

in testcase: rcutorture
version:
with following parameters:

runtime: 300s
test: default
torture_type: srcu



compiler: gcc-12
test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

(please refer to attached dmesg/kmsg for entire log/backtrace)



If you fix the issue in a separate patch/commit (i.e. not just a new version of
the same patch/commit), kindly add following tags
| Reported-by: kernel test robot <[email protected]>
| Closes: https://lore.kernel.org/oe-lkp/[email protected]



[ 35.935095][ C0] ------------[ cut here ]------------
[ 35.936505][ C0] WARNING: CPU: 0 PID: 100 at kernel/workqueue.c:6400 wq_watchdog_timer_fn+0x185/0x3b0
[ 35.938627][ C0] Modules linked in:
[ 35.939641][ C0] CPU: 0 PID: 100 Comm: systemd-journal Not tainted 6.5.0-rc1-00043-gf5d265c1a771 #1
[ 35.941708][ C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.2-debian-1.16.2-1 04/01/2014
[ 35.948000][ C0] EIP: wq_watchdog_timer_fn+0x185/0x3b0
[ 35.949262][ C0] Code: b8 bc bc 2b c2 e8 db ee c8 00 31 d2 89 c3 85 c0 b8 28 91 89 c2 6a 00 0f 94 c2 31 c9 e8 44 b9 10 00 85 db 58 0f 85 6d ff ff ff <0f> 0b ba
01 00 00 00 e9 63 ff ff ff 8d b4 26 00 00 00 00 8b 75 e8
[ 35.953232][ C0] EAX: 00000000 EBX: 00000000 ECX: 00000000 EDX: 00000001
[ 35.954818][ C0] ESI: c22bb970 EDI: c3aa3500 EBP: c4067f0c ESP: c4067ee8
[ 35.956302][ C0] DS: 007b ES: 007b FS: 0000 GS: 0033 SS: 0068 EFLAGS: 00210246
[ 35.957911][ C0] CR0: 80050033 CR2: b7ae9360 CR3: 2d43e000 CR4: 00040690
[ 35.959401][ C0] Call Trace:
[ 35.960250][ C0] <SOFTIRQ>
[ 35.961061][ C0] ? show_regs+0x50/0x60
[ 35.962135][ C0] ? wq_watchdog_timer_fn+0x185/0x3b0
[ 35.963335][ C0] ? __warn+0x6f/0x1d0
[ 35.964381][ C0] ? wq_watchdog_timer_fn+0x185/0x3b0
[ 35.965658][ C0] ? report_bug+0x169/0x190
[ 35.966864][ C0] ? exc_overflow+0x40/0x40
[ 35.968144][ C0] ? handle_bug+0x28/0x50
[ 35.969223][ C0] ? exc_invalid_op+0x1a/0x60
[ 35.970336][ C0] ? handle_exception+0x14a/0x14a
[ 35.971590][ C0] ? poke_int3_handler+0x1eb/0x2e0
[ 35.972787][ C0] ? exc_overflow+0x40/0x40
[ 35.973818][ C0] ? wq_watchdog_timer_fn+0x185/0x3b0
[ 35.975100][ C0] ? exc_overflow+0x40/0x40
[ 35.976255][ C0] ? wq_watchdog_timer_fn+0x185/0x3b0
[ 35.977590][ C0] ? show_all_workqueues+0x300/0x300
[ 35.978887][ C0] call_timer_fn+0xb7/0x310
[ 35.980130][ C0] ? show_all_workqueues+0x300/0x300
[ 35.981379][ C0] ? show_all_workqueues+0x300/0x300
[ 35.982658][ C0] __run_timers+0x2a3/0x3b0
[ 35.983860][ C0] run_timer_softirq+0x1c/0x20
[ 35.985029][ C0] __do_softirq+0x144/0x518
[ 35.986182][ C0] ? __lock_text_end+0xc/0xc
[ 35.987384][ C0] call_on_stack+0x45/0x50
[ 35.988532][ C0] </SOFTIRQ>
[ 35.989438][ C0] ? irq_exit_rcu+0xb3/0xf0
[ 35.990584][ C0] ? sysvec_apic_timer_interrupt+0x1f/0x30
[ 35.992046][ C0] ? handle_exception+0x14a/0x14a
[ 35.993299][ C0] ? percpu_ref_put_many+0x64/0x140
[ 35.994831][ C0] ? vmware_sched_clock+0x100/0x100
[ 35.996171][ C0] ? lock_release+0x7b/0xe0
[ 35.997313][ C0] ? vmware_sched_clock+0x100/0x100
[ 35.998584][ C0] ? lock_release+0x7b/0xe0
[ 35.999799][ C0] ? percpu_ref_put_many+0x78/0x140
[ 36.001302][ C0] ? uncharge_folio+0x198/0x3a0
[ 36.002498][ C0] ? __mem_cgroup_uncharge_list+0x52/0x90
[ 36.003965][ C0] ? release_pages+0x17b/0x4b0
[ 36.005193][ C0] ? __folio_batch_release+0x1d/0x40
[ 36.006440][ C0] ? shmem_undo_range+0x2f8/0x7f0
[ 36.007799][ C0] ? shmem_evict_inode+0x111/0x2e0
[ 36.008857][ C0] ? __lock_release+0x152/0x2f0
[ 36.009924][ C0] ? check_preemption_disabled+0x2a/0x50
[ 36.011540][ C0] ? preempt_count_sub+0x74/0x150
[ 36.012814][ C0] ? _raw_spin_unlock+0x57/0x80
[ 36.014052][ C0] ? evict+0xed/0x220
[ 36.015188][ C0] ? evict+0xed/0x220
[ 36.016345][ C0] ? iput_final+0x148/0x190
[ 36.017550][ C0] ? iput+0x14f/0x180
[ 36.024349][ C0] ? dentry_unlink_inode+0xaf/0x110
[ 36.025748][ C0] ? __dentry_kill+0x11f/0x200
[ 36.027051][ C0] ? dentry_kill+0x7b/0x1f0
[ 36.028271][ C0] ? dput+0x2d8/0x2f0
[ 36.029343][ C0] ? __fput+0x164/0x400
[ 36.030465][ C0] ? ____fput+0xd/0x10
[ 36.031626][ C0] ? task_work_run+0x94/0xf0
[ 36.033020][ C0] ? exit_to_user_mode_prepare+0x335/0x340
[ 36.034506][ C0] ? syscall_exit_to_user_mode+0x1a/0x50
[ 36.035998][ C0] ? do_int80_syscall_32+0x62/0xa0
[ 36.037305][ C0] ? entry_INT80_32+0x107/0x107
[ 36.038557][ C0] irq event stamp: 234966
[ 36.039817][ C0] hardirqs last enabled at (234978): [<c10eaca6>] __up_console_sem+0x56/0x60
[ 36.041931][ C0] hardirqs last disabled at (234987): [<c10eac8d>] __up_console_sem+0x3d/0x60
[ 36.044105][ C0] softirqs last enabled at (233030): [<c1d34934>] __do_softirq+0x2a4/0x518
[ 36.046155][ C0] softirqs last disabled at (234197): [<c1023fb5>] call_on_stack+0x45/0x50
[ 36.048296][ C0] ---[ end trace 0000000000000000 ]---



The kernel config and materials to reproduce are available at:
https://download.01.org/0day-ci/archive/20230814/[email protected]



--
0-DAY CI Kernel Test Service
https://github.com/intel/lkp-tests/wiki


2023-08-22 14:07:17

by Tio Zhang

[permalink] [raw]
Subject: [PATCH v2] workqueue: let WORKER_CPU_INTENSIVE be included in watchdog

When a pool has a worker with WORKER_CPU_INTENSIVE set but other workers
are not that busy, the pool->worklist will mostly be empty, which leads
the intensive work always having a chance of escaping from the watchdog's
check. This may cause watchdog miss finding out a forever running work
in WQ_CPU_INTENSIVE.

Also, after commit '616db8779b1e3f93075df691432cccc5ef3c3ba0',
workers with potentially intensive works will automatically be converted
into WORKER_CPU_INTENSIVE. This might let watchdog to miss all work
potentially running forever.

Signed-off-by: Tio Zhang <[email protected]>
---
kernel/workqueue.c | 22 +++++++++++++++++++---
1 file changed, 19 insertions(+), 3 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 02a8f402eeb5..564d96c38d4d 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -6277,13 +6277,29 @@ static void wq_watchdog_timer_fn(struct timer_list *unused)
if (!thresh)
return;

- rcu_read_lock();
+ mutex_lock(&wq_pool_mutex);

for_each_pool(pool, pi) {
+ struct worker *worker;
unsigned long pool_ts, touched, ts;
+ bool check_intensive = false;

pool->cpu_stall = false;
- if (list_empty(&pool->worklist))
+
+ /* Not sure if we should let WORKER_UNBOUND to
+ * be included? Since let a unbound work to last
+ * more than e,g, 30 seconds seem also unacceptable.
+ */
+ mutex_lock(&wq_pool_attach_mutex);
+ for_each_pool_worker(worker, pool) {
+ if (worker->flags & WORKER_CPU_INTENSIVE) {
+ check_intensive = true;
+ break;
+ }
+ }
+ mutex_unlock(&wq_pool_attach_mutex);
+
+ if (list_empty(&pool->worklist) && !check_intensive)
continue;

/*
@@ -6320,7 +6336,7 @@ static void wq_watchdog_timer_fn(struct timer_list *unused)

}

- rcu_read_unlock();
+ mutex_unlock(&wq_pool_mutex);

if (lockup_detected)
show_all_workqueues();
--
2.17.1