LinuxLists.cc - [PATCH] workqueue: Use active mask for new worker when pool is DISASSOCIATED

2022-07-07 09:20:19

Subject: [PATCH] workqueue: Use active mask for new worker when pool is DISASSOCIATED

When CPU-[un]hotplugs, all workers will be bound to active CPU via
unbind_workers().

But the unbound worker still has a chance to create a new worker, which
has bound the newly created task to pool->attrs->cpumask. The CPU has
been unplugged. And the following crash occurred.

The crash log is as follows:
[ 1622.829074] ------------[ cut here ]------------
[ 1622.829081] Dying CPU not properly vacated!
[ 1622.829091] WARNING: CPU: 3 PID: 31 at kernel/sched/core.c:7756 sched_cpu_dying+0x74/0x204
[ 1622.829374] CPU: 3 PID: 31 Comm: migration/3 Tainted: P O 5.10.59-rt52 #2
[ 1622.829386] Stopper: multi_cpu_stop+0x0/0x160 <- 0x0
[ 1622.829400] pstate: 60c00009 (nZCv daif +PAN +UAO -TCO BTYPE=--)
[ 1622.829408] pc : sched_cpu_dying+0x74/0x204
[ 1622.829415] lr : sched_cpu_dying+0x74/0x204
[ 1622.829421] sp : ffff800012933c70
[ 1622.829424] pmr_save: 00000060
[ 1622.829426] x29: ffff800012933c70 x28: 0000000000000000
[ 1622.829435] x27: 0000000000000000 x26: 0000000000000001
[ 1622.829444] x25: 0000000000000000 x24: ffff800018353c2c
[ 1622.829452] x23: 0000000000000003 x22: 0000000000000003
[ 1622.829460] x21: 0000000000000059 x20: 0000000000000000
[ 1622.829468] x19: ffff00027ee68a80 x18: 0000000000000000
[ 1622.829477] x17: 0000000000000000 x16: 0000000000000000
[ 1622.829485] x15: ffffffffffffffff x14: ffff80001169ae30
[ 1622.829493] x13: ffffffffffc38c07 x12: ffffffffffffffff
[ 1622.829501] x11: ffffffffffe00000 x10: ffff80001169ae58
[ 1622.829510] x9 : 000000000000001e x8 : ffff80001169ae30
[ 1622.829518] x7 : ffff800012933ab0 x6 : 00000000ffff0e20
[ 1622.829526] x5 : ffff00027ee62988 x4 : 00000000ffff0e20
[ 1622.829535] x3 : ffff800011e30180 x2 : 0000000100000002
[ 1622.829543] x1 : 0000000000000000 x0 : 0000000000000000
[ 1622.829552] Call trace:
[ 1622.829555] sched_cpu_dying+0x74/0x204
[ 1622.829562] cpuhp_invoke_callback+0xc0/0x1b0
[ 1622.829571] take_cpu_down+0xbc/0xd4
[ 1622.829577] multi_cpu_stop+0x138/0x160
[ 1622.829584] cpu_stopper_thread+0x9c/0x118
[ 1622.829591] smpboot_thread_fn+0x1e8/0x1ec
[ 1622.829598] kthread+0x114/0x124
[ 1622.829606] ret_from_fork+0x10/0x30
[ 1622.829615] ---[ end trace 0000000000000002 ]---
[ 1623.830273] CPU3 enqueued tasks (2 total):
[ 1623.830291] pid: 31, name: migration/3
[ 1623.830440] pid: 25654, name: kworker/3:0
[ 1623.830444] task:kworker/3:0 state:R running task stack: 0 pid:25654 ppid: 2 flags:0x00000028
[ 1623.830458] Call trace:
[ 1623.830460] __switch_to+0x164/0x17c
[ 1623.830472] __schedule+0x4cc/0x5c0
[ 1623.830483] schedule+0x7c/0xcc
[ 1623.830491] schedule_preempt_disabled+0x14/0x24
[ 1623.830500] kthread+0xd8/0x124
[ 1623.830509] ret_from_fork+0x10/0x30

In the crash log, the error enqueued kworker(pid: 25654, name: kworker/3:0)
is a newly created thread, and have bind to a unpluged CPU 3.

Can crash dump can also verify this.
crash> task -R nr_cpus_allowed,cpus_mask 25654
PID: 25654 TASK: ffff000181ff0000 CPU: 3 COMMAND: "kworker/3:0"
nr_cpus_allowed = 1,
cpus_mask = {
bits = {8, 0}
},

crash> struct worker_pool.cpu,nr_workers,attrs 0xffff00027ee68380
cpu = 3,
nr_workers = 3,
attrs = 0xffff000180004480,
crash> struct workqueue_attrs 0xffff000180004480
struct workqueue_attrs {
nice = 0,
cpumask = {{
bits = {8, 0}
}},
no_numa = false
}

The stack trace of the bad running task was dumped via the following patch:
Link: https://lore.kernel.org/all/[email protected]/
And I think this debug patch needs to be added to the mainline,
it can help us to debug this kind of problem

To fix it, we can use cpu_active_mask when work pool is DISASSOCIATED.

Signed-off-by: Schspa Shi <[email protected]>
---
kernel/workqueue.c | 21 ++++++++++++++++++---
1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 1ea50f6be843..713b07957cb6 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1872,8 +1872,12 @@ static void worker_attach_to_pool(struct worker *worker,
else
kthread_set_per_cpu(worker->task, pool->cpu);

- if (worker->rescue_wq)
- set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
+ if (worker->rescue_wq) {
+ if (pool->flags & POOL_DISASSOCIATED)
+ set_cpus_allowed_ptr(worker->task, cpu_active_mask);
+ else
+ set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
+ }

list_add_tail(&worker->node, &pool->workers);
worker->pool = pool;
@@ -1952,7 +1956,18 @@ static struct worker *create_worker(struct worker_pool *pool)
goto fail;

set_user_nice(worker->task, pool->attrs->nice);
- kthread_bind_mask(worker->task, pool->attrs->cpumask);
+
+ mutex_lock(&wq_pool_attach_mutex);
+ if ((pool->flags & POOL_DISASSOCIATED)) {
+ /* We can't call get_online_cpus, there will be deadlock
+ * cpu_active_mask will no change, because we have
+ * wq_pool_attach_mutex hold.
+ **/
+ kthread_bind_mask(worker->task, cpu_active_mask);
+ } else {
+ kthread_bind_mask(worker->task, pool->attrs->cpumask);
+ }
+ mutex_unlock(&wq_pool_attach_mutex);

/* successful, attach the worker to the pool */
worker_attach_to_pool(worker, pool);
--
2.29.0

2022-07-13 02:54:20

by kernel test robot

[permalink] [raw]

Subject: [workqueue] 1a0a67f5ef: phoronix-test-suite.fio.SequentialRead.IO_uring.Yes.No.1MB.DefaultTestDirectory.mb_s -17.7% regression

Greeting,

FYI, we noticed a

commit: 1a0a67f5efc83d6ca1 url:

If you fix the issue, Reported-by: kernel

Details are as below:
--------------------------

To reproduce:

git clone sudo bin/lkp install job.yaml bin/lkp split-job sudo bin/lkp
# if come # please
========================== compiler/cpufreq_governor/ gcc-11/performance/x86_6
commit:
8bee9dd953 ("workqueue: 1a0a67f5ef ("workqueue:
8bee9dd953b69c63 ---------------- %stddev %change \ | 2891 -17.7% 2895 -17.7% 21667102 ? 4% 10150 ? 48% +129.0% 1.46 ? 17% -0.3 0.11 ? 18% -0.0 385964 ? 3% 385964 ? 3% 351.90 ? 76% 17589 ? 47% -57.2% 5.01 ? 14% -0.7 125.83 ? 2% 71.75 ? 2% 1947025 ? 5% 20804 ? 3% -29.4% 9.361e+08 ? 4% 8742887 ? 35% 47.24 ? 8% +14.3 1.142e+08 ? 14% 21068 ? 3% -29.9% 139.10 ? 4% -29.3% 1.211e+09 ? 4% 6.473e+08 45.91 ? 6% -3.5 1334055 ? 8% 4.54e+09 ? 4% 30594729 ? 3% 32.17 ? 18% +30.8 2033471 ? 82% 12876365 ? 6% 21.90 ? 14% +51.8 283278 ? 51% +2796.4% 13020718 ? 2% 60.25 ? 9% +19.3 131.53 ? 6% 46.09 ? 6% -3.5 13.03 ? 79% +44.4 2.10 ? 49% +70.2 9.31e+08 ? 3% 8696474 ? 35% 1.136e+08 ? 14% 20951 ? 3% -29.7% 138.34 ? 5% -29.1% 1.204e+09 ? 4% 6.438e+08 1326996 ? 8% 4.515e+09 ? 3% 2027098 ? 82% 12801187 ? 6% 282154 ? 51% +2798.3% 12947790 ? 3% 19.64 ? 8% -3.4 18.54 ? 8% -3.3 4.85 ? 7% -1.0 4.84 ? 7% -1.0 1.94 ? 13% -0.8 1.72 ? 13% -0.7 0.95 ? 5% -0.7 1.34 ? 14% -0.6 1.48 ? 12% -0.5 1.16 ? 15% -0.5 1.05 ? 5% -0.5 1.20 ? 13% -0.4 0.89 ? 13% -0.3 0.81 ? 6% -0.1 0.77 ? 6% -0.1 0.99 ? 6% -0.1 1.52 ? 17% +0.6 0.68 ? 10% +0.7 0.00 +0.9 0.29 ?100% +1.0 14.52 ? 6% +2.5 14.02 ? 7% +2.5 14.11 ? 7% +2.5 12.30 ? 7% +2.7 12.40 ? 7% +2.7 6.58 ? 5% +2.8 6.85 ? 5% +2.8 6.62 ? 5% +2.8 19.64 ? 8% -3.4 18.55 ? 8% -3.3 3.02 ? 7% -1.1 5.64 ? 6% -1.0 5.65 ? 6% -1.0 2.26 ? 5% -0.8 1.94 ? 13% -0.8 1.78 ? 13% -0.7 2.49 ? 8% -0.6 2.40 ? 8% -0.6 2.38 ? 8% -0.5 1.36 ? 14% -0.5 1.49 ? 12% -0.5 0.72 ? 4% -0.5 1.18 ? 14% -0.5 0.54 ? 5% -0.5 2.09 ? 8% -0.5 0.71 ? 9% -0.5 1.05 ? 5% -0.5 0.95 ? 5% -0.5 1.99 ? 8% -0.5 0.63 ? 8% -0.5 1.99 ? 8% -0.5 0.61 ? 8% -0.5 0.94 ? 5% -0.5 1.22 ? 13% -0.4 0.55 ? 6% -0.4 1.58 ? 14% -0.4 0.93 ? 13% -0.4 0.76 ? 15% -0.3 1.18 ? 10% -0.3 1.46 ? 8% -0.3 0.77 ? 9% -0.3 0.76 ? 13% -0.3 0.97 ? 10% -0.3 1.22 ? 8% -0.2 0.62 ? 10% -0.2 0.62 ? 10% -0.2 0.88 ? 7% -0.2 0.64 ? 10% -0.2 0.59 ? 10% -0.2 1.16 ? 9% -0.2 0.68 ? 11% -0.2 0.81 ? 7% -0.2 0.83 ? 7% -0.2 0.48 ? 16% -0.2 0.51 ? 11% -0.2 0.25 ? 31% -0.2 0.25 ? 10% -0.2 0.62 ? 9% -0.2 0.22 ? 9% -0.2 0.47 ? 11% -0.2 0.24 ? 10% -0.2 1.05 ? 6% -0.1 0.81 ? 6% -0.1 0.77 ? 6% -0.1 0.38 ? 8% -0.1 0.36 ? 13% -0.1 0.30 ? 11% -0.1 0.22 ? 9% -0.1 0.28 ? 7% -0.1 0.24 ? 24% -0.1 0.31 ? 13% -0.1 0.22 ? 13% -0.1 0.25 ? 11% -0.1 0.24 ? 10% -0.1 0.19 ? 17% -0.1 0.09 ? 17% -0.1 0.14 ? 15% -0.1 0.18 ? 12% -0.1 0.14 ? 39% -0.1 0.10 ? 15% -0.0 0.17 ? 12% -0.0 0.11 ? 7% -0.0 0.11 ? 39% +0.1 0.10 ? 21% +0.1 0.10 ? 22% +0.1 0.00 +0.4 0.00 +0.4 0.26 ? 32% +0.6 1.56 ? 18% +0.6 0.70 ? 11% +0.7 0.53 ? 14% +0.8 14.53 ? 6% +2.5 14.04 ? 7% +2.5 14.12 ? 7% +2.5 12.32 ? 7% +2.8 6.68 ? 5% +2.8 6.63 ? 5% +2.8 19.26 ? 3% +5.6 0.44 ? 8% -0.4 1.10 ? 9% -0.3 0.55 ? 17% -0.2 0.55 ? 14% -0.2 0.58 ? 12% -0.2 0.41 ? 13% -0.1 0.37 ? 11% -0.1 0.58 ? 9% -0.1 0.48 ? 9% -0.1 0.29 ? 19% -0.1 0.26 ? 12% -0.1 0.40 ? 10% -0.1 0.27 ? 6% -0.1 0.27 ? 10% -0.1 0.21 ? 9% -0.1 0.22 ? 12% -0.1 0.17 ? 23% -0.1 0.40 ? 9% -0.1 0.33 ? 12% -0.1 0.23 ? 11% -0.1 0.18 ? 15% -0.1 0.32 ? 6% -0.1 0.13 ? 16% -0.1 0.17 ? 10% -0.0 0.10 ? 9% -0.0 0.12 ? 18% -0.0 0.09 ? 17% -0.0 0.10 ? 39% +0.1 0.09 ? 21% +0.1 0.11 ? 24% +0.2 0.26 ? 33% +0.6 12.22 ? 7% +2.7 6.62 ? 5% +2.8

Disclaimer:
Results have been for informational design or configuration

--
0-DAY CI Kernel Test Service
https://github.com/intel-lab-lkp/linux/commits/Schspa-Shi/workqueue-Use-active-mask-for-new-worker-when-pool-is-DISASSOCIATED/20220707-170740
t.kernel.org/cgit/linux/kernel/git/tj/wq.git">https://git.kernel.org/cgit/linux/kernel/git/tj/wq.git for-next
s://lore.kernel.org/lkml/20220707090501.55483-1-schspa@gmail.com">https://lore.kernel.org/lkml/[email protected]
96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 512G memory
The Phoronix Test Suite is the most comprehensive testing and benchmarking platform available that provides an extensible framework for which new tests can be easily added.
/www.phoronix-test-suite.com/">http://www.phoronix-test-suite.com/
kindly add following tag
test robot <[email protected]>
------------------------------------------------------------------------>
href="https://github.com/intel/lkp-tests.git">https://github.com/intel/lkp-tests.git
# job file is attached in this email
--compatible job.yaml # generate the yaml file for lkp run
run generated-yaml-file
across any failure that blocks the test,
remove ~/.lkp and /lkp dir to run from a clean state.
===============================================================
kconfig/option_a/option_b/option_c/option_d/option_e/option_f/rootfs/tbox_group/test/testcase/ucode:
4-rhel-8.3/Sequential Read/IO_uring/Yes/No/1MB/Default Test Directory/debian-x86_64-phoronix/lkp-csl-2sp7/fio-1.14.1/phoronix-test-suite/0x500320a
Switch to new kerneldoc syntax for named variable macro argument")
Use active mask for new worker when pool is DISASSOCIATED")
1a0a67f5efc83d6ca15c1e12471
---------------------------
%stddev
\
2380 ? 2% phoronix-test-suite.fio.SequentialRead.IO_uring.Yes.No.1MB.DefaultTestDirectory.iops
2384 ? 2% phoronix-test-suite.fio.SequentialRead.IO_uring.Yes.No.1MB.DefaultTestDirectory.mb_s
+36.7% 29623577 ? 5% meminfo.DirectMap2M
23243 ? 20% numa-meminfo.node1.AnonHugePages
1.17 mpstat.cpu.all.irq%
0.08 ? 2% mpstat.cpu.all.soft%
+4.0% 401329 proc-vmstat.nr_inactive_file
+4.0% 401330 proc-vmstat.nr_zone_inactive_file
-64.9% 123.66 ? 16% sched_debug.cpu.clock_task.stddev
7533 ? 27% sched_debug.cpu.nr_switches.stddev
4.32 turbostat.Busy%
+7.2% 134.90 turbostat.PkgWatt
-4.2% 68.75 ? 2% vmstat.cpu.id
-13.6% 1681672 ? 2% vmstat.io.bi
14684 ? 2% vmstat.system.cs
-16.3% 7.833e+08 ? 4% perf-stat.i.branch-instructions
-35.3% 5658438 ? 26% perf-stat.i.branch-misses
61.52 ? 4% perf-stat.i.cache-miss-rate%
-21.4% 89789700 perf-stat.i.cache-references
14777 ? 2% perf-stat.i.context-switches
98.38 ? 3% perf-stat.i.cpu-migrations
-17.2% 1.002e+09 ? 3% perf-stat.i.dTLB-loads
-14.7% 5.522e+08 ? 2% perf-stat.i.dTLB-stores
42.41 perf-stat.i.iTLB-load-miss-rate%
-11.7% 1178304 ? 3% perf-stat.i.iTLB-load-misses
-16.4% 3.795e+09 ? 4% perf-stat.i.instructions
-16.3% 25596372 ? 3% perf-stat.i.metric.M/sec
62.95 ? 8% perf-stat.i.node-load-miss-rate%
+398.5% 10137149 ? 14% perf-stat.i.node-load-misses
-42.2% 7438822 ? 12% perf-stat.i.node-loads
73.69 ? 7% perf-stat.i.node-store-miss-rate%
8204832 ? 11% perf-stat.i.node-store-misses
-76.0% 3128643 ? 22% perf-stat.i.node-stores
79.52 ? 2% perf-stat.overall.cache-miss-rate%
-10.0% 118.37 ? 2% perf-stat.overall.cycles-between-cache-misses
42.58 perf-stat.overall.iTLB-load-miss-rate%
57.47 ? 11% perf-stat.overall.node-load-miss-rate%
72.27 ? 9% perf-stat.overall.node-store-miss-rate%
-16.1% 7.807e+08 ? 3% perf-stat.ps.branch-instructions
-35.2% 5639513 ? 26% perf-stat.ps.branch-misses
-21.3% 89476682 perf-stat.ps.cache-references
14725 ? 2% perf-stat.ps.context-switches
98.03 ? 3% perf-stat.ps.cpu-migrations
-17.0% 9.991e+08 ? 3% perf-stat.ps.dTLB-loads
-14.5% 5.503e+08 ? 2% perf-stat.ps.dTLB-stores
-11.5% 1174735 ? 3% perf-stat.ps.iTLB-load-misses
-16.2% 3.783e+09 ? 3% perf-stat.ps.instructions
+398.4% 10104033 ? 14% perf-stat.ps.node-load-misses
-42.1% 7411350 ? 12% perf-stat.ps.node-loads
8177779 ? 11% perf-stat.ps.node-store-misses
-75.9% 3116101 ? 22% perf-stat.ps.node-stores
16.20 ? 2% perf-profile.calltrace.cycles-pp.filemap_get_pages.filemap_read.io_read.io_issue_sqe.io_wq_submit_work
15.27 ? 2% perf-profile.calltrace.cycles-pp.page_cache_ra_unbounded.filemap_get_pages.filemap_read.io_read.io_issue_sqe
3.89 ? 4% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
3.88 ? 4% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.15 ? 4% perf-profile.calltrace.cycles-pp.folio_alloc.page_cache_ra_unbounded.filemap_get_pages.filemap_read.io_read
1.01 ? 4% perf-profile.calltrace.cycles-pp.__alloc_pages.folio_alloc.page_cache_ra_unbounded.filemap_get_pages.filemap_read
0.26 ?100% perf-profile.calltrace.cycles-pp.io_cqring_wait.__do_sys_io_uring_enter.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.78 ? 5% perf-profile.calltrace.cycles-pp.get_page_from_freelist.__alloc_pages.folio_alloc.page_cache_ra_unbounded.filemap_get_pages
0.94 ? 5% perf-profile.calltrace.cycles-pp.folio_add_lru.filemap_add_folio.page_cache_ra_unbounded.filemap_get_pages.filemap_read
0.67 ? 3% perf-profile.calltrace.cycles-pp.rmqueue.get_page_from_freelist.__alloc_pages.folio_alloc.page_cache_ra_unbounded
0.58 ? 7% perf-profile.calltrace.cycles-pp.__do_sys_io_uring_enter.do_syscall_64.entry_SYSCALL_64_after_hwframe
0.76 ? 4% perf-profile.calltrace.cycles-pp.__pagevec_lru_add.folio_add_lru.filemap_add_folio.page_cache_ra_unbounded.filemap_get_pages
0.55 ? 6% perf-profile.calltrace.cycles-pp.__pagevec_lru_add_fn.__pagevec_lru_add.folio_add_lru.filemap_add_folio.page_cache_ra_unbounded
0.68 ? 6% perf-profile.calltrace.cycles-pp.read_pages.page_cache_ra_unbounded.filemap_get_pages.filemap_read.io_read
0.64 ? 5% perf-profile.calltrace.cycles-pp.ext4_mpage_readpages.read_pages.page_cache_ra_unbounded.filemap_get_pages.filemap_read
0.86 ? 7% perf-profile.calltrace.cycles-pp.shmem_getpage_gfp.shmem_file_read_iter.do_iter_readv_writev.do_iter_read.loop_process_work
2.16 ? 11% perf-profile.calltrace.cycles-pp.menu_select.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary
1.42 ? 10% perf-profile.calltrace.cycles-pp.tick_nohz_get_sleep_length.menu_select.cpuidle_idle_call.do_idle.cpu_startup_entry
0.86 ? 10% perf-profile.calltrace.cycles-pp.timekeeping_max_deferment.tick_nohz_next_event.tick_nohz_get_sleep_length.menu_select.cpuidle_idle_call
1.28 ? 11% perf-profile.calltrace.cycles-pp.tick_nohz_next_event.tick_nohz_get_sleep_length.menu_select.cpuidle_idle_call.do_idle
16.99 ? 7% perf-profile.calltrace.cycles-pp.do_iter_read.loop_process_work.process_one_work.worker_thread.kthread
16.52 ? 8% perf-profile.calltrace.cycles-pp.shmem_file_read_iter.do_iter_readv_writev.do_iter_read.loop_process_work.process_one_work
16.62 ? 7% perf-profile.calltrace.cycles-pp.do_iter_readv_writev.do_iter_read.loop_process_work.process_one_work.worker_thread
15.04 ? 8% perf-profile.calltrace.cycles-pp._copy_to_iter.copy_page_to_iter.shmem_file_read_iter.do_iter_readv_writev.do_iter_read
15.14 ? 8% perf-profile.calltrace.cycles-pp.copy_page_to_iter.shmem_file_read_iter.do_iter_readv_writev.do_iter_read.loop_process_work
9.40 ? 8% perf-profile.calltrace.cycles-pp.copy_user_enhanced_fast_string.copyout.copy_page_to_iter.filemap_read.io_read
9.67 ? 8% perf-profile.calltrace.cycles-pp.copy_page_to_iter.filemap_read.io_read.io_issue_sqe.io_wq_submit_work
9.45 ? 8% perf-profile.calltrace.cycles-pp.copyout.copy_page_to_iter.filemap_read.io_read.io_issue_sqe
16.20 ? 2% perf-profile.children.cycles-pp.filemap_get_pages
15.27 ? 2% perf-profile.children.cycles-pp.page_cache_ra_unbounded
1.94 ? 4% perf-profile.children.cycles-pp.__schedule
4.64 ? 4% perf-profile.children.cycles-pp.do_syscall_64
4.65 ? 4% perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
1.41 ? 6% perf-profile.children.cycles-pp.schedule
1.15 ? 4% perf-profile.children.cycles-pp.folio_alloc
1.06 ? 4% perf-profile.children.cycles-pp.__alloc_pages
1.91 ? 5% perf-profile.children.cycles-pp.perf_tp_event
1.84 ? 5% perf-profile.children.cycles-pp.__perf_event_overflow
1.83 ? 5% perf-profile.children.cycles-pp.perf_event_output_forward
0.81 ? 5% perf-profile.children.cycles-pp.get_page_from_freelist
0.95 ? 4% perf-profile.children.cycles-pp.folio_add_lru
0.23 ? 13% perf-profile.children.cycles-pp.pick_next_task_fair
0.69 ? 4% perf-profile.children.cycles-pp.rmqueue
0.06 ? 63% perf-profile.children.cycles-pp.newidle_balance
1.60 ? 6% perf-profile.children.cycles-pp.perf_prepare_sample
0.23 ? 5% perf-profile.children.cycles-pp.load_balance
0.58 ? 7% perf-profile.children.cycles-pp.__do_sys_io_uring_enter
0.49 ? 7% perf-profile.children.cycles-pp.io_cqring_wait
1.53 ? 6% perf-profile.children.cycles-pp.get_perf_callchain
0.17 ? 10% perf-profile.children.cycles-pp.find_busiest_group
1.54 ? 6% perf-profile.children.cycles-pp.perf_callchain
0.16 ? 7% perf-profile.children.cycles-pp.update_sd_lb_stats
0.48 ? 7% perf-profile.children.cycles-pp.schedule_hrtimeout_range_clock
0.77 ? 5% perf-profile.children.cycles-pp.__pagevec_lru_add
0.11 ? 8% perf-profile.children.cycles-pp.update_sg_lb_stats
1.16 ? 5% perf-profile.children.cycles-pp.__softirqentry_text_start
0.57 ? 6% perf-profile.children.cycles-pp.__pagevec_lru_add_fn
0.43 ? 7% perf-profile.children.cycles-pp.rmqueue_bulk
0.86 ? 6% perf-profile.children.cycles-pp.perf_callchain_kernel
1.15 ? 4% perf-profile.children.cycles-pp.xas_load
0.48 ? 8% perf-profile.children.cycles-pp.try_to_wake_up
0.48 ? 6% perf-profile.children.cycles-pp.__mem_cgroup_charge
0.71 ? 7% perf-profile.children.cycles-pp.unwind_next_frame
0.98 ? 5% perf-profile.children.cycles-pp.update_curr
0.38 ? 7% perf-profile.children.cycles-pp.blk_update_request
0.38 ? 6% perf-profile.children.cycles-pp.blk_mq_end_request
0.64 ? 8% perf-profile.children.cycles-pp.perf_trace_sched_switch
0.41 ? 4% perf-profile.children.cycles-pp.blk_complete_reqs
0.36 ? 5% perf-profile.children.cycles-pp.__read_end_io
0.93 ? 5% perf-profile.children.cycles-pp.perf_trace_sched_stat_runtime
0.49 ? 9% perf-profile.children.cycles-pp.xa_load
0.63 ? 8% perf-profile.children.cycles-pp.dequeue_entity
0.65 ? 7% perf-profile.children.cycles-pp.dequeue_task_fair
0.30 ? 6% perf-profile.children.cycles-pp.charge_memcg
0.34 ? 6% perf-profile.children.cycles-pp.__mod_lruvec_page_state
0.08 ? 20% perf-profile.children.cycles-pp.rcu_core
0.09 ? 22% perf-profile.children.cycles-pp.ttwu_do_activate
0.46 ? 7% perf-profile.children.cycles-pp._raw_spin_lock
0.06 ? 42% perf-profile.children.cycles-pp.enqueue_entity
0.31 ? 5% perf-profile.children.cycles-pp.__unwind_start
0.09 ? 21% perf-profile.children.cycles-pp.enqueue_task_fair
0.91 ? 6% perf-profile.children.cycles-pp.shmem_getpage_gfp
0.68 ? 6% perf-profile.children.cycles-pp.read_pages
0.64 ? 5% perf-profile.children.cycles-pp.ext4_mpage_readpages
0.24 ? 8% perf-profile.children.cycles-pp.__mod_lruvec_state
0.25 ? 7% perf-profile.children.cycles-pp.__list_del_entry_valid
0.19 ? 6% perf-profile.children.cycles-pp.xa_get_order
0.12 ? 15% perf-profile.children.cycles-pp.__cond_resched
0.18 ? 8% perf-profile.children.cycles-pp.__mod_node_page_state
0.15 ? 8% perf-profile.children.cycles-pp.try_charge_memcg
0.22 ? 8% perf-profile.children.cycles-pp.__mod_memcg_lruvec_state
0.14 ? 12% perf-profile.children.cycles-pp.get_mem_cgroup_from_mm
0.18 ? 12% perf-profile.children.cycles-pp.xas_find_conflict
0.18 ? 8% perf-profile.children.cycles-pp.perf_output_sample
0.13 ? 17% perf-profile.children.cycles-pp.xas_create
0.03 ?104% perf-profile.children.cycles-pp.__kernel_text_address
0.09 ? 16% perf-profile.children.cycles-pp.folio_mapping
0.13 ? 13% perf-profile.children.cycles-pp.orc_find
0.09 ? 21% perf-profile.children.cycles-pp.rcu_idle_exit
0.05 ? 60% perf-profile.children.cycles-pp.kmem_cache_free
0.13 ? 13% perf-profile.children.cycles-pp.memcpy_erms
0.08 ? 16% perf-profile.children.cycles-pp.xas_descend
0.24 ? 16% perf-profile.children.cycles-pp.calc_global_load_tick
0.24 ? 16% perf-profile.children.cycles-pp.arch_cpu_idle_enter
0.24 ? 17% perf-profile.children.cycles-pp.tsc_verify_tsc_adjust
0.40 ? 5% perf-profile.children.cycles-pp.do_softirq
0.43 ? 2% perf-profile.children.cycles-pp.flush_smp_call_function_queue
0.86 ? 11% perf-profile.children.cycles-pp.timekeeping_max_deferment
2.19 ? 11% perf-profile.children.cycles-pp.menu_select
1.43 ? 10% perf-profile.children.cycles-pp.tick_nohz_get_sleep_length
1.30 ? 11% perf-profile.children.cycles-pp.tick_nohz_next_event
17.00 ? 7% perf-profile.children.cycles-pp.do_iter_read
16.54 ? 8% perf-profile.children.cycles-pp.shmem_file_read_iter
16.62 ? 7% perf-profile.children.cycles-pp.do_iter_readv_writev
15.08 ? 8% perf-profile.children.cycles-pp._copy_to_iter
9.49 ? 8% perf-profile.children.cycles-pp.copy_user_enhanced_fast_string
9.45 ? 8% perf-profile.children.cycles-pp.copyout
24.82 ? 6% perf-profile.children.cycles-pp.copy_page_to_iter
0.09 ? 13% perf-profile.self.cycles-pp.update_sg_lb_stats
0.83 ? 5% perf-profile.self.cycles-pp.xas_load
0.31 ? 9% perf-profile.self.cycles-pp.rmqueue_bulk
0.35 ? 6% perf-profile.self.cycles-pp.__pagevec_lru_add_fn
0.39 ? 4% perf-profile.self.cycles-pp.__filemap_add_folio
0.27 ? 10% perf-profile.self.cycles-pp.xas_store
0.24 ? 6% perf-profile.self.cycles-pp.unwind_next_frame
0.44 ? 10% perf-profile.self.cycles-pp.__get_user_nocheck_8
0.36 ? 9% perf-profile.self.cycles-pp._raw_spin_lock_irq
0.17 ? 6% perf-profile.self.cycles-pp.rmqueue
0.15 ? 17% perf-profile.self.cycles-pp.__alloc_pages
0.28 ? 11% perf-profile.self.cycles-pp._raw_spin_lock
0.17 ? 9% perf-profile.self.cycles-pp.__mod_node_page_state
0.18 ? 16% perf-profile.self.cycles-pp.folio_add_lru
0.12 ? 15% perf-profile.self.cycles-pp.__read_end_io
0.14 ? 12% perf-profile.self.cycles-pp.get_mem_cgroup_from_mm
0.09 ? 12% perf-profile.self.cycles-pp.try_charge_memcg
0.32 ? 9% perf-profile.self.cycles-pp.find_lock_entries
0.26 ? 10% perf-profile.self.cycles-pp.shmem_file_read_iter
0.17 ? 8% perf-profile.self.cycles-pp.__mod_memcg_lruvec_state
0.11 ? 18% perf-profile.self.cycles-pp.get_page_from_freelist
0.26 ? 6% perf-profile.self.cycles-pp.ext4_mpage_readpages
0.08 ? 15% perf-profile.self.cycles-pp.folio_mapping
0.13 ? 13% perf-profile.self.cycles-pp.orc_find
0.06 ? 40% perf-profile.self.cycles-pp.xas_descend
0.07 ? 13% perf-profile.self.cycles-pp.charge_memcg
0.06 ? 41% perf-profile.self.cycles-pp.__mod_lruvec_state
0.24 ? 16% perf-profile.self.cycles-pp.calc_global_load_tick
0.24 ? 17% perf-profile.self.cycles-pp.tsc_verify_tsc_adjust
0.31 ? 17% perf-profile.self.cycles-pp.tick_nohz_next_event
0.86 ? 11% perf-profile.self.cycles-pp.timekeeping_max_deferment
14.93 ? 8% perf-profile.self.cycles-pp._copy_to_iter
9.41 ? 8% perf-profile.self.cycles-pp.copy_user_enhanced_fast_string
estimated based on internal Intel analysis and are provided
purposes only. Any difference in system hardware or software
may affect actual performance.
kp">https://01.org/lkp

Attachments:

(No filename) (22.58 kB)
config-5.19.0-rc1-00007-g1a0a67f5efc8 (166.40 kB)
job-script (8.08 kB)
job.yaml (5.30 kB)
reproduce (306.00 B)
Download all attachments

2022-07-13 07:22:18

by Schspa Shi

[permalink] [raw]

Subject: Re: [workqueue] 1a0a67f5ef: phoronix-test-suite.fio.SequentialRead.IO_uring.Yes.No.1MB.DefaultTestDirectory.mb_s -17.7% regression

kernel test robot <[email protected]> writes:

> Greeting,
>
> FYI, we noticed a -17.7% regression of phoronix-test-suite.fio.SequentialRead.IO_uring.Yes.No.1MB.DefaultTestDirectory.mb_s due to commit:
>

This performance drop can be fixed by convert to wq_pool_attach_mutex
to a rwsem, or add another lock for this. I will upload a new PATCH
version for this.

--
BRs
Schspa Shi

2022-07-13 10:21:04

by Lai Jiangshan

[permalink] [raw]

Subject: Re: [PATCH] workqueue: Use active mask for new worker when pool is DISASSOCIATED

CC Peter.
Peter has changed the CPU binding code in workqueue.c.

I'm not understanding the problem enough, if kthread_bind_mask() is buggy
in workqueue.c, it would be buggy in other places too.

On 2022/7/7 17:05, Schspa Shi wrote:

>
> - if (worker->rescue_wq)
> - set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
> + if (worker->rescue_wq) {
> + if (pool->flags & POOL_DISASSOCIATED)
> + set_cpus_allowed_ptr(worker->task, cpu_active_mask);
> + else
> + set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
> + }
>

For unbound pools (which also has POOL_DISASSOCIATED), pool->attrs->cpumask
should be used if pool->attrs->cpumask has active cpu.

> +
> + mutex_lock(&wq_pool_attach_mutex);
> + if ((pool->flags & POOL_DISASSOCIATED)) {
> + /* We can't call get_online_cpus, there will be deadlock
> + * cpu_active_mask will no change, because we have
> + * wq_pool_attach_mutex hold.
> + **/
> + kthread_bind_mask(worker->task, cpu_active_mask);
> + } else {
> + kthread_bind_mask(worker->task, pool->attrs->cpumask);
> + }
> + mutex_unlock(&wq_pool_attach_mutex);

For unbound pools, pool->attrs->cpumask should be used if pool->attrs->cpumask
has active cpu.

wq_pool_attach_mutex is held here and in worker_attach_to_pool() which smells bad.

The change is complex. And if kthread_bind_mask() can't work as expected here,
the change I prefer would be:

diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 4056f2a3f9d5..1ad8aef5fe98 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1862,6 +1862,12 @@ static void worker_attach_to_pool(struct worker *worker,
{
mutex_lock(&wq_pool_attach_mutex);

+ /*
+ * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
+ * online CPUs. It'll be re-applied when any of the CPUs come up.
+ */
+ set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
+
/*
* The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains
* stable across this function. See the comments above the flag
@@ -1872,9 +1877,6 @@ static void worker_attach_to_pool(struct worker *worker,
else
kthread_set_per_cpu(worker->task, pool->cpu);

- if (worker->rescue_wq)
- set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
-
list_add_tail(&worker->node, &pool->workers);
worker->pool = pool;

@@ -1952,7 +1954,7 @@ static struct worker *create_worker(struct worker_pool *pool)
goto fail;

set_user_nice(worker->task, pool->attrs->nice);
- kthread_bind_mask(worker->task, pool->attrs->cpumask);
+ worker->flags |= PF_NO_SETAFFINITY;

/* successful, attach the worker to the pool */
worker_attach_to_pool(worker, pool);
@@ -4270,7 +4272,7 @@ static int init_rescuer(struct workqueue_struct *wq)
}

wq->rescuer = rescuer;
- kthread_bind_mask(rescuer->task, cpu_possible_mask);
+ rescuer->flags |= PF_NO_SETAFFINITY;
wake_up_process(rescuer->task);

return 0;

It is untested. It effectively reverts the commit 640f17c82460e
("workqueue: Restrict affinity change to rescuer").
It avoids using kthread_bind_mask().

2022-07-13 12:05:43

by Schspa Shi

[permalink] [raw]

Subject: Re: [PATCH] workqueue: Use active mask for new worker when pool is DISASSOCIATED

Lai Jiangshan <[email protected]> writes:

> CC Peter.
> Peter has changed the CPU binding code in workqueue.c.
>
> I'm not understanding the problem enough, if kthread_bind_mask() is buggy
> in workqueue.c, it would be buggy in other places too.
>

It's not the bug of to use kthread_bind_mask(), other than we set the
bad pool->attrs->cpumask to this kthread.

>
> On 2022/7/7 17:05, Schspa Shi wrote:
>
>> - if (worker->rescue_wq)
>> - set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
>> + if (worker->rescue_wq) {
>> + if (pool->flags & POOL_DISASSOCIATED)
>> + set_cpus_allowed_ptr(worker->task, cpu_active_mask);
>> + else
>> + set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
>> + }
>>
>
> For unbound pools (which also has POOL_DISASSOCIATED), pool->attrs->cpumask
> should be used if pool->attrs->cpumask has active cpu.
>

In this case pool->attrs->cpumask have no active cpu, the cpu for this
pool have offlined already.

The bug will occurs when the cpu have called the workqueue_offline_cpu
form cpu unplug, and create a new worker which will running on a offline
cpu.

>
>> +
>> + mutex_lock(&wq_pool_attach_mutex);
>> + if ((pool->flags & POOL_DISASSOCIATED)) {
>> + /* We can't call get_online_cpus, there will be deadlock
>> + * cpu_active_mask will no change, because we have
>> + * wq_pool_attach_mutex hold.
>> + **/
>> + kthread_bind_mask(worker->task, cpu_active_mask);
>> + } else {
>> + kthread_bind_mask(worker->task, pool->attrs->cpumask);
>> + }
>> + mutex_unlock(&wq_pool_attach_mutex);
>
>
> For unbound pools, pool->attrs->cpumask should be used if pool->attrs->cpumask
> has active cpu.
>
> wq_pool_attach_mutex is held here and in worker_attach_to_pool() which smells bad.
>

Yes, this will be changed, I have make a new patch , to move the
thread bind to worker_attach_to_pool, via set_cpus_allowed_ptr.

>
>
> The change is complex. And if kthread_bind_mask() can't work as expected here,
> the change I prefer would be:
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 4056f2a3f9d5..1ad8aef5fe98 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1862,6 +1862,12 @@ static void worker_attach_to_pool(struct worker *worker,
> {
> mutex_lock(&wq_pool_attach_mutex);
>
> + /*
> + * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
> + * online CPUs. It'll be re-applied when any of the CPUs come up.
> + */
> + set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
> +

This will succeed in this case, set_cpus_allowed_ptr will use
cpu_online_mask to verify the cpumask is valid, but in this case,
the cpu_state is between cpu_online and cpu_active. And this
modification doesn't solve the problem I'm having.

> /*
> * The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains
> * stable across this function. See the comments above the flag
> @@ -1872,9 +1877,6 @@ static void worker_attach_to_pool(struct worker *worker,
> else
> kthread_set_per_cpu(worker->task, pool->cpu);
>
> - if (worker->rescue_wq)
> - set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
> -
> list_add_tail(&worker->node, &pool->workers);
> worker->pool = pool;
>
> @@ -1952,7 +1954,7 @@ static struct worker *create_worker(struct worker_pool *pool)
> goto fail;
>
> set_user_nice(worker->task, pool->attrs->nice);
> - kthread_bind_mask(worker->task, pool->attrs->cpumask);
> + worker->flags |= PF_NO_SETAFFINITY;
>
> /* successful, attach the worker to the pool */
> worker_attach_to_pool(worker, pool);
> @@ -4270,7 +4272,7 @@ static int init_rescuer(struct workqueue_struct *wq)
> }
>
> wq->rescuer = rescuer;
> - kthread_bind_mask(rescuer->task, cpu_possible_mask);
> + rescuer->flags |= PF_NO_SETAFFINITY;
> wake_up_process(rescuer->task);
>
> return 0;
>
>
> It is untested. It effectively reverts the commit 640f17c82460e
> ("workqueue: Restrict affinity change to rescuer").
> It avoids using kthread_bind_mask().

I will upload a new patchset to remove the extra &wq_pool_attach_mutex
and add a timing diagram to make this question clearer.

--
BRs
Schspa Shi

2022-07-14 14:56:08

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [PATCH] workqueue: Use active mask for new worker when pool is DISASSOCIATED

On Wed, Jul 13, 2022 at 05:52:58PM +0800, Lai Jiangshan wrote:
>
>
> CC Peter.
> Peter has changed the CPU binding code in workqueue.c.

[ 1622.829091] WARNING: CPU: 3 PID: 31 at kernel/sched/core.c:7756 sched_cpu_dying+0x74/0x204
[ 1622.829374] CPU: 3 PID: 31 Comm: migration/3 Tainted: P O 5.10.59-rt52 #2
^^^^^^^^^^^^^^^^^^^^^

I think we can ignore this as being some ancient kernel. Please try
something recent.

2022-07-14 16:17:04

by Schspa Shi

[permalink] [raw]

Subject: Re: [PATCH] workqueue: Use active mask for new worker when pool is DISASSOCIATED

2022-07-30 04:27:32

by Schspa Shi

[permalink] [raw]

Subject: Re: [PATCH] workqueue: Use active mask for new worker when pool is DISASSOCIATED

Peter Zijlstra <[email protected]> writes:

> On Wed, Jul 13, 2022 at 05:52:58PM +0800, Lai Jiangshan wrote:
>>
>>
>> CC Peter.
>> Peter has changed the CPU binding code in workqueue.c.
>
> [ 1622.829091] WARNING: CPU: 3 PID: 31 at kernel/sched/core.c:7756 sched_cpu_dying+0x74/0x204
> [ 1622.829374] CPU: 3 PID: 31 Comm: migration/3 Tainted: P O 5.10.59-rt52 #2
> ^^^^^^^^^^^^^^^^^^^^^
>
> I think we can ignore this as being some ancient kernel. Please try
> something recent.

Hi peter:

I spent a few days writing a test case and reproduced the problem on
kernel 5.19. I think it's time for us to review the V3 patch for a fix.

The V3 patch is at
https://lore.kernel.org/all/[email protected]/
Please help to review it.

Test branch as:
https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tag/?h=v5.19-rc8-rt8

I think this code is new enough to demonstrate that the problem persists.

The log as fellowing:

[ 3103.198684] ------------[ cut here ]------------
[ 3103.198684] Dying CPU not properly vacated!
[ 3103.198684] WARNING: CPU: 1 PID: 23 at kernel/sched/core.c:9575 sched_cpu_dying.cold+0xc/0xc3
[ 3103.198684] Modules linked in: work_test(O)
[ 3103.198684] CPU: 1 PID: 23 Comm: migration/1 Tainted: G O 5.19.0-rc8-rt8 #1
[ 3103.198684] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[ 3103.198684] Stopper: multi_cpu_stop+0x0/0xf0 <- stop_machine_cpuslocked+0x132/0x170
[ 3103.198684] RIP: 0010:sched_cpu_dying.cold+0xc/0xc3
[ 3103.198684] Code: 00 e9 a1 c1 40 ff 48 c7 c7 48 91 94 82 e8 99 29 00 00 48 c7 c7 00 5e 53 83 e9 e3 10 50 ff 48 c7 c7 98 91 94 82 e8 4f ec ff ff <0f> 0b 44 8b ab 10 0a 00 00 8b 4b 04 48 c7 c6 cd 37 93 82 48 c7 c7
[ 3103.198684] RSP: 0000:ffffc900000dbda0 EFLAGS: 00010086
[ 3103.198684] RAX: 0000000000000000 RBX: ffff88813bcaa280 RCX: 0000000000000000
[ 3103.198684] RDX: 0000000000000003 RSI: ffffffff82953971 RDI: 00000000ffffffff
[ 3103.198684] RBP: 0000000000000082 R08: 00000000000021ed R09: ffffc900000dbd38
[ 3103.198684] R10: 0000000000000001 R11: ffffffffffffffff R12: 0000000000000060
[ 3103.198684] R13: ffffffff810a9040 R14: ffffffff82c555c0 R15: 0000000000000000
[ 3103.198684] FS: 0000000000000000(0000) GS:ffff88813bc80000(0000) knlGS:0000000000000000
[ 3103.198684] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3103.198684] CR2: 00007f85acd18010 CR3: 0000000102578000 CR4: 0000000000350ee0
[ 3103.198684] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3103.198684] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3103.198684] Call Trace:
[ 3103.198684] <TASK>
[ 3103.198684] ? sched_cpu_wait_empty+0x60/0x60
[ 3103.198684] cpuhp_invoke_callback+0x3a4/0x5f0
[ 3103.198684] take_cpu_down+0x71/0xd0
[ 3103.198684] multi_cpu_stop+0x5c/0xf0
[ 3103.198684] ? stop_machine_yield+0x10/0x10
[ 3103.198684] cpu_stopper_thread+0x82/0x130
[ 3103.198684] smpboot_thread_fn+0x1bb/0x2b0
[ 3103.198684] ? sort_range+0x20/0x20
[ 3103.198684] kthread+0xfe/0x120
[ 3103.198684] ? kthread_complete_and_exit+0x20/0x20
[ 3103.198684] ret_from_fork+0x1f/0x30
[ 3103.198684] </TASK>
[ 3103.198684] Kernel panic - not syncing: panic_on_warn set ...
[ 3103.198684] CPU: 1 PID: 23 Comm: migration/1 Tainted: G O 5.19.0-rc8-rt8 #1
[ 3103.198684] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[ 3103.198684] Stopper: multi_cpu_stop+0x0/0xf0 <- stop_machine_cpuslocked+0x132/0x170
[ 3103.198684] Call Trace:
[ 3103.198684] <TASK>
[ 3103.198684] dump_stack_lvl+0x34/0x48
[ 3103.198684] panic+0xf8/0x299
[ 3103.198684] ? sched_cpu_dying.cold+0xc/0xc3
[ 3103.198684] __warn.cold+0x43/0xba
[ 3103.198684] ? sched_cpu_dying.cold+0xc/0xc3
[ 3103.198684] report_bug+0x9d/0xc0
[ 3103.198684] handle_bug+0x3c/0x70
[ 3103.198684] exc_invalid_op+0x14/0x70
[ 3103.198684] asm_exc_invalid_op+0x16/0x20
[ 3103.198684] RIP: 0010:sched_cpu_dying.cold+0xc/0xc3

--
BRs
Schspa Shi

2022-08-01 04:03:38

by Lai Jiangshan

[permalink] [raw]

Subject: Re: [PATCH] workqueue: Use active mask for new worker when pool is DISASSOCIATED

On Sat, Jul 30, 2022 at 12:19 PM Schspa Shi <[email protected]> wrote:
>
>
> Peter Zijlstra <[email protected]> writes:
>
> > On Wed, Jul 13, 2022 at 05:52:58PM +0800, Lai Jiangshan wrote:
> >>
> >>
> >> CC Peter.
> >> Peter has changed the CPU binding code in workqueue.c.
> >
> > [ 1622.829091] WARNING: CPU: 3 PID: 31 at kernel/sched/core.c:7756 sched_cpu_dying+0x74/0x204
> > [ 1622.829374] CPU: 3 PID: 31 Comm: migration/3 Tainted: P O 5.10.59-rt52 #2
> > ^^^^^^^^^^^^^^^^^^^^^
> >
> > I think we can ignore this as being some ancient kernel. Please try
> > something recent.
>
> Hi peter:
>
> I spent a few days writing a test case and reproduced the problem on
> kernel 5.19. I think it's time for us to review the V3 patch for a fix.
>
> The V3 patch is at
> https://lore.kernel.org/all/[email protected]/
> Please help to review it.

Because of the implementation of is_cpu_allowed(),
I am still suspicious about how the newly created worker can be woken
up in the dying cpu since it has no KTHREAD_IS_PER_CPU set.

Is it a dying cpu when onlining a CPU fails? I think balance_push
is not set in this case when it fails at some point during online.
But I haven't looked into the details and linked the clues.

>
> Test branch as:
> https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tag/?h=v5.19-rc8-rt8
>

Can it only be triggered in the rt kernel?

> I think this code is new enough to demonstrate that the problem persists.
>
> The log as fellowing:
>
> [ 3103.198684] ------------[ cut here ]------------
> [ 3103.198684] Dying CPU not properly vacated!
> [ 3103.198684] WARNING: CPU: 1 PID: 23 at kernel/sched/core.c:9575 sched_cpu_dying.cold+0xc/0xc3
> [ 3103.198684] Modules linked in: work_test(O)

Could you give me the code of this module?

2022-08-01 06:01:55

by Schspa Shi

[permalink] [raw]

Subject: Re: [PATCH] workqueue: Use active mask for new worker when pool is DISASSOCIATED

Lai Jiangshan <[email protected]> writes:

> On Sat, Jul 30, 2022 at 12:19 PM Schspa Shi <[email protected]> wrote:
>>
>>
>> Peter Zijlstra <[email protected]> writes:
>>
>> > On Wed, Jul 13, 2022 at 05:52:58PM +0800, Lai Jiangshan wrote:
>> >>
>> >>
>> >> CC Peter.
>> >> Peter has changed the CPU binding code in workqueue.c.
>> >
>> > [ 1622.829091] WARNING: CPU: 3 PID: 31 at kernel/sched/core.c:7756 sched_cpu_dying+0x74/0x204
>> > [ 1622.829374] CPU: 3 PID: 31 Comm: migration/3 Tainted: P O 5.10.59-rt52 #2
>> > ^^^^^^^^^^^^^^^^^^^^^
>> >
>> > I think we can ignore this as being some ancient kernel. Please try
>> > something recent.
>>
>> Hi peter:
>>
>> I spent a few days writing a test case and reproduced the problem on
>> kernel 5.19. I think it's time for us to review the V3 patch for a fix.
>>
>> The V3 patch is at
>> https://lore.kernel.org/all/[email protected]/
>> Please help to review it.
>
> Because of the implementation of is_cpu_allowed(),
> I am still suspicious about how the newly created worker can be woken
> up in the dying cpu since it has no KTHREAD_IS_PER_CPU set.
>

I think the comments of the V3 patch have explained the problem, the
newly created worker runs on the dying CPU because we bind it to the
dying CPU via kthread_bind_mask or set_cpus_allowed_ptr.

Please refer to the following scenarios.

CPU0 CPU1
------------------------------------------------------------------
sched_cpu_deactivate(cpu_active_mask clear)
workqueue_offline_cpu(work pool POOL_DISASSOCIATED)
-- all worker will migrate to another cpu --
worker_thread
-- will create new worker if
pool->worklist is not empty
create_worker()
-- new kworker will bound to CPU0
(pool->attrs->cpumask will be mask of CPU0).
kworker/0:x will running on rq

sched_cpu_dying
if (rq->nr_running != 1 || rq_has_pinned_tasks(rq))
WARN(true, "Dying CPU not properly vacated!");
---------OOPS-------------

When CPUHP_HRTIMERS_PREPARE < CPU0's state < CPUHP_AP_WORKQUEUE_ONLINE,
the CPU0 still can schedule a new work, which will add a new work to
pool->worklist of a dying CPU's work pool.

> Is it a dying cpu when onlining a CPU fails? I think balance_push
> is not set in this case when it fails at some point during online.
> But I haven't looked into the details and linked the clues.
>

Not this case, there is on onlining CPU fails log in my test log.

>
>>
>> Test branch as:
>> https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tag/?h=v5.19-rc8-rt8
>>
>
> Can it only be triggered in the rt kernel?
>

I only test on rt kernel, because of I work on RT kernel now, But I
think non-rt kernel have the issues too.

>> I think this code is new enough to demonstrate that the problem persists.
>>
>> The log as fellowing:
>>
>> [ 3103.198684] ------------[ cut here ]------------
>> [ 3103.198684] Dying CPU not properly vacated!
>> [ 3103.198684] WARNING: CPU: 1 PID: 23 at kernel/sched/core.c:9575 sched_cpu_dying.cold+0xc/0xc3
>> [ 3103.198684] Modules linked in: work_test(O)
>
> Could you give me the code of this module?

The module's source code is at
https://github.com/schspa/code_snippet/blob/master/kernel_module/workqueue_test/work-test.c
Kernel defconfig is at
https://github.com/schspa/ktest/blob/main/configs/linux_rt_defconfig

I add a hook to the entry of worker_thread, which add extra 50ms delay
to make it reproduced faster.

--
BRs
Schspa Shi

2022-08-01 08:54:02

by Lai Jiangshan

[permalink] [raw]

Subject: Re: [PATCH] workqueue: Use active mask for new worker when pool is DISASSOCIATED

On Mon, Aug 1, 2022 at 1:08 PM Schspa Shi <[email protected]> wrote:
>
>
> Lai Jiangshan <[email protected]> writes:
>
> > On Sat, Jul 30, 2022 at 12:19 PM Schspa Shi <[email protected]> wrote:
> >>
> >>
> >> Peter Zijlstra <[email protected]> writes:
> >>
> >> > On Wed, Jul 13, 2022 at 05:52:58PM +0800, Lai Jiangshan wrote:
> >> >>
> >> >>
> >> >> CC Peter.
> >> >> Peter has changed the CPU binding code in workqueue.c.
> >> >
> >> > [ 1622.829091] WARNING: CPU: 3 PID: 31 at kernel/sched/core.c:7756 sched_cpu_dying+0x74/0x204
> >> > [ 1622.829374] CPU: 3 PID: 31 Comm: migration/3 Tainted: P O 5.10.59-rt52 #2
> >> > ^^^^^^^^^^^^^^^^^^^^^
> >> >
> >> > I think we can ignore this as being some ancient kernel. Please try
> >> > something recent.
> >>
> >> Hi peter:
> >>
> >> I spent a few days writing a test case and reproduced the problem on
> >> kernel 5.19. I think it's time for us to review the V3 patch for a fix.
> >>
> >> The V3 patch is at
> >> https://lore.kernel.org/all/[email protected]/
> >> Please help to review it.
> >
> > Because of the implementation of is_cpu_allowed(),
> > I am still suspicious about how the newly created worker can be woken
> > up in the dying cpu since it has no KTHREAD_IS_PER_CPU set.
> >
>
> I think the comments of the V3 patch have explained the problem, the
> newly created worker runs on the dying CPU because we bind it to the
> dying CPU via kthread_bind_mask or set_cpus_allowed_ptr.
>

wake_up_process() will modify the task's cpumask and select
a fallback cpu. The schedule code does it for all kthreads that
happen to have a single CPU affinity which is lost.

> Please refer to the following scenarios.
>
> CPU0 CPU1
> ------------------------------------------------------------------
> sched_cpu_deactivate(cpu_active_mask clear)
> workqueue_offline_cpu(work pool POOL_DISASSOCIATED)
> -- all worker will migrate to another cpu --
> worker_thread
> -- will create new worker if
> pool->worklist is not empty
> create_worker()
> -- new kworker will bound to CPU0
> (pool->attrs->cpumask will be mask of CPU0).
> kworker/0:x will running on rq
>
> sched_cpu_dying
> if (rq->nr_running != 1 || rq_has_pinned_tasks(rq))
> WARN(true, "Dying CPU not properly vacated!");
> ---------OOPS-------------
>
> When CPUHP_HRTIMERS_PREPARE < CPU0's state < CPUHP_AP_WORKQUEUE_ONLINE,
> the CPU0 still can schedule a new work, which will add a new work to
> pool->worklist of a dying CPU's work pool.
>
> > Is it a dying cpu when onlining a CPU fails? I think balance_push
> > is not set in this case when it fails at some point during online.
> > But I haven't looked into the details and linked the clues.
> >
>
> Not this case, there is on onlining CPU fails log in my test log.
>
> >
> >>
> >> Test branch as:
> >> https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tag/?h=v5.19-rc8-rt8
> >>
> >
> > Can it only be triggered in the rt kernel?
> >
>
> I only test on rt kernel, because of I work on RT kernel now, But I
> think non-rt kernel have the issues too.
>
> >> I think this code is new enough to demonstrate that the problem persists.
> >>
> >> The log as fellowing:
> >>
> >> [ 3103.198684] ------------[ cut here ]------------
> >> [ 3103.198684] Dying CPU not properly vacated!
> >> [ 3103.198684] WARNING: CPU: 1 PID: 23 at kernel/sched/core.c:9575 sched_cpu_dying.cold+0xc/0xc3
> >> [ 3103.198684] Modules linked in: work_test(O)
> >
> > Could you give me the code of this module?
>
> The module's source code is at
> https://github.com/schspa/code_snippet/blob/master/kernel_module/workqueue_test/work-test.c
> Kernel defconfig is at
> https://github.com/schspa/ktest/blob/main/configs/linux_rt_defconfig
>

How long will it take to reproduce the bug?
I have tested for a short time on the non-rt kernel, no thing
has happened yet.

> I add a hook to the entry of worker_thread, which add extra 50ms delay
> to make it reproduced faster.
>
> --
> BRs
> Schspa Shi

2022-08-01 10:08:07

by Schspa Shi

[permalink] [raw]

Subject: Re: [PATCH] workqueue: Use active mask for new worker when pool is DISASSOCIATED

Lai Jiangshan <[email protected]> writes:

> On Mon, Aug 1, 2022 at 1:08 PM Schspa Shi <[email protected]> wrote:
>>
>>
>> Lai Jiangshan <[email protected]> writes:
>>
>> > On Sat, Jul 30, 2022 at 12:19 PM Schspa Shi <[email protected]> wrote:
>> >>
>> >>
>> >> Peter Zijlstra <[email protected]> writes:
>> >>
>> >> > On Wed, Jul 13, 2022 at 05:52:58PM +0800, Lai Jiangshan wrote:
>> >> >>
>> >> >>
>> >> >> CC Peter.
>> >> >> Peter has changed the CPU binding code in workqueue.c.
>> >> >
>> >> > [ 1622.829091] WARNING: CPU: 3 PID: 31 at kernel/sched/core.c:7756 sched_cpu_dying+0x74/0x204
>> >> > [ 1622.829374] CPU: 3 PID: 31 Comm: migration/3 Tainted: P O 5.10.59-rt52 #2
>> >> > ^^^^^^^^^^^^^^^^^^^^^
>> >> >
>> >> > I think we can ignore this as being some ancient kernel. Please try
>> >> > something recent.
>> >>
>> >> Hi peter:
>> >>
>> >> I spent a few days writing a test case and reproduced the problem on
>> >> kernel 5.19. I think it's time for us to review the V3 patch for a fix.
>> >>
>> >> The V3 patch is at
>> >> https://lore.kernel.org/all/[email protected]/
>> >> Please help to review it.
>> >
>> > Because of the implementation of is_cpu_allowed(),
>> > I am still suspicious about how the newly created worker can be woken
>> > up in the dying cpu since it has no KTHREAD_IS_PER_CPU set.
>> >
>>
>> I think the comments of the V3 patch have explained the problem, the
>> newly created worker runs on the dying CPU because we bind it to the
>> dying CPU via kthread_bind_mask or set_cpus_allowed_ptr.
>>
>
> wake_up_process() will modify the task's cpumask and select
> a fallback cpu. The schedule code does it for all kthreads that
> happen to have a single CPU affinity which is lost.
>

I'm not very familiar with this part of the code, I'll check the code
here first.

>> Please refer to the following scenarios.
>>
>> CPU0 CPU1
>> ------------------------------------------------------------------
>> sched_cpu_deactivate(cpu_active_mask clear)
>> workqueue_offline_cpu(work pool POOL_DISASSOCIATED)
>> -- all worker will migrate to another cpu --
>> worker_thread
>> -- will create new worker if
>> pool->worklist is not empty
>> create_worker()
>> -- new kworker will bound to CPU0
>> (pool->attrs->cpumask will be mask of CPU0).
>> kworker/0:x will running on rq
>>
>> sched_cpu_dying
>> if (rq->nr_running != 1 || rq_has_pinned_tasks(rq))
>> WARN(true, "Dying CPU not properly vacated!");
>> ---------OOPS-------------
>>
>> When CPUHP_HRTIMERS_PREPARE < CPU0's state < CPUHP_AP_WORKQUEUE_ONLINE,
>> the CPU0 still can schedule a new work, which will add a new work to
>> pool->worklist of a dying CPU's work pool.
>>
>> > Is it a dying cpu when onlining a CPU fails? I think balance_push
>> > is not set in this case when it fails at some point during online.
>> > But I haven't looked into the details and linked the clues.
>> >
>>
>> Not this case, there is on onlining CPU fails log in my test log.
>>
>> >
>> >>
>> >> Test branch as:
>> >> https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tag/?h=v5.19-rc8-rt8
>> >>
>> >
>> > Can it only be triggered in the rt kernel?
>> >
>>
>> I only test on rt kernel, because of I work on RT kernel now, But I
>> think non-rt kernel have the issues too.
>>
>> >> I think this code is new enough to demonstrate that the problem persists.
>> >>
>> >> The log as fellowing:
>> >>
>> >> [ 3103.198684] ------------[ cut here ]------------
>> >> [ 3103.198684] Dying CPU not properly vacated!
>> >> [ 3103.198684] WARNING: CPU: 1 PID: 23 at kernel/sched/core.c:9575 sched_cpu_dying.cold+0xc/0xc3
>> >> [ 3103.198684] Modules linked in: work_test(O)
>> >
>> > Could you give me the code of this module?
>>
>> The module's source code is at
>> https://github.com/schspa/code_snippet/blob/master/kernel_module/workqueue_test/work-test.c
>> Kernel defconfig is at
>> https://github.com/schspa/ktest/blob/main/configs/linux_rt_defconfig
>>
>
> How long will it take to reproduce the bug?
> I have tested for a short time on the non-rt kernel, no thing
> has happened yet.
>

Test host machine:
4 core virtual machines with nested KVM support.
16GB memory.

Test VM:
3 * 4core KVM virtual machine.

With 3 test instances, it will reproduce in one day.

This needs some stress to make it reproduce, you can use my script to
compile & launch the test automatically.

https://github.com/schspa/ktest/blob/main/test-lnx.sh
~/work/src/ktest/test-lnx.sh ~/src/linux linux-rt-devel/linux-5.18.y-rt ~/test/linux-5.18.y-rt wq-test

>> I add a hook to the entry of worker_thread, which add extra 50ms delay
>> to make it reproduced faster.
>>
>> --
>> BRs
>> Schspa Shi

--
BRs
Schspa Shi

2022-08-07 14:53:26

by Schspa Shi

[permalink] [raw]

Subject: Re: [PATCH] workqueue: Use active mask for new worker when pool is DISASSOCIATED

Lai Jiangshan <[email protected]> writes:

> On Mon, Aug 1, 2022 at 1:08 PM Schspa Shi <[email protected]> wrote:
>>
>>
>> Lai Jiangshan <[email protected]> writes:
>>
>> > On Sat, Jul 30, 2022 at 12:19 PM Schspa Shi <[email protected]> wrote:
>> >>
>> >>
>> >> Peter Zijlstra <[email protected]> writes:
>> >>
>> >> > On Wed, Jul 13, 2022 at 05:52:58PM +0800, Lai Jiangshan wrote:
>> >> >>
>> >> >>
>> >> >> CC Peter.
>> >> >> Peter has changed the CPU binding code in workqueue.c.
>> >> >
>> >> > [ 1622.829091] WARNING: CPU: 3 PID: 31 at kernel/sched/core.c:7756 sched_cpu_dying+0x74/0x204
>> >> > [ 1622.829374] CPU: 3 PID: 31 Comm: migration/3 Tainted: P O 5.10.59-rt52 #2
>> >> > ^^^^^^^^^^^^^^^^^^^^^
>> >> >
>> >> > I think we can ignore this as being some ancient kernel. Please try
>> >> > something recent.
>> >>
>> >> Hi peter:
>> >>
>> >> I spent a few days writing a test case and reproduced the problem on
>> >> kernel 5.19. I think it's time for us to review the V3 patch for a fix.
>> >>
>> >> The V3 patch is at
>> >> https://lore.kernel.org/all/[email protected]/
>> >> Please help to review it.
>> >
>> > Because of the implementation of is_cpu_allowed(),
>> > I am still suspicious about how the newly created worker can be woken
>> > up in the dying cpu since it has no KTHREAD_IS_PER_CPU set.
>> >
>>
>> I think the comments of the V3 patch have explained the problem, the
>> newly created worker runs on the dying CPU because we bind it to the
>> dying CPU via kthread_bind_mask or set_cpus_allowed_ptr.
>>
>
> wake_up_process() will modify the task's cpumask and select
> a fallback cpu. The schedule code does it for all kthreads that
> happen to have a single CPU affinity which is lost.
>

There are two exceptions I can think of here:
1. For tasks that are prohibited from migration, the system uses
cpu_online_mask, and will not select the fallback cpu.
2. When the scheduling code judges cpu_dying, there is no means of
synchronization. At this time, cpu_dying_mask is an unstable variable.
The scheduler does not guarantee that the above migration will be possible.

In the case of PREEMPT_RT, spin_lock will prohibit migration, which also
causes this problem to be easily reproduced in the RT version.
In view of some of the above limitations, I think we should not rely on
the implementation of the above scheduler, but handle this when creating
threads.

>> Please refer to the following scenarios.
>>
>> CPU0 CPU1
>> ------------------------------------------------------------------
>> sched_cpu_deactivate(cpu_active_mask clear)
>> workqueue_offline_cpu(work pool POOL_DISASSOCIATED)
>> -- all worker will migrate to another cpu --
>> worker_thread
>> -- will create new worker if
>> pool->worklist is not empty
>> create_worker()
>> -- new kworker will bound to CPU0
>> (pool->attrs->cpumask will be mask of CPU0).
>> kworker/0:x will running on rq
>>
>> sched_cpu_dying
>> if (rq->nr_running != 1 || rq_has_pinned_tasks(rq))
>> WARN(true, "Dying CPU not properly vacated!");
>> ---------OOPS-------------
>>
>> When CPUHP_HRTIMERS_PREPARE < CPU0's state < CPUHP_AP_WORKQUEUE_ONLINE,
>> the CPU0 still can schedule a new work, which will add a new work to
>> pool->worklist of a dying CPU's work pool.
>>
>> > Is it a dying cpu when onlining a CPU fails? I think balance_push
>> > is not set in this case when it fails at some point during online.
>> > But I haven't looked into the details and linked the clues.
>> >
>>
>> Not this case, there is on onlining CPU fails log in my test log.
>>
>> >
>> >>
>> >> Test branch as:
>> >> https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tag/?h=v5.19-rc8-rt8
>> >>
>> >
>> > Can it only be triggered in the rt kernel?
>> >
>>
>> I only test on rt kernel, because of I work on RT kernel now, But I
>> think non-rt kernel have the issues too.
>>
>> >> I think this code is new enough to demonstrate that the problem persists.
>> >>
>> >> The log as fellowing:
>> >>
>> >> [ 3103.198684] ------------[ cut here ]------------
>> >> [ 3103.198684] Dying CPU not properly vacated!
>> >> [ 3103.198684] WARNING: CPU: 1 PID: 23 at kernel/sched/core.c:9575 sched_cpu_dying.cold+0xc/0xc3
>> >> [ 3103.198684] Modules linked in: work_test(O)
>> >
>> > Could you give me the code of this module?
>>
>> The module's source code is at
>> https://github.com/schspa/code_snippet/blob/master/kernel_module/workqueue_test/work-test.c
>> Kernel defconfig is at
>> https://github.com/schspa/ktest/blob/main/configs/linux_rt_defconfig
>>
>
> How long will it take to reproduce the bug?
> I have tested for a short time on the non-rt kernel, no thing
> has happened yet.
>
>> I add a hook to the entry of worker_thread, which add extra 50ms delay
>> to make it reproduced faster.
>>
>> --
>> BRs
>> Schspa Shi

--
BRs
Schspa Shi