When CPU-[un]hotplugs, all workers will be bound to active CPU via
unbind_workers().
But the unbound worker still has a chance to create a new worker, which
has bound the newly created task to pool->attrs->cpumask. The CPU has
been unplugged. And the following crash occurred.
The crash log is as follows:
[ 1622.829074] ------------[ cut here ]------------
[ 1622.829081] Dying CPU not properly vacated!
[ 1622.829091] WARNING: CPU: 3 PID: 31 at kernel/sched/core.c:7756 sched_cpu_dying+0x74/0x204
[ 1622.829374] CPU: 3 PID: 31 Comm: migration/3 Tainted: P O 5.10.59-rt52 #2
[ 1622.829386] Stopper: multi_cpu_stop+0x0/0x160 <- 0x0
[ 1622.829400] pstate: 60c00009 (nZCv daif +PAN +UAO -TCO BTYPE=--)
[ 1622.829408] pc : sched_cpu_dying+0x74/0x204
[ 1622.829415] lr : sched_cpu_dying+0x74/0x204
[ 1622.829421] sp : ffff800012933c70
[ 1622.829424] pmr_save: 00000060
[ 1622.829426] x29: ffff800012933c70 x28: 0000000000000000
[ 1622.829435] x27: 0000000000000000 x26: 0000000000000001
[ 1622.829444] x25: 0000000000000000 x24: ffff800018353c2c
[ 1622.829452] x23: 0000000000000003 x22: 0000000000000003
[ 1622.829460] x21: 0000000000000059 x20: 0000000000000000
[ 1622.829468] x19: ffff00027ee68a80 x18: 0000000000000000
[ 1622.829477] x17: 0000000000000000 x16: 0000000000000000
[ 1622.829485] x15: ffffffffffffffff x14: ffff80001169ae30
[ 1622.829493] x13: ffffffffffc38c07 x12: ffffffffffffffff
[ 1622.829501] x11: ffffffffffe00000 x10: ffff80001169ae58
[ 1622.829510] x9 : 000000000000001e x8 : ffff80001169ae30
[ 1622.829518] x7 : ffff800012933ab0 x6 : 00000000ffff0e20
[ 1622.829526] x5 : ffff00027ee62988 x4 : 00000000ffff0e20
[ 1622.829535] x3 : ffff800011e30180 x2 : 0000000100000002
[ 1622.829543] x1 : 0000000000000000 x0 : 0000000000000000
[ 1622.829552] Call trace:
[ 1622.829555] sched_cpu_dying+0x74/0x204
[ 1622.829562] cpuhp_invoke_callback+0xc0/0x1b0
[ 1622.829571] take_cpu_down+0xbc/0xd4
[ 1622.829577] multi_cpu_stop+0x138/0x160
[ 1622.829584] cpu_stopper_thread+0x9c/0x118
[ 1622.829591] smpboot_thread_fn+0x1e8/0x1ec
[ 1622.829598] kthread+0x114/0x124
[ 1622.829606] ret_from_fork+0x10/0x30
[ 1622.829615] ---[ end trace 0000000000000002 ]---
[ 1623.830273] CPU3 enqueued tasks (2 total):
[ 1623.830291] pid: 31, name: migration/3
[ 1623.830440] pid: 25654, name: kworker/3:0
[ 1623.830444] task:kworker/3:0 state:R running task stack: 0 pid:25654 ppid: 2 flags:0x00000028
[ 1623.830458] Call trace:
[ 1623.830460] __switch_to+0x164/0x17c
[ 1623.830472] __schedule+0x4cc/0x5c0
[ 1623.830483] schedule+0x7c/0xcc
[ 1623.830491] schedule_preempt_disabled+0x14/0x24
[ 1623.830500] kthread+0xd8/0x124
[ 1623.830509] ret_from_fork+0x10/0x30
In the crash log, the error enqueued kworker(pid: 25654, name: kworker/3:0)
is a newly created thread, and have bind to a unpluged CPU 3.
Can crash dump can also verify this.
crash> task -R nr_cpus_allowed,cpus_mask 25654
PID: 25654 TASK: ffff000181ff0000 CPU: 3 COMMAND: "kworker/3:0"
nr_cpus_allowed = 1,
cpus_mask = {
bits = {8, 0}
},
crash> struct worker_pool.cpu,nr_workers,attrs 0xffff00027ee68380
cpu = 3,
nr_workers = 3,
attrs = 0xffff000180004480,
crash> struct workqueue_attrs 0xffff000180004480
struct workqueue_attrs {
nice = 0,
cpumask = {{
bits = {8, 0}
}},
no_numa = false
}
The stack trace of the bad running task was dumped via the following patch:
Link: https://lore.kernel.org/all/[email protected]/
And I think this debug patch needs to be added to the mainline,
it can help us to debug this kind of problem
To fix it, we can use cpu_active_mask when work pool is DISASSOCIATED.
Signed-off-by: Schspa Shi <[email protected]>
---
kernel/workqueue.c | 21 ++++++++++++++++++---
1 file changed, 18 insertions(+), 3 deletions(-)
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 1ea50f6be843..713b07957cb6 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1872,8 +1872,12 @@ static void worker_attach_to_pool(struct worker *worker,
else
kthread_set_per_cpu(worker->task, pool->cpu);
- if (worker->rescue_wq)
- set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
+ if (worker->rescue_wq) {
+ if (pool->flags & POOL_DISASSOCIATED)
+ set_cpus_allowed_ptr(worker->task, cpu_active_mask);
+ else
+ set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
+ }
list_add_tail(&worker->node, &pool->workers);
worker->pool = pool;
@@ -1952,7 +1956,18 @@ static struct worker *create_worker(struct worker_pool *pool)
goto fail;
set_user_nice(worker->task, pool->attrs->nice);
- kthread_bind_mask(worker->task, pool->attrs->cpumask);
+
+ mutex_lock(&wq_pool_attach_mutex);
+ if ((pool->flags & POOL_DISASSOCIATED)) {
+ /* We can't call get_online_cpus, there will be deadlock
+ * cpu_active_mask will no change, because we have
+ * wq_pool_attach_mutex hold.
+ **/
+ kthread_bind_mask(worker->task, cpu_active_mask);
+ } else {
+ kthread_bind_mask(worker->task, pool->attrs->cpumask);
+ }
+ mutex_unlock(&wq_pool_attach_mutex);
/* successful, attach the worker to the pool */
worker_attach_to_pool(worker, pool);
--
2.29.0
Greeting,
FYI, we noticed a -17.7% regression of phoronix-test-suite.fio.SequentialRead.IO_uring.Yes.No.1MB.DefaultTestDirectory.mb_s due to commit:
commit: 1a0a67f5efc83d6ca15c1e12471175f6f7db550b ("[PATCH] workqueue: Use active mask for new worker when pool is DISASSOCIATED")
url: https://github.com/intel-lab-lkp/linux/commits/Schspa-Shi/workqueue-Use-active-mask-for-new-worker-when-pool-is-DISASSOCIATED/20220707-170740
base: https://git.kernel.org/cgit/linux/kernel/git/tj/wq.git for-next
patch link: https://lore.kernel.org/lkml/[email protected]
in testcase: phoronix-test-suite
on test machine: 96 threads 2 sockets Intel(R) Xeon(R) Gold 6252 CPU @ 2.10GHz with 512G memory
with following parameters:
test: fio-1.14.1
option_a: Sequential Read
option_b: IO_uring
option_c: Yes
option_d: No
option_e: 1MB
option_f: Default Test Directory
cpufreq_governor: performance
ucode: 0x500320a
test-description: The Phoronix Test Suite is the most comprehensive testing and benchmarking platform available that provides an extensible framework for which new tests can be easily added.
test-url: http://www.phoronix-test-suite.com/
If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>
Details are as below:
-------------------------------------------------------------------------------------------------->
To reproduce:
git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
sudo bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
sudo bin/lkp run generated-yaml-file
# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.
=========================================================================================
compiler/cpufreq_governor/kconfig/option_a/option_b/option_c/option_d/option_e/option_f/rootfs/tbox_group/test/testcase/ucode:
gcc-11/performance/x86_64-rhel-8.3/Sequential Read/IO_uring/Yes/No/1MB/Default Test Directory/debian-x86_64-phoronix/lkp-csl-2sp7/fio-1.14.1/phoronix-test-suite/0x500320a
commit:
8bee9dd953 ("workqueue: Switch to new kerneldoc syntax for named variable macro argument")
1a0a67f5ef ("workqueue: Use active mask for new worker when pool is DISASSOCIATED")
8bee9dd953b69c63 1a0a67f5efc83d6ca15c1e12471
---------------- ---------------------------
%stddev %change %stddev
\ | \
2891 -17.7% 2380 ? 2% phoronix-test-suite.fio.SequentialRead.IO_uring.Yes.No.1MB.DefaultTestDirectory.iops
2895 -17.7% 2384 ? 2% phoronix-test-suite.fio.SequentialRead.IO_uring.Yes.No.1MB.DefaultTestDirectory.mb_s
21667102 ? 4% +36.7% 29623577 ? 5% meminfo.DirectMap2M
10150 ? 48% +129.0% 23243 ? 20% numa-meminfo.node1.AnonHugePages
1.46 ? 17% -0.3 1.17 mpstat.cpu.all.irq%
0.11 ? 18% -0.0 0.08 ? 2% mpstat.cpu.all.soft%
385964 ? 3% +4.0% 401329 proc-vmstat.nr_inactive_file
385964 ? 3% +4.0% 401330 proc-vmstat.nr_zone_inactive_file
351.90 ? 76% -64.9% 123.66 ? 16% sched_debug.cpu.clock_task.stddev
17589 ? 47% -57.2% 7533 ? 27% sched_debug.cpu.nr_switches.stddev
5.01 ? 14% -0.7 4.32 turbostat.Busy%
125.83 ? 2% +7.2% 134.90 turbostat.PkgWatt
71.75 ? 2% -4.2% 68.75 ? 2% vmstat.cpu.id
1947025 ? 5% -13.6% 1681672 ? 2% vmstat.io.bi
20804 ? 3% -29.4% 14684 ? 2% vmstat.system.cs
9.361e+08 ? 4% -16.3% 7.833e+08 ? 4% perf-stat.i.branch-instructions
8742887 ? 35% -35.3% 5658438 ? 26% perf-stat.i.branch-misses
47.24 ? 8% +14.3 61.52 ? 4% perf-stat.i.cache-miss-rate%
1.142e+08 ? 14% -21.4% 89789700 perf-stat.i.cache-references
21068 ? 3% -29.9% 14777 ? 2% perf-stat.i.context-switches
139.10 ? 4% -29.3% 98.38 ? 3% perf-stat.i.cpu-migrations
1.211e+09 ? 4% -17.2% 1.002e+09 ? 3% perf-stat.i.dTLB-loads
6.473e+08 -14.7% 5.522e+08 ? 2% perf-stat.i.dTLB-stores
45.91 ? 6% -3.5 42.41 perf-stat.i.iTLB-load-miss-rate%
1334055 ? 8% -11.7% 1178304 ? 3% perf-stat.i.iTLB-load-misses
4.54e+09 ? 4% -16.4% 3.795e+09 ? 4% perf-stat.i.instructions
30594729 ? 3% -16.3% 25596372 ? 3% perf-stat.i.metric.M/sec
32.17 ? 18% +30.8 62.95 ? 8% perf-stat.i.node-load-miss-rate%
2033471 ? 82% +398.5% 10137149 ? 14% perf-stat.i.node-load-misses
12876365 ? 6% -42.2% 7438822 ? 12% perf-stat.i.node-loads
21.90 ? 14% +51.8 73.69 ? 7% perf-stat.i.node-store-miss-rate%
283278 ? 51% +2796.4% 8204832 ? 11% perf-stat.i.node-store-misses
13020718 ? 2% -76.0% 3128643 ? 22% perf-stat.i.node-stores
60.25 ? 9% +19.3 79.52 ? 2% perf-stat.overall.cache-miss-rate%
131.53 ? 6% -10.0% 118.37 ? 2% perf-stat.overall.cycles-between-cache-misses
46.09 ? 6% -3.5 42.58 perf-stat.overall.iTLB-load-miss-rate%
13.03 ? 79% +44.4 57.47 ? 11% perf-stat.overall.node-load-miss-rate%
2.10 ? 49% +70.2 72.27 ? 9% perf-stat.overall.node-store-miss-rate%
9.31e+08 ? 3% -16.1% 7.807e+08 ? 3% perf-stat.ps.branch-instructions
8696474 ? 35% -35.2% 5639513 ? 26% perf-stat.ps.branch-misses
1.136e+08 ? 14% -21.3% 89476682 perf-stat.ps.cache-references
20951 ? 3% -29.7% 14725 ? 2% perf-stat.ps.context-switches
138.34 ? 5% -29.1% 98.03 ? 3% perf-stat.ps.cpu-migrations
1.204e+09 ? 4% -17.0% 9.991e+08 ? 3% perf-stat.ps.dTLB-loads
6.438e+08 -14.5% 5.503e+08 ? 2% perf-stat.ps.dTLB-stores
1326996 ? 8% -11.5% 1174735 ? 3% perf-stat.ps.iTLB-load-misses
4.515e+09 ? 3% -16.2% 3.783e+09 ? 3% perf-stat.ps.instructions
2027098 ? 82% +398.4% 10104033 ? 14% perf-stat.ps.node-load-misses
12801187 ? 6% -42.1% 7411350 ? 12% perf-stat.ps.node-loads
282154 ? 51% +2798.3% 8177779 ? 11% perf-stat.ps.node-store-misses
12947790 ? 3% -75.9% 3116101 ? 22% perf-stat.ps.node-stores
19.64 ? 8% -3.4 16.20 ? 2% perf-profile.calltrace.cycles-pp.filemap_get_pages.filemap_read.io_read.io_issue_sqe.io_wq_submit_work
18.54 ? 8% -3.3 15.27 ? 2% perf-profile.calltrace.cycles-pp.page_cache_ra_unbounded.filemap_get_pages.filemap_read.io_read.io_issue_sqe
4.85 ? 7% -1.0 3.89 ? 4% perf-profile.calltrace.cycles-pp.entry_SYSCALL_64_after_hwframe
4.84 ? 7% -1.0 3.88 ? 4% perf-profile.calltrace.cycles-pp.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.94 ? 13% -0.8 1.15 ? 4% perf-profile.calltrace.cycles-pp.folio_alloc.page_cache_ra_unbounded.filemap_get_pages.filemap_read.io_read
1.72 ? 13% -0.7 1.01 ? 4% perf-profile.calltrace.cycles-pp.__alloc_pages.folio_alloc.page_cache_ra_unbounded.filemap_get_pages.filemap_read
0.95 ? 5% -0.7 0.26 ?100% perf-profile.calltrace.cycles-pp.io_cqring_wait.__do_sys_io_uring_enter.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.34 ? 14% -0.6 0.78 ? 5% perf-profile.calltrace.cycles-pp.get_page_from_freelist.__alloc_pages.folio_alloc.page_cache_ra_unbounded.filemap_get_pages
1.48 ? 12% -0.5 0.94 ? 5% perf-profile.calltrace.cycles-pp.folio_add_lru.filemap_add_folio.page_cache_ra_unbounded.filemap_get_pages.filemap_read
1.16 ? 15% -0.5 0.67 ? 3% perf-profile.calltrace.cycles-pp.rmqueue.get_page_from_freelist.__alloc_pages.folio_alloc.page_cache_ra_unbounded
1.05 ? 5% -0.5 0.58 ? 7% perf-profile.calltrace.cycles-pp.__do_sys_io_uring_enter.do_syscall_64.entry_SYSCALL_64_after_hwframe
1.20 ? 13% -0.4 0.76 ? 4% perf-profile.calltrace.cycles-pp.__pagevec_lru_add.folio_add_lru.filemap_add_folio.page_cache_ra_unbounded.filemap_get_pages
0.89 ? 13% -0.3 0.55 ? 6% perf-profile.calltrace.cycles-pp.__pagevec_lru_add_fn.__pagevec_lru_add.folio_add_lru.filemap_add_folio.page_cache_ra_unbounded
0.81 ? 6% -0.1 0.68 ? 6% perf-profile.calltrace.cycles-pp.read_pages.page_cache_ra_unbounded.filemap_get_pages.filemap_read.io_read
0.77 ? 6% -0.1 0.64 ? 5% perf-profile.calltrace.cycles-pp.ext4_mpage_readpages.read_pages.page_cache_ra_unbounded.filemap_get_pages.filemap_read
0.99 ? 6% -0.1 0.86 ? 7% perf-profile.calltrace.cycles-pp.shmem_getpage_gfp.shmem_file_read_iter.do_iter_readv_writev.do_iter_read.loop_process_work
1.52 ? 17% +0.6 2.16 ? 11% perf-profile.calltrace.cycles-pp.menu_select.cpuidle_idle_call.do_idle.cpu_startup_entry.start_secondary
0.68 ? 10% +0.7 1.42 ? 10% perf-profile.calltrace.cycles-pp.tick_nohz_get_sleep_length.menu_select.cpuidle_idle_call.do_idle.cpu_startup_entry
0.00 +0.9 0.86 ? 10% perf-profile.calltrace.cycles-pp.timekeeping_max_deferment.tick_nohz_next_event.tick_nohz_get_sleep_length.menu_select.cpuidle_idle_call
0.29 ?100% +1.0 1.28 ? 11% perf-profile.calltrace.cycles-pp.tick_nohz_next_event.tick_nohz_get_sleep_length.menu_select.cpuidle_idle_call.do_idle
14.52 ? 6% +2.5 16.99 ? 7% perf-profile.calltrace.cycles-pp.do_iter_read.loop_process_work.process_one_work.worker_thread.kthread
14.02 ? 7% +2.5 16.52 ? 8% perf-profile.calltrace.cycles-pp.shmem_file_read_iter.do_iter_readv_writev.do_iter_read.loop_process_work.process_one_work
14.11 ? 7% +2.5 16.62 ? 7% perf-profile.calltrace.cycles-pp.do_iter_readv_writev.do_iter_read.loop_process_work.process_one_work.worker_thread
12.30 ? 7% +2.7 15.04 ? 8% perf-profile.calltrace.cycles-pp._copy_to_iter.copy_page_to_iter.shmem_file_read_iter.do_iter_readv_writev.do_iter_read
12.40 ? 7% +2.7 15.14 ? 8% perf-profile.calltrace.cycles-pp.copy_page_to_iter.shmem_file_read_iter.do_iter_readv_writev.do_iter_read.loop_process_work
6.58 ? 5% +2.8 9.40 ? 8% perf-profile.calltrace.cycles-pp.copy_user_enhanced_fast_string.copyout.copy_page_to_iter.filemap_read.io_read
6.85 ? 5% +2.8 9.67 ? 8% perf-profile.calltrace.cycles-pp.copy_page_to_iter.filemap_read.io_read.io_issue_sqe.io_wq_submit_work
6.62 ? 5% +2.8 9.45 ? 8% perf-profile.calltrace.cycles-pp.copyout.copy_page_to_iter.filemap_read.io_read.io_issue_sqe
19.64 ? 8% -3.4 16.20 ? 2% perf-profile.children.cycles-pp.filemap_get_pages
18.55 ? 8% -3.3 15.27 ? 2% perf-profile.children.cycles-pp.page_cache_ra_unbounded
3.02 ? 7% -1.1 1.94 ? 4% perf-profile.children.cycles-pp.__schedule
5.64 ? 6% -1.0 4.64 ? 4% perf-profile.children.cycles-pp.do_syscall_64
5.65 ? 6% -1.0 4.65 ? 4% perf-profile.children.cycles-pp.entry_SYSCALL_64_after_hwframe
2.26 ? 5% -0.8 1.41 ? 6% perf-profile.children.cycles-pp.schedule
1.94 ? 13% -0.8 1.15 ? 4% perf-profile.children.cycles-pp.folio_alloc
1.78 ? 13% -0.7 1.06 ? 4% perf-profile.children.cycles-pp.__alloc_pages
2.49 ? 8% -0.6 1.91 ? 5% perf-profile.children.cycles-pp.perf_tp_event
2.40 ? 8% -0.6 1.84 ? 5% perf-profile.children.cycles-pp.__perf_event_overflow
2.38 ? 8% -0.5 1.83 ? 5% perf-profile.children.cycles-pp.perf_event_output_forward
1.36 ? 14% -0.5 0.81 ? 5% perf-profile.children.cycles-pp.get_page_from_freelist
1.49 ? 12% -0.5 0.95 ? 4% perf-profile.children.cycles-pp.folio_add_lru
0.72 ? 4% -0.5 0.23 ? 13% perf-profile.children.cycles-pp.pick_next_task_fair
1.18 ? 14% -0.5 0.69 ? 4% perf-profile.children.cycles-pp.rmqueue
0.54 ? 5% -0.5 0.06 ? 63% perf-profile.children.cycles-pp.newidle_balance
2.09 ? 8% -0.5 1.60 ? 6% perf-profile.children.cycles-pp.perf_prepare_sample
0.71 ? 9% -0.5 0.23 ? 5% perf-profile.children.cycles-pp.load_balance
1.05 ? 5% -0.5 0.58 ? 7% perf-profile.children.cycles-pp.__do_sys_io_uring_enter
0.95 ? 5% -0.5 0.49 ? 7% perf-profile.children.cycles-pp.io_cqring_wait
1.99 ? 8% -0.5 1.53 ? 6% perf-profile.children.cycles-pp.get_perf_callchain
0.63 ? 8% -0.5 0.17 ? 10% perf-profile.children.cycles-pp.find_busiest_group
1.99 ? 8% -0.5 1.54 ? 6% perf-profile.children.cycles-pp.perf_callchain
0.61 ? 8% -0.5 0.16 ? 7% perf-profile.children.cycles-pp.update_sd_lb_stats
0.94 ? 5% -0.5 0.48 ? 7% perf-profile.children.cycles-pp.schedule_hrtimeout_range_clock
1.22 ? 13% -0.4 0.77 ? 5% perf-profile.children.cycles-pp.__pagevec_lru_add
0.55 ? 6% -0.4 0.11 ? 8% perf-profile.children.cycles-pp.update_sg_lb_stats
1.58 ? 14% -0.4 1.16 ? 5% perf-profile.children.cycles-pp.__softirqentry_text_start
0.93 ? 13% -0.4 0.57 ? 6% perf-profile.children.cycles-pp.__pagevec_lru_add_fn
0.76 ? 15% -0.3 0.43 ? 7% perf-profile.children.cycles-pp.rmqueue_bulk
1.18 ? 10% -0.3 0.86 ? 6% perf-profile.children.cycles-pp.perf_callchain_kernel
1.46 ? 8% -0.3 1.15 ? 4% perf-profile.children.cycles-pp.xas_load
0.77 ? 9% -0.3 0.48 ? 8% perf-profile.children.cycles-pp.try_to_wake_up
0.76 ? 13% -0.3 0.48 ? 6% perf-profile.children.cycles-pp.__mem_cgroup_charge
0.97 ? 10% -0.3 0.71 ? 7% perf-profile.children.cycles-pp.unwind_next_frame
1.22 ? 8% -0.2 0.98 ? 5% perf-profile.children.cycles-pp.update_curr
0.62 ? 10% -0.2 0.38 ? 7% perf-profile.children.cycles-pp.blk_update_request
0.62 ? 10% -0.2 0.38 ? 6% perf-profile.children.cycles-pp.blk_mq_end_request
0.88 ? 7% -0.2 0.64 ? 8% perf-profile.children.cycles-pp.perf_trace_sched_switch
0.64 ? 10% -0.2 0.41 ? 4% perf-profile.children.cycles-pp.blk_complete_reqs
0.59 ? 10% -0.2 0.36 ? 5% perf-profile.children.cycles-pp.__read_end_io
1.16 ? 9% -0.2 0.93 ? 5% perf-profile.children.cycles-pp.perf_trace_sched_stat_runtime
0.68 ? 11% -0.2 0.49 ? 9% perf-profile.children.cycles-pp.xa_load
0.81 ? 7% -0.2 0.63 ? 8% perf-profile.children.cycles-pp.dequeue_entity
0.83 ? 7% -0.2 0.65 ? 7% perf-profile.children.cycles-pp.dequeue_task_fair
0.48 ? 16% -0.2 0.30 ? 6% perf-profile.children.cycles-pp.charge_memcg
0.51 ? 11% -0.2 0.34 ? 6% perf-profile.children.cycles-pp.__mod_lruvec_page_state
0.25 ? 31% -0.2 0.08 ? 20% perf-profile.children.cycles-pp.rcu_core
0.25 ? 10% -0.2 0.09 ? 22% perf-profile.children.cycles-pp.ttwu_do_activate
0.62 ? 9% -0.2 0.46 ? 7% perf-profile.children.cycles-pp._raw_spin_lock
0.22 ? 9% -0.2 0.06 ? 42% perf-profile.children.cycles-pp.enqueue_entity
0.47 ? 11% -0.2 0.31 ? 5% perf-profile.children.cycles-pp.__unwind_start
0.24 ? 10% -0.2 0.09 ? 21% perf-profile.children.cycles-pp.enqueue_task_fair
1.05 ? 6% -0.1 0.91 ? 6% perf-profile.children.cycles-pp.shmem_getpage_gfp
0.81 ? 6% -0.1 0.68 ? 6% perf-profile.children.cycles-pp.read_pages
0.77 ? 6% -0.1 0.64 ? 5% perf-profile.children.cycles-pp.ext4_mpage_readpages
0.38 ? 8% -0.1 0.24 ? 8% perf-profile.children.cycles-pp.__mod_lruvec_state
0.36 ? 13% -0.1 0.25 ? 7% perf-profile.children.cycles-pp.__list_del_entry_valid
0.30 ? 11% -0.1 0.19 ? 6% perf-profile.children.cycles-pp.xa_get_order
0.22 ? 9% -0.1 0.12 ? 15% perf-profile.children.cycles-pp.__cond_resched
0.28 ? 7% -0.1 0.18 ? 8% perf-profile.children.cycles-pp.__mod_node_page_state
0.24 ? 24% -0.1 0.15 ? 8% perf-profile.children.cycles-pp.try_charge_memcg
0.31 ? 13% -0.1 0.22 ? 8% perf-profile.children.cycles-pp.__mod_memcg_lruvec_state
0.22 ? 13% -0.1 0.14 ? 12% perf-profile.children.cycles-pp.get_mem_cgroup_from_mm
0.25 ? 11% -0.1 0.18 ? 12% perf-profile.children.cycles-pp.xas_find_conflict
0.24 ? 10% -0.1 0.18 ? 8% perf-profile.children.cycles-pp.perf_output_sample
0.19 ? 17% -0.1 0.13 ? 17% perf-profile.children.cycles-pp.xas_create
0.09 ? 17% -0.1 0.03 ?104% perf-profile.children.cycles-pp.__kernel_text_address
0.14 ? 15% -0.1 0.09 ? 16% perf-profile.children.cycles-pp.folio_mapping
0.18 ? 12% -0.1 0.13 ? 13% perf-profile.children.cycles-pp.orc_find
0.14 ? 39% -0.1 0.09 ? 21% perf-profile.children.cycles-pp.rcu_idle_exit
0.10 ? 15% -0.0 0.05 ? 60% perf-profile.children.cycles-pp.kmem_cache_free
0.17 ? 12% -0.0 0.13 ? 13% perf-profile.children.cycles-pp.memcpy_erms
0.11 ? 7% -0.0 0.08 ? 16% perf-profile.children.cycles-pp.xas_descend
0.11 ? 39% +0.1 0.24 ? 16% perf-profile.children.cycles-pp.calc_global_load_tick
0.10 ? 21% +0.1 0.24 ? 16% perf-profile.children.cycles-pp.arch_cpu_idle_enter
0.10 ? 22% +0.1 0.24 ? 17% perf-profile.children.cycles-pp.tsc_verify_tsc_adjust
0.00 +0.4 0.40 ? 5% perf-profile.children.cycles-pp.do_softirq
0.00 +0.4 0.43 ? 2% perf-profile.children.cycles-pp.flush_smp_call_function_queue
0.26 ? 32% +0.6 0.86 ? 11% perf-profile.children.cycles-pp.timekeeping_max_deferment
1.56 ? 18% +0.6 2.19 ? 11% perf-profile.children.cycles-pp.menu_select
0.70 ? 11% +0.7 1.43 ? 10% perf-profile.children.cycles-pp.tick_nohz_get_sleep_length
0.53 ? 14% +0.8 1.30 ? 11% perf-profile.children.cycles-pp.tick_nohz_next_event
14.53 ? 6% +2.5 17.00 ? 7% perf-profile.children.cycles-pp.do_iter_read
14.04 ? 7% +2.5 16.54 ? 8% perf-profile.children.cycles-pp.shmem_file_read_iter
14.12 ? 7% +2.5 16.62 ? 7% perf-profile.children.cycles-pp.do_iter_readv_writev
12.32 ? 7% +2.8 15.08 ? 8% perf-profile.children.cycles-pp._copy_to_iter
6.68 ? 5% +2.8 9.49 ? 8% perf-profile.children.cycles-pp.copy_user_enhanced_fast_string
6.63 ? 5% +2.8 9.45 ? 8% perf-profile.children.cycles-pp.copyout
19.26 ? 3% +5.6 24.82 ? 6% perf-profile.children.cycles-pp.copy_page_to_iter
0.44 ? 8% -0.4 0.09 ? 13% perf-profile.self.cycles-pp.update_sg_lb_stats
1.10 ? 9% -0.3 0.83 ? 5% perf-profile.self.cycles-pp.xas_load
0.55 ? 17% -0.2 0.31 ? 9% perf-profile.self.cycles-pp.rmqueue_bulk
0.55 ? 14% -0.2 0.35 ? 6% perf-profile.self.cycles-pp.__pagevec_lru_add_fn
0.58 ? 12% -0.2 0.39 ? 4% perf-profile.self.cycles-pp.__filemap_add_folio
0.41 ? 13% -0.1 0.27 ? 10% perf-profile.self.cycles-pp.xas_store
0.37 ? 11% -0.1 0.24 ? 6% perf-profile.self.cycles-pp.unwind_next_frame
0.58 ? 9% -0.1 0.44 ? 10% perf-profile.self.cycles-pp.__get_user_nocheck_8
0.48 ? 9% -0.1 0.36 ? 9% perf-profile.self.cycles-pp._raw_spin_lock_irq
0.29 ? 19% -0.1 0.17 ? 6% perf-profile.self.cycles-pp.rmqueue
0.26 ? 12% -0.1 0.15 ? 17% perf-profile.self.cycles-pp.__alloc_pages
0.40 ? 10% -0.1 0.28 ? 11% perf-profile.self.cycles-pp._raw_spin_lock
0.27 ? 6% -0.1 0.17 ? 9% perf-profile.self.cycles-pp.__mod_node_page_state
0.27 ? 10% -0.1 0.18 ? 16% perf-profile.self.cycles-pp.folio_add_lru
0.21 ? 9% -0.1 0.12 ? 15% perf-profile.self.cycles-pp.__read_end_io
0.22 ? 12% -0.1 0.14 ? 12% perf-profile.self.cycles-pp.get_mem_cgroup_from_mm
0.17 ? 23% -0.1 0.09 ? 12% perf-profile.self.cycles-pp.try_charge_memcg
0.40 ? 9% -0.1 0.32 ? 9% perf-profile.self.cycles-pp.find_lock_entries
0.33 ? 12% -0.1 0.26 ? 10% perf-profile.self.cycles-pp.shmem_file_read_iter
0.23 ? 11% -0.1 0.17 ? 8% perf-profile.self.cycles-pp.__mod_memcg_lruvec_state
0.18 ? 15% -0.1 0.11 ? 18% perf-profile.self.cycles-pp.get_page_from_freelist
0.32 ? 6% -0.1 0.26 ? 6% perf-profile.self.cycles-pp.ext4_mpage_readpages
0.13 ? 16% -0.1 0.08 ? 15% perf-profile.self.cycles-pp.folio_mapping
0.17 ? 10% -0.0 0.13 ? 13% perf-profile.self.cycles-pp.orc_find
0.10 ? 9% -0.0 0.06 ? 40% perf-profile.self.cycles-pp.xas_descend
0.12 ? 18% -0.0 0.07 ? 13% perf-profile.self.cycles-pp.charge_memcg
0.09 ? 17% -0.0 0.06 ? 41% perf-profile.self.cycles-pp.__mod_lruvec_state
0.10 ? 39% +0.1 0.24 ? 16% perf-profile.self.cycles-pp.calc_global_load_tick
0.09 ? 21% +0.1 0.24 ? 17% perf-profile.self.cycles-pp.tsc_verify_tsc_adjust
0.11 ? 24% +0.2 0.31 ? 17% perf-profile.self.cycles-pp.tick_nohz_next_event
0.26 ? 33% +0.6 0.86 ? 11% perf-profile.self.cycles-pp.timekeeping_max_deferment
12.22 ? 7% +2.7 14.93 ? 8% perf-profile.self.cycles-pp._copy_to_iter
6.62 ? 5% +2.8 9.41 ? 8% perf-profile.self.cycles-pp.copy_user_enhanced_fast_string
Disclaimer:
Results have been estimated based on internal Intel analysis and are provided
for informational purposes only. Any difference in system hardware or software
design or configuration may affect actual performance.
--
0-DAY CI Kernel Test Service
https://01.org/lkp
kernel test robot <[email protected]> writes:
> Greeting,
>
> FYI, we noticed a -17.7% regression of phoronix-test-suite.fio.SequentialRead.IO_uring.Yes.No.1MB.DefaultTestDirectory.mb_s due to commit:
>
This performance drop can be fixed by convert to wq_pool_attach_mutex
to a rwsem, or add another lock for this. I will upload a new PATCH
version for this.
--
BRs
Schspa Shi
CC Peter.
Peter has changed the CPU binding code in workqueue.c.
I'm not understanding the problem enough, if kthread_bind_mask() is buggy
in workqueue.c, it would be buggy in other places too.
On 2022/7/7 17:05, Schspa Shi wrote:
>
> - if (worker->rescue_wq)
> - set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
> + if (worker->rescue_wq) {
> + if (pool->flags & POOL_DISASSOCIATED)
> + set_cpus_allowed_ptr(worker->task, cpu_active_mask);
> + else
> + set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
> + }
>
For unbound pools (which also has POOL_DISASSOCIATED), pool->attrs->cpumask
should be used if pool->attrs->cpumask has active cpu.
> +
> + mutex_lock(&wq_pool_attach_mutex);
> + if ((pool->flags & POOL_DISASSOCIATED)) {
> + /* We can't call get_online_cpus, there will be deadlock
> + * cpu_active_mask will no change, because we have
> + * wq_pool_attach_mutex hold.
> + **/
> + kthread_bind_mask(worker->task, cpu_active_mask);
> + } else {
> + kthread_bind_mask(worker->task, pool->attrs->cpumask);
> + }
> + mutex_unlock(&wq_pool_attach_mutex);
For unbound pools, pool->attrs->cpumask should be used if pool->attrs->cpumask
has active cpu.
wq_pool_attach_mutex is held here and in worker_attach_to_pool() which smells bad.
The change is complex. And if kthread_bind_mask() can't work as expected here,
the change I prefer would be:
diff --git a/kernel/workqueue.c b/kernel/workqueue.c
index 4056f2a3f9d5..1ad8aef5fe98 100644
--- a/kernel/workqueue.c
+++ b/kernel/workqueue.c
@@ -1862,6 +1862,12 @@ static void worker_attach_to_pool(struct worker *worker,
{
mutex_lock(&wq_pool_attach_mutex);
+ /*
+ * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
+ * online CPUs. It'll be re-applied when any of the CPUs come up.
+ */
+ set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
+
/*
* The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains
* stable across this function. See the comments above the flag
@@ -1872,9 +1877,6 @@ static void worker_attach_to_pool(struct worker *worker,
else
kthread_set_per_cpu(worker->task, pool->cpu);
- if (worker->rescue_wq)
- set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
-
list_add_tail(&worker->node, &pool->workers);
worker->pool = pool;
@@ -1952,7 +1954,7 @@ static struct worker *create_worker(struct worker_pool *pool)
goto fail;
set_user_nice(worker->task, pool->attrs->nice);
- kthread_bind_mask(worker->task, pool->attrs->cpumask);
+ worker->flags |= PF_NO_SETAFFINITY;
/* successful, attach the worker to the pool */
worker_attach_to_pool(worker, pool);
@@ -4270,7 +4272,7 @@ static int init_rescuer(struct workqueue_struct *wq)
}
wq->rescuer = rescuer;
- kthread_bind_mask(rescuer->task, cpu_possible_mask);
+ rescuer->flags |= PF_NO_SETAFFINITY;
wake_up_process(rescuer->task);
return 0;
It is untested. It effectively reverts the commit 640f17c82460e
("workqueue: Restrict affinity change to rescuer").
It avoids using kthread_bind_mask().
Lai Jiangshan <[email protected]> writes:
> CC Peter.
> Peter has changed the CPU binding code in workqueue.c.
>
> I'm not understanding the problem enough, if kthread_bind_mask() is buggy
> in workqueue.c, it would be buggy in other places too.
>
It's not the bug of to use kthread_bind_mask(), other than we set the
bad pool->attrs->cpumask to this kthread.
>
> On 2022/7/7 17:05, Schspa Shi wrote:
>
>> - if (worker->rescue_wq)
>> - set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
>> + if (worker->rescue_wq) {
>> + if (pool->flags & POOL_DISASSOCIATED)
>> + set_cpus_allowed_ptr(worker->task, cpu_active_mask);
>> + else
>> + set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
>> + }
>>
>
> For unbound pools (which also has POOL_DISASSOCIATED), pool->attrs->cpumask
> should be used if pool->attrs->cpumask has active cpu.
>
In this case pool->attrs->cpumask have no active cpu, the cpu for this
pool have offlined already.
The bug will occurs when the cpu have called the workqueue_offline_cpu
form cpu unplug, and create a new worker which will running on a offline
cpu.
>
>> +
>> + mutex_lock(&wq_pool_attach_mutex);
>> + if ((pool->flags & POOL_DISASSOCIATED)) {
>> + /* We can't call get_online_cpus, there will be deadlock
>> + * cpu_active_mask will no change, because we have
>> + * wq_pool_attach_mutex hold.
>> + **/
>> + kthread_bind_mask(worker->task, cpu_active_mask);
>> + } else {
>> + kthread_bind_mask(worker->task, pool->attrs->cpumask);
>> + }
>> + mutex_unlock(&wq_pool_attach_mutex);
>
>
> For unbound pools, pool->attrs->cpumask should be used if pool->attrs->cpumask
> has active cpu.
>
> wq_pool_attach_mutex is held here and in worker_attach_to_pool() which smells bad.
>
Yes, this will be changed, I have make a new patch , to move the
thread bind to worker_attach_to_pool, via set_cpus_allowed_ptr.
>
>
> The change is complex. And if kthread_bind_mask() can't work as expected here,
> the change I prefer would be:
>
> diff --git a/kernel/workqueue.c b/kernel/workqueue.c
> index 4056f2a3f9d5..1ad8aef5fe98 100644
> --- a/kernel/workqueue.c
> +++ b/kernel/workqueue.c
> @@ -1862,6 +1862,12 @@ static void worker_attach_to_pool(struct worker *worker,
> {
> mutex_lock(&wq_pool_attach_mutex);
>
> + /*
> + * set_cpus_allowed_ptr() will fail if the cpumask doesn't have any
> + * online CPUs. It'll be re-applied when any of the CPUs come up.
> + */
> + set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
> +
This will succeed in this case, set_cpus_allowed_ptr will use
cpu_online_mask to verify the cpumask is valid, but in this case,
the cpu_state is between cpu_online and cpu_active. And this
modification doesn't solve the problem I'm having.
> /*
> * The wq_pool_attach_mutex ensures %POOL_DISASSOCIATED remains
> * stable across this function. See the comments above the flag
> @@ -1872,9 +1877,6 @@ static void worker_attach_to_pool(struct worker *worker,
> else
> kthread_set_per_cpu(worker->task, pool->cpu);
>
> - if (worker->rescue_wq)
> - set_cpus_allowed_ptr(worker->task, pool->attrs->cpumask);
> -
> list_add_tail(&worker->node, &pool->workers);
> worker->pool = pool;
>
> @@ -1952,7 +1954,7 @@ static struct worker *create_worker(struct worker_pool *pool)
> goto fail;
>
> set_user_nice(worker->task, pool->attrs->nice);
> - kthread_bind_mask(worker->task, pool->attrs->cpumask);
> + worker->flags |= PF_NO_SETAFFINITY;
>
> /* successful, attach the worker to the pool */
> worker_attach_to_pool(worker, pool);
> @@ -4270,7 +4272,7 @@ static int init_rescuer(struct workqueue_struct *wq)
> }
>
> wq->rescuer = rescuer;
> - kthread_bind_mask(rescuer->task, cpu_possible_mask);
> + rescuer->flags |= PF_NO_SETAFFINITY;
> wake_up_process(rescuer->task);
>
> return 0;
>
>
> It is untested. It effectively reverts the commit 640f17c82460e
> ("workqueue: Restrict affinity change to rescuer").
> It avoids using kthread_bind_mask().
I will upload a new patchset to remove the extra &wq_pool_attach_mutex
and add a timing diagram to make this question clearer.
--
BRs
Schspa Shi
On Wed, Jul 13, 2022 at 05:52:58PM +0800, Lai Jiangshan wrote:
>
>
> CC Peter.
> Peter has changed the CPU binding code in workqueue.c.
[ 1622.829091] WARNING: CPU: 3 PID: 31 at kernel/sched/core.c:7756 sched_cpu_dying+0x74/0x204
[ 1622.829374] CPU: 3 PID: 31 Comm: migration/3 Tainted: P O 5.10.59-rt52 #2
^^^^^^^^^^^^^^^^^^^^^
I think we can ignore this as being some ancient kernel. Please try
something recent.
Peter Zijlstra <[email protected]> writes:
> On Wed, Jul 13, 2022 at 05:52:58PM +0800, Lai Jiangshan wrote:
>>
>>
>> CC Peter.
>> Peter has changed the CPU binding code in workqueue.c.
>
> [ 1622.829091] WARNING: CPU: 3 PID: 31 at kernel/sched/core.c:7756 sched_cpu_dying+0x74/0x204
> [ 1622.829374] CPU: 3 PID: 31 Comm: migration/3 Tainted: P O 5.10.59-rt52 #2
> ^^^^^^^^^^^^^^^^^^^^^
>
> I think we can ignore this as being some ancient kernel. Please try
> something recent.
I have merged all the workqueue changes from
Link: https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-stable-rt.git/log/kernel/workqueue.c?h=v5.10-rt
And this problem can still reproduce.
I checked the patch from master tree too, it seems there is no existed
patch can fix this problem.
Or are there any scheduler, process management related patches that
might fix this problem?
--
BRs
Schspa Shi
Peter Zijlstra <[email protected]> writes:
> On Wed, Jul 13, 2022 at 05:52:58PM +0800, Lai Jiangshan wrote:
>>
>>
>> CC Peter.
>> Peter has changed the CPU binding code in workqueue.c.
>
> [ 1622.829091] WARNING: CPU: 3 PID: 31 at kernel/sched/core.c:7756 sched_cpu_dying+0x74/0x204
> [ 1622.829374] CPU: 3 PID: 31 Comm: migration/3 Tainted: P O 5.10.59-rt52 #2
> ^^^^^^^^^^^^^^^^^^^^^
>
> I think we can ignore this as being some ancient kernel. Please try
> something recent.
Hi peter:
I spent a few days writing a test case and reproduced the problem on
kernel 5.19. I think it's time for us to review the V3 patch for a fix.
The V3 patch is at
https://lore.kernel.org/all/[email protected]/
Please help to review it.
Test branch as:
https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tag/?h=v5.19-rc8-rt8
I think this code is new enough to demonstrate that the problem persists.
The log as fellowing:
[ 3103.198684] ------------[ cut here ]------------
[ 3103.198684] Dying CPU not properly vacated!
[ 3103.198684] WARNING: CPU: 1 PID: 23 at kernel/sched/core.c:9575 sched_cpu_dying.cold+0xc/0xc3
[ 3103.198684] Modules linked in: work_test(O)
[ 3103.198684] CPU: 1 PID: 23 Comm: migration/1 Tainted: G O 5.19.0-rc8-rt8 #1
[ 3103.198684] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[ 3103.198684] Stopper: multi_cpu_stop+0x0/0xf0 <- stop_machine_cpuslocked+0x132/0x170
[ 3103.198684] RIP: 0010:sched_cpu_dying.cold+0xc/0xc3
[ 3103.198684] Code: 00 e9 a1 c1 40 ff 48 c7 c7 48 91 94 82 e8 99 29 00 00 48 c7 c7 00 5e 53 83 e9 e3 10 50 ff 48 c7 c7 98 91 94 82 e8 4f ec ff ff <0f> 0b 44 8b ab 10 0a 00 00 8b 4b 04 48 c7 c6 cd 37 93 82 48 c7 c7
[ 3103.198684] RSP: 0000:ffffc900000dbda0 EFLAGS: 00010086
[ 3103.198684] RAX: 0000000000000000 RBX: ffff88813bcaa280 RCX: 0000000000000000
[ 3103.198684] RDX: 0000000000000003 RSI: ffffffff82953971 RDI: 00000000ffffffff
[ 3103.198684] RBP: 0000000000000082 R08: 00000000000021ed R09: ffffc900000dbd38
[ 3103.198684] R10: 0000000000000001 R11: ffffffffffffffff R12: 0000000000000060
[ 3103.198684] R13: ffffffff810a9040 R14: ffffffff82c555c0 R15: 0000000000000000
[ 3103.198684] FS: 0000000000000000(0000) GS:ffff88813bc80000(0000) knlGS:0000000000000000
[ 3103.198684] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3103.198684] CR2: 00007f85acd18010 CR3: 0000000102578000 CR4: 0000000000350ee0
[ 3103.198684] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3103.198684] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3103.198684] Call Trace:
[ 3103.198684] <TASK>
[ 3103.198684] ? sched_cpu_wait_empty+0x60/0x60
[ 3103.198684] cpuhp_invoke_callback+0x3a4/0x5f0
[ 3103.198684] take_cpu_down+0x71/0xd0
[ 3103.198684] multi_cpu_stop+0x5c/0xf0
[ 3103.198684] ? stop_machine_yield+0x10/0x10
[ 3103.198684] cpu_stopper_thread+0x82/0x130
[ 3103.198684] smpboot_thread_fn+0x1bb/0x2b0
[ 3103.198684] ? sort_range+0x20/0x20
[ 3103.198684] kthread+0xfe/0x120
[ 3103.198684] ? kthread_complete_and_exit+0x20/0x20
[ 3103.198684] ret_from_fork+0x1f/0x30
[ 3103.198684] </TASK>
[ 3103.198684] Kernel panic - not syncing: panic_on_warn set ...
[ 3103.198684] CPU: 1 PID: 23 Comm: migration/1 Tainted: G O 5.19.0-rc8-rt8 #1
[ 3103.198684] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[ 3103.198684] Stopper: multi_cpu_stop+0x0/0xf0 <- stop_machine_cpuslocked+0x132/0x170
[ 3103.198684] Call Trace:
[ 3103.198684] <TASK>
[ 3103.198684] dump_stack_lvl+0x34/0x48
[ 3103.198684] panic+0xf8/0x299
[ 3103.198684] ? sched_cpu_dying.cold+0xc/0xc3
[ 3103.198684] __warn.cold+0x43/0xba
[ 3103.198684] ? sched_cpu_dying.cold+0xc/0xc3
[ 3103.198684] report_bug+0x9d/0xc0
[ 3103.198684] handle_bug+0x3c/0x70
[ 3103.198684] exc_invalid_op+0x14/0x70
[ 3103.198684] asm_exc_invalid_op+0x16/0x20
[ 3103.198684] RIP: 0010:sched_cpu_dying.cold+0xc/0xc3
--
BRs
Schspa Shi
On Sat, Jul 30, 2022 at 12:19 PM Schspa Shi <[email protected]> wrote:
>
>
> Peter Zijlstra <[email protected]> writes:
>
> > On Wed, Jul 13, 2022 at 05:52:58PM +0800, Lai Jiangshan wrote:
> >>
> >>
> >> CC Peter.
> >> Peter has changed the CPU binding code in workqueue.c.
> >
> > [ 1622.829091] WARNING: CPU: 3 PID: 31 at kernel/sched/core.c:7756 sched_cpu_dying+0x74/0x204
> > [ 1622.829374] CPU: 3 PID: 31 Comm: migration/3 Tainted: P O 5.10.59-rt52 #2
> > ^^^^^^^^^^^^^^^^^^^^^
> >
> > I think we can ignore this as being some ancient kernel. Please try
> > something recent.
>
> Hi peter:
>
> I spent a few days writing a test case and reproduced the problem on
> kernel 5.19. I think it's time for us to review the V3 patch for a fix.
>
> The V3 patch is at
> https://lore.kernel.org/all/[email protected]/
> Please help to review it.
Because of the implementation of is_cpu_allowed(),
I am still suspicious about how the newly created worker can be woken
up in the dying cpu since it has no KTHREAD_IS_PER_CPU set.
Is it a dying cpu when onlining a CPU fails? I think balance_push
is not set in this case when it fails at some point during online.
But I haven't looked into the details and linked the clues.
>
> Test branch as:
> https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tag/?h=v5.19-rc8-rt8
>
Can it only be triggered in the rt kernel?
> I think this code is new enough to demonstrate that the problem persists.
>
> The log as fellowing:
>
> [ 3103.198684] ------------[ cut here ]------------
> [ 3103.198684] Dying CPU not properly vacated!
> [ 3103.198684] WARNING: CPU: 1 PID: 23 at kernel/sched/core.c:9575 sched_cpu_dying.cold+0xc/0xc3
> [ 3103.198684] Modules linked in: work_test(O)
Could you give me the code of this module?
Lai Jiangshan <[email protected]> writes:
> On Sat, Jul 30, 2022 at 12:19 PM Schspa Shi <[email protected]> wrote:
>>
>>
>> Peter Zijlstra <[email protected]> writes:
>>
>> > On Wed, Jul 13, 2022 at 05:52:58PM +0800, Lai Jiangshan wrote:
>> >>
>> >>
>> >> CC Peter.
>> >> Peter has changed the CPU binding code in workqueue.c.
>> >
>> > [ 1622.829091] WARNING: CPU: 3 PID: 31 at kernel/sched/core.c:7756 sched_cpu_dying+0x74/0x204
>> > [ 1622.829374] CPU: 3 PID: 31 Comm: migration/3 Tainted: P O 5.10.59-rt52 #2
>> > ^^^^^^^^^^^^^^^^^^^^^
>> >
>> > I think we can ignore this as being some ancient kernel. Please try
>> > something recent.
>>
>> Hi peter:
>>
>> I spent a few days writing a test case and reproduced the problem on
>> kernel 5.19. I think it's time for us to review the V3 patch for a fix.
>>
>> The V3 patch is at
>> https://lore.kernel.org/all/[email protected]/
>> Please help to review it.
>
> Because of the implementation of is_cpu_allowed(),
> I am still suspicious about how the newly created worker can be woken
> up in the dying cpu since it has no KTHREAD_IS_PER_CPU set.
>
I think the comments of the V3 patch have explained the problem, the
newly created worker runs on the dying CPU because we bind it to the
dying CPU via kthread_bind_mask or set_cpus_allowed_ptr.
Please refer to the following scenarios.
CPU0 CPU1
------------------------------------------------------------------
sched_cpu_deactivate(cpu_active_mask clear)
workqueue_offline_cpu(work pool POOL_DISASSOCIATED)
-- all worker will migrate to another cpu --
worker_thread
-- will create new worker if
pool->worklist is not empty
create_worker()
-- new kworker will bound to CPU0
(pool->attrs->cpumask will be mask of CPU0).
kworker/0:x will running on rq
sched_cpu_dying
if (rq->nr_running != 1 || rq_has_pinned_tasks(rq))
WARN(true, "Dying CPU not properly vacated!");
---------OOPS-------------
When CPUHP_HRTIMERS_PREPARE < CPU0's state < CPUHP_AP_WORKQUEUE_ONLINE,
the CPU0 still can schedule a new work, which will add a new work to
pool->worklist of a dying CPU's work pool.
> Is it a dying cpu when onlining a CPU fails? I think balance_push
> is not set in this case when it fails at some point during online.
> But I haven't looked into the details and linked the clues.
>
Not this case, there is on onlining CPU fails log in my test log.
>
>>
>> Test branch as:
>> https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tag/?h=v5.19-rc8-rt8
>>
>
> Can it only be triggered in the rt kernel?
>
I only test on rt kernel, because of I work on RT kernel now, But I
think non-rt kernel have the issues too.
>> I think this code is new enough to demonstrate that the problem persists.
>>
>> The log as fellowing:
>>
>> [ 3103.198684] ------------[ cut here ]------------
>> [ 3103.198684] Dying CPU not properly vacated!
>> [ 3103.198684] WARNING: CPU: 1 PID: 23 at kernel/sched/core.c:9575 sched_cpu_dying.cold+0xc/0xc3
>> [ 3103.198684] Modules linked in: work_test(O)
>
> Could you give me the code of this module?
The module's source code is at
https://github.com/schspa/code_snippet/blob/master/kernel_module/workqueue_test/work-test.c
Kernel defconfig is at
https://github.com/schspa/ktest/blob/main/configs/linux_rt_defconfig
I add a hook to the entry of worker_thread, which add extra 50ms delay
to make it reproduced faster.
--
BRs
Schspa Shi
On Mon, Aug 1, 2022 at 1:08 PM Schspa Shi <[email protected]> wrote:
>
>
> Lai Jiangshan <[email protected]> writes:
>
> > On Sat, Jul 30, 2022 at 12:19 PM Schspa Shi <[email protected]> wrote:
> >>
> >>
> >> Peter Zijlstra <[email protected]> writes:
> >>
> >> > On Wed, Jul 13, 2022 at 05:52:58PM +0800, Lai Jiangshan wrote:
> >> >>
> >> >>
> >> >> CC Peter.
> >> >> Peter has changed the CPU binding code in workqueue.c.
> >> >
> >> > [ 1622.829091] WARNING: CPU: 3 PID: 31 at kernel/sched/core.c:7756 sched_cpu_dying+0x74/0x204
> >> > [ 1622.829374] CPU: 3 PID: 31 Comm: migration/3 Tainted: P O 5.10.59-rt52 #2
> >> > ^^^^^^^^^^^^^^^^^^^^^
> >> >
> >> > I think we can ignore this as being some ancient kernel. Please try
> >> > something recent.
> >>
> >> Hi peter:
> >>
> >> I spent a few days writing a test case and reproduced the problem on
> >> kernel 5.19. I think it's time for us to review the V3 patch for a fix.
> >>
> >> The V3 patch is at
> >> https://lore.kernel.org/all/[email protected]/
> >> Please help to review it.
> >
> > Because of the implementation of is_cpu_allowed(),
> > I am still suspicious about how the newly created worker can be woken
> > up in the dying cpu since it has no KTHREAD_IS_PER_CPU set.
> >
>
> I think the comments of the V3 patch have explained the problem, the
> newly created worker runs on the dying CPU because we bind it to the
> dying CPU via kthread_bind_mask or set_cpus_allowed_ptr.
>
wake_up_process() will modify the task's cpumask and select
a fallback cpu. The schedule code does it for all kthreads that
happen to have a single CPU affinity which is lost.
> Please refer to the following scenarios.
>
> CPU0 CPU1
> ------------------------------------------------------------------
> sched_cpu_deactivate(cpu_active_mask clear)
> workqueue_offline_cpu(work pool POOL_DISASSOCIATED)
> -- all worker will migrate to another cpu --
> worker_thread
> -- will create new worker if
> pool->worklist is not empty
> create_worker()
> -- new kworker will bound to CPU0
> (pool->attrs->cpumask will be mask of CPU0).
> kworker/0:x will running on rq
>
> sched_cpu_dying
> if (rq->nr_running != 1 || rq_has_pinned_tasks(rq))
> WARN(true, "Dying CPU not properly vacated!");
> ---------OOPS-------------
>
> When CPUHP_HRTIMERS_PREPARE < CPU0's state < CPUHP_AP_WORKQUEUE_ONLINE,
> the CPU0 still can schedule a new work, which will add a new work to
> pool->worklist of a dying CPU's work pool.
>
> > Is it a dying cpu when onlining a CPU fails? I think balance_push
> > is not set in this case when it fails at some point during online.
> > But I haven't looked into the details and linked the clues.
> >
>
> Not this case, there is on onlining CPU fails log in my test log.
>
> >
> >>
> >> Test branch as:
> >> https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tag/?h=v5.19-rc8-rt8
> >>
> >
> > Can it only be triggered in the rt kernel?
> >
>
> I only test on rt kernel, because of I work on RT kernel now, But I
> think non-rt kernel have the issues too.
>
> >> I think this code is new enough to demonstrate that the problem persists.
> >>
> >> The log as fellowing:
> >>
> >> [ 3103.198684] ------------[ cut here ]------------
> >> [ 3103.198684] Dying CPU not properly vacated!
> >> [ 3103.198684] WARNING: CPU: 1 PID: 23 at kernel/sched/core.c:9575 sched_cpu_dying.cold+0xc/0xc3
> >> [ 3103.198684] Modules linked in: work_test(O)
> >
> > Could you give me the code of this module?
>
> The module's source code is at
> https://github.com/schspa/code_snippet/blob/master/kernel_module/workqueue_test/work-test.c
> Kernel defconfig is at
> https://github.com/schspa/ktest/blob/main/configs/linux_rt_defconfig
>
How long will it take to reproduce the bug?
I have tested for a short time on the non-rt kernel, no thing
has happened yet.
> I add a hook to the entry of worker_thread, which add extra 50ms delay
> to make it reproduced faster.
>
> --
> BRs
> Schspa Shi
Lai Jiangshan <[email protected]> writes:
> On Mon, Aug 1, 2022 at 1:08 PM Schspa Shi <[email protected]> wrote:
>>
>>
>> Lai Jiangshan <[email protected]> writes:
>>
>> > On Sat, Jul 30, 2022 at 12:19 PM Schspa Shi <[email protected]> wrote:
>> >>
>> >>
>> >> Peter Zijlstra <[email protected]> writes:
>> >>
>> >> > On Wed, Jul 13, 2022 at 05:52:58PM +0800, Lai Jiangshan wrote:
>> >> >>
>> >> >>
>> >> >> CC Peter.
>> >> >> Peter has changed the CPU binding code in workqueue.c.
>> >> >
>> >> > [ 1622.829091] WARNING: CPU: 3 PID: 31 at kernel/sched/core.c:7756 sched_cpu_dying+0x74/0x204
>> >> > [ 1622.829374] CPU: 3 PID: 31 Comm: migration/3 Tainted: P O 5.10.59-rt52 #2
>> >> > ^^^^^^^^^^^^^^^^^^^^^
>> >> >
>> >> > I think we can ignore this as being some ancient kernel. Please try
>> >> > something recent.
>> >>
>> >> Hi peter:
>> >>
>> >> I spent a few days writing a test case and reproduced the problem on
>> >> kernel 5.19. I think it's time for us to review the V3 patch for a fix.
>> >>
>> >> The V3 patch is at
>> >> https://lore.kernel.org/all/[email protected]/
>> >> Please help to review it.
>> >
>> > Because of the implementation of is_cpu_allowed(),
>> > I am still suspicious about how the newly created worker can be woken
>> > up in the dying cpu since it has no KTHREAD_IS_PER_CPU set.
>> >
>>
>> I think the comments of the V3 patch have explained the problem, the
>> newly created worker runs on the dying CPU because we bind it to the
>> dying CPU via kthread_bind_mask or set_cpus_allowed_ptr.
>>
>
> wake_up_process() will modify the task's cpumask and select
> a fallback cpu. The schedule code does it for all kthreads that
> happen to have a single CPU affinity which is lost.
>
I'm not very familiar with this part of the code, I'll check the code
here first.
>> Please refer to the following scenarios.
>>
>> CPU0 CPU1
>> ------------------------------------------------------------------
>> sched_cpu_deactivate(cpu_active_mask clear)
>> workqueue_offline_cpu(work pool POOL_DISASSOCIATED)
>> -- all worker will migrate to another cpu --
>> worker_thread
>> -- will create new worker if
>> pool->worklist is not empty
>> create_worker()
>> -- new kworker will bound to CPU0
>> (pool->attrs->cpumask will be mask of CPU0).
>> kworker/0:x will running on rq
>>
>> sched_cpu_dying
>> if (rq->nr_running != 1 || rq_has_pinned_tasks(rq))
>> WARN(true, "Dying CPU not properly vacated!");
>> ---------OOPS-------------
>>
>> When CPUHP_HRTIMERS_PREPARE < CPU0's state < CPUHP_AP_WORKQUEUE_ONLINE,
>> the CPU0 still can schedule a new work, which will add a new work to
>> pool->worklist of a dying CPU's work pool.
>>
>> > Is it a dying cpu when onlining a CPU fails? I think balance_push
>> > is not set in this case when it fails at some point during online.
>> > But I haven't looked into the details and linked the clues.
>> >
>>
>> Not this case, there is on onlining CPU fails log in my test log.
>>
>> >
>> >>
>> >> Test branch as:
>> >> https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tag/?h=v5.19-rc8-rt8
>> >>
>> >
>> > Can it only be triggered in the rt kernel?
>> >
>>
>> I only test on rt kernel, because of I work on RT kernel now, But I
>> think non-rt kernel have the issues too.
>>
>> >> I think this code is new enough to demonstrate that the problem persists.
>> >>
>> >> The log as fellowing:
>> >>
>> >> [ 3103.198684] ------------[ cut here ]------------
>> >> [ 3103.198684] Dying CPU not properly vacated!
>> >> [ 3103.198684] WARNING: CPU: 1 PID: 23 at kernel/sched/core.c:9575 sched_cpu_dying.cold+0xc/0xc3
>> >> [ 3103.198684] Modules linked in: work_test(O)
>> >
>> > Could you give me the code of this module?
>>
>> The module's source code is at
>> https://github.com/schspa/code_snippet/blob/master/kernel_module/workqueue_test/work-test.c
>> Kernel defconfig is at
>> https://github.com/schspa/ktest/blob/main/configs/linux_rt_defconfig
>>
>
> How long will it take to reproduce the bug?
> I have tested for a short time on the non-rt kernel, no thing
> has happened yet.
>
Test host machine:
4 core virtual machines with nested KVM support.
16GB memory.
Test VM:
3 * 4core KVM virtual machine.
With 3 test instances, it will reproduce in one day.
This needs some stress to make it reproduce, you can use my script to
compile & launch the test automatically.
https://github.com/schspa/ktest/blob/main/test-lnx.sh
~/work/src/ktest/test-lnx.sh ~/src/linux linux-rt-devel/linux-5.18.y-rt ~/test/linux-5.18.y-rt wq-test
>> I add a hook to the entry of worker_thread, which add extra 50ms delay
>> to make it reproduced faster.
>>
>> --
>> BRs
>> Schspa Shi
--
BRs
Schspa Shi
Lai Jiangshan <[email protected]> writes:
> On Mon, Aug 1, 2022 at 1:08 PM Schspa Shi <[email protected]> wrote:
>>
>>
>> Lai Jiangshan <[email protected]> writes:
>>
>> > On Sat, Jul 30, 2022 at 12:19 PM Schspa Shi <[email protected]> wrote:
>> >>
>> >>
>> >> Peter Zijlstra <[email protected]> writes:
>> >>
>> >> > On Wed, Jul 13, 2022 at 05:52:58PM +0800, Lai Jiangshan wrote:
>> >> >>
>> >> >>
>> >> >> CC Peter.
>> >> >> Peter has changed the CPU binding code in workqueue.c.
>> >> >
>> >> > [ 1622.829091] WARNING: CPU: 3 PID: 31 at kernel/sched/core.c:7756 sched_cpu_dying+0x74/0x204
>> >> > [ 1622.829374] CPU: 3 PID: 31 Comm: migration/3 Tainted: P O 5.10.59-rt52 #2
>> >> > ^^^^^^^^^^^^^^^^^^^^^
>> >> >
>> >> > I think we can ignore this as being some ancient kernel. Please try
>> >> > something recent.
>> >>
>> >> Hi peter:
>> >>
>> >> I spent a few days writing a test case and reproduced the problem on
>> >> kernel 5.19. I think it's time for us to review the V3 patch for a fix.
>> >>
>> >> The V3 patch is at
>> >> https://lore.kernel.org/all/[email protected]/
>> >> Please help to review it.
>> >
>> > Because of the implementation of is_cpu_allowed(),
>> > I am still suspicious about how the newly created worker can be woken
>> > up in the dying cpu since it has no KTHREAD_IS_PER_CPU set.
>> >
>>
>> I think the comments of the V3 patch have explained the problem, the
>> newly created worker runs on the dying CPU because we bind it to the
>> dying CPU via kthread_bind_mask or set_cpus_allowed_ptr.
>>
>
> wake_up_process() will modify the task's cpumask and select
> a fallback cpu. The schedule code does it for all kthreads that
> happen to have a single CPU affinity which is lost.
>
There are two exceptions I can think of here:
1. For tasks that are prohibited from migration, the system uses
cpu_online_mask, and will not select the fallback cpu.
2. When the scheduling code judges cpu_dying, there is no means of
synchronization. At this time, cpu_dying_mask is an unstable variable.
The scheduler does not guarantee that the above migration will be possible.
In the case of PREEMPT_RT, spin_lock will prohibit migration, which also
causes this problem to be easily reproduced in the RT version.
In view of some of the above limitations, I think we should not rely on
the implementation of the above scheduler, but handle this when creating
threads.
>> Please refer to the following scenarios.
>>
>> CPU0 CPU1
>> ------------------------------------------------------------------
>> sched_cpu_deactivate(cpu_active_mask clear)
>> workqueue_offline_cpu(work pool POOL_DISASSOCIATED)
>> -- all worker will migrate to another cpu --
>> worker_thread
>> -- will create new worker if
>> pool->worklist is not empty
>> create_worker()
>> -- new kworker will bound to CPU0
>> (pool->attrs->cpumask will be mask of CPU0).
>> kworker/0:x will running on rq
>>
>> sched_cpu_dying
>> if (rq->nr_running != 1 || rq_has_pinned_tasks(rq))
>> WARN(true, "Dying CPU not properly vacated!");
>> ---------OOPS-------------
>>
>> When CPUHP_HRTIMERS_PREPARE < CPU0's state < CPUHP_AP_WORKQUEUE_ONLINE,
>> the CPU0 still can schedule a new work, which will add a new work to
>> pool->worklist of a dying CPU's work pool.
>>
>> > Is it a dying cpu when onlining a CPU fails? I think balance_push
>> > is not set in this case when it fails at some point during online.
>> > But I haven't looked into the details and linked the clues.
>> >
>>
>> Not this case, there is on onlining CPU fails log in my test log.
>>
>> >
>> >>
>> >> Test branch as:
>> >> https://git.kernel.org/pub/scm/linux/kernel/git/rt/linux-rt-devel.git/tag/?h=v5.19-rc8-rt8
>> >>
>> >
>> > Can it only be triggered in the rt kernel?
>> >
>>
>> I only test on rt kernel, because of I work on RT kernel now, But I
>> think non-rt kernel have the issues too.
>>
>> >> I think this code is new enough to demonstrate that the problem persists.
>> >>
>> >> The log as fellowing:
>> >>
>> >> [ 3103.198684] ------------[ cut here ]------------
>> >> [ 3103.198684] Dying CPU not properly vacated!
>> >> [ 3103.198684] WARNING: CPU: 1 PID: 23 at kernel/sched/core.c:9575 sched_cpu_dying.cold+0xc/0xc3
>> >> [ 3103.198684] Modules linked in: work_test(O)
>> >
>> > Could you give me the code of this module?
>>
>> The module's source code is at
>> https://github.com/schspa/code_snippet/blob/master/kernel_module/workqueue_test/work-test.c
>> Kernel defconfig is at
>> https://github.com/schspa/ktest/blob/main/configs/linux_rt_defconfig
>>
>
> How long will it take to reproduce the bug?
> I have tested for a short time on the non-rt kernel, no thing
> has happened yet.
>
>> I add a hook to the entry of worker_thread, which add extra 50ms delay
>> to make it reproduced faster.
>>
>> --
>> BRs
>> Schspa Shi
--
BRs
Schspa Shi