2022-10-06 12:59:28

by Valentin Schneider

[permalink] [raw]
Subject: [RFC PATCH bitmap-for-next 0/4] lib/cpumask, blk_mq: Fix blk_mq_hctx_next_cpu() vs cpumask_check()

Hi,

I've split this from [1] given I don't have any updates to the other patches,
and this can live separately from them.

I figured I'd follow what Yury has done and condense the logic of
cpumask_next_wrap() into a macro, however cpumask_next_wrap() has a UP variant
which makes this a bit more annoying.

I've tried giving the UP variant its own macro in cpumask.c and declaring
it there, but that means making cpumask.c compile under !CONFIG_SMP (again),
which means doing the same for all of the cpumask.c functions that have UP
variants (cpumask_local_spread(), cpumask_any_*distribute()...).

Before going too deep in what might be a stupid idea, I thought I'd stop there,
send what I have, and check what folks if that sounds sane.

If it does, I see two ways of handling the UP stubs:
o Get rid of the UP optimizations and use the same code as SMP
o Move *all* definitions of the UP optimizations into cpumask.c with
a different set of macros (e.g. a *_UP() variant).

[1]: http://lore.kernel.org/r/[email protected]

Cheers,
Valentin

Valentin Schneider (4):
lib/cpumask: Generate cpumask_next_wrap() body with a macro
lib/cpumask: Fix cpumask_check() warning in cpumask_next_wrap*()
lib/cpumask: Introduce cpumask_next_and_wrap()
blk_mq: Fix cpumask_check() warning in blk_mq_hctx_next_cpu()

block/blk-mq.c | 39 +++++++++------------------
include/linux/cpumask.h | 22 +++++++++++++++
lib/cpumask.c | 60 ++++++++++++++++++++++++++++++-----------
3 files changed, 79 insertions(+), 42 deletions(-)

--
2.31.1


2022-10-06 13:03:35

by Valentin Schneider

[permalink] [raw]
Subject: [RFC PATCH bitmap-for-next 2/4] lib/cpumask: Fix cpumask_check() warning in cpumask_next_wrap*()

Invoking cpumask_next*() with n==nr_cpu_ids-1 triggers a warning as there
are (obviously) no more valid CPU ids after that. This is however undesired
for the cpumask_next_wrap*() family which needs to wrap around reaching
this condition.

Don't invoke cpumask_next*() when n==nr_cpu_ids, go for the wrapping (if
any) instead.

NOTE: this only fixes the NR_CPUS>1 variants.

Signed-off-by: Valentin Schneider <[email protected]>
---
lib/cpumask.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/lib/cpumask.c b/lib/cpumask.c
index 6e576485c84f..f8174fa3d752 100644
--- a/lib/cpumask.c
+++ b/lib/cpumask.c
@@ -12,11 +12,11 @@
unsigned int next; \
\
again: \
- next = (FETCH_NEXT); \
+ next = n == nr_cpu_ids - 1 ? nr_cpu_ids : (FETCH_NEXT); \
\
if (wrap && n < start && next >= start) { \
- next = nr_cpumask_bits; \
- } else if (next >= nr_cpumask_bits) { \
+ next = nr_cpu_ids; \
+ } else if (next >= nr_cpu_ids) { \
wrap = true; \
n = -1; \
goto again; \
--
2.31.1

2022-10-07 15:24:34

by kernel test robot

[permalink] [raw]
Subject: [lib/cpumask] e5ad41dae2: BUG:workqueue_lockup-pool


Greeting,

FYI, we noticed the following commit (built with gcc-11):

commit: e5ad41dae251946ecdcdc38bb8f639cd55a8eae1 ("[RFC PATCH bitmap-for-next 2/4] lib/cpumask: Fix cpumask_check() warning in cpumask_next_wrap*()")
url: https://github.com/intel-lab-lkp/linux/commits/Valentin-Schneider/lib-cpumask-blk_mq-Fix-blk_mq_hctx_next_cpu-vs-cpumask_check/20221006-202402
base: https://git.kernel.org/cgit/linux/kernel/git/axboe/linux-block.git for-next
patch link: https://lore.kernel.org/linux-block/[email protected]

in testcase: boot

on test machine: qemu-system-x86_64 -enable-kvm -cpu SandyBridge -smp 2 -m 16G

caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):


+----------------------------------------------+------------+------------+
| | d8e0ef5a1d | e5ad41dae2 |
+----------------------------------------------+------------+------------+
| boot_successes | 10 | 0 |
| boot_failures | 0 | 10 |
| BUG:workqueue_lockup-pool | 0 | 10 |
| INFO:rcu_sched_detected_stalls_on_CPUs/tasks | 0 | 10 |
| BUG:kernel_hang_in_boot_stage | 0 | 10 |
+----------------------------------------------+------------+------------+


If you fix the issue, kindly add following tag
| Reported-by: kernel test robot <[email protected]>
| Link: https://lore.kernel.org/r/[email protected]


[ 60.568059][ C0] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 58s!
[ 60.569057][ C0] Showing busy workqueues and worker pools:
[ 60.569663][ C0] workqueue events: flags=0x0
[ 60.570057][ C0] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 60.570064][ C0] pending: vmstat_shepherd
[ 90.776058][ C0] BUG: workqueue lockup - pool cpus=0 node=0 flags=0x0 nice=0 stuck for 88s!
[ 90.777057][ C0] Showing busy workqueues and worker pools:
[ 90.777819][ C0] workqueue events: flags=0x0
[ 90.778056][ C0] pwq 0: cpus=0 node=0 flags=0x0 nice=0 active=1/256 refcnt=2
[ 90.778065][ C0] pending: vmstat_shepherd
[ 105.234045][ C0] rcu: INFO: rcu_sched detected stalls on CPUs/tasks:
[ 105.234045][ C0] (detected by 0, t=105002 jiffies, g=-1195, q=1 ncpus=2)
[ 105.234045][ C0] rcu: All QSes seen, last rcu_sched kthread activity 105002 (-194950--299952), jiffies_till_next_fqs=3, root ->qsmask 0x0
[ 105.234045][ C0] rcu: rcu_sched kthread starved for 105002 jiffies! g-1195 f0x2 RCU_GP_WAIT_FQS(5) ->state=0x0 ->cpu=0
[ 105.234045][ C0] rcu: Unless rcu_sched kthread gets sufficient CPU time, OOM is now expected behavior.
[ 105.234045][ C0] rcu: RCU grace-period kthread stack dump:
[ 105.234045][ C0] task:rcu_sched state:R running task stack: 7484 pid: 11 ppid: 2 flags:0x00004000
[ 105.234045][ C0] Call Trace:
[ 105.234045][ C0] ? __schedule+0x58a/0x5b8
[ 105.234045][ C0] ? schedule+0x83/0xba
[ 105.234045][ C0] ? schedule_timeout+0x88/0xa5
[ 105.234045][ C0] ? del_timer_sync+0x7d/0x7d
[ 105.234045][ C0] ? rcu_gp_fqs_loop+0xef/0x294
[ 105.234045][ C0] ? rcu_gp_kthread+0xd4/0xf0
[ 105.234045][ C0] ? kthread+0xc0/0xc5
[ 105.234045][ C0] ? rcu_gp_init+0x4c4/0x4c4
[ 105.234045][ C0] ? kthread_complete_and_exit+0x1b/0x1b
[ 105.234045][ C0] ? ret_from_fork+0x19/0x24
[ 105.234045][ C0] rcu: Stack dump where RCU GP kthread last ran:
[ 105.234045][ C0] NMI backtrace for cpu 0
[ 105.234045][ C0] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.0.0-rc7-00395-ge5ad41dae251 #1
[ 105.234045][ C0] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.0-debian-1.16.0-4 04/01/2014
[ 105.234045][ C0] Call Trace:
[ 105.234045][ C0] ? dump_stack_lvl+0x42/0x54
[ 105.234045][ C0] ? dump_stack+0xd/0x10
[ 105.234045][ C0] ? nmi_cpu_backtrace+0x96/0xb8
[ 105.234045][ C0] ? lapic_can_unplug_cpu+0x87/0x87
[ 105.234045][ C0] ? nmi_trigger_cpumask_backtrace+0x49/0xac
[ 105.234045][ C0] ? arch_trigger_cpumask_backtrace+0x15/0x17
[ 105.234045][ C0] ? rcu_check_gp_kthread_starvation+0x122/0x131
[ 105.234045][ C0] ? print_other_cpu_stall+0x264/0x2a9
[ 105.234045][ C0] ? print_other_cpu_stall+0x297/0x2a9
[ 105.234045][ C0] ? check_cpu_stall+0x174/0x1bd
[ 105.234045][ C0] ? rcu_sched_clock_irq+0xd7/0x186
[ 105.234045][ C0] ? update_process_times+0x45/0x60
[ 105.234045][ C0] ? tick_periodic+0xc0/0xcc
[ 105.234045][ C0] ? tick_handle_periodic+0x22/0x66
[ 105.234045][ C0] ? sysvec_call_function_single+0x2c/0x2c
[ 105.234045][ C0] ? __sysvec_apic_timer_interrupt+0xe4/0x182
[ 105.234045][ C0] ? sysvec_apic_timer_interrupt+0x1b/0x2e
[ 105.234045][ C0] ? handle_exception+0x133/0x133
[ 105.234045][ C0] ? rmi_firmware_update+0x3ab/0x3f7
[ 105.234045][ C0] ? sysvec_call_function_single+0x2c/0x2c
[ 105.234045][ C0] ? build_sched_domains+0x1e5/0x71c
[ 105.234045][ C0] ? sysvec_call_function_single+0x2c/0x2c
[ 105.234045][ C0] ? build_sched_domains+0x1e5/0x71c
[ 105.234045][ C0] ? sched_init_domains+0x73/0x77
[ 105.234045][ C0] ? sched_init_smp+0x26/0x6c
[ 105.234045][ C0] ? kernel_init_freeable+0x143/0x195
[ 105.234045][ C0] ? rest_init+0x13a/0x13a
[ 105.234045][ C0] ? kernel_init+0x17/0xf3
[ 105.234045][ C0] ? ret_from_fork+0x19/0x24



To reproduce:

# build kernel
cd linux
cp config-6.0.0-rc7-00395-ge5ad41dae251 .config
make HOSTCC=gcc-11 CC=gcc-11 ARCH=i386 olddefconfig prepare modules_prepare bzImage modules
make HOSTCC=gcc-11 CC=gcc-11 ARCH=i386 INSTALL_MOD_PATH=<mod-install-dir> modules_install
cd <mod-install-dir>
find lib/ | cpio -o -H newc --quiet | gzip > modules.cgz


git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp qemu -k <bzImage> -m modules.cgz job-script # job-script is attached in this email

# if come across any failure that blocks the test,
# please remove ~/.lkp and /lkp dir to run from a clean state.



--
0-DAY CI Kernel Test Service
https://01.org/lkp



Attachments:
(No filename) (6.30 kB)
config-6.0.0-rc7-00395-ge5ad41dae251 (167.76 kB)
job-script (5.11 kB)
dmesg.xz (8.44 kB)
Download all attachments