2021-06-28 04:45:39

by Shinichiro Kawasaki

[permalink] [raw]
Subject: Re: WARNING at blktests block/008 in ttwu_queue_wakelist()

On Jun 10, 2021 / 16:32, Hillf Danton wrote:
> On Wed, 9 Jun 2021 06:55:59 +0000 Damien Le Moal wrote:
> >+ Jens and linux-kernel
> >
> >On 2021/06/09 15:53, Shinichiro Kawasaki wrote:
> >> Hi there,
> >>
> >> Let me share a blktests failure. When I ran blktests on the kernel v5.13-rc5,
> >> block/008 failed. A WARNING below was the cause of the failure.
> >>
> >> WARNING: CPU: 1 PID: 135817 at kernel/sched/core.c:3175 ttwu_queue_wakelist+0x284/0x2f0
> >>
> >> I'm trying to recreate the failure repeating the test case, but so far, I am
> >> not able to. This failure looks rare, but actually, I observed it 3 times in
> >> the past one year.
> >>
> >> 1) Oct/2020, kernel: v5.9-rc7 test dev: dm-flakey on AHCI-SATA SMR HDD, log [1]
> >> 2) Mar/2021, kernel: v5.12-rc2 test dev: AHCI-SATA SMR HDD, log [2]
> >> 3) Jun/2021, kernel: v5.13-rc5 test dev: dm-linear on null_blk zoned, log [3]
> >>
> >> The test case block/008 does IO during CPU hotplug, and the WARNING in
> >> ttwu_queue_wakelist() checks "WARN_ON_ONCE(cpu == smp_processor_id())".
> >> So it is likely that the test case triggers the warning, but I don't have
> >> clear view why and how the warning was triggered. It was observed on various
> >> block devices, so I would like to ask linux-block experts if anyone can tell
> >> what is going on. Comments will be appreciated.
>
> [...]
>
> >> [40041.712804][T135817] ------------[ cut here ]------------
> >> [40041.718489][T135817] WARNING: CPU: 1 PID: 135817 at kernel/sched/core.=
> >c:3175 ttwu_queue_wakelist+0x284/0x2f0
> >> [40041.728311][T135817] Modules linked in: null_blk dm_flakey iscsi_targe=
> >t_mod tcm_loop target_core_pscsi target_core_file target_core_iblock nft_fi=
> >b_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_=
> >reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat ip6table_mangle ip=
> >6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6=
> > nf_defrag_ipv4 iptable_mangle iptable_raw bridge iptable_security stp llc =
> >ip_set rfkill nf_tables target_core_user target_core_mod nfnetlink ip6table=
> >_filter ip6_tables iptable_filter sunrpc intel_rapl_msr intel_rapl_common x=
> >86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass iTCO_=
> >wdt intel_pmc_bxt iTCO_vendor_support rapl intel_cstate intel_uncore joydev=
> > lpc_ich i2c_i801 i2c_smbus ses enclosure mei_me mei ipmi_ssif ioatdma wmi =
> >acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad zr=
> >am ip_tables ast drm_vram_helper drm_kms_helper syscopyarea sysfillrect crc=
> >32c_intel sysimgblt
> >> [40041.728481][T135817] fb_sys_fops cec drm_ttm_helper ttm ghash_clmulni=
> >_intel drm igb mpt3sas nvme dca i2c_algo_bit nvme_core raid_class scsi_tran=
> >sport_sas pkcs8_key_parser [last unloaded: null_blk]
> >> [40041.832215][T135817] CPU: 1 PID: 135817 Comm: fio Not tainted 5.13.0-r=
> >c5+ #1
> >> [40041.839262][T135817] Hardware name: Supermicro Super Server/X10SRL-F, =
> >BIOS 3.2 11/22/2019
> >> [40041.847434][T135817] RIP: 0010:ttwu_queue_wakelist+0x284/0x2f0
> >> [40041.853266][T135817] Code: 34 24 e8 6f 71 64 00 4c 8b 44 24 10 48 8b 4=
> >c 24 08 8b 34 24 e9 a1 fe ff ff e8 a8 71 64 00 e9 66 ff ff ff e8 be 71 64 0=
> >0 eb a0 <0f> 0b 45 31 ff e9 cb fe ff ff 48 89 04 24 e8 49 71 64 00 48 8b 04=
> >
> >> [40041.872793][T135817] RSP: 0018:ffff888106bff348 EFLAGS: 00010046
> >> [40041.878800][T135817] RAX: 0000000000000001 RBX: ffff888117ec3240 RCX: =
> >ffff888811440000
> >> [40041.886711][T135817] RDX: 0000000000000000 RSI: 0000000000000001 RDI: =
> >ffffffffb603d6e8
> >> [40041.894625][T135817] RBP: 0000000000000001 R08: ffffffffb603d6e8 R09: =
> >ffffffffb6ba6167
> >> [40041.902537][T135817] R10: fffffbfff6d74c2c R11: 0000000000000001 R12: =
> >0000000000000000
> >> [40041.910451][T135817] R13: ffff88881145fd00 R14: 0000000000000001 R15: =
> >ffff888811440001
> >> [40041.918364][T135817] FS: 00007f8eabf14b80(0000) GS:ffff888811440000(0=
> >000) knlGS:0000000000000000
> >> [40041.927229][T135817] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033=
> >
> >> [40041.933756][T135817] CR2: 000055ce81e2cc78 CR3: 000000011be92005 CR4: =
> >00000000001706e0
> >> [40041.941669][T135817] Call Trace:
> >> [40041.944895][T135817] ? lock_is_held_type+0x98/0x110
> >> [40041.949860][T135817] try_to_wake_up+0x6f9/0x15e0
>
> 2) __queue_work
> raw_spin_lock(&pwq->pool->lock) with irq disabled
> insert_work
> wake_up_worker(pool);
> wake_up_process first_idle_worker(pool);
>
> Even if waker is lucky enough running on worker's CPU, what is weird is an
> idle worker can trigger the warning, given
>
> if (smp_load_acquire(&p->on_cpu) &&
> ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU))
> goto unlock;
>
> because p->on_cpu must have been false for quite a while.
>
> Is there any chance for CPU hotplug to make a difference?
>
> Thoughts are welcome.
>
> Hillf

Hillf, thank you very much for the comments, and sorry about this late reply.

I walked through related functions to understand your comments and, but I have
to say that I still don't have enough background knowledge to provide valuable
comments back to you. I understand that the waker and the wakee are on same CPU,
and it is weird that p->on_cpu is true. This looks indicating that the task
scheduler failing to control task status on task migration triggered by CPU
hotplugs, but as far as I read comments in kernel/sched/core.c, CPU hotplug and
task migration are folded into the design and implementation.

The blktests test case block/008 runs I/Os to a test target block device 30
seconds. During this I/O, it repeats offlining CPUs and onlining CPUs: when
there are N CPUs are available, it offlines N-1 CPUs to have only one online
CPU, then onlines all CPUs again. It repeats this online and offline until I/O
workload completes. When the only one CPU is online, the waker and the wakee can
be on the same CPU. Or, one of the waker or the wakee might have been migrated
from other CPUs to the only one online CPU. But still it is not clear for me why
it results in the WARNING.

Now I'm trying to recreate the failure. By repeating test cases in "block
group" on the kernel v5.13-rc5, I was able to recreate the failure. It took 3 to
5 hours to recreate it. The test target block device used was a null_blk with
rather unique configuration (zoned device with zone capacity smaller than zone
size). I will try to confirm the failure recreation with latest kernel version
and other block devices.

--
Best Regards,
Shin'ichiro Kawasaki

>
> >> [40041.954567][T135817] ? migrate_swap_stop+0x970/0x970
> >> [40041.959618][T135817] ? insert_work+0x193/0x2e0
> >> [40041.964152][T135817] __queue_work+0x4e8/0xdb0
> >> [40041.968599][T135817] ? try_to_grab_pending.part.0+0x439/0x530
> >> [40041.974429][T135817] ? hctx_unlock+0x7d/0xe0
> >> [40041.978790][T135817] mod_delayed_work_on+0x8c/0x100
> >> [40041.983755][T135817] ? cancel_delayed_work+0x1f0/0x1f0
> >> [40041.988982][T135817] ? __blk_mq_delay_run_hw_queue+0x95/0x4b0
> >> [40041.994817][T135817] kblockd_mod_delayed_work_on+0x17/0x20
>
> 1) try to queue work on kblockd_workqueue that is bound percpu workqueue, so
> kworker is bound to one CPU.
>
> >> [40042.000390][T135817] blk_mq_run_hw_queue+0x125/0x270
> >> [40042.005439][T135817] ? blk_mq_delay_run_hw_queues+0x410/0x410
> >> [40042.011275][T135817] blk_mq_sched_insert_request+0x294/0x420
> >> [40042.017020][T135817] ? __blk_mq_sched_bio_merge+0x340/0x340
> >> [40042.022677][T135817] ? blk_mq_rq_ctx_init+0x99a/0xe80
> >> [40042.027819][T135817] blk_mq_submit_bio+0xb37/0x15f0
> >> [40042.032786][T135817] ? blk_mq_try_issue_list_directly+0x940/0x940
> >> [40042.038966][T135817] ? percpu_ref_put_many.constprop.0+0x82/0x1b0
> >> [40042.045148][T135817] submit_bio_noacct+0x79c/0xe60
> >> [40042.050023][T135817] ? blk_queue_enter+0x810/0x810
> >> [40042.054902][T135817] ? find_held_lock+0x2c/0x110
> >> [40042.059611][T135817] submit_bio+0xe4/0x4c0
> >> [40042.063795][T135817] ? submit_bio_noacct+0xe60/0xe60
> >> [40042.068847][T135817] ? bio_add_zone_append_page+0x2c0/0x2c0
> >> [40042.074507][T135817] ? bio_release_pages.part.0+0x10c/0x3d0
> >> [40042.080167][T135817] ? __blkdev_direct_IO_simple+0x436/0x7d0
> >> [40042.085912][T135817] __blkdev_direct_IO_simple+0x40d/0x7d0
> >> [40042.091488][T135817] ? bd_link_disk_holder+0x6e0/0x6e0
> >> [40042.096711][T135817] ? __lock_acquire+0xbbc/0x51b0
> >> [40042.101591][T135817] ? mark_lock+0xe4/0x18a0
> >> [40042.105953][T135817] ? set_init_blocksize.isra.0+0x140/0x140
> >> [40042.111698][T135817] ? mark_lock+0xe4/0x18a0
> >> [40042.116054][T135817] ? lock_is_held_type+0x98/0x110
> >> [40042.121021][T135817] ? find_held_lock+0x2c/0x110
> >> [40042.125730][T135817] blkdev_direct_IO+0xd23/0x11d0
> >> [40042.130607][T135817] ? __lock_acquire+0x15c2/0x51b0
> >> [40042.135577][T135817] ? bd_prepare_to_claim+0x2a0/0x2a0
> >> [40042.140799][T135817] ? __lock_acquire+0x15c2/0x51b0
> >> [40042.145770][T135817] generic_file_read_iter+0x1f4/0x470
> >> [40042.151080][T135817] blkdev_read_iter+0x100/0x190
> >> [40042.155871][T135817] new_sync_read+0x352/0x5d0
> >> [40042.160403][T135817] ? __ia32_sys_llseek+0x310/0x310
> >> [40042.165458][T135817] ? __cond_resched+0x15/0x30
> >> [40042.170076][T135817] ? inode_security+0x56/0xf0
> >> [40042.174698][T135817] vfs_read+0x26c/0x470
> >> [40042.178796][T135817] __x64_sys_pread64+0x17d/0x1d0
> >> [40042.183674][T135817] ? vfs_read+0x470/0x470
> >> [40042.187946][T135817] ? syscall_enter_from_user_mode+0x27/0x70
> >> [40042.193780][T135817] ? trace_hardirqs_on+0x1c/0x110
> >> [40042.198747][T135817] do_syscall_64+0x40/0x80
> >> [40042.203107][T135817] entry_SYSCALL_64_after_hwframe+0x44/0xae
> >> [40042.208939][T135817] RIP: 0033:0x7f8eb5cdb96f
> >> [40042.213297][T135817] Code: 08 89 3c 24 48 89 4c 24 18 e8 7d f3 ff ff 4=
> >c 8b 54 24 18 48 8b 54 24 10 41 89 c0 48 8b 74 24 08 8b 3c 24 b8 11 00 00 0=
> >0 0f 05 <48> 3d 00 f0 ff ff 77 31 44 89 c7 48 89 04 24 e8 cd f3 ff ff 48 8b=
> >
> >> [40042.232824][T135817] RSP: 002b:00007fffaa10c9d0 EFLAGS: 00000293 ORIG_=
> >RAX: 0000000000000011
> >> [40042.241169][T135817] RAX: ffffffffffffffda RBX: 00000000012e2540 RCX: =
> >00007f8eb5cdb96f
> >> [40042.249074][T135817] RDX: 0000000000001000 RSI: 000000000130a000 RDI: =
> >0000000000000009
> >> [40042.256985][T135817] RBP: 00000000012e2540 R08: 0000000000000000 R09: =
> >00007fffaa163080
> >> [40042.264898][T135817] R10: 0000000000862000 R11: 0000000000000293 R12: =
> >00007f8e96829458
> >> [40042.272813][T135817] R13: 0000000000000000 R14: 0000000000001000 R15: =
> >00000000012e2568
> >> [40042.280732][T135817] irq event stamp: 1042448
> >> [40042.285084][T135817] hardirqs last enabled at (1042447): [<ffffffffb3=
> >20209e>] try_to_grab_pending.part.0+0x1ae/0x530
> >> [40042.295683][T135817] hardirqs last disabled at (1042448): [<ffffffffb3=
> >202b3b>] mod_delayed_work_on+0xcb/0x100
> >> [40042.305591][T135817] softirqs last enabled at (1041222): [<ffffffffb3=
> >1b5316>] __irq_exit_rcu+0x1c6/0x270
> >> [40042.315150][T135817] softirqs last disabled at (1041215): [<ffffffffb3=
> >1b5316>] __irq_exit_rcu+0x1c6/0x270
> >> [40042.324711][T135817] ---[ end trace b6f997c9a553aa02 ]---
> >> [40042.330118][ C1]
> >> [40042.330119][ C1] =================
> >==========================
> >=============
> >> [40042.330121][ C1] WARNING: possible circular locking dependency dete=
> >cted
> >> [40042.330123][ C1] 5.13.0-rc5+ #1 Not tainted
> >> [40042.330125][ C1] --------------------------------------------------=
> >----
> >> [40042.330126][ C1] fio/135817 is trying to acquire lock:
> >> [40042.330128][ C1] ffffffffb641acb8 ((console_sem).lock){-.-.}-{2:2},=
> > at: down_trylock+0x13/0x70
> >> [40042.330134][ C1]
> >> [40042.330135][ C1] but task is already holding lock:
> >> [40042.330137][ C1] ffff888117ec3f30 (&p->pi_lock){-.-.}-{2:2}, at: tr=
> >y_to_wake_up+0x88/0x15e0
> >> [40042.330142][ C1]
> >> [40042.330144][ C1] which lock already depends on the new lock.
> >> [40042.330145][ C1]
> >> [40042.330146][ C1]
> >> [40042.330147][ C1] the existing dependency chain (in reverse order) i=
> >s:
> >> [40042.330149][ C1]
> >> [40042.330150][ C1] -> #1 (&p->pi_lock){-.-.}-{2:2}:
> >> [40042.330155][ C1] _raw_spin_lock_irqsave+0x3b/0x60
> >> [40042.330156][ C1] try_to_wake_up+0x88/0x15e0
> >> [40042.330157][ C1] up+0x7a/0xb0
> >> [40042.330159][ C1] __up_console_sem+0x2d/0x60
> >> [40042.330161][ C1] console_unlock+0x3f7/0x840
> >> [40042.330162][ C1] vprintk_emit+0x257/0x420
> >> [40042.330163][ C1] devkmsg_emit.constprop.0+0x95/0xb1
> >> [40042.330165][ C1] devkmsg_write.cold+0x48/0x74
> >> [40042.330166][ C1] do_iter_readv_writev+0x32b/0x6b0
> >> [40042.330167][ C1] do_iter_write+0x137/0x5d0
> >> [40042.330169][ C1] vfs_writev+0x147/0x4a0
> >> [40042.330170][ C1] do_writev+0x100/0x260
> >> [40042.330172][ C1] do_syscall_64+0x40/0x80
> >> [40042.330173][ C1] entry_SYSCALL_64_after_hwframe+0x44/0xae
> >> [40042.330175][ C1]
> >> [40042.330176][ C1] -> #0 ((console_sem).lock){-.-.}-{2:2}:
> >> [40042.330181][ C1] __lock_acquire+0x2940/0x51b0
> >> [40042.330182][ C1] lock_acquire+0x181/0x6a0
> >> [40042.330183][ C1] _raw_spin_lock_irqsave+0x3b/0x60
> >> [40042.330184][ C1] down_trylock+0x13/0x70
> >> [40042.330185][ C1] __down_trylock_console_sem+0x24/0x90
> >> [40042.330186][ C1] vprintk_emit+0x230/0x420
> >> [40042.330187][ C1] printk+0x96/0xb2
> >> [40042.330187][ C1] report_bug.cold+0x31/0x53
> >> [40042.330188][ C1] handle_bug+0x3c/0x70
> >> [40042.330189][ C1] exc_invalid_op+0x14/0x40
> >> [40042.330189][ C1] asm_exc_invalid_op+0x12/0x20
> >> [40042.330190][ C1] ttwu_queue_wakelist+0x284/0x2f0
> >> [40042.330191][ C1] try_to_wake_up+0x6f9/0x15e0
> >> [40042.330192][ C1] __queue_work+0x4e8/0xdb0
> >> [40042.330193][ C1] mod_delayed_work_on+0x8c/0x100
> >> [40042.330194][ C1] kblockd_mod_delayed_work_on+0x17/0x20
> >> [40042.330194][ C1] blk_mq_run_hw_queue+0x125/0x270
> >> [40042.330195][ C1] blk_mq_sched_insert_request+0x294/0x420
> >> [40042.330196][ C1] blk_mq_submit_bio+0xb37/0x15f0
> >> [40042.330197][ C1] submit_bio_noacct+0x79c/0xe60
> >> [40042.330198][ C1] submit_bio+0xe4/0x4c0
> >> [40042.330199][ C1] __blkdev_direct_IO_simple+0x40d/0x7d0
> >> [40042.330199][ C1] blkdev_direct_IO+0xd23/0x11d0
> >> [40042.330200][ C1] generic_file_read_iter+0x1f4/0x470
> >> [40042.330201][ C1] blkdev_read_iter+0x100/0x190
> >> [40042.330202][ C1] new_sync_read+0x352/0x5d0
> >> [40042.330203][ C1] vfs_read+0x26c/0x470
> >> [40042.330203][ C1] __x64_sys_pread64+0x17d/0x1d0
> >> [40042.330204][ C1] do_syscall_64+0x40/0x80
> >> [40042.330205][ C1] entry_SYSCALL_64_after_hwframe+0x44/0xae
> >> [40042.330206][ C1]
> >> [40042.330206][ C1] other info that might help us debug this:
> >> [40042.330207][ C1]
> >> [40042.330208][ C1] Possible unsafe locking scenario:
> >> [40042.330208][ C1]
> >> [40042.330209][ C1] CPU0 CPU1
> >> [40042.330210][ C1] ---- ----
> >> [40042.330210][ C1] lock(&p->pi_lock);
> >> [40042.330212][ C1] lock((console_sem).=
> >lock);
> >> [40042.330214][ C1] lock(&p->pi_lock);=
> >
> >> [40042.330216][ C1] lock((console_sem).lock);
> >> [40042.330217][ C1]
> >> [40042.330218][ C1] *** DEADLOCK ***
> >> [40042.330218][ C1]
> >> [40042.330219][ C1] 3 locks held by fio/135817:
> >> [40042.330220][ C1] #0: ffffffffb643ae80 (rcu_read_lock){....}-{1:2},=
> > at: __queue_work+0xa0/0xdb0
> >> [40042.330223][ C1] #1: ffff8888114730d8 (&pool->lock){-.-.}-{2:2}, a=
> >t: __queue_work+0x2ce/0xdb0
> >> [40042.330227][ C1] #2: ffff888117ec3f30 (&p->pi_lock){-.-.}-{2:2}, a=
> >t: try_to_wake_up+0x88/0x15e0
> >> [40042.330230][ C1]
> >> [40042.330231][ C1] stack backtrace:
> >> [40042.330231][ C1] CPU: 1 PID: 135817 Comm: fio Not tainted 5.13.0-rc=
> >5+ #1
> >> [40042.330232][ C1] Hardware name: Supermicro Super Server/X10SRL-F, B=
> >IOS 3.2 11/22/2019
> >> [40042.330233][ C1] Call Trace:
> >> [40042.330234][ C1] dump_stack+0x93/0xc2
> >> [40042.330234][ C1] check_noncircular+0x235/0x2b0
> >> [40042.330235][ C1] ? print_circular_bug+0x1f0/0x1f0
> >> [40042.330236][ C1] ? enable_ptr_key_workfn+0x20/0x20
> >> [40042.330237][ C1] ? desc_read+0x218/0x2e0
> >> [40042.330237][ C1] ? vsnprintf+0x830/0x15c0
> >> [40042.330238][ C1] __lock_acquire+0x2940/0x51b0
> >> [40042.330239][ C1] ? printk_sprint+0x89/0x110
> >> [40042.330239][ C1] ? lockdep_hardirqs_on_prepare+0x3e0/0x3e0
> >> [40042.330240][ C1] ? vprintk_store+0x581/0x630
> >> [40042.330241][ C1] lock_acquire+0x181/0x6a0
> >> [40042.330242][ C1] ? down_trylock+0x13/0x70
> >> [40042.330242][ C1] ? lock_release+0x680/0x680
> >> [40042.330243][ C1] ? lock_chain_count+0x20/0x20
> >> [40042.330244][ C1] ? lock_downgrade+0x6a0/0x6a0
> >> [40042.330245][ C1] ? printk+0x96/0xb2
> >> [40042.330246][ C1] _raw_spin_lock_irqsave+0x3b/0x60
> >> [40042.330248][ C1] ? down_trylock+0x13/0x70
> >> [40042.330249][ C1] down_trylock+0x13/0x70
> >> [40042.330250][ C1] ? printk+0x96/0xb2
> >> [40042.330252][ C1] __down_trylock_console_sem+0x24/0x90
> >> [40042.330253][ C1] vprintk_emit+0x230/0x420
> >> [40042.330254][ C1] ? ttwu_queue_wakelist+0x284/0x2f0
> >> [40042.330255][ C1] printk+0x96/0xb2
> >> [40042.330256][ C1] ? record_print_text.cold+0x11/0x11
> >> [40042.330258][ C1] report_bug.cold+0x31/0x53
> >> [40042.330259][ C1] ? ttwu_queue_wakelist+0x284/0x2f0
> >> [40042.330260][ C1] handle_bug+0x3c/0x70
> >> [40042.330262][ C1] exc_invalid_op+0x14/0x40
> >> [40042.330263][ C1] asm_exc_invalid_op+0x12/0x20
> >> [40042.330264][ C1] RIP: 0010:ttwu_queue_wakelist+0x284/0x2f0
> >> [40042.330266][ C1] Code: 34 24 e8 6f 71 64 00 4c 8b 44 24 10 48 8b 4c=
> > 24 08 8b 34 24 e9 a1 fe ff ff e8 a8 71 64 00 e9 66 ff ff ff e8 be 71 64 00=
> > eb a0 <0f> 0b 45 31 ff e9 cb
> >> [40042.330269][ C1] Lost 69 message(s)!
> >>
> >
> >
> >--
> >Damien Le Moal
> >Western Digital Research


2021-07-09 03:58:07

by Shinichiro Kawasaki

[permalink] [raw]
Subject: Re: WARNING at blktests block/008 in ttwu_queue_wakelist()

On Jun 28, 2021 / 13:43, Shin'ichiro Kawasaki wrote:
> On Jun 10, 2021 / 16:32, Hillf Danton wrote:
> > On Wed, 9 Jun 2021 06:55:59 +0000 Damien Le Moal wrote:
> > >+ Jens and linux-kernel
> > >
> > >On 2021/06/09 15:53, Shinichiro Kawasaki wrote:
> > >> Hi there,
> > >>
> > >> Let me share a blktests failure. When I ran blktests on the kernel v5.13-rc5,
> > >> block/008 failed. A WARNING below was the cause of the failure.
> > >>
> > >> WARNING: CPU: 1 PID: 135817 at kernel/sched/core.c:3175 ttwu_queue_wakelist+0x284/0x2f0
> > >>
> > >> I'm trying to recreate the failure repeating the test case, but so far, I am
> > >> not able to. This failure looks rare, but actually, I observed it 3 times in
> > >> the past one year.
> > >>
> > >> 1) Oct/2020, kernel: v5.9-rc7 test dev: dm-flakey on AHCI-SATA SMR HDD, log [1]
> > >> 2) Mar/2021, kernel: v5.12-rc2 test dev: AHCI-SATA SMR HDD, log [2]
> > >> 3) Jun/2021, kernel: v5.13-rc5 test dev: dm-linear on null_blk zoned, log [3]
> > >>
> > >> The test case block/008 does IO during CPU hotplug, and the WARNING in
> > >> ttwu_queue_wakelist() checks "WARN_ON_ONCE(cpu == smp_processor_id())".
> > >> So it is likely that the test case triggers the warning, but I don't have
> > >> clear view why and how the warning was triggered. It was observed on various
> > >> block devices, so I would like to ask linux-block experts if anyone can tell
> > >> what is going on. Comments will be appreciated.
> >
> > [...]
> >
> > >> [40041.712804][T135817] ------------[ cut here ]------------
> > >> [40041.718489][T135817] WARNING: CPU: 1 PID: 135817 at kernel/sched/core.=
> > >c:3175 ttwu_queue_wakelist+0x284/0x2f0
> > >> [40041.728311][T135817] Modules linked in: null_blk dm_flakey iscsi_targe=
> > >t_mod tcm_loop target_core_pscsi target_core_file target_core_iblock nft_fi=
> > >b_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_=
> > >reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat ip6table_mangle ip=
> > >6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6=
> > > nf_defrag_ipv4 iptable_mangle iptable_raw bridge iptable_security stp llc =
> > >ip_set rfkill nf_tables target_core_user target_core_mod nfnetlink ip6table=
> > >_filter ip6_tables iptable_filter sunrpc intel_rapl_msr intel_rapl_common x=
> > >86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass iTCO_=
> > >wdt intel_pmc_bxt iTCO_vendor_support rapl intel_cstate intel_uncore joydev=
> > > lpc_ich i2c_i801 i2c_smbus ses enclosure mei_me mei ipmi_ssif ioatdma wmi =
> > >acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad zr=
> > >am ip_tables ast drm_vram_helper drm_kms_helper syscopyarea sysfillrect crc=
> > >32c_intel sysimgblt
> > >> [40041.728481][T135817] fb_sys_fops cec drm_ttm_helper ttm ghash_clmulni=
> > >_intel drm igb mpt3sas nvme dca i2c_algo_bit nvme_core raid_class scsi_tran=
> > >sport_sas pkcs8_key_parser [last unloaded: null_blk]
> > >> [40041.832215][T135817] CPU: 1 PID: 135817 Comm: fio Not tainted 5.13.0-r=
> > >c5+ #1
> > >> [40041.839262][T135817] Hardware name: Supermicro Super Server/X10SRL-F, =
> > >BIOS 3.2 11/22/2019
> > >> [40041.847434][T135817] RIP: 0010:ttwu_queue_wakelist+0x284/0x2f0
> > >> [40041.853266][T135817] Code: 34 24 e8 6f 71 64 00 4c 8b 44 24 10 48 8b 4=
> > >c 24 08 8b 34 24 e9 a1 fe ff ff e8 a8 71 64 00 e9 66 ff ff ff e8 be 71 64 0=
> > >0 eb a0 <0f> 0b 45 31 ff e9 cb fe ff ff 48 89 04 24 e8 49 71 64 00 48 8b 04=
> > >
> > >> [40041.872793][T135817] RSP: 0018:ffff888106bff348 EFLAGS: 00010046
> > >> [40041.878800][T135817] RAX: 0000000000000001 RBX: ffff888117ec3240 RCX: =
> > >ffff888811440000
> > >> [40041.886711][T135817] RDX: 0000000000000000 RSI: 0000000000000001 RDI: =
> > >ffffffffb603d6e8
> > >> [40041.894625][T135817] RBP: 0000000000000001 R08: ffffffffb603d6e8 R09: =
> > >ffffffffb6ba6167
> > >> [40041.902537][T135817] R10: fffffbfff6d74c2c R11: 0000000000000001 R12: =
> > >0000000000000000
> > >> [40041.910451][T135817] R13: ffff88881145fd00 R14: 0000000000000001 R15: =
> > >ffff888811440001
> > >> [40041.918364][T135817] FS: 00007f8eabf14b80(0000) GS:ffff888811440000(0=
> > >000) knlGS:0000000000000000
> > >> [40041.927229][T135817] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033=
> > >
> > >> [40041.933756][T135817] CR2: 000055ce81e2cc78 CR3: 000000011be92005 CR4: =
> > >00000000001706e0
> > >> [40041.941669][T135817] Call Trace:
> > >> [40041.944895][T135817] ? lock_is_held_type+0x98/0x110
> > >> [40041.949860][T135817] try_to_wake_up+0x6f9/0x15e0
> >
> > 2) __queue_work
> > raw_spin_lock(&pwq->pool->lock) with irq disabled
> > insert_work
> > wake_up_worker(pool);
> > wake_up_process first_idle_worker(pool);
> >
> > Even if waker is lucky enough running on worker's CPU, what is weird is an
> > idle worker can trigger the warning, given
> >
> > if (smp_load_acquire(&p->on_cpu) &&
> > ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU))
> > goto unlock;
> >
> > because p->on_cpu must have been false for quite a while.
> >
> > Is there any chance for CPU hotplug to make a difference?
> >
> > Thoughts are welcome.
> >
> > Hillf
>
> Hillf, thank you very much for the comments, and sorry about this late reply.
>
> I walked through related functions to understand your comments and, but I have
> to say that I still don't have enough background knowledge to provide valuable
> comments back to you. I understand that the waker and the wakee are on same CPU,
> and it is weird that p->on_cpu is true. This looks indicating that the task
> scheduler failing to control task status on task migration triggered by CPU
> hotplugs, but as far as I read comments in kernel/sched/core.c, CPU hotplug and
> task migration are folded into the design and implementation.
>
> The blktests test case block/008 runs I/Os to a test target block device 30
> seconds. During this I/O, it repeats offlining CPUs and onlining CPUs: when
> there are N CPUs are available, it offlines N-1 CPUs to have only one online
> CPU, then onlines all CPUs again. It repeats this online and offline until I/O
> workload completes. When the only one CPU is online, the waker and the wakee can
> be on the same CPU. Or, one of the waker or the wakee might have been migrated
> from other CPUs to the only one online CPU. But still it is not clear for me why
> it results in the WARNING.
>
> Now I'm trying to recreate the failure. By repeating test cases in "block
> group" on the kernel v5.13-rc5, I was able to recreate the failure. It took 3 to
> 5 hours to recreate it. The test target block device used was a null_blk with
> rather unique configuration (zoned device with zone capacity smaller than zone
> size). I will try to confirm the failure recreation with latest kernel version
> and other block devices.

I tried some device setups, and found that dm-linear device on null_blk
recreates the warning consistently. In case anyone wishes to recreate it, let
me share the bash script below which I used. I tried it several times on the
kernel v5.13, and all tries recreated the warning. With my system (nproc is 8),
it took from 3 to 6 hours to recreate.

--------------------------------------------------------------------------------
#!/bin/bash

# create a null_blk device
declare sysfs=/sys/kernel/config/nullb/nullb0
modprobe null_blk nr_devices=0
mkdir "${sysfs}"
echo 1024 > "${sysfs}"/size
echo 1 > "${sysfs}"/memory_backed
echo 1 > "${sysfs}"/power
sleep 1

# create a dm-linear device on the null_blk device
echo "0 $((0x2000 * 4)) linear /dev/nullb0 0" | dmsetup create test
sleep 1

# run blktests, block/008 many times
git clone https://github.com/osandov/blktests.git
cd blktests
echo "TEST_DEVS=( /dev/mapper/test )" > config
for ((i=0;i<1000;i++)); do
echo $i;
if ! ./check block/008; then
break;
fi
done

--
Best Regards,
Shin'ichiro Kawasaki

2021-10-29 00:56:20

by Shinichiro Kawasaki

[permalink] [raw]
Subject: Re: WARNING at blktests block/008 in ttwu_queue_wakelist()

The WARNING I reported with the kernel v5.13-rc5 is still observed with
v5.15-rc7. It took long, but now I think I understand the cause. Here I share my
findings in line.

On Jul 09, 2021 / 03:57, Shinichiro Kawasaki wrote:
> On Jun 28, 2021 / 13:43, Shin'ichiro Kawasaki wrote:
> > On Jun 10, 2021 / 16:32, Hillf Danton wrote:
> > > On Wed, 9 Jun 2021 06:55:59 +0000 Damien Le Moal wrote:
> > > >+ Jens and linux-kernel
> > > >
> > > >On 2021/06/09 15:53, Shinichiro Kawasaki wrote:
> > > >> Hi there,
> > > >>
> > > >> Let me share a blktests failure. When I ran blktests on the kernel v5.13-rc5,
> > > >> block/008 failed. A WARNING below was the cause of the failure.
> > > >>
> > > >> WARNING: CPU: 1 PID: 135817 at kernel/sched/core.c:3175 ttwu_queue_wakelist+0x284/0x2f0
> > > >>
> > > >> I'm trying to recreate the failure repeating the test case, but so far, I am
> > > >> not able to. This failure looks rare, but actually, I observed it 3 times in
> > > >> the past one year.
> > > >>
> > > >> 1) Oct/2020, kernel: v5.9-rc7 test dev: dm-flakey on AHCI-SATA SMR HDD, log [1]
> > > >> 2) Mar/2021, kernel: v5.12-rc2 test dev: AHCI-SATA SMR HDD, log [2]
> > > >> 3) Jun/2021, kernel: v5.13-rc5 test dev: dm-linear on null_blk zoned, log [3]
> > > >>
> > > >> The test case block/008 does IO during CPU hotplug, and the WARNING in
> > > >> ttwu_queue_wakelist() checks "WARN_ON_ONCE(cpu == smp_processor_id())".
> > > >> So it is likely that the test case triggers the warning, but I don't have
> > > >> clear view why and how the warning was triggered. It was observed on various
> > > >> block devices, so I would like to ask linux-block experts if anyone can tell
> > > >> what is going on. Comments will be appreciated.
> > >
> > > [...]
> > >
> > > >> [40041.712804][T135817] ------------[ cut here ]------------
> > > >> [40041.718489][T135817] WARNING: CPU: 1 PID: 135817 at kernel/sched/core.=
> > > >c:3175 ttwu_queue_wakelist+0x284/0x2f0
> > > >> [40041.728311][T135817] Modules linked in: null_blk dm_flakey iscsi_targe=
> > > >t_mod tcm_loop target_core_pscsi target_core_file target_core_iblock nft_fi=
> > > >b_inet nft_fib_ipv4 nft_fib_ipv6 nft_fib nft_reject_inet nf_reject_ipv4 nf_=
> > > >reject_ipv6 nft_reject nft_ct nft_chain_nat ip6table_nat ip6table_mangle ip=
> > > >6table_raw ip6table_security iptable_nat nf_nat nf_conntrack nf_defrag_ipv6=
> > > > nf_defrag_ipv4 iptable_mangle iptable_raw bridge iptable_security stp llc =
> > > >ip_set rfkill nf_tables target_core_user target_core_mod nfnetlink ip6table=
> > > >_filter ip6_tables iptable_filter sunrpc intel_rapl_msr intel_rapl_common x=
> > > >86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass iTCO_=
> > > >wdt intel_pmc_bxt iTCO_vendor_support rapl intel_cstate intel_uncore joydev=
> > > > lpc_ich i2c_i801 i2c_smbus ses enclosure mei_me mei ipmi_ssif ioatdma wmi =
> > > >acpi_ipmi ipmi_si ipmi_devintf ipmi_msghandler acpi_power_meter acpi_pad zr=
> > > >am ip_tables ast drm_vram_helper drm_kms_helper syscopyarea sysfillrect crc=
> > > >32c_intel sysimgblt
> > > >> [40041.728481][T135817] fb_sys_fops cec drm_ttm_helper ttm ghash_clmulni=
> > > >_intel drm igb mpt3sas nvme dca i2c_algo_bit nvme_core raid_class scsi_tran=
> > > >sport_sas pkcs8_key_parser [last unloaded: null_blk]
> > > >> [40041.832215][T135817] CPU: 1 PID: 135817 Comm: fio Not tainted 5.13.0-r=
> > > >c5+ #1
> > > >> [40041.839262][T135817] Hardware name: Supermicro Super Server/X10SRL-F, =
> > > >BIOS 3.2 11/22/2019
> > > >> [40041.847434][T135817] RIP: 0010:ttwu_queue_wakelist+0x284/0x2f0
> > > >> [40041.853266][T135817] Code: 34 24 e8 6f 71 64 00 4c 8b 44 24 10 48 8b 4=
> > > >c 24 08 8b 34 24 e9 a1 fe ff ff e8 a8 71 64 00 e9 66 ff ff ff e8 be 71 64 0=
> > > >0 eb a0 <0f> 0b 45 31 ff e9 cb fe ff ff 48 89 04 24 e8 49 71 64 00 48 8b 04=
> > > >
> > > >> [40041.872793][T135817] RSP: 0018:ffff888106bff348 EFLAGS: 00010046
> > > >> [40041.878800][T135817] RAX: 0000000000000001 RBX: ffff888117ec3240 RCX: =
> > > >ffff888811440000
> > > >> [40041.886711][T135817] RDX: 0000000000000000 RSI: 0000000000000001 RDI: =
> > > >ffffffffb603d6e8
> > > >> [40041.894625][T135817] RBP: 0000000000000001 R08: ffffffffb603d6e8 R09: =
> > > >ffffffffb6ba6167
> > > >> [40041.902537][T135817] R10: fffffbfff6d74c2c R11: 0000000000000001 R12: =
> > > >0000000000000000
> > > >> [40041.910451][T135817] R13: ffff88881145fd00 R14: 0000000000000001 R15: =
> > > >ffff888811440001
> > > >> [40041.918364][T135817] FS: 00007f8eabf14b80(0000) GS:ffff888811440000(0=
> > > >000) knlGS:0000000000000000
> > > >> [40041.927229][T135817] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033=
> > > >
> > > >> [40041.933756][T135817] CR2: 000055ce81e2cc78 CR3: 000000011be92005 CR4: =
> > > >00000000001706e0
> > > >> [40041.941669][T135817] Call Trace:
> > > >> [40041.944895][T135817] ? lock_is_held_type+0x98/0x110
> > > >> [40041.949860][T135817] try_to_wake_up+0x6f9/0x15e0
> > >
> > > 2) __queue_work
> > > raw_spin_lock(&pwq->pool->lock) with irq disabled
> > > insert_work
> > > wake_up_worker(pool);
> > > wake_up_process first_idle_worker(pool);
> > >
> > > Even if waker is lucky enough running on worker's CPU, what is weird is an
> > > idle worker can trigger the warning, given
> > >
> > > if (smp_load_acquire(&p->on_cpu) &&
> > > ttwu_queue_wakelist(p, task_cpu(p), wake_flags | WF_ON_CPU))
> > > goto unlock;
> > >
> > > because p->on_cpu must have been false for quite a while.

The call trace indicated that the try_to_wake_up() called ttwu_queue_wakelist().
So I had assumed that the ttwu_queue_wakelist() call was in the hunk that Hillf
quoted above. I also had thorugh that weird, but I noticed that there is another
path to call ttwu_queue_wakelist(). try_to_wake_up() calls ttwu_queue() at
another place, and ttwu_queue() calls ttwu_queue_wakelist(). I confirmed that
the warning is reported with this call path. I think this path can happen on
p->cpu false condition.

> > >
> > > Is there any chance for CPU hotplug to make a difference?

I found sd_llc_id change by CPU hotplug affects ttwu_queue_wakelist(). When the
warning is reported, ttwu_queue_cond() returns true in the ttwu_queue_wakelist()
hunk below.

if (sched_feat(TTWU_QUEUE) && ttwu_queue_cond(cpu, wake_flags)) {
if (WARN_ON_ONCE(cpu == smp_processor_id()))
return false;

And in ttwu_queue_cond() hunk below, cpus_share_cache() returns false, even
though smp_processor_id() and cpu are same cpu. This is weird. Why the single
same cpu does not share the cache?

/*
* If the CPU does not share cache, then queue the task on the
* remote rqs wakelist to avoid accessing remote data.
*/
if (!cpus_share_cache(smp_processor_id(), cpu))
return true;

In cpus_share_cache(), sd_llc_id is checked twice with the line below.

return per_cpu(sd_llc_id, this_cpu) == per_cpu(sd_llc_id, that_cpu);

When this_cpu == that_cpu, it is expected sd_llc_id is same. However, when CPU
hotplug is ongoing, scheduler domain destroy and rebuild happens and modifies
sd_llc_id value. Then, the two sd_llc_id references in the code can have
different values.

I think this cpus_share_cached() needs a fix assuming that sd_llc_id may
change. My idea is to add check this_cpu == that_cpu. Based on this idea, I
created a fix and the warning is avoided. I will post the patch for further
discussion and review.

--
Best Regards,
Shin'ichiro Kawasaki

> > >
> > > Thoughts are welcome.
> > >
> > > Hillf
> >
> > Hillf, thank you very much for the comments, and sorry about this late reply.
> >
> > I walked through related functions to understand your comments and, but I have
> > to say that I still don't have enough background knowledge to provide valuable
> > comments back to you. I understand that the waker and the wakee are on same CPU,
> > and it is weird that p->on_cpu is true. This looks indicating that the task
> > scheduler failing to control task status on task migration triggered by CPU
> > hotplugs, but as far as I read comments in kernel/sched/core.c, CPU hotplug and
> > task migration are folded into the design and implementation.
> >
> > The blktests test case block/008 runs I/Os to a test target block device 30
> > seconds. During this I/O, it repeats offlining CPUs and onlining CPUs: when
> > there are N CPUs are available, it offlines N-1 CPUs to have only one online
> > CPU, then onlines all CPUs again. It repeats this online and offline until I/O
> > workload completes. When the only one CPU is online, the waker and the wakee can
> > be on the same CPU. Or, one of the waker or the wakee might have been migrated
> > from other CPUs to the only one online CPU. But still it is not clear for me why
> > it results in the WARNING.
> >
> > Now I'm trying to recreate the failure. By repeating test cases in "block
> > group" on the kernel v5.13-rc5, I was able to recreate the failure. It took 3 to
> > 5 hours to recreate it. The test target block device used was a null_blk with
> > rather unique configuration (zoned device with zone capacity smaller than zone
> > size). I will try to confirm the failure recreation with latest kernel version
> > and other block devices.
>
> I tried some device setups, and found that dm-linear device on null_blk
> recreates the warning consistently. In case anyone wishes to recreate it, let
> me share the bash script below which I used. I tried it several times on the
> kernel v5.13, and all tries recreated the warning. With my system (nproc is 8),
> it took from 3 to 6 hours to recreate.
>
> --------------------------------------------------------------------------------
> #!/bin/bash
>
> # create a null_blk device
> declare sysfs=/sys/kernel/config/nullb/nullb0
> modprobe null_blk nr_devices=0
> mkdir "${sysfs}"
> echo 1024 > "${sysfs}"/size
> echo 1 > "${sysfs}"/memory_backed
> echo 1 > "${sysfs}"/power
> sleep 1
>
> # create a dm-linear device on the null_blk device
> echo "0 $((0x2000 * 4)) linear /dev/nullb0 0" | dmsetup create test
> sleep 1
>
> # run blktests, block/008 many times
> git clone https://github.com/osandov/blktests.git
> cd blktests
> echo "TEST_DEVS=( /dev/mapper/test )" > config
> for ((i=0;i<1000;i++)); do
> echo $i;
> if ! ./check block/008; then
> break;
> fi
> done
>
> --
> Best Regards,
> Shin'ichiro Kawasaki