Just a head up. Repeatedly compiling kernels for a while would trigger
endless soft-lockups since next-20200519 on both x86_64 and powerpc.
.config are in,
https://github.com/cailca/linux-mm
I did first try to revert the linux-next commit 68cd9f4e7238
("tick/nohz: Narrow down noise while setting current task's tick
dependency"), but it did not help.
== x86_64 ==
[ 1167.993773][ C1] WARNING: CPU: 1 PID: 0 at kernel/smp.c:127
flush_smp_call_function_queue+0x1fa/0x2e0
[ 1168.003333][ C1] Modules linked in: nls_iso8859_1 nls_cp437 vfat
fat kvm_amd ses kvm enclosure dax_pmem irqbypass dax_pmem_core efivars
acpi_cpufreq efivarfs ip_tables x_tables xfs sd_mod smartpqi
scsi_transport_sas tg3 mlx5_core libphy firmware_class dm_mirror
dm_region_hash dm_log dm_mod
[ 1168.029492][ C1] CPU: 1 PID: 0 Comm: swapper/1 Not tainted
5.7.0-rc6-next-20200519 #1
[ 1168.037665][ C1] Hardware name: HPE ProLiant DL385
Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
[ 1168.046978][ C1] RIP: 0010:flush_smp_call_function_queue+0x1fa/0x2e0
[ 1168.053658][ C1] Code: 01 0f 87 c9 12 00 00 83 e3 01 0f 85 cc fe
ff ff 48 c7 c7 c0 55 a9 8f c6 05 f6 86 cd 01 01 e8 de 09 ea ff 0f 0b
e9 b2 fe ff ff <0f> 0b e9 52 ff ff ff 0f 0b e9 f2 fe ff ff 65 44 8b 25
10 52 3f 71
[ 1168.073262][ C1] RSP: 0018:ffffc90000178918 EFLAGS: 00010046
[ 1168.079253][ C1] RAX: 0000000000000000 RBX: ffff8888430c58f8
RCX: ffffffff8ec26083
[ 1168.087156][ C1] RDX: 0000000000000003 RSI: dffffc0000000000
RDI: ffff8888430c58f8
[ 1168.095054][ C1] RBP: ffffc900001789a8 R08: ffffed1108618cec
R09: ffffed1108618cec
[ 1168.102964][ C1] R10: ffff8888430c675b R11: 0000000000000000
R12: ffff8888430c58e0
[ 1168.110866][ C1] R13: ffffffff8eb30c40 R14: ffff8888430c5880
R15: ffff8888430c58e0
[ 1168.118767][ C1] FS: 0000000000000000(0000)
GS:ffff888843080000(0000) knlGS:0000000000000000
[ 1168.127628][ C1] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1168.134129][ C1] CR2: 000055b169604560 CR3: 0000000d08a14000
CR4: 00000000003406e0
[ 1168.142026][ C1] Call Trace:
[ 1168.145206][ C1] <IRQ>
[ 1168.147957][ C1] ? smp_call_on_cpu_callback+0xd0/0xd0
[ 1168.153421][ C1] ? rcu_read_lock_sched_held+0xac/0xe0
[ 1168.158880][ C1] ? rcu_read_lock_bh_held+0xc0/0xc0
[ 1168.164076][ C1] generic_smp_call_function_single_interrupt+0x13/0x2b
[ 1168.170938][ C1] smp_call_function_single_interrupt+0x157/0x4e0
[ 1168.177278][ C1] ? smp_call_function_interrupt+0x4e0/0x4e0
[ 1168.183172][ C1] ? interrupt_entry+0xe4/0xf0
[ 1168.187846][ C1] ? trace_hardirqs_off_caller+0x8d/0x1f0
[ 1168.193478][ C1] ? trace_hardirqs_on_caller+0x1f0/0x1f0
[ 1168.199116][ C1] ? _nohz_idle_balance+0x221/0x360
[ 1168.204228][ C1] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 1168.209690][ C1] call_function_single_interrupt+0xf/0x20
[ 1168.215415][ C1] RIP: 0010:_raw_spin_unlock_irqrestore+0x46/0x50
[ 1168.221747][ C1] Code: 8d 5e ff 4c 89 e7 e8 a9 35 5f ff f6 c7 02
75 13 53 9d e8 fd c0 6f ff 65 ff 0d 4e ab a6 70 5b 41 5c 5d c3 e8 dc
c2 6f ff 53 9d <eb> eb 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 65 ff
05 2b ab a6
[ 1168.241353][ C1] RSP: 0018:ffffc90000178bd0 EFLAGS: 00000246
ORIG_RAX: ffffffffffffff04
[ 1168.249700][ C1] RAX: 0000000000000000 RBX: 0000000000000246
RCX: ffffffff8eba0740
[ 1168.257602][ C1] RDX: 0000000000000007 RSI: dffffc0000000000
RDI: ffff888214f5c8e4
[ 1168.265503][ C1] RBP: ffffc90000178be0 R08: fffffbfff2120216
R09: 0000000000000000
[ 1168.273400][ C1] R10: 0000000000000000 R11: 0000000000000000
R12: ffff888843145880
[ 1168.281300][ C1] R13: ffffffff90b2db80 R14: 0000000000000002
R15: 00000001000164cb
[ 1168.289218][ C1] ? call_function_single_interrupt+0xa/0x20
[ 1168.295117][ C1] ? lockdep_hardirqs_on+0x1b0/0x2c0
[ 1168.300319][ C1] _nohz_idle_balance+0x221/0x360
[ 1168.305256][ C1] run_rebalance_domains+0x16c/0x2e0
[ 1168.310452][ C1] __do_softirq+0x1ca/0x96a
[ 1168.314861][ C1] ? __irqentry_text_end+0x1fa9e7/0x1fa9e7
[ 1168.320579][ C1] ? hrtimer_reprogram+0x170/0x170
[ 1168.325608][ C1] ? __bpf_trace_preemptirq_template+0x100/0x100
[ 1168.331856][ C1] ? lapic_next_event+0x3c/0x50
[ 1168.336617][ C1] ? clockevents_program_event+0xfc/0x180
[ 1168.342249][ C1] ? check_flags.part.28+0x86/0x220
[ 1168.347355][ C1] ? trace_hardirqs_off+0x8d/0x1f0
[ 1168.352374][ C1] ? __bpf_trace_preemptirq_template+0x100/0x100
[ 1168.358620][ C1] ? rcu_read_lock_sched_held+0xac/0xe0
[ 1168.364077][ C1] ? rcu_read_lock_bh_held+0xc0/0xc0
[ 1168.369282][ C1] irq_exit+0xd6/0xf0
[ 1168.373168][ C1] smp_apic_timer_interrupt+0x215/0x560
[ 1168.378628][ C1] ? smp_call_function_single_interrupt+0x4e0/0x4e0
[ 1168.385137][ C1] ? smp_call_function_interrupt+0x4e0/0x4e0
[ 1168.391031][ C1] ? interrupt_entry+0xe4/0xf0
[ 1168.395705][ C1] ? trace_hardirqs_off_caller+0x8d/0x1f0
[ 1168.401336][ C1] ? trace_hardirqs_off_caller+0x8d/0x1f0
[ 1168.406969][ C1] ? trace_hardirqs_on_caller+0x1f0/0x1f0
[ 1168.412602][ C1] ? trace_hardirqs_on_caller+0x1f0/0x1f0
[ 1168.418234][ C1] ? __kasan_check_write+0x14/0x20
[ 1168.423260][ C1] ? rcu_dynticks_eqs_enter+0x25/0x40
[ 1168.428550][ C1] ? trace_hardirqs_off_thunk+0x1a/0x1c
[ 1168.434013][ C1] apic_timer_interrupt+0xf/0x20
[ 1168.438855][ C1] </IRQ>
[ 1168.441698][ C1] RIP: 0010:cpuidle_enter_state+0x1d1/0xac0
[ 1168.447504][ C1] Code: ff e8 63 22 7c ff 80 bd 28 ff ff ff 00 74
12 9c 58 f6 c4 02 0f 85 cc 06 00 00 31 ff e8 d8 1e 8a ff e8 23 c4 93
ff fb 45 85 ed <0f> 88 dc 01 00 00 4d 63 f5 49 83 fe 09 0f 87 d0 07 00
00 4b 8d 14
[ 1168.467110][ C1] RSP: 0018:ffffc9000031fc70 EFLAGS: 00000202
ORIG_RAX: ffffffffffffff13
[ 1168.475452][ C1] RAX: 0000000000000000 RBX: ffff8886381b4400
RCX: ffffffff8eba0740
[ 1168.483353][ C1] RDX: 0000000000000007 RSI: dffffc0000000000
RDI: ffff888214f5c8e4
[ 1168.491255][ C1] RBP: ffffc9000031fd78 R08: fffffbfff2120216
R09: 0000000000000000
[ 1168.499158][ C1] R10: 0000000000000000 R11: 0000000000000000
R12: 0000000000000001
[ 1168.507061][ C1] R13: 0000000000000002 R14: ffffffff90695bb0
R15: 0000010ff187211b
[ 1168.514971][ C1] ? lockdep_hardirqs_on+0x1b0/0x2c0
[ 1168.520178][ C1] ? tick_nohz_idle_stop_tick+0x2b0/0x690
[ 1168.525817][ C1] ? cpuidle_enter_s2idle+0x280/0x280
[ 1168.531104][ C1] ? tick_nohz_tick_stopped_cpu+0xa0/0xa0
[ 1168.536741][ C1] ? menu_enable_device+0xf0/0xf0
[ 1168.541679][ C1] ? trace_hardirqs_off+0x1f0/0x1f0
[ 1168.546794][ C1] cpuidle_enter+0x41/0x70
[ 1168.551126][ C1] do_idle+0x3cf/0x440
== powerpc ==
[13720.177440][ C35] WARNING: CPU: 35 PID: 0 at kernel/smp.c:127
flush_smp_call_function_queue+0x104/0x360
[13720.177562][ C35] Modules linked in: nf_tables nfnetlink cn
kvm_hv kvm ip_tables x_tables xfs sd_mod bnx2x ahci tg3 libahci mdio
libphy libata firmware_class dm_mirror dm_region_hash dm_log dm_mod
[13720.177776][ C35] CPU: 35 PID: 0 Comm: swapper/35 Tainted: G
W L 5.7.0-rc6-next-20200519 #2
[13720.177877][ C35] NIP: c000000000275f44 LR: c000000000275f60
CTR: c0000000001875b0
[13720.177952][ C35] REGS: c00000003e64f0c0 TRAP: 0700 Tainted: G
W L (5.7.0-rc6-next-20200519)
[13720.178061][ C35] MSR: 9000000000029033
<SF,HV,EE,ME,IR,DR,RI,LE> CR: 24002428 XER: 20040000
[13720.178183][ C35] CFAR: c000000000275f68 IRQMASK: 1
[13720.178183][ C35] GPR00: c000000000275f60 c00000003e64f350
c000000001765000 c000001ffe204000
[13720.178183][ C35] GPR04: c00000000179bc30 0000000000000000
c00000003e64f674 c000201fff7ff800
[13720.178183][ C35] GPR08: 0000000000000000 0000000000000001
c0000000001875b0 000000003b70faa3
[13720.178183][ C35] GPR12: c0000000001875b0 c000001ffffe2a80
c000001ffe2b4018 0000000000000024
[13720.178183][ C35] GPR16: 0000000000000000 c000001ffe204000
0000000000000000 c0000000015b1e90
[13720.178183][ C35] GPR20: 000000010013d6df 0000000000000003
0000000000000001 0000000000000002
[13720.178183][ C35] GPR24: 0000000000000000 c00000000179c664
c00000003e64f4f8 c00000000179c3b0
[13720.178183][ C35] GPR28: 0000001ffd0b0000 0000000000000000
c000001ffe204060 c000001ffe204060
[13720.179023][ C35] NIP [c000000000275f44]
flush_smp_call_function_queue+0x104/0x360
[13720.179104][ C35] LR [c000000000275f60]
flush_smp_call_function_queue+0x120/0x360
[13720.179191][ C35] Call Trace:
[13720.179225][ C35] [c00000003e64f350] [c000000000275f60]
flush_smp_call_function_queue+0x120/0x360 (unreliable)
[13720.179337][ C35] [c00000003e64f3f0] [c000000000059894]
smp_ipi_demux_relaxed+0xa4/0x100
[13720.179439][ C35] [c00000003e64f430] [c000000000053084]
doorbell_exception+0x124/0x730
[13720.179525][ C35] [c00000003e64f4d0] [c000000000017404]
replay_soft_interrupts+0x254/0x3c0
[13720.179622][ C35] [c00000003e64f6c0] [c0000000000175c0]
arch_local_irq_restore+0x50/0xd0
[13720.179714][ C35] [c00000003e64f6e0] [c000000000adc3f0]
_raw_spin_unlock_irqrestore+0xa0/0xd0
[13720.179806][ C35] [c00000003e64f710] [c0000000001a8f68]
_nohz_idle_balance+0x308/0x450
[13720.179900][ C35] [c00000003e64f810] [c000000000add04c]
__do_softirq+0x3ac/0xaa8
[13720.179986][ C35] [c00000003e64f990] [c00000000012981c]
irq_exit+0x16c/0x1d0
[13720.180080][ C35] [c00000003e64fa00] [c00000000002771c]
timer_interrupt+0x1fc/0x880
[13720.180162][ C35] [c00000003e64fac0] [c000000000017344]
replay_soft_interrupts+0x194/0x3c0
[13720.180266][ C35] [c00000003e64fcb0] [c0000000000175c0]
arch_local_irq_restore+0x50/0xd0
[13720.180367][ C35] [c00000003e64fcd0] [c0000000008cee78]
cpuidle_enter_state+0x128/0x9f0
[13720.180464][ C35] [c00000003e64fd80] [c0000000008cf7e0]
cpuidle_enter+0x50/0x70
[13720.180543][ C35] [c00000003e64fdc0] [c00000000018e2ec]
call_cpuidle+0x4c/0x90
[13720.180638][ C35] [c00000003e64fde0] [c00000000018e7f8] do_idle+0x378/0x470
[13720.506608][ C35] [c00000003e64fe90] [c00000000018ed18]
cpu_startup_entry+0x38/0x40
[13720.506678][ C35] [c00000003e64fec0] [c00000000005b0a0]
start_secondary+0x780/0xa20
[13720.506759][ C35] [c00000003e64ff90] [c00000000000c454]
start_secondary_prolog+0x10/0x14
[13720.506851][ C35] Instruction dump:
[13720.506909][ C35] 2fbe0000 93bf0018 7fdff378 419e004c 813f0018
ebdf0000 e95f0008 e87f0010
[13720.507016][ C35] 71280002 4082ffb8 7d2948f8 552907fe <0b090000>
7c2004ac 911f0018 7d4c5378
[13720.507119][ C35] irq event stamp: 122776347
[13720.507202][ C35] hardirqs last enabled at (122776346):
[<c000000000adc3e4>] _raw_spin_unlock_irqrestore+0x94/0xd0
[13720.507303][ C35] hardirqs last disabled at (122776347):
[<c0000000000175b8>] arch_local_irq_restore+0x48/0xd0
[13720.507427][ C35] softirqs last enabled at (122776342):
[<c0000000001296ac>] irq_enter+0x9c/0xa0
[13720.507517][ C35] softirqs last disabled at (122776343):
[<c00000000012981c>] irq_exit+0x16c/0x1d0
[13720.507632][ C35] ---[ end trace 20587d9746d61ca8 ]---
On Tue, May 19, 2020 at 11:58:17PM -0400, Qian Cai wrote:
> Just a head up. Repeatedly compiling kernels for a while would trigger
> endless soft-lockups since next-20200519 on both x86_64 and powerpc.
> .config are in,
Could be 90b5363acd47 ("sched: Clean up scheduler_ipi()"), although I've
not seen anything like that myself. Let me go have a look.
In as far as the logs are readable (they're a wrapped mess, please don't
do that!), they contain very little useful, as is typical with IPIs :/
> [ 1167.993773][ C1] WARNING: CPU: 1 PID: 0 at kernel/smp.c:127
> flush_smp_call_function_queue+0x1fa/0x2e0
> [ 1168.003333][ C1] Modules linked in: nls_iso8859_1 nls_cp437 vfat
> fat kvm_amd ses kvm enclosure dax_pmem irqbypass dax_pmem_core efivars
> acpi_cpufreq efivarfs ip_tables x_tables xfs sd_mod smartpqi
> scsi_transport_sas tg3 mlx5_core libphy firmware_class dm_mirror
> dm_region_hash dm_log dm_mod
> [ 1168.029492][ C1] CPU: 1 PID: 0 Comm: swapper/1 Not tainted
> 5.7.0-rc6-next-20200519 #1
> [ 1168.037665][ C1] Hardware name: HPE ProLiant DL385
> Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
> [ 1168.046978][ C1] RIP: 0010:flush_smp_call_function_queue+0x1fa/0x2e0
> [ 1168.053658][ C1] Code: 01 0f 87 c9 12 00 00 83 e3 01 0f 85 cc fe
> ff ff 48 c7 c7 c0 55 a9 8f c6 05 f6 86 cd 01 01 e8 de 09 ea ff 0f 0b
> e9 b2 fe ff ff <0f> 0b e9 52 ff ff ff 0f 0b e9 f2 fe ff ff 65 44 8b 25
> 10 52 3f 71
> [ 1168.073262][ C1] RSP: 0018:ffffc90000178918 EFLAGS: 00010046
> [ 1168.079253][ C1] RAX: 0000000000000000 RBX: ffff8888430c58f8
> RCX: ffffffff8ec26083
> [ 1168.087156][ C1] RDX: 0000000000000003 RSI: dffffc0000000000
> RDI: ffff8888430c58f8
> [ 1168.095054][ C1] RBP: ffffc900001789a8 R08: ffffed1108618cec
> R09: ffffed1108618cec
> [ 1168.102964][ C1] R10: ffff8888430c675b R11: 0000000000000000
> R12: ffff8888430c58e0
> [ 1168.110866][ C1] R13: ffffffff8eb30c40 R14: ffff8888430c5880
> R15: ffff8888430c58e0
> [ 1168.118767][ C1] FS: 0000000000000000(0000)
> GS:ffff888843080000(0000) knlGS:0000000000000000
> [ 1168.127628][ C1] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 1168.134129][ C1] CR2: 000055b169604560 CR3: 0000000d08a14000
> CR4: 00000000003406e0
> [ 1168.142026][ C1] Call Trace:
> [ 1168.145206][ C1] <IRQ>
> [ 1168.147957][ C1] ? smp_call_on_cpu_callback+0xd0/0xd0
> [ 1168.153421][ C1] ? rcu_read_lock_sched_held+0xac/0xe0
> [ 1168.158880][ C1] ? rcu_read_lock_bh_held+0xc0/0xc0
> [ 1168.164076][ C1] generic_smp_call_function_single_interrupt+0x13/0x2b
> [ 1168.170938][ C1] smp_call_function_single_interrupt+0x157/0x4e0
> [ 1168.177278][ C1] ? smp_call_function_interrupt+0x4e0/0x4e0
> [ 1168.183172][ C1] ? interrupt_entry+0xe4/0xf0
> [ 1168.187846][ C1] ? trace_hardirqs_off_caller+0x8d/0x1f0
> [ 1168.193478][ C1] ? trace_hardirqs_on_caller+0x1f0/0x1f0
> [ 1168.199116][ C1] ? _nohz_idle_balance+0x221/0x360
> [ 1168.204228][ C1] ? trace_hardirqs_off_thunk+0x1a/0x1c
> [ 1168.209690][ C1] call_function_single_interrupt+0xf/0x20
On Wed, May 20, 2020 at 02:50:56PM +0200, Peter Zijlstra wrote:
> On Tue, May 19, 2020 at 11:58:17PM -0400, Qian Cai wrote:
> > Just a head up. Repeatedly compiling kernels for a while would trigger
> > endless soft-lockups since next-20200519 on both x86_64 and powerpc.
> > .config are in,
>
> Could be 90b5363acd47 ("sched: Clean up scheduler_ipi()"), although I've
> not seen anything like that myself. Let me go have a look.
Yes, I ended up figuring out the same commit a bit earlier. Since then I
reverted that commit and its dependency,
2a0a24ebb499 ("sched: Make scheduler_ipi inline")
Everything works fine so far.
>
>
> In as far as the logs are readable (they're a wrapped mess, please don't
> do that!), they contain very little useful, as is typical with IPIs :/
Sorry about that. I forgot that gmail webUI will wrap things around. I will
switch to mutt.
>
> > [ 1167.993773][ C1] WARNING: CPU: 1 PID: 0 at kernel/smp.c:127
> > flush_smp_call_function_queue+0x1fa/0x2e0
> > [ 1168.003333][ C1] Modules linked in: nls_iso8859_1 nls_cp437 vfat
> > fat kvm_amd ses kvm enclosure dax_pmem irqbypass dax_pmem_core efivars
> > acpi_cpufreq efivarfs ip_tables x_tables xfs sd_mod smartpqi
> > scsi_transport_sas tg3 mlx5_core libphy firmware_class dm_mirror
> > dm_region_hash dm_log dm_mod
> > [ 1168.029492][ C1] CPU: 1 PID: 0 Comm: swapper/1 Not tainted
> > 5.7.0-rc6-next-20200519 #1
> > [ 1168.037665][ C1] Hardware name: HPE ProLiant DL385
> > Gen10/ProLiant DL385 Gen10, BIOS A40 07/10/2019
> > [ 1168.046978][ C1] RIP: 0010:flush_smp_call_function_queue+0x1fa/0x2e0
> > [ 1168.053658][ C1] Code: 01 0f 87 c9 12 00 00 83 e3 01 0f 85 cc fe
> > ff ff 48 c7 c7 c0 55 a9 8f c6 05 f6 86 cd 01 01 e8 de 09 ea ff 0f 0b
> > e9 b2 fe ff ff <0f> 0b e9 52 ff ff ff 0f 0b e9 f2 fe ff ff 65 44 8b 25
> > 10 52 3f 71
> > [ 1168.073262][ C1] RSP: 0018:ffffc90000178918 EFLAGS: 00010046
> > [ 1168.079253][ C1] RAX: 0000000000000000 RBX: ffff8888430c58f8
> > RCX: ffffffff8ec26083
> > [ 1168.087156][ C1] RDX: 0000000000000003 RSI: dffffc0000000000
> > RDI: ffff8888430c58f8
> > [ 1168.095054][ C1] RBP: ffffc900001789a8 R08: ffffed1108618cec
> > R09: ffffed1108618cec
> > [ 1168.102964][ C1] R10: ffff8888430c675b R11: 0000000000000000
> > R12: ffff8888430c58e0
> > [ 1168.110866][ C1] R13: ffffffff8eb30c40 R14: ffff8888430c5880
> > R15: ffff8888430c58e0
> > [ 1168.118767][ C1] FS: 0000000000000000(0000)
> > GS:ffff888843080000(0000) knlGS:0000000000000000
> > [ 1168.127628][ C1] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [ 1168.134129][ C1] CR2: 000055b169604560 CR3: 0000000d08a14000
> > CR4: 00000000003406e0
> > [ 1168.142026][ C1] Call Trace:
> > [ 1168.145206][ C1] <IRQ>
> > [ 1168.147957][ C1] ? smp_call_on_cpu_callback+0xd0/0xd0
> > [ 1168.153421][ C1] ? rcu_read_lock_sched_held+0xac/0xe0
> > [ 1168.158880][ C1] ? rcu_read_lock_bh_held+0xc0/0xc0
> > [ 1168.164076][ C1] generic_smp_call_function_single_interrupt+0x13/0x2b
> > [ 1168.170938][ C1] smp_call_function_single_interrupt+0x157/0x4e0
> > [ 1168.177278][ C1] ? smp_call_function_interrupt+0x4e0/0x4e0
> > [ 1168.183172][ C1] ? interrupt_entry+0xe4/0xf0
> > [ 1168.187846][ C1] ? trace_hardirqs_off_caller+0x8d/0x1f0
> > [ 1168.193478][ C1] ? trace_hardirqs_on_caller+0x1f0/0x1f0
> > [ 1168.199116][ C1] ? _nohz_idle_balance+0x221/0x360
> > [ 1168.204228][ C1] ? trace_hardirqs_off_thunk+0x1a/0x1c
> > [ 1168.209690][ C1] call_function_single_interrupt+0xf/0x20
On Wed, May 20, 2020 at 02:50:56PM +0200, Peter Zijlstra wrote:
> On Tue, May 19, 2020 at 11:58:17PM -0400, Qian Cai wrote:
> > Just a head up. Repeatedly compiling kernels for a while would trigger
> > endless soft-lockups since next-20200519 on both x86_64 and powerpc.
> > .config are in,
>
> Could be 90b5363acd47 ("sched: Clean up scheduler_ipi()"), although I've
> not seen anything like that myself. Let me go have a look.
>
>
> In as far as the logs are readable (they're a wrapped mess, please don't
> do that!), they contain very little useful, as is typical with IPIs :/
>
> > [ 1167.993773][ C1] WARNING: CPU: 1 PID: 0 at kernel/smp.c:127
> > flush_smp_call_function_queue+0x1fa/0x2e0
So I've tried to think of a race that could produce that and here is
the only thing I could come up with. It's a bit complicated unfortunately:
CPU 0 CPU 1
----- -----
tick {
trigger_load_balance() {
raise_softirq(SCHED_SOFTIRQ);
//but nohz_flags(0) = 0
}
kick_ilb() {
atomic_fetch_or(...., nohz_flags(0))
softirq() { #VMEXIT or anything that could stop a CPU for a while
run_rebalance_domain() {
nohz_idle_balance() {
atomic_andnot(NOHZ_KICK_MASK, nohz_flag(0))
}
}
}
}
// schedule
nohz_newidle_balance() {
kick_ilb() { // pick current CPU
atomic_fetch_or(...., nohz_flags(0)) #VMENTER
smp_call_function_single_async() { smp_call_function_single_async() {
// verified csd->flags != CSD_LOCK // verified csd->flags != CSD_LOCK
csd->flags = CSD_LOCK csd->flags = CSD_LOCK
//execute in place //queue and send IPI
csd->flags = 0
nohz_csd_func()
}
}
}
IPI�{
flush_smp_call_function_queue() {
csd_unlock() {
WARN_ON(csd->flags != CSD_LOCK) <---------!!!!!
The root cause here would be that trigger_load_balance() unconditionally raise
the softirq. And I have to confess I'm not clear why since the softirq is
essentially a no-op when nohz_flags() is 0.
Thanks.
On Thu, May 21, 2020 at 02:40:36AM +0200, Frederic Weisbecker wrote:
> On Wed, May 20, 2020 at 02:50:56PM +0200, Peter Zijlstra wrote:
> > On Tue, May 19, 2020 at 11:58:17PM -0400, Qian Cai wrote:
> > > Just a head up. Repeatedly compiling kernels for a while would trigger
> > > endless soft-lockups since next-20200519 on both x86_64 and powerpc.
> > > .config are in,
> >
> > Could be 90b5363acd47 ("sched: Clean up scheduler_ipi()"), although I've
> > not seen anything like that myself. Let me go have a look.
> >
> >
> > In as far as the logs are readable (they're a wrapped mess, please don't
> > do that!), they contain very little useful, as is typical with IPIs :/
> >
> > > [ 1167.993773][ C1] WARNING: CPU: 1 PID: 0 at kernel/smp.c:127
> > > flush_smp_call_function_queue+0x1fa/0x2e0
>
> So I've tried to think of a race that could produce that and here is
> the only thing I could come up with. It's a bit complicated unfortunately:
This:
> smp_call_function_single_async() { smp_call_function_single_async() {
> // verified csd->flags != CSD_LOCK // verified csd->flags != CSD_LOCK
> csd->flags = CSD_LOCK csd->flags = CSD_LOCK
concurrent smp_call_function_single_async() using the same csd is what
I'm looking at as well. Now in the ILB case there is an easy cure:
(because there is only a single ilb target)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 01f94cf52783..b6d8a7b991f0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10033,7 +10033,7 @@ static void kick_ilb(unsigned int flags)
* is idle. And the softirq performing nohz idle load balance
* will be run before returning from the IPI.
*/
- smp_call_function_single_async(ilb_cpu, &cpu_rq(ilb_cpu)->nohz_csd);
+ smp_call_function_single_async(ilb_cpu, &this_rq()->nohz_csd);
}
/*
Qian, can you give that a spin?
But I'm still not convinced of your scenario:
> kick_ilb() {
> atomic_fetch_or(...., nohz_flags(0))
> atomic_fetch_or(...., nohz_flags(0)) #VMENTER
> smp_call_function_single_async() { smp_call_function_single_async() {
> // verified csd->flags != CSD_LOCK // verified csd->flags != CSD_LOCK
> csd->flags = CSD_LOCK csd->flags = CSD_LOCK
Note that we check the return value of atomic_fetch_or() and bail if
someone else set a flag in KICK_MASK before us.
Aah, I suppose you're saying this can happen when:
!(flags & NOHZ_KICK_MASK)
? That's not supposed to happen though.
Anyway, let me go stare at the remove wake-up case, because i'm afraid
that might have the same problem too...
On Thu, May 21, 2020 at 02:40:36AM +0200, Frederic Weisbecker wrote:
> atomic_fetch_or(...., nohz_flags(0))
> softirq() { #VMEXIT or anything that could stop a CPU for a while
> run_rebalance_domain() {
> nohz_idle_balance() {
> atomic_andnot(NOHZ_KICK_MASK, nohz_flag(0))
I'm an idiot and didn't have enough wake-up-juice; I missed that andnot
clearing the flag again.
Yes, fun fun fun..
On Thu, May 21, 2020 at 11:39:39AM +0200, Peter Zijlstra wrote:
> On Thu, May 21, 2020 at 02:40:36AM +0200, Frederic Weisbecker wrote:
> This:
>
> > smp_call_function_single_async() { smp_call_function_single_async() {
> > // verified csd->flags != CSD_LOCK // verified csd->flags != CSD_LOCK
> > csd->flags = CSD_LOCK csd->flags = CSD_LOCK
>
> concurrent smp_call_function_single_async() using the same csd is what
> I'm looking at as well.
So something like this ought to cure the fundamental problem and make
smp_call_function_single_async() more user friendly, but also more
expensive.
The problem is that while the ILB case is easy to fix, I can't seem to
find an equally nice solution for the ttwu_remote_queue() case; that
would basically require sticking the wake_csd in task_struct, I'll also
post that.
So it's either this:
---
kernel/smp.c | 21 ++++++++++++++++-----
1 file changed, 16 insertions(+), 5 deletions(-)
diff --git a/kernel/smp.c b/kernel/smp.c
index 84303197caf9..d1ca2a2d1cc7 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -109,6 +109,12 @@ static __always_inline void csd_lock_wait(call_single_data_t *csd)
smp_cond_load_acquire(&csd->flags, !(VAL & CSD_FLAG_LOCK));
}
+/*
+ * csd_lock() can use non-atomic operations to set CSD_FLAG_LOCK because it's
+ * users are careful to only use CPU-local data. IOW, there is no cross-cpu
+ * lock usage. Also, you're not allowed to use smp_call_function*() from IRQs,
+ * and must be extra careful from SoftIRQ.
+ */
static __always_inline void csd_lock(call_single_data_t *csd)
{
csd_lock_wait(csd);
@@ -318,7 +324,7 @@ EXPORT_SYMBOL(smp_call_function_single);
/**
* smp_call_function_single_async(): Run an asynchronous function on a
- * specific CPU.
+ * specific CPU.
* @cpu: The CPU to run on.
* @csd: Pre-allocated and setup data structure
*
@@ -339,18 +345,23 @@ EXPORT_SYMBOL(smp_call_function_single);
*/
int smp_call_function_single_async(int cpu, call_single_data_t *csd)
{
+ unsigned int csd_flags;
int err = 0;
preempt_disable();
- if (csd->flags & CSD_FLAG_LOCK) {
+ /*
+ * Unlike the regular smp_call_function*() APIs, this one is actually
+ * usable from IRQ context, also the -EBUSY return value suggests
+ * it is safe to share csd's.
+ */
+ csd_flags = READ_ONCE(csd->flags);
+ if (csd_flags & CSD_FLAG_LOCK ||
+ cmpxchg(&csd->flags, csd_flags, csd_flags | CSD_FLAG_LOCK) != csd_flags) {
err = -EBUSY;
goto out;
}
- csd->flags = CSD_FLAG_LOCK;
- smp_wmb();
-
err = generic_exec_single(cpu, csd, csd->func, csd->info);
out:
On Thu, May 21, 2020 at 12:49:37PM +0200, Peter Zijlstra wrote:
> On Thu, May 21, 2020 at 11:39:39AM +0200, Peter Zijlstra wrote:
> > On Thu, May 21, 2020 at 02:40:36AM +0200, Frederic Weisbecker wrote:
>
> > This:
> >
> > > smp_call_function_single_async() { smp_call_function_single_async() {
> > > // verified csd->flags != CSD_LOCK // verified csd->flags != CSD_LOCK
> > > csd->flags = CSD_LOCK csd->flags = CSD_LOCK
> >
> > concurrent smp_call_function_single_async() using the same csd is what
> > I'm looking at as well.
>
> So something like this ought to cure the fundamental problem and make
> smp_call_function_single_async() more user friendly, but also more
> expensive.
>
> The problem is that while the ILB case is easy to fix, I can't seem to
> find an equally nice solution for the ttwu_remote_queue() case; that
> would basically require sticking the wake_csd in task_struct, I'll also
> post that.
>
> So it's either this:
Or this:
---
include/linux/sched.h | 4 ++++
kernel/sched/core.c | 7 ++++---
kernel/sched/fair.c | 2 +-
kernel/sched/sched.h | 1 -
4 files changed, 9 insertions(+), 5 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f38d62c4632c..136ee400b568 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -696,6 +696,10 @@ struct task_struct {
struct uclamp_se uclamp[UCLAMP_CNT];
#endif
+#ifdef CONFIG_SMP
+ call_single_data_t wake_csd;
+#endif
+
#ifdef CONFIG_PREEMPT_NOTIFIERS
/* List of struct preempt_notifier: */
struct hlist_head preempt_notifiers;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5b286469e26e..a7129652e89b 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2320,7 +2320,7 @@ static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
if (llist_add(&p->wake_entry, &rq->wake_list)) {
if (!set_nr_if_polling(rq->idle))
- smp_call_function_single_async(cpu, &rq->wake_csd);
+ smp_call_function_single_async(cpu, &p->wake_csd);
else
trace_sched_wake_idle_without_ipi(cpu);
}
@@ -2921,6 +2921,9 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
#endif
#if defined(CONFIG_SMP)
p->on_cpu = 0;
+ p->wake_csd = (struct __call_single_data) {
+ .func = wake_csd_func,
+ };
#endif
init_task_preempt_count(p);
#ifdef CONFIG_SMP
@@ -6723,8 +6726,6 @@ void __init sched_init(void)
rq->avg_idle = 2*sysctl_sched_migration_cost;
rq->max_idle_balance_cost = sysctl_sched_migration_cost;
- rq_csd_init(rq, &rq->wake_csd, wake_csd_func);
-
INIT_LIST_HEAD(&rq->cfs_tasks);
rq_attach_root(rq, &def_root_domain);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 01f94cf52783..b6d8a7b991f0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10033,7 +10033,7 @@ static void kick_ilb(unsigned int flags)
* is idle. And the softirq performing nohz idle load balance
* will be run before returning from the IPI.
*/
- smp_call_function_single_async(ilb_cpu, &cpu_rq(ilb_cpu)->nohz_csd);
+ smp_call_function_single_async(ilb_cpu, &this_rq()->nohz_csd);
}
/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f7ab6334e992..c35f0ef43ab0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1021,7 +1021,6 @@ struct rq {
#endif
#ifdef CONFIG_SMP
- call_single_data_t wake_csd;
struct llist_head wake_list;
#endif
On Thu, May 21, 2020 at 01:00:27PM +0200, Peter Zijlstra wrote:
> On Thu, May 21, 2020 at 12:49:37PM +0200, Peter Zijlstra wrote:
> > On Thu, May 21, 2020 at 11:39:39AM +0200, Peter Zijlstra wrote:
> > > On Thu, May 21, 2020 at 02:40:36AM +0200, Frederic Weisbecker wrote:
> >
> > > This:
> > >
> > > > smp_call_function_single_async() { smp_call_function_single_async() {
> > > > // verified csd->flags != CSD_LOCK // verified csd->flags != CSD_LOCK
> > > > csd->flags = CSD_LOCK csd->flags = CSD_LOCK
> > >
> > > concurrent smp_call_function_single_async() using the same csd is what
> > > I'm looking at as well.
> >
> > So something like this ought to cure the fundamental problem and make
> > smp_call_function_single_async() more user friendly, but also more
> > expensive.
> >
> > The problem is that while the ILB case is easy to fix, I can't seem to
> > find an equally nice solution for the ttwu_remote_queue() case; that
> > would basically require sticking the wake_csd in task_struct, I'll also
> > post that.
> >
> > So it's either this:
>
> Or this:
>
> ---
> include/linux/sched.h | 4 ++++
> kernel/sched/core.c | 7 ++++---
> kernel/sched/fair.c | 2 +-
> kernel/sched/sched.h | 1 -
> 4 files changed, 9 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index f38d62c4632c..136ee400b568 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -696,6 +696,10 @@ struct task_struct {
> struct uclamp_se uclamp[UCLAMP_CNT];
> #endif
>
> +#ifdef CONFIG_SMP
> + call_single_data_t wake_csd;
> +#endif
> +
> #ifdef CONFIG_PREEMPT_NOTIFIERS
> /* List of struct preempt_notifier: */
> struct hlist_head preempt_notifiers;
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 5b286469e26e..a7129652e89b 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -2320,7 +2320,7 @@ static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
>
> if (llist_add(&p->wake_entry, &rq->wake_list)) {
> if (!set_nr_if_polling(rq->idle))
> - smp_call_function_single_async(cpu, &rq->wake_csd);
> + smp_call_function_single_async(cpu, &p->wake_csd);
> else
> trace_sched_wake_idle_without_ipi(cpu);
> }
> @@ -2921,6 +2921,9 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
> #endif
> #if defined(CONFIG_SMP)
> p->on_cpu = 0;
> + p->wake_csd = (struct __call_single_data) {
> + .func = wake_csd_func,
> + };
> #endif
> init_task_preempt_count(p);
> #ifdef CONFIG_SMP
> @@ -6723,8 +6726,6 @@ void __init sched_init(void)
> rq->avg_idle = 2*sysctl_sched_migration_cost;
> rq->max_idle_balance_cost = sysctl_sched_migration_cost;
>
> - rq_csd_init(rq, &rq->wake_csd, wake_csd_func);
> -
> INIT_LIST_HEAD(&rq->cfs_tasks);
>
> rq_attach_root(rq, &def_root_domain);
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 01f94cf52783..b6d8a7b991f0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10033,7 +10033,7 @@ static void kick_ilb(unsigned int flags)
> * is idle. And the softirq performing nohz idle load balance
> * will be run before returning from the IPI.
> */
> - smp_call_function_single_async(ilb_cpu, &cpu_rq(ilb_cpu)->nohz_csd);
> + smp_call_function_single_async(ilb_cpu, &this_rq()->nohz_csd);
My fear here is that if a previous call from the the same CPU but to another
target is still pending, the new one will be spuriously ignored.
Namely this could happen:
CPU 0 CPU 1
----- -----
local_irq_disable() or VMEXIT
kick_ilb() {
smp_call_function_single_async(CPU 1,
&this_rq()->nohz_csd);
}
kick_ilb() {
smp_call_function_single_async(CPU 2,
&this_rq()->nohz_csd) {
// IPI to CPU 2 ignored
if (csd->flags == CSD_LOCK)
return -EBUSY;
}
}
local_irq_enable();
But I believe we can still keep the remote csd if nohz_flags() are
strictly only set before the IPI and strictly only cleared from it.
And I still don't understand why trigger_load_balance() raise the
softirq without setting the current CPU as ilb. run_rebalance_domains()
thus ignores it most of the time in the end or it spuriously clear the
nohz_flags set by an IPI sender. Or there is something I misunderstood
there.
(Haven't checked the wake up case yet).
On Thu, May 21, 2020 at 11:39:38AM +0200, Peter Zijlstra wrote:
> On Thu, May 21, 2020 at 02:40:36AM +0200, Frederic Weisbecker wrote:
> > On Wed, May 20, 2020 at 02:50:56PM +0200, Peter Zijlstra wrote:
> > > On Tue, May 19, 2020 at 11:58:17PM -0400, Qian Cai wrote:
> > > > Just a head up. Repeatedly compiling kernels for a while would trigger
> > > > endless soft-lockups since next-20200519 on both x86_64 and powerpc.
> > > > .config are in,
> > >
> > > Could be 90b5363acd47 ("sched: Clean up scheduler_ipi()"), although I've
> > > not seen anything like that myself. Let me go have a look.
> > >
> > >
> > > In as far as the logs are readable (they're a wrapped mess, please don't
> > > do that!), they contain very little useful, as is typical with IPIs :/
> > >
> > > > [ 1167.993773][ C1] WARNING: CPU: 1 PID: 0 at kernel/smp.c:127
> > > > flush_smp_call_function_queue+0x1fa/0x2e0
> >
> > So I've tried to think of a race that could produce that and here is
> > the only thing I could come up with. It's a bit complicated unfortunately:
>
> This:
>
> > smp_call_function_single_async() { smp_call_function_single_async() {
> > // verified csd->flags != CSD_LOCK // verified csd->flags != CSD_LOCK
> > csd->flags = CSD_LOCK csd->flags = CSD_LOCK
>
> concurrent smp_call_function_single_async() using the same csd is what
> I'm looking at as well. Now in the ILB case there is an easy cure:
>
> (because there is only a single ilb target)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 01f94cf52783..b6d8a7b991f0 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -10033,7 +10033,7 @@ static void kick_ilb(unsigned int flags)
> * is idle. And the softirq performing nohz idle load balance
> * will be run before returning from the IPI.
> */
> - smp_call_function_single_async(ilb_cpu, &cpu_rq(ilb_cpu)->nohz_csd);
> + smp_call_function_single_async(ilb_cpu, &this_rq()->nohz_csd);
> }
>
> /*
>
> Qian, can you give that a spin?
Running for a few hours now. It works fine.
On Thu, May 21, 2020 at 02:41:14PM +0200, Frederic Weisbecker wrote:
> On Thu, May 21, 2020 at 01:00:27PM +0200, Peter Zijlstra wrote:
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 01f94cf52783..b6d8a7b991f0 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -10033,7 +10033,7 @@ static void kick_ilb(unsigned int flags)
> > * is idle. And the softirq performing nohz idle load balance
> > * will be run before returning from the IPI.
> > */
> > - smp_call_function_single_async(ilb_cpu, &cpu_rq(ilb_cpu)->nohz_csd);
> > + smp_call_function_single_async(ilb_cpu, &this_rq()->nohz_csd);
>
> My fear here is that if a previous call from the the same CPU but to another
> target is still pending, the new one will be spuriously ignored.
>
Urgh, indeed!
> But I believe we can still keep the remote csd if nohz_flags() are
> strictly only set before the IPI and strictly only cleared from it.
>
> And I still don't understand why trigger_load_balance() raise the
> softirq without setting the current CPU as ilb. run_rebalance_domains()
> thus ignores it most of the time in the end or it spuriously clear the
> nohz_flags set by an IPI sender. Or there is something I misunderstood
> there.
That is because it is simple and didn't matter before. Whoever got there
first go to run the ilb whenever the flag was set.
But now we have this race due to having to serialize access to the csd.
We want the IPI to clear the flag, but then the softirq no longer knows
it was supposed to do ILB.
How's this then?
---
include/linux/sched.h | 4 ++++
kernel/sched/core.c | 41 +++++++++++++----------------------------
kernel/sched/fair.c | 15 +++++++--------
kernel/sched/sched.h | 2 +-
4 files changed, 25 insertions(+), 37 deletions(-)
diff --git a/include/linux/sched.h b/include/linux/sched.h
index f38d62c4632c..136ee400b568 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -696,6 +696,10 @@ struct task_struct {
struct uclamp_se uclamp[UCLAMP_CNT];
#endif
+#ifdef CONFIG_SMP
+ call_single_data_t wake_csd;
+#endif
+
#ifdef CONFIG_PREEMPT_NOTIFIERS
/* List of struct preempt_notifier: */
struct hlist_head preempt_notifiers;
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5b286469e26e..90484b988b65 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -637,41 +637,25 @@ void wake_up_nohz_cpu(int cpu)
wake_up_idle_cpu(cpu);
}
-static inline bool got_nohz_idle_kick(void)
+static void nohz_csd_func(void *info)
{
- int cpu = smp_processor_id();
-
- if (!(atomic_read(nohz_flags(cpu)) & NOHZ_KICK_MASK))
- return false;
-
- if (idle_cpu(cpu) && !need_resched())
- return true;
+ struct rq *rq = info;
+ int cpu = cpu_of(rq);
+ WARN_ON(!(atomic_read(nohz_flags(cpu)) & NOHZ_KICK_MASK));
/*
- * We can't run Idle Load Balance on this CPU for this time so we
- * cancel it and clear NOHZ_BALANCE_KICK
+ * Release the rq::nohz_csd.
*/
+ smp_mb__before_atomic();
atomic_andnot(NOHZ_KICK_MASK, nohz_flags(cpu));
- return false;
-}
-
-static void nohz_csd_func(void *info)
-{
- struct rq *rq = info;
- if (got_nohz_idle_kick()) {
- rq->idle_balance = 1;
+ rq->idle_balance = idle_cpu(cpu);
+ if (rq->idle_balance && !need_resched()) {
+ rq->nohz_idle_balance = 1;
raise_softirq_irqoff(SCHED_SOFTIRQ);
}
}
-#else /* CONFIG_NO_HZ_COMMON */
-
-static inline bool got_nohz_idle_kick(void)
-{
- return false;
-}
-
#endif /* CONFIG_NO_HZ_COMMON */
#ifdef CONFIG_NO_HZ_FULL
@@ -2320,7 +2304,7 @@ static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
if (llist_add(&p->wake_entry, &rq->wake_list)) {
if (!set_nr_if_polling(rq->idle))
- smp_call_function_single_async(cpu, &rq->wake_csd);
+ smp_call_function_single_async(cpu, &p->wake_csd);
else
trace_sched_wake_idle_without_ipi(cpu);
}
@@ -2921,6 +2905,9 @@ int sched_fork(unsigned long clone_flags, struct task_struct *p)
#endif
#if defined(CONFIG_SMP)
p->on_cpu = 0;
+ p->wake_csd = (struct __call_single_data) {
+ .func = wake_csd_func,
+ };
#endif
init_task_preempt_count(p);
#ifdef CONFIG_SMP
@@ -6723,8 +6710,6 @@ void __init sched_init(void)
rq->avg_idle = 2*sysctl_sched_migration_cost;
rq->max_idle_balance_cost = sysctl_sched_migration_cost;
- rq_csd_init(rq, &rq->wake_csd, wake_csd_func);
-
INIT_LIST_HEAD(&rq->cfs_tasks);
rq_attach_root(rq, &def_root_domain);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 01f94cf52783..93525549a023 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10024,6 +10024,10 @@ static void kick_ilb(unsigned int flags)
if (ilb_cpu >= nr_cpu_ids)
return;
+ /*
+ * Access to rq::nohz_csd is serialized by NOHZ_KICK_MASK; he who sets
+ * the first flag owns it; cleared by nohz_csd_func().
+ */
flags = atomic_fetch_or(flags, nohz_flags(ilb_cpu));
if (flags & NOHZ_KICK_MASK)
return;
@@ -10374,17 +10378,12 @@ static bool nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
int this_cpu = this_rq->cpu;
unsigned int flags;
- if (!(atomic_read(nohz_flags(this_cpu)) & NOHZ_KICK_MASK))
+ if (!this_rq->nohz_idle_balance)
return false;
- if (idle != CPU_IDLE) {
- atomic_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
- return false;
- }
+ this_rq->nohz_idle_balance = 0;
- /* could be _relaxed() */
- flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
- if (!(flags & NOHZ_KICK_MASK))
+ if (idle != CPU_IDLE)
return false;
_nohz_idle_balance(this_rq, flags, idle);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f7ab6334e992..6418f6af15c1 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -951,6 +951,7 @@ struct rq {
struct callback_head *balance_callback;
+ unsigned char nohz_idle_balance;
unsigned char idle_balance;
unsigned long misfit_task_load;
@@ -1021,7 +1022,6 @@ struct rq {
#endif
#ifdef CONFIG_SMP
- call_single_data_t wake_csd;
struct llist_head wake_list;
#endif
On Mon, May 25, 2020 at 03:21:05PM +0200, Peter Zijlstra wrote:
> @@ -2320,7 +2304,7 @@ static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
>
> if (llist_add(&p->wake_entry, &rq->wake_list)) {
> if (!set_nr_if_polling(rq->idle))
> - smp_call_function_single_async(cpu, &rq->wake_csd);
> + smp_call_function_single_async(cpu, &p->wake_csd);
> else
> trace_sched_wake_idle_without_ipi(cpu);
Ok that's of course very unlikely but could it be possible to have the
following:
CPU 0 CPU 1 CPU 2
-----
//Wake up A
ttwu_queue(TASK A, CPU 1) idle_loop {
ttwu_queue_pending {
....
raw_spin_unlock_irqrestore(rq)
# VMEXIT (with IPI still pending)
//task A migrates here
wait_event(....)
//sleep
//Wake up A
ttwu_queue(TASK A, CPU 2) {
//IPI on CPU 2 ignored
// due to csd->flags == CSD_LOCK
On Mon, May 25, 2020 at 03:21:05PM +0200, Peter Zijlstra wrote:
> - flags = atomic_fetch_andnot(NOHZ_KICK_MASK, nohz_flags(this_cpu));
> - if (!(flags & NOHZ_KICK_MASK))
> + if (idle != CPU_IDLE)
> return false;
>
> _nohz_idle_balance(this_rq, flags, idle);
Bah, I think I broke something there. Lemme go mend.
On Mon, May 25, 2020 at 04:05:49PM +0200, Frederic Weisbecker wrote:
> On Mon, May 25, 2020 at 03:21:05PM +0200, Peter Zijlstra wrote:
> > @@ -2320,7 +2304,7 @@ static void ttwu_queue_remote(struct task_struct *p, int cpu, int wake_flags)
> >
> > if (llist_add(&p->wake_entry, &rq->wake_list)) {
> > if (!set_nr_if_polling(rq->idle))
> > - smp_call_function_single_async(cpu, &rq->wake_csd);
> > + smp_call_function_single_async(cpu, &p->wake_csd);
> > else
> > trace_sched_wake_idle_without_ipi(cpu);
>
> Ok that's of course very unlikely but could it be possible to have the
> following:
>
> CPU 0 CPU 1 CPU 2
> -----
>
> //Wake up A
> ttwu_queue(TASK A, CPU 1) idle_loop {
> ttwu_queue_pending {
> ....
> raw_spin_unlock_irqrestore(rq)
> # VMEXIT (with IPI still pending)
> //task A migrates here
> wait_event(....)
> //sleep
>
> //Wake up A
> ttwu_queue(TASK A, CPU 2) {
> //IPI on CPU 2 ignored
> // due to csd->flags == CSD_LOCK
>
Right you are.
Bah!
More thinking....