2015-02-11 13:19:14

by Rafael David Tinoco

[permalink] [raw]
Subject: smp_call_function_single lockups

Linus, Thomas, Jens..

During the 3.18 - 3.19 "frequent lockups discussion", in some point
you have observed csd_lock() and csd_unlock() possible synchronization
problems. I think we have managed to reproduce that issue in a
constant basis with 3.13 (ubuntu) and 3.19 (latest vanilla).

- When running "open-stack tempest" in a nested-kvm environment we are
able to cause a lockup in question of hours (from 2 to 20 hours
usually). Trace from nested hypervisor (ubuntu 3.13):

crash> bt
PID: 29130 TASK: ffff8804288ac800 CPU: 1 COMMAND: "qemu-system-x86"
#0 [ffff88043fd03d18] machine_kexec at ffffffff8104ac02
#1 [ffff88043fd03d68] crash_kexec at ffffffff810e7203
#2 [ffff88043fd03e30] panic at ffffffff81719ff4
#3 [ffff88043fd03ea8] watchdog_timer_fn at ffffffff8110d7c5
#4 [ffff88043fd03ed8] __run_hrtimer at ffffffff8108e787
#5 [ffff88043fd03f18] hrtimer_interrupt at ffffffff8108ef4f
#6 [ffff88043fd03f80] local_apic_timer_interrupt at ffffffff81043537
#7 [ffff88043fd03f98] smp_apic_timer_interrupt at ffffffff81733d4f
#8 [ffff88043fd03fb0] apic_timer_interrupt at ffffffff817326dd
--- <IRQ stack> ---
#9 [ffff8804284a3bc8] apic_timer_interrupt at ffffffff817326dd
[exception RIP: generic_exec_single+130]
RIP: ffffffff810dbe62 RSP: ffff8804284a3c70 RFLAGS: 00000202
RAX: 0000000000000002 RBX: ffff8804284a3c40 RCX: 0000000000000001
RDX: ffffffff8180ad60 RSI: 0000000000000000 RDI: 0000000000000286
RBP: ffff8804284a3ca0 R8: ffffffff8180ad48 R9: 0000000000000001
R10: ffffffff81185cac R11: ffffea00109b4a00 R12: ffff88042829f400
R13: 0000000000000000 R14: ffffea001017d640 R15:
0000000000000005 ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
#10 [ffff8804284a3ca8] smp_call_function_single at ffffffff810dbf75
#11 [ffff8804284a3d20] smp_call_function_many at ffffffff810dc3a6
#12 [ffff8804284a3d80] native_flush_tlb_others at ffffffff8105c8f7
#13 [ffff8804284a3da8] flush_tlb_mm_range at ffffffff8105c9cb
#14 [ffff8804284a3dd8] tlb_flush_mmu at ffffffff811755b3
#15 [ffff8804284a3e00] tlb_finish_mmu at ffffffff81176145
#16 [ffff8804284a3e20] unmap_region at ffffffff8117e013
#17 [ffff8804284a3ee0] do_munmap at ffffffff81180356
#18 [ffff8804284a3f30] vm_munmap at ffffffff81180521
#19 [ffff8804284a3f60] sys_munmap at ffffffff81181482
#20 [ffff8804284a3f80] system_call_fastpath at ffffffff8173196d
RIP: 00007fa3ed16c587 RSP: 00007fa3536f5c10 RFLAGS: 00000246
RAX: 000000000000000b RBX: ffffffff8173196d RCX: 0000001d00000007
RDX: 0000000000000000 RSI: 0000000000801000 RDI: 00007fa315ff4000
RBP: 00007fa3167f49c0 R8: 0000000000000000 R9: 00007fa3f5396738
R10: 00007fa3536f5a60 R11: 0000000000000202 R12: 00007fa3ed6562a0
R13: 00007fa350ef19c0 R14: ffffffff81181482 R15: ffff8804284a3f78
ORIG_RAX: 000000000000000b CS: 0033 SS: 002b

- After applying patch provided by Thomas we were able to cause the
lockup only after 6 days (also locked inside
smp_call_function_single). Test performance (even for a nested kvm)
was reduced substantially with 3.19 + this patch. Trace from the
nested hypervisor (3.19 + patch):

crash> bt
PID: 10467 TASK: ffff880817b3b1c0 CPU: 1 COMMAND: "qemu-system-x86"
#0 [ffff88083fd03cc0] machine_kexec at ffffffff81052052
#1 [ffff88083fd03d10] crash_kexec at ffffffff810f91c3
#2 [ffff88083fd03de0] panic at ffffffff8176f713
#3 [ffff88083fd03e60] watchdog_timer_fn at ffffffff8112316b
#4 [ffff88083fd03ea0] __run_hrtimer at ffffffff810da087
#5 [ffff88083fd03ef0] hrtimer_interrupt at ffffffff810da467
#6 [ffff88083fd03f70] local_apic_timer_interrupt at ffffffff81049769
#7 [ffff88083fd03f90] smp_apic_timer_interrupt at ffffffff8177fc25
#8 [ffff88083fd03fb0] apic_timer_interrupt at ffffffff8177dcbd
--- <IRQ stack> ---
#9 [ffff880817973a68] apic_timer_interrupt at ffffffff8177dcbd
[exception RIP: generic_exec_single+218]
RIP: ffffffff810ee0ca RSP: ffff880817973b18 RFLAGS: 00000202
RAX: 0000000000000002 RBX: 0000000000000292 RCX: 0000000000000001
RDX: ffffffff8180e6e0 RSI: 0000000000000000 RDI: 0000000000000292
RBP: ffff880817973b58 R8: ffffffff8180e6c8 R9: 0000000000000001
R10: 000000000000b6e0 R11: 0000000000000001 R12: ffffffff811f6626
R13: ffff880817973ab8 R14: ffffffff8109cfd2 R15: ffff880817973a78
ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
#10 [ffff880817973b60] smp_call_function_single at ffffffff810ee1c7
#11 [ffff880817973b90] loaded_vmcs_clear at ffffffffa0309097 [kvm_intel]
#12 [ffff880817973ba0] vmx_vcpu_load at ffffffffa030defe [kvm_intel]
#13 [ffff880817973be0] kvm_arch_vcpu_load at ffffffffa01eba53 [kvm]
#14 [ffff880817973c00] kvm_sched_in at ffffffffa01d94a9 [kvm]
#15 [ffff880817973c20] finish_task_switch at ffffffff81099148
#16 [ffff880817973c60] __schedule at ffffffff817781ec
#17 [ffff880817973cd0] schedule at ffffffff81778699
#18 [ffff880817973ce0] kvm_vcpu_block at ffffffffa01d8dfd [kvm]
#19 [ffff880817973d40] kvm_arch_vcpu_ioctl_run at ffffffffa01ef64c [kvm]
#20 [ffff880817973e10] kvm_vcpu_ioctl at ffffffffa01dbc19 [kvm]
#21 [ffff880817973eb0] do_vfs_ioctl at ffffffff811f5948
#22 [ffff880817973f30] sys_ioctl at ffffffff811f5be1
#23 [ffff880817973f80] system_call_fastpath at ffffffff8177cc2d
RIP: 00007f42f987fec7 RSP: 00007f42ef1bebd8 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: ffffffff8177cc2d RCX: ffffffffffffffff
RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000000e
RBP: 00007f430047b040 R8: 0000000000000000 R9: 00000000000000ff
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f42ff920240
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001
ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b

Not sure if you are still pursuing this, anyway.. let me know if you
think of any other change, I'll keep the environment.

-Tinoco

On Mon, 17 Nov 2014, Thomas Gleixner wrote:
> On Mon, 17 Nov 2014, Linus Torvalds wrote:
> > llist_for_each_entry_safe(csd, csd_next, entry, llist) {
> > - csd->func(csd->info);
> > + smp_call_func_t func = csd->func;
> > + void *info = csd->info;
> > csd_unlock(csd);
> > +
> > + func(info);
>
> No, that won't work for synchronous calls:
>
> CPU 0 CPU 1
>
> csd_lock(csd);
> queue_csd();
> ipi();
> func = csd->func;
> info = csd->info;
> csd_unlock(csd);
> csd_lock_wait();
> func(info);
>
> The csd_lock_wait() side will succeed and therefor assume that the
> call has been completed while the function has not been called at
> all. Interesting explosions to follow.
>
> The proper solution is to revert that commit and properly analyze the
> problem which Jens was trying to solve and work from there.

So a combo of both (Jens and yours) might do the trick. Patch below.

I think what Jens was trying to solve is:

CPU 0 CPU 1

csd_lock(csd);
queue_csd();
ipi();
csd->func(csd->info);
wait_for_completion(csd);
complete(csd);
reuse_csd(csd);
csd_unlock(csd);
Thanks,

tglx

Index: linux/kernel/smp.c
===================================================================
--- linux.orig/kernel/smp.c
+++ linux/kernel/smp.c
@@ -126,7 +126,7 @@ static void csd_lock(struct call_single_

static void csd_unlock(struct call_single_data *csd)
{
- WARN_ON((csd->flags & CSD_FLAG_WAIT) && !(csd->flags & CSD_FLAG_LOCK));
+ WARN_ON(!(csd->flags & CSD_FLAG_LOCK));

/*
* ensure we're all done before releasing data:
@@ -250,8 +250,23 @@ static void flush_smp_call_function_queu
}

llist_for_each_entry_safe(csd, csd_next, entry, llist) {
- csd->func(csd->info);
- csd_unlock(csd);
+
+ /*
+ * For synchronous calls we are not allowed to unlock
+ * before the callback returned. For the async case
+ * its the responsibility of the caller to keep
+ * csd->info consistent while the callback runs.
+ */
+ if (csd->flags & CSD_FLAG_WAIT) {
+ csd->func(csd->info);
+ csd_unlock(csd);
+ } else {
+ smp_call_func_t func = csd->func;
+ void *info = csd->info;
+
+ csd_unlock(csd);
+ func(info);
+ }
}

/*


2015-02-11 18:18:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: smp_call_function_single lockups

On Wed, Feb 11, 2015 at 5:19 AM, Rafael David Tinoco <[email protected]> wrote:
>
> - After applying patch provided by Thomas we were able to cause the
> lockup only after 6 days (also locked inside
> smp_call_function_single). Test performance (even for a nested kvm)
> was reduced substantially with 3.19 + this patch.

I think that just means that the patch from Thomas doesn't change
anything - the reason it takes longer to lock up is just that
performance reduction, so whatever race it is that causes the problem
was just harder to hit, but not fundamentally affected.

I think a more interesting thing to get is the traces from the other
CPU's when this happens. In a virtualized environment, that might be
easier to get than on real hardware, and if you are able to reproduce
this at will - especially with something recent like 3.19, and could
get that, that would be really good.

I'll think about this all, but we couldn't figure anything out last
time we looked at it, so without more clues, don't hold your breath.

That said, it *would* be good if we could get rid of the synchronous
behavior entirely, and make it a rule that if somebody wants to wait
for it, they'll have to do their own waiting. Because I still think
that that CSD_FLAG_WAIT is pure and utter garbage. And I think that
Jens said that it is probably bogus to begin with.

I also don't even see where the CSD_FLAG_WAIT bit woudl ever be
cleared, so it all looks completely buggy anyway.

Does this (COMPLETELY UNTESTED!) attached patch change anything?

Linus


Attachments:
patch.diff (1.42 kB)

2015-02-11 19:59:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: smp_call_function_single lockups

On Wed, Feb 11, 2015 at 10:18 AM, Linus Torvalds
<[email protected]> wrote:
>
> I'll think about this all, but we couldn't figure anything out last
> time we looked at it, so without more clues, don't hold your breath.

So having looked at it once more, one thing struck me:

Look at smp_call_function_single_async(). The comment says

* Like smp_call_function_single(), but the call is asynchonous and
* can thus be done from contexts with disabled interrupts.

but that is *only* true if we don't have to wait for the csd lock. The
comments even clarify that:

* The caller passes his own pre-allocated data structure
* (ie: embedded in an object) and is responsible for synchronizing it
* such that the IPIs performed on the @csd are strictly serialized.

but it's not at all clear that the caller *can* do that. Since the
"csd_unlock()" is done *after* the call to the callback function, any
serialization done by the caller is fundamentally not trustworthy,
since it cannot serialize with the csd lock - if it releases things in
the callback, the csd lock will still be set after releasing things.

So the caller has a really hard time guaranteeing that CSD_LOCK isn't
set. And if the call is done in interrupt context, for all we know it
is interrupting the code that is going to clear CSD_LOCK, so CSD_LOCK
will never be cleared at all, and csd_lock() will wait forever.

So I actually think that for the async case, we really *should* unlock
before doing the callback (which is what Thomas' old patch did).

And we migth well be better off doing something like

WARN_ON_ONCE(csd->flags & CSD_LOCK);

in smp_call_function_single_async(), because that really is a hard requirement.

And it strikes me that hrtick_csd is one of these cases that do this
with interrupts disabled, and use the callback for serialization. So I
really wonder if this is part of the problem..

Thomas? Am I missing something?

Linus

2015-02-11 20:42:14

by Linus Torvalds

[permalink] [raw]
Subject: Re: smp_call_function_single lockups

[ Added Frederic to the cc, since he's touched this file/area most ]

On Wed, Feb 11, 2015 at 11:59 AM, Linus Torvalds
<[email protected]> wrote:
>
> So the caller has a really hard time guaranteeing that CSD_LOCK isn't
> set. And if the call is done in interrupt context, for all we know it
> is interrupting the code that is going to clear CSD_LOCK, so CSD_LOCK
> will never be cleared at all, and csd_lock() will wait forever.
>
> So I actually think that for the async case, we really *should* unlock
> before doing the callback (which is what Thomas' old patch did).
>
> And we migth well be better off doing something like
>
> WARN_ON_ONCE(csd->flags & CSD_LOCK);
>
> in smp_call_function_single_async(), because that really is a hard requirement.
>
> And it strikes me that hrtick_csd is one of these cases that do this
> with interrupts disabled, and use the callback for serialization. So I
> really wonder if this is part of the problem..
>
> Thomas? Am I missing something?

Ok, this is a more involved patch than I'd like, but making the
*caller* do all the CSD maintenance actually cleans things up.

And this is still completely untested, and may be entirely buggy. What
do you guys think?

Linus


Attachments:
patch.diff (4.59 kB)

2015-02-12 16:39:04

by Rafael David Tinoco

[permalink] [raw]
Subject: Re: smp_call_function_single lockups

Meanwhile we'll take the opportunity to run same tests with the
"smp_load_acquire/smp_store_release + outside sync/async" approach
made by your latest patch on top of 3.19. If anything comes up I'll
provide full back traces (2 vcpus).

Here I can only reproduce this inside nested kvm on top of Proliant
DL360 Gen8 machines with:

- no opt out from x2apic (gen8 firmware asks for opting out but HP
says x2apic should be used for >= gen8)
- no intel_idle since proliant firmware is causing NMIs during MWAIT
instructions

As observed before, reducing performance made the problem to be
triggered only after some days so, if nothing goes wrong with
performance this time, I expect to have results in between 10 to 30
hours.

Thank you

Tinoco

On Wed, Feb 11, 2015 at 6:42 PM, Linus Torvalds
<[email protected]> wrote:
>
> [ Added Frederic to the cc, since he's touched this file/area most ]
>
> On Wed, Feb 11, 2015 at 11:59 AM, Linus Torvalds
> <[email protected]> wrote:
> >
> > So the caller has a really hard time guaranteeing that CSD_LOCK isn't
> > set. And if the call is done in interrupt context, for all we know it
> > is interrupting the code that is going to clear CSD_LOCK, so CSD_LOCK
> > will never be cleared at all, and csd_lock() will wait forever.
> >
> > So I actually think that for the async case, we really *should* unlock
> > before doing the callback (which is what Thomas' old patch did).
> >
> > And we migth well be better off doing something like
> >
> > WARN_ON_ONCE(csd->flags & CSD_LOCK);
> >
> > in smp_call_function_single_async(), because that really is a hard requirement.
> >
> > And it strikes me that hrtick_csd is one of these cases that do this
> > with interrupts disabled, and use the callback for serialization. So I
> > really wonder if this is part of the problem..
> >
> > Thomas? Am I missing something?
>
> Ok, this is a more involved patch than I'd like, but making the
> *caller* do all the CSD maintenance actually cleans things up.
>
> And this is still completely untested, and may be entirely buggy. What
> do you guys think?
>
> Linus

2015-02-18 22:25:56

by Peter Zijlstra

[permalink] [raw]
Subject: Re: smp_call_function_single lockups

On Wed, Feb 11, 2015 at 12:42:10PM -0800, Linus Torvalds wrote:
> Ok, this is a more involved patch than I'd like, but making the
> *caller* do all the CSD maintenance actually cleans things up.
>
> And this is still completely untested, and may be entirely buggy. What
> do you guys think?

I think it makes perfect sense.

Acked-by: Peter Zijlstra (Intel) <[email protected]>

2015-02-19 15:42:43

by Rafael David Tinoco

[permalink] [raw]
Subject: Re: smp_call_function_single lockups

Linus, Peter, Thomas

Just a quick feedback, We were able to reproduce the lockup with this
proposed patch (3.19 + patch). Unfortunately we had problems with the
core file and I have only the stack trace for now but I think we are
able to reproduce it again and provide more details (sorry for the
delay... after a reboot it took some days for us to reproduce this
again).

It looks like RIP is still smp_call_function_single.

Same environment as before: Nested KVM (2 vcpus) on top of Proliant
DL380G8 with acpi_idle and no x2apic optout.

[47708.068013] CPU: 0 PID: 29869 Comm: qemu-system-x86 Tainted: G
E 3.19.0-c7671cf-lp1413540v2 #31
[47708.068013] Hardware name: OpenStack Foundation OpenStack Nova,
BIOS Bochs 01/01/2011
[47708.068013] task: ffff88081b9beca0 ti: ffff88081a7a0000 task.ti:
ffff88081a7a0000
[47708.068013] RIP: 0010:[<ffffffff810f537a>] [<ffffffff810f537a>]
smp_call_function_single+0xca/0x120
[47708.068013] RSP: 0018:ffff88081a7a3b38 EFLAGS: 00000202
[47708.068013] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000002
[47708.068013] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000296
[47708.068013] RBP: ffff88081a7a3b78 R08: ffffffff81815168 R09: ffff880818192000
[47708.068013] R10: 000000000000bdf6 R11: 000000000001bf90 R12: 00080000810b66f8
[47708.068013] R13: 00000000000000fb R14: 0000000000000296 R15: 0000000000000000
[47708.068013] FS: 00007fa143fff700(0000) GS:ffff88083fc00000(0000)
knlGS:0000000000000000
[47708.068013] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[47708.068013] CR2: 00007f5d76f5d050 CR3: 00000008190cc000 CR4: 00000000000426f0
[47708.068013] Stack:
[47708.068013] ffff88083fd151b8 0000000000000001 0000000000000000
ffffffffc0589320
[47708.068013] ffff88081a547a80 0000000000000003 ffff88081a543f80
0000000000000000
[47708.068013] ffff88081a7a3b88 ffffffffc0586097 ffff88081a7a3bc8
ffffffffc058aefe
[47708.068013] Call Trace:
[47708.068013] [<ffffffffc0589320>] ?
copy_shadow_to_vmcs12+0x110/0x110 [kvm_intel]
[47708.068013] [<ffffffffc0586097>] loaded_vmcs_clear+0x27/0x30 [kvm_intel]
[47708.068013] [<ffffffffc058aefe>] vmx_vcpu_load+0x17e/0x1a0 [kvm_intel]
[47708.068013] [<ffffffff810a918d>] ? set_next_entity+0x9d/0xb0
[47708.068013] [<ffffffffc04660e3>] kvm_arch_vcpu_load+0x33/0x1f0 [kvm]
[47708.068013] [<ffffffffc0452529>] kvm_sched_in+0x39/0x40 [kvm]
[47708.068013] [<ffffffff8109e8e8>] finish_task_switch+0x98/0x1a0
[47708.068013] [<ffffffff817aa81b>] __schedule+0x33b/0x900
[47708.068013] [<ffffffff817aae17>] schedule+0x37/0x90
[47708.068013] [<ffffffffc0451e7d>] kvm_vcpu_block+0x6d/0xb0 [kvm]
[47708.068013] [<ffffffff810b6ec0>] ? prepare_to_wait_event+0x110/0x110
[47708.068013] [<ffffffffc0469d3c>] kvm_arch_vcpu_ioctl_run+0x10c/0x1290 [kvm]
[47708.068013] [<ffffffffc04551ce>] kvm_vcpu_ioctl+0x2ce/0x670 [kvm]
[47708.068013] [<ffffffff811ef441>] ? new_sync_write+0x81/0xb0
[47708.068013] [<ffffffff812034e8>] do_vfs_ioctl+0x2f8/0x510
[47708.068013] [<ffffffff811f2215>] ? __sb_end_write+0x35/0x70
[47708.068013] [<ffffffffc045cf84>] ? kvm_on_user_return+0x74/0x80 [kvm]
[47708.068013] [<ffffffff81203781>] SyS_ioctl+0x81/0xa0
[47708.068013] [<ffffffff817aefad>] system_call_fastpath+0x16/0x1b
[47708.068013] Code: 30 5b 41 5c 5d c3 0f 1f 00 48 8d 75 d0 48 89 d1
89 df 4c 89 e2 e8 57 fe ff ff 0f b7 55 e8 83 e2 01 74 da 66 0f 1f 44
00 00 f3 90 <0f> b7 55 e8 83 e2 01 75 f5 eb c7 0f 1f 00 8b 05 ca e6 dd
00 85
[47708.068013] Kernel panic - not syncing: softlockup: hung tasks
[47708.068013] CPU: 0 PID: 29869 Comm: qemu-system-x86 Tainted: G
EL 3.19.0-c7671cf-lp1413540v2 #31
[47708.068013] Hardware name: OpenStack Foundation OpenStack Nova,
BIOS Bochs 01/01/2011
[47708.068013] ffff88081b9beca0 ffff88083fc03de8 ffffffff817a6bf6
0000000000000000
[47708.068013] ffffffff81ab30d4 ffff88083fc03e68 ffffffff817a1aec
0000000000000e92
[47708.068013] 0000000000000008 ffff88083fc03e78 ffff88083fc03e18
ffff88083fc03e68
[47708.068013] Call Trace:
[47708.068013] <IRQ> [<ffffffff817a6bf6>] dump_stack+0x45/0x57
[47708.068013] [<ffffffff817a1aec>] panic+0xc1/0x1f5
[47708.068013] [<ffffffff8112ba0b>] watchdog_timer_fn+0x1db/0x1f0
[47708.068013] [<ffffffff810e0e37>] __run_hrtimer+0x77/0x1d0
[47708.068013] [<ffffffff8112b830>] ? watchdog+0x30/0x30
[47708.068013] [<ffffffff810e1203>] hrtimer_interrupt+0xf3/0x220
[47708.068013] [<ffffffffc0589320>] ?
copy_shadow_to_vmcs12+0x110/0x110 [kvm_intel]
[47708.068013] [<ffffffff8104b0a9>] local_apic_timer_interrupt+0x39/0x60
[47708.068013] [<ffffffff817b1fb5>] smp_apic_timer_interrupt+0x45/0x60
[47708.068013] [<ffffffff817b002d>] apic_timer_interrupt+0x6d/0x80
[47708.068013] <EOI> [<ffffffff810f537a>] ?
smp_call_function_single+0xca/0x120
[47708.068013] [<ffffffff810f5369>] ? smp_call_function_single+0xb9/0x120
[47708.068013] [<ffffffffc0589320>] ?
copy_shadow_to_vmcs12+0x110/0x110 [kvm_intel]
[47708.068013] [<ffffffffc0586097>] loaded_vmcs_clear+0x27/0x30 [kvm_intel]
[47708.068013] [<ffffffffc058aefe>] vmx_vcpu_load+0x17e/0x1a0 [kvm_intel]
[47708.068013] [<ffffffff810a918d>] ? set_next_entity+0x9d/0xb0
[47708.068013] [<ffffffffc04660e3>] kvm_arch_vcpu_load+0x33/0x1f0 [kvm]
[47708.068013] [<ffffffffc0452529>] kvm_sched_in+0x39/0x40 [kvm]
[47708.068013] [<ffffffff8109e8e8>] finish_task_switch+0x98/0x1a0
[47708.068013] [<ffffffff817aa81b>] __schedule+0x33b/0x900
[47708.068013] [<ffffffff817aae17>] schedule+0x37/0x90
[47708.068013] [<ffffffffc0451e7d>] kvm_vcpu_block+0x6d/0xb0 [kvm]
[47708.068013] [<ffffffff810b6ec0>] ? prepare_to_wait_event+0x110/0x110
[47708.068013] [<ffffffffc0469d3c>] kvm_arch_vcpu_ioctl_run+0x10c/0x1290 [kvm]
[47708.068013] [<ffffffffc04551ce>] kvm_vcpu_ioctl+0x2ce/0x670 [kvm]
[47708.068013] [<ffffffff811ef441>] ? new_sync_write+0x81/0xb0
[47708.068013] [<ffffffff812034e8>] do_vfs_ioctl+0x2f8/0x510
[47708.068013] [<ffffffff811f2215>] ? __sb_end_write+0x35/0x70
[47708.068013] [<ffffffffc045cf84>] ? kvm_on_user_return+0x74/0x80 [kvm]
[47708.068013] [<ffffffff81203781>] SyS_ioctl+0x81/0xa0
[47708.068013] [<ffffffff817aefad>] system_call_fastpath+0x16/0x1b

Tks
Rafael Tinoco

On Wed, Feb 18, 2015 at 8:25 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Feb 11, 2015 at 12:42:10PM -0800, Linus Torvalds wrote:
>> Ok, this is a more involved patch than I'd like, but making the
>> *caller* do all the CSD maintenance actually cleans things up.
>>
>> And this is still completely untested, and may be entirely buggy. What
>> do you guys think?
>
> I think it makes perfect sense.
>
> Acked-by: Peter Zijlstra (Intel) <[email protected]>

2015-02-19 16:14:48

by Linus Torvalds

[permalink] [raw]
Subject: Re: smp_call_function_single lockups

On Thu, Feb 19, 2015 at 7:42 AM, Rafael David Tinoco <[email protected]> wrote:
>
> Just a quick feedback, We were able to reproduce the lockup with this
> proposed patch (3.19 + patch). Unfortunately we had problems with the
> core file and I have only the stack trace for now but I think we are
> able to reproduce it again and provide more details (sorry for the
> delay... after a reboot it took some days for us to reproduce this
> again).
>
> It looks like RIP is still smp_call_function_single.

Hmm. Still just the stack trace for the CPU that is blocked (CPU0), if
you can get the core-file to work and figure out where the other CPU
is, that would be good.

Linus

2015-02-19 16:16:15

by Peter Zijlstra

[permalink] [raw]
Subject: Re: smp_call_function_single lockups

On Thu, Feb 19, 2015 at 01:42:39PM -0200, Rafael David Tinoco wrote:
> Linus, Peter, Thomas
>
> Just a quick feedback, We were able to reproduce the lockup with this
> proposed patch (3.19 + patch). Unfortunately we had problems with the
> core file and I have only the stack trace for now but I think we are
> able to reproduce it again and provide more details (sorry for the
> delay... after a reboot it took some days for us to reproduce this
> again).
>
> It looks like RIP is still smp_call_function_single.

So Linus' patch mostly fixes smp_call_function_single_async() which is
not what you're using.

It would be very good to see traces of other CPUs; if for some reason
the target CPU doesn't get around to running your callback, then we'll
forever wait on it.

loaded_vmcs_clear() uses smp_call_function_single(.wait = 1), that
should work as before.

2015-02-19 16:26:25

by Linus Torvalds

[permalink] [raw]
Subject: Re: smp_call_function_single lockups

On Thu, Feb 19, 2015 at 7:42 AM, Rafael David Tinoco <[email protected]> wrote:
>
> Same environment as before: Nested KVM (2 vcpus) on top of Proliant
> DL380G8 with acpi_idle and no x2apic optout.

Btw, which apic model does that end up using? Does "no x2apic optout"
mean you're using the x2apic?

What does "dmesg | grep apic" report? Something like

Switched APIC routing to cluster x2apic.

or what?

Side note to the apic guys: I think the "single CPU" case ends up
being one of the most important ones, but the stupid APIC model
doesn't allow that, so sending an IPI to a single CPU ends up being
"send a mask with a single bit set", and then we have that horrible
"for_each_cpu(cpu, mask)" crap.

Would it make sense to perhaps add a "send_IPI_single()" function
call, and then for the APIC models that always are based on masks, use
a wrapper that just does that "cpumask_of(cpu)" thing..

Hmm?

Linus

2015-02-19 16:32:10

by Rafael David Tinoco

[permalink] [raw]
Subject: Re: smp_call_function_single lockups

For the host, we are using "intremap=no_x2apic_optout
intel_idle.max_cstate=0" for cmdline. It looks like that DL360/DL380
Gen8 firmware still asks to optout from x2apic but HP engineering team
said that using x2apic for Gen8 would be ok (intel_idle causes these
servers to generate NMIs when idling, probably related to packed
c-states and this server's dependency on acpi tables for c-state).

Feb 19 08:21:28 derain kernel: [ 3.504676] Enabled IRQ remapping in
x2apic mode
Feb 19 08:21:28 derain kernel: [ 3.565451] Enabling x2apic
Feb 19 08:21:28 derain kernel: [ 3.602134] Enabled x2apic
Feb 19 08:21:28 derain kernel: [ 3.637682] Switched APIC routing to
cluster x2apic.

On Thu, Feb 19, 2015 at 2:26 PM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Feb 19, 2015 at 7:42 AM, Rafael David Tinoco <[email protected]> wrote:
>>
>> Same environment as before: Nested KVM (2 vcpus) on top of Proliant
>> DL380G8 with acpi_idle and no x2apic optout.
>
> Btw, which apic model does that end up using? Does "no x2apic optout"
> mean you're using the x2apic?
>
> What does "dmesg | grep apic" report? Something like
>
> Switched APIC routing to cluster x2apic.
>
> or what?
>
> Side note to the apic guys: I think the "single CPU" case ends up
> being one of the most important ones, but the stupid APIC model
> doesn't allow that, so sending an IPI to a single CPU ends up being
> "send a mask with a single bit set", and then we have that horrible
> "for_each_cpu(cpu, mask)" crap.
>
> Would it make sense to perhaps add a "send_IPI_single()" function
> call, and then for the APIC models that always are based on masks, use
> a wrapper that just does that "cpumask_of(cpu)" thing..
>
> Hmm?
>
> Linus

2015-02-19 16:59:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: smp_call_function_single lockups

On Thu, Feb 19, 2015 at 8:32 AM, Rafael David Tinoco <[email protected]> wrote:
> Feb 19 08:21:28 derain kernel: [ 3.637682] Switched APIC routing to
> cluster x2apic.

Ok. That "cluster x2apic" mode is just about the nastiest mode when it
comes to sending a single ipi. We do that insane dance where we

- turn single cpu number into cpumask
- copy the cpumask to a percpu temporary storage
- walk each cpu in the cpumask
- for each cpu, look up the cluster siblings
- for each cluster sibling that is also in the cpumask, look up the
logical apic mask and add it to the actual ipi destination mask
- send an ipi to that final mask.

which is just insane. It's complicated, it's fragile, and it's unnecessary.

If we had a simple "send_IPI()" function, we could do this all with
something much saner, and it would look sopmething like

static void x2apic_send_IPI(int cpu, int vector)
{
u32 dest = per_cpu(x86_cpu_to_logical_apicid, cpu);
x2apic_wrmsr_fence();
__x2apic_send_IPI_dest(dest, vector, APIC_DEST_LOGICAL);
}

and then 'void native_send_call_func_single_ipi()' would just look like

void native_send_call_func_single_ipi(int cpu)
{
apic->send_IPI(cpu, CALL_FUNCTION_SINGLE_VECTOR);
}

but I might have missed something (and we might want to have a wrapper
that says "if the apic doesn't have a 'send_IPI' function, use
"send_IPI_mask(cpumask_of(cpu, vector) instead"

The fact that you need that no_x2apic_optout (which in turn means that
your ACPI tables seem to say "don't use x2apic") also makes me worry.

Are there known errata for the x2apic?

Linus

2015-02-19 17:30:46

by Rafael David Tinoco

[permalink] [raw]
Subject: Re: smp_call_function_single lockups

I could only find an advisory (regarding sr-iov and irq remaps) from
HP to RHEL6.2 users stating that Gen8 firmware does not enable it by
default.

http://h20564.www2.hp.com/hpsc/doc/public/display?docId=emr_na-c03645796

"""
The interrupt remapping capability depends on x2apic enabled in the
BIOS and HP ProLiant Gen8 systems do not enable x2apic; therefore, the
following workaround is required for device assignment: Edit the
/etc/grub.conf and add intremap=no_x2apic_optout option to the kernel
command line options.
"""

Probably for backwards compatibility... not sure if there is an option
to enable/disable it in firmware (like DL390 seems to have).
I don't think so... but I was told by HP team that I should use x2apic
for >= Gen8.


On Thu, Feb 19, 2015 at 2:59 PM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Feb 19, 2015 at 8:32 AM, Rafael David Tinoco <[email protected]> wrote:
>> Feb 19 08:21:28 derain kernel: [ 3.637682] Switched APIC routing to
>> cluster x2apic.
>
> Ok. That "cluster x2apic" mode is just about the nastiest mode when it
> comes to sending a single ipi. We do that insane dance where we
>
> - turn single cpu number into cpumask
> - copy the cpumask to a percpu temporary storage
> - walk each cpu in the cpumask
> - for each cpu, look up the cluster siblings
> - for each cluster sibling that is also in the cpumask, look up the
> logical apic mask and add it to the actual ipi destination mask
> - send an ipi to that final mask.
>
> which is just insane. It's complicated, it's fragile, and it's unnecessary.
>
> If we had a simple "send_IPI()" function, we could do this all with
> something much saner, and it would look sopmething like
>
> static void x2apic_send_IPI(int cpu, int vector)
> {
> u32 dest = per_cpu(x86_cpu_to_logical_apicid, cpu);
> x2apic_wrmsr_fence();
> __x2apic_send_IPI_dest(dest, vector, APIC_DEST_LOGICAL);
> }
>
> and then 'void native_send_call_func_single_ipi()' would just look like
>
> void native_send_call_func_single_ipi(int cpu)
> {
> apic->send_IPI(cpu, CALL_FUNCTION_SINGLE_VECTOR);
> }
>
> but I might have missed something (and we might want to have a wrapper
> that says "if the apic doesn't have a 'send_IPI' function, use
> "send_IPI_mask(cpumask_of(cpu, vector) instead"
>
> The fact that you need that no_x2apic_optout (which in turn means that
> your ACPI tables seem to say "don't use x2apic") also makes me worry.
>
> Are there known errata for the x2apic?
>
> Linus

2015-02-19 17:39:20

by Linus Torvalds

[permalink] [raw]
Subject: Re: smp_call_function_single lockups

On Thu, Feb 19, 2015 at 8:59 AM, Linus Torvalds
<[email protected]> wrote:
>
> Are there known errata for the x2apic?

.. and in particular, do we still have to worry about the traditional
local apic "if there are more than two pending interrupts per priority
level, things get lost" problem?

I forget the exact details. Hopefully somebody remembers.

Linus

2015-02-19 20:29:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: smp_call_function_single lockups

On Thu, Feb 19, 2015 at 9:39 AM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Feb 19, 2015 at 8:59 AM, Linus Torvalds
> <[email protected]> wrote:
>>
>> Are there known errata for the x2apic?
>
> .. and in particular, do we still have to worry about the traditional
> local apic "if there are more than two pending interrupts per priority
> level, things get lost" problem?
>
> I forget the exact details. Hopefully somebody remembers.

I can't find it in the docs. I find the "two-entries per vector", but
not anything that is per priority level (group of 16 vectors). Maybe
that was the IO-APIC, in which case it's immaterial for IPI's.

However, having now mostly re-acquainted myself with the APIC details,
it strikes me that we do have some oddities here.

In particular, a few interrupt types are very special: NMI, SMI, INIT,
ExtINT, or SIPI are handled early in the interrupt acceptance logic,
and are sent directly to the CPU core, without going through the usual
intermediate IRR/ISR dance.

And why might this matter? It's important because it means that those
kinds of interrupts must *not* do the apic EOI that ack_APIC_irq()
does.

And we correctly don't do ack_APIC_irq() for NMI etc, but it strikes
me that ExtINT is odd and special.

I think we still use ExtINT for some odd cases. We used to have some
magic with the legacy timer interrupt, for example. And I think they
all go through the normal "do_IRQ()" logic regardless of whether they
are ExtINT or not.

Now, what happens if we send an EOI for an ExtINT interrupt? It
basically ends up being a spurious IPI. And I *think* that what
normally happens is absolutely nothing at all. But if in addition to
the ExtINT, there was a pending IPI (or other pending ISR bit set),
maybe we lose interrupts..

.. and it's entirely possible that I'm just completely full of shit.
Who is the poor bastard who has worked most with things like ExtINT,
and can educate me? I'm adding Ingo, hpa and Jiang Liu as primary
contacts..

Linus

2015-02-19 21:59:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: smp_call_function_single lockups

On Thu, Feb 19, 2015 at 12:29 PM, Linus Torvalds
<[email protected]> wrote:
>
> Now, what happens if we send an EOI for an ExtINT interrupt? It
> basically ends up being a spurious IPI. And I *think* that what
> normally happens is absolutely nothing at all. But if in addition to
> the ExtINT, there was a pending IPI (or other pending ISR bit set),
> maybe we lose interrupts..
>
> .. and it's entirely possible that I'm just completely full of shit.
> Who is the poor bastard who has worked most with things like ExtINT,
> and can educate me? I'm adding Ingo, hpa and Jiang Liu as primary
> contacts..

So quite frankly, trying to follow all the logic from do_IRQ() through
handle_irq() to the actual low-level handler, I just couldn't do it.

So instead, I wrote a patch to verify that the ISR bit is actually set
when we do ack_APIC_irq().

This was complicated by the fact that we don't actually pass in the
vector number at all to the acking, so 99% of the patch is just doing
that. A couple of places we don't really have a good vector number, so
I said "screw it, a negative value means that we won't check the ISR).

The attached patch is quite possibly garbage, but it gives an
interesting warning for me during i8042 probing, so who knows. Maybe
it actually shows a real problem - or maybe I just screwed up the
patch.

.. and maybe even if the patch is fine, it's actually never really a
problem to have spurious APIC ACK cycles. Maybe it cannot make
interrupts be ignored.

Anyway, the back-trace for the warning I get is during boot:

...
PNP: No PS/2 controller found. Probing ports directly.
------------[ cut here ]------------
WARNING: CPU: 0 PID: 1 at ./arch/x86/include/asm/apic.h:436
ir_ack_apic_edge+0x74/0x80()
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted
3.19.0-08857-g89d3fa45b4ad-dirty #2
Call Trace:
<IRQ>
dump_stack+0x45/0x57
warn_slowpath_common+0x80/0xc0
warn_slowpath_null+0x15/0x20
ir_ack_apic_edge+0x74/0x80
handle_edge_irq+0x51/0x110
handle_irq+0x74/0x140
do_IRQ+0x4a/0x140
common_interrupt+0x6a/0x6a
<EOI>
? _raw_spin_unlock_irqrestore+0x9/0x10
__setup_irq+0x239/0x5a0
request_threaded_irq+0xc2/0x180
i8042_probe+0x5b8/0x680
platform_drv_probe+0x2f/0xa0
driver_probe_device+0x8b/0x3e0
__driver_attach+0x93/0xa0
bus_for_each_dev+0x63/0xa0
driver_attach+0x19/0x20
bus_add_driver+0x178/0x250
driver_register+0x5f/0xf0
__platform_driver_register+0x45/0x50
__platform_driver_probe+0x26/0xa0
__platform_create_bundle+0xad/0xe0
i8042_init+0x3d0/0x3f6
do_one_initcall+0xb8/0x1d0
kernel_init_freeable+0x16d/0x1fa
kernel_init+0x9/0xf0
ret_from_fork+0x7c/0xb0
---[ end trace 1de82c4457c6a0f0 ]---
serio: i8042 KBD port at 0x60,0x64 irq 1
serio: i8042 AUX port at 0x60,0x64 irq 12
...

and it looks not entirely insane.

Is this worth looking at? Or is it something spurious? I might have
gotten the vectors wrong, and maybe the warning is not because the ISR
bit isn't set, but because I test the wrong bit.

Linus


Attachments:
patch.diff (12.74 kB)

2015-02-19 22:45:56

by Linus Torvalds

[permalink] [raw]
Subject: Re: smp_call_function_single lockups

On Thu, Feb 19, 2015 at 1:59 PM, Linus Torvalds
<[email protected]> wrote:
>
> Is this worth looking at? Or is it something spurious? I might have
> gotten the vectors wrong, and maybe the warning is not because the ISR
> bit isn't set, but because I test the wrong bit.

I edited the patch to do ratelimiting (one per 10s max) rather than
"once". And tested it some more. It seems to work correctly. The irq
case during 8042 probing is not repeatable, and I suspect it happens
because the interrupt source goes away (some probe-time thing that
first triggers an interrupt, but then clears it itself), so it doesn't
happen every boot, and I've gotten it with slightly different
backtraces.

But it's the only warning that happens for me, so I think my code is
right (at least for the cases that trigger on this machine). It's
definitely not a "every interrupt causes the warning because the code
was buggy, and the WARN_ONCE() just printed the first one".

It would be interesting to hear if others see spurious APIC EOI cases
too. In particular, the people seeing the IPI lockup. Because a lot of
the lockups we've seen have *looked* like the IPI interrupt just never
happened, and so we're waiting forever for the target CPU to react to
it. And just maybe the spurious EOI could cause the wrong bit to be
cleared in the ISR, and then the interrupt never shows up. Something
like that would certainly explain why it only happens on some machines
and under certain timing circumstances.

Linus

2015-02-20 09:30:09

by Ingo Molnar

[permalink] [raw]
Subject: Re: smp_call_function_single lockups


* Linus Torvalds <[email protected]> wrote:

> On Thu, Feb 19, 2015 at 9:39 AM, Linus Torvalds
> <[email protected]> wrote:
> > On Thu, Feb 19, 2015 at 8:59 AM, Linus Torvalds
> > <[email protected]> wrote:
> >>
> >> Are there known errata for the x2apic?
> >
> > .. and in particular, do we still have to worry about
> > the traditional local apic "if there are more than two
> > pending interrupts per priority level, things get lost"
> > problem?
> >
> > I forget the exact details. Hopefully somebody
> > remembers.
>
> I can't find it in the docs. I find the "two-entries per
> vector", but not anything that is per priority level
> (group of 16 vectors). Maybe that was the IO-APIC, in
> which case it's immaterial for IPI's.

So if my memory serves me right, I think it was for local
APICs, and even there mostly it was a performance issue: if
an IO-APIC sent more than 2 IRQs per 'level' to a local
APIC then the IO-APIC might be forced to resend those IRQs,
leading to excessive message traffic on the relevant
hardware bus.

( I think the 'resend' was automatic in this case, i.e. a
hardware fallback for a CPU side resource shortage, and
it could not result in actually lost IRQs. I never saw
this documented properly, so people inside Intel or AMD
would be in a better position to comment on this ... I
might be mis-remembering this or confusing different
bugs. )

> However, having now mostly re-acquainted myself with the
> APIC details, it strikes me that we do have some oddities
> here.
>
> In particular, a few interrupt types are very special:
> NMI, SMI, INIT, ExtINT, or SIPI are handled early in the
> interrupt acceptance logic, and are sent directly to the
> CPU core, without going through the usual intermediate
> IRR/ISR dance.
>
> And why might this matter? It's important because it
> means that those kinds of interrupts must *not* do the
> apic EOI that ack_APIC_irq() does.
>
> And we correctly don't do ack_APIC_irq() for NMI etc, but
> it strikes me that ExtINT is odd and special.
>
> I think we still use ExtINT for some odd cases. We used
> to have some magic with the legacy timer interrupt, for
> example. And I think they all go through the normal
> "do_IRQ()" logic regardless of whether they are ExtINT or
> not.
>
> Now, what happens if we send an EOI for an ExtINT
> interrupt? It basically ends up being a spurious IPI. And
> I *think* that what normally happens is absolutely
> nothing at all. But if in addition to the ExtINT, there
> was a pending IPI (or other pending ISR bit set), maybe
> we lose interrupts..

1)

I think you got it right.

So the principle of EOI acknowledgement from the OS to the
local APIC is specific to the IRQ that raised the interrupt
and caused the vector to be executed, so it's not possible
to ack the 'wrong' IRQ.

But technically the EOI is state-less, i.e. (as you know)
we write a constant value to a local APIC register without
indicating which vector or external IRQ we meant. The OS
wants to ack 'the IRQ that we are executing currently', but
this leaves the situation a bit confused in cases where for
example an IRQ handler enables IRQs, another IRQ comes in
and stays unacked.

So I _think_ it's not possible to accidentally acknowledge
a pending IRQ that has not been issued to the CPU yet
(unless we have hardirqs enabled), just by writing stray
EOIs to the local APIC. So in that sense the ExtInt irq0
case should be mostly harmless.

But I could be wrong :-/

2)

So my suggestion for this bug would be:

The 'does a stray EOI matter' question could also be tested
by deliberately writing two EOIs instead of just one - does
this trigger the bug faster?

Then perhaps try to make sure that no hardirqs get ever
enabled in an irq handler, and figure out whether any of
the IRQs in question are edge triggered - but AFAICS it
could be 'any' IRQ handler or flow causing the problem,
right?

3)

I also fully share your frustration about the level of
obfuscation the various APIC drivers display today.

The lack of a simple single-IPI implementation is annoying
as well - when that injury was first inflicted with
clustered APICs I tried to resist, but AFAICR there were
some good hardware arguments why it cannot be kept and I
gave up.

If you agree then I can declare a feature stop for new
hardware support (that isn't a stop-ship issue for users)
until it's all cleaned up for real, and Thomas started some
of that work already.

> .. and it's entirely possible that I'm just completely
> full of shit. Who is the poor bastard who has worked most
> with things like ExtINT, and can educate me? I'm adding
> Ingo, hpa and Jiang Liu as primary contacts..

So the buck stops at my desk, but any help is welcome!

Thanks,

Ingo

2015-02-20 16:49:41

by Linus Torvalds

[permalink] [raw]
Subject: Re: smp_call_function_single lockups

On Fri, Feb 20, 2015 at 1:30 AM, Ingo Molnar <[email protected]> wrote:
>
> So if my memory serves me right, I think it was for local
> APICs, and even there mostly it was a performance issue: if
> an IO-APIC sent more than 2 IRQs per 'level' to a local
> APIC then the IO-APIC might be forced to resend those IRQs,
> leading to excessive message traffic on the relevant
> hardware bus.

Hmm. I have a distinct memory of interrupts actually being lost, but I
really can't find anything to support that memory, so it's probably
some drug-induced confusion of mine. I don't find *anything* about
interrupt "levels" any more in modern Intel documentation on the APIC,
but maybe I missed something. But it might all have been an IO-APIC
thing.

> I think you got it right.
>
> So the principle of EOI acknowledgement from the OS to the
> local APIC is specific to the IRQ that raised the interrupt
> and caused the vector to be executed, so it's not possible
> to ack the 'wrong' IRQ.

I agree - but only for cases where we actualyl should ACK in the first
place. The LAPIC knows what its own priorities were, so it always just
clears the highest-priority ISR bit.

> So I _think_ it's not possible to accidentally acknowledge
> a pending IRQ that has not been issued to the CPU yet
> (unless we have hardirqs enabled), just by writing stray
> EOIs to the local APIC. So in that sense the ExtInt irq0
> case should be mostly harmless.

It must clearly be mostly harmless even if I'm right, since we've
always done the ACK for it, as far as I can tell. So you're probably
right, and it doesn't matter.

In particular, it would depend on exactly when the ISR bit actually
gets set in the APIC. If it gets set only after the CPU has actually
*accepted* the interrupt, then it should not be possible for a
spurious EOI write to screw things up, because it's still happening in
the context of the previously executing interrupt, so a new ISR bit
hasn't been set yet, and any old ISR bits must be lower-priority than
the currently executing interrupt routine (that were interrupted by a
higher-priority one).

That sounds like the likely scenario. It just worries me. And my test
patch did actually trigger, even if it only seemed to trigger during
device probing (when the probing itself probably caused and then
cleared an interrupt).

> I also fully share your frustration about the level of
> obfuscation the various APIC drivers display today.
>
> The lack of a simple single-IPI implementation is annoying
> as well - when that injury was first inflicted with
> clustered APICs I tried to resist, but AFAICR there were
> some good hardware arguments why it cannot be kept and I
> gave up.
>
> If you agree then I can declare a feature stop for new
> hardware support (that isn't a stop-ship issue for users)
> until it's all cleaned up for real, and Thomas started some
> of that work already.

Well, the attached patch for that seems pretty trivial. And seems to
work for me (my machine also defaults to x2apic clustered mode), and
allows the APIC code to start doing a "send to specific cpu" thing one
by one, since it falls back to the send_IPI_mask() function if no
individual CPU IPI function exists.

NOTE! There's a few cases in arch/x86/kernel/apic/vector.c that also
do that "apic->send_IPI_mask(cpumask_of(i), .." thing, but they aren't
that important, so I didn't bother with them.

NOTE2! I've tested this, and it seems to work, but maybe there is
something seriously wrong. I skipped the "disable interrupts" part
when doing the "send_IPI", for example, because I think it's entirely
unnecessary for that case. But this has certainly *not* gotten any
real stress-testing.

But /proc/interrupts shows thousands of rescheduling IPI's, so this
should all have triggered.

Linus


Attachments:
patch.diff (2.73 kB)

2015-02-20 19:41:21

by Ingo Molnar

[permalink] [raw]
Subject: Re: smp_call_function_single lockups


* Linus Torvalds <[email protected]> wrote:

> On Fri, Feb 20, 2015 at 1:30 AM, Ingo Molnar <[email protected]> wrote:
> >
> > So if my memory serves me right, I think it was for
> > local APICs, and even there mostly it was a performance
> > issue: if an IO-APIC sent more than 2 IRQs per 'level'
> > to a local APIC then the IO-APIC might be forced to
> > resend those IRQs, leading to excessive message traffic
> > on the relevant hardware bus.
>
> Hmm. I have a distinct memory of interrupts actually
> being lost, but I really can't find anything to support
> that memory, so it's probably some drug-induced confusion
> of mine. I don't find *anything* about interrupt "levels"
> any more in modern Intel documentation on the APIC, but
> maybe I missed something. But it might all have been an
> IO-APIC thing.

So I just found an older discussion of it:

http://www.gossamer-threads.com/lists/linux/kernel/1554815?do=post_view_threaded#1554815

while it's not a comprehensive description, it matches what
I remember from it: with 3 vectors within a level of 16
vectors we'd get excessive "retries" sent by the IO-APIC
through the (then rather slow) APIC bus.

( It was possible for the same phenomenon to occur with
IPIs as well, when a CPU sent an APIC message to another
CPU, if the affected vectors were equal modulo 16 - but
this was rare IIRC because most systems were dual CPU so
only two IPIs could have occured. )

> Well, the attached patch for that seems pretty trivial.
> And seems to work for me (my machine also defaults to
> x2apic clustered mode), and allows the APIC code to start
> doing a "send to specific cpu" thing one by one, since it
> falls back to the send_IPI_mask() function if no
> individual CPU IPI function exists.
>
> NOTE! There's a few cases in
> arch/x86/kernel/apic/vector.c that also do that
> "apic->send_IPI_mask(cpumask_of(i), .." thing, but they
> aren't that important, so I didn't bother with them.
>
> NOTE2! I've tested this, and it seems to work, but maybe
> there is something seriously wrong. I skipped the
> "disable interrupts" part when doing the "send_IPI", for
> example, because I think it's entirely unnecessary for
> that case. But this has certainly *not* gotten any real
> stress-testing.

I'm not so sure about that aspect: I think disabling IRQs
might be necessary with some APICs (if lower levels don't
disable IRQs), to make sure the 'local APIC busy' bit isn't
set:

we typically do a wait_icr_idle() call before sending an
IPI - and if IRQs are not off then the idleness of the APIC
might be gone. (Because a hardirq that arrives after a
wait_icr_idle() but before the actual IPI sending sent out
an IPI and the queue is full.)

So the IPI sending should be atomic in that sense.

Thanks,

Ingo

2015-02-20 20:03:07

by Linus Torvalds

[permalink] [raw]
Subject: Re: smp_call_function_single lockups

On Fri, Feb 20, 2015 at 11:41 AM, Ingo Molnar <[email protected]> wrote:
>
> I'm not so sure about that aspect: I think disabling IRQs
> might be necessary with some APICs (if lower levels don't
> disable IRQs), to make sure the 'local APIC busy' bit isn't
> set:

Right. But afaik not for the x2apic case, which this is. The x2apic
doesn't even have a busy bit, and sending the ipi is a single write,

I agree that when doing other apic implementations, we may need to
guarantee atomicity for things like "wait for apic idle, then send the
ipi".

Linus

2015-02-20 20:11:29

by Ingo Molnar

[permalink] [raw]
Subject: Re: smp_call_function_single lockups


* Linus Torvalds <[email protected]> wrote:

> On Fri, Feb 20, 2015 at 11:41 AM, Ingo Molnar <[email protected]> wrote:
> >
> > I'm not so sure about that aspect: I think disabling
> > IRQs might be necessary with some APICs (if lower
> > levels don't disable IRQs), to make sure the 'local
> > APIC busy' bit isn't set:
>
> Right. But afaik not for the x2apic case, which this is.
> The x2apic doesn't even have a busy bit, and sending the
> ipi is a single write,

Ah, ok! Then the patch looks good to me.

( Originally we didn't wait for the ICR bit either, but
then it was added due to later erratas and was eventually
made an architectural requirement. )

> I agree that when doing other apic implementations, we
> may need to guarantee atomicity for things like "wait for
> apic idle, then send the ipi".

Yeah.

Thanks,

Ingo

2015-02-22 09:00:17

by Daniel J Blueman

[permalink] [raw]
Subject: Re: smp_call_function_single lockups

On Saturday, February 21, 2015 at 3:50:05 AM UTC+8, Ingo Molnar wrote:
> * Linus Torvalds <[email protected]> wrote:
>
> > On Fri, Feb 20, 2015 at 1:30 AM, Ingo Molnar <[email protected]> wrote:
> > >
> > > So if my memory serves me right, I think it was for
> > > local APICs, and even there mostly it was a performance
> > > issue: if an IO-APIC sent more than 2 IRQs per 'level'
> > > to a local APIC then the IO-APIC might be forced to
> > > resend those IRQs, leading to excessive message traffic
> > > on the relevant hardware bus.
> >
> > Hmm. I have a distinct memory of interrupts actually
> > being lost, but I really can't find anything to support
> > that memory, so it's probably some drug-induced confusion
> > of mine. I don't find *anything* about interrupt "levels"
> > any more in modern Intel documentation on the APIC, but
> > maybe I missed something. But it might all have been an
> > IO-APIC thing.
>
> So I just found an older discussion of it:
>
>
http://www.gossamer-threads.com/lists/linux/kernel/1554815?do=post_view_threaded#1554815
>
> while it's not a comprehensive description, it matches what
> I remember from it: with 3 vectors within a level of 16
> vectors we'd get excessive "retries" sent by the IO-APIC
> through the (then rather slow) APIC bus.
>
> ( It was possible for the same phenomenon to occur with
> IPIs as well, when a CPU sent an APIC message to another
> CPU, if the affected vectors were equal modulo 16 - but
> this was rare IIRC because most systems were dual CPU so
> only two IPIs could have occured. )
>
> > Well, the attached patch for that seems pretty trivial.
> > And seems to work for me (my machine also defaults to
> > x2apic clustered mode), and allows the APIC code to start
> > doing a "send to specific cpu" thing one by one, since it
> > falls back to the send_IPI_mask() function if no
> > individual CPU IPI function exists.
> >
> > NOTE! There's a few cases in
> > arch/x86/kernel/apic/vector.c that also do that
> > "apic->send_IPI_mask(cpumask_of(i), .." thing, but they
> > aren't that important, so I didn't bother with them.
> >
> > NOTE2! I've tested this, and it seems to work, but maybe
> > there is something seriously wrong. I skipped the
> > "disable interrupts" part when doing the "send_IPI", for
> > example, because I think it's entirely unnecessary for
> > that case. But this has certainly *not* gotten any real
> > stress-testing.

> I'm not so sure about that aspect: I think disabling IRQs
> might be necessary with some APICs (if lower levels don't
> disable IRQs), to make sure the 'local APIC busy' bit isn't
> set:
>
> we typically do a wait_icr_idle() call before sending an
> IPI - and if IRQs are not off then the idleness of the APIC
> might be gone. (Because a hardirq that arrives after a
> wait_icr_idle() but before the actual IPI sending sent out
> an IPI and the queue is full.)

The Intel SDM [1] and AMD F15h BKDG [2] state that IPIs are queued, so
the wait_icr_idle() polling is only necessary on PPro and older, and
maybe then to avoid delivery retry. This unnecessarily ties up the IPI
caller, so we bypass the polling in the Numachip APIC driver IPI-to-self
path.

On Linus's earlier point, with the large core counts on Numascale
systems, I previously implemented a shortcut to allow single IPIs to
bypass all the cpumask generation and walking; it's way down on my list,
but I'll see if I can generalise and present a patch series at some
point if interested?

Dan

-- [1] Intel SDM 3, p10-30
http://www.intel.com/content/dam/www/public/us/en/documents/manuals/64-ia-32-architectures-software-developer-system-programming-manual-325384.pdf

If more than one interrupt is generated with the same vector number, the
local APIC can set the bit for the vector both in the IRR and the ISR.
This means that for the Pentium 4 and Intel Xeon processors, the IRR and
ISR can queue two interrupts for each interrupt vector: one in the IRR
and one in the ISR. Any additional interrupts issued for the same
interrupt vector are collapsed into the single bit in the IRR. For the
P6 family and Pentium processors, the IRR and ISR registers can queue no
more than two interrupts per interrupt vector and will reject other
interrupts that are received within the same vector.

-- [2] AMD Fam15h BKDG p470
http://support.amd.com/TechDocs/42301_15h_Mod_00h-0Fh_BKDG.pdf

DS: interrupt delivery status. Read-only. Reset: 0. In xAPIC mode this
bit is set to indicate that the interrupt has not yet been accepted by
the destination core(s). 0=Idle. 1=Send pending. Reserved in x2APIC
mode. Software may repeatedly write ICRL without polling the DS bit; all
requested IPIs will be delivered.

2015-02-22 10:37:17

by Ingo Molnar

[permalink] [raw]
Subject: Re: smp_call_function_single lockups


* Daniel J Blueman <[email protected]> wrote:

> The Intel SDM [1] and AMD F15h BKDG [2] state that IPIs
> are queued, so the wait_icr_idle() polling is only
> necessary on PPro and older, and maybe then to avoid
> delivery retry. This unnecessarily ties up the IPI
> caller, so we bypass the polling in the Numachip APIC
> driver IPI-to-self path.

It would be nice to propagate this back to the generic x86
code.

> On Linus's earlier point, with the large core counts on
> Numascale systems, I previously implemented a shortcut to
> allow single IPIs to bypass all the cpumask generation
> and walking; it's way down on my list, but I'll see if I
> can generalise and present a patch series at some point
> if interested?

I am definitely interested!

Thanks,

Ingo

2015-02-23 14:01:12

by Rafael David Tinoco

[permalink] [raw]
Subject: Re: smp_call_function_single lockups

On Thu, Feb 19, 2015 at 2:14 PM, Linus Torvalds
<[email protected]> wrote:

>
> Hmm. Still just the stack trace for the CPU that is blocked (CPU0), if
> you can get the core-file to work and figure out where the other CPU
> is, that would be good.
>

This is v3.19 + your patch (smp acquire/release)
- (nested kvm with 2 vcpus on top of proliant with x2apic cluster mode
and acpi_idle)

* http://people.canonical.com/~inaddy/lp1413540/STACK_TRACE.txt
http://people.canonical.com/~inaddy/lp1413540/BACKTRACES.txt
http://people.canonical.com/~inaddy/lp1413540/FOREACH_BT.txt

* It looks like we got locked because of reentrant flush_tlb_* through
smp_call_*
but I'll leave it to you. let me know if you need any more info..

-
Rafael Tinoco

2015-02-23 19:32:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: smp_call_function_single lockups

On Mon, Feb 23, 2015 at 6:01 AM, Rafael David Tinoco <[email protected]> wrote:
>
> This is v3.19 + your patch (smp acquire/release)
> - (nested kvm with 2 vcpus on top of proliant with x2apic cluster mode
> and acpi_idle)

Hmm. There is absolutely nothing else going on on that machine, except
for the single call to smp_call_function_single() that is waiting for
the CSD to be released.

> * It looks like we got locked because of reentrant flush_tlb_* through
> smp_call_*
> but I'll leave it to you.

No, that is all a perfectly regular callchain:

.. native_flush_tlb_others -> smp_call_function_many ->
smp_call_function_single

but the stack contains some stale addresses (one is probably just from
smp_call_function_single() calling into "generic_exec_single()", and
thus the stack contains the return address inside
smp_call_function_single() in _addition_ to the actual place where the
watchdog timer then interrupted it).

It all really looks very regular and sane, and looks like
smp_call_function_single() is happily just waiting for the IPI to
finish in the (inlined) csd_lock_wait().

I see nothing wrong at all.

However, here's a patch to the actual x86
smp_call_function_single_interrupt() handler, which probably doesn't
make any difference at all, but does:

- gets rid of the incredibly crazy and ugly smp_entering_irq() that
inexplicably pairs with irq_exit()

- makes the SMP IPI functions dot he APIC ACK cycle at the *end*
rather than at the beginning.

The exact placement of the ACK should make absolutely no difference,
since interrupts should be disabled for the whole period anyway. But
I'm desperate and am flailing around worrying about x2apic bugs and
whatnot in addition to whatever bugs we migth have in this area in sw.

So maybe there is some timing thing. This makes us consistently ack
the IPI _after_ we have cleared the list of smp call actions, and
executed the list, rather than ack it before, and then go on to muck
with the list.

Plus that old code really annoyed me anyway.

So I don't see why this should matter, but since I don't see how the
bug could happen in the first place, might as well try it..

Linus


Attachments:
patch.diff (2.45 kB)

2015-02-23 20:50:51

by Peter Zijlstra

[permalink] [raw]
Subject: Re: smp_call_function_single lockups

On Mon, Feb 23, 2015 at 11:32:50AM -0800, Linus Torvalds wrote:
> On Mon, Feb 23, 2015 at 6:01 AM, Rafael David Tinoco <[email protected]> wrote:
> >
> > This is v3.19 + your patch (smp acquire/release)
> > - (nested kvm with 2 vcpus on top of proliant with x2apic cluster mode
> > and acpi_idle)
>
> Hmm. There is absolutely nothing else going on on that machine, except
> for the single call to smp_call_function_single() that is waiting for
> the CSD to be released.
>
> > * It looks like we got locked because of reentrant flush_tlb_* through
> > smp_call_*
> > but I'll leave it to you.
>
> No, that is all a perfectly regular callchain:
>
> .. native_flush_tlb_others -> smp_call_function_many ->
> smp_call_function_single
>
> but the stack contains some stale addresses (one is probably just from
> smp_call_function_single() calling into "generic_exec_single()", and
> thus the stack contains the return address inside
> smp_call_function_single() in _addition_ to the actual place where the
> watchdog timer then interrupted it).
>
> It all really looks very regular and sane, and looks like
> smp_call_function_single() is happily just waiting for the IPI to
> finish in the (inlined) csd_lock_wait().
>
> I see nothing wrong at all.

[11396.096002] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Bochs 01/01/2011

But its a virtual machine right? Its not running bare metal, its running
a !virt kernel on a virt machine, so maybe some of the virt muck is
borked?

A very subtly broken APIC emulation would be heaps of 'fun'.

2015-02-23 21:02:33

by Rafael David Tinoco

[permalink] [raw]
Subject: Re: smp_call_function_single lockups

>
> [11396.096002] Hardware name: OpenStack Foundation OpenStack Nova, BIOS Bochs 01/01/2011
>
> But its a virtual machine right? Its not running bare metal, its running
> a !virt kernel on a virt machine, so maybe some of the virt muck is
> borked?
>
> A very subtly broken APIC emulation would be heaps of 'fun'.

Indeed, I'll dig hypervisor changes... Anyway, this initial stack
trace was first seen during the "frequent lockups in 3.18rc4"
discussion:

https://lkml.org/lkml/2014/11/14/656

RIP: 0010:[<ffffffff9c11e98a>] [<ffffffff9c11e98a>]
generic_exec_single+0xea/0x1d0

At that time seen by Dave Jones, and running bare metal (if i remember
correctly).

--
Rafael Tinoco

2015-04-01 12:43:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: smp_call_function_single lockups


* Chris J Arges <[email protected]> wrote:

> Linus,
>
> I had a few runs with your patch plus modifications, and got the following
> results (modified patch inlined below):
>
> [ 14.423916] ack_APIC_irq: vector = d1, irq = ffffffff
> [ 176.060005] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:1630]
>
> [ 17.995298] ack_APIC_irq: vector = d1, irq = ffffffff
> [ 182.993828] ack_APIC_irq: vector = e1, irq = ffffffff
> [ 202.919691] ack_APIC_irq: vector = 22, irq = ffffffff
> [ 484.132006] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [qemu-system-x86:1586]
>
> [ 15.592032] ack_APIC_irq: vector = d1, irq = ffffffff
> [ 304.993490] ack_APIC_irq: vector = e1, irq = ffffffff
> [ 315.174755] ack_APIC_irq: vector = 22, irq = ffffffff
> [ 360.108007] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [ksmd:26]
>
> [ 15.026077] ack_APIC_irq: vector = b1, irq = ffffffff
> [ 374.828531] ack_APIC_irq: vector = c1, irq = ffffffff
> [ 402.965942] ack_APIC_irq: vector = d1, irq = ffffffff
> [ 434.540814] ack_APIC_irq: vector = e1, irq = ffffffff
> [ 461.820768] ack_APIC_irq: vector = 22, irq = ffffffff
> [ 536.120027] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 23s! [qemu-system-x86:4243]
>
> [ 17.889334] ack_APIC_irq: vector = d1, irq = ffffffff
> [ 291.888784] ack_APIC_irq: vector = e1, irq = ffffffff
> [ 297.824627] ack_APIC_irq: vector = 22, irq = ffffffff
> [ 336.960594] ack_APIC_irq: vector = 42, irq = ffffffff
> [ 367.012706] ack_APIC_irq: vector = 52, irq = ffffffff
> [ 377.025090] ack_APIC_irq: vector = 62, irq = ffffffff
> [ 417.088773] ack_APIC_irq: vector = 72, irq = ffffffff
> [ 447.136788] ack_APIC_irq: vector = 82, irq = ffffffff
> -- stopped it since it wasn't reproducing / I was impatient --
>
> So I'm seeing irq == VECTOR_UNDEFINED in all of these cases. Making
> (vector >= 0) didn't seem to expose any additional vectors.

So, these vectors do seem to be lining up with the pattern of how new
irq vectors get assigned and how we slowly rotate through all
available ones.

The VECTOR_UNDEFINED might correspond to the fact that we already
'freed' that vector, as part of the irq-move mechanism - but it was
likely in use shortly before.

So the irq-move code is not off the hook, to the contrary.

Have you already tested whether the hang goes away if you remove
irq-affinity fiddling daemons from the system? Do you have irqbalance
installed or similar mechanisms?

Thanks,

Ingo

2015-04-01 14:22:44

by Frederic Weisbecker

[permalink] [raw]
Subject: Re: smp_call_function_single lockups

On Wed, Feb 11, 2015 at 12:42:10PM -0800, Linus Torvalds wrote:
> [ Added Frederic to the cc, since he's touched this file/area most ]
>
> On Wed, Feb 11, 2015 at 11:59 AM, Linus Torvalds
> <[email protected]> wrote:
> >
> > So the caller has a really hard time guaranteeing that CSD_LOCK isn't
> > set. And if the call is done in interrupt context, for all we know it
> > is interrupting the code that is going to clear CSD_LOCK, so CSD_LOCK
> > will never be cleared at all, and csd_lock() will wait forever.
> >
> > So I actually think that for the async case, we really *should* unlock
> > before doing the callback (which is what Thomas' old patch did).
> >
> > And we migth well be better off doing something like
> >
> > WARN_ON_ONCE(csd->flags & CSD_LOCK);
> >
> > in smp_call_function_single_async(), because that really is a hard requirement.
> >
> > And it strikes me that hrtick_csd is one of these cases that do this
> > with interrupts disabled, and use the callback for serialization. So I
> > really wonder if this is part of the problem..
> >
> > Thomas? Am I missing something?
>
> Ok, this is a more involved patch than I'd like, but making the
> *caller* do all the CSD maintenance actually cleans things up.
>
> And this is still completely untested, and may be entirely buggy. What
> do you guys think?
>
> Linus

> kernel/smp.c | 78 ++++++++++++++++++++++++++++++++++++------------------------
> 1 file changed, 47 insertions(+), 31 deletions(-)
>
> diff --git a/kernel/smp.c b/kernel/smp.c
> index f38a1e692259..2aaac2c47683 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -19,7 +19,7 @@
>
> enum {
> CSD_FLAG_LOCK = 0x01,
> - CSD_FLAG_WAIT = 0x02,
> + CSD_FLAG_SYNCHRONOUS = 0x02,
> };
>
> struct call_function_data {
> @@ -107,7 +107,7 @@ void __init call_function_init(void)
> */
> static void csd_lock_wait(struct call_single_data *csd)
> {
> - while (csd->flags & CSD_FLAG_LOCK)
> + while (smp_load_acquire(&csd->flags) & CSD_FLAG_LOCK)
> cpu_relax();
> }
>
> @@ -121,19 +121,17 @@ static void csd_lock(struct call_single_data *csd)
> * to ->flags with any subsequent assignments to other
> * fields of the specified call_single_data structure:
> */
> - smp_mb();
> + smp_wmb();
> }
>
> static void csd_unlock(struct call_single_data *csd)
> {
> - WARN_ON((csd->flags & CSD_FLAG_WAIT) && !(csd->flags & CSD_FLAG_LOCK));
> + WARN_ON(!(csd->flags & CSD_FLAG_LOCK));
>
> /*
> * ensure we're all done before releasing data:
> */
> - smp_mb();
> -
> - csd->flags &= ~CSD_FLAG_LOCK;
> + smp_store_release(&csd->flags, 0);
> }
>
> static DEFINE_PER_CPU_SHARED_ALIGNED(struct call_single_data, csd_data);
> @@ -144,13 +142,16 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct call_single_data, csd_data);
> * ->func, ->info, and ->flags set.
> */
> static int generic_exec_single(int cpu, struct call_single_data *csd,
> - smp_call_func_t func, void *info, int wait)
> + smp_call_func_t func, void *info)
> {
> - struct call_single_data csd_stack = { .flags = 0 };
> - unsigned long flags;
> -
> -
> if (cpu == smp_processor_id()) {
> + unsigned long flags;
> +
> + /*
> + * We can unlock early even for the synchronous on-stack case,
> + * since we're doing this from the same CPU..
> + */
> + csd_unlock(csd);
> local_irq_save(flags);
> func(info);
> local_irq_restore(flags);
> @@ -161,21 +162,9 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
> if ((unsigned)cpu >= nr_cpu_ids || !cpu_online(cpu))
> return -ENXIO;
>
> -
> - if (!csd) {
> - csd = &csd_stack;
> - if (!wait)
> - csd = this_cpu_ptr(&csd_data);
> - }
> -
> - csd_lock(csd);
> -
> csd->func = func;
> csd->info = info;
>
> - if (wait)
> - csd->flags |= CSD_FLAG_WAIT;
> -
> /*
> * The list addition should be visible before sending the IPI
> * handler locks the list to pull the entry off it because of
> @@ -190,9 +179,6 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
> if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
> arch_send_call_function_single_ipi(cpu);
>
> - if (wait)
> - csd_lock_wait(csd);
> -
> return 0;
> }
>
> @@ -250,8 +236,17 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline)
> }
>
> llist_for_each_entry_safe(csd, csd_next, entry, llist) {
> - csd->func(csd->info);
> - csd_unlock(csd);
> + smp_call_func_t func = csd->func;
> + void *info = csd->info;
> +
> + /* Do we wait until *after* callback? */
> + if (csd->flags & CSD_FLAG_SYNCHRONOUS) {
> + func(info);
> + csd_unlock(csd);
> + } else {
> + csd_unlock(csd);
> + func(info);

Just to clarify things, the expected kind of lockup it is expected to fix is the case
where the IPI is resent from the IPI itself, right?

Thanks.

2015-04-01 14:33:10

by Chris J Arges

[permalink] [raw]
Subject: Re: smp_call_function_single lockups

On Tue, Mar 31, 2015 at 04:07:32PM -0700, Linus Torvalds wrote:
> On Tue, Mar 31, 2015 at 3:23 PM, Chris J Arges
> <[email protected]> wrote:
> >
> > I had a few runs with your patch plus modifications, and got the following
> > results (modified patch inlined below):
>
> Ok, thanks.
>
> > [ 14.423916] ack_APIC_irq: vector = d1, irq = ffffffff
> > [ 176.060005] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [qemu-system-x86:1630]
> >
> > [ 17.995298] ack_APIC_irq: vector = d1, irq = ffffffff
> > [ 182.993828] ack_APIC_irq: vector = e1, irq = ffffffff
> > [ 202.919691] ack_APIC_irq: vector = 22, irq = ffffffff
> > [ 484.132006] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [qemu-system-x86:1586]
> >
> > [ 15.592032] ack_APIC_irq: vector = d1, irq = ffffffff
> > [ 304.993490] ack_APIC_irq: vector = e1, irq = ffffffff
> > [ 315.174755] ack_APIC_irq: vector = 22, irq = ffffffff
> > [ 360.108007] NMI watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [ksmd:26]
> .. snip snip ..
>
> So yeah, that's VECTOR_UNDEFINED, and while it could happen as part of
> irq setup, I'm not seeing that being something that your load should
> trigger.
>
> It could also obviously just be the vector being somehow corrupted,
> either due to crazy hardware or software.
>
> But quite frankly, the most likely reason is that whole irq vector movement.
>
> Especially since it sounds from your other email that when you apply
> Ingo's patches, the ack_APIC_irq warnings go away. Is that correct? Or
> did you just grep for "move" in the messages?
>
> If you do get both movement messages (from Info's patch) _and_ the
> ack_APIC_irq warnings (from mine), it would be interesting to see if
> the vectors line up somehow..
>
> Linus
>

Linus,

I included the full patch in reply to Ingo's email, and when running with that
I no longer get the ack_APIC_irq WARNs.

My next homework assignments are:
- Testing with irqbalance disabled
- Testing w/ the appropriate dump_stack() in Ingo's patch
- L0 testing

Thanks,
--chris

2015-04-01 15:36:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: smp_call_function_single lockups

On Wed, Apr 1, 2015 at 7:32 AM, Chris J Arges
<[email protected]> wrote:
>
> I included the full patch in reply to Ingo's email, and when running with that
> I no longer get the ack_APIC_irq WARNs.

Ok. That means that the printk's themselves just change timing enough,
or change the compiler instruction scheduling so that it hides the
apic problem.

Which very much indicates that these things are interconnected.

For example, Ingo's printk patch does

cfg->move_in_progress =
cpumask_intersects(cfg->old_domain, cpu_online_mask);
+ if (cfg->move_in_progress)
+ pr_info("apic: vector %02x,
same-domain move in progress\n", cfg->vector);
cpumask_and(cfg->domain, cfg->domain, tmp_mask);

and that means that now the setting of move_in_progress is serialized
with the cpumask_and() in a way that it wasn't before.

And while the code takes the "vector_lock" and disables interrupts,
the interrupts themselves can happily continue on other cpu's, and
they don't take the vector_lock. Neither does send_cleanup_vector(),
which clears that bit, afaik.

I don't know. The locking there is odd.

> My next homework assignments are:
> - Testing with irqbalance disabled

Definitely.

> - Testing w/ the appropriate dump_stack() in Ingo's patch
> - L0 testing

Thanks,

Linus

2015-04-01 16:11:27

by Chris J Arges

[permalink] [raw]
Subject: Re: smp_call_function_single lockups

On Wed, Apr 01, 2015 at 02:43:36PM +0200, Ingo Molnar wrote:

<snip>
> Have you already tested whether the hang goes away if you remove
> irq-affinity fiddling daemons from the system? Do you have irqbalance
> installed or similar mechanisms?
>
> Thanks,
>
> Ingo
>

Even with irqbalance removed from the L0/L1 machines the hang still occurs.
Using the debug patch as described here:
https://lkml.org/lkml/2015/4/1/338

This results in no 'apic: vector* or 'ack_APIC*' messages being displayed.

--chris

2015-04-01 16:14:04

by Linus Torvalds

[permalink] [raw]
Subject: Re: smp_call_function_single lockups

On Wed, Apr 1, 2015 at 9:10 AM, Chris J Arges
<[email protected]> wrote:
>
> Even with irqbalance removed from the L0/L1 machines the hang still occurs.
>
> This results in no 'apic: vector* or 'ack_APIC*' messages being displayed.

Ok. So the ack_APIC debug patch found *something*, but it seems to be
unrelated to the hang.

Dang. Oh well. Back to square one.

Linus

2015-04-01 21:59:44

by Chris J Arges

[permalink] [raw]
Subject: Re: smp_call_function_single lockups



On 04/01/2015 11:14 AM, Linus Torvalds wrote:
> On Wed, Apr 1, 2015 at 9:10 AM, Chris J Arges
> <[email protected]> wrote:
>>
>> Even with irqbalance removed from the L0/L1 machines the hang still occurs.
>>
>> This results in no 'apic: vector* or 'ack_APIC*' messages being displayed.
>
> Ok. So the ack_APIC debug patch found *something*, but it seems to be
> unrelated to the hang.
>
> Dang. Oh well. Back to square one.
>
> Linus
>

With my L0 testing I've normally used a 3.13 series kernel since it
tends to reproduce the hang very quickly with the testcase. Note we have
reproduced an identical hang with newer kernels (3.19+patches) using the
openstack tempest on openstack reproducer, but the timing can vary
between hours and days.

Installing a v4.0-rc6+patch kernel on L0 makes the problem very slow to
reproduce, so I am running these tests now which may take day(s).

It is worthwhile to do a 'bisect' to see where on average it takes
longer to reproduce? Perhaps it will point to a relevant change, or it
may be completely useless.

--chris

2015-04-02 09:55:40

by Ingo Molnar

[permalink] [raw]
Subject: Re: smp_call_function_single lockups


* Linus Torvalds <[email protected]> wrote:

> On Wed, Apr 1, 2015 at 7:32 AM, Chris J Arges
> <[email protected]> wrote:
> >
> > I included the full patch in reply to Ingo's email, and when
> > running with that I no longer get the ack_APIC_irq WARNs.
>
> Ok. That means that the printk's themselves just change timing
> enough, or change the compiler instruction scheduling so that it
> hides the apic problem.

So another possibility would be that it's the third change causing
this change in behavior:

diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c
index 6cedd7914581..833a981c5420 100644
--- a/arch/x86/kernel/apic/vector.c
+++ b/arch/x86/kernel/apic/vector.c
@@ -335,9 +340,11 @@ int apic_retrigger_irq(struct irq_data *data)

void apic_ack_edge(struct irq_data *data)
{
+ ack_APIC_irq();
+
+ /* Might generate IPIs, so do this after having ACKed the APIC: */
irq_complete_move(irqd_cfg(data));
irq_move_irq(data);
- ack_APIC_irq();
}

/*

... since with this we won't send IPIs in a semi-nested fashion with
an unacked APIC, which is a good idea to do in general. It's also a
weird enough hardware pattern that virtualization's APIC emulation
might get it slightly wrong or slightly different.

> Which very much indicates that these things are interconnected.
>
> For example, Ingo's printk patch does
>
> cfg->move_in_progress =
> cpumask_intersects(cfg->old_domain, cpu_online_mask);
> + if (cfg->move_in_progress)
> + pr_info("apic: vector %02x,
> same-domain move in progress\n", cfg->vector);
> cpumask_and(cfg->domain, cfg->domain, tmp_mask);
>
> and that means that now the setting of move_in_progress is
> serialized with the cpumask_and() in a way that it wasn't before.

Yeah, that's a possibility too. It all looks very fragile.

Thanks,

Ingo

2015-04-02 17:31:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: smp_call_function_single lockups

On Wed, Apr 1, 2015 at 2:59 PM, Chris J Arges
<[email protected]> wrote:
>
> It is worthwhile to do a 'bisect' to see where on average it takes
> longer to reproduce? Perhaps it will point to a relevant change, or it
> may be completely useless.

It's likely to be an exercise in futility. "git bisect" is realyl bad
at "gray area" things, and when it's a question of "it takes hours or
days to reproduce", it's almost certainly not worth it. Not unless
there is some really clear cut-off that we can believably say "this
causes it to get much slower". And in this case, I don't think it's
that clear-cut. Judging by DaveJ's attempts at bisecting things, the
timing just changes. And the differences might be due to entirely
unrelated changes like cacheline alignment etc.

So unless we find a real clear signature of the bug (I was hoping that
the ISR bit would be that sign), I don't think trying to bisect it
based on how quickly you can reproduce things is worthwhile.

Linus

2015-04-02 17:35:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: smp_call_function_single lockups

On Thu, Apr 2, 2015 at 2:55 AM, Ingo Molnar <[email protected]> wrote:
>
> So another possibility would be that it's the third change causing
> this change in behavior:

Oh, yes, that looks much more likely. I overlooked that small change entirely.

> ... since with this we won't send IPIs in a semi-nested fashion with
> an unacked APIC, which is a good idea to do in general. It's also a
> weird enough hardware pattern that virtualization's APIC emulation
> might get it slightly wrong or slightly different.

Yup, that's more likely than the subtle timing/reordering differences
of just the added debug printouts. I agree.

Linus

2015-04-02 18:26:14

by Ingo Molnar

[permalink] [raw]
Subject: Re: smp_call_function_single lockups


* Linus Torvalds <[email protected]> wrote:

> So unless we find a real clear signature of the bug (I was hoping
> that the ISR bit would be that sign), I don't think trying to bisect
> it based on how quickly you can reproduce things is worthwhile.

So I'm wondering (and I might have missed some earlier report that
outlines just that), now that the possible location of the bug is
again sadly up to 15+ million lines of code, I have no better idea
than to debug by symptoms again: what kind of effort was made to
examine the locked up state itself?

Softlockups always have some direct cause, which task exactly causes
scheduling to stop altogether, why does it lock up - or is it not a
clear lockup, just a very slow system?

Thanks,

Ingo

2015-04-02 18:51:34

by Chris J Arges

[permalink] [raw]
Subject: Re: smp_call_function_single lockups

On 04/02/2015 01:26 PM, Ingo Molnar wrote:
>
> * Linus Torvalds <[email protected]> wrote:
>
>> So unless we find a real clear signature of the bug (I was hoping
>> that the ISR bit would be that sign), I don't think trying to bisect
>> it based on how quickly you can reproduce things is worthwhile.
>
> So I'm wondering (and I might have missed some earlier report that
> outlines just that), now that the possible location of the bug is
> again sadly up to 15+ million lines of code, I have no better idea
> than to debug by symptoms again: what kind of effort was made to
> examine the locked up state itself?
>

Ingo,

Rafael did some analysis when I was out earlier here:
https://lkml.org/lkml/2015/2/23/234

My reproducer setup is as follows:
L0 - 8-way CPU, 48 GB memory
L1 - 2-way vCPU, 4 GB memory
L2 - 1-way vCPU, 1 GB memory

Stress is only run in the L2 VM, and running top on L0/L1 doesn't show
excessive load.

> Softlockups always have some direct cause, which task exactly causes
> scheduling to stop altogether, why does it lock up - or is it not a
> clear lockup, just a very slow system?
>
> Thanks,
>
> Ingo
>

Whenever we look through the crashdump we see csd_lock_wait waiting for
CSD_FLAG_LOCK bit to be cleared. Usually the signature leading up to
that looks like the following (in the openstack tempest on openstack and
nested VM stress case)

(qemu-system-x86 task)
kvm_sched_in
-> kvm_arch_vcpu_load
-> vmx_vcpu_load
-> loaded_vmcs_clear
-> smp_call_function_single

(ksmd task)
pmdp_clear_flush
-> flush_tlb_mm_range
-> native_flush_tlb_others
-> smp_call_function_many

--chris

2015-04-02 19:07:34

by Ingo Molnar

[permalink] [raw]
Subject: Re: smp_call_function_single lockups


* Chris J Arges <[email protected]> wrote:

> Whenever we look through the crashdump we see csd_lock_wait waiting
> for CSD_FLAG_LOCK bit to be cleared. Usually the signature leading
> up to that looks like the following (in the openstack tempest on
> openstack and nested VM stress case)
>
> (qemu-system-x86 task)
> kvm_sched_in
> -> kvm_arch_vcpu_load
> -> vmx_vcpu_load
> -> loaded_vmcs_clear
> -> smp_call_function_single
>
> (ksmd task)
> pmdp_clear_flush
> -> flush_tlb_mm_range
> -> native_flush_tlb_others
> -> smp_call_function_many

So is this two separate smp_call_function instances, crossing each
other, and none makes any progress, indefinitely - as if the two IPIs
got lost?

The traces Rafael he linked to show a simpler scenario with two CPUs
apparently locked up, doing this:

CPU0:

#5 [ffffffff81c03e88] native_safe_halt at ffffffff81059386
#6 [ffffffff81c03e90] default_idle at ffffffff8101eaee
#7 [ffffffff81c03eb0] arch_cpu_idle at ffffffff8101f46f
#8 [ffffffff81c03ec0] cpu_startup_entry at ffffffff810b6563
#9 [ffffffff81c03f30] rest_init at ffffffff817a6067
#10 [ffffffff81c03f40] start_kernel at ffffffff81d4cfce
#11 [ffffffff81c03f80] x86_64_start_reservations at ffffffff81d4c4d7
#12 [ffffffff81c03f90] x86_64_start_kernel at ffffffff81d4c61c

This CPU is idle.

CPU1:

#10 [ffff88081993fa70] smp_call_function_single at ffffffff810f4d69
#11 [ffff88081993fb10] native_flush_tlb_others at ffffffff810671ae
#12 [ffff88081993fb40] flush_tlb_mm_range at ffffffff810672d4
#13 [ffff88081993fb80] pmdp_splitting_flush at ffffffff81065e0d
#14 [ffff88081993fba0] split_huge_page_to_list at ffffffff811ddd39
#15 [ffff88081993fc30] __split_huge_page_pmd at ffffffff811dec65
#16 [ffff88081993fcc0] unmap_single_vma at ffffffff811a4f03
#17 [ffff88081993fdc0] zap_page_range at ffffffff811a5d08
#18 [ffff88081993fe80] sys_madvise at ffffffff811b9775
#19 [ffff88081993ff80] system_call_fastpath at ffffffff817b8bad

This CPU is busy-waiting for the TLB flush IPI to finish.

There's no unexpected pattern here (other than it not finishing)
AFAICS, the smp_call_function_single() is just the usual way we invoke
the TLB flushing methods AFAICS.

So one possibility would be that an 'IPI was sent but lost'.

We could try the following trick: poll for completion for a couple of
seconds (since an IPI is not held up by anything but irqs-off
sections, it should arrive within microseconds typically - seconds of
polling should be more than enough), and if the IPI does not arrive,
print a warning message and re-send the IPI.

If the IPI was lost due to some race and there's no other failure mode
that we don't understand, then this would work around the bug and
would make the tests pass indefinitely - with occasional hickups and a
handful of messages produced along the way whenever it would have
locked up with a previous kernel.

If testing indeed confirms that kind of behavior we could drill down
more closely to figure out why the IPI did not get to its destination.

Or if the behavior is different, we'd have some new behavior to look
at. (for example the IPI sending mechanism might be wedged
indefinitely for some reason, so that even a resend won't work.)

Agreed?

Thanks,

Ingo

2015-04-02 20:57:35

by Linus Torvalds

[permalink] [raw]
Subject: Re: smp_call_function_single lockups

On Thu, Apr 2, 2015 at 12:07 PM, Ingo Molnar <[email protected]> wrote:
>
> So one possibility would be that an 'IPI was sent but lost'.

Yes, the "sent but lost" thing would certainly explain the lockups.

At the same time, that sounds like a huge hardware bug, and that's
somewhat surprising/unlikely.

That said.

> We could try the following trick: poll for completion for a couple of
> seconds (since an IPI is not held up by anything but irqs-off
> sections, it should arrive within microseconds typically - seconds of
> polling should be more than enough), and if the IPI does not arrive,
> print a warning message and re-send the IPI.

Sounds like a reasonable approach. At worst it doesn't fix anything,
and we never see any messages, and that tells us something too.

Linus

2015-04-02 21:13:12

by Chris J Arges

[permalink] [raw]
Subject: Re: smp_call_function_single lockups



On 04/02/2015 02:07 PM, Ingo Molnar wrote:
>
> * Chris J Arges <[email protected]> wrote:
>
>> Whenever we look through the crashdump we see csd_lock_wait waiting
>> for CSD_FLAG_LOCK bit to be cleared. Usually the signature leading
>> up to that looks like the following (in the openstack tempest on
>> openstack and nested VM stress case)
>>
>> (qemu-system-x86 task)
>> kvm_sched_in
>> -> kvm_arch_vcpu_load
>> -> vmx_vcpu_load
>> -> loaded_vmcs_clear
>> -> smp_call_function_single
>>
>> (ksmd task)
>> pmdp_clear_flush
>> -> flush_tlb_mm_range
>> -> native_flush_tlb_others
>> -> smp_call_function_many
>
> So is this two separate smp_call_function instances, crossing each
> other, and none makes any progress, indefinitely - as if the two IPIs
> got lost?
>

This is two different crash signatures. Sorry for the confusion.

> The traces Rafael he linked to show a simpler scenario with two CPUs
> apparently locked up, doing this:
>
> CPU0:
>
> #5 [ffffffff81c03e88] native_safe_halt at ffffffff81059386
> #6 [ffffffff81c03e90] default_idle at ffffffff8101eaee
> #7 [ffffffff81c03eb0] arch_cpu_idle at ffffffff8101f46f
> #8 [ffffffff81c03ec0] cpu_startup_entry at ffffffff810b6563
> #9 [ffffffff81c03f30] rest_init at ffffffff817a6067
> #10 [ffffffff81c03f40] start_kernel at ffffffff81d4cfce
> #11 [ffffffff81c03f80] x86_64_start_reservations at ffffffff81d4c4d7
> #12 [ffffffff81c03f90] x86_64_start_kernel at ffffffff81d4c61c
>
> This CPU is idle.
>
> CPU1:
>
> #10 [ffff88081993fa70] smp_call_function_single at ffffffff810f4d69
> #11 [ffff88081993fb10] native_flush_tlb_others at ffffffff810671ae
> #12 [ffff88081993fb40] flush_tlb_mm_range at ffffffff810672d4
> #13 [ffff88081993fb80] pmdp_splitting_flush at ffffffff81065e0d
> #14 [ffff88081993fba0] split_huge_page_to_list at ffffffff811ddd39
> #15 [ffff88081993fc30] __split_huge_page_pmd at ffffffff811dec65
> #16 [ffff88081993fcc0] unmap_single_vma at ffffffff811a4f03
> #17 [ffff88081993fdc0] zap_page_range at ffffffff811a5d08
> #18 [ffff88081993fe80] sys_madvise at ffffffff811b9775
> #19 [ffff88081993ff80] system_call_fastpath at ffffffff817b8bad
>
> This CPU is busy-waiting for the TLB flush IPI to finish.
>
> There's no unexpected pattern here (other than it not finishing)
> AFAICS, the smp_call_function_single() is just the usual way we invoke
> the TLB flushing methods AFAICS.
>
> So one possibility would be that an 'IPI was sent but lost'.
>
> We could try the following trick: poll for completion for a couple of
> seconds (since an IPI is not held up by anything but irqs-off
> sections, it should arrive within microseconds typically - seconds of
> polling should be more than enough), and if the IPI does not arrive,
> print a warning message and re-send the IPI.
>
> If the IPI was lost due to some race and there's no other failure mode
> that we don't understand, then this would work around the bug and
> would make the tests pass indefinitely - with occasional hickups and a
> handful of messages produced along the way whenever it would have
> locked up with a previous kernel.
>
> If testing indeed confirms that kind of behavior we could drill down
> more closely to figure out why the IPI did not get to its destination.
>
> Or if the behavior is different, we'd have some new behavior to look
> at. (for example the IPI sending mechanism might be wedged
> indefinitely for some reason, so that even a resend won't work.)
>
> Agreed?
>
> Thanks,
>
> Ingo
>

Ingo,

I think tracking IPI calls from 'generic_exec_single' would make a lot
of sense. When you say poll for completion do you mean a loop after
'arch_send_call_function_single_ipi' in kernel/smp.c? My main concern
would be to not alter the timings too much so we can still reproduce the
original problem.

Another approach:
If we want to check for non-ACKed IPIs a possibility would be to add a
timestamp field to 'struct call_single_data' and just record jiffies
when the IPI gets called. Then have a per-cpu kthread check the
'call_single_queue' percpu list periodically if (jiffies - timestamp) >
THRESHOLD. When we reach that condition print the stale entry in
call_single_queue, backtrace, then re-send the IPI.

Let me know what makes the most sense to hack on.

Thanks,
--chris

2015-04-03 05:43:27

by Ingo Molnar

[permalink] [raw]
Subject: [PATCH] smp/call: Detect stuck CSD locks


* Chris J Arges <[email protected]> wrote:

> Ingo,
>
> I think tracking IPI calls from 'generic_exec_single' would make a lot
> of sense. When you say poll for completion do you mean a loop after
> 'arch_send_call_function_single_ipi' in kernel/smp.c? My main concern
> would be to not alter the timings too much so we can still reproduce the
> original problem.
>
> Another approach:
> If we want to check for non-ACKed IPIs a possibility would be to add a
> timestamp field to 'struct call_single_data' and just record jiffies
> when the IPI gets called. Then have a per-cpu kthread check the
> 'call_single_queue' percpu list periodically if (jiffies - timestamp) >
> THRESHOLD. When we reach that condition print the stale entry in
> call_single_queue, backtrace, then re-send the IPI.
>
> Let me know what makes the most sense to hack on.

Well, the thing is, putting this polling into an async kernel thread
loses a lot of data context and right of reference that we might need
to re-send an IPI.

And if the context is not lost we might as well send it from the
original, still-looping context - which is a lot simpler as well.

( ... and on a deadlocked non-CONFIG_PREEMPT kernel the kernel thread
won't run at all, so it won't be able to detect deadlocks. )

So I'd really suggest instrumenting the already existing CSD polling,
which is already a slowpath, so it won't impact timing much.

I'd suggest the following, rather unintrusive approach:

- first just take a jiffies timestamp and generate a warning message
if more than 10 seconds elapsed after sending the IPI, without
having heard from it.

- then the IPI is resent. This means adding a bit of a control flow
to csd_lock_wait().

Something like the patch below, which implements both steps:

- It will detect and print CSD deadlocks both in the single- and
multi- function call APIs, and in the pre-IPI CSD lock wait case
as well.

- It will resend an IPI if possible

- It generates various messages in the deadlock case that should
give us some idea about how the deadlock played out and whether it
got resolved.

The timeout is set to 10 seconds, that should be plenty even in a
virtualization environment.

Only very lightly tested under a simple lkvm bootup: I verified that
the boilerplate message is displayed, and that it doesn't generate
false positive messages in light loads - but I haven't checked whether
the deadlock detection works at all.

Thanks,

Ingo
---

kernel/smp.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 46 insertions(+), 5 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index f38a1e692259..e0eec1ab3ef2 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -98,22 +98,63 @@ void __init call_function_init(void)
register_cpu_notifier(&hotplug_cfd_notifier);
}

+/* Locking timeout in ms: */
+#define CSD_LOCK_TIMEOUT (10*1000ULL)
+
+/* Print this ID in every printk line we output, to be able to easily sort them apart: */
+static int csd_bug_count;
+
/*
* csd_lock/csd_unlock used to serialize access to per-cpu csd resources
*
* For non-synchronous ipi calls the csd can still be in use by the
* previous function call. For multi-cpu calls its even more interesting
* as we'll have to ensure no other cpu is observing our csd.
+ *
+ * ( The overhead of deadlock detection is not a big problem, this is a
+ * cpu_relax() loop that is actively wasting CPU cycles to poll for
+ * completion. )
*/
-static void csd_lock_wait(struct call_single_data *csd)
+static void csd_lock_wait(struct call_single_data *csd, int cpu)
{
- while (csd->flags & CSD_FLAG_LOCK)
+ int bug_id = 0;
+ u64 ts0, ts1, ts_delta;
+
+ ts0 = jiffies_to_msecs(jiffies);
+
+ if (unlikely(!csd_bug_count)) {
+ csd_bug_count++;
+ printk("csd: CSD deadlock debugging initiated!\n");
+ }
+
+ while (csd->flags & CSD_FLAG_LOCK) {
+ ts1 = jiffies_to_msecs(jiffies);
+
+ ts_delta = ts1-ts0;
+ if (unlikely(ts_delta >= CSD_LOCK_TIMEOUT)) { /* Uh oh, it took too long. Why? */
+
+ bug_id = csd_bug_count;
+ csd_bug_count++;
+
+ ts0 = ts1; /* Re-start the timeout detection */
+
+ printk("csd: Detected non-responsive CSD lock (#%d) on CPU#%02d, waiting %Ld.%03Ld secs for CPU#%02d\n",
+ bug_id, raw_smp_processor_id(), ts_delta/1000ULL, ts_delta % 1000ULL, cpu);
+ if (cpu >= 0) {
+ printk("csd: Re-sending CSD lock (#%d) IPI from CPU#%02d to CPU#%02d\n", bug_id, raw_smp_processor_id(), cpu);
+ arch_send_call_function_single_ipi(cpu);
+ }
+ dump_stack();
+ }
cpu_relax();
+ }
+ if (unlikely(bug_id))
+ printk("csd: CSD lock (#%d) got unstuck on CPU#%02d, CPU#%02d released the lock after all. Phew!\n", bug_id, raw_smp_processor_id(), cpu);
}

static void csd_lock(struct call_single_data *csd)
{
- csd_lock_wait(csd);
+ csd_lock_wait(csd, -1);
csd->flags |= CSD_FLAG_LOCK;

/*
@@ -191,7 +232,7 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
arch_send_call_function_single_ipi(cpu);

if (wait)
- csd_lock_wait(csd);
+ csd_lock_wait(csd, cpu);

return 0;
}
@@ -446,7 +487,7 @@ void smp_call_function_many(const struct cpumask *mask,
struct call_single_data *csd;

csd = per_cpu_ptr(cfd->csd, cpu);
- csd_lock_wait(csd);
+ csd_lock_wait(csd, cpu);
}
}
}

2015-04-03 05:45:21

by Ingo Molnar

[permalink] [raw]
Subject: Re: smp_call_function_single lockups


* Chris J Arges <[email protected]> wrote:

>
>
> On 04/02/2015 02:07 PM, Ingo Molnar wrote:
> >
> > * Chris J Arges <[email protected]> wrote:
> >
> >> Whenever we look through the crashdump we see csd_lock_wait waiting
> >> for CSD_FLAG_LOCK bit to be cleared. Usually the signature leading
> >> up to that looks like the following (in the openstack tempest on
> >> openstack and nested VM stress case)
> >>
> >> (qemu-system-x86 task)
> >> kvm_sched_in
> >> -> kvm_arch_vcpu_load
> >> -> vmx_vcpu_load
> >> -> loaded_vmcs_clear
> >> -> smp_call_function_single
> >>
> >> (ksmd task)
> >> pmdp_clear_flush
> >> -> flush_tlb_mm_range
> >> -> native_flush_tlb_others
> >> -> smp_call_function_many
> >
> > So is this two separate smp_call_function instances, crossing each
> > other, and none makes any progress, indefinitely - as if the two IPIs
> > got lost?
> >
>
> This is two different crash signatures. Sorry for the confusion.

So just in case, both crash signatures ought to be detected by the
patch I just sent.

Thanks,

Ingo

2015-04-03 05:47:15

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] smp/call: Detect stuck CSD locks


* Ingo Molnar <[email protected]> wrote:

> +static void csd_lock_wait(struct call_single_data *csd, int cpu)
> {
> - while (csd->flags & CSD_FLAG_LOCK)
> + int bug_id = 0;
> + u64 ts0, ts1, ts_delta;
> +
> + ts0 = jiffies_to_msecs(jiffies);

Note that while 'jiffies' is a global variable, it's read-mostly
accessed, so it is a pretty cheap way to acquire a timestamp in such a
simple polling context that has irqs enabled.

Thanks,

Ingo

2015-04-06 16:59:04

by Chris J Arges

[permalink] [raw]
Subject: Re: [PATCH] smp/call: Detect stuck CSD locks

On 04/03/2015 12:43 AM, Ingo Molnar wrote:
>
> * Chris J Arges <[email protected]> wrote:
>
>> Ingo,
>>
>> I think tracking IPI calls from 'generic_exec_single' would make a lot
>> of sense. When you say poll for completion do you mean a loop after
>> 'arch_send_call_function_single_ipi' in kernel/smp.c? My main concern
>> would be to not alter the timings too much so we can still reproduce the
>> original problem.
>>
>> Another approach:
>> If we want to check for non-ACKed IPIs a possibility would be to add a
>> timestamp field to 'struct call_single_data' and just record jiffies
>> when the IPI gets called. Then have a per-cpu kthread check the
>> 'call_single_queue' percpu list periodically if (jiffies - timestamp) >
>> THRESHOLD. When we reach that condition print the stale entry in
>> call_single_queue, backtrace, then re-send the IPI.
>>
>> Let me know what makes the most sense to hack on.
>
> Well, the thing is, putting this polling into an async kernel thread
> loses a lot of data context and right of reference that we might need
> to re-send an IPI.
>
> And if the context is not lost we might as well send it from the
> original, still-looping context - which is a lot simpler as well.
>
> ( ... and on a deadlocked non-CONFIG_PREEMPT kernel the kernel thread
> won't run at all, so it won't be able to detect deadlocks. )
>
> So I'd really suggest instrumenting the already existing CSD polling,
> which is already a slowpath, so it won't impact timing much.
>
> I'd suggest the following, rather unintrusive approach:
>
> - first just take a jiffies timestamp and generate a warning message
> if more than 10 seconds elapsed after sending the IPI, without
> having heard from it.
>
> - then the IPI is resent. This means adding a bit of a control flow
> to csd_lock_wait().
>
> Something like the patch below, which implements both steps:
>
> - It will detect and print CSD deadlocks both in the single- and
> multi- function call APIs, and in the pre-IPI CSD lock wait case
> as well.
>
> - It will resend an IPI if possible
>
> - It generates various messages in the deadlock case that should
> give us some idea about how the deadlock played out and whether it
> got resolved.
>
> The timeout is set to 10 seconds, that should be plenty even in a
> virtualization environment.
>
> Only very lightly tested under a simple lkvm bootup: I verified that
> the boilerplate message is displayed, and that it doesn't generate
> false positive messages in light loads - but I haven't checked whether
> the deadlock detection works at all.
>
> Thanks,
>
> Ingo
> ---
>
> kernel/smp.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++-----
> 1 file changed, 46 insertions(+), 5 deletions(-)
>
> diff --git a/kernel/smp.c b/kernel/smp.c
> index f38a1e692259..e0eec1ab3ef2 100644
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -98,22 +98,63 @@ void __init call_function_init(void)
> register_cpu_notifier(&hotplug_cfd_notifier);
> }
>
> +/* Locking timeout in ms: */
> +#define CSD_LOCK_TIMEOUT (10*1000ULL)
> +
> +/* Print this ID in every printk line we output, to be able to easily sort them apart: */
> +static int csd_bug_count;
> +
> /*
> * csd_lock/csd_unlock used to serialize access to per-cpu csd resources
> *
> * For non-synchronous ipi calls the csd can still be in use by the
> * previous function call. For multi-cpu calls its even more interesting
> * as we'll have to ensure no other cpu is observing our csd.
> + *
> + * ( The overhead of deadlock detection is not a big problem, this is a
> + * cpu_relax() loop that is actively wasting CPU cycles to poll for
> + * completion. )
> */
> -static void csd_lock_wait(struct call_single_data *csd)
> +static void csd_lock_wait(struct call_single_data *csd, int cpu)
> {
> - while (csd->flags & CSD_FLAG_LOCK)
> + int bug_id = 0;
> + u64 ts0, ts1, ts_delta;
> +
> + ts0 = jiffies_to_msecs(jiffies);
> +
> + if (unlikely(!csd_bug_count)) {
> + csd_bug_count++;
> + printk("csd: CSD deadlock debugging initiated!\n");
> + }
> +
> + while (csd->flags & CSD_FLAG_LOCK) {
> + ts1 = jiffies_to_msecs(jiffies);
> +
> + ts_delta = ts1-ts0;

Ingo,

I noticed that with this patch it never reached 'csd: Detected
non-responsive CSD lock...' because it seems that ts_delta never reached
CSD_LOCK_TIMEOUT. I tried adjusting the TIMEOUT value and still got a
hang without reaching this statement. I made the ts0,ts1 values global
and put a counter into the while loop and found that the loop iterated
about 670 million times before the softlockup was detected. In addition
ts0 and ts1 both had the same values upon soft lockup, and thus would
never trip the CSD_LOCK_TIMEOUT check.

--chris


> + if (unlikely(ts_delta >= CSD_LOCK_TIMEOUT)) { /* Uh oh, it took too long. Why? */
> +
> + bug_id = csd_bug_count;
> + csd_bug_count++;
> +
> + ts0 = ts1; /* Re-start the timeout detection */
> +
> + printk("csd: Detected non-responsive CSD lock (#%d) on CPU#%02d, waiting %Ld.%03Ld secs for CPU#%02d\n",
> + bug_id, raw_smp_processor_id(), ts_delta/1000ULL, ts_delta % 1000ULL, cpu);
> + if (cpu >= 0) {
> + printk("csd: Re-sending CSD lock (#%d) IPI from CPU#%02d to CPU#%02d\n", bug_id, raw_smp_processor_id(), cpu);
> + arch_send_call_function_single_ipi(cpu);
> + }
> + dump_stack();
> + }
> cpu_relax();
> + }
> + if (unlikely(bug_id))
> + printk("csd: CSD lock (#%d) got unstuck on CPU#%02d, CPU#%02d released the lock after all. Phew!\n", bug_id, raw_smp_processor_id(), cpu);
> }
>
> static void csd_lock(struct call_single_data *csd)
> {
> - csd_lock_wait(csd);
> + csd_lock_wait(csd, -1);
> csd->flags |= CSD_FLAG_LOCK;
>
> /*
> @@ -191,7 +232,7 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
> arch_send_call_function_single_ipi(cpu);
>
> if (wait)
> - csd_lock_wait(csd);
> + csd_lock_wait(csd, cpu);
>
> return 0;
> }
> @@ -446,7 +487,7 @@ void smp_call_function_many(const struct cpumask *mask,
> struct call_single_data *csd;
>
> csd = per_cpu_ptr(cfd->csd, cpu);
> - csd_lock_wait(csd);
> + csd_lock_wait(csd, cpu);
> }
> }
> }
>

2015-04-06 17:24:09

by Chris J Arges

[permalink] [raw]
Subject: Re: smp_call_function_single lockups

On Thu, Apr 02, 2015 at 10:31:50AM -0700, Linus Torvalds wrote:
> On Wed, Apr 1, 2015 at 2:59 PM, Chris J Arges
> <[email protected]> wrote:
> >
> > It is worthwhile to do a 'bisect' to see where on average it takes
> > longer to reproduce? Perhaps it will point to a relevant change, or it
> > may be completely useless.
>
> It's likely to be an exercise in futility. "git bisect" is realyl bad
> at "gray area" things, and when it's a question of "it takes hours or
> days to reproduce", it's almost certainly not worth it. Not unless
> there is some really clear cut-off that we can believably say "this
> causes it to get much slower". And in this case, I don't think it's
> that clear-cut. Judging by DaveJ's attempts at bisecting things, the
> timing just changes. And the differences might be due to entirely
> unrelated changes like cacheline alignment etc.
>
> So unless we find a real clear signature of the bug (I was hoping that
> the ISR bit would be that sign), I don't think trying to bisect it
> based on how quickly you can reproduce things is worthwhile.
>
> Linus
>

Linus, Ingo,

I did some testing and found that at the following patch level, the issue was
much, much more likely to reproduce within < 15m.

commit b6b8a1451fc40412c57d10c94b62e22acab28f94
Author: Jan Kiszka <[email protected]>
Date: Fri Mar 7 20:03:12 2014 +0100

KVM: nVMX: Rework interception of IRQs and NMIs

Move the check for leaving L2 on pending and intercepted IRQs or NMIs
from the *_allowed handler into a dedicated callback. Invoke this
callback at the relevant points before KVM checks if IRQs/NMIs can be
injected. The callback has the task to switch from L2 to L1 if needed
and inject the proper vmexit events.

The rework fixes L2 wakeups from HLT and provides the foundation for
preemption timer emulation.

However, when the following patch was applied the average time to reproduction
goes down greatly (the stress reproducer ran for hours without issue):

commit 9242b5b60df8b13b469bc6b7be08ff6ebb551ad3
Author: Bandan Das <[email protected]>
Date: Tue Jul 8 00:30:23 2014 -0400

KVM: x86: Check for nested events if there is an injectable interrupt

With commit b6b8a1451fc40412c57d1 that introduced
vmx_check_nested_events, checks for injectable interrupts happen
at different points in time for L1 and L2 that could potentially
cause a race. The regression occurs because KVM_REQ_EVENT is always
set when nested_run_pending is set even if there's no pending interrupt.
Consequently, there could be a small window when check_nested_events
returns without exiting to L1, but an interrupt comes through soon
after and it incorrectly, gets injected to L2 by inject_pending_event
Fix this by adding a call to check for nested events too when a check
for injectable interrupt returns true

However we reproduced with v3.19 (containing these two patches) which did
eventually softlockup with a similar backtrace.

So far, this agrees with the current understanding that we may be not ACK'ing
certain interrupts (IPIs from the L1 guest) causing csd_lock_wait to spin and
causing the soft lockup.

Hopefully this helps shed more light on this issue.

Thanks,
--chris j arges

2015-04-06 17:32:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] smp/call: Detect stuck CSD locks

On Mon, Apr 6, 2015 at 9:58 AM, Chris J Arges
<[email protected]> wrote:
>
> I noticed that with this patch it never reached 'csd: Detected
> non-responsive CSD lock...' because it seems that ts_delta never reached
> CSD_LOCK_TIMEOUT. I tried adjusting the TIMEOUT value and still got a
> hang without reaching this statement. I made the ts0,ts1 values global
> and put a counter into the while loop and found that the loop iterated
> about 670 million times before the softlockup was detected. In addition
> ts0 and ts1 both had the same values upon soft lockup, and thus would
> never trip the CSD_LOCK_TIMEOUT check.

Sounds like jiffies stops updating. Which doesn't sound unreasonable
for when there is some IPI problem.

How about just changing the debug patch to count iterations, and print
out a warning when it reaches ten million or so.

No, that's not the right thing to do in general, but for debugging
this particular issue it's a quick hack that might just work. It's not
like re-sending the IPI is "wrong" anyway.

Linus

2015-04-07 09:21:27

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] smp/call: Detect stuck CSD locks


* Linus Torvalds <[email protected]> wrote:

> On Mon, Apr 6, 2015 at 9:58 AM, Chris J Arges
> <[email protected]> wrote:
> >
> > I noticed that with this patch it never reached 'csd: Detected
> > non-responsive CSD lock...' because it seems that ts_delta never reached
> > CSD_LOCK_TIMEOUT. I tried adjusting the TIMEOUT value and still got a
> > hang without reaching this statement. I made the ts0,ts1 values global
> > and put a counter into the while loop and found that the loop iterated
> > about 670 million times before the softlockup was detected. In addition
> > ts0 and ts1 both had the same values upon soft lockup, and thus would
> > never trip the CSD_LOCK_TIMEOUT check.
>
> Sounds like jiffies stops updating. Which doesn't sound unreasonable
> for when there is some IPI problem.

Yeah - although it weakens the 'IPI lost spuriously' hypothesis: we
ought to have irqs enabled here which normally doesn't stop jiffies
from updating, and the timer interrupt stopping suggests a much deeper
problem than just some lost IPI ...

>
> How about just changing the debug patch to count iterations, and
> print out a warning when it reaches ten million or so.

Yeah, or replace jiffies_to_ms() with:

sched_clock()/1000000

sched_clock() should be safe to call in these codepaths.

Like the attached patch. (Totally untested.)

Thanks,

Ingo

---

kernel/smp.c | 51 ++++++++++++++++++++++++++++++++++++++++++++++-----
1 file changed, 46 insertions(+), 5 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index f38a1e692259..e0eec1ab3ef2 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -98,22 +98,63 @@ void __init call_function_init(void)
register_cpu_notifier(&hotplug_cfd_notifier);
}

+/* Locking timeout in ms: */
+#define CSD_LOCK_TIMEOUT (10*1000ULL)
+
+/* Print this ID in every printk line we output, to be able to easily sort them apart: */
+static int csd_bug_count;
+
/*
* csd_lock/csd_unlock used to serialize access to per-cpu csd resources
*
* For non-synchronous ipi calls the csd can still be in use by the
* previous function call. For multi-cpu calls its even more interesting
* as we'll have to ensure no other cpu is observing our csd.
+ *
+ * ( The overhead of deadlock detection is not a big problem, this is a
+ * cpu_relax() loop that is actively wasting CPU cycles to poll for
+ * completion. )
*/
-static void csd_lock_wait(struct call_single_data *csd)
+static void csd_lock_wait(struct call_single_data *csd, int cpu)
{
- while (csd->flags & CSD_FLAG_LOCK)
+ int bug_id = 0;
+ u64 ts0, ts1, ts_delta;
+
+ ts0 = sched_clock()/1000000;
+
+ if (unlikely(!csd_bug_count)) {
+ csd_bug_count++;
+ printk("csd: CSD deadlock debugging initiated!\n");
+ }
+
+ while (csd->flags & CSD_FLAG_LOCK) {
+ ts1 = sched_clock()/1000000;
+
+ ts_delta = ts1-ts0;
+ if (unlikely(ts_delta >= CSD_LOCK_TIMEOUT)) { /* Uh oh, it took too long. Why? */
+
+ bug_id = csd_bug_count;
+ csd_bug_count++;
+
+ ts0 = ts1; /* Re-start the timeout detection */
+
+ printk("csd: Detected non-responsive CSD lock (#%d) on CPU#%02d, waiting %Ld.%03Ld secs for CPU#%02d\n",
+ bug_id, raw_smp_processor_id(), ts_delta/1000ULL, ts_delta % 1000ULL, cpu);
+ if (cpu >= 0) {
+ printk("csd: Re-sending CSD lock (#%d) IPI from CPU#%02d to CPU#%02d\n", bug_id, raw_smp_processor_id(), cpu);
+ arch_send_call_function_single_ipi(cpu);
+ }
+ dump_stack();
+ }
cpu_relax();
+ }
+ if (unlikely(bug_id))
+ printk("csd: CSD lock (#%d) got unstuck on CPU#%02d, CPU#%02d released the lock after all. Phew!\n", bug_id, raw_smp_processor_id(), cpu);
}

static void csd_lock(struct call_single_data *csd)
{
- csd_lock_wait(csd);
+ csd_lock_wait(csd, -1);
csd->flags |= CSD_FLAG_LOCK;

/*
@@ -191,7 +232,7 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
arch_send_call_function_single_ipi(cpu);

if (wait)
- csd_lock_wait(csd);
+ csd_lock_wait(csd, cpu);

return 0;
}
@@ -446,7 +487,7 @@ void smp_call_function_many(const struct cpumask *mask,
struct call_single_data *csd;

csd = per_cpu_ptr(cfd->csd, cpu);
- csd_lock_wait(csd);
+ csd_lock_wait(csd, cpu);
}
}
}

2015-04-07 21:00:26

by Chris J Arges

[permalink] [raw]
Subject: Re: [PATCH] smp/call: Detect stuck CSD locks

On Tue, Apr 07, 2015 at 11:21:21AM +0200, Ingo Molnar wrote:
>
> * Linus Torvalds <[email protected]> wrote:
>
> > On Mon, Apr 6, 2015 at 9:58 AM, Chris J Arges
> > <[email protected]> wrote:
> > >
> > > I noticed that with this patch it never reached 'csd: Detected
> > > non-responsive CSD lock...' because it seems that ts_delta never reached
> > > CSD_LOCK_TIMEOUT. I tried adjusting the TIMEOUT value and still got a
> > > hang without reaching this statement. I made the ts0,ts1 values global
> > > and put a counter into the while loop and found that the loop iterated
> > > about 670 million times before the softlockup was detected. In addition
> > > ts0 and ts1 both had the same values upon soft lockup, and thus would
> > > never trip the CSD_LOCK_TIMEOUT check.
> >
> > Sounds like jiffies stops updating. Which doesn't sound unreasonable
> > for when there is some IPI problem.
>
> Yeah - although it weakens the 'IPI lost spuriously' hypothesis: we
> ought to have irqs enabled here which normally doesn't stop jiffies
> from updating, and the timer interrupt stopping suggests a much deeper
> problem than just some lost IPI ...
>
> >
> > How about just changing the debug patch to count iterations, and
> > print out a warning when it reaches ten million or so.
>
> Yeah, or replace jiffies_to_ms() with:
>
> sched_clock()/1000000
>
> sched_clock() should be safe to call in these codepaths.
>
> Like the attached patch. (Totally untested.)
>

Ingo,

Looks like sched_lock() works in this case.

Adding the dump_stack() line caused various issues such as the VM oopsing on
boot or the softlockup never being detected properly (and thus not crashing).
So the below is running with your patch and 'dump_stack()' commented out.

Here is the log leading up to the soft lockup (I adjusted CSD_LOCK_TIMEOUT to 5s):
[ 22.669630] kvm [1523]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0xffff
[ 38.712710] csd: Detected non-responsive CSD lock (#1) on CPU#00, waiting 5.000 secs for CPU#01
[ 38.712715] csd: Re-sending CSD lock (#1) IPI from CPU#00 to CPU#01
[ 43.712709] csd: Detected non-responsive CSD lock (#2) on CPU#00, waiting 5.000 secs for CPU#01
[ 43.712713] csd: Re-sending CSD lock (#2) IPI from CPU#00 to CPU#01
[ 48.712708] csd: Detected non-responsive CSD lock (#3) on CPU#00, waiting 5.000 secs for CPU#01
[ 48.712732] csd: Re-sending CSD lock (#3) IPI from CPU#00 to CPU#01
[ 53.712708] csd: Detected non-responsive CSD lock (#4) on CPU#00, waiting 5.000 secs for CPU#01
[ 53.712712] csd: Re-sending CSD lock (#4) IPI from CPU#00 to CPU#01
[ 58.712707] csd: Detected non-responsive CSD lock (#5) on CPU#00, waiting 5.000 secs for CPU#01
[ 58.712712] csd: Re-sending CSD lock (#5) IPI from CPU#00 to CPU#01
[ 60.080005] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [ksmd:26]

Still we never seem to release the lock, even when resending the IPI.

Looking at the call_single_queue I see the following (I crashed during the soft lockup):

crash> p call_single_queue
PER-CPU DATA TYPE:
struct llist_head call_single_queue;
PER-CPU ADDRESSES:
[0]: ffff88013fc16580
[1]: ffff88013fd16580
crash> list -s call_single_data ffff88013fc16580
ffff88013fc16580
struct call_single_data {
llist = {
next = 0x0
},
func = 0x0,
info = 0x0,
flags = 0
}
crash> list -s call_single_data ffff88013fd16580
ffff88013fd16580
struct call_single_data {
llist = {
next = 0xffff88013a517c08
},
func = 0x0,
info = 0x0,
flags = 0
}
ffff88013a517c08
struct call_single_data {
llist = {
next = 0x0
},
func = 0xffffffff81067f30 <flush_tlb_func>,
info = 0xffff88013a517d00,
flags = 3
}

This seems consistent with previous crash dumps.


As I mentioned here: https://lkml.org/lkml/2015/4/6/186
I'm able to reproduce this easily on certain hardware w/
b6b8a1451fc40412c57d10c94b62e22acab28f94 applied and not
9242b5b60df8b13b469bc6b7be08ff6ebb551ad3 on the L0 kernel. I think it makes
sense to get as clear of a picture with this more trivial reproducer, then
re-run this on a L0 w/ v4.0-rcX. Most likely the later case will take many days
to reproduce.

Thanks,
--chris

2015-04-07 21:15:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH] smp/call: Detect stuck CSD locks

On Tue, Apr 7, 2015 at 1:59 PM, Chris J Arges
<[email protected]> wrote:
>
> Here is the log leading up to the soft lockup (I adjusted CSD_LOCK_TIMEOUT to 5s):
> [ 22.669630] kvm [1523]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0xffff
> [ 38.712710] csd: Detected non-responsive CSD lock (#1) on CPU#00, waiting 5.000 secs for CPU#01
> [ 38.712715] csd: Re-sending CSD lock (#1) IPI from CPU#00 to CPU#01
> [ 43.712709] csd: Detected non-responsive CSD lock (#2) on CPU#00, waiting 5.000 secs for CPU#01
> [ 43.712713] csd: Re-sending CSD lock (#2) IPI from CPU#00 to CPU#01
> [ 48.712708] csd: Detected non-responsive CSD lock (#3) on CPU#00, waiting 5.000 secs for CPU#01
> [ 48.712732] csd: Re-sending CSD lock (#3) IPI from CPU#00 to CPU#01
> [ 53.712708] csd: Detected non-responsive CSD lock (#4) on CPU#00, waiting 5.000 secs for CPU#01
> [ 53.712712] csd: Re-sending CSD lock (#4) IPI from CPU#00 to CPU#01
> [ 58.712707] csd: Detected non-responsive CSD lock (#5) on CPU#00, waiting 5.000 secs for CPU#01
> [ 58.712712] csd: Re-sending CSD lock (#5) IPI from CPU#00 to CPU#01
> [ 60.080005] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [ksmd:26]

Ok, so it's not the IPI just "getting lost". I'm not hugely surprised.

But it does look like CPU1 never actually reacts to the IPI, even when re-sent:

> Looking at the call_single_queue I see the following (I crashed during the soft lockup):
>
> crash> p call_single_queue
> PER-CPU DATA TYPE:
> struct llist_head call_single_queue;
> PER-CPU ADDRESSES:
> [0]: ffff88013fc16580
> [1]: ffff88013fd16580
> crash> list -s call_single_data ffff88013fc16580
> ffff88013fc16580
> struct call_single_data {
> llist = {
> next = 0x0
> },
> func = 0x0,
> info = 0x0,
> flags = 0
> }
> crash> list -s call_single_data ffff88013fd16580
> ffff88013fd16580
> struct call_single_data {
> llist = {
> next = 0xffff88013a517c08
> },
> func = 0x0,
> info = 0x0,
> flags = 0
> }
> ffff88013a517c08
> struct call_single_data {
> llist = {
> next = 0x0
> },
> func = 0xffffffff81067f30 <flush_tlb_func>,
> info = 0xffff88013a517d00,
> flags = 3
> }
>
> This seems consistent with previous crash dumps.

The above seems to show that CPU1 has never picked up the CSD. Which
is consistent with CPU0 waiting forever for it.

It really looks like CPU1 is simply not reacting to the IPI even when
resent. It's possibly masked at the APIC level, or it's somehow stuck.

> As I mentioned here: https://lkml.org/lkml/2015/4/6/186
> I'm able to reproduce this easily on certain hardware w/
> b6b8a1451fc40412c57d10c94b62e22acab28f94 applied and not
> 9242b5b60df8b13b469bc6b7be08ff6ebb551ad3 on the L0 kernel. I think it makes
> sense to get as clear of a picture with this more trivial reproducer, then
> re-run this on a L0 w/ v4.0-rcX. Most likely the later case will take many days
> to reproduce.

Well, those commits imply that it's a kvm virtualization problem, but
at least DaveJ's problems was while running on raw hardware with no
virtualization.

Ho humm.

Linus

2015-04-08 06:47:41

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] smp/call: Detect stuck CSD locks


* Chris J Arges <[email protected]> wrote:

> Ingo,
>
> Looks like sched_clock() works in this case.
>
> Adding the dump_stack() line caused various issues such as the VM
> oopsing on boot or the softlockup never being detected properly (and
> thus not crashing). So the below is running with your patch and
> 'dump_stack()' commented out.
>
> Here is the log leading up to the soft lockup (I adjusted CSD_LOCK_TIMEOUT to 5s):
> [ 22.669630] kvm [1523]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0xffff
> [ 38.712710] csd: Detected non-responsive CSD lock (#1) on CPU#00, waiting 5.000 secs for CPU#01
> [ 38.712715] csd: Re-sending CSD lock (#1) IPI from CPU#00 to CPU#01
> [ 43.712709] csd: Detected non-responsive CSD lock (#2) on CPU#00, waiting 5.000 secs for CPU#01
> [ 43.712713] csd: Re-sending CSD lock (#2) IPI from CPU#00 to CPU#01
> [ 48.712708] csd: Detected non-responsive CSD lock (#3) on CPU#00, waiting 5.000 secs for CPU#01
> [ 48.712732] csd: Re-sending CSD lock (#3) IPI from CPU#00 to CPU#01
> [ 53.712708] csd: Detected non-responsive CSD lock (#4) on CPU#00, waiting 5.000 secs for CPU#01
> [ 53.712712] csd: Re-sending CSD lock (#4) IPI from CPU#00 to CPU#01
> [ 58.712707] csd: Detected non-responsive CSD lock (#5) on CPU#00, waiting 5.000 secs for CPU#01
> [ 58.712712] csd: Re-sending CSD lock (#5) IPI from CPU#00 to CPU#01
> [ 60.080005] NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [ksmd:26]
>
> Still we never seem to release the lock, even when resending the IPI.

So it would be really important to see the stack dump of CPU#0 at this
point, and also do an APIC state dump of it.

Because from previous dumps it appeared that the 'stuck' CPU was just
in its idle loop - which is 'impossible' as the idle loop should still
allow APIC irqs arriving.

This behavior can only happen if:

- CPU#0 has irqs disabled perpetually. A dump of CPU#0 should
tell us where it's executing. This has actually a fair
chance to be the case as it actually happened in a fair
number of bugs in the past, but I thought from past dumps
you guys provided that this possibility was excluded ... but
it merits re-examination with the debug patches applied.

- the APIC on CPU#0 is unacked and has queued up so many IPIs
that it starts rejecting them. I'm not sure that's even
possible on KVM though. I'm not sure that's even possible on
KVM, unless part of the hardware virtualizes the APIC. One
other thing that talks against this scenario is that NMIs
appear to be reaching through to CPU#0: the crash dumps and
dump-on-all-cpus NMI callbacks worked fine.

- the APIC on CPU#0 is in some weird state well outside of its
Linux programming model (TPR set wrong, etc. etc.). There's
literally a myriad of ways an APIC can be configured to not
receive IPIs: but I've never actually seen this happen under
Linux, as it needs complicated writes to specialized APIC
registers, and we don't actually reconfigure the APIC in any
serious fashion aside bootup. Low likelihood but not
impossible. Again, NMIs reaching through make this situation
less likely.

- CPU#0 having a bad IDT and essentially ignoring certain
IPIs. This presumes some serious but very targeted memory
corruption. Lowest likelihood.

- ... other failure modes that elude me. Neither of the
scenarios above strike me as particularly plausible - but
something must be causing the lockup, so ...

In any case, something got seriously messed up on CPU#0, and stays
messed up during the lockup, and it would help a lot figuring out
exactly what, by further examining its state.

Note, it might also be useful to dump KVM's state of the APIC of
CPU#0, to see why _it_ isn't sending (and injecting) the lapic IRQ
into CPU#0. By all means it should. [Maybe take a look at CPU#1 as
well, to make sure the IPI was actually generated.]

It should be much easier to figure this out on the KVM side than on
the native hardware side, which emulates the lapic to a large degree,
so we can see 'hardware state' directly. If we are lucky then the KVM
problem mirrors the native hardware problem.

Btw., it might also be helpful to try to turn off hardware assisted
APIC virtualization on the KVM side, to make the APIC purely software
emulated. If this makes the bug go away magically then this raises the
likelihood that the bug is really hardware APIC related.

I don't know what the magic incantation is to make 'pure software
APIC' happen on KVM and Qemu though.

Thanks,

Ingo

2015-04-13 03:57:06

by Chris J Arges

[permalink] [raw]
Subject: Re: [PATCH] smp/call: Detect stuck CSD locks

<snip>
> So it would be really important to see the stack dump of CPU#0 at this
> point, and also do an APIC state dump of it.
>
> Because from previous dumps it appeared that the 'stuck' CPU was just
> in its idle loop - which is 'impossible' as the idle loop should still
> allow APIC irqs arriving.
>
> This behavior can only happen if:
>
> - CPU#0 has irqs disabled perpetually. A dump of CPU#0 should
> tell us where it's executing. This has actually a fair
> chance to be the case as it actually happened in a fair
> number of bugs in the past, but I thought from past dumps
> you guys provided that this possibility was excluded ... but
> it merits re-examination with the debug patches applied.
>
> - the APIC on CPU#0 is unacked and has queued up so many IPIs
> that it starts rejecting them. I'm not sure that's even
> possible on KVM though. I'm not sure that's even possible on
> KVM, unless part of the hardware virtualizes the APIC. One
> other thing that talks against this scenario is that NMIs
> appear to be reaching through to CPU#0: the crash dumps and
> dump-on-all-cpus NMI callbacks worked fine.
>
> - the APIC on CPU#0 is in some weird state well outside of its
> Linux programming model (TPR set wrong, etc. etc.). There's
> literally a myriad of ways an APIC can be configured to not
> receive IPIs: but I've never actually seen this happen under
> Linux, as it needs complicated writes to specialized APIC
> registers, and we don't actually reconfigure the APIC in any
> serious fashion aside bootup. Low likelihood but not
> impossible. Again, NMIs reaching through make this situation
> less likely.
>
> - CPU#0 having a bad IDT and essentially ignoring certain
> IPIs. This presumes some serious but very targeted memory
> corruption. Lowest likelihood.
>
> - ... other failure modes that elude me. Neither of the
> scenarios above strike me as particularly plausible - but
> something must be causing the lockup, so ...
>
> In any case, something got seriously messed up on CPU#0, and stays
> messed up during the lockup, and it would help a lot figuring out
> exactly what, by further examining its state.
>
> Note, it might also be useful to dump KVM's state of the APIC of
> CPU#0, to see why _it_ isn't sending (and injecting) the lapic IRQ
> into CPU#0. By all means it should. [Maybe take a look at CPU#1 as
> well, to make sure the IPI was actually generated.]
>
> It should be much easier to figure this out on the KVM side than on
> the native hardware side, which emulates the lapic to a large degree,
> so we can see 'hardware state' directly. If we are lucky then the KVM
> problem mirrors the native hardware problem.
>
> Btw., it might also be helpful to try to turn off hardware assisted
> APIC virtualization on the KVM side, to make the APIC purely software
> emulated. If this makes the bug go away magically then this raises the
> likelihood that the bug is really hardware APIC related.
>
> I don't know what the magic incantation is to make 'pure software
> APIC' happen on KVM and Qemu though.
>
> Thanks,
>
> Ingo
>

Ingo,

/sys/module/kvm_intel/parameters/enable_apicv on the affected hardware is not
enabled, and unfortunately my hardware doesn't have the necessary features
to enable it. So we are dealing with KVM's lapic implementation only.

FYI, I'm working on getting better data at the moment and here is my approach:
* For the L0 kernel:
- In arch/x86/kvm/lapic.c, I enabled 'apic_debug' to get more output (and print
the addresses of various useful structures)
- Setup crash to live dump kvm_lapic structures and associated registers for
both vCPUs
* For the L1 kernel:
- Dump a stacktrace when we detect a lockup.
- Detect a lockup and try to not alter the state.
- Have a reliable signal such that the L0 hypervisor can dump the lapic
structures and registers when csd_lock_wait detects a softlockup.

Hopefully I can make progress and present meaningful results in my next update.

Thanks,
--chris j arges

2015-04-13 06:15:11

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] smp/call: Detect stuck CSD locks


* Chris J Arges <[email protected]> wrote:

> /sys/module/kvm_intel/parameters/enable_apicv on the affected
> hardware is not enabled, and unfortunately my hardware doesn't have
> the necessary features to enable it. So we are dealing with KVM's
> lapic implementation only.

That's actually pretty fortunate, as we don't have to worry about
hardware state nearly as much!

> FYI, I'm working on getting better data at the moment and here is my approach:
> * For the L0 kernel:
> - In arch/x86/kvm/lapic.c, I enabled 'apic_debug' to get more output (and print
> the addresses of various useful structures)
> - Setup crash to live dump kvm_lapic structures and associated registers for
> both vCPUs

It would also be nice to double check the stuck vCPU's normal CPU
state: is it truly able to receive interrupts? (IRQ flags are on, or
is it sitting in the idle loop, etc.?)

If the IRQ flag (in EFLAGS) is off then the vCPU is not able to
receive interrupts, regardless of local APIC state.

> * For the L1 kernel:
> - Dump a stacktrace when we detect a lockup.
> - Detect a lockup and try to not alter the state.
> - Have a reliable signal such that the L0 hypervisor can dump the lapic
> structures and registers when csd_lock_wait detects a softlockup.

I'd also suggest adding a printk() to IPI receipt, to make sure it's
not the CSD code that is not getting called into after the IPI resend
attempt. To make sure you only get messages after the CPU got stuck,
add a 'locked_up' flag that signals this, and only print the messages
if the lockup scenario is happening.

I'd do it by adding something like this to
kernel/smp.c::generic_smp_call_function_single_interrupt():

if (csd_locked_up)
printk("CSD: Function call IPI callback on CPU#%d\n", raw_smp_processor_id());

Having this message in place would ensure that the IPI indeed did not
get generated on the stuck vCPU. (Because we'd not get this message.)

Thanks,

Ingo

2015-04-15 19:55:44

by Chris J Arges

[permalink] [raw]
Subject: Re: [PATCH] smp/call: Detect stuck CSD locks

On Mon, Apr 13, 2015 at 08:14:51AM +0200, Ingo Molnar wrote:
<snip>
> > FYI, I'm working on getting better data at the moment and here is my approach:
> > * For the L0 kernel:
> > - In arch/x86/kvm/lapic.c, I enabled 'apic_debug' to get more output (and print
> > the addresses of various useful structures)
> > - Setup crash to live dump kvm_lapic structures and associated registers for
> > both vCPUs
>
> It would also be nice to double check the stuck vCPU's normal CPU
> state: is it truly able to receive interrupts? (IRQ flags are on, or
> is it sitting in the idle loop, etc.?)
>
> If the IRQ flag (in EFLAGS) is off then the vCPU is not able to
> receive interrupts, regardless of local APIC state.
>
> > * For the L1 kernel:
> > - Dump a stacktrace when we detect a lockup.
> > - Detect a lockup and try to not alter the state.
> > - Have a reliable signal such that the L0 hypervisor can dump the lapic
> > structures and registers when csd_lock_wait detects a softlockup.
>
> I'd also suggest adding a printk() to IPI receipt, to make sure it's
> not the CSD code that is not getting called into after the IPI resend
> attempt. To make sure you only get messages after the CPU got stuck,
> add a 'locked_up' flag that signals this, and only print the messages
> if the lockup scenario is happening.
>
> I'd do it by adding something like this to
> kernel/smp.c::generic_smp_call_function_single_interrupt():
>
> if (csd_locked_up)
> printk("CSD: Function call IPI callback on CPU#%d\n", raw_smp_processor_id());
>
> Having this message in place would ensure that the IPI indeed did not
> get generated on the stuck vCPU. (Because we'd not get this message.)
>
> Thanks,
>
> Ingo

Ingo,

Below are the patches and data I've gathered from the reproducer. My methodology
was as described previously; however I used gdb on the qemu process in order to
breakpoint L1 once we've detected the hang. This made dumping the kvm_lapic
structures on L0 more reliable.

Thanks,
--chris j arges

--

* L0 patch (against Ubuntu-3.13.x w/ b6b8a1451fc40412c57d1):

diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c
index f52e300..4d96638 100644
--- a/arch/x86/kvm/lapic.c
+++ b/arch/x86/kvm/lapic.c
@@ -54,8 +54,7 @@

#define APIC_BUS_CYCLE_NS 1

-/* #define apic_debug(fmt,arg...) printk(KERN_WARNING fmt,##arg) */
-#define apic_debug(fmt, arg...)
+#define apic_debug(fmt,arg...) printk(KERN_WARNING fmt,##arg)

#define APIC_LVT_NUM 6
/* 14 is the version for Xeon and Pentium 8.4.8*/
@@ -539,9 +538,6 @@ static void apic_update_ppr(struct kvm_lapic *apic)
else
ppr = isrv & 0xf0;

- apic_debug("vlapic %p, ppr 0x%x, isr 0x%x, isrv 0x%x",
- apic, ppr, isr, isrv);
-
if (old_ppr != ppr) {
apic_set_reg(apic, APIC_PROCPRI, ppr);
if (ppr < old_ppr)
@@ -865,10 +861,10 @@ static void apic_send_ipi(struct kvm_lapic *apic)

trace_kvm_apic_ipi(icr_low, irq.dest_id);

- apic_debug("icr_high 0x%x, icr_low 0x%x, "
+ apic_debug("apic_send_ipi-> kvm 0x%x vcpu 0x%x apic 0x%x icr_high 0x%x, icr_low 0x%x, "
"short_hand 0x%x, dest 0x%x, trig_mode 0x%x, level 0x%x, "
"dest_mode 0x%x, delivery_mode 0x%x, vector 0x%x\n",
- icr_high, icr_low, irq.shorthand, irq.dest_id,
+ apic->vcpu->kvm, apic->vcpu, apic, icr_high, icr_low, irq.shorthand, irq.dest_id,
irq.trig_mode, irq.level, irq.dest_mode, irq.delivery_mode,
irq.vector);

* L1 patch (against latest v4.0):

diff --git a/kernel/smp.c b/kernel/smp.c
index f38a1e6..df75d3d 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -98,22 +98,67 @@ void __init call_function_init(void)
register_cpu_notifier(&hotplug_cfd_notifier);
}

+/* Locking timeout in ms: */
+#define CSD_LOCK_TIMEOUT (2*1000ULL)
+
+/* Print this ID in every printk line we output, to be able to easily sort them apart: */
+static int csd_bug_count;
+static int csd_locked_up = 0;
+
/*
* csd_lock/csd_unlock used to serialize access to per-cpu csd resources
*
* For non-synchronous ipi calls the csd can still be in use by the
* previous function call. For multi-cpu calls its even more interesting
* as we'll have to ensure no other cpu is observing our csd.
+ *
+ * ( The overhead of deadlock detection is not a big problem, this is a
+ * cpu_relax() loop that is actively wasting CPU cycles to poll for
+ * completion. )
*/
-static void csd_lock_wait(struct call_single_data *csd)
+static void csd_lock_wait(struct call_single_data *csd, int cpu)
{
- while (csd->flags & CSD_FLAG_LOCK)
+ int bug_id = 0;
+ u64 ts0, ts1, ts_delta;
+
+ ts0 = sched_clock()/1000000;
+
+ if (unlikely(!csd_bug_count)) {
+ csd_bug_count++;
+ printk("csd: CSD deadlock debugging initiated!\n");
+ }
+
+ while (csd->flags & CSD_FLAG_LOCK) {
+ ts1 = sched_clock()/1000000;
+
+ ts_delta = ts1-ts0;
+ if (unlikely(ts_delta >= CSD_LOCK_TIMEOUT)) { /* Uh oh, it took too long. Why? */
+
+ bug_id = csd_bug_count;
+ csd_bug_count++;
+
+ ts0 = ts1; /* Re-start the timeout detection */
+
+ printk("csd: Detected non-responsive CSD lock (#%d) on CPU#%02d, waiting %Ld.%03Ld secs for CPU#%02d\n",
+ bug_id, raw_smp_processor_id(), ts_delta/1000ULL, ts_delta % 1000ULL, cpu);
+ if (cpu >= 0) {
+ /*printk("csd: Re-sending CSD lock (#%d) IPI from CPU#%02d to CPU#%02d\n", bug_id, raw_smp_processor_id(), cpu);
+ arch_send_call_function_single_ipi(cpu);
+ */
+ csd_locked_up = 1;
+ panic("csd_lock_wait");
+ }
+ //dump_stack();
+ }
cpu_relax();
+ }
+ if (unlikely(bug_id))
+ printk("csd: CSD lock (#%d) got unstuck on CPU#%02d, CPU#%02d released the lock after all. Phew!\n", bug_id, raw_smp_processor_id(), cpu);
}

static void csd_lock(struct call_single_data *csd)
{
- csd_lock_wait(csd);
+ csd_lock_wait(csd, -1);
csd->flags |= CSD_FLAG_LOCK;

/*
@@ -191,7 +236,7 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
arch_send_call_function_single_ipi(cpu);

if (wait)
- csd_lock_wait(csd);
+ csd_lock_wait(csd, cpu);

return 0;
}
@@ -204,6 +249,9 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
*/
void generic_smp_call_function_single_interrupt(void)
{
+ if (csd_locked_up)
+ printk("CSD: Function call IPI callback on CPU#%d\n", raw_smp_processor_id());
+
flush_smp_call_function_queue(true);
}

@@ -446,7 +494,7 @@ void smp_call_function_many(const struct cpumask *mask,
struct call_single_data *csd;

csd = per_cpu_ptr(cfd->csd, cpu);
- csd_lock_wait(csd);
+ csd_lock_wait(csd, cpu);
}
}
}

* L1 Kernel Log leading up to crash, I set the timeout for 2s.

[ 177.324317] kvm [1671]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0xffff
[ 227.645275] csd: Detected non-responsive CSD lock (#1) on CPU#00, waiting 2.000 secs for CPU#01
[ 227.687193] Kernel panic - not syncing: csd_lock_wait

Note: the L0 log mostly was available just to let me know which address the
kvm_lapic structures had. I wasn't able to grab this log in this run.

* Crash dump backtrace from L1:

crash> bt -a
PID: 26 TASK: ffff88013a4f1400 CPU: 0 COMMAND: "ksmd"
#0 [ffff88013a5039f0] machine_kexec at ffffffff8109d3ec
#1 [ffff88013a503a50] crash_kexec at ffffffff8114a763
#2 [ffff88013a503b20] panic at ffffffff818068e0
#3 [ffff88013a503ba0] csd_lock_wait at ffffffff8113f1e4
#4 [ffff88013a503bf0] generic_exec_single at ffffffff8113f2d0
#5 [ffff88013a503c60] smp_call_function_single at ffffffff8113f417
#6 [ffff88013a503c90] smp_call_function_many at ffffffff8113f7a4
#7 [ffff88013a503d20] flush_tlb_page at ffffffff810b3bf9
#8 [ffff88013a503d50] ptep_clear_flush at ffffffff81205e5e
#9 [ffff88013a503d80] try_to_merge_with_ksm_page at ffffffff8121a445
#10 [ffff88013a503e00] ksm_scan_thread at ffffffff8121ac0e
#11 [ffff88013a503ec0] kthread at ffffffff810df0fb
#12 [ffff88013a503f50] ret_from_fork at ffffffff8180fc98

PID: 1674 TASK: ffff8800ba4a9e00 CPU: 1 COMMAND: "qemu-system-x86"
#0 [ffff88013fd05e20] crash_nmi_callback at ffffffff81091521
#1 [ffff88013fd05e30] nmi_handle at ffffffff81062560
#2 [ffff88013fd05ea0] default_do_nmi at ffffffff81062b0a
#3 [ffff88013fd05ed0] do_nmi at ffffffff81062c88
#4 [ffff88013fd05ef0] end_repeat_nmi at ffffffff81812241
[exception RIP: vmx_vcpu_run+992]
RIP: ffffffff8104cef0 RSP: ffff88013940bcb8 RFLAGS: 00000082
RAX: 0000000080000202 RBX: ffff880139b30000 RCX: ffff880139b30000
RDX: 0000000000000200 RSI: ffff880139b30000 RDI: ffff880139b30000
RBP: ffff88013940bd28 R8: 00007fe192b71110 R9: 00007fe192b71140
R10: 00007fff66407d00 R11: 00007fe1927f0060 R12: 0000000000000000
R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000000
ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
--- <NMI exception stack> ---
#5 [ffff88013940bcb8] vmx_vcpu_run at ffffffff8104cef0
#6 [ffff88013940bcf8] vmx_handle_external_intr at ffffffff81040c18
#7 [ffff88013940bd30] kvm_arch_vcpu_ioctl_run at ffffffff8101b5ad
#8 [ffff88013940be00] kvm_vcpu_ioctl at ffffffff81007894
#9 [ffff88013940beb0] do_vfs_ioctl at ffffffff81253190
#10 [ffff88013940bf30] sys_ioctl at ffffffff81253411
#11 [ffff88013940bf80] system_call_fastpath at ffffffff8180fd4d
RIP: 00007ff6f80f5337 RSP: 00007ff6ed833bd8 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: ffffffff8180fd4d RCX: 00007ff6f80f5337
RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 0000000000000010
RBP: 00007ff6ffa7c430 R8: 0000000000000000 R9: 00000000ffffffff
R10: 00000000000fed00 R11: 0000000000000246 R12: 00007ff6fe195240
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000000
ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b

* Looking at call_single_queue, CPU[1] has <flush_tlb_func>:

crash> p call_single_queue
PER-CPU DATA TYPE:
struct llist_head call_single_queue;
PER-CPU ADDRESSES:
[0]: ffff88013fc16540
[1]: ffff88013fd16540
crash> list -s call_single_data 0xffff88013fc16540
ffff88013fc16540
struct call_single_data {
llist = {
next = 0x0
},
func = 0x0,
info = 0x0,
flags = 0
}
crash> list -s call_single_data 0xffff88013fd16540
ffff88013fd16540
struct call_single_data {
llist = {
next = 0xffff88013a503c08
},
func = 0x0,
info = 0x0,
flags = 0
}
ffff88013a503c08
struct call_single_data {
llist = {
next = 0x0
},
func = 0xffffffff810b37d0 <flush_tlb_func>,
info = 0xffff88013a503d00,
flags = 3
}

* Using a script as follows once I find out the addresses of the kvm_lapic
structures:

mod -S /usr/lib/debug/lib/modules > /dev/null
p *(struct kvm_lapic *)${apic0}
rd ${regs0} 512 | grep -v '0000000000000000 0000000000000000'
p *(struct kvm_lapic *)${apic1}
rd ${regs1} 512 | grep -v '0000000000000000 0000000000000000'

* Normal dump of kvm_lapic structure and registers for each vCPU on L1 (before
the hang):

$1 = {
base_address = 4276092928,
dev = {
ops = 0xffffffffa01dc890 <apic_mmio_ops>
},
lapic_timer = {
timer = {
node = {
node = {
__rb_parent_color = 18446612183461255480,
rb_right = 0x0,
rb_left = 0x0
},
expires = {
tv64 = 170867881314366
}
},
_softexpires = {
tv64 = 170867881314366
},
function = 0xffffffffa01c5f80 <apic_timer_fn>,
base = 0xffff880bff20d1a0,
state = 1,
start_pid = 13304,
start_site = 0xffffffff8108edd8 <hrtimer_start+24>,
start_comm = "qemu-system-x86"
},
period = 199816048,
timer_mode_mask = 131072,
tscdeadline = 0,
pending = {
counter = 0
}
},
divide_count = 16,
vcpu = 0xffff880be7443f00,
irr_pending = false,
isr_count = 0,
highest_isr_cache = -1,
regs = 0xffff880be91e6000,
vapic_addr = 0,
vapic_cache = {
generation = 0,
gpa = 0,
hva = 0,
len = 0,
memslot = 0x0
},
pending_events = 0,
sipi_vector = 0
}
ffff880be91e6030: 0000000001050014 0000000000000000 ................
ffff880be91e60d0: 0000000000000001 0000000000000000 ................
ffff880be91e60e0: 00000000ffffffff 0000000000000000 ................
ffff880be91e60f0: 00000000000001ff 0000000000000000 ................
ffff880be91e6190: 000000000e200000 0000000000000000 .. .............
ffff880be91e6300: 00000000000000fd 0000000000000000 ................
ffff880be91e6310: 0000000000000001 0000000000000000 ................
ffff880be91e6320: 00000000000000ef 0000000000000000 ................
ffff880be91e6330: 0000000000010000 0000000000000000 ................
ffff880be91e6340: 0000000000010000 0000000000000000 ................
ffff880be91e6350: 0000000000010700 0000000000000000 ................
ffff880be91e6360: 0000000000000400 0000000000000000 ................
ffff880be91e6370: 00000000000000fe 0000000000000000 ................
ffff880be91e6380: 0000000000be8f37 0000000000000000 7...............
ffff880be91e63e0: 0000000000000003 0000000000000000 ................

020 Local APIC ID Register = 0x0000000000000000
030 Local APIC Version Register = 0x0000000001050014
0d0 Logical Destination Register = 0x0000000000000001
0e0 Destination Format Register = 0x00000000ffffffff
0f0 Spurious Interrupt Vector Reg = 0x00000000000001ff
190 Trigger Mode Reg (TMR); 63:32 = 0x000000000e200000
300 Interrupt Cmd Reg (ICR); 0-31 = 0x00000000000000fd
310 Interrupt Cmd Reg (ICR); 32-63 = 0x0000000000000001
320 LVT Timer Register = 0x00000000000000ef
330 LVT Thermal Sensor Register = 0x0000000000010000
340 LVT Perf Mon Counter Reg = 0x0000000000010000
350 LVT LINT0 Register = 0x0000000000010700
360 LVT LINT1 Register = 0x0000000000000400
370 LVT Error Register = 0x00000000000000fe
380 Initial Count Reg (for Timer) = 0x0000000000be8f37
3e0 Divide Cfg Reg (for Timer) = 0x0000000000000003

$2 = {
base_address = 4276092928,
dev = {
ops = 0xffffffffa01dc890 <apic_mmio_ops>
},
lapic_timer = {
timer = {
node = {
node = {
__rb_parent_color = 18446612183483367104,
rb_right = 0x0,
rb_left = 0x0
},
expires = {
tv64 = 170868457351683
}
},
_softexpires = {
tv64 = 170868457351683
},
function = 0xffffffffa01c5f80 <apic_timer_fn>,
base = 0xffff880bff26d1a0,
state = 1,
start_pid = 13305,
start_site = 0xffffffff8108edd8 <hrtimer_start+24>,
start_comm = "qemu-system-x86"
},
period = 955895056,
timer_mode_mask = 131072,
tscdeadline = 0,
pending = {
counter = 0
}
},
divide_count = 16,
vcpu = 0xffff880be7440000,
irr_pending = false,
isr_count = 0,
highest_isr_cache = -1,
regs = 0xffff880be78c1000,
vapic_addr = 0,
vapic_cache = {
generation = 0,
gpa = 0,
hva = 0,
len = 0,
memslot = 0x0
},
pending_events = 0,
sipi_vector = 154
}
ffff880be78c1020: 0000000001000000 0000000000000000 ................
ffff880be78c1030: 0000000001050014 0000000000000000 ................
ffff880be78c10d0: 0000000000000002 0000000000000000 ................
ffff880be78c10e0: 00000000ffffffff 0000000000000000 ................
ffff880be78c10f0: 00000000000001ff 0000000000000000 ................
ffff880be78c1300: 00000000000000fd 0000000000000000 ................
ffff880be78c1320: 00000000000000ef 0000000000000000 ................
ffff880be78c1330: 0000000000010000 0000000000000000 ................
ffff880be78c1340: 0000000000010400 0000000000000000 ................
ffff880be78c1350: 0000000000010700 0000000000000000 ................
ffff880be78c1360: 0000000000010400 0000000000000000 ................
ffff880be78c1370: 00000000000000fe 0000000000000000 ................
ffff880be78c1380: 00000000038f9cd1 0000000000000000 ................
ffff880be78c13e0: 0000000000000003 0000000000000000 ................

020 Local APIC ID Register = 0x0000000001000000
030 Local APIC Version Register = 0x0000000001050014
0d0 Logical Destination Register = 0x0000000000000002
0e0 Destination Format Register = 0x00000000ffffffff
0f0 Spurious Interrupt Vector Reg = 0x00000000000001ff
190 Trigger Mode Reg (TMR); 63:32 = 0x0000000000000000
300 Interrupt Cmd Reg (ICR); 0-31 = 0x00000000000000fd
310 Interrupt Cmd Reg (ICR); 32-63 = 0x0000000000000000
320 LVT Timer Register = 0x00000000000000ef
330 LVT Thermal Sensor Register = 0x0000000000010000
340 LVT Perf Mon Counter Reg = 0x0000000000010400
350 LVT LINT0 Register = 0x0000000000010700
360 LVT LINT1 Register = 0x0000000000010400
370 LVT Error Register = 0x00000000000000fe
380 Initial Count Reg (for Timer) = 0x00000000038f9cd1
3e0 Divide Cfg Reg (for Timer) = 0x0000000000000003

* Dump of kvm_lapic structure and registers for each vCPU on L1 when crash is
detected.

Note, I used qemu to trap on when we detect the hang in csd_lock_wait. This way
we should preserve the registers as they exist when we've detected the lockup.

$1 = {
base_address = 4276092928,
dev = {
ops = 0xffffffffa01dc890 <apic_mmio_ops>
},
lapic_timer = {
timer = {
node = {
node = {
__rb_parent_color = 18446612172786658320,
rb_right = 0xffff880bff24dce0,
rb_left = 0x0
},
expires = {
tv64 = 170996305305003
}
},
_softexpires = {
tv64 = 170996305305003
},
function = 0xffffffffa01c5f80 <apic_timer_fn>,
base = 0xffff880bff24d1a0,
state = 0,
start_pid = 13304,
start_site = 0xffffffff8108edd8 <hrtimer_start+24>,
start_comm = "qemu-system-x86"
},
period = 3995184,
timer_mode_mask = 131072,
tscdeadline = 0,
pending = {
counter = 1
}
},
divide_count = 16,
vcpu = 0xffff880be7443f00,
irr_pending = false,
isr_count = 0,
highest_isr_cache = -1,
regs = 0xffff880be91e6000,
vapic_addr = 0,
vapic_cache = {
generation = 0,
gpa = 0,
hva = 0,
len = 0,
memslot = 0x0
},
pending_events = 0,
sipi_vector = 0
}
ffff880be91e6030: 0000000001050014 0000000000000000 ................
ffff880be91e60d0: 0000000000000001 0000000000000000 ................
ffff880be91e60e0: 00000000ffffffff 0000000000000000 ................
ffff880be91e60f0: 00000000000001ff 0000000000000000 ................
ffff880be91e6190: 000000000e200000 0000000000000000 .. .............
ffff880be91e6300: 00000000000000fb 0000000000000000 ................
ffff880be91e6310: 0000000000000001 0000000000000000 ................
ffff880be91e6320: 00000000000000ef 0000000000000000 ................
ffff880be91e6330: 0000000000010000 0000000000000000 ................
ffff880be91e6340: 0000000000010000 0000000000000000 ................
ffff880be91e6350: 0000000000010700 0000000000000000 ................
ffff880be91e6360: 0000000000000400 0000000000000000 ................
ffff880be91e6370: 00000000000000fe 0000000000000000 ................
ffff880be91e6380: 000000000003cf63 0000000000000000 c...............
ffff880be91e63e0: 0000000000000003 0000000000000000 ................

020 Local APIC ID Register = 0x0000000000000000
030 Local APIC Version Register = 0x0000000001050014
0d0 Logical Destination Register = 0x0000000000000001
0e0 Destination Format Register = 0x00000000ffffffff
0f0 Spurious Interrupt Vector Reg = 0x00000000000001ff
190 Trigger Mode Reg (TMR); 63:32 = 0x000000000e200000
300 Interrupt Cmd Reg (ICR); 0-31 = 0x00000000000000fb
310 Interrupt Cmd Reg (ICR); 32-63 = 0x0000000000000001
320 LVT Timer Register = 0x00000000000000ef
330 LVT Thermal Sensor Register = 0x0000000000010000
340 LVT Perf Mon Counter Reg = 0x0000000000010000
350 LVT LINT0 Register = 0x0000000000010700
360 LVT LINT1 Register = 0x0000000000000400
370 LVT Error Register = 0x00000000000000fe
380 Initial Count Reg (for Timer) = 0x000000000003cf63
3e0 Divide Cfg Reg (for Timer) = 0x0000000000000003

$2 = {
base_address = 4276092928,
dev = {
ops = 0xffffffffa01dc890 <apic_mmio_ops>
},
lapic_timer = {
timer = {
node = {
node = {
__rb_parent_color = 18446612172786662160,
rb_right = 0x0,
rb_left = 0x0
},
expires = {
tv64 = 170994304769720
}
},
_softexpires = {
tv64 = 170994304769720
},
function = 0xffffffffa01c5f80 <apic_timer_fn>,
base = 0xffff880bff20d1a0,
state = 0,
start_pid = 13305,
start_site = 0xffffffff8108edd8 <hrtimer_start+24>,
start_comm = "qemu-system-x86"
},
period = 3422832,
timer_mode_mask = 131072,
tscdeadline = 0,
pending = {
counter = 0
}
},
divide_count = 16,
vcpu = 0xffff880be7440000,
irr_pending = true,
isr_count = 1,
highest_isr_cache = 251,
regs = 0xffff880be78c1000,
vapic_addr = 0,
vapic_cache = {
generation = 0,
gpa = 0,
hva = 0,
len = 0,
memslot = 0x0
},
pending_events = 0,
sipi_vector = 154
}
ffff880be78c1020: 0000000001000000 0000000000000000 ................
ffff880be78c1030: 0000000001050014 0000000000000000 ................
ffff880be78c10a0: 00000000000000f0 0000000000000000 ................
ffff880be78c10d0: 0000000000000002 0000000000000000 ................
ffff880be78c10e0: 00000000ffffffff 0000000000000000 ................
ffff880be78c10f0: 00000000000001ff 0000000000000000 ................
ffff880be78c1170: 0000000008000000 0000000000000000 ................
ffff880be78c1250: 0000000000020000 0000000000000000 ................
ffff880be78c1260: 0000000000000002 0000000000000000 ................
ffff880be78c1270: 0000000000008000 0000000000000000 ................
ffff880be78c1300: 00000000000000fd 0000000000000000 ................
ffff880be78c1320: 00000000000000ef 0000000000000000 ................
ffff880be78c1330: 0000000000010000 0000000000000000 ................
ffff880be78c1340: 0000000000010400 0000000000000000 ................
ffff880be78c1350: 0000000000010700 0000000000000000 ................
ffff880be78c1360: 0000000000010400 0000000000000000 ................
ffff880be78c1370: 00000000000000fe 0000000000000000 ................
ffff880be78c1380: 00000000000343a7 0000000000000000 .C..............
ffff880be78c13e0: 0000000000000003 0000000000000000 ................

020 Local APIC ID Register = 0x0000000001000000
030 Local APIC Version Register = 0x0000000001050014
0a0 Processor Priority Reg (PPR) = 0x00000000000000f0 *
0d0 Logical Destination Register = 0x0000000000000002
0e0 Destination Format Register = 0x00000000ffffffff
0f0 Spurious Interrupt Vector Reg = 0x00000000000001ff
170 In-Service Reg (ISR); 255:224 = 0x0000000008000000 *
190 Trigger Mode Reg (TMR); 63:32 = 0x0000000000000000
250 Intr Req Reg (IRR); 191:160 = 0x0000000000020000 *
260 Intr Req Reg (IRR); 223:192 = 0x0000000000000002 *
270 Intr Req Reg (IRR); 255:224 = 0x0000000000008000 *
300 Interrupt Cmd Reg (ICR); 0-31 = 0x00000000000000fd
310 Interrupt Cmd Reg (ICR); 32-63 = 0x0000000000000000
320 LVT Timer Register = 0x00000000000000ef
330 LVT Thermal Sensor Register = 0x0000000000010000
340 LVT Perf Mon Counter Reg = 0x0000000000010400
350 LVT LINT0 Register = 0x0000000000010700
360 LVT LINT1 Register = 0x0000000000010400
370 LVT Error Register = 0x00000000000000fe
380 Initial Count Reg (for Timer) = 0x00000000000343a7
3e0 Divide Cfg Reg (for Timer) = 0x0000000000000003

2015-04-16 11:04:37

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] smp/call: Detect stuck CSD locks


* Chris J Arges <[email protected]> wrote:

> Ingo,
>
> Below are the patches and data I've gathered from the reproducer. My
> methodology was as described previously; however I used gdb on the
> qemu process in order to breakpoint L1 once we've detected the hang.
> This made dumping the kvm_lapic structures on L0 more reliable.

Thanks!

So I have trouble interpreting the L1 backtrace, because it shows
something entirely new (to me).

First lets clarify the terminology, to make sure I got the workload
all right:

- L0 is the host kernel, running native Linux. It's not locking up.

- L1 is the guest kernel, running virtualized Linux. This is the one
that is locking up.

- L2 is the nested guest kernel, running whatever test workload you
were running - this is obviously locking up together with L1.

Right?

So with that cleared up, the backtrace on L1 looks like this:

> * Crash dump backtrace from L1:
>
> crash> bt -a
> PID: 26 TASK: ffff88013a4f1400 CPU: 0 COMMAND: "ksmd"
> #0 [ffff88013a5039f0] machine_kexec at ffffffff8109d3ec
> #1 [ffff88013a503a50] crash_kexec at ffffffff8114a763
> #2 [ffff88013a503b20] panic at ffffffff818068e0
> #3 [ffff88013a503ba0] csd_lock_wait at ffffffff8113f1e4
> #4 [ffff88013a503bf0] generic_exec_single at ffffffff8113f2d0
> #5 [ffff88013a503c60] smp_call_function_single at ffffffff8113f417
> #6 [ffff88013a503c90] smp_call_function_many at ffffffff8113f7a4
> #7 [ffff88013a503d20] flush_tlb_page at ffffffff810b3bf9
> #8 [ffff88013a503d50] ptep_clear_flush at ffffffff81205e5e
> #9 [ffff88013a503d80] try_to_merge_with_ksm_page at ffffffff8121a445
> #10 [ffff88013a503e00] ksm_scan_thread at ffffffff8121ac0e
> #11 [ffff88013a503ec0] kthread at ffffffff810df0fb
> #12 [ffff88013a503f50] ret_from_fork at ffffffff8180fc98

So this one, VCPU0, is trying to send an IPI to VCPU1:

> PID: 1674 TASK: ffff8800ba4a9e00 CPU: 1 COMMAND: "qemu-system-x86"
> #0 [ffff88013fd05e20] crash_nmi_callback at ffffffff81091521
> #1 [ffff88013fd05e30] nmi_handle at ffffffff81062560
> #2 [ffff88013fd05ea0] default_do_nmi at ffffffff81062b0a
> #3 [ffff88013fd05ed0] do_nmi at ffffffff81062c88
> #4 [ffff88013fd05ef0] end_repeat_nmi at ffffffff81812241
> [exception RIP: vmx_vcpu_run+992]
> RIP: ffffffff8104cef0 RSP: ffff88013940bcb8 RFLAGS: 00000082
> RAX: 0000000080000202 RBX: ffff880139b30000 RCX: ffff880139b30000
> RDX: 0000000000000200 RSI: ffff880139b30000 RDI: ffff880139b30000
> RBP: ffff88013940bd28 R8: 00007fe192b71110 R9: 00007fe192b71140
> R10: 00007fff66407d00 R11: 00007fe1927f0060 R12: 0000000000000000
> R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000000
> ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> --- <NMI exception stack> ---
> #5 [ffff88013940bcb8] vmx_vcpu_run at ffffffff8104cef0
> #6 [ffff88013940bcf8] vmx_handle_external_intr at ffffffff81040c18
> #7 [ffff88013940bd30] kvm_arch_vcpu_ioctl_run at ffffffff8101b5ad
> #8 [ffff88013940be00] kvm_vcpu_ioctl at ffffffff81007894
> #9 [ffff88013940beb0] do_vfs_ioctl at ffffffff81253190
> #10 [ffff88013940bf30] sys_ioctl at ffffffff81253411
> #11 [ffff88013940bf80] system_call_fastpath at ffffffff8180fd4d

So the problem here that I can see is that L1's VCPU1 appears to be
looping with interrupts disabled:

> RIP: ffffffff8104cef0 RSP: ffff88013940bcb8 RFLAGS: 00000082

Look how RFLAGS doesn't have 0x200 set - so it's executing with
interrupts disabled.

That is why the IPI does not get through to it, but kdump's NMI had no
problem getting through.

This (assuming all backtraces are exact!):

> #5 [ffff88013940bcb8] vmx_vcpu_run at ffffffff8104cef0
> #6 [ffff88013940bcf8] vmx_handle_external_intr at ffffffff81040c18
> #7 [ffff88013940bd30] kvm_arch_vcpu_ioctl_run at ffffffff8101b5ad

suggests that we called vmx_vcpu_run() from
vmx_handle_external_intr(), and that we are executing L2 guest code
with interrupts disabled.

How is this supposed to work? What mechanism does KVM have against a
(untrusted) guest interrupt handler locking up?

I might be misunderstanding how this works at the KVM level, but from
the APIC perspective the situation appears to be pretty clear: CPU1's
interrupts are turned off, so it cannot receive IPIs, the CSD wait
will eventually time out.

Now obviously it appears to be anomalous (assuming my analysis is
correct) that the interrupt handler has locked up, but it's
immaterial: a nested kernel must not allow its guest to lock it up.

Thanks,

Ingo

2015-04-16 15:59:16

by Chris J Arges

[permalink] [raw]
Subject: Re: [PATCH] smp/call: Detect stuck CSD locks

On Thu, Apr 16, 2015 at 01:04:23PM +0200, Ingo Molnar wrote:
>
> * Chris J Arges <[email protected]> wrote:
>
> > Ingo,
> >
> > Below are the patches and data I've gathered from the reproducer. My
> > methodology was as described previously; however I used gdb on the
> > qemu process in order to breakpoint L1 once we've detected the hang.
> > This made dumping the kvm_lapic structures on L0 more reliable.
>
> Thanks!
>
> So I have trouble interpreting the L1 backtrace, because it shows
> something entirely new (to me).
>
> First lets clarify the terminology, to make sure I got the workload
> all right:
>
> - L0 is the host kernel, running native Linux. It's not locking up.
>
> - L1 is the guest kernel, running virtualized Linux. This is the one
> that is locking up.
>
> - L2 is the nested guest kernel, running whatever test workload you
> were running - this is obviously locking up together with L1.
>
> Right?

Yup this sums it up nicely.

>
> So with that cleared up, the backtrace on L1 looks like this:
>
> > * Crash dump backtrace from L1:
> >
> > crash> bt -a
> > PID: 26 TASK: ffff88013a4f1400 CPU: 0 COMMAND: "ksmd"
> > #0 [ffff88013a5039f0] machine_kexec at ffffffff8109d3ec
> > #1 [ffff88013a503a50] crash_kexec at ffffffff8114a763
> > #2 [ffff88013a503b20] panic at ffffffff818068e0
> > #3 [ffff88013a503ba0] csd_lock_wait at ffffffff8113f1e4
> > #4 [ffff88013a503bf0] generic_exec_single at ffffffff8113f2d0
> > #5 [ffff88013a503c60] smp_call_function_single at ffffffff8113f417
> > #6 [ffff88013a503c90] smp_call_function_many at ffffffff8113f7a4
> > #7 [ffff88013a503d20] flush_tlb_page at ffffffff810b3bf9
> > #8 [ffff88013a503d50] ptep_clear_flush at ffffffff81205e5e
> > #9 [ffff88013a503d80] try_to_merge_with_ksm_page at ffffffff8121a445
> > #10 [ffff88013a503e00] ksm_scan_thread at ffffffff8121ac0e
> > #11 [ffff88013a503ec0] kthread at ffffffff810df0fb
> > #12 [ffff88013a503f50] ret_from_fork at ffffffff8180fc98
>
> So this one, VCPU0, is trying to send an IPI to VCPU1:
>
> > PID: 1674 TASK: ffff8800ba4a9e00 CPU: 1 COMMAND: "qemu-system-x86"
> > #0 [ffff88013fd05e20] crash_nmi_callback at ffffffff81091521
> > #1 [ffff88013fd05e30] nmi_handle at ffffffff81062560
> > #2 [ffff88013fd05ea0] default_do_nmi at ffffffff81062b0a
> > #3 [ffff88013fd05ed0] do_nmi at ffffffff81062c88
> > #4 [ffff88013fd05ef0] end_repeat_nmi at ffffffff81812241
> > [exception RIP: vmx_vcpu_run+992]
> > RIP: ffffffff8104cef0 RSP: ffff88013940bcb8 RFLAGS: 00000082
> > RAX: 0000000080000202 RBX: ffff880139b30000 RCX: ffff880139b30000
> > RDX: 0000000000000200 RSI: ffff880139b30000 RDI: ffff880139b30000
> > RBP: ffff88013940bd28 R8: 00007fe192b71110 R9: 00007fe192b71140
> > R10: 00007fff66407d00 R11: 00007fe1927f0060 R12: 0000000000000000
> > R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000000
> > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018
> > --- <NMI exception stack> ---
> > #5 [ffff88013940bcb8] vmx_vcpu_run at ffffffff8104cef0
> > #6 [ffff88013940bcf8] vmx_handle_external_intr at ffffffff81040c18
> > #7 [ffff88013940bd30] kvm_arch_vcpu_ioctl_run at ffffffff8101b5ad
> > #8 [ffff88013940be00] kvm_vcpu_ioctl at ffffffff81007894
> > #9 [ffff88013940beb0] do_vfs_ioctl at ffffffff81253190
> > #10 [ffff88013940bf30] sys_ioctl at ffffffff81253411
> > #11 [ffff88013940bf80] system_call_fastpath at ffffffff8180fd4d
>
> So the problem here that I can see is that L1's VCPU1 appears to be
> looping with interrupts disabled:
>
> > RIP: ffffffff8104cef0 RSP: ffff88013940bcb8 RFLAGS: 00000082
>
> Look how RFLAGS doesn't have 0x200 set - so it's executing with
> interrupts disabled.
>

I've ran this with L0: 3.13+b6b8a145 L1: 4.0+debug patches and got a similar
backtrace with interrupts disabled; however this _may_ be another issue.

I ran L0/L1: 3.13+b6b8a145 and got something like this:
PID: 36 TASK: ffff8801396a9800 CPU: 0 COMMAND: "ksmd"
#0 [ffff88013fc03d18] machine_kexec at ffffffff8104ace2
#1 [ffff88013fc03d68] crash_kexec at ffffffff810e72e3
#2 [ffff88013fc03e30] panic at ffffffff8171a404
#3 [ffff88013fc03ea8] watchdog_timer_fn at ffffffff8110daa5
#4 [ffff88013fc03ed8] __run_hrtimer at ffffffff8108e7a7
#5 [ffff88013fc03f18] hrtimer_interrupt at ffffffff8108ef6f
#6 [ffff88013fc03f80] local_apic_timer_interrupt at ffffffff81043617
#7 [ffff88013fc03f98] smp_apic_timer_interrupt at ffffffff8173414f
#8 [ffff88013fc03fb0] apic_timer_interrupt at ffffffff81732add
--- <IRQ stack> ---
#9 [ffff880137fd3b08] apic_timer_interrupt at ffffffff81732add
[exception RIP: generic_exec_single+130]
RIP: ffffffff810dbf52 RSP: ffff880137fd3bb0 RFLAGS: 00000202
RAX: 0000000000000002 RBX: ffff880137fd3b80 RCX: 0000000000000002
RDX: ffffffff8180ad80 RSI: 0000000000000000 RDI: 0000000000000282
RBP: ffff880137fd3be0 R8: ffffffff8180ad68 R9: 0000000000000001
R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000000
R13: 0000000000000002 R14: 0000000000000000 R15: ffff880137fd3bd0
ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
#10 [ffff880137fd3be8] smp_call_function_single at ffffffff810dc065
#11 [ffff880137fd3c60] smp_call_function_many at ffffffff810dc496
#12 [ffff880137fd3cc0] make_all_cpus_request at ffffffffa004e5ef [kvm]
#13 [ffff880137fd3cf8] kvm_flush_remote_tlbs at ffffffffa004e620 [kvm]
#14 [ffff880137fd3d18] kvm_mmu_notifier_invalidate_range_start at ffffffffa004e6a2 [kvm]
#15 [ffff880137fd3d58] __mmu_notifier_invalidate_range_start at ffffffff8119b56b
#16 [ffff880137fd3d90] try_to_merge_with_ksm_page at ffffffff8119d296
#17 [ffff880137fd3e00] ksm_do_scan at ffffffff8119d749
#18 [ffff880137fd3e78] ksm_scan_thread at ffffffff8119e0cf
#19 [ffff880137fd3ed0] kthread at ffffffff8108b592
#20 [ffff880137fd3f50] ret_from_fork at ffffffff81731ccc

PID: 1543 TASK: ffff880137861800 CPU: 1 COMMAND: "qemu-system-x86"
#0 [ffff88013fd05e58] crash_nmi_callback at ffffffff81040082
#1 [ffff88013fd05e68] nmi_handle at ffffffff8172aa38
#2 [ffff88013fd05ec8] do_nmi at ffffffff8172ac00
#3 [ffff88013fd05ef0] end_repeat_nmi at ffffffff81729ea1
[exception RIP: _raw_spin_lock+50]
RIP: ffffffff817292b2 RSP: ffff8800372ab8f8 RFLAGS: 00000202
RAX: 0000000000000010 RBX: 0000000000000010 RCX: 0000000000000202
RDX: ffff8800372ab8f8 RSI: 0000000000000018 RDI: 0000000000000001
RBP: ffffffff817292b2 R8: ffffffff817292b2 R9: 0000000000000018
R10: ffff8800372ab8f8 R11: 0000000000000202 R12: ffffffffffffffff
R13: ffff8800b8f78000 R14: 00000000000070ae R15: 00000000000070ae
ORIG_RAX: 00000000000070ae CS: 0010 SS: 0018
--- <DOUBLEFAULT exception stack> ---
#4 [ffff8800372ab8f8] _raw_spin_lock at ffffffff817292b2
#5 [ffff8800372ab900] kvm_mmu_notifier_invalidate_range_start at ffffffffa004e67e [kvm]
#6 [ffff8800372ab940] __mmu_notifier_invalidate_range_start at ffffffff8119b56b
#7 [ffff8800372ab978] do_wp_page at ffffffff81177fee

A previous backtrace of a 3.19 series kernel is here and showing interrupts
enabled on both CPUs on L1:
https://lkml.org/lkml/2015/2/23/234
http://people.canonical.com/~inaddy/lp1413540/BACKTRACES.txt

> That is why the IPI does not get through to it, but kdump's NMI had no
> problem getting through.
>
> This (assuming all backtraces are exact!):
>
> > #5 [ffff88013940bcb8] vmx_vcpu_run at ffffffff8104cef0
> > #6 [ffff88013940bcf8] vmx_handle_external_intr at ffffffff81040c18
> > #7 [ffff88013940bd30] kvm_arch_vcpu_ioctl_run at ffffffff8101b5ad
>
> suggests that we called vmx_vcpu_run() from
> vmx_handle_external_intr(), and that we are executing L2 guest code
> with interrupts disabled.
>
> How is this supposed to work? What mechanism does KVM have against a
> (untrusted) guest interrupt handler locking up?
>
> I might be misunderstanding how this works at the KVM level, but from
> the APIC perspective the situation appears to be pretty clear: CPU1's
> interrupts are turned off, so it cannot receive IPIs, the CSD wait
> will eventually time out.
>
> Now obviously it appears to be anomalous (assuming my analysis is
> correct) that the interrupt handler has locked up, but it's
> immaterial: a nested kernel must not allow its guest to lock it up.
>
> Thanks,
>
> Ingo
>

Yes, I think at this point I'll go through the various backtraces and try to
narrow things down. I think overall we're seeing a single effect from multiple
code paths.

--chris j arges

2015-04-16 16:31:53

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] smp/call: Detect stuck CSD locks


* Chris J Arges <[email protected]> wrote:

> A previous backtrace of a 3.19 series kernel is here and showing interrupts
> enabled on both CPUs on L1:
> https://lkml.org/lkml/2015/2/23/234
> http://people.canonical.com/~inaddy/lp1413540/BACKTRACES.txt
>
> [...]
>
> Yes, I think at this point I'll go through the various backtraces
> and try to narrow things down. I think overall we're seeing a single
> effect from multiple code paths.

Now what would be nice is to observe it whether the CPU that is not
doing the CSD wait is truly locked up.

It might be executing random KVM-ish workloads and the various
backtraces we've seen so far are just a random sample of those
workloads (from L1's perspective).

Yet the fact that the kdump's NMI gets through is a strong indication
that the CPU's APIC is fine: NMIs are essentially IPIs too, they just
go to the NMI vector, which punches through irqs-off regions.

So maybe another debug trick would be useful: instead of re-sending
the IPI, send a single non-destructive NMI every second or so,
creating a backtrace on the other CPU. From that we'll be able to see
whether it's locked up permanently in an irqs-off section.

I.e. basically you could try to trigger the 'show NMI backtraces on
all CPUs' logic when the lockup triggers, and repeat it every couple
of seconds.

The simplest method to do that would be to call:

trigger_all_cpu_backtrace();

every couple of seconds, in the CSD polling loop - after the initial
timeout has passed. I'd suggest to collect at least 10 pairs of
backtraces that way.

Thanks,

Ingo

Subject: [tip:locking/urgent] smp: Fix smp_call_function_single_async() locking

Commit-ID: 8053871d0f7f67c7efb7f226ef031f78877d6625
Gitweb: http://git.kernel.org/tip/8053871d0f7f67c7efb7f226ef031f78877d6625
Author: Linus Torvalds <[email protected]>
AuthorDate: Wed, 11 Feb 2015 12:42:10 -0800
Committer: Ingo Molnar <[email protected]>
CommitDate: Fri, 17 Apr 2015 09:57:52 +0200

smp: Fix smp_call_function_single_async() locking

The current smp_function_call code suffers a number of problems, most
notably smp_call_function_single_async() is broken.

The problem is that flush_smp_call_function_queue() does csd_unlock()
_after_ calling csd->func(). This means that a caller cannot properly
synchronize the csd usage as it has to.

Change the code to release the csd before calling ->func() for the
async case, and put a WARN_ON_ONCE(csd->flags & CSD_FLAG_LOCK) in
smp_call_function_single_async() to warn us of improper serialization,
because any waiting there can results in deadlocks when called with
IRQs disabled.

Rename the (currently) unused WAIT flag to SYNCHRONOUS and (re)use it
such that we know what to do in flush_smp_call_function_queue().

Rework csd_{,un}lock() to use smp_load_acquire() / smp_store_release()
to avoid some full barriers while more clearly providing lock
semantics.

Finally move the csd maintenance out of generic_exec_single() into its
callers for clearer code.

Signed-off-by: Linus Torvalds <[email protected]>
[ Added changelog. ]
Signed-off-by: Peter Zijlstra (Intel) <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Cc: Jens Axboe <[email protected]>
Cc: Rafael David Tinoco <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Link: http://lkml.kernel.org/r/CA+55aFz492bzLFhdbKN-Hygjcreup7CjMEYk3nTSfRWjppz-OA@mail.gmail.com
Signed-off-by: Ingo Molnar <[email protected]>
---
kernel/smp.c | 78 ++++++++++++++++++++++++++++++++++++------------------------
1 file changed, 47 insertions(+), 31 deletions(-)

diff --git a/kernel/smp.c b/kernel/smp.c
index f38a1e6..2aaac2c 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -19,7 +19,7 @@

enum {
CSD_FLAG_LOCK = 0x01,
- CSD_FLAG_WAIT = 0x02,
+ CSD_FLAG_SYNCHRONOUS = 0x02,
};

struct call_function_data {
@@ -107,7 +107,7 @@ void __init call_function_init(void)
*/
static void csd_lock_wait(struct call_single_data *csd)
{
- while (csd->flags & CSD_FLAG_LOCK)
+ while (smp_load_acquire(&csd->flags) & CSD_FLAG_LOCK)
cpu_relax();
}

@@ -121,19 +121,17 @@ static void csd_lock(struct call_single_data *csd)
* to ->flags with any subsequent assignments to other
* fields of the specified call_single_data structure:
*/
- smp_mb();
+ smp_wmb();
}

static void csd_unlock(struct call_single_data *csd)
{
- WARN_ON((csd->flags & CSD_FLAG_WAIT) && !(csd->flags & CSD_FLAG_LOCK));
+ WARN_ON(!(csd->flags & CSD_FLAG_LOCK));

/*
* ensure we're all done before releasing data:
*/
- smp_mb();
-
- csd->flags &= ~CSD_FLAG_LOCK;
+ smp_store_release(&csd->flags, 0);
}

static DEFINE_PER_CPU_SHARED_ALIGNED(struct call_single_data, csd_data);
@@ -144,13 +142,16 @@ static DEFINE_PER_CPU_SHARED_ALIGNED(struct call_single_data, csd_data);
* ->func, ->info, and ->flags set.
*/
static int generic_exec_single(int cpu, struct call_single_data *csd,
- smp_call_func_t func, void *info, int wait)
+ smp_call_func_t func, void *info)
{
- struct call_single_data csd_stack = { .flags = 0 };
- unsigned long flags;
-
-
if (cpu == smp_processor_id()) {
+ unsigned long flags;
+
+ /*
+ * We can unlock early even for the synchronous on-stack case,
+ * since we're doing this from the same CPU..
+ */
+ csd_unlock(csd);
local_irq_save(flags);
func(info);
local_irq_restore(flags);
@@ -161,21 +162,9 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
if ((unsigned)cpu >= nr_cpu_ids || !cpu_online(cpu))
return -ENXIO;

-
- if (!csd) {
- csd = &csd_stack;
- if (!wait)
- csd = this_cpu_ptr(&csd_data);
- }
-
- csd_lock(csd);
-
csd->func = func;
csd->info = info;

- if (wait)
- csd->flags |= CSD_FLAG_WAIT;
-
/*
* The list addition should be visible before sending the IPI
* handler locks the list to pull the entry off it because of
@@ -190,9 +179,6 @@ static int generic_exec_single(int cpu, struct call_single_data *csd,
if (llist_add(&csd->llist, &per_cpu(call_single_queue, cpu)))
arch_send_call_function_single_ipi(cpu);

- if (wait)
- csd_lock_wait(csd);
-
return 0;
}

@@ -250,8 +236,17 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline)
}

llist_for_each_entry_safe(csd, csd_next, entry, llist) {
- csd->func(csd->info);
- csd_unlock(csd);
+ smp_call_func_t func = csd->func;
+ void *info = csd->info;
+
+ /* Do we wait until *after* callback? */
+ if (csd->flags & CSD_FLAG_SYNCHRONOUS) {
+ func(info);
+ csd_unlock(csd);
+ } else {
+ csd_unlock(csd);
+ func(info);
+ }
}

/*
@@ -274,6 +269,8 @@ static void flush_smp_call_function_queue(bool warn_cpu_offline)
int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
int wait)
{
+ struct call_single_data *csd;
+ struct call_single_data csd_stack = { .flags = CSD_FLAG_LOCK | CSD_FLAG_SYNCHRONOUS };
int this_cpu;
int err;

@@ -292,7 +289,16 @@ int smp_call_function_single(int cpu, smp_call_func_t func, void *info,
WARN_ON_ONCE(cpu_online(this_cpu) && irqs_disabled()
&& !oops_in_progress);

- err = generic_exec_single(cpu, NULL, func, info, wait);
+ csd = &csd_stack;
+ if (!wait) {
+ csd = this_cpu_ptr(&csd_data);
+ csd_lock(csd);
+ }
+
+ err = generic_exec_single(cpu, csd, func, info);
+
+ if (wait)
+ csd_lock_wait(csd);

put_cpu();

@@ -321,7 +327,15 @@ int smp_call_function_single_async(int cpu, struct call_single_data *csd)
int err = 0;

preempt_disable();
- err = generic_exec_single(cpu, csd, csd->func, csd->info, 0);
+
+ /* We could deadlock if we have to wait here with interrupts disabled! */
+ if (WARN_ON_ONCE(csd->flags & CSD_FLAG_LOCK))
+ csd_lock_wait(csd);
+
+ csd->flags = CSD_FLAG_LOCK;
+ smp_wmb();
+
+ err = generic_exec_single(cpu, csd, csd->func, csd->info);
preempt_enable();

return err;
@@ -433,6 +447,8 @@ void smp_call_function_many(const struct cpumask *mask,
struct call_single_data *csd = per_cpu_ptr(cfd->csd, cpu);

csd_lock(csd);
+ if (wait)
+ csd->flags |= CSD_FLAG_SYNCHRONOUS;
csd->func = func;
csd->info = info;
llist_add(&csd->llist, &per_cpu(call_single_queue, cpu));

2015-04-29 21:09:15

by Chris J Arges

[permalink] [raw]
Subject: Re: [PATCH] smp/call: Detect stuck CSD locks

On Thu, Apr 16, 2015 at 06:31:40PM +0200, Ingo Molnar wrote:

<snip>

> Now what would be nice is to observe it whether the CPU that is not
> doing the CSD wait is truly locked up.
>
> It might be executing random KVM-ish workloads and the various
> backtraces we've seen so far are just a random sample of those
> workloads (from L1's perspective).
>
> Yet the fact that the kdump's NMI gets through is a strong indication
> that the CPU's APIC is fine: NMIs are essentially IPIs too, they just
> go to the NMI vector, which punches through irqs-off regions.
>
> So maybe another debug trick would be useful: instead of re-sending
> the IPI, send a single non-destructive NMI every second or so,
> creating a backtrace on the other CPU. From that we'll be able to see
> whether it's locked up permanently in an irqs-off section.
>
> I.e. basically you could try to trigger the 'show NMI backtraces on
> all CPUs' logic when the lockup triggers, and repeat it every couple
> of seconds.
>
> The simplest method to do that would be to call:
>
> trigger_all_cpu_backtrace();
>
> every couple of seconds, in the CSD polling loop - after the initial
> timeout has passed. I'd suggest to collect at least 10 pairs of
> backtraces that way.
>
> Thanks,
>
> Ingo
>

Ingo,

Ok, finally got some good data and some analysis for this issue.

I've uploaded my data here:
http://people.canonical.com/~arges/csd_hang/

Patch I used for these experiments is here:
http://people.canonical.com/~arges/csd_hang/csd-lock-debug.patch

The actual dmesg/bt output are in the timestamped directory.

One thing I tried to do was observe how long it took until we got into an
actual softlockup that led into a full system lockup. The maximum time I saw
that a kernel thread spent in csd_lock and it not being a full system hang was
0.017s. By detecting the hang earlier I was able to use things like ftrace to
actually see the IPI call in action that hangs the machine.

* Experiment 1:

For every lockup detected generate NMIs for all cpus using
'trigger_all_cpu_backtrace()'. I used a csd_lock_timeout value of 1s.

This was perhaps inconclusive. If I added the call when we detect a hang, I
never got the watchdog to detect a softlockup. Furthermore the machine became
unresponsive when running this experiment. Breakpoint using gdb/qemu showed
that the machine was still hitting the csd timeout code over and over again,
from there I was able to 'call panic()' and get a crashdump.

In 201504291924, 201504292008 it seems like the backtrace is causing
csd_lock_wait to detect a hang and then backtrace again.

* Experiment 2:

A bit more interesting was using ftrace and dumping the output on a detected
lockup in csd_lock_wait.

The csd_lock_timeout value was set to 0.018s, and ftrace was dumped on oops.
Using this method I collected 10 traces in which most had ftrace output.


In 201504282127 we see a normal flush_tlb_page call followed by the interrupt
handler and then calling the requested function 'flush_tlb_func' on the other
CPU:

[ 603.248016] 2452.083739 | 0) | ptep_clear_flush() {
[ 603.248016] 2452.083739 | 0) | flush_tlb_page() {
[ 603.248016] 2452.083739 | 0) 0.071 us | leave_mm();
[ 603.248016] 2452.083739 | 0) | native_flush_tlb_others() {
[ 603.248016] 2452.083740 | 0) | smp_call_function_many() {
[ 603.248016] 2452.083740 | 0) | smp_call_function_single() {
[ 603.248016] 2452.083740 | 0) | generic_exec_single() {
[ 603.248016] 2452.083740 | 0) | native_send_call_func_single_ipi() {
[ 603.248016] 2452.083740 | 0) | x2apic_send_IPI_mask() {
[ 603.248016] 2452.083741 | 0) 1.406 us | __x2apic_send_IPI_mask();
[ 603.248016] 2452.083743 | 0) 1.958 us | }
[ 603.248016] 2452.083743 | 0) 2.349 us | }
[ 603.248016] 2452.083743 | 0) 2.823 us | }
[ 603.248016] 2452.083743 | 0) + 13.132 us | csd_lock_wait.isra.4();
[ 603.248016] 2452.083746 | 1) + 25.238 us | }
[ 603.248016] 2452.083747 | 1) 0.987 us | vmx_read_l1_tsc();
[ 603.248016] 2452.083749 | 1) | vmx_handle_external_intr() {
[ 603.248016] 2452.083752 | 1) | smp_call_function_single_interrupt() {
[ 603.248016] 2452.083752 | 1) 0.063 us | kvm_guest_apic_eoi_write();
[ 603.248016] 2452.083753 | 1) | irq_enter() {
[ 603.248016] 2452.083753 | 1) 0.077 us | rcu_irq_enter();
[ 603.248016] 2452.083754 | 1) 0.751 us | }
[ 603.248016] 2452.083754 | 1) | generic_smp_call_function_single_interrupt() {
[ 603.248016] 2452.083754 | 1) | flush_smp_call_function_queue() {
[ 603.248016] 2452.083755 | 1) 0.459 us | flush_tlb_func();
[ 603.248016] 2452.083756 | 1) 0.089 us | csd_unlock();
[ 603.248016] 2452.083757 | 1) 2.180 us | }
[ 603.248016] 2452.083757 | 0) + 16.726 us | }
[ 603.248016] 2452.083757 | 1) 2.826 us | }
[ 603.248016] 2452.083757 | 0) + 17.236 us | }
[ 603.248016] 2452.083757 | 0) + 17.722 us | }

Later in the trace we see the same call followed by vmx_handle_external_intr()
ignoring the call:

[ 603.248016] 2452.083823 | 0) | ptep_clear_flush() {
[ 603.248016] 2452.083824 | 0) | flush_tlb_page() {
[ 603.248016] 2452.083824 | 0) 0.109 us | leave_mm();
[ 603.248016] 2452.083824 | 0) | native_flush_tlb_others() {
[ 603.248016] 2452.083824 | 0) | smp_call_function_many() {
[ 603.248016] 2452.083825 | 0) | smp_call_function_single() {
[ 603.248016] 2452.083825 | 0) | generic_exec_single() {
[ 603.248016] 2452.083825 | 0) | native_send_call_func_single_ipi() {
[ 603.248016] 2452.083825 | 0) | x2apic_send_IPI_mask() {
[ 603.248016] 2452.083826 | 0) 1.625 us | __x2apic_send_IPI_mask();
[ 603.248016] 2452.083828 | 0) 2.173 us | }
[ 603.248016] 2452.083828 | 0) 2.588 us | }
[ 603.248016] 2452.083828 | 0) 3.082 us | }
[ 603.248016] 2452.083828 | 0) | csd_lock_wait.isra.4() {
[ 603.248016] 2452.083848 | 1) + 44.033 us | }
[ 603.248016] 2452.083849 | 1) 0.975 us | vmx_read_l1_tsc();
[ 603.248016] 2452.083851 | 1) 1.031 us | vmx_handle_external_intr();
[ 603.248016] 2452.083852 | 1) 0.234 us | __srcu_read_lock();
[ 603.248016] 2452.083853 | 1) | vmx_handle_exit() {
[ 603.248016] 2452.083854 | 1) | handle_ept_violation() {
[ 603.248016] 2452.083856 | 1) | kvm_mmu_page_fault() {
[ 603.248016] 2452.083856 | 1) | tdp_page_fault() {
[ 603.248016] 2452.083856 | 1) 0.092 us | mmu_topup_memory_caches();
[ 603.248016] 2452.083857 | 1) | gfn_to_memslot_dirty_bitmap.isra.84() {
[ 603.248016] 2452.083857 | 1) 0.231 us | gfn_to_memslot();
[ 603.248016] 2452.083858 | 1) 0.774 us | }

So potentially, CPU0 generated an interrupt that caused vcpu_enter_guest to be
called on CPU1. However, when vmx_handle_external_intr was called, it didn't
progress any further.

Another experiment here would be to dump vmcs_read32(VM_EXIT_INTR_INFO); to see
why we don't handle the interrupt.


In 201504282155/dmesg.* we see a similar pattern:

[ 1578.184041] 4152.856104 | 1) | flush_tlb_page() {
[ 1578.184041] 4152.856105 | 1) 0.122 us | leave_mm();
[ 1578.184041] 4152.856105 | 1) | native_flush_tlb_others() {
[ 1578.184041] 4152.856106 | 1) | smp_call_function_many() {
[ 1578.184041] 4152.856106 | 1) | smp_call_function_single() {
[ 1578.184041] 4152.856107 | 1) | generic_exec_single() {
[ 1578.184041] 4152.856107 | 1) | native_send_call_func_single_ipi() {
[ 1578.184041] 4152.856107 | 1) | x2apic_send_IPI_mask() {
[ 1578.184041] 4152.856108 | 1) 2.770 us | __x2apic_send_IPI_mask();
[ 1578.184041] 4152.856111 | 1) 3.725 us | }
[ 1578.184041] 4152.856112 | 1) 4.383 us | }
[ 1578.184041] 4152.856112 | 1) 5.220 us | }
[ 1578.184041] 4152.856112 | 1) | csd_lock_wait.isra.4() {
[ 1578.184041] 4152.856149 | 0) + 57.757 us | }
[ 1578.184041] 4152.856150 | 0) 1.613 us | vmx_read_l1_tsc();
[ 1578.184041] 4152.856153 | 0) 1.566 us | vmx_handle_external_intr();
[ 1578.184041] 4152.856156 | 0) 0.360 us | __srcu_read_lock();
[ 1578.184041] 4152.856157 | 0) | vmx_handle_exit() {
[ 1578.184041] 4152.856158 | 0) | handle_wrmsr() {
[ 1578.184041] 4152.856158 | 0) | kvm_set_msr() {

201504282251 and 201504290305 didn't seem to ftrace properly.

201504290015 shows the same thing, but we get another interrupt through
afterwards that does work. The original call is what hangs the machine.

[ 4906.553939] 12500.903369 | 0) 1.736 us | __x2apic_send_IPI_mask();
[ 4906.556244] 12500.903371 | 0) 2.361 us | }
[ 4906.558222] 12500.903371 | 0) 2.785 us | }
[ 4906.560198] 12500.903371 | 0) 3.374 us | }
[ 4906.562131] 12500.903372 | 0) | csd_lock_wait.isra.4() {
[ 4906.564334] 12500.903391 | 1) + 50.198 us | }
[ 4906.566201] 12500.903392 | 1) 0.991 us | vmx_read_l1_tsc();
[ 4906.568284] 12500.903394 | 1) 0.977 us | vmx_handle_external_intr();
[ 4906.570436] 12500.903395 | 1) 0.260 us | __srcu_read_lock();
[ 4906.572727] 12500.903396 | 1) | vmx_handle_exit() {
[ 4906.574810] 12500.903397 | 1) | handle_wrmsr() {
[ 4906.576888] 12500.903397 | 1) | kvm_set_msr() {
[ 4906.578965] 12500.903397 | 1) | vmx_set_msr() {

<snip>

[ 4918.994035] 12500.913985 | 1) | native_send_call_func_single_ipi() {
[ 4918.996510] 12500.913986 | 1) | x2apic_send_IPI_mask() {
[ 4918.998809] 12500.913986 | 1) 2.050 us | __x2apic_send_IPI_mask();
[ 4919.001153] 12500.913989 | 1) 2.820 us | }
[ 4919.003186] 12500.913989 | 1) 3.444 us | }
[ 4919.005215] 12500.913989 | 1) 4.145 us | }
[ 4919.007180] 12500.913989 | 0) | smp_call_function_single_interrupt() {
[ 4919.009516] 12500.913989 | 1) 4.474 us | csd_lock_wait.isra.4();
[ 4919.011745] 12500.913990 | 0) 0.116 us | kvm_guest_apic_eoi_write();
[ 4919.013994] 12500.913990 | 0) | irq_enter() {
[ 4919.016092] 12500.913991 | 0) 0.115 us | rcu_irq_enter();
[ 4919.018210] 12500.913991 | 0) 0.607 us | }
[ 4919.020165] 12500.913991 | 0) | generic_smp_call_function_single_interrupt() {
[ 4919.022645] 12500.913992 | 0) | flush_smp_call_function_queue() {
[ 4919.025005] 12500.913992 | 0) | flush_tlb_func() {
[ 4919.027223] 12500.913993 | 0) 0.348 us | leave_mm();


201504290245 and 201504290342 aren't very clear, we send the interrupt, then
CPU1 gets a timer interrupt right away. CPU0 shows lost events, so perhaps we
missed what actually happened in the ftrace output.

[ 9504.605612] 22133.181730 | 1) 3.307 us | __x2apic_send_IPI_mask();
[ 9504.605616] 22133.181734 | 1) 4.081 us | }
[ 9504.605619] 22133.181735 | 1) 4.735 us | }
[ 9504.605622] 22133.181735 | 1) 5.744 us | }
[ 9504.605626] 22133.181735 | 1) | csd_lock_wait.isra.4() {
[ 9504.605632] 22133.185489 | 1) ==========> |.22133.185489 | 1) | smp_apic_timer_interrupt() {
[ 9504.605637] 22133.185489 | 1) 0.088 us | kvm_guest_apic_eoi_write();
[ 9504.605641] 22133.185490 | 1) | irq_enter() {
[ 9504.605646] 22133.185490 | 1) 0.130 us | rcu_irq_enter();
<snip>
[ 9504.606348] CPU:0 [LOST 902542527 EVENTS].22133.188647 | 0) 0.096 us | } /* inc_zone_page_state */
[ 9504.606354] 22133.188647 | 0) 0.088 us | mem_cgroup_end_page_stat();


201504290305 shows the call, and it doesn't even seem like CPU1 ever responds
to it.

[ 643.522880] 22837.075483 | 0) | flush_tlb_page() {
[ 643.522885] 22837.075484 | 1) 0.160 us | _raw_spin_unlock_irqrestore();
[ 643.522890] 22837.075485 | 0) | native_flush_tlb_others() {
[ 643.522895] 22837.075485 | 1) 0.083 us | _raw_spin_unlock_irqrestore();
[ 643.522899] 22837.075486 | 0) | smp_call_function_many() {
[ 643.522904] 22837.075486 | 1) 0.145 us | mem_cgroup_update_page_stat();
[ 643.522908] 22837.075487 | 0) | smp_call_function_single() {
[ 643.522913] 22837.075487 | 1) 0.255 us | dec_zone_page_state();
[ 643.522918] 22837.075487 | 0) 0.288 us | generic_exec_single();
[ 643.522923] 22837.075488 | 1) 0.336 us | inc_zone_page_state();
[ 643.522927] 22837.075489 | 0) | csd_lock_wait.isra.4() {
[ 643.522931] 22837.075490 | 1) 0.115 us | mem_cgroup_end_page_stat();
[ 643.522935] 22837.075491 | 1) + 14.409 us | }
[ 643.522940] 22837.075491 | 1) 0.113 us | __wake_up_bit();
[ 643.522943] 22837.075492 | 1) + 16.781 us | }
[ 643.522948] 22837.075493 | 1) | end_page_writeback() {
[ 643.522952] 22837.075494 | 1) | test_clear_page_writeback() {
[ 643.522957] 22837.075494 | 1) 0.110 us | page_mapping();
[ 643.522962] 22837.075495 | 1) 0.135 us | mem_cgroup_begin_page_stat();
[ 643.522967] 22837.075496 | 1) | inode_to_bdi() {
[ 643.522972] 22837.075497 | 1) 0.115 us | sb_is_blkdev_sb();


Thanks,
--chris j arges

2015-05-11 14:00:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: [PATCH] smp/call: Detect stuck CSD locks


* Chris J Arges <[email protected]> wrote:

> Later in the trace we see the same call followed by
> vmx_handle_external_intr() ignoring the call:
>
> [ 603.248016] 2452.083823 | 0) | ptep_clear_flush() {
> [ 603.248016] 2452.083824 | 0) | flush_tlb_page() {
> [ 603.248016] 2452.083824 | 0) 0.109 us | leave_mm();
> [ 603.248016] 2452.083824 | 0) | native_flush_tlb_others() {
> [ 603.248016] 2452.083824 | 0) | smp_call_function_many() {
> [ 603.248016] 2452.083825 | 0) | smp_call_function_single() {
> [ 603.248016] 2452.083825 | 0) | generic_exec_single() {
> [ 603.248016] 2452.083825 | 0) | native_send_call_func_single_ipi() {
> [ 603.248016] 2452.083825 | 0) | x2apic_send_IPI_mask() {
> [ 603.248016] 2452.083826 | 0) 1.625 us | __x2apic_send_IPI_mask();
> [ 603.248016] 2452.083828 | 0) 2.173 us | }
> [ 603.248016] 2452.083828 | 0) 2.588 us | }
> [ 603.248016] 2452.083828 | 0) 3.082 us | }
> [ 603.248016] 2452.083828 | 0) | csd_lock_wait.isra.4() {
> [ 603.248016] 2452.083848 | 1) + 44.033 us | }
> [ 603.248016] 2452.083849 | 1) 0.975 us | vmx_read_l1_tsc();
> [ 603.248016] 2452.083851 | 1) 1.031 us | vmx_handle_external_intr();
> [ 603.248016] 2452.083852 | 1) 0.234 us | __srcu_read_lock();
> [ 603.248016] 2452.083853 | 1) | vmx_handle_exit() {
> [ 603.248016] 2452.083854 | 1) | handle_ept_violation() {
> [ 603.248016] 2452.083856 | 1) | kvm_mmu_page_fault() {
> [ 603.248016] 2452.083856 | 1) | tdp_page_fault() {
> [ 603.248016] 2452.083856 | 1) 0.092 us | mmu_topup_memory_caches();
> [ 603.248016] 2452.083857 | 1) | gfn_to_memslot_dirty_bitmap.isra.84() {
> [ 603.248016] 2452.083857 | 1) 0.231 us | gfn_to_memslot();
> [ 603.248016] 2452.083858 | 1) 0.774 us | }
>
> So potentially, CPU0 generated an interrupt that caused
> vcpu_enter_guest to be called on CPU1. However, when
> vmx_handle_external_intr was called, it didn't progress any further.

So the IPI does look like to be lost in the KVM code?

So why did vmx_handle_external_intr() skip the irq injection - were
IRQs disabled in the guest perhaps?

> Another experiment here would be to dump
> vmcs_read32(VM_EXIT_INTR_INFO); to see why we don't handle the
> interrupt.

Possibly, but also to instrument the KVM IRQ injection code to see
when it skips an IPI and why.

Thanks,

Ingo

2015-05-20 18:20:28

by Chris J Arges

[permalink] [raw]
Subject: Re: [PATCH] smp/call: Detect stuck CSD locks

On Mon, May 11, 2015 at 04:00:03PM +0200, Ingo Molnar wrote:
> > So potentially, CPU0 generated an interrupt that caused
> > vcpu_enter_guest to be called on CPU1. However, when
> > vmx_handle_external_intr was called, it didn't progress any further.
>
> So the IPI does look like to be lost in the KVM code?
>
> So why did vmx_handle_external_intr() skip the irq injection - were
> IRQs disabled in the guest perhaps?
>
> > Another experiment here would be to dump
> > vmcs_read32(VM_EXIT_INTR_INFO); to see why we don't handle the
> > interrupt.
>
> Possibly, but also to instrument the KVM IRQ injection code to see
> when it skips an IPI and why.
>
> Thanks,
>
> Ingo
>

Ingo,
I no longer have access to the reproducer machine unfortunately. I'll try to
locate additional kit that has the same situation, but it may take some time.
Thanks,
--chris