LinuxLists.cc - smp_call_function

2015-02-11 13:19:14

Subject: smp_call_function_single lockups

Linus, Thomas, Jens..

During the 3.18 - 3.19 "frequent lockups discussion", in some point
you have observed csd_lock() and csd_unlock() possible synchronization
problems. I think we have managed to reproduce that issue in a
constant basis with 3.13 (ubuntu) and 3.19 (latest vanilla).

- When running "open-stack tempest" in a nested-kvm environment we are
able to cause a lockup in question of hours (from 2 to 20 hours
usually). Trace from nested hypervisor (ubuntu 3.13):

crash> bt
PID: 29130 TASK: ffff8804288ac800 CPU: 1 COMMAND: "qemu-system-x86"
#0 [ffff88043fd03d18] machine_kexec at ffffffff8104ac02
#1 [ffff88043fd03d68] crash_kexec at ffffffff810e7203
#2 [ffff88043fd03e30] panic at ffffffff81719ff4
#3 [ffff88043fd03ea8] watchdog_timer_fn at ffffffff8110d7c5
#4 [ffff88043fd03ed8] __run_hrtimer at ffffffff8108e787
#5 [ffff88043fd03f18] hrtimer_interrupt at ffffffff8108ef4f
#6 [ffff88043fd03f80] local_apic_timer_interrupt at ffffffff81043537
#7 [ffff88043fd03f98] smp_apic_timer_interrupt at ffffffff81733d4f
#8 [ffff88043fd03fb0] apic_timer_interrupt at ffffffff817326dd
--- <IRQ stack> ---
#9 [ffff8804284a3bc8] apic_timer_interrupt at ffffffff817326dd
[exception RIP: generic_exec_single+130]
RIP: ffffffff810dbe62 RSP: ffff8804284a3c70 RFLAGS: 00000202
RAX: 0000000000000002 RBX: ffff8804284a3c40 RCX: 0000000000000001
RDX: ffffffff8180ad60 RSI: 0000000000000000 RDI: 0000000000000286
RBP: ffff8804284a3ca0 R8: ffffffff8180ad48 R9: 0000000000000001
R10: ffffffff81185cac R11: ffffea00109b4a00 R12: ffff88042829f400
R13: 0000000000000000 R14: ffffea001017d640 R15:
0000000000000005 ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
#10 [ffff8804284a3ca8] smp_call_function_single at ffffffff810dbf75
#11 [ffff8804284a3d20] smp_call_function_many at ffffffff810dc3a6
#12 [ffff8804284a3d80] native_flush_tlb_others at ffffffff8105c8f7
#13 [ffff8804284a3da8] flush_tlb_mm_range at ffffffff8105c9cb
#14 [ffff8804284a3dd8] tlb_flush_mmu at ffffffff811755b3
#15 [ffff8804284a3e00] tlb_finish_mmu at ffffffff81176145
#16 [ffff8804284a3e20] unmap_region at ffffffff8117e013
#17 [ffff8804284a3ee0] do_munmap at ffffffff81180356
#18 [ffff8804284a3f30] vm_munmap at ffffffff81180521
#19 [ffff8804284a3f60] sys_munmap at ffffffff81181482
#20 [ffff8804284a3f80] system_call_fastpath at ffffffff8173196d
RIP: 00007fa3ed16c587 RSP: 00007fa3536f5c10 RFLAGS: 00000246
RAX: 000000000000000b RBX: ffffffff8173196d RCX: 0000001d00000007
RDX: 0000000000000000 RSI: 0000000000801000 RDI: 00007fa315ff4000
RBP: 00007fa3167f49c0 R8: 0000000000000000 R9: 00007fa3f5396738
R10: 00007fa3536f5a60 R11: 0000000000000202 R12: 00007fa3ed6562a0
R13: 00007fa350ef19c0 R14: ffffffff81181482 R15: ffff8804284a3f78
ORIG_RAX: 000000000000000b CS: 0033 SS: 002b

- After applying patch provided by Thomas we were able to cause the
lockup only after 6 days (also locked inside
smp_call_function_single). Test performance (even for a nested kvm)
was reduced substantially with 3.19 + this patch. Trace from the
nested hypervisor (3.19 + patch):

crash> bt
PID: 10467 TASK: ffff880817b3b1c0 CPU: 1 COMMAND: "qemu-system-x86"
#0 [ffff88083fd03cc0] machine_kexec at ffffffff81052052
#1 [ffff88083fd03d10] crash_kexec at ffffffff810f91c3
#2 [ffff88083fd03de0] panic at ffffffff8176f713
#3 [ffff88083fd03e60] watchdog_timer_fn at ffffffff8112316b
#4 [ffff88083fd03ea0] __run_hrtimer at ffffffff810da087
#5 [ffff88083fd03ef0] hrtimer_interrupt at ffffffff810da467
#6 [ffff88083fd03f70] local_apic_timer_interrupt at ffffffff81049769
#7 [ffff88083fd03f90] smp_apic_timer_interrupt at ffffffff8177fc25
#8 [ffff88083fd03fb0] apic_timer_interrupt at ffffffff8177dcbd
--- <IRQ stack> ---
#9 [ffff880817973a68] apic_timer_interrupt at ffffffff8177dcbd
[exception RIP: generic_exec_single+218]
RIP: ffffffff810ee0ca RSP: ffff880817973b18 RFLAGS: 00000202
RAX: 0000000000000002 RBX: 0000000000000292 RCX: 0000000000000001
RDX: ffffffff8180e6e0 RSI: 0000000000000000 RDI: 0000000000000292
RBP: ffff880817973b58 R8: ffffffff8180e6c8 R9: 0000000000000001
R10: 000000000000b6e0 R11: 0000000000000001 R12: ffffffff811f6626
R13: ffff880817973ab8 R14: ffffffff8109cfd2 R15: ffff880817973a78
ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018
#10 [ffff880817973b60] smp_call_function_single at ffffffff810ee1c7
#11 [ffff880817973b90] loaded_vmcs_clear at ffffffffa0309097 [kvm_intel]
#12 [ffff880817973ba0] vmx_vcpu_load at ffffffffa030defe [kvm_intel]
#13 [ffff880817973be0] kvm_arch_vcpu_load at ffffffffa01eba53 [kvm]
#14 [ffff880817973c00] kvm_sched_in at ffffffffa01d94a9 [kvm]
#15 [ffff880817973c20] finish_task_switch at ffffffff81099148
#16 [ffff880817973c60] __schedule at ffffffff817781ec
#17 [ffff880817973cd0] schedule at ffffffff81778699
#18 [ffff880817973ce0] kvm_vcpu_block at ffffffffa01d8dfd [kvm]
#19 [ffff880817973d40] kvm_arch_vcpu_ioctl_run at ffffffffa01ef64c [kvm]
#20 [ffff880817973e10] kvm_vcpu_ioctl at ffffffffa01dbc19 [kvm]
#21 [ffff880817973eb0] do_vfs_ioctl at ffffffff811f5948
#22 [ffff880817973f30] sys_ioctl at ffffffff811f5be1
#23 [ffff880817973f80] system_call_fastpath at ffffffff8177cc2d
RIP: 00007f42f987fec7 RSP: 00007f42ef1bebd8 RFLAGS: 00000246
RAX: ffffffffffffffda RBX: ffffffff8177cc2d RCX: ffffffffffffffff
RDX: 0000000000000000 RSI: 000000000000ae80 RDI: 000000000000000e
RBP: 00007f430047b040 R8: 0000000000000000 R9: 00000000000000ff
R10: 0000000000000000 R11: 0000000000000246 R12: 00007f42ff920240
R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000001
ORIG_RAX: 0000000000000010 CS: 0033 SS: 002b

Not sure if you are still pursuing this, anyway.. let me know if you
think of any other change, I'll keep the environment.

-Tinoco

On Mon, 17 Nov 2014, Thomas Gleixner wrote:
> On Mon, 17 Nov 2014, Linus Torvalds wrote:
> > llist_for_each_entry_safe(csd, csd_next, entry, llist) {
> > - csd->func(csd->info);
> > + smp_call_func_t func = csd->func;
> > + void *info = csd->info;
> > csd_unlock(csd);
> > +
> > + func(info);
>
> No, that won't work for synchronous calls:
>
> CPU 0 CPU 1
>
> csd_lock(csd);
> queue_csd();
> ipi();
> func = csd->func;
> info = csd->info;
> csd_unlock(csd);
> csd_lock_wait();
> func(info);
>
> The csd_lock_wait() side will succeed and therefor assume that the
> call has been completed while the function has not been called at
> all. Interesting explosions to follow.
>
> The proper solution is to revert that commit and properly analyze the
> problem which Jens was trying to solve and work from there.

So a combo of both (Jens and yours) might do the trick. Patch below.

I think what Jens was trying to solve is:

CPU 0 CPU 1

csd_lock(csd);
queue_csd();
ipi();
csd->func(csd->info);
wait_for_completion(csd);
complete(csd);
reuse_csd(csd);
csd_unlock(csd);
Thanks,

tglx

Index: linux/kernel/smp.c
===================================================================
--- linux.orig/kernel/smp.c
+++ linux/kernel/smp.c
@@ -126,7 +126,7 @@ static void csd_lock(struct call_single_

static void csd_unlock(struct call_single_data *csd)
{
- WARN_ON((csd->flags & CSD_FLAG_WAIT) && !(csd->flags & CSD_FLAG_LOCK));
+ WARN_ON(!(csd->flags & CSD_FLAG_LOCK));

/*
* ensure we're all done before releasing data:
@@ -250,8 +250,23 @@ static void flush_smp_call_function_queu
}

llist_for_each_entry_safe(csd, csd_next, entry, llist) {
- csd->func(csd->info);
- csd_unlock(csd);
+
+ /*
+ * For synchronous calls we are not allowed to unlock
+ * before the callback returned. For the async case
+ * its the responsibility of the caller to keep
+ * csd->info consistent while the callback runs.
+ */
+ if (csd->flags & CSD_FLAG_WAIT) {
+ csd->func(csd->info);
+ csd_unlock(csd);
+ } else {
+ smp_call_func_t func = csd->func;
+ void *info = csd->info;
+
+ csd_unlock(csd);
+ func(info);
+ }
}

/*

2015-02-11 18:18:47

by Linus Torvalds

[permalink] [raw]

Subject: Re: smp_call_function_single lockups

On Wed, Feb 11, 2015 at 5:19 AM, Rafael David Tinoco <[email protected]> wrote:
>
> - After applying patch provided by Thomas we were able to cause the
> lockup only after 6 days (also locked inside
> smp_call_function_single). Test performance (even for a nested kvm)
> was reduced substantially with 3.19 + this patch.

I think that just means that the patch from Thomas doesn't change
anything - the reason it takes longer to lock up is just that
performance reduction, so whatever race it is that causes the problem
was just harder to hit, but not fundamentally affected.

I think a more interesting thing to get is the traces from the other
CPU's when this happens. In a virtualized environment, that might be
easier to get than on real hardware, and if you are able to reproduce
this at will - especially with something recent like 3.19, and could
get that, that would be really good.

I'll think about this all, but we couldn't figure anything out last
time we looked at it, so without more clues, don't hold your breath.

That said, it *would* be good if we could get rid of the synchronous
behavior entirely, and make it a rule that if somebody wants to wait
for it, they'll have to do their own waiting. Because I still think
that that CSD_FLAG_WAIT is pure and utter garbage. And I think that
Jens said that it is probably bogus to begin with.

I also don't even see where the CSD_FLAG_WAIT bit woudl ever be
cleared, so it all looks completely buggy anyway.

Does this (COMPLETELY UNTESTED!) attached patch change anything?

Linus

Attachments:

patch.diff (1.42 kB)

2015-02-11 19:59:09

by Linus Torvalds

[permalink] [raw]

Subject: Re: smp_call_function_single lockups

On Wed, Feb 11, 2015 at 10:18 AM, Linus Torvalds
<[email protected]> wrote:
>
> I'll think about this all, but we couldn't figure anything out last
> time we looked at it, so without more clues, don't hold your breath.

So having looked at it once more, one thing struck me:

Look at smp_call_function_single_async(). The comment says

* Like smp_call_function_single(), but the call is asynchonous and
* can thus be done from contexts with disabled interrupts.

but that is *only* true if we don't have to wait for the csd lock. The
comments even clarify that:

* The caller passes his own pre-allocated data structure
* (ie: embedded in an object) and is responsible for synchronizing it
* such that the IPIs performed on the @csd are strictly serialized.

but it's not at all clear that the caller *can* do that. Since the
"csd_unlock()" is done *after* the call to the callback function, any
serialization done by the caller is fundamentally not trustworthy,
since it cannot serialize with the csd lock - if it releases things in
the callback, the csd lock will still be set after releasing things.

So the caller has a really hard time guaranteeing that CSD_LOCK isn't
set. And if the call is done in interrupt context, for all we know it
is interrupting the code that is going to clear CSD_LOCK, so CSD_LOCK
will never be cleared at all, and csd_lock() will wait forever.

So I actually think that for the async case, we really *should* unlock
before doing the callback (which is what Thomas' old patch did).

And we migth well be better off doing something like

WARN_ON_ONCE(csd->flags & CSD_LOCK);

in smp_call_function_single_async(), because that really is a hard requirement.

And it strikes me that hrtick_csd is one of these cases that do this
with interrupts disabled, and use the callback for serialization. So I
really wonder if this is part of the problem..

Thomas? Am I missing something?

Linus

2015-02-11 20:42:14

by Linus Torvalds

[permalink] [raw]

Subject: Re: smp_call_function_single lockups

[ Added Frederic to the cc, since he's touched this file/area most ]

On Wed, Feb 11, 2015 at 11:59 AM, Linus Torvalds
<[email protected]> wrote:
>
> So the caller has a really hard time guaranteeing that CSD_LOCK isn't
> set. And if the call is done in interrupt context, for all we know it
> is interrupting the code that is going to clear CSD_LOCK, so CSD_LOCK
> will never be cleared at all, and csd_lock() will wait forever.
>
> So I actually think that for the async case, we really *should* unlock
> before doing the callback (which is what Thomas' old patch did).
>
> And we migth well be better off doing something like
>
> WARN_ON_ONCE(csd->flags & CSD_LOCK);
>
> in smp_call_function_single_async(), because that really is a hard requirement.
>
> And it strikes me that hrtick_csd is one of these cases that do this
> with interrupts disabled, and use the callback for serialization. So I
> really wonder if this is part of the problem..
>
> Thomas? Am I missing something?

Ok, this is a more involved patch than I'd like, but making the
*caller* do all the CSD maintenance actually cleans things up.

And this is still completely untested, and may be entirely buggy. What
do you guys think?

Linus

Attachments:

patch.diff (4.59 kB)

2015-02-12 16:39:04

by Rafael David Tinoco

[permalink] [raw]

Subject: Re: smp_call_function_single lockups

Meanwhile we'll take the opportunity to run same tests with the
"smp_load_acquire/smp_store_release + outside sync/async" approach
made by your latest patch on top of 3.19. If anything comes up I'll
provide full back traces (2 vcpus).

Here I can only reproduce this inside nested kvm on top of Proliant
DL360 Gen8 machines with:

- no opt out from x2apic (gen8 firmware asks for opting out but HP
says x2apic should be used for >= gen8)
- no intel_idle since proliant firmware is causing NMIs during MWAIT
instructions

As observed before, reducing performance made the problem to be
triggered only after some days so, if nothing goes wrong with
performance this time, I expect to have results in between 10 to 30
hours.

Thank you

Tinoco

On Wed, Feb 11, 2015 at 6:42 PM, Linus Torvalds
<[email protected]> wrote:
>
> [ Added Frederic to the cc, since he's touched this file/area most ]
>
> On Wed, Feb 11, 2015 at 11:59 AM, Linus Torvalds
> <[email protected]> wrote:
> >
> > So the caller has a really hard time guaranteeing that CSD_LOCK isn't
> > set. And if the call is done in interrupt context, for all we know it
> > is interrupting the code that is going to clear CSD_LOCK, so CSD_LOCK
> > will never be cleared at all, and csd_lock() will wait forever.
> >
> > So I actually think that for the async case, we really *should* unlock
> > before doing the callback (which is what Thomas' old patch did).
> >
> > And we migth well be better off doing something like
> >
> > WARN_ON_ONCE(csd->flags & CSD_LOCK);
> >
> > in smp_call_function_single_async(), because that really is a hard requirement.
> >
> > And it strikes me that hrtick_csd is one of these cases that do this
> > with interrupts disabled, and use the callback for serialization. So I
> > really wonder if this is part of the problem..
> >
> > Thomas? Am I missing something?
>
> Ok, this is a more involved patch than I'd like, but making the
> *caller* do all the CSD maintenance actually cleans things up.
>
> And this is still completely untested, and may be entirely buggy. What
> do you guys think?
>
> Linus

2015-02-18 22:25:56

by Peter Zijlstra

[permalink] [raw]

Subject: Re: smp_call_function_single lockups

On Wed, Feb 11, 2015 at 12:42:10PM -0800, Linus Torvalds wrote:
> Ok, this is a more involved patch than I'd like, but making the
> *caller* do all the CSD maintenance actually cleans things up.
>
> And this is still completely untested, and may be entirely buggy. What
> do you guys think?

I think it makes perfect sense.

Acked-by: Peter Zijlstra (Intel) <[email protected]>

2015-02-19 15:42:43

by Rafael David Tinoco

[permalink] [raw]

Subject: Re: smp_call_function_single lockups

Linus, Peter, Thomas

Just a quick feedback, We were able to reproduce the lockup with this
proposed patch (3.19 + patch). Unfortunately we had problems with the
core file and I have only the stack trace for now but I think we are
able to reproduce it again and provide more details (sorry for the
delay... after a reboot it took some days for us to reproduce this
again).

It looks like RIP is still smp_call_function_single.

Same environment as before: Nested KVM (2 vcpus) on top of Proliant
DL380G8 with acpi_idle and no x2apic optout.

[47708.068013] CPU: 0 PID: 29869 Comm: qemu-system-x86 Tainted: G
E 3.19.0-c7671cf-lp1413540v2 #31
[47708.068013] Hardware name: OpenStack Foundation OpenStack Nova,
BIOS Bochs 01/01/2011
[47708.068013] task: ffff88081b9beca0 ti: ffff88081a7a0000 task.ti:
ffff88081a7a0000
[47708.068013] RIP: 0010:[<ffffffff810f537a>] [<ffffffff810f537a>]
smp_call_function_single+0xca/0x120
[47708.068013] RSP: 0018:ffff88081a7a3b38 EFLAGS: 00000202
[47708.068013] RAX: 0000000000000000 RBX: 0000000000000001 RCX: 0000000000000002
[47708.068013] RDX: 0000000000000001 RSI: 0000000000000000 RDI: 0000000000000296
[47708.068013] RBP: ffff88081a7a3b78 R08: ffffffff81815168 R09: ffff880818192000
[47708.068013] R10: 000000000000bdf6 R11: 000000000001bf90 R12: 00080000810b66f8
[47708.068013] R13: 00000000000000fb R14: 0000000000000296 R15: 0000000000000000
[47708.068013] FS: 00007fa143fff700(0000) GS:ffff88083fc00000(0000)
knlGS:0000000000000000
[47708.068013] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[47708.068013] CR2: 00007f5d76f5d050 CR3: 00000008190cc000 CR4: 00000000000426f0
[47708.068013] Stack:
[47708.068013] ffff88083fd151b8 0000000000000001 0000000000000000
ffffffffc0589320
[47708.068013] ffff88081a547a80 0000000000000003 ffff88081a543f80
0000000000000000
[47708.068013] ffff88081a7a3b88 ffffffffc0586097 ffff88081a7a3bc8
ffffffffc058aefe
[47708.068013] Call Trace:
[47708.068013] [<ffffffffc0589320>] ?
copy_shadow_to_vmcs12+0x110/0x110 [kvm_intel]
[47708.068013] [<ffffffffc0586097>] loaded_vmcs_clear+0x27/0x30 [kvm_intel]
[47708.068013] [<ffffffffc058aefe>] vmx_vcpu_load+0x17e/0x1a0 [kvm_intel]
[47708.068013] [<ffffffff810a918d>] ? set_next_entity+0x9d/0xb0
[47708.068013] [<ffffffffc04660e3>] kvm_arch_vcpu_load+0x33/0x1f0 [kvm]
[47708.068013] [<ffffffffc0452529>] kvm_sched_in+0x39/0x40 [kvm]
[47708.068013] [<ffffffff8109e8e8>] finish_task_switch+0x98/0x1a0
[47708.068013] [<ffffffff817aa81b>] __schedule+0x33b/0x900
[47708.068013] [<ffffffff817aae17>] schedule+0x37/0x90
[47708.068013] [<ffffffffc0451e7d>] kvm_vcpu_block+0x6d/0xb0 [kvm]
[47708.068013] [<ffffffff810b6ec0>] ? prepare_to_wait_event+0x110/0x110
[47708.068013] [<ffffffffc0469d3c>] kvm_arch_vcpu_ioctl_run+0x10c/0x1290 [kvm]
[47708.068013] [<ffffffffc04551ce>] kvm_vcpu_ioctl+0x2ce/0x670 [kvm]
[47708.068013] [<ffffffff811ef441>] ? new_sync_write+0x81/0xb0
[47708.068013] [<ffffffff812034e8>] do_vfs_ioctl+0x2f8/0x510
[47708.068013] [<ffffffff811f2215>] ? __sb_end_write+0x35/0x70
[47708.068013] [<ffffffffc045cf84>] ? kvm_on_user_return+0x74/0x80 [kvm]
[47708.068013] [<ffffffff81203781>] SyS_ioctl+0x81/0xa0
[47708.068013] [<ffffffff817aefad>] system_call_fastpath+0x16/0x1b
[47708.068013] Code: 30 5b 41 5c 5d c3 0f 1f 00 48 8d 75 d0 48 89 d1
89 df 4c 89 e2 e8 57 fe ff ff 0f b7 55 e8 83 e2 01 74 da 66 0f 1f 44
00 00 f3 90 <0f> b7 55 e8 83 e2 01 75 f5 eb c7 0f 1f 00 8b 05 ca e6 dd
00 85
[47708.068013] Kernel panic - not syncing: softlockup: hung tasks
[47708.068013] CPU: 0 PID: 29869 Comm: qemu-system-x86 Tainted: G
EL 3.19.0-c7671cf-lp1413540v2 #31
[47708.068013] Hardware name: OpenStack Foundation OpenStack Nova,
BIOS Bochs 01/01/2011
[47708.068013] ffff88081b9beca0 ffff88083fc03de8 ffffffff817a6bf6
0000000000000000
[47708.068013] ffffffff81ab30d4 ffff88083fc03e68 ffffffff817a1aec
0000000000000e92
[47708.068013] 0000000000000008 ffff88083fc03e78 ffff88083fc03e18
ffff88083fc03e68
[47708.068013] Call Trace:
[47708.068013] <IRQ> [<ffffffff817a6bf6>] dump_stack+0x45/0x57
[47708.068013] [<ffffffff817a1aec>] panic+0xc1/0x1f5
[47708.068013] [<ffffffff8112ba0b>] watchdog_timer_fn+0x1db/0x1f0
[47708.068013] [<ffffffff810e0e37>] __run_hrtimer+0x77/0x1d0
[47708.068013] [<ffffffff8112b830>] ? watchdog+0x30/0x30
[47708.068013] [<ffffffff810e1203>] hrtimer_interrupt+0xf3/0x220
[47708.068013] [<ffffffffc0589320>] ?
copy_shadow_to_vmcs12+0x110/0x110 [kvm_intel]
[47708.068013] [<ffffffff8104b0a9>] local_apic_timer_interrupt+0x39/0x60
[47708.068013] [<ffffffff817b1fb5>] smp_apic_timer_interrupt+0x45/0x60
[47708.068013] [<ffffffff817b002d>] apic_timer_interrupt+0x6d/0x80
[47708.068013] <EOI> [<ffffffff810f537a>] ?
smp_call_function_single+0xca/0x120
[47708.068013] [<ffffffff810f5369>] ? smp_call_function_single+0xb9/0x120
[47708.068013] [<ffffffffc0589320>] ?
copy_shadow_to_vmcs12+0x110/0x110 [kvm_intel]
[47708.068013] [<ffffffffc0586097>] loaded_vmcs_clear+0x27/0x30 [kvm_intel]
[47708.068013] [<ffffffffc058aefe>] vmx_vcpu_load+0x17e/0x1a0 [kvm_intel]
[47708.068013] [<ffffffff810a918d>] ? set_next_entity+0x9d/0xb0
[47708.068013] [<ffffffffc04660e3>] kvm_arch_vcpu_load+0x33/0x1f0 [kvm]
[47708.068013] [<ffffffffc0452529>] kvm_sched_in+0x39/0x40 [kvm]
[47708.068013] [<ffffffff8109e8e8>] finish_task_switch+0x98/0x1a0
[47708.068013] [<ffffffff817aa81b>] __schedule+0x33b/0x900
[47708.068013] [<ffffffff817aae17>] schedule+0x37/0x90
[47708.068013] [<ffffffffc0451e7d>] kvm_vcpu_block+0x6d/0xb0 [kvm]
[47708.068013] [<ffffffff810b6ec0>] ? prepare_to_wait_event+0x110/0x110
[47708.068013] [<ffffffffc0469d3c>] kvm_arch_vcpu_ioctl_run+0x10c/0x1290 [kvm]
[47708.068013] [<ffffffffc04551ce>] kvm_vcpu_ioctl+0x2ce/0x670 [kvm]
[47708.068013] [<ffffffff811ef441>] ? new_sync_write+0x81/0xb0
[47708.068013] [<ffffffff812034e8>] do_vfs_ioctl+0x2f8/0x510
[47708.068013] [<ffffffff811f2215>] ? __sb_end_write+0x35/0x70
[47708.068013] [<ffffffffc045cf84>] ? kvm_on_user_return+0x74/0x80 [kvm]
[47708.068013] [<ffffffff81203781>] SyS_ioctl+0x81/0xa0
[47708.068013] [<ffffffff817aefad>] system_call_fastpath+0x16/0x1b

Tks
Rafael Tinoco

On Wed, Feb 18, 2015 at 8:25 PM, Peter Zijlstra <[email protected]> wrote:
> On Wed, Feb 11, 2015 at 12:42:10PM -0800, Linus Torvalds wrote:
>> Ok, this is a more involved patch than I'd like, but making the
>> *caller* do all the CSD maintenance actually cleans things up.
>>
>> And this is still completely untested, and may be entirely buggy. What
>> do you guys think?
>
> I think it makes perfect sense.
>
> Acked-by: Peter Zijlstra (Intel) <[email protected]>

2015-02-19 16:14:48

by Linus Torvalds

[permalink] [raw]

Subject: Re: smp_call_function_single lockups

On Thu, Feb 19, 2015 at 7:42 AM, Rafael David Tinoco <[email protected]> wrote:
>
> Just a quick feedback, We were able to reproduce the lockup with this
> proposed patch (3.19 + patch). Unfortunately we had problems with the
> core file and I have only the stack trace for now but I think we are
> able to reproduce it again and provide more details (sorry for the
> delay... after a reboot it took some days for us to reproduce this
> again).
>
> It looks like RIP is still smp_call_function_single.

Hmm. Still just the stack trace for the CPU that is blocked (CPU0), if
you can get the core-file to work and figure out where the other CPU
is, that would be good.

Linus

2015-02-19 16:16:15

by Peter Zijlstra

[permalink] [raw]

Subject: Re: smp_call_function_single lockups

On Thu, Feb 19, 2015 at 01:42:39PM -0200, Rafael David Tinoco wrote:
> Linus, Peter, Thomas
>
> Just a quick feedback, We were able to reproduce the lockup with this
> proposed patch (3.19 + patch). Unfortunately we had problems with the
> core file and I have only the stack trace for now but I think we are
> able to reproduce it again and provide more details (sorry for the
> delay... after a reboot it took some days for us to reproduce this
> again).
>
> It looks like RIP is still smp_call_function_single.

So Linus' patch mostly fixes smp_call_function_single_async() which is
not what you're using.

It would be very good to see traces of other CPUs; if for some reason
the target CPU doesn't get around to running your callback, then we'll
forever wait on it.

loaded_vmcs_clear() uses smp_call_function_single(.wait = 1), that
should work as before.

2015-02-19 16:26:25

by Linus Torvalds

[permalink] [raw]

Subject: Re: smp_call_function_single lockups

On Thu, Feb 19, 2015 at 7:42 AM, Rafael David Tinoco <[email protected]> wrote:
>
> Same environment as before: Nested KVM (2 vcpus) on top of Proliant
> DL380G8 with acpi_idle and no x2apic optout.

Btw, which apic model does that end up using? Does "no x2apic optout"
mean you're using the x2apic?

What does "dmesg | grep apic" report? Something like

Switched APIC routing to cluster x2apic.

or what?

Side note to the apic guys: I think the "single CPU" case ends up
being one of the most important ones, but the stupid APIC model
doesn't allow that, so sending an IPI to a single CPU ends up being
"send a mask with a single bit set", and then we have that horrible
"for_each_cpu(cpu, mask)" crap.

Would it make sense to perhaps add a "send_IPI_single()" function
call, and then for the APIC models that always are based on masks, use
a wrapper that just does that "cpumask_of(cpu)" thing..

Hmm?

Linus

2015-02-19 16:32:10

by Rafael David Tinoco

[permalink] [raw]

Subject: Re: smp_call_function_single lockups

For the host, we are using "intremap=no_x2apic_optout
intel_idle.max_cstate=0" for cmdline. It looks like that DL360/DL380
Gen8 firmware still asks to optout from x2apic but HP engineering team
said that using x2apic for Gen8 would be ok (intel_idle causes these
servers to generate NMIs when idling, probably related to packed
c-states and this server's dependency on acpi tables for c-state).

Feb 19 08:21:28 derain kernel: [ 3.504676] Enabled IRQ remapping in
x2apic mode
Feb 19 08:21:28 derain kernel: [ 3.565451] Enabling x2apic
Feb 19 08:21:28 derain kernel: [ 3.602134] Enabled x2apic
Feb 19 08:21:28 derain kernel: [ 3.637682] Switched APIC routing to
cluster x2apic.

On Thu, Feb 19, 2015 at 2:26 PM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Feb 19, 2015 at 7:42 AM, Rafael David Tinoco <[email protected]> wrote:
>>
>> Same environment as before: Nested KVM (2 vcpus) on top of Proliant
>> DL380G8 with acpi_idle and no x2apic optout.
>
> Btw, which apic model does that end up using? Does "no x2apic optout"
> mean you're using the x2apic?
>
> What does "dmesg | grep apic" report? Something like
>
> Switched APIC routing to cluster x2apic.
>
> or what?
>
> Side note to the apic guys: I think the "single CPU" case ends up
> being one of the most important ones, but the stupid APIC model
> doesn't allow that, so sending an IPI to a single CPU ends up being
> "send a mask with a single bit set", and then we have that horrible
> "for_each_cpu(cpu, mask)" crap.
>
> Would it make sense to perhaps add a "send_IPI_single()" function
> call, and then for the APIC models that always are based on masks, use
> a wrapper that just does that "cpumask_of(cpu)" thing..
>
> Hmm?
>
> Linus

2015-02-19 16:59:19

by Linus Torvalds

[permalink] [raw]

Subject: Re: smp_call_function_single lockups

On Thu, Feb 19, 2015 at 8:32 AM, Rafael David Tinoco <[email protected]> wrote:
> Feb 19 08:21:28 derain kernel: [ 3.637682] Switched APIC routing to
> cluster x2apic.

Ok. That "cluster x2apic" mode is just about the nastiest mode when it
comes to sending a single ipi. We do that insane dance where we

- turn single cpu number into cpumask
- copy the cpumask to a percpu temporary storage
- walk each cpu in the cpumask
- for each cpu, look up the cluster siblings
- for each cluster sibling that is also in the cpumask, look up the
logical apic mask and add it to the actual ipi destination mask
- send an ipi to that final mask.

which is just insane. It's complicated, it's fragile, and it's unnecessary.

If we had a simple "send_IPI()" function, we could do this all with
something much saner, and it would look sopmething like

static void x2apic_send_IPI(int cpu, int vector)
{
u32 dest = per_cpu(x86_cpu_to_logical_apicid, cpu);
x2apic_wrmsr_fence();
__x2apic_send_IPI_dest(dest, vector, APIC_DEST_LOGICAL);
}

and then 'void native_send_call_func_single_ipi()' would just look like

void native_send_call_func_single_ipi(int cpu)
{
apic->send_IPI(cpu, CALL_FUNCTION_SINGLE_VECTOR);
}

but I might have missed something (and we might want to have a wrapper
that says "if the apic doesn't have a 'send_IPI' function, use
"send_IPI_mask(cpumask_of(cpu, vector) instead"

The fact that you need that no_x2apic_optout (which in turn means that
your ACPI tables seem to say "don't use x2apic") also makes me worry.

Are there known errata for the x2apic?

Linus

2015-02-19 17:30:46

by Rafael David Tinoco

[permalink] [raw]

Subject: Re: smp_call_function_single lockups

I could only find an advisory (regarding sr-iov and irq remaps) from
HP to RHEL6.2 users stating that Gen8 firmware does not enable it by
default.

http://h20564.www2.hp.com/hpsc/doc/public/display?docId=emr_na-c03645796

"""
The interrupt remapping capability depends on x2apic enabled in the
BIOS and HP ProLiant Gen8 systems do not enable x2apic; therefore, the
following workaround is required for device assignment: Edit the
/etc/grub.conf and add intremap=no_x2apic_optout option to the kernel
command line options.
"""

Probably for backwards compatibility... not sure if there is an option
to enable/disable it in firmware (like DL390 seems to have).
I don't think so... but I was told by HP team that I should use x2apic
for >= Gen8.

On Thu, Feb 19, 2015 at 2:59 PM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Feb 19, 2015 at 8:32 AM, Rafael David Tinoco <[email protected]> wrote:
>> Feb 19 08:21:28 derain kernel: [ 3.637682] Switched APIC routing to
>> cluster x2apic.
>
> Ok. That "cluster x2apic" mode is just about the nastiest mode when it
> comes to sending a single ipi. We do that insane dance where we
>
> - turn single cpu number into cpumask
> - copy the cpumask to a percpu temporary storage
> - walk each cpu in the cpumask
> - for each cpu, look up the cluster siblings
> - for each cluster sibling that is also in the cpumask, look up the
> logical apic mask and add it to the actual ipi destination mask
> - send an ipi to that final mask.
>
> which is just insane. It's complicated, it's fragile, and it's unnecessary.
>
> If we had a simple "send_IPI()" function, we could do this all with
> something much saner, and it would look sopmething like
>
> static void x2apic_send_IPI(int cpu, int vector)
> {
> u32 dest = per_cpu(x86_cpu_to_logical_apicid, cpu);
> x2apic_wrmsr_fence();
> __x2apic_send_IPI_dest(dest, vector, APIC_DEST_LOGICAL);
> }
>
> and then 'void native_send_call_func_single_ipi()' would just look like
>
> void native_send_call_func_single_ipi(int cpu)
> {
> apic->send_IPI(cpu, CALL_FUNCTION_SINGLE_VECTOR);
> }
>
> but I might have missed something (and we might want to have a wrapper
> that says "if the apic doesn't have a 'send_IPI' function, use
> "send_IPI_mask(cpumask_of(cpu, vector) instead"
>
> The fact that you need that no_x2apic_optout (which in turn means that
> your ACPI tables seem to say "don't use x2apic") also makes me worry.
>
> Are there known errata for the x2apic?
>
> Linus

2015-02-19 17:39:20

by Linus Torvalds

[permalink] [raw]

Subject: Re: smp_call_function_single lockups

On Thu, Feb 19, 2015 at 8:59 AM, Linus Torvalds
<[email protected]> wrote:
>
> Are there known errata for the x2apic?

.. and in particular, do we still have to worry about the traditional
local apic "if there are more than two pending interrupts per priority
level, things get lost" problem?

I forget the exact details. Hopefully somebody remembers.

Linus

2015-02-19 20:29:37

by Linus Torvalds

[permalink] [raw]

Subject: Re: smp_call_function_single lockups

On Thu, Feb 19, 2015 at 9:39 AM, Linus Torvalds
<[email protected]> wrote:
> On Thu, Feb 19, 2015 at 8:59 AM, Linus Torvalds
> <[email protected]> wrote:
>>
>> Are there known errata for the x2apic?
>
> .. and in particular, do we still have to worry about the traditional
> local apic "if there are more than two pending interrupts per priority
> level, things get lost" problem?
>
> I forget the exact details. Hopefully somebody remembers.

I can't find it in the docs. I find the "two-entries per vector", but
not anything that is per priority level (group of 16 vectors). Maybe
that was the IO-APIC, in which case it's immaterial for IPI's.

However, having now mostly re-acquainted myself with the APIC details,
it strikes me that we do have some oddities here.

In particular, a few interrupt types are very special: NMI, SMI, INIT,
ExtINT, or SIPI are handled early in the interrupt acceptance logic,
and are sent directly to the CPU core, without going through the usual
intermediate IRR/ISR dance.

And why might this matter? It's important because it means that those
kinds of interrupts must *not* do the apic EOI that ack_APIC_irq()
does.

And we correctly don't do ack_APIC_irq() for NMI etc, but it strikes
me that ExtINT is odd and special.

I think we still use ExtINT for some odd cases. We used to have some
magic with the legacy timer interrupt, for example. And I think they
all go through the normal "do_IRQ()" logic regardless of whether they
are ExtINT or not.

Now, what happens if we send an EOI for an ExtINT interrupt? It
basically ends up being a spurious IPI. And I *think* that what
normally happens is absolutely nothing at all. But if in addition to
the ExtINT, there was a pending IPI (or other pending ISR bit set),
maybe we lose interrupts..

.. and it's entirely possible that I'm just completely full of shit.
Who is the poor bastard who has worked most with things like ExtINT,
and can educate me? I'm adding Ingo, hpa and Jiang Liu as primary
contacts..

Linus

2015-02-19 21:59:49

by Linus Torvalds

[permalink] [raw]

Subject: Re: smp_call_function_single lockups

On Thu, Feb 19, 2015 at 12:29 PM, Linus Torvalds
<[email protected]> wrote:
>
> Now, what happens if we send an EOI for an ExtINT interrupt? It
> basically ends up being a spurious IPI. And I *think* that what
> normally happens is absolutely nothing at all. But if in addition to
> the ExtINT, there was a pending IPI (or other pending ISR bit set),
> maybe we lose interrupts..
>
> .. and it's entirely possible that I'm just completely full of shit.
> Who is the poor bastard who has worked most with things like ExtINT,
> and can educate me? I'm adding Ingo, hpa and Jiang Liu as primary
> contacts..

So quite frankly, trying to follow all the logic from do_IRQ() through
handle_irq() to the actual low-level handler, I just couldn't do it.

So instead, I wrote a patch to verify that the ISR bit is actually set
when we do ack_APIC_irq().

This was complicated by the fact that we don't actually pass in the
vector number at all to the acking, so 99% of the patch is just doing
that. A couple of places we don't really have a good vector number, so
I said "screw it, a negative value means that we won't check the ISR).

The attached patch is quite possibly garbage, but it gives an
interesting warning for me during i8042 probing, so who knows. Maybe
it actually shows a real problem - or maybe I just screwed up the
patch.

.. and maybe even if the patch is fine, it's actually never really a
problem to have spurious APIC ACK cycles. Maybe it cannot make
interrupts be ignored.

Anyway, the back-trace for the warning I get is during boot:

...
PNP: No PS/2 controller found. Probing ports directly.
------------[ cut here ]------------
WARNING: CPU: 0 PID: 1 at ./arch/x86/include/asm/apic.h:436
ir_ack_apic_edge+0x74/0x80()
Modules linked in:
CPU: 0 PID: 1 Comm: swapper/0 Not tainted
3.19.0-08857-g89d3fa45b4ad-dirty #2
Call Trace:
<IRQ>
dump_stack+0x45/0x57
warn_slowpath_common+0x80/0xc0
warn_slowpath_null+0x15/0x20
ir_ack_apic_edge+0x74/0x80
handle_edge_irq+0x51/0x110
handle_irq+0x74/0x140
do_IRQ+0x4a/0x140
common_interrupt+0x6a/0x6a
<EOI>
? _raw_spin_unlock_irqrestore+0x9/0x10
__setup_irq+0x239/0x5a0
request_threaded_irq+0xc2/0x180
i8042_probe+0x5b8/0x680
platform_drv_probe+0x2f/0xa0
driver_probe_device+0x8b/0x3e0
__driver_attach+0x93/0xa0
bus_for_each_dev+0x63/0xa0
driver_attach+0x19/0x20
bus_add_driver+0x178/0x250
driver_register+0x5f/0xf0
__platform_driver_register+0x45/0x50
__platform_driver_probe+0x26/0xa0
__platform_create_bundle+0xad/0xe0
i8042_init+0x3d0/0x3f6
do_one_initcall+0xb8/0x1d0
kernel_init_freeable+0x16d/0x1fa
kernel_init+0x9/0xf0
ret_from_fork+0x7c/0xb0
---[ end trace 1de82c4457c6a0f0 ]---
serio: i8042 KBD port at 0x60,0x64 irq 1
serio: i8042 AUX port at 0x60,0x64 irq 12
...

and it looks not entirely insane.

Is this worth looking at? Or is it something spurious? I might have
gotten the vectors wrong, and maybe the warning is not because the ISR
bit isn't set, but because I test the wrong bit.

Linus

Attachments:

patch.diff (12.74 kB)

2015-02-19 22:45:56

by Linus Torvalds

[permalink] [raw]

Subject: Re: smp_call_function_single lockups

On Thu, Feb 19, 2015 at 1:59 PM, Linus Torvalds
<[email protected]> wrote:
>
> Is this worth looking at? Or is it something spurious? I might have
> gotten the vectors wrong, and maybe the warning is not because the ISR
> bit isn't set, but because I test the wrong bit.

I edited the patch to do ratelimiting (one per 10s max) rather than
"once". And tested it some more. It seems to work correctly. The irq
case during 8042 probing is not repeatable, and I suspect it happens
because the interrupt source goes away (some probe-time thing that
first triggers an interrupt, but then clears it itself), so it doesn't
happen every boot, and I've gotten it with slightly different
backtraces.

But it's the only warning that happens for me, so I think my code is
right (at least for the cases that trigger on this machine). It's
definitely not a "every interrupt causes the warning because the code
was buggy, and the WARN_ONCE() just printed the first one".

It would be interesting to hear if others see spurious APIC EOI cases
too. In particular, the people seeing the IPI lockup. Because a lot of
the lockups we've seen have *looked* like the IPI interrupt just never
happened, and so we're waiting forever for the target CPU to react to
it. And just maybe the spurious EOI could cause the wrong bit to be
cleared in the ISR, and then the interrupt never shows up. Something
like that would certainly explain why it only happens on some machines
and under certain timing circumstances.

Linus

2015-02-20 09:30:09

Subject: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Attachments:

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Attachments:

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Attachments:

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Attachments:

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Attachments:

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Subject: Re: smp_call_function_single lockups

Subject: [PATCH] smp/call: Detect stuck CSD locks

Subject: Re: smp_call_function_single lockups

Subject: Re: [PATCH] smp/call: Detect stuck CSD locks

Subject: Re: [PATCH] smp/call: Detect stuck CSD locks

Subject: Re: smp_call_function_single lockups

Subject: Re: [PATCH] smp/call: Detect stuck CSD locks

Subject: Re: [PATCH] smp/call: Detect stuck CSD locks

Subject: Re: [PATCH] smp/call: Detect stuck CSD locks

Subject: Re: [PATCH] smp/call: Detect stuck CSD locks

Subject: Re: [PATCH] smp/call: Detect stuck CSD locks

Subject: Re: [PATCH] smp/call: Detect stuck CSD locks

Subject: Re: [PATCH] smp/call: Detect stuck CSD locks

Subject: Re: [PATCH] smp/call: Detect stuck CSD locks

Subject: Re: [PATCH] smp/call: Detect stuck CSD locks

Subject: Re: [PATCH] smp/call: Detect stuck CSD locks

Subject: Re: [PATCH] smp/call: Detect stuck CSD locks

Subject: [tip:locking/urgent] smp: Fix smp_call_function_single_async() locking

Subject: Re: [PATCH] smp/call: Detect stuck CSD locks

Subject: Re: [PATCH] smp/call: Detect stuck CSD locks

Subject: Re: [PATCH] smp/call: Detect stuck CSD locks