Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753621AbbDPP7Q (ORCPT ); Thu, 16 Apr 2015 11:59:16 -0400 Received: from youngberry.canonical.com ([91.189.89.112]:55003 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753571AbbDPP7N (ORCPT ); Thu, 16 Apr 2015 11:59:13 -0400 Date: Thu, 16 Apr 2015 10:58:36 -0500 From: Chris J Arges To: Ingo Molnar Cc: Linus Torvalds , Rafael David Tinoco , Peter Anvin , Jiang Liu , Peter Zijlstra , LKML , Jens Axboe , Frederic Weisbecker , Gema Gomez , the arch/x86 maintainers Subject: Re: [PATCH] smp/call: Detect stuck CSD locks Message-ID: <20150416155819.GA20490@canonical.com> References: <20150403054320.GA9863@gmail.com> <5522BB49.5060704@canonical.com> <20150407092121.GA9971@gmail.com> <20150407205945.GA28212@canonical.com> <20150408064734.GA26861@gmail.com> <20150413035616.GA24037@canonical.com> <20150413061450.GA10857@gmail.com> <20150415195452.GA19953@canonical.com> <20150416110423.GA15760@gmail.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20150416110423.GA15760@gmail.com> User-Agent: Mutt/1.5.23 (2014-03-12) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 10159 Lines: 188 On Thu, Apr 16, 2015 at 01:04:23PM +0200, Ingo Molnar wrote: > > * Chris J Arges wrote: > > > Ingo, > > > > Below are the patches and data I've gathered from the reproducer. My > > methodology was as described previously; however I used gdb on the > > qemu process in order to breakpoint L1 once we've detected the hang. > > This made dumping the kvm_lapic structures on L0 more reliable. > > Thanks! > > So I have trouble interpreting the L1 backtrace, because it shows > something entirely new (to me). > > First lets clarify the terminology, to make sure I got the workload > all right: > > - L0 is the host kernel, running native Linux. It's not locking up. > > - L1 is the guest kernel, running virtualized Linux. This is the one > that is locking up. > > - L2 is the nested guest kernel, running whatever test workload you > were running - this is obviously locking up together with L1. > > Right? Yup this sums it up nicely. > > So with that cleared up, the backtrace on L1 looks like this: > > > * Crash dump backtrace from L1: > > > > crash> bt -a > > PID: 26 TASK: ffff88013a4f1400 CPU: 0 COMMAND: "ksmd" > > #0 [ffff88013a5039f0] machine_kexec at ffffffff8109d3ec > > #1 [ffff88013a503a50] crash_kexec at ffffffff8114a763 > > #2 [ffff88013a503b20] panic at ffffffff818068e0 > > #3 [ffff88013a503ba0] csd_lock_wait at ffffffff8113f1e4 > > #4 [ffff88013a503bf0] generic_exec_single at ffffffff8113f2d0 > > #5 [ffff88013a503c60] smp_call_function_single at ffffffff8113f417 > > #6 [ffff88013a503c90] smp_call_function_many at ffffffff8113f7a4 > > #7 [ffff88013a503d20] flush_tlb_page at ffffffff810b3bf9 > > #8 [ffff88013a503d50] ptep_clear_flush at ffffffff81205e5e > > #9 [ffff88013a503d80] try_to_merge_with_ksm_page at ffffffff8121a445 > > #10 [ffff88013a503e00] ksm_scan_thread at ffffffff8121ac0e > > #11 [ffff88013a503ec0] kthread at ffffffff810df0fb > > #12 [ffff88013a503f50] ret_from_fork at ffffffff8180fc98 > > So this one, VCPU0, is trying to send an IPI to VCPU1: > > > PID: 1674 TASK: ffff8800ba4a9e00 CPU: 1 COMMAND: "qemu-system-x86" > > #0 [ffff88013fd05e20] crash_nmi_callback at ffffffff81091521 > > #1 [ffff88013fd05e30] nmi_handle at ffffffff81062560 > > #2 [ffff88013fd05ea0] default_do_nmi at ffffffff81062b0a > > #3 [ffff88013fd05ed0] do_nmi at ffffffff81062c88 > > #4 [ffff88013fd05ef0] end_repeat_nmi at ffffffff81812241 > > [exception RIP: vmx_vcpu_run+992] > > RIP: ffffffff8104cef0 RSP: ffff88013940bcb8 RFLAGS: 00000082 > > RAX: 0000000080000202 RBX: ffff880139b30000 RCX: ffff880139b30000 > > RDX: 0000000000000200 RSI: ffff880139b30000 RDI: ffff880139b30000 > > RBP: ffff88013940bd28 R8: 00007fe192b71110 R9: 00007fe192b71140 > > R10: 00007fff66407d00 R11: 00007fe1927f0060 R12: 0000000000000000 > > R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000000 > > ORIG_RAX: ffffffffffffffff CS: 0010 SS: 0018 > > --- --- > > #5 [ffff88013940bcb8] vmx_vcpu_run at ffffffff8104cef0 > > #6 [ffff88013940bcf8] vmx_handle_external_intr at ffffffff81040c18 > > #7 [ffff88013940bd30] kvm_arch_vcpu_ioctl_run at ffffffff8101b5ad > > #8 [ffff88013940be00] kvm_vcpu_ioctl at ffffffff81007894 > > #9 [ffff88013940beb0] do_vfs_ioctl at ffffffff81253190 > > #10 [ffff88013940bf30] sys_ioctl at ffffffff81253411 > > #11 [ffff88013940bf80] system_call_fastpath at ffffffff8180fd4d > > So the problem here that I can see is that L1's VCPU1 appears to be > looping with interrupts disabled: > > > RIP: ffffffff8104cef0 RSP: ffff88013940bcb8 RFLAGS: 00000082 > > Look how RFLAGS doesn't have 0x200 set - so it's executing with > interrupts disabled. > I've ran this with L0: 3.13+b6b8a145 L1: 4.0+debug patches and got a similar backtrace with interrupts disabled; however this _may_ be another issue. I ran L0/L1: 3.13+b6b8a145 and got something like this: PID: 36 TASK: ffff8801396a9800 CPU: 0 COMMAND: "ksmd" #0 [ffff88013fc03d18] machine_kexec at ffffffff8104ace2 #1 [ffff88013fc03d68] crash_kexec at ffffffff810e72e3 #2 [ffff88013fc03e30] panic at ffffffff8171a404 #3 [ffff88013fc03ea8] watchdog_timer_fn at ffffffff8110daa5 #4 [ffff88013fc03ed8] __run_hrtimer at ffffffff8108e7a7 #5 [ffff88013fc03f18] hrtimer_interrupt at ffffffff8108ef6f #6 [ffff88013fc03f80] local_apic_timer_interrupt at ffffffff81043617 #7 [ffff88013fc03f98] smp_apic_timer_interrupt at ffffffff8173414f #8 [ffff88013fc03fb0] apic_timer_interrupt at ffffffff81732add --- --- #9 [ffff880137fd3b08] apic_timer_interrupt at ffffffff81732add [exception RIP: generic_exec_single+130] RIP: ffffffff810dbf52 RSP: ffff880137fd3bb0 RFLAGS: 00000202 RAX: 0000000000000002 RBX: ffff880137fd3b80 RCX: 0000000000000002 RDX: ffffffff8180ad80 RSI: 0000000000000000 RDI: 0000000000000282 RBP: ffff880137fd3be0 R8: ffffffff8180ad68 R9: 0000000000000001 R10: 0000000000000000 R11: 0000000000000002 R12: 0000000000000000 R13: 0000000000000002 R14: 0000000000000000 R15: ffff880137fd3bd0 ORIG_RAX: ffffffffffffff10 CS: 0010 SS: 0018 #10 [ffff880137fd3be8] smp_call_function_single at ffffffff810dc065 #11 [ffff880137fd3c60] smp_call_function_many at ffffffff810dc496 #12 [ffff880137fd3cc0] make_all_cpus_request at ffffffffa004e5ef [kvm] #13 [ffff880137fd3cf8] kvm_flush_remote_tlbs at ffffffffa004e620 [kvm] #14 [ffff880137fd3d18] kvm_mmu_notifier_invalidate_range_start at ffffffffa004e6a2 [kvm] #15 [ffff880137fd3d58] __mmu_notifier_invalidate_range_start at ffffffff8119b56b #16 [ffff880137fd3d90] try_to_merge_with_ksm_page at ffffffff8119d296 #17 [ffff880137fd3e00] ksm_do_scan at ffffffff8119d749 #18 [ffff880137fd3e78] ksm_scan_thread at ffffffff8119e0cf #19 [ffff880137fd3ed0] kthread at ffffffff8108b592 #20 [ffff880137fd3f50] ret_from_fork at ffffffff81731ccc PID: 1543 TASK: ffff880137861800 CPU: 1 COMMAND: "qemu-system-x86" #0 [ffff88013fd05e58] crash_nmi_callback at ffffffff81040082 #1 [ffff88013fd05e68] nmi_handle at ffffffff8172aa38 #2 [ffff88013fd05ec8] do_nmi at ffffffff8172ac00 #3 [ffff88013fd05ef0] end_repeat_nmi at ffffffff81729ea1 [exception RIP: _raw_spin_lock+50] RIP: ffffffff817292b2 RSP: ffff8800372ab8f8 RFLAGS: 00000202 RAX: 0000000000000010 RBX: 0000000000000010 RCX: 0000000000000202 RDX: ffff8800372ab8f8 RSI: 0000000000000018 RDI: 0000000000000001 RBP: ffffffff817292b2 R8: ffffffff817292b2 R9: 0000000000000018 R10: ffff8800372ab8f8 R11: 0000000000000202 R12: ffffffffffffffff R13: ffff8800b8f78000 R14: 00000000000070ae R15: 00000000000070ae ORIG_RAX: 00000000000070ae CS: 0010 SS: 0018 --- --- #4 [ffff8800372ab8f8] _raw_spin_lock at ffffffff817292b2 #5 [ffff8800372ab900] kvm_mmu_notifier_invalidate_range_start at ffffffffa004e67e [kvm] #6 [ffff8800372ab940] __mmu_notifier_invalidate_range_start at ffffffff8119b56b #7 [ffff8800372ab978] do_wp_page at ffffffff81177fee A previous backtrace of a 3.19 series kernel is here and showing interrupts enabled on both CPUs on L1: https://lkml.org/lkml/2015/2/23/234 http://people.canonical.com/~inaddy/lp1413540/BACKTRACES.txt > That is why the IPI does not get through to it, but kdump's NMI had no > problem getting through. > > This (assuming all backtraces are exact!): > > > #5 [ffff88013940bcb8] vmx_vcpu_run at ffffffff8104cef0 > > #6 [ffff88013940bcf8] vmx_handle_external_intr at ffffffff81040c18 > > #7 [ffff88013940bd30] kvm_arch_vcpu_ioctl_run at ffffffff8101b5ad > > suggests that we called vmx_vcpu_run() from > vmx_handle_external_intr(), and that we are executing L2 guest code > with interrupts disabled. > > How is this supposed to work? What mechanism does KVM have against a > (untrusted) guest interrupt handler locking up? > > I might be misunderstanding how this works at the KVM level, but from > the APIC perspective the situation appears to be pretty clear: CPU1's > interrupts are turned off, so it cannot receive IPIs, the CSD wait > will eventually time out. > > Now obviously it appears to be anomalous (assuming my analysis is > correct) that the interrupt handler has locked up, but it's > immaterial: a nested kernel must not allow its guest to lock it up. > > Thanks, > > Ingo > Yes, I think at this point I'll go through the various backtraces and try to narrow things down. I think overall we're seeing a single effect from multiple code paths. --chris j arges -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/