Date: Thu, 16 Apr 2015 10:58:36 -0500
From: Chris J Arges <chris.j.arges@canonical.com>
To: Ingo Molnar <mingo@kernel.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>,
        Rafael David Tinoco <inaddy@ubuntu.com>, Peter Anvin <hpa@zytor.com>,
        Jiang Liu <jiang.liu@linux.intel.com>,
        Peter Zijlstra <peterz@infradead.org>,
        LKML <linux-kernel@vger.kernel.org>, Jens Axboe <axboe@kernel.dk>,
        Frederic Weisbecker <fweisbec@gmail.com>,
        Gema Gomez <gema.gomez-solano@canonical.com>,
        the arch/x86 maintainers <x86@kernel.org>
Subject: Re: [PATCH] smp/call: Detect stuck CSD locks
Message-ID: <20150416155819.GA20490@canonical.com>
References: <20150403054320.GA9863@gmail.com>
 <5522BB49.5060704@canonical.com>
 <CA+55aFyqbih4XLYtC0kAWFrEH+_uUMx_mbNsnKWKNuKQJRA3Dg@mail.gmail.com>
 <20150407092121.GA9971@gmail.com>
 <20150407205945.GA28212@canonical.com>
 <20150408064734.GA26861@gmail.com>
 <20150413035616.GA24037@canonical.com>
 <20150413061450.GA10857@gmail.com>
 <20150415195452.GA19953@canonical.com>
 <20150416110423.GA15760@gmail.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
In-Reply-To: <20150416110423.GA15760@gmail.com>
User-Agent: Mutt/1.5.23 (2014-03-12)
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 10159
Lines: 188

On Thu, Apr 16, 2015 at 01:04:23PM +0200, Ingo Molnar wrote:
> 
> * Chris J Arges <chris.j.arges@canonical.com> wrote:
> 
> > Ingo,
> > 
> > Below are the patches and data I've gathered from the reproducer. My 
> > methodology was as described previously; however I used gdb on the 
> > qemu process in order to breakpoint L1 once we've detected the hang. 
> > This made dumping the kvm_lapic structures on L0 more reliable.
> 
> Thanks!
> 
> So I have trouble interpreting the L1 backtrace, because it shows 
> something entirely new (to me).
> 
> First lets clarify the terminology, to make sure I got the workload 
> all right:
> 
>  - L0 is the host kernel, running native Linux. It's not locking up.
> 
>  - L1 is the guest kernel, running virtualized Linux. This is the one 
>    that is locking up.
> 
>  - L2 is the nested guest kernel, running whatever test workload you 
>    were running - this is obviously locking up together with L1.
> 
> Right?

Yup this sums it up nicely.

> 
> So with that cleared up, the backtrace on L1 looks like this:
> 
> > * Crash dump backtrace from L1:
> > 
> > crash> bt -a
> > PID: 26     TASK: ffff88013a4f1400  CPU: 0   COMMAND: "ksmd"
> >  #0 [ffff88013a5039f0] machine_kexec at ffffffff8109d3ec
> >  #1 [ffff88013a503a50] crash_kexec at ffffffff8114a763
> >  #2 [ffff88013a503b20] panic at ffffffff818068e0
> >  #3 [ffff88013a503ba0] csd_lock_wait at ffffffff8113f1e4
> >  #4 [ffff88013a503bf0] generic_exec_single at ffffffff8113f2d0
> >  #5 [ffff88013a503c60] smp_call_function_single at ffffffff8113f417
> >  #6 [ffff88013a503c90] smp_call_function_many at ffffffff8113f7a4
> >  #7 [ffff88013a503d20] flush_tlb_page at ffffffff810b3bf9
> >  #8 [ffff88013a503d50] ptep_clear_flush at ffffffff81205e5e
> >  #9 [ffff88013a503d80] try_to_merge_with_ksm_page at ffffffff8121a445
> > #10 [ffff88013a503e00] ksm_scan_thread at ffffffff8121ac0e
> > #11 [ffff88013a503ec0] kthread at ffffffff810df0fb
> > #12 [ffff88013a503f50] ret_from_fork at ffffffff8180fc98
> 
> So this one, VCPU0, is trying to send an IPI to VCPU1:
> 
> > PID: 1674   TASK: ffff8800ba4a9e00  CPU: 1   COMMAND: "qemu-system-x86"
> >  #0 [ffff88013fd05e20] crash_nmi_callback at ffffffff81091521
> >  #1 [ffff88013fd05e30] nmi_handle at ffffffff81062560
> >  #2 [ffff88013fd05ea0] default_do_nmi at ffffffff81062b0a
> >  #3 [ffff88013fd05ed0] do_nmi at ffffffff81062c88
> >  #4 [ffff88013fd05ef0] end_repeat_nmi at ffffffff81812241
> >     [exception RIP: vmx_vcpu_run+992]
> >     RIP: ffffffff8104cef0  RSP: ffff88013940bcb8  RFLAGS: 00000082
> >     RAX: 0000000080000202  RBX: ffff880139b30000  RCX: ffff880139b30000
> >     RDX: 0000000000000200  RSI: ffff880139b30000  RDI: ffff880139b30000
> >     RBP: ffff88013940bd28   R8: 00007fe192b71110   R9: 00007fe192b71140
> >     R10: 00007fff66407d00  R11: 00007fe1927f0060  R12: 0000000000000000
> >     R13: 0000000000000001  R14: 0000000000000001  R15: 0000000000000000
> >     ORIG_RAX: ffffffffffffffff  CS: 0010  SS: 0018
> > --- <NMI exception stack> ---
> >  #5 [ffff88013940bcb8] vmx_vcpu_run at ffffffff8104cef0
> >  #6 [ffff88013940bcf8] vmx_handle_external_intr at ffffffff81040c18
> >  #7 [ffff88013940bd30] kvm_arch_vcpu_ioctl_run at ffffffff8101b5ad
> >  #8 [ffff88013940be00] kvm_vcpu_ioctl at ffffffff81007894
> >  #9 [ffff88013940beb0] do_vfs_ioctl at ffffffff81253190
> > #10 [ffff88013940bf30] sys_ioctl at ffffffff81253411
> > #11 [ffff88013940bf80] system_call_fastpath at ffffffff8180fd4d
> 
> So the problem here that I can see is that L1's VCPU1 appears to be 
> looping with interrupts disabled:
> 
> >     RIP: ffffffff8104cef0  RSP: ffff88013940bcb8  RFLAGS: 00000082
> 
> Look how RFLAGS doesn't have 0x200 set - so it's executing with 
> interrupts disabled.
>

I've ran this with L0: 3.13+b6b8a145 L1: 4.0+debug patches and got a similar
backtrace with interrupts disabled; however this _may_ be another issue.

I ran L0/L1: 3.13+b6b8a145 and got something like this:
PID: 36     TASK: ffff8801396a9800  CPU: 0   COMMAND: "ksmd"                            
 #0 [ffff88013fc03d18] machine_kexec at ffffffff8104ace2                                
 #1 [ffff88013fc03d68] crash_kexec at ffffffff810e72e3                                  
 #2 [ffff88013fc03e30] panic at ffffffff8171a404                                        
 #3 [ffff88013fc03ea8] watchdog_timer_fn at ffffffff8110daa5                            
 #4 [ffff88013fc03ed8] __run_hrtimer at ffffffff8108e7a7                                
 #5 [ffff88013fc03f18] hrtimer_interrupt at ffffffff8108ef6f                            
 #6 [ffff88013fc03f80] local_apic_timer_interrupt at ffffffff81043617                   
 #7 [ffff88013fc03f98] smp_apic_timer_interrupt at ffffffff8173414f                     
 #8 [ffff88013fc03fb0] apic_timer_interrupt at ffffffff81732add                         
--- <IRQ stack> ---                                                                     
 #9 [ffff880137fd3b08] apic_timer_interrupt at ffffffff81732add                         
    [exception RIP: generic_exec_single+130]                                            
    RIP: ffffffff810dbf52  RSP: ffff880137fd3bb0  RFLAGS: 00000202                      
    RAX: 0000000000000002  RBX: ffff880137fd3b80  RCX: 0000000000000002                 
    RDX: ffffffff8180ad80  RSI: 0000000000000000  RDI: 0000000000000282                 
    RBP: ffff880137fd3be0   R8: ffffffff8180ad68   R9: 0000000000000001                 
    R10: 0000000000000000  R11: 0000000000000002  R12: 0000000000000000                 
    R13: 0000000000000002  R14: 0000000000000000  R15: ffff880137fd3bd0                 
    ORIG_RAX: ffffffffffffff10  CS: 0010  SS: 0018                                      
#10 [ffff880137fd3be8] smp_call_function_single at ffffffff810dc065                     
#11 [ffff880137fd3c60] smp_call_function_many at ffffffff810dc496                       
#12 [ffff880137fd3cc0] make_all_cpus_request at ffffffffa004e5ef [kvm]                  
#13 [ffff880137fd3cf8] kvm_flush_remote_tlbs at ffffffffa004e620 [kvm]                  
#14 [ffff880137fd3d18] kvm_mmu_notifier_invalidate_range_start at ffffffffa004e6a2 [kvm]
#15 [ffff880137fd3d58] __mmu_notifier_invalidate_range_start at ffffffff8119b56b        
#16 [ffff880137fd3d90] try_to_merge_with_ksm_page at ffffffff8119d296                   
#17 [ffff880137fd3e00] ksm_do_scan at ffffffff8119d749                                  
#18 [ffff880137fd3e78] ksm_scan_thread at ffffffff8119e0cf                              
#19 [ffff880137fd3ed0] kthread at ffffffff8108b592                                      
#20 [ffff880137fd3f50] ret_from_fork at ffffffff81731ccc                                
                                                                                        
PID: 1543   TASK: ffff880137861800  CPU: 1   COMMAND: "qemu-system-x86"                 
 #0 [ffff88013fd05e58] crash_nmi_callback at ffffffff81040082                           
 #1 [ffff88013fd05e68] nmi_handle at ffffffff8172aa38                                   
 #2 [ffff88013fd05ec8] do_nmi at ffffffff8172ac00                                       
 #3 [ffff88013fd05ef0] end_repeat_nmi at ffffffff81729ea1                               
    [exception RIP: _raw_spin_lock+50]                                                  
    RIP: ffffffff817292b2  RSP: ffff8800372ab8f8  RFLAGS: 00000202                      
    RAX: 0000000000000010  RBX: 0000000000000010  RCX: 0000000000000202                 
    RDX: ffff8800372ab8f8  RSI: 0000000000000018  RDI: 0000000000000001                 
    RBP: ffffffff817292b2   R8: ffffffff817292b2   R9: 0000000000000018                 
    R10: ffff8800372ab8f8  R11: 0000000000000202  R12: ffffffffffffffff                 
    R13: ffff8800b8f78000  R14: 00000000000070ae  R15: 00000000000070ae                 
    ORIG_RAX: 00000000000070ae  CS: 0010  SS: 0018                                      
--- <DOUBLEFAULT exception stack> ---                                                   
 #4 [ffff8800372ab8f8] _raw_spin_lock at ffffffff817292b2                               
 #5 [ffff8800372ab900] kvm_mmu_notifier_invalidate_range_start at ffffffffa004e67e [kvm]
 #6 [ffff8800372ab940] __mmu_notifier_invalidate_range_start at ffffffff8119b56b        
 #7 [ffff8800372ab978] do_wp_page at ffffffff81177fee                                   

A previous backtrace of a 3.19 series kernel is here and showing interrupts
enabled on both CPUs on L1:
https://lkml.org/lkml/2015/2/23/234
http://people.canonical.com/~inaddy/lp1413540/BACKTRACES.txt

> That is why the IPI does not get through to it, but kdump's NMI had no 
> problem getting through.
> 
> This (assuming all backtraces are exact!):
> 
> >  #5 [ffff88013940bcb8] vmx_vcpu_run at ffffffff8104cef0
> >  #6 [ffff88013940bcf8] vmx_handle_external_intr at ffffffff81040c18
> >  #7 [ffff88013940bd30] kvm_arch_vcpu_ioctl_run at ffffffff8101b5ad
> 
> suggests that we called vmx_vcpu_run() from 
> vmx_handle_external_intr(), and that we are executing L2 guest code 
> with interrupts disabled.
> 
> How is this supposed to work? What mechanism does KVM have against a 
> (untrusted) guest interrupt handler locking up?
> 
> I might be misunderstanding how this works at the KVM level, but from 
> the APIC perspective the situation appears to be pretty clear: CPU1's 
> interrupts are turned off, so it cannot receive IPIs, the CSD wait 
> will eventually time out.
> 
> Now obviously it appears to be anomalous (assuming my analysis is 
> correct) that the interrupt handler has locked up, but it's 
> immaterial: a nested kernel must not allow its guest to lock it up.
> 
> Thanks,
> 
> 	Ingo
> 

Yes, I think at this point I'll go through the various backtraces and try to
narrow things down. I think overall we're seeing a single effect from multiple
code paths.

--chris j arges


--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/