Message-ID: <551DB0E2.1020607@canonical.com>
Date: Thu, 02 Apr 2015 16:13:06 -0500
From: Chris J Arges <chris.j.arges@canonical.com>
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0
MIME-Version: 1.0
To: Ingo Molnar <mingo@kernel.org>
CC: Linus Torvalds <torvalds@linux-foundation.org>,
        Rafael David Tinoco <inaddy@ubuntu.com>, Peter Anvin <hpa@zytor.com>,
        Jiang Liu <jiang.liu@linux.intel.com>,
        Peter Zijlstra <peterz@infradead.org>,
        LKML <linux-kernel@vger.kernel.org>, Jens Axboe <axboe@kernel.dk>,
        Frederic Weisbecker <fweisbec@gmail.com>,
        Gema Gomez <gema.gomez-solano@canonical.com>,
        the arch/x86 maintainers <x86@kernel.org>
Subject: Re: smp_call_function_single lockups
References: <20150331031536.GA9303@canonical.com> <CA+55aFykg3SAO16=NRiC+tP1gGj5hgbu+Y93ss4Qg30+qyZ=+w@mail.gmail.com> <20150331222327.GA12512@canonical.com> <20150401124336.GB12841@gmail.com> <20150401161047.GD12730@canonical.com> <CA+55aFxQ6q7MNS+4XWZ3=Xa0Hz6kumd84v_aEw3M4gBpXszTkQ@mail.gmail.com> <551C6A48.9060805@canonical.com> <CA+55aFw2Jb4ASOxckY1cwP23fAYv5dG1WYCkB6RyjjpP2hEQcw@mail.gmail.com> <20150402182607.GA8896@gmail.com> <551D8FAF.5070805@canonical.com> <20150402190725.GA10570@gmail.com>
In-Reply-To: <20150402190725.GA10570@gmail.com>
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 8bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4562
Lines: 119


On 04/02/2015 02:07 PM, Ingo Molnar wrote:
> 
> * Chris J Arges <chris.j.arges@canonical.com> wrote:
> 
>> Whenever we look through the crashdump we see csd_lock_wait waiting 
>> for CSD_FLAG_LOCK bit to be cleared.  Usually the signature leading 
>> up to that looks like the following (in the openstack tempest on 
>> openstack and nested VM stress case)
>>
>> (qemu-system-x86 task)
>> kvm_sched_in
>>  -> kvm_arch_vcpu_load
>>   -> vmx_vcpu_load
>>    -> loaded_vmcs_clear
>>     -> smp_call_function_single
>>
>> (ksmd task)
>> pmdp_clear_flush
>>  -> flush_tlb_mm_range
>>   -> native_flush_tlb_others
>>     -> smp_call_function_many
> 
> So is this two separate smp_call_function instances, crossing each 
> other, and none makes any progress, indefinitely - as if the two IPIs 
> got lost?
> 

This is two different crash signatures. Sorry for the confusion.

> The traces Rafael he linked to show a simpler scenario with two CPUs 
> apparently locked up, doing this:
> 
> CPU0:
> 
>  #5 [ffffffff81c03e88] native_safe_halt at ffffffff81059386
>  #6 [ffffffff81c03e90] default_idle at ffffffff8101eaee
>  #7 [ffffffff81c03eb0] arch_cpu_idle at ffffffff8101f46f
>  #8 [ffffffff81c03ec0] cpu_startup_entry at ffffffff810b6563
>  #9 [ffffffff81c03f30] rest_init at ffffffff817a6067
> #10 [ffffffff81c03f40] start_kernel at ffffffff81d4cfce
> #11 [ffffffff81c03f80] x86_64_start_reservations at ffffffff81d4c4d7
> #12 [ffffffff81c03f90] x86_64_start_kernel at ffffffff81d4c61c
> 
> This CPU is idle.
> 
> CPU1:
> 
> #10 [ffff88081993fa70] smp_call_function_single at ffffffff810f4d69
> #11 [ffff88081993fb10] native_flush_tlb_others at ffffffff810671ae
> #12 [ffff88081993fb40] flush_tlb_mm_range at ffffffff810672d4
> #13 [ffff88081993fb80] pmdp_splitting_flush at ffffffff81065e0d
> #14 [ffff88081993fba0] split_huge_page_to_list at ffffffff811ddd39
> #15 [ffff88081993fc30] __split_huge_page_pmd at ffffffff811dec65
> #16 [ffff88081993fcc0] unmap_single_vma at ffffffff811a4f03
> #17 [ffff88081993fdc0] zap_page_range at ffffffff811a5d08
> #18 [ffff88081993fe80] sys_madvise at ffffffff811b9775
> #19 [ffff88081993ff80] system_call_fastpath at ffffffff817b8bad
> 
> This CPU is busy-waiting for the TLB flush IPI to finish.
> 
> There's no unexpected pattern here (other than it not finishing) 
> AFAICS, the smp_call_function_single() is just the usual way we invoke 
> the TLB flushing methods AFAICS.
> 
> So one possibility would be that an 'IPI was sent but lost'.
> 
> We could try the following trick: poll for completion for a couple of 
> seconds (since an IPI is not held up by anything but irqs-off 
> sections, it should arrive within microseconds typically - seconds of 
> polling should be more than enough), and if the IPI does not arrive, 
> print a warning message and re-send the IPI.
>
> If the IPI was lost due to some race and there's no other failure mode 
> that we don't understand, then this would work around the bug and 
> would make the tests pass indefinitely - with occasional hickups and a 
> handful of messages produced along the way whenever it would have 
> locked up with a previous kernel.
> 
> If testing indeed confirms that kind of behavior we could drill down 
> more closely to figure out why the IPI did not get to its destination.
> 
> Or if the behavior is different, we'd have some new behavior to look 
> at. (for example the IPI sending mechanism might be wedged 
> indefinitely for some reason, so that even a resend won't work.)
> 
> Agreed?
> 
> Thanks,
> 
> 	Ingo
> 

Ingo,

I think tracking IPI calls from 'generic_exec_single' would make a lot
of sense. When you say poll for completion do you mean a loop after
'arch_send_call_function_single_ipi' in kernel/smp.c? My main concern
would be to not alter the timings too much so we can still reproduce the
original problem.

Another approach:
If we want to check for non-ACKed IPIs a possibility would be to add a
timestamp field to 'struct call_single_data' and just record jiffies
when the IPI gets called. Then have a per-cpu kthread check the
'call_single_queue' percpu list periodically if (jiffies - timestamp) >
THRESHOLD. When we reach that condition print the stale entry in
call_single_queue, backtrace, then re-send the IPI.

Let me know what makes the most sense to hack on.

Thanks,
--chris
--
To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
the body of a message to majordomo@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html
Please read the FAQ at  http://www.tux.org/lkml/