Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752691AbbDBVNM (ORCPT ); Thu, 2 Apr 2015 17:13:12 -0400 Received: from youngberry.canonical.com ([91.189.89.112]:52945 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752646AbbDBVNK (ORCPT ); Thu, 2 Apr 2015 17:13:10 -0400 Message-ID: <551DB0E2.1020607@canonical.com> Date: Thu, 02 Apr 2015 16:13:06 -0500 From: Chris J Arges User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Thunderbird/31.5.0 MIME-Version: 1.0 To: Ingo Molnar CC: Linus Torvalds , Rafael David Tinoco , Peter Anvin , Jiang Liu , Peter Zijlstra , LKML , Jens Axboe , Frederic Weisbecker , Gema Gomez , the arch/x86 maintainers Subject: Re: smp_call_function_single lockups References: <20150331031536.GA9303@canonical.com> <20150331222327.GA12512@canonical.com> <20150401124336.GB12841@gmail.com> <20150401161047.GD12730@canonical.com> <551C6A48.9060805@canonical.com> <20150402182607.GA8896@gmail.com> <551D8FAF.5070805@canonical.com> <20150402190725.GA10570@gmail.com> In-Reply-To: <20150402190725.GA10570@gmail.com> Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4562 Lines: 119 On 04/02/2015 02:07 PM, Ingo Molnar wrote: > > * Chris J Arges wrote: > >> Whenever we look through the crashdump we see csd_lock_wait waiting >> for CSD_FLAG_LOCK bit to be cleared. Usually the signature leading >> up to that looks like the following (in the openstack tempest on >> openstack and nested VM stress case) >> >> (qemu-system-x86 task) >> kvm_sched_in >> -> kvm_arch_vcpu_load >> -> vmx_vcpu_load >> -> loaded_vmcs_clear >> -> smp_call_function_single >> >> (ksmd task) >> pmdp_clear_flush >> -> flush_tlb_mm_range >> -> native_flush_tlb_others >> -> smp_call_function_many > > So is this two separate smp_call_function instances, crossing each > other, and none makes any progress, indefinitely - as if the two IPIs > got lost? > This is two different crash signatures. Sorry for the confusion. > The traces Rafael he linked to show a simpler scenario with two CPUs > apparently locked up, doing this: > > CPU0: > > #5 [ffffffff81c03e88] native_safe_halt at ffffffff81059386 > #6 [ffffffff81c03e90] default_idle at ffffffff8101eaee > #7 [ffffffff81c03eb0] arch_cpu_idle at ffffffff8101f46f > #8 [ffffffff81c03ec0] cpu_startup_entry at ffffffff810b6563 > #9 [ffffffff81c03f30] rest_init at ffffffff817a6067 > #10 [ffffffff81c03f40] start_kernel at ffffffff81d4cfce > #11 [ffffffff81c03f80] x86_64_start_reservations at ffffffff81d4c4d7 > #12 [ffffffff81c03f90] x86_64_start_kernel at ffffffff81d4c61c > > This CPU is idle. > > CPU1: > > #10 [ffff88081993fa70] smp_call_function_single at ffffffff810f4d69 > #11 [ffff88081993fb10] native_flush_tlb_others at ffffffff810671ae > #12 [ffff88081993fb40] flush_tlb_mm_range at ffffffff810672d4 > #13 [ffff88081993fb80] pmdp_splitting_flush at ffffffff81065e0d > #14 [ffff88081993fba0] split_huge_page_to_list at ffffffff811ddd39 > #15 [ffff88081993fc30] __split_huge_page_pmd at ffffffff811dec65 > #16 [ffff88081993fcc0] unmap_single_vma at ffffffff811a4f03 > #17 [ffff88081993fdc0] zap_page_range at ffffffff811a5d08 > #18 [ffff88081993fe80] sys_madvise at ffffffff811b9775 > #19 [ffff88081993ff80] system_call_fastpath at ffffffff817b8bad > > This CPU is busy-waiting for the TLB flush IPI to finish. > > There's no unexpected pattern here (other than it not finishing) > AFAICS, the smp_call_function_single() is just the usual way we invoke > the TLB flushing methods AFAICS. > > So one possibility would be that an 'IPI was sent but lost'. > > We could try the following trick: poll for completion for a couple of > seconds (since an IPI is not held up by anything but irqs-off > sections, it should arrive within microseconds typically - seconds of > polling should be more than enough), and if the IPI does not arrive, > print a warning message and re-send the IPI. > > If the IPI was lost due to some race and there's no other failure mode > that we don't understand, then this would work around the bug and > would make the tests pass indefinitely - with occasional hickups and a > handful of messages produced along the way whenever it would have > locked up with a previous kernel. > > If testing indeed confirms that kind of behavior we could drill down > more closely to figure out why the IPI did not get to its destination. > > Or if the behavior is different, we'd have some new behavior to look > at. (for example the IPI sending mechanism might be wedged > indefinitely for some reason, so that even a resend won't work.) > > Agreed? > > Thanks, > > Ingo > Ingo, I think tracking IPI calls from 'generic_exec_single' would make a lot of sense. When you say poll for completion do you mean a loop after 'arch_send_call_function_single_ipi' in kernel/smp.c? My main concern would be to not alter the timings too much so we can still reproduce the original problem. Another approach: If we want to check for non-ACKed IPIs a possibility would be to add a timestamp field to 'struct call_single_data' and just record jiffies when the IPI gets called. Then have a per-cpu kthread check the 'call_single_queue' percpu list periodically if (jiffies - timestamp) > THRESHOLD. When we reach that condition print the stale entry in call_single_queue, backtrace, then re-send the IPI. Let me know what makes the most sense to hack on. Thanks, --chris -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/