Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752944AbdGCJ3P (ORCPT ); Mon, 3 Jul 2017 05:29:15 -0400 Received: from mail-oi0-f67.google.com ([209.85.218.67]:33657 "EHLO mail-oi0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753639AbdGCJ3D (ORCPT ); Mon, 3 Jul 2017 05:29:03 -0400 Subject: Re: [PATCH 2/2] x86/idle: use dynamic halt poll To: =?UTF-8?B?UmFkaW0gS3LEjW3DocWZ?= , Paolo Bonzini References: <1498130534-26568-1-git-send-email-root@ip-172-31-39-62.us-west-2.compute.internal> <1498130534-26568-3-git-send-email-root@ip-172-31-39-62.us-west-2.compute.internal> <4444ffc8-9e7b-5bd2-20da-af422fe834cc@redhat.com> <2245bef7-b668-9265-f3f8-3b63d71b1033@gmail.com> <7d085956-2573-212f-44f4-86104beba9bb@gmail.com> <05ec7efc-fb9c-ae24-5770-66fc472545a4@redhat.com> <20170627134043.GA1487@potion> <2771f905-d1b0-b118-9ae9-db5fb87f877c@redhat.com> <20170627142251.GB1487@potion> Cc: Wanpeng Li , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , the arch/x86 maintainers , Jonathan Corbet , tony.luck@intel.com, Borislav Petkov , Peter Zijlstra , mchehab@kernel.org, Andrew Morton , krzk@kernel.org, jpoimboe@redhat.com, Andy Lutomirski , Christian Borntraeger , Thomas Garnier , Robert Gerst , Mathias Krause , douly.fnst@cn.fujitsu.com, Nicolai Stange , Frederic Weisbecker , dvlasenk@redhat.com, Daniel Bristot de Oliveira , yamada.masahiro@socionext.com, mika.westerberg@linux.intel.com, Chen Yu , aaron.lu@intel.com, Steven Rostedt , Kyle Huey , Len Brown , Prarit Bhargava , hidehiro.kawai.ez@hitachi.com, fengtiantian@huawei.com, pmladek@suse.com, jeyu@redhat.com, Larry.Finger@lwfinger.net, zijun_hu@htc.com, luisbg@osg.samsung.com, johannes.berg@intel.com, niklas.soderlund+renesas@ragnatech.se, zlpnobody@gmail.com, Alexey Dobriyan , fgao@48lvckh6395k16k5.yundunddos.com, ebiederm@xmission.com, Subash Abhinov Kasiviswanathan , Arnd Bergmann , Matt Fleming , Mel Gorman , "linux-kernel@vger.kernel.org" , linux-doc@vger.kernel.org, linux-edac@vger.kernel.org, kvm From: Yang Zhang Message-ID: Date: Mon, 3 Jul 2017 17:28:42 +0800 User-Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 MIME-Version: 1.0 In-Reply-To: <20170627142251.GB1487@potion> Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 4145 Lines: 87 On 2017/6/27 22:22, Radim Krčmář wrote: > 2017-06-27 15:56+0200, Paolo Bonzini: >> On 27/06/2017 15:40, Radim Krčmář wrote: >>>> ... which is not necessarily _wrong_. It's just a different heuristic. >>> Right, it's just harder to use than host's single_task_running() -- the >>> VCPU calling vcpu_is_preempted() is never preempted, so we have to look >>> at other VCPUs that are not halted, but still preempted. >>> >>> If we see some ratio of preempted VCPUs (> 0?), then we stop polling and >>> yield to the host. Working under the assumption that there is work for >>> this PCPU if other VCPUs have stuff to do. The downside is that it >>> misses information about host's topology, so it would be hard to make it >>> work well. >> >> I would just use vcpu_is_preempted on the current CPU. From guest POV >> this option is really a "f*** everyone else" setting just like >> idle=poll, only a little more polite. > > vcpu_is_preempted() on current cpu cannot return true, AFAIK. > >> If we've been preempted and we were polling, there are two cases. If an >> interrupt was queued while the guest was preempted, the poll will be >> treated as successful anyway. > > I think the poll should be treated as invalid if the window has expired > while the VCPU was preempted -- the guest can't tell whether the > interrupt arrived still within the poll window (unless we added paravirt > for that), so it shouldn't be wasting time waiting for it. > >> If it hasn't, let others run---but really >> that's not because the guest wants to be polite, it's to avoid that the >> scheduler penalizes it excessively. > > This sounds like a VM entry just to do an immediate VM exit, so paravirt > seems better here as well ... (the guest telling the host about its > window -- which could also be used to rule it out as a target in the > pause loop random kick.) > >> So until it's preempted, I think it's okay if the guest doesn't care >> about others. You wouldn't use this option anyway in overcommitted >> situations. >> >> (I'm still not very convinced about the idea). > > Me neither. (The same mechanism is applicable to bare-metal, but was > never used there, so I would rather bring the guest behavior closer to > bare-metal.) > The background is that we(Alibaba Cloud) do get more and more complaints from our customers in both KVM and Xen compare to bare-mental.After investigations, the root cause is known to us: big cost in message passing workload(David show it in KVM forum 2015) A typical message workload like below: vcpu 0 vcpu 1 1. send ipi 2. doing hlt 3. go into idle 4. receive ipi and wake up from hlt 5. write APIC time twice 6. write APIC time twice to to stop sched timer reprogram sched timer 7. doing hlt 8. handle task and send ipi to vcpu 0 9. same to 4. 10. same to 3 One transaction will introduce about 12 vmexits(2 hlt and 10 msr write). The cost of such vmexits will degrades performance severely. Linux kernel already provide idle=poll to mitigate the trend. But it only eliminates the IPI and hlt vmexit. It has nothing to do with start/stop sched timer. A compromise would be to turn off NOHZ kernel, but it is not the default config for new distributions. Same for halt-poll in KVM, it only solve the cost from schedule in/out in host and can not help such workload much. The purpose of this patch we want to improve current idle=poll mechanism to use dynamic polling and do poll before touch sched timer. It should not be a virtualization specific feature but seems bare mental have low cost to access the MSR. So i want to only enable it in VM. Though the idea below the patch may not so perfect to fit all conditions, it looks no worse than now. How about we keep current implementation and i integrate the patch to para-virtualize part as Paolo suggested? We can continue discuss it and i will continue to refine it if anyone has better suggestions? -- Yang Alibaba Cloud Computing