Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752687AbdGDW2x (ORCPT ); Tue, 4 Jul 2017 18:28:53 -0400 Received: from mail-oi0-f65.google.com ([209.85.218.65]:34999 "EHLO mail-oi0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752559AbdGDW2u (ORCPT ); Tue, 4 Jul 2017 18:28:50 -0400 MIME-Version: 1.0 In-Reply-To: References: <1498130534-26568-1-git-send-email-root@ip-172-31-39-62.us-west-2.compute.internal> <1498130534-26568-3-git-send-email-root@ip-172-31-39-62.us-west-2.compute.internal> <4444ffc8-9e7b-5bd2-20da-af422fe834cc@redhat.com> <2245bef7-b668-9265-f3f8-3b63d71b1033@gmail.com> <7d085956-2573-212f-44f4-86104beba9bb@gmail.com> <05ec7efc-fb9c-ae24-5770-66fc472545a4@redhat.com> <20170627134043.GA1487@potion> <2771f905-d1b0-b118-9ae9-db5fb87f877c@redhat.com> <20170627142251.GB1487@potion> From: Wanpeng Li Date: Wed, 5 Jul 2017 06:28:48 +0800 Message-ID: Subject: Re: [PATCH 2/2] x86/idle: use dynamic halt poll To: Yang Zhang Cc: =?UTF-8?B?UmFkaW0gS3LEjW3DocWZ?= , Paolo Bonzini , Thomas Gleixner , Ingo Molnar , "H. Peter Anvin" , "the arch/x86 maintainers" , Jonathan Corbet , tony.luck@intel.com, Borislav Petkov , Peter Zijlstra , mchehab@kernel.org, Andrew Morton , krzk@kernel.org, jpoimboe@redhat.com, Andy Lutomirski , Christian Borntraeger , Thomas Garnier , Robert Gerst , Mathias Krause , douly.fnst@cn.fujitsu.com, Nicolai Stange , Frederic Weisbecker , dvlasenk@redhat.com, Daniel Bristot de Oliveira , yamada.masahiro@socionext.com, mika.westerberg@linux.intel.com, Chen Yu , aaron.lu@intel.com, Steven Rostedt , Kyle Huey , Len Brown , Prarit Bhargava , hidehiro.kawai.ez@hitachi.com, fengtiantian@huawei.com, pmladek@suse.com, jeyu@redhat.com, Larry.Finger@lwfinger.net, zijun_hu@htc.com, luisbg@osg.samsung.com, johannes.berg@intel.com, niklas.soderlund+renesas@ragnatech.se, zlpnobody@gmail.com, Alexey Dobriyan , fgao@48lvckh6395k16k5.yundunddos.com, ebiederm@xmission.com, Subash Abhinov Kasiviswanathan , Arnd Bergmann , Matt Fleming , Mel Gorman , "linux-kernel@vger.kernel.org" , linux-doc@vger.kernel.org, linux-edac@vger.kernel.org, kvm Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Transfer-Encoding: 8bit X-MIME-Autoconverted: from quoted-printable to 8bit by mail.home.local id v64MUXJE024741 Content-Length: 4500 Lines: 103 2017-07-03 17:28 GMT+08:00 Yang Zhang : > On 2017/6/27 22:22, Radim Krčmář wrote: >> >> 2017-06-27 15:56+0200, Paolo Bonzini: >>> >>> On 27/06/2017 15:40, Radim Krčmář wrote: >>>>> >>>>> ... which is not necessarily _wrong_. It's just a different heuristic. >>>> >>>> Right, it's just harder to use than host's single_task_running() -- the >>>> VCPU calling vcpu_is_preempted() is never preempted, so we have to look >>>> at other VCPUs that are not halted, but still preempted. >>>> >>>> If we see some ratio of preempted VCPUs (> 0?), then we stop polling and >>>> yield to the host. Working under the assumption that there is work for >>>> this PCPU if other VCPUs have stuff to do. The downside is that it >>>> misses information about host's topology, so it would be hard to make it >>>> work well. >>> >>> >>> I would just use vcpu_is_preempted on the current CPU. From guest POV >>> this option is really a "f*** everyone else" setting just like >>> idle=poll, only a little more polite. >> >> >> vcpu_is_preempted() on current cpu cannot return true, AFAIK. >> >>> If we've been preempted and we were polling, there are two cases. If an >>> interrupt was queued while the guest was preempted, the poll will be >>> treated as successful anyway. >> >> >> I think the poll should be treated as invalid if the window has expired >> while the VCPU was preempted -- the guest can't tell whether the >> interrupt arrived still within the poll window (unless we added paravirt >> for that), so it shouldn't be wasting time waiting for it. >> >>> If it hasn't, let others run---but really >>> that's not because the guest wants to be polite, it's to avoid that the >>> scheduler penalizes it excessively. >> >> >> This sounds like a VM entry just to do an immediate VM exit, so paravirt >> seems better here as well ... (the guest telling the host about its >> window -- which could also be used to rule it out as a target in the >> pause loop random kick.) >> >>> So until it's preempted, I think it's okay if the guest doesn't care >>> about others. You wouldn't use this option anyway in overcommitted >>> situations. >>> >>> (I'm still not very convinced about the idea). >> >> >> Me neither. (The same mechanism is applicable to bare-metal, but was >> never used there, so I would rather bring the guest behavior closer to >> bare-metal.) >> > > The background is that we(Alibaba Cloud) do get more and more complaints > from our customers in both KVM and Xen compare to bare-mental.After > investigations, the root cause is known to us: big cost in message passing > workload(David show it in KVM forum 2015) > > A typical message workload like below: > vcpu 0 vcpu 1 > 1. send ipi 2. doing hlt > 3. go into idle 4. receive ipi and wake up from hlt > 5. write APIC time twice 6. write APIC time twice to > to stop sched timer reprogram sched timer I didn't find these two scenarios will program APIC timer twice separately instead of once separately, could you point out the codes? Regards, Wanpeng Li > 7. doing hlt 8. handle task and send ipi to > vcpu 0 > 9. same to 4. 10. same to 3 > > One transaction will introduce about 12 vmexits(2 hlt and 10 msr write). The > cost of such vmexits will degrades performance severely. Linux kernel > already provide idle=poll to mitigate the trend. But it only eliminates the > IPI and hlt vmexit. It has nothing to do with start/stop sched timer. A > compromise would be to turn off NOHZ kernel, but it is not the default > config for new distributions. Same for halt-poll in KVM, it only solve the > cost from schedule in/out in host and can not help such workload much. > > The purpose of this patch we want to improve current idle=poll mechanism to > use dynamic polling and do poll before touch sched timer. It should not be a > virtualization specific feature but seems bare mental have low cost to > access the MSR. So i want to only enable it in VM. Though the idea below the > patch may not so perfect to fit all conditions, it looks no worse than now. > How about we keep current implementation and i integrate the patch to > para-virtualize part as Paolo suggested? We can continue discuss it and i > will continue to refine it if anyone has better suggestions? > > > > -- > Yang > Alibaba Cloud Computing