Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752252Ab3FZPeT (ORCPT ); Wed, 26 Jun 2013 11:34:19 -0400 Received: from e23smtp07.au.ibm.com ([202.81.31.140]:54149 "EHLO e23smtp07.au.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751782Ab3FZPeR (ORCPT ); Wed, 26 Jun 2013 11:34:17 -0400 Message-ID: <51CB0AD3.50101@linux.vnet.ibm.com> Date: Wed, 26 Jun 2013 21:07:55 +0530 From: Raghavendra K T Organization: IBM User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:16.0) Gecko/20121029 Thunderbird/16.0.2 MIME-Version: 1.0 To: Chegu Vinod CC: Gleb Natapov , habanero@linux.vnet.ibm.com, Andrew Jones , mingo@redhat.com, jeremy@goop.org, x86@kernel.org, konrad.wilk@oracle.com, hpa@zytor.com, pbonzini@redhat.com, linux-doc@vger.kernel.org, xen-devel@lists.xensource.com, peterz@infradead.org, mtosatti@redhat.com, stefano.stabellini@eu.citrix.com, andi@firstfloor.org, attilio.rao@citrix.com, ouyang@cs.pitt.edu, gregkh@suse.de, agraf@suse.de, torvalds@linux-foundation.org, avi.kivity@gmail.com, tglx@linutronix.de, kvm@vger.kernel.org, linux-kernel@vger.kernel.org, stephan.diestelhorst@amd.com, riel@redhat.com, virtualization@lists.linux-foundation.org, srivatsa.vaddagiri@gmail.com Subject: Re: [PATCH RFC V9 0/19] Paravirtualized ticket spinlocks References: <20130601192125.5966.35563.sendpatchset@codeblue> <1372171802.3804.30.camel@oc2024037011.ibm.com> <51CAAA26.4090204@linux.vnet.ibm.com> <20130626113744.GA6300@hawk.usersys.redhat.com> <20130626125240.GY18508@redhat.com> <51CAEF45.3010203@linux.vnet.ibm.com> <51CAFD28.7080002@hp.com> In-Reply-To: <51CAFD28.7080002@hp.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit X-Content-Scanned: Fidelis XPS MAILER x-cbid: 13062615-0260-0000-0000-00000339130D Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 9392 Lines: 247 On 06/26/2013 08:09 PM, Chegu Vinod wrote: > On 6/26/2013 6:40 AM, Raghavendra K T wrote: >> On 06/26/2013 06:22 PM, Gleb Natapov wrote: >>> On Wed, Jun 26, 2013 at 01:37:45PM +0200, Andrew Jones wrote: >>>> On Wed, Jun 26, 2013 at 02:15:26PM +0530, Raghavendra K T wrote: >>>>> On 06/25/2013 08:20 PM, Andrew Theurer wrote: >>>>>> On Sun, 2013-06-02 at 00:51 +0530, Raghavendra K T wrote: >>>>>>> This series replaces the existing paravirtualized spinlock mechanism >>>>>>> with a paravirtualized ticketlock mechanism. The series provides >>>>>>> implementation for both Xen and KVM. >>>>>>> >>>>>>> Changes in V9: >>>>>>> - Changed spin_threshold to 32k to avoid excess halt exits that are >>>>>>> causing undercommit degradation (after PLE handler improvement). >>>>>>> - Added kvm_irq_delivery_to_apic (suggested by Gleb) >>>>>>> - Optimized halt exit path to use PLE handler >>>>>>> >>>>>>> V8 of PVspinlock was posted last year. After Avi's suggestions to >>>>>>> look >>>>>>> at PLE handler's improvements, various optimizations in PLE handling >>>>>>> have been tried. >>>>>> >>>>>> Sorry for not posting this sooner. I have tested the v9 >>>>>> pv-ticketlock >>>>>> patches in 1x and 2x over-commit with 10-vcpu and 20-vcpu VMs. I >>>>>> have >>>>>> tested these patches with and without PLE, as PLE is still not >>>>>> scalable >>>>>> with large VMs. >>>>>> >>>>> >>>>> Hi Andrew, >>>>> >>>>> Thanks for testing. >>>>> >>>>>> System: x3850X5, 40 cores, 80 threads >>>>>> >>>>>> >>>>>> 1x over-commit with 10-vCPU VMs (8 VMs) all running dbench: >>>>>> ---------------------------------------------------------- >>>>>> Total >>>>>> Configuration Throughput(MB/s) Notes >>>>>> >>>>>> 3.10-default-ple_on 22945 5% CPU in host >>>>>> kernel, 2% spin_lock in guests >>>>>> 3.10-default-ple_off 23184 5% CPU in host >>>>>> kernel, 2% spin_lock in guests >>>>>> 3.10-pvticket-ple_on 22895 5% CPU in host >>>>>> kernel, 2% spin_lock in guests >>>>>> 3.10-pvticket-ple_off 23051 5% CPU in host >>>>>> kernel, 2% spin_lock in guests >>>>>> [all 1x results look good here] >>>>> >>>>> Yes. The 1x results look too close >>>>> >>>>>> >>>>>> >>>>>> 2x over-commit with 10-vCPU VMs (16 VMs) all running dbench: >>>>>> ----------------------------------------------------------- >>>>>> Total >>>>>> Configuration Throughput Notes >>>>>> >>>>>> 3.10-default-ple_on 6287 55% CPU host >>>>>> kernel, 17% spin_lock in guests >>>>>> 3.10-default-ple_off 1849 2% CPU in host >>>>>> kernel, 95% spin_lock in guests >>>>>> 3.10-pvticket-ple_on 6691 50% CPU in host >>>>>> kernel, 15% spin_lock in guests >>>>>> 3.10-pvticket-ple_off 16464 8% CPU in host >>>>>> kernel, 33% spin_lock in guests >>>>> >>>>> I see 6.426% improvement with ple_on >>>>> and 161.87% improvement with ple_off. I think this is a very good sign >>>>> for the patches >>>>> >>>>>> [PLE hinders pv-ticket improvements, but even with PLE off, >>>>>> we still off from ideal throughput (somewhere >20000)] >>>>>> >>>>> >>>>> Okay, The ideal throughput you are referring is getting around atleast >>>>> 80% of 1x throughput for over-commit. Yes we are still far away from >>>>> there. >>>>> >>>>>> >>>>>> 1x over-commit with 20-vCPU VMs (4 VMs) all running dbench: >>>>>> ---------------------------------------------------------- >>>>>> Total >>>>>> Configuration Throughput Notes >>>>>> >>>>>> 3.10-default-ple_on 22736 6% CPU in host >>>>>> kernel, 3% spin_lock in guests >>>>>> 3.10-default-ple_off 23377 5% CPU in host >>>>>> kernel, 3% spin_lock in guests >>>>>> 3.10-pvticket-ple_on 22471 6% CPU in host >>>>>> kernel, 3% spin_lock in guests >>>>>> 3.10-pvticket-ple_off 23445 5% CPU in host >>>>>> kernel, 3% spin_lock in guests >>>>>> [1x looking fine here] >>>>>> >>>>> >>>>> I see ple_off is little better here. >>>>> >>>>>> >>>>>> 2x over-commit with 20-vCPU VMs (8 VMs) all running dbench: >>>>>> ---------------------------------------------------------- >>>>>> Total >>>>>> Configuration Throughput Notes >>>>>> >>>>>> 3.10-default-ple_on 1965 70% CPU in host >>>>>> kernel, 34% spin_lock in guests >>>>>> 3.10-default-ple_off 226 2% CPU in host >>>>>> kernel, 94% spin_lock in guests >>>>>> 3.10-pvticket-ple_on 1942 70% CPU in host >>>>>> kernel, 35% spin_lock in guests >>>>>> 3.10-pvticket-ple_off 8003 11% CPU in host >>>>>> kernel, 70% spin_lock in guests >>>>>> [quite bad all around, but pv-tickets with PLE off the best so far. >>>>>> Still quite a bit off from ideal throughput] >>>>> >>>>> This is again a remarkable improvement (307%). >>>>> This motivates me to add a patch to disable ple when pvspinlock is on. >>>>> probably we can add a hypercall that disables ple in kvm init patch. >>>>> but only problem I see is what if the guests are mixed. >>>>> >>>>> (i.e one guest has pvspinlock support but other does not. Host >>>>> supports pv) >>>> >>>> How about reintroducing the idea to create per-kvm ple_gap,ple_window >>>> state. We were headed down that road when considering a dynamic >>>> window at >>>> one point. Then you can just set a single guest's ple_gap to zero, >>>> which >>>> would lead to PLE being disabled for that guest. We could also revisit >>>> the dynamic window then. >>>> >>> Can be done, but lets understand why ple on is such a big problem. Is it >>> possible that ple gap and SPIN_THRESHOLD are not tuned properly? >>> >> >> The one obvious reason I see is commit awareness inside the guest. for >> under-commit there is no necessity to do PLE, but unfortunately we do. >> >> atleast we return back immediately in case of potential undercommits, >> but we still incur vmexit delay. >> same applies to SPIN_THRESHOLD. SPIN_THRESHOLD should be ideally more >> for undercommit and less for overcommit. >> >> with this patch series SPIN_THRESHOLD is increased to 32k to solely >> avoid under-commit regressions but it would have eaten some amount of >> overcommit performance. >> In summary: excess halt-exit/pl-exit was one main reason for >> undercommit regression. (compared to pl disabled case) > > I haven't yet tried these patches...hope to do so sometime soon. > > Fwiw...after Raghu's last set of PLE changes that is now in 3.10-rc > kernels...I didn't notice much difference in workload performance > between PLE enabled vs. disabled. This is for under-commit (+pinned) > large guest case. > Hi Vinod, Thanks for confirming that now ple enabled case is very close to ple disabled. > Here is a small sampling of the guest exits collected via kvm ftrace for > an OLTP-like workload which was keeping the guest ~85-90% busy on a 8 > socket Westmere-EX box (HT-off). > > TIME_IN_GUEST 71.616293 > > TIME_ON_HOST 7.764597 > > MSR_READ 0.0003620.0% > > NMI_WINDOW 0.0000020.0% > > PAUSE_INSTRUCTION 0.1585952.0% > > PENDING_INTERRUPT 0.0337790.4% > > MSR_WRITE 0.0016950.0% > > EXTERNAL_INTERRUPT 3.21086741.4% > > IO_INSTRUCTION 0.0000180.0% > > RDPMC 0.0000670.0% > > HLT 2.82252336.4% > > EXCEPTION_NMI 0.0083620.1% > > CR_ACCESS 0.0100270.1% > > APIC_ACCESS 1.51830019.6% > > > > [ Don't mean to digress from the topic but in most of my under-commit + > pinned large guest experiments with 3.10 kernels (using 2 or 3 different > workloads) the time spent in halt exits are typically much more than the > time spent in ple exits. Can anything be done to reduce the duration or > avoid those exits ? ] > I would say, using ple handler in halt exit path patch in this series, [patch 18 kvm hypervisor: Add directed yield in vcpu block path], help this. That is an independent patch to tryout. >> >> 1. dynamic ple window was one solution for PLE, which we can experiment >> further. (at VM level or global). > > Is this the case where the dynamic PLE window starts off at a value > more suitable to reduce exits for under-commit (and pinned) cases and > only when the host OS detects that the degree of under-commit is > shrinking (i.e. moving towards having more vcpus to schedule and hence > getting to be over committed) it adjusts the ple window more suitable to > the over commit case ? or is this some different idea ? Yes we are discussing about same idea. > > Thanks > Vinod > >> The other experiment I was thinking is to extend spinlock to >> accommodate vcpuid (Linus has opposed that but may be worth a >> try). >> > > >> 2. Andrew Theurer had patch to reduce double runq lock that I will be >> testing. >> >> I have some older experiments to retry though they did not give >> significant improvements before the PLE handler modified. >> >> Andrew, do you have any other details to add (from perf report that >> you usually take with these experiments)? -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/