Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1754899Ab2JEJHB (ORCPT ); Fri, 5 Oct 2012 05:07:01 -0400 Received: from e28smtp08.in.ibm.com ([122.248.162.8]:40724 "EHLO e28smtp08.in.ibm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1754313Ab2JEJG5 (ORCPT ); Fri, 5 Oct 2012 05:06:57 -0400 Message-ID: <506EA240.3090104@linux.vnet.ibm.com> Date: Fri, 05 Oct 2012 14:32:56 +0530 From: Raghavendra K T Organization: IBM User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:15.0) Gecko/20120911 Thunderbird/15.0.1 MIME-Version: 1.0 To: Avi Kivity CC: Rik van Riel , Peter Zijlstra , "H. Peter Anvin" , Ingo Molnar , Marcelo Tosatti , Srikar , "Nikunj A. Dadhania" , KVM , Jiannan Ouyang , chegu vinod , "Andrew M. Theurer" , LKML , Srivatsa Vaddagiri , Gleb Natapov Subject: Re: [PATCH RFC 1/2] kvm: Handle undercommitted guest case in PLE handler References: <20120921115942.27611.67488.sendpatchset@codeblue> <20120921120000.27611.71321.sendpatchset@codeblue> <505C654B.2050106@redhat.com> <505CA2EB.7050403@linux.vnet.ibm.com> <50607F1F.2040704@redhat.com> <20121003122209.GA9076@linux.vnet.ibm.com> <506C7057.6000102@redhat.com> <506D69AB.7020400@linux.vnet.ibm.com> <506D83EE.2020303@redhat.com> In-Reply-To: <506D83EE.2020303@redhat.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit x-cbid: 12100509-2000-0000-0000-0000095C118E Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Content-Length: 6868 Lines: 206 On 10/04/2012 06:11 PM, Avi Kivity wrote: > On 10/04/2012 12:49 PM, Raghavendra K T wrote: >> On 10/03/2012 10:35 PM, Avi Kivity wrote: >>> On 10/03/2012 02:22 PM, Raghavendra K T wrote: >>>>> So I think it's worth trying again with ple_window of 20000-40000. >>>>> >>>> >>>> Hi Avi, >>>> >>>> I ran different benchmarks increasing ple_window, and results does not >>>> seem to be encouraging for increasing ple_window. >>> >>> Thanks for testing! Comments below. >>> >>>> Results: >>>> 16 core PLE machine with 16 vcpu guest. >>>> >>>> base kernel = 3.6-rc5 + ple handler optimization patch >>>> base_pleopt_8k = base kernel + ple window = 8k >>>> base_pleopt_16k = base kernel + ple window = 16k >>>> base_pleopt_32k = base kernel + ple window = 32k >>>> >>>> >>>> Percentage improvements of benchmarks w.r.t base_pleopt with >>>> ple_window = 4096 >>>> >>>> base_pleopt_8k base_pleopt_16k base_pleopt_32k >>>> ----------------------------------------------------------------- >>>> >>>> kernbench_1x -5.54915 -15.94529 -44.31562 >>>> kernbench_2x -7.89399 -17.75039 -37.73498 >>> >>> So, 44% degradation even with no overcommit? That's surprising. >> >> Yes. Kernbench was run with #threads = #vcpu * 2 as usual. Is it >> spending 8 times the original ple_window cycles for 16 vcpus >> significant? > > A PLE exit when not overcommitted cannot do any good, it is better to > spin in the guest rather that look for candidates on the host. In fact > when we benchmark we often disable PLE completely. > >> >>> >>>> I also got perf top output to analyse the difference. Difference comes >>>> because of flushtlb (and also spinlock). >>> >>> That's in the guest, yes? >> >> Yes. Perf is in guest. >> >>> >>>> >>>> Ebizzy run for 4k ple_window >>>> - 87.20% [kernel] [k] arch_local_irq_restore >>>> - arch_local_irq_restore >>>> - 100.00% _raw_spin_unlock_irqrestore >>>> + 52.89% release_pages >>>> + 47.10% pagevec_lru_move_fn >>>> - 5.71% [kernel] [k] arch_local_irq_restore >>>> - arch_local_irq_restore >>>> + 86.03% default_send_IPI_mask_allbutself_phys >>>> + 13.96% default_send_IPI_mask_sequence_phys >>>> - 3.10% [kernel] [k] smp_call_function_many >>>> smp_call_function_many >>>> >>>> >>>> Ebizzy run for 32k ple_window >>>> >>>> - 91.40% [kernel] [k] arch_local_irq_restore >>>> - arch_local_irq_restore >>>> - 100.00% _raw_spin_unlock_irqrestore >>>> + 53.13% release_pages >>>> + 46.86% pagevec_lru_move_fn >>>> - 4.38% [kernel] [k] smp_call_function_many >>>> smp_call_function_many >>>> - 2.51% [kernel] [k] arch_local_irq_restore >>>> - arch_local_irq_restore >>>> + 90.76% default_send_IPI_mask_allbutself_phys >>>> + 9.24% default_send_IPI_mask_sequence_phys >>>> >>> >>> Both the 4k and the 32k results are crazy. Why is >>> arch_local_irq_restore() so prominent? Do you have a very high >>> interrupt rate in the guest? >> >> How to measure if I have high interrupt rate in guest? >> From /proc/interrupt numbers I am not able to judge :( > > 'vmstat 1' > Thanks you. 'll save this. Apart from in,cs I think r: The number of processes waiting for run time, would be useful for me in vmstat. >> >> I went back and got the results on a 32 core machine with 32 vcpu guest. >> Strangely, I got result supporting the claim that increasing ple_window >> helps for non-overcommitted scenario. >> >> 32 core 32 vcpu guest 1x scenarios. >> >> ple_gap = 0 >> kernbench: Elapsed Time 38.61 >> ebizzy: 7463 records/s >> >> ple_window = 4k >> kernbench: Elapsed Time 43.5067 >> ebizzy: 2528 records/s >> >> ple_window = 32k >> kernebench : Elapsed Time 39.4133 >> ebizzy: 7196 records/s > > So maybe something was wrong with the first measurement. May be I was not clear. The first time I had run on x240 (sandybridge) 16 core cpu, Then ran on 32 core x3850 to confirm the perf top results. But yes both had [ 0.018997] Performance Events: Broken PMU hardware detected, using software events only. problem as rightly pointed by you and PeterZ. after -cpu host, I see that is fixed on x240, [ 0.017997] Performance Events: 16-deep LBR, SandyBridge events, Intel PMU driver. [ 0.018868] NMI watchdog: enabled on all CPUs, permanently consumes one hw-PMU counter. So I 'll try it on x240 again. ( Some how mx3850 -cpu host resulted in [ 0.026995] Performance Events: unsupported p6 CPU model 26 no PMU driver, software events only. I think qemu needs some fix as pointed in http://www.mail-archive.com/kvm@vger.kernel.org/msg55836.html > >> >> >> perf top for ebizzy for above: >> ple_gap = 0 >> - 84.74% [kernel] [k] arch_local_irq_restore >> - arch_local_irq_restore >> - 100.00% _raw_spin_unlock_irqrestore >> + 50.96% release_pages >> + 49.02% pagevec_lru_move_fn >> - 6.57% [kernel] [k] arch_local_irq_restore >> - arch_local_irq_restore >> + 92.54% default_send_IPI_mask_allbutself_phys >> + 7.46% default_send_IPI_mask_sequence_phys >> - 1.54% [kernel] [k] smp_call_function_many >> smp_call_function_many > > Again the numbers are ridiculously high for arch_local_irq_restore. > Maybe there's a bad perf/kvm interaction when we're injecting an > interrupt, I can't believe we're spending 84% of the time running the > popf instruction. > >> >> ple_window = 32k >> - 84.47% [kernel] [k] arch_local_irq_restore >> + arch_local_irq_restore >> - 6.46% [kernel] [k] arch_local_irq_restore >> - arch_local_irq_restore >> + 93.51% default_send_IPI_mask_allbutself_phys >> + 6.49% default_send_IPI_mask_sequence_phys >> - 1.80% [kernel] [k] smp_call_function_many >> - smp_call_function_many >> + 99.98% native_flush_tlb_others >> >> >> ple_window = 4k >> - 91.35% [kernel] [k] arch_local_irq_restore >> - arch_local_irq_restore >> - 100.00% _raw_spin_unlock_irqrestore >> + 53.19% release_pages >> + 46.81% pagevec_lru_move_fn >> - 3.90% [kernel] [k] smp_call_function_many >> smp_call_function_many >> - 2.94% [kernel] [k] arch_local_irq_restore >> - arch_local_irq_restore >> + 93.12% default_send_IPI_mask_allbutself_phys >> + 6.88% default_send_IPI_mask_sequence_phys >> >> Let me know if I can try something here.. >> /me confused :( >> > > I'm even more confused. Please try 'perf kvm' from the host, it does > fewer dirty tricks with the PMU and so may be more accurate. > I will try with host perf kvm this time.. -- To unsubscribe from this list: send the line "unsubscribe linux-kernel" in the body of a message to majordomo@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/