DMARC-Filter: OpenDMARC Filter v1.3.2 mx1.redhat.com A14A38046A
Subject: Re: [PATCH v2 0/4] KVM: optimize the kvm_vcpu_on_spin
To: "Longpeng (Mike)" <longpeng2@huawei.com>
Cc: pbonzini@redhat.com, rkrcmar@redhat.com, agraf@suse.com,
        borntraeger@de.ibm.com, cohuck@redhat.com, christoffer.dall@linaro.org,
        marc.zyngier@arm.com, james.hogan@imgtec.com, kvm@vger.kernel.org,
        linux-kernel@vger.kernel.org, weidong.huang@huawei.com,
        arei.gonglei@huawei.com, wangxinxin.wang@huawei.com,
        longpeng.mike@gmail.com
References: <1502165135-4784-1-git-send-email-longpeng2@huawei.com>
 <bc8870e0-0338-5741-f05f-8ad35f0dad19@redhat.com>
 <5989A53B.8070406@huawei.com>
From: David Hildenbrand <david@redhat.com>
Organization: Red Hat GmbH
Message-ID: <1187734c-7bf4-60fc-306c-6e498fc3d4c4@redhat.com>
Date: Tue, 8 Aug 2017 13:50:57 +0200
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101
 Thunderbird/52.2.1
MIME-Version: 1.0
In-Reply-To: <5989A53B.8070406@huawei.com>
Content-Type: text/plain; charset=utf-8
Content-Language: en-US
Content-Transfer-Encoding: 7bit
Sender: linux-kernel-owner@vger.kernel.org
Content-Length: 4456
Lines: 136

On 08.08.2017 13:49, Longpeng (Mike) wrote:
> 
> 
> On 2017/8/8 19:25, David Hildenbrand wrote:
> 
>> On 08.08.2017 06:05, Longpeng(Mike) wrote:
>>> This is a simple optimization for kvm_vcpu_on_spin, the
>>> main idea is described in patch-1's commit msg.
>>>
>>> I did some tests base on the RFC version, the result shows
>>> that it can improves the performance slightly.
>>>
>>> == Geekbench-3.4.1 ==
>>> VM1: 	8U,4G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
>>> 	running Geekbench-3.4.1 *10 truns*
>>> VM2/VM3/VM4: configure is the same as VM1
>>> 	stress each vcpu usage(seed by top in guest) to 40%
>>>
>>> The comparison of each testcase's score:
>>> (higher is better)
>>> 		before		after		improve
>>> Inter
>>>  single		1176.7		1179.0		0.2%
>>>  multi		3459.5		3426.5		-0.9%
>>> Float
>>>  single		1150.5		1150.9		0.0%
>>>  multi		3364.5		3391.9		0.8%
>>> Memory(stream)
>>>  single		1768.7		1773.1		0.2%
>>>  multi		2511.6		2557.2		1.8%
>>> Overall
>>>  single		1284.2		1286.2		0.2%
>>>  multi		3231.4		3238.4		0.2%
>>>
>>>
>>> == kernbench-0.42 ==
>>> VM1:    8U,12G, vcpu(0...7) is 1:1 pinned to pcpu(6...11,18,19)
>>>         running "kernbench -n 10"
>>> VM2/VM3/VM4: configure is the same as VM1
>>>         stress each vcpu usage(seed by top in guest) to 40%
>>>
>>> The comparison of 'Elapsed Time':
>>> (sooner is better)
>>> 		before		after		improve
>>> load -j4	12.762		12.751		0.1%
>>> load -j32	9.743		8.955		8.1%
>>> load -j		9.688		9.229		4.7%
>>>
>>>
>>> Physical Machine:
>>>   Architecture:          x86_64
>>>   CPU op-mode(s):        32-bit, 64-bit
>>>   Byte Order:            Little Endian
>>>   CPU(s):                24
>>>   On-line CPU(s) list:   0-23
>>>   Thread(s) per core:    2
>>>   Core(s) per socket:    6
>>>   Socket(s):             2
>>>   NUMA node(s):          2
>>>   Vendor ID:             GenuineIntel
>>>   CPU family:            6
>>>   Model:                 45
>>>   Model name:            Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz
>>>   Stepping:              7
>>>   CPU MHz:               2799.902
>>>   BogoMIPS:              5004.67
>>>   Virtualization:        VT-x
>>>   L1d cache:             32K
>>>   L1i cache:             32K
>>>   L2 cache:              256K
>>>   L3 cache:              15360K
>>>   NUMA node0 CPU(s):     0-5,12-17
>>>   NUMA node1 CPU(s):     6-11,18-23
>>>
>>> ---
>>> Changes since V1:
>>>  - split the implementation of s390 & arm. [David]
>>>  - refactor the impls according to the suggestion. [Paolo]
>>>
>>> Changes since RFC:
>>>  - only cache result for X86. [David & Cornlia & Paolo]
>>>  - add performance numbers. [David]
>>>  - impls arm/s390. [Christoffer & David]
>>>  - refactor the impls. [me]
>>>
>>> ---
>>> Longpeng(Mike) (4):
>>>   KVM: add spinlock optimization framework
>>>   KVM: X86: implement the logic for spinlock optimization
>>>   KVM: s390: implements the kvm_arch_vcpu_in_kernel()
>>>   KVM: arm: implements the kvm_arch_vcpu_in_kernel()
>>>
>>>  arch/arm/kvm/handle_exit.c      |  2 +-
>>>  arch/arm64/kvm/handle_exit.c    |  2 +-
>>>  arch/mips/kvm/mips.c            |  6 ++++++
>>>  arch/powerpc/kvm/powerpc.c      |  6 ++++++
>>>  arch/s390/kvm/diag.c            |  2 +-
>>>  arch/s390/kvm/kvm-s390.c        |  6 ++++++
>>>  arch/x86/include/asm/kvm_host.h |  5 +++++
>>>  arch/x86/kvm/hyperv.c           |  2 +-
>>>  arch/x86/kvm/svm.c              | 10 +++++++++-
>>>  arch/x86/kvm/vmx.c              | 16 +++++++++++++++-
>>>  arch/x86/kvm/x86.c              | 11 +++++++++++
>>>  include/linux/kvm_host.h        |  3 ++-
>>>  virt/kvm/arm/arm.c              |  5 +++++
>>>  virt/kvm/kvm_main.c             |  4 +++-
>>>  14 files changed, 72 insertions(+), 8 deletions(-)
>>>
>>
>> I am curious, is there any architecture that allows to trigger
>> kvm_vcpu_on_spin(vcpu); while _not_ in kernel mode?
> 
> 
> IIUC, X86/SVM will trap to host due to PAUSE insn no matter the vcpu is in
> kernel-mode or user-mode.
> 
>>
>> I would have guessed that user space should never be allowed to make cpu
>> wide decisions (giving up the CPU to the hypervisor).
>>
>> E.g. s390x diag can only be executed from kernel space. VMX PAUSE is
>> only valid from kernel space.
> 
> 
> X86/VMX has "PAUSE exiting" and "PAUSE-loop exiting"(PLE). KVM only uses PLE,
> this is as you said "only valid from kernel space"
> 
> However, the "PAUSE exiting" can cause user-mode vcpu exit too.

Thanks Longpeng and Christoffer!

-- 

Thanks,

David