2023-07-07 06:34:54

by Wang Jianchao

[permalink] [raw]
Subject: [RFC 0/3] KVM: x86: introduce pv feature lazy tscdeadline

Hi

This patchset attemps to introduce a new pv feature, lazy tscdeadline.
Everytime guest write msr of MSR_IA32_TSC_DEADLINE, a vm-exit occurs
and host side handle it. However, a lot of the vm-exit is unnecessary
because the timer is often over-written before it expires.

v : write to msr of tsc deadline
| : timer armed by tsc deadline

v v v v v | | | | |
---------------------------------------> Time

The timer armed by msr write is over-written before expires and the
vm-exit caused by it are wasted. The lazy tscdeadline works as following,

v v v v v | |
---------------------------------------> Time
'- arm -'

The 1st timer is responsible for arming the next timer. When the armed
timer is expired, it will check pending and arm a new timer.

In the netperf test with TCP_RR on loopback, this lazy_tscdeadline can
reduce vm-exit obviously.

Close Open
--------------------------------------------------------
VM-Exit
sum 12617503 5815737
intr 0% 37023 0% 33002
cpuid 0% 1 0% 0
halt 19% 2503932 47% 2780683
msr-write 79% 10046340 51% 2966824
pause 0% 90 0% 84
ept-violation 0% 584 0% 336
ept-misconfig 0% 0 0% 2
preemption-timer 0% 29518 0% 34800
-------------------------------------------------------
MSR-Write
sum 10046455 2966864
apic-icr 25% 2533498 93% 2781235
tsc-deadline 74% 7512945 6% 185629

This patchset is made and tested on 6.4.0, includes 3 patches,

The 1st one adds necessary data structures for this feature
The 2nd one adds the specific msr operations between guest and host
The 3rd one are the one make this feature works.

Any comment is welcome.

Thanks
Jianchao

Wang Jianchao (3)
KVM: x86: add msr register and data structure for lazy tscdeadline
KVM: x86: exchange info about lazy_tscdeadline with msr
KVM: X86: add lazy tscdeadline support to reduce vm-exit of msr-write


arch/x86/include/asm/kvm_host.h | 10 ++++++++
arch/x86/include/uapi/asm/kvm_para.h | 9 +++++++
arch/x86/kernel/apic/apic.c | 47 ++++++++++++++++++++++++++++++++++-
arch/x86/kernel/kvm.c | 13 ++++++++++
arch/x86/kvm/cpuid.c | 1 +
arch/x86/kvm/lapic.c | 128 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
arch/x86/kvm/lapic.h | 4 +++
arch/x86/kvm/x86.c | 26 ++++++++++++++++++++
8 files changed, 229 insertions(+), 9 deletions(-)


2023-07-12 18:57:16

by Zhi Wang

[permalink] [raw]
Subject: Re: [RFC 0/3] KVM: x86: introduce pv feature lazy tscdeadline

On Fri, 7 Jul 2023 14:17:58 +0800
Wang Jianchao <[email protected]> wrote:

> Hi
>
> This patchset attemps to introduce a new pv feature, lazy tscdeadline.
> Everytime guest write msr of MSR_IA32_TSC_DEADLINE, a vm-exit occurs
> and host side handle it. However, a lot of the vm-exit is unnecessary
> because the timer is often over-written before it expires.
>
> v : write to msr of tsc deadline
> | : timer armed by tsc deadline
>
> v v v v v | | | | |
> ---------------------------------------> Time
>
> The timer armed by msr write is over-written before expires and the
> vm-exit caused by it are wasted. The lazy tscdeadline works as following,
>
> v v v v v | |
> ---------------------------------------> Time
> '- arm -'
>

Interesting patch.

I am a little bit confused of the chart above. It seems the write of MSR,
which is said to cause VM exit, is not reduced in the chart of lazy
tscdeadline, only the times of arm are getting less. And the benefit of
lazy tscdeadline is said coming from "less vm exit". Maybe it is better
to imporve the chart a little bit to help people jump into the idea
easily?

> The 1st timer is responsible for arming the next timer. When the armed
> timer is expired, it will check pending and arm a new timer.
>
> In the netperf test with TCP_RR on loopback, this lazy_tscdeadline can
> reduce vm-exit obviously.
>
> Close Open
> --------------------------------------------------------
> VM-Exit
> sum 12617503 5815737
> intr 0% 37023 0% 33002
> cpuid 0% 1 0% 0
> halt 19% 2503932 47% 2780683
> msr-write 79% 10046340 51% 2966824
> pause 0% 90 0% 84
> ept-violation 0% 584 0% 336
> ept-misconfig 0% 0 0% 2
> preemption-timer 0% 29518 0% 34800
> -------------------------------------------------------
> MSR-Write
> sum 10046455 2966864
> apic-icr 25% 2533498 93% 2781235
> tsc-deadline 74% 7512945 6% 185629
>
> This patchset is made and tested on 6.4.0, includes 3 patches,
>
> The 1st one adds necessary data structures for this feature
> The 2nd one adds the specific msr operations between guest and host
> The 3rd one are the one make this feature works.
>
> Any comment is welcome.
>
> Thanks
> Jianchao
>
> Wang Jianchao (3)
> KVM: x86: add msr register and data structure for lazy tscdeadline
> KVM: x86: exchange info about lazy_tscdeadline with msr
> KVM: X86: add lazy tscdeadline support to reduce vm-exit of msr-write
>
>
> arch/x86/include/asm/kvm_host.h | 10 ++++++++
> arch/x86/include/uapi/asm/kvm_para.h | 9 +++++++
> arch/x86/kernel/apic/apic.c | 47 ++++++++++++++++++++++++++++++++++-
> arch/x86/kernel/kvm.c | 13 ++++++++++
> arch/x86/kvm/cpuid.c | 1 +
> arch/x86/kvm/lapic.c | 128 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
> arch/x86/kvm/lapic.h | 4 +++
> arch/x86/kvm/x86.c | 26 ++++++++++++++++++++
> 8 files changed, 229 insertions(+), 9 deletions(-)


2023-07-13 03:21:37

by Wang Jianchao

[permalink] [raw]
Subject: Re: [RFC 0/3] KVM: x86: introduce pv feature lazy tscdeadline



On 2023.07.13 02:14, Zhi Wang wrote:
> On Fri, 7 Jul 2023 14:17:58 +0800
> Wang Jianchao <[email protected]> wrote:
>
>> Hi
>>
>> This patchset attemps to introduce a new pv feature, lazy tscdeadline.
>> Everytime guest write msr of MSR_IA32_TSC_DEADLINE, a vm-exit occurs
>> and host side handle it. However, a lot of the vm-exit is unnecessary
>> because the timer is often over-written before it expires.
>>
>> v : write to msr of tsc deadline
>> | : timer armed by tsc deadline
>>
>> v v v v v | | | | |
>> ---------------------------------------> Time
>>
>> The timer armed by msr write is over-written before expires and the
>> vm-exit caused by it are wasted. The lazy tscdeadline works as following,
>>
>> v v v v v | |
>> ---------------------------------------> Time
>> '- arm -'
>>
>
> Interesting patch.
>
> I am a little bit confused of the chart above. It seems the write of MSR,
> which is said to cause VM exit, is not reduced in the chart of lazy
> tscdeadline, only the times of arm are getting less. And the benefit of
> lazy tscdeadline is said coming from "less vm exit". Maybe it is better
> to imporve the chart a little bit to help people jump into the idea
> easily?

Thanks so much for you comment and sorry for my poor chart.

Let me try to rework the chart.

Before this patch, every time guest start or modify a hrtimer, we need to write the msr of tsc deadline,
a vm-exit occurs and host arms a hv or sw timer for it.


w: write msr
x: vm-exit
t: hv or sw timer


Guest
w
---------------------------------------> Time
Host x t


However, in some workload that needs setup timer frequently, msr of tscdeadline is usually overwritten
many times before the timer expires. And every time we modify the tscdeadline, a vm-exit ocurrs


1. write to msr with t0

Guest
w0
----------------------------------------> Time
Host x0 t0


2. write to msr with t1
Guest
w1
------------------------------------------> Time
Host x1 t0->t1


2. write to msr with t2
Guest
w2
------------------------------------------> Time
Host x2 t1->t2


3. write to msr with t3
Guest
w3
------------------------------------------> Time
Host x3 t2->t3



What this patch want to do is to eliminate the vm-exit of x1 x2 and x3 as following,


Firstly, we have two fields shared between guest and host as other pv features, saying,
- armed, the value of tscdeadline that has a timer in host side, only updated by __host__ side
- pending, the next value of tscdeadline, only updated by __guest__ side


1. write to msr with t0

armed : t0
pending : t0
Guest
w0
----------------------------------------> Time
Host x0 t0

vm-exit occurs and arms a timer for t0 in host side


2. write to msr with t1

armed : t0
pending : t1

Guest
w1
------------------------------------------> Time
Host t0

the value of tsc deadline that has been armed, namely t0, is smaller than t1, needn't to write
to msr but just update pending


3. write to msr with t2

armed : t0
pending : t2

Guest
w2
------------------------------------------> Time
Host t0

Similar with step 2, just update pending field with t2, no vm-exit


4. write to msr with t3

armed : t0
pending : t3

Guest
w3
------------------------------------------> Time
Host t0
Similar with step 2, just update pending field with t3, no vm-exit


5. t0 expires, arm t3

armed : t3
pending : t3


Guest

------------------------------------------> Time
Host t0 ------> t3

t0 is fired, it checks the pending field and re-arm a timer based on it.


Here is the core ideal of this patch ;)


Thanks
Jianchao

>
>> The 1st timer is responsible for arming the next timer. When the armed
>> timer is expired, it will check pending and arm a new timer.
>>
>> In the netperf test with TCP_RR on loopback, this lazy_tscdeadline can
>> reduce vm-exit obviously.
>>
>> Close Open
>> --------------------------------------------------------
>> VM-Exit
>> sum 12617503 5815737
>> intr 0% 37023 0% 33002
>> cpuid 0% 1 0% 0
>> halt 19% 2503932 47% 2780683
>> msr-write 79% 10046340 51% 2966824
>> pause 0% 90 0% 84
>> ept-violation 0% 584 0% 336
>> ept-misconfig 0% 0 0% 2
>> preemption-timer 0% 29518 0% 34800
>> -------------------------------------------------------
>> MSR-Write
>> sum 10046455 2966864
>> apic-icr 25% 2533498 93% 2781235
>> tsc-deadline 74% 7512945 6% 185629
>>
>> This patchset is made and tested on 6.4.0, includes 3 patches,
>>
>> The 1st one adds necessary data structures for this feature
>> The 2nd one adds the specific msr operations between guest and host
>> The 3rd one are the one make this feature works.
>>
>> Any comment is welcome.
>>
>> Thanks
>> Jianchao
>>
>> Wang Jianchao (3)
>> KVM: x86: add msr register and data structure for lazy tscdeadline
>> KVM: x86: exchange info about lazy_tscdeadline with msr
>> KVM: X86: add lazy tscdeadline support to reduce vm-exit of msr-write
>>
>>
>> arch/x86/include/asm/kvm_host.h | 10 ++++++++
>> arch/x86/include/uapi/asm/kvm_para.h | 9 +++++++
>> arch/x86/kernel/apic/apic.c | 47 ++++++++++++++++++++++++++++++++++-
>> arch/x86/kernel/kvm.c | 13 ++++++++++
>> arch/x86/kvm/cpuid.c | 1 +
>> arch/x86/kvm/lapic.c | 128 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
>> arch/x86/kvm/lapic.h | 4 +++
>> arch/x86/kvm/x86.c | 26 ++++++++++++++++++++
>> 8 files changed, 229 insertions(+), 9 deletions(-)
>

2023-07-13 07:24:53

by Zhi Wang

[permalink] [raw]
Subject: Re: [RFC 0/3] KVM: x86: introduce pv feature lazy tscdeadline

On Thu, 13 Jul 2023 10:50:36 +0800
Wang Jianchao <[email protected]> wrote:

>
>
> On 2023.07.13 02:14, Zhi Wang wrote:
> > On Fri, 7 Jul 2023 14:17:58 +0800
> > Wang Jianchao <[email protected]> wrote:
> >
> >> Hi
> >>
> >> This patchset attemps to introduce a new pv feature, lazy tscdeadline.
> >> Everytime guest write msr of MSR_IA32_TSC_DEADLINE, a vm-exit occurs
> >> and host side handle it. However, a lot of the vm-exit is unnecessary
> >> because the timer is often over-written before it expires.
> >>
> >> v : write to msr of tsc deadline
> >> | : timer armed by tsc deadline
> >>
> >> v v v v v | | | | |
> >> ---------------------------------------> Time
> >>
> >> The timer armed by msr write is over-written before expires and the
> >> vm-exit caused by it are wasted. The lazy tscdeadline works as following,
> >>
> >> v v v v v | |
> >> ---------------------------------------> Time
> >> '- arm -'
> >>
> >
> > Interesting patch.
> >
> > I am a little bit confused of the chart above. It seems the write of MSR,
> > which is said to cause VM exit, is not reduced in the chart of lazy
> > tscdeadline, only the times of arm are getting less. And the benefit of
> > lazy tscdeadline is said coming from "less vm exit". Maybe it is better
> > to imporve the chart a little bit to help people jump into the idea
> > easily?
>
> Thanks so much for you comment and sorry for my poor chart.
>

You don't have to say sorry here. :) Save it for later when you actually
break something.

> Let me try to rework the chart.
>
> Before this patch, every time guest start or modify a hrtimer, we need to write the msr of tsc deadline,
> a vm-exit occurs and host arms a hv or sw timer for it.
>
>
> w: write msr
> x: vm-exit
> t: hv or sw timer
>
>
> Guest
> w
> ---------------------------------------> Time
> Host x t
>
>
> However, in some workload that needs setup timer frequently, msr of tscdeadline is usually overwritten
> many times before the timer expires. And every time we modify the tscdeadline, a vm-exit ocurrs
>
>
> 1. write to msr with t0
>
> Guest
> w0
> ----------------------------------------> Time
> Host x0 t0
>
>
> 2. write to msr with t1
> Guest
> w1
> ------------------------------------------> Time
> Host x1 t0->t1
>
>
> 2. write to msr with t2
> Guest
> w2
> ------------------------------------------> Time
> Host x2 t1->t2
>
>
> 3. write to msr with t3
> Guest
> w3
> ------------------------------------------> Time
> Host x3 t2->t3
>
>
>
> What this patch want to do is to eliminate the vm-exit of x1 x2 and x3 as following,
>
>
> Firstly, we have two fields shared between guest and host as other pv features, saying,
> - armed, the value of tscdeadline that has a timer in host side, only updated by __host__ side
> - pending, the next value of tscdeadline, only updated by __guest__ side
>
>
> 1. write to msr with t0
>
> armed : t0
> pending : t0
> Guest
> w0
> ----------------------------------------> Time
> Host x0 t0
>
> vm-exit occurs and arms a timer for t0 in host side
>
>
> 2. write to msr with t1
>
> armed : t0
> pending : t1
>
> Guest
> w1
> ------------------------------------------> Time
> Host t0
>
> the value of tsc deadline that has been armed, namely t0, is smaller than t1, needn't to write
> to msr but just update pending
>
>
> 3. write to msr with t2
>
> armed : t0
> pending : t2
>
> Guest
> w2
> ------------------------------------------> Time
> Host t0
>
> Similar with step 2, just update pending field with t2, no vm-exit
>
>
> 4. write to msr with t3
>
> armed : t0
> pending : t3
>
> Guest
> w3
> ------------------------------------------> Time
> Host t0
> Similar with step 2, just update pending field with t3, no vm-exit
>
>
> 5. t0 expires, arm t3
>
> armed : t3
> pending : t3
>
>
> Guest
>
> ------------------------------------------> Time
> Host t0 ------> t3
>
> t0 is fired, it checks the pending field and re-arm a timer based on it.
>
>
> Here is the core ideal of this patch ;)
>

That's much better. Please keep this in the cover letter in the next RFC.

My concern about this approach is: it might slightly affect timing
sensitive workload in the guest, as the approach merges the deadline
interrupt. The guest might see less deadline interrupts than before. It
might be better to have a comparison of number of deadline interrupts
in the cover letter.

Note that I went through the whole patch series. The coding seems fine
except some sanity checks and typos. I think it is good enough to
demonstrate the idea. Let's wait for more folks to weigh in for the ideas.

For cover letter, besides the chart, you can also briefly describe what
each patch does in the cover letter and put more details in the comments
of each patch. So that people can grab the basic idea quickly without
switching between email threads.

For the comment body of patch, please refer to Sean's maintainer handbook.
They have patterns and they are quite helpful on improving the readability.
:)

Also, don't worry if you doesn't have QEMU patches for people to try. You
can add a KVM selftest to the patch series to let people try.

>
> Thanks
> Jianchao
>
> >
> >> The 1st timer is responsible for arming the next timer. When the armed
> >> timer is expired, it will check pending and arm a new timer.
> >>
> >> In the netperf test with TCP_RR on loopback, this lazy_tscdeadline can
> >> reduce vm-exit obviously.
> >>
> >> Close Open
> >> --------------------------------------------------------
> >> VM-Exit
> >> sum 12617503 5815737
> >> intr 0% 37023 0% 33002
> >> cpuid 0% 1 0% 0
> >> halt 19% 2503932 47% 2780683
> >> msr-write 79% 10046340 51% 2966824
> >> pause 0% 90 0% 84
> >> ept-violation 0% 584 0% 336
> >> ept-misconfig 0% 0 0% 2
> >> preemption-timer 0% 29518 0% 34800
> >> -------------------------------------------------------
> >> MSR-Write
> >> sum 10046455 2966864
> >> apic-icr 25% 2533498 93% 2781235
> >> tsc-deadline 74% 7512945 6% 185629
> >>
> >> This patchset is made and tested on 6.4.0, includes 3 patches,
> >>
> >> The 1st one adds necessary data structures for this feature
> >> The 2nd one adds the specific msr operations between guest and host
> >> The 3rd one are the one make this feature works.
> >>
> >> Any comment is welcome.
> >>
> >> Thanks
> >> Jianchao
> >>
> >> Wang Jianchao (3)
> >> KVM: x86: add msr register and data structure for lazy tscdeadline
> >> KVM: x86: exchange info about lazy_tscdeadline with msr
> >> KVM: X86: add lazy tscdeadline support to reduce vm-exit of msr-write
> >>
> >>
> >> arch/x86/include/asm/kvm_host.h | 10 ++++++++
> >> arch/x86/include/uapi/asm/kvm_para.h | 9 +++++++
> >> arch/x86/kernel/apic/apic.c | 47 ++++++++++++++++++++++++++++++++++-
> >> arch/x86/kernel/kvm.c | 13 ++++++++++
> >> arch/x86/kvm/cpuid.c | 1 +
> >> arch/x86/kvm/lapic.c | 128 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
> >> arch/x86/kvm/lapic.h | 4 +++
> >> arch/x86/kvm/x86.c | 26 ++++++++++++++++++++
> >> 8 files changed, 229 insertions(+), 9 deletions(-)
> >


2023-07-13 10:54:36

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [RFC 0/3] KVM: x86: introduce pv feature lazy tscdeadline

On 7/13/2023 2:57 PM, Zhi Wang wrote:
> On Thu, 13 Jul 2023 10:50:36 +0800
> Wang Jianchao <[email protected]> wrote:
>
>>
>>
>> On 2023.07.13 02:14, Zhi Wang wrote:
>>> On Fri, 7 Jul 2023 14:17:58 +0800
>>> Wang Jianchao <[email protected]> wrote:
>>>
>>>> Hi
>>>>
>>>> This patchset attemps to introduce a new pv feature, lazy tscdeadline.
>>>> Everytime guest write msr of MSR_IA32_TSC_DEADLINE, a vm-exit occurs
>>>> and host side handle it. However, a lot of the vm-exit is unnecessary
>>>> because the timer is often over-written before it expires.
>>>>
>>>> v : write to msr of tsc deadline
>>>> | : timer armed by tsc deadline
>>>>
>>>> v v v v v | | | | |
>>>> ---------------------------------------> Time
>>>>
>>>> The timer armed by msr write is over-written before expires and the
>>>> vm-exit caused by it are wasted. The lazy tscdeadline works as following,
>>>>
>>>> v v v v v | |
>>>> ---------------------------------------> Time
>>>> '- arm -'
>>>>
>>>
>>> Interesting patch.
>>>
>>> I am a little bit confused of the chart above. It seems the write of MSR,
>>> which is said to cause VM exit, is not reduced in the chart of lazy
>>> tscdeadline, only the times of arm are getting less. And the benefit of
>>> lazy tscdeadline is said coming from "less vm exit". Maybe it is better
>>> to imporve the chart a little bit to help people jump into the idea
>>> easily?
>>
>> Thanks so much for you comment and sorry for my poor chart.
>>
>
> You don't have to say sorry here. :) Save it for later when you actually
> break something.
>
>> Let me try to rework the chart.
>>
>> Before this patch, every time guest start or modify a hrtimer, we need to write the msr of tsc deadline,
>> a vm-exit occurs and host arms a hv or sw timer for it.
>>
>>
>> w: write msr
>> x: vm-exit
>> t: hv or sw timer
>>
>>
>> Guest
>> w
>> ---------------------------------------> Time
>> Host x t
>>
>>
>> However, in some workload that needs setup timer frequently, msr of tscdeadline is usually overwritten
>> many times before the timer expires. And every time we modify the tscdeadline, a vm-exit ocurrs
>>
>>
>> 1. write to msr with t0
>>
>> Guest
>> w0
>> ----------------------------------------> Time
>> Host x0 t0
>>
>>
>> 2. write to msr with t1
>> Guest
>> w1
>> ------------------------------------------> Time
>> Host x1 t0->t1
>>
>>
>> 2. write to msr with t2
>> Guest
>> w2
>> ------------------------------------------> Time
>> Host x2 t1->t2
>>
>>
>> 3. write to msr with t3
>> Guest
>> w3
>> ------------------------------------------> Time
>> Host x3 t2->t3
>>
>>
>>
>> What this patch want to do is to eliminate the vm-exit of x1 x2 and x3 as following,
>>
>>
>> Firstly, we have two fields shared between guest and host as other pv features, saying,
>> - armed, the value of tscdeadline that has a timer in host side, only updated by __host__ side
>> - pending, the next value of tscdeadline, only updated by __guest__ side
>>
>>
>> 1. write to msr with t0
>>
>> armed : t0
>> pending : t0
>> Guest
>> w0
>> ----------------------------------------> Time
>> Host x0 t0
>>
>> vm-exit occurs and arms a timer for t0 in host side
>>
>>
>> 2. write to msr with t1
>>
>> armed : t0
>> pending : t1
>>
>> Guest
>> w1
>> ------------------------------------------> Time
>> Host t0
>>
>> the value of tsc deadline that has been armed, namely t0, is smaller than t1, needn't to write
>> to msr but just update pending
>>
>>
>> 3. write to msr with t2
>>
>> armed : t0
>> pending : t2
>>
>> Guest
>> w2
>> ------------------------------------------> Time
>> Host t0
>>
>> Similar with step 2, just update pending field with t2, no vm-exit
>>
>>
>> 4. write to msr with t3
>>
>> armed : t0
>> pending : t3
>>
>> Guest
>> w3
>> ------------------------------------------> Time
>> Host t0
>> Similar with step 2, just update pending field with t3, no vm-exit
>>
>>
>> 5. t0 expires, arm t3
>>
>> armed : t3
>> pending : t3
>>
>>
>> Guest
>>
>> ------------------------------------------> Time
>> Host t0 ------> t3
>>
>> t0 is fired, it checks the pending field and re-arm a timer based on it.
>>
>>
>> Here is the core ideal of this patch ;)
>>
>
> That's much better. Please keep this in the cover letter in the next RFC.
>
> My concern about this approach is: it might slightly affect timing
> sensitive workload in the guest, as the approach merges the deadline
> interrupt. The guest might see less deadline interrupts than before. It
> might be better to have a comparison of number of deadline interrupts
> in the cover letter.

I don't think guest will get less deadline interrupts since the deadline
is updated always before the timer expires.

However, host will get more deadline interrupt because timer for t0 is
not disarmed when new deadline (t1, t2, t3) is programmed.



2023-07-13 11:22:12

by Wang Jianchao

[permalink] [raw]
Subject: Re: [RFC 0/3] KVM: x86: introduce pv feature lazy tscdeadline



On 2023.07.13 14:57, Zhi Wang wrote:
> On Thu, 13 Jul 2023 10:50:36 +0800
> Wang Jianchao <[email protected]> wrote:
>
>>
>>
>> On 2023.07.13 02:14, Zhi Wang wrote:
>>> On Fri, 7 Jul 2023 14:17:58 +0800
>>> Wang Jianchao <[email protected]> wrote:
>>>
>>>> Hi
>>>>
>>>> This patchset attemps to introduce a new pv feature, lazy tscdeadline.
>>>> Everytime guest write msr of MSR_IA32_TSC_DEADLINE, a vm-exit occurs
>>>> and host side handle it. However, a lot of the vm-exit is unnecessary
>>>> because the timer is often over-written before it expires.
>>>>
>>>> v : write to msr of tsc deadline
>>>> | : timer armed by tsc deadline
>>>>
>>>> v v v v v | | | | |
>>>> ---------------------------------------> Time
>>>>
>>>> The timer armed by msr write is over-written before expires and the
>>>> vm-exit caused by it are wasted. The lazy tscdeadline works as following,
>>>>
>>>> v v v v v | |
>>>> ---------------------------------------> Time
>>>> '- arm -'
>>>>
>>>
>>> Interesting patch.
>>>
>>> I am a little bit confused of the chart above. It seems the write of MSR,
>>> which is said to cause VM exit, is not reduced in the chart of lazy
>>> tscdeadline, only the times of arm are getting less. And the benefit of
>>> lazy tscdeadline is said coming from "less vm exit". Maybe it is better
>>> to imporve the chart a little bit to help people jump into the idea
>>> easily?
>>
>> Thanks so much for you comment and sorry for my poor chart.
>>
>
> You don't have to say sorry here. :) Save it for later when you actually
> break something.
>
>> Let me try to rework the chart.
>>
>> Before this patch, every time guest start or modify a hrtimer, we need to write the msr of tsc deadline,
>> a vm-exit occurs and host arms a hv or sw timer for it.
>>
>>
>> w: write msr
>> x: vm-exit
>> t: hv or sw timer
>>
>>
>> Guest
>> w
>> ---------------------------------------> Time
>> Host x t
>>
>>
>> However, in some workload that needs setup timer frequently, msr of tscdeadline is usually overwritten
>> many times before the timer expires. And every time we modify the tscdeadline, a vm-exit ocurrs
>>
>>
>> 1. write to msr with t0
>>
>> Guest
>> w0
>> ----------------------------------------> Time
>> Host x0 t0
>>
>>
>> 2. write to msr with t1
>> Guest
>> w1
>> ------------------------------------------> Time
>> Host x1 t0->t1
>>
>>
>> 2. write to msr with t2
>> Guest
>> w2
>> ------------------------------------------> Time
>> Host x2 t1->t2
>>
>>
>> 3. write to msr with t3
>> Guest
>> w3
>> ------------------------------------------> Time
>> Host x3 t2->t3
>>
>>
>>
>> What this patch want to do is to eliminate the vm-exit of x1 x2 and x3 as following,
>>
>>
>> Firstly, we have two fields shared between guest and host as other pv features, saying,
>> - armed, the value of tscdeadline that has a timer in host side, only updated by __host__ side
>> - pending, the next value of tscdeadline, only updated by __guest__ side
>>
>>
>> 1. write to msr with t0
>>
>> armed : t0
>> pending : t0
>> Guest
>> w0
>> ----------------------------------------> Time
>> Host x0 t0
>>
>> vm-exit occurs and arms a timer for t0 in host side
>>
>>
>> 2. write to msr with t1
>>
>> armed : t0
>> pending : t1
>>
>> Guest
>> w1
>> ------------------------------------------> Time
>> Host t0
>>
>> the value of tsc deadline that has been armed, namely t0, is smaller than t1, needn't to write
>> to msr but just update pending
>>
>>
>> 3. write to msr with t2
>>
>> armed : t0
>> pending : t2
>>
>> Guest
>> w2
>> ------------------------------------------> Time
>> Host t0
>>
>> Similar with step 2, just update pending field with t2, no vm-exit
>>
>>
>> 4. write to msr with t3
>>
>> armed : t0
>> pending : t3
>>
>> Guest
>> w3
>> ------------------------------------------> Time
>> Host t0
>> Similar with step 2, just update pending field with t3, no vm-exit
>>
>>
>> 5. t0 expires, arm t3
>>
>> armed : t3
>> pending : t3
>>
>>
>> Guest
>>
>> ------------------------------------------> Time
>> Host t0 ------> t3
>>
>> t0 is fired, it checks the pending field and re-arm a timer based on it.
>>
>>
>> Here is the core ideal of this patch ;)
>>
>
> That's much better. Please keep this in the cover letter in the next RFC.
OK
>
> My concern about this approach is: it might slightly affect timing
> sensitive workload in the guest, as the approach merges the deadline
> interrupt. The guest might see less deadline interrupts than before. It
> might be better to have a comparison of number of deadline interrupts
> in the cover letter.

This patch is based on the idea that the tscdeadline msr can be overwritten
many times before local timer interrupt is fired. ITOW, the tscdeadline msr
is moved forward again and again, as well as the timer in host side, but
there is no timer interrupt during this period. The guesOS should guarantee
that there is no pending timers if it invokes set_next_event. If there is
deadline interrupt lost, it should be a bug ;)

>
> Note that I went through the whole patch series. The coding seems fine
> except some sanity checks and typos. I think it is good enough to
> demonstrate the idea. Let's wait for more folks to weigh in for the ideas.
>
> For cover letter, besides the chart, you can also briefly describe what
> each patch does in the cover letter and put more details in the comments
> of each patch. So that people can grab the basic idea quickly without
> switching between email threads.
>
> For the comment body of patch, please refer to Sean's maintainer handbook.
> They have patterns and they are quite helpful on improving the readability.
> :)

I will try to put more details int the comment of patches

>
> Also, don't worry if you doesn't have QEMU patches for people to try. You
> can add a KVM selftest to the patch series to let people try.
>

Thanks so much for your kindly comment !
Jianchao

>>
>> Thanks
>> Jianchao
>>
>>>
>>>> The 1st timer is responsible for arming the next timer. When the armed
>>>> timer is expired, it will check pending and arm a new timer.
>>>>
>>>> In the netperf test with TCP_RR on loopback, this lazy_tscdeadline can
>>>> reduce vm-exit obviously.
>>>>
>>>> Close Open
>>>> --------------------------------------------------------
>>>> VM-Exit
>>>> sum 12617503 5815737
>>>> intr 0% 37023 0% 33002
>>>> cpuid 0% 1 0% 0
>>>> halt 19% 2503932 47% 2780683
>>>> msr-write 79% 10046340 51% 2966824
>>>> pause 0% 90 0% 84
>>>> ept-violation 0% 584 0% 336
>>>> ept-misconfig 0% 0 0% 2
>>>> preemption-timer 0% 29518 0% 34800
>>>> -------------------------------------------------------
>>>> MSR-Write
>>>> sum 10046455 2966864
>>>> apic-icr 25% 2533498 93% 2781235
>>>> tsc-deadline 74% 7512945 6% 185629
>>>>
>>>> This patchset is made and tested on 6.4.0, includes 3 patches,
>>>>
>>>> The 1st one adds necessary data structures for this feature
>>>> The 2nd one adds the specific msr operations between guest and host
>>>> The 3rd one are the one make this feature works.
>>>>
>>>> Any comment is welcome.
>>>>
>>>> Thanks
>>>> Jianchao
>>>>
>>>> Wang Jianchao (3)
>>>> KVM: x86: add msr register and data structure for lazy tscdeadline
>>>> KVM: x86: exchange info about lazy_tscdeadline with msr
>>>> KVM: X86: add lazy tscdeadline support to reduce vm-exit of msr-write
>>>>
>>>>
>>>> arch/x86/include/asm/kvm_host.h | 10 ++++++++
>>>> arch/x86/include/uapi/asm/kvm_para.h | 9 +++++++
>>>> arch/x86/kernel/apic/apic.c | 47 ++++++++++++++++++++++++++++++++++-
>>>> arch/x86/kernel/kvm.c | 13 ++++++++++
>>>> arch/x86/kvm/cpuid.c | 1 +
>>>> arch/x86/kvm/lapic.c | 128 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
>>>> arch/x86/kvm/lapic.h | 4 +++
>>>> arch/x86/kvm/x86.c | 26 ++++++++++++++++++++
>>>> 8 files changed, 229 insertions(+), 9 deletions(-)
>>>
>

2023-07-13 11:40:40

by Wang Jianchao

[permalink] [raw]
Subject: Re: [RFC 0/3] KVM: x86: introduce pv feature lazy tscdeadline



On 2023.07.13 18:27, Xiaoyao Li wrote:
> On 7/13/2023 2:57 PM, Zhi Wang wrote:
>> On Thu, 13 Jul 2023 10:50:36 +0800
>> Wang Jianchao <[email protected]> wrote:
>>
>>>
>>>
>>> On 2023.07.13 02:14, Zhi Wang wrote:
>>>> On Fri,  7 Jul 2023 14:17:58 +0800
>>>> Wang Jianchao <[email protected]> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> This patchset attemps to introduce a new pv feature, lazy tscdeadline.
>>>>> Everytime guest write msr of MSR_IA32_TSC_DEADLINE, a vm-exit occurs
>>>>> and host side handle it. However, a lot of the vm-exit is unnecessary
>>>>> because the timer is often over-written before it expires.
>>>>>
>>>>> v : write to msr of tsc deadline
>>>>> | : timer armed by tsc deadline
>>>>>
>>>>>           v v v v v        | | | | |
>>>>> --------------------------------------->  Time
>>>>>
>>>>> The timer armed by msr write is over-written before expires and the
>>>>> vm-exit caused by it are wasted. The lazy tscdeadline works as following,
>>>>>
>>>>>           v v v v v        |       |
>>>>> --------------------------------------->  Time
>>>>>                            '- arm -'
>>>>>
>>>>
>>>> Interesting patch.
>>>>
>>>> I am a little bit confused of the chart above. It seems the write of MSR,
>>>> which is said to cause VM exit, is not reduced in the chart of lazy
>>>> tscdeadline, only the times of arm are getting less. And the benefit of
>>>> lazy tscdeadline is said coming from "less vm exit". Maybe it is better
>>>> to imporve the chart a little bit to help people jump into the idea
>>>> easily?
>>>
>>> Thanks so much for you comment and sorry for my poor chart.
>>>
>>
>> You don't have to say sorry here. :) Save it for later when you actually
>> break something.
>>
>>> Let me try to rework the chart.
>>>
>>> Before this patch, every time guest start or modify a hrtimer, we need to write the msr of tsc deadline,
>>> a vm-exit occurs and host arms a hv or sw timer for it.
>>>
>>>
>>> w: write msr
>>> x: vm-exit
>>> t: hv or sw timer
>>>
>>>
>>> Guest
>>>           w
>>> --------------------------------------->  Time
>>> Host     x              t
>>>  
>>> However, in some workload that needs setup timer frequently, msr of tscdeadline is usually overwritten
>>> many times before the timer expires. And every time we modify the tscdeadline, a vm-exit ocurrs
>>>
>>>
>>> 1. write to msr with t0
>>>
>>> Guest
>>>           w0
>>> ---------------------------------------->  Time
>>> Host     x0             t0
>>>
>>>   2. write to msr with t1
>>> Guest
>>>               w1
>>> ------------------------------------------>  Time
>>> Host         x1          t0->t1
>>>
>>>
>>> 2. write to msr with t2
>>> Guest
>>>                  w2
>>> ------------------------------------------>  Time
>>> Host            x2          t1->t2
>>>  
>>> 3. write to msr with t3
>>> Guest
>>>                      w3
>>> ------------------------------------------>  Time
>>> Host                x3           t2->t3
>>>
>>>
>>>
>>> What this patch want to do is to eliminate the vm-exit of x1 x2 and x3 as following,
>>>
>>>
>>> Firstly, we have two fields shared between guest and host as other pv features, saying,
>>>   - armed, the value of tscdeadline that has a timer in host side, only updated by __host__ side
>>>   - pending, the next value of tscdeadline, only updated by __guest__ side
>>>
>>>
>>> 1. write to msr with t0
>>>
>>>               armed   : t0
>>>               pending : t0
>>> Guest
>>>           w0
>>> ---------------------------------------->  Time
>>> Host     x0             t0
>>>
>>> vm-exit occurs and arms a timer for t0 in host side
>>>
>>>   2. write to msr with t1
>>>
>>>               armed   : t0
>>>               pending : t1
>>>
>>> Guest
>>>               w1
>>> ------------------------------------------>  Time
>>> Host                     t0
>>>
>>> the value of tsc deadline that has been armed, namely t0, is smaller than t1, needn't to write
>>> to msr but just update pending
>>>
>>>
>>> 3. write to msr with t2
>>>
>>>               armed   : t0
>>>               pending : t2
>>>   Guest
>>>                  w2
>>> ------------------------------------------>  Time
>>> Host                      t0
>>>   Similar with step 2, just update pending field with t2, no vm-exit
>>>
>>>
>>> 4.  write to msr with t3
>>>
>>>               armed   : t0
>>>               pending : t3
>>>
>>> Guest
>>>                      w3
>>> ------------------------------------------>  Time
>>> Host                       t0
>>> Similar with step 2, just update pending field with t3, no vm-exit
>>>
>>>
>>> 5.  t0 expires, arm t3
>>>
>>>               armed   : t3
>>>               pending : t3
>>>
>>>
>>> Guest
>>>                              ------------------------------------------>  Time
>>> Host                       t0  ------> t3
>>>
>>> t0 is fired, it checks the pending field and re-arm a timer based on it.
>>>
>>>
>>> Here is the core ideal of this patch ;)
>>>
>>
>> That's much better. Please keep this in the cover letter in the next RFC.
>>
>> My concern about this approach is: it might slightly affect timing
>> sensitive workload in the guest, as the approach merges the deadline
>> interrupt. The guest might see less deadline interrupts than before. It
>> might be better to have a comparison of number of deadline interrupts
>> in the cover letter.
>
> I don't think guest will get less deadline interrupts since the deadline is updated always before the timer expires.
>
> However, host will get more deadline interrupt because timer for t0 is not disarmed when new deadline (t1, t2, t3) is programmed.
>

I forget to avoid to inject local timer interrupt of t0 in this version. This will be modified in V3 patchset.
But there is still a vm-exit of preemption timer for t0 ...
The worst case is: guest program t0 t1, t1's vm-exit due to msr write is avoided but t0's preemption vm-exit replace it.
In the other case, there should be benefit of vm-exit.

Thanks
Jianchao


2023-07-13 14:20:07

by Xiaoyao Li

[permalink] [raw]
Subject: Re: [RFC 0/3] KVM: x86: introduce pv feature lazy tscdeadline

On 7/13/2023 10:50 AM, Wang Jianchao wrote:
>
>
> On 2023.07.13 02:14, Zhi Wang wrote:
>> On Fri, 7 Jul 2023 14:17:58 +0800
>> Wang Jianchao <[email protected]> wrote:
>>
>>> Hi
>>>
>>> This patchset attemps to introduce a new pv feature, lazy tscdeadline.
>>> Everytime guest write msr of MSR_IA32_TSC_DEADLINE, a vm-exit occurs
>>> and host side handle it. However, a lot of the vm-exit is unnecessary
>>> because the timer is often over-written before it expires.
>>>
>>> v : write to msr of tsc deadline
>>> | : timer armed by tsc deadline
>>>
>>> v v v v v | | | | |
>>> ---------------------------------------> Time
>>>
>>> The timer armed by msr write is over-written before expires and the
>>> vm-exit caused by it are wasted. The lazy tscdeadline works as following,
>>>
>>> v v v v v | |
>>> ---------------------------------------> Time
>>> '- arm -'
>>>
>>
>> Interesting patch.
>>
>> I am a little bit confused of the chart above. It seems the write of MSR,
>> which is said to cause VM exit, is not reduced in the chart of lazy
>> tscdeadline, only the times of arm are getting less. And the benefit of
>> lazy tscdeadline is said coming from "less vm exit". Maybe it is better
>> to imporve the chart a little bit to help people jump into the idea
>> easily?
>
> Thanks so much for you comment and sorry for my poor chart.
>
> Let me try to rework the chart.
>
> Before this patch, every time guest start or modify a hrtimer, we need to write the msr of tsc deadline,
> a vm-exit occurs and host arms a hv or sw timer for it.
>
>
> w: write msr
> x: vm-exit
> t: hv or sw timer
>
>
> Guest
> w
> ---------------------------------------> Time
> Host x t
>
>
> However, in some workload that needs setup timer frequently, msr of tscdeadline is usually overwritten
> many times before the timer expires. And every time we modify the tscdeadline, a vm-exit ocurrs
>
>
> 1. write to msr with t0
>
> Guest
> w0
> ----------------------------------------> Time
> Host x0 t0
>
>
> 2. write to msr with t1
> Guest
> w1
> ------------------------------------------> Time
> Host x1 t0->t1
>
>
> 2. write to msr with t2
> Guest
> w2
> ------------------------------------------> Time
> Host x2 t1->t2
>
>
> 3. write to msr with t3
> Guest
> w3
> ------------------------------------------> Time
> Host x3 t2->t3
>
>
>
> What this patch want to do is to eliminate the vm-exit of x1 x2 and x3 as following,
>
>
> Firstly, we have two fields shared between guest and host as other pv features, saying,
> - armed, the value of tscdeadline that has a timer in host side, only updated by __host__ side
> - pending, the next value of tscdeadline, only updated by __guest__ side
>
>
> 1. write to msr with t0
>
> armed : t0
> pending : t0
> Guest
> w0
> ----------------------------------------> Time
> Host x0 t0
>
> vm-exit occurs and arms a timer for t0 in host side

What's the initial value of @armed and @pending?

>
> 2. write to msr with t1
>
> armed : t0
> pending : t1
>
> Guest
> w1
> ------------------------------------------> Time
> Host t0
>
> the value of tsc deadline that has been armed, namely t0, is smaller than t1, needn't to write
> to msr but just update pending

if t1 < t0, then it triggers the vm exit, right?
And in this case, I think @armed will be updated to t1. What about
pending? will it get updated to t1 or not?

>
> 3. write to msr with t2
>
> armed : t0
> pending : t2
>
> Guest
> w2
> ------------------------------------------> Time
> Host t0
>
> Similar with step 2, just update pending field with t2, no vm-exit
>
>
> 4. write to msr with t3
>
> armed : t0
> pending : t3
>
> Guest
> w3
> ------------------------------------------> Time
> Host t0
> Similar with step 2, just update pending field with t3, no vm-exit
>
>
> 5. t0 expires, arm t3
>
> armed : t3
> pending : t3
>
>
> Guest
>
> ------------------------------------------> Time
> Host t0 ------> t3
>
> t0 is fired, it checks the pending field and re-arm a timer based on it.
>
>
> Here is the core ideal of this patch ;)
>
>
> Thanks
> Jianchao
>
>>
>>> The 1st timer is responsible for arming the next timer. When the armed
>>> timer is expired, it will check pending and arm a new timer.
>>>
>>> In the netperf test with TCP_RR on loopback, this lazy_tscdeadline can
>>> reduce vm-exit obviously.
>>>
>>> Close Open
>>> --------------------------------------------------------
>>> VM-Exit
>>> sum 12617503 5815737
>>> intr 0% 37023 0% 33002
>>> cpuid 0% 1 0% 0
>>> halt 19% 2503932 47% 2780683
>>> msr-write 79% 10046340 51% 2966824
>>> pause 0% 90 0% 84
>>> ept-violation 0% 584 0% 336
>>> ept-misconfig 0% 0 0% 2
>>> preemption-timer 0% 29518 0% 34800
>>> -------------------------------------------------------
>>> MSR-Write
>>> sum 10046455 2966864
>>> apic-icr 25% 2533498 93% 2781235
>>> tsc-deadline 74% 7512945 6% 185629
>>>
>>> This patchset is made and tested on 6.4.0, includes 3 patches,
>>>
>>> The 1st one adds necessary data structures for this feature
>>> The 2nd one adds the specific msr operations between guest and host
>>> The 3rd one are the one make this feature works.
>>>
>>> Any comment is welcome.
>>>
>>> Thanks
>>> Jianchao
>>>
>>> Wang Jianchao (3)
>>> KVM: x86: add msr register and data structure for lazy tscdeadline
>>> KVM: x86: exchange info about lazy_tscdeadline with msr
>>> KVM: X86: add lazy tscdeadline support to reduce vm-exit of msr-write
>>>
>>>
>>> arch/x86/include/asm/kvm_host.h | 10 ++++++++
>>> arch/x86/include/uapi/asm/kvm_para.h | 9 +++++++
>>> arch/x86/kernel/apic/apic.c | 47 ++++++++++++++++++++++++++++++++++-
>>> arch/x86/kernel/kvm.c | 13 ++++++++++
>>> arch/x86/kvm/cpuid.c | 1 +
>>> arch/x86/kvm/lapic.c | 128 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
>>> arch/x86/kvm/lapic.h | 4 +++
>>> arch/x86/kvm/x86.c | 26 ++++++++++++++++++++
>>> 8 files changed, 229 insertions(+), 9 deletions(-)
>>


2023-07-14 01:39:01

by Wang Jianchao

[permalink] [raw]
Subject: Re: [RFC 0/3] KVM: x86: introduce pv feature lazy tscdeadline



On 2023.07.13 21:32, Xiaoyao Li wrote:
> On 7/13/2023 10:50 AM, Wang Jianchao wrote:
>>
>>
>> On 2023.07.13 02:14, Zhi Wang wrote:
>>> On Fri,  7 Jul 2023 14:17:58 +0800
>>> Wang Jianchao <[email protected]> wrote:
>>>
>>>> Hi
>>>>
>>>> This patchset attemps to introduce a new pv feature, lazy tscdeadline.
>>>> Everytime guest write msr of MSR_IA32_TSC_DEADLINE, a vm-exit occurs
>>>> and host side handle it. However, a lot of the vm-exit is unnecessary
>>>> because the timer is often over-written before it expires.
>>>>
>>>> v : write to msr of tsc deadline
>>>> | : timer armed by tsc deadline
>>>>
>>>>           v v v v v        | | | | |
>>>> --------------------------------------->  Time
>>>>
>>>> The timer armed by msr write is over-written before expires and the
>>>> vm-exit caused by it are wasted. The lazy tscdeadline works as following,
>>>>
>>>>           v v v v v        |       |
>>>> --------------------------------------->  Time
>>>>                            '- arm -'
>>>>
>>>
>>> Interesting patch.
>>>
>>> I am a little bit confused of the chart above. It seems the write of MSR,
>>> which is said to cause VM exit, is not reduced in the chart of lazy
>>> tscdeadline, only the times of arm are getting less. And the benefit of
>>> lazy tscdeadline is said coming from "less vm exit". Maybe it is better
>>> to imporve the chart a little bit to help people jump into the idea
>>> easily?
>>
>> Thanks so much for you comment and sorry for my poor chart.
>>
>> Let me try to rework the chart.
>>
>> Before this patch, every time guest start or modify a hrtimer, we need to write the msr of tsc deadline,
>> a vm-exit occurs and host arms a hv or sw timer for it.
>>
>>
>> w: write msr
>> x: vm-exit
>> t: hv or sw timer
>>
>>
>> Guest
>>           w
>> --------------------------------------->  Time
>> Host     x              t
>>  
>> However, in some workload that needs setup timer frequently, msr of tscdeadline is usually overwritten
>> many times before the timer expires. And every time we modify the tscdeadline, a vm-exit ocurrs
>>
>>
>> 1. write to msr with t0
>>
>> Guest
>>           w0
>> ---------------------------------------->  Time
>> Host     x0             t0
>>
>>   2. write to msr with t1
>> Guest
>>               w1
>> ------------------------------------------>  Time
>> Host         x1          t0->t1
>>
>>
>> 2. write to msr with t2
>> Guest
>>                  w2
>> ------------------------------------------>  Time
>> Host            x2          t1->t2
>>  
>> 3. write to msr with t3
>> Guest
>>                      w3
>> ------------------------------------------>  Time
>> Host                x3           t2->t3
>>
>>
>>
>> What this patch want to do is to eliminate the vm-exit of x1 x2 and x3 as following,
>>
>>
>> Firstly, we have two fields shared between guest and host as other pv features, saying,
>>   - armed, the value of tscdeadline that has a timer in host side, only updated by __host__ side
>>   - pending, the next value of tscdeadline, only updated by __guest__ side
>>
>>
>> 1. write to msr with t0
>>
>>               armed   : t0
>>               pending : t0
>> Guest
>>           w0
>> ---------------------------------------->  Time
>> Host     x0             t0
>>
>> vm-exit occurs and arms a timer for t0 in host side
>
> What's the initial value of @armed and @pending?

Both of them are zero.

@armed is only updated by host
@pending is updated by guest

Guest side will check @armed, it it is zero, jumps to wrmsrl

>
>>   2. write to msr with t1
>>
>>               armed   : t0
>>               pending : t1
>>
>> Guest
>>               w1
>> ------------------------------------------>  Time
>> Host                     t0
>>
>> the value of tsc deadline that has been armed, namely t0, is smaller than t1, needn't to write
>> to msr but just update pending
>
> if t1 < t0, then it triggers the vm exit, right?

Yes. If new tsc deadline value is smaller than @armed, namely t1 here, it jumps to wrmsrl

> And in this case, I think @armed will be updated to t1. What about pending? will it get updated to t1 or not?

Yes, the guest jumps to wrmsrl and causes a vm-exit, the host side will update the @armed and re-arm the timer


Thanks
Jianchao

>
>>
>> 3. write to msr with t2
>>
>>               armed   : t0
>>               pending : t2
>>   Guest
>>                  w2
>> ------------------------------------------>  Time
>> Host                      t0
>>   Similar with step 2, just update pending field with t2, no vm-exit
>>
>>
>> 4.  write to msr with t3
>>
>>               armed   : t0
>>               pending : t3
>>
>> Guest
>>                      w3
>> ------------------------------------------>  Time
>> Host                       t0
>> Similar with step 2, just update pending field with t3, no vm-exit
>>
>>
>> 5.  t0 expires, arm t3
>>
>>               armed   : t3
>>               pending : t3
>>
>>
>> Guest
>>                              ------------------------------------------>  Time
>> Host                       t0  ------> t3
>>
>> t0 is fired, it checks the pending field and re-arm a timer based on it.
>>
>>
>> Here is the core ideal of this patch ;)
>>
>>
>> Thanks
>> Jianchao
>>
>>>
>>>> The 1st timer is responsible for arming the next timer. When the armed
>>>> timer is expired, it will check pending and arm a new timer.
>>>>
>>>> In the netperf test with TCP_RR on loopback, this lazy_tscdeadline can
>>>> reduce vm-exit obviously.
>>>>
>>>>                           Close               Open
>>>> --------------------------------------------------------
>>>> VM-Exit
>>>>               sum         12617503            5815737
>>>>              intr      0% 37023            0% 33002
>>>>             cpuid      0% 1                0% 0
>>>>              halt     19% 2503932         47% 2780683
>>>>         msr-write     79% 10046340        51% 2966824
>>>>             pause      0% 90               0% 84
>>>>     ept-violation      0% 584              0% 336
>>>>     ept-misconfig      0% 0                0% 2
>>>> preemption-timer      0% 29518            0% 34800
>>>> -------------------------------------------------------
>>>> MSR-Write
>>>>              sum          10046455            2966864
>>>>          apic-icr     25% 2533498         93% 2781235
>>>>      tsc-deadline     74% 7512945          6% 185629
>>>>
>>>> This patchset is made and tested on 6.4.0, includes 3 patches,
>>>>
>>>> The 1st one adds necessary data structures for this feature
>>>> The 2nd one adds the specific msr operations between guest and host
>>>> The 3rd one are the one make this feature works.
>>>>
>>>> Any comment is welcome.
>>>>
>>>> Thanks
>>>> Jianchao
>>>>
>>>> Wang Jianchao (3)
>>>>     KVM: x86: add msr register and data structure for lazy tscdeadline
>>>>     KVM: x86: exchange info about lazy_tscdeadline with msr
>>>>     KVM: X86: add lazy tscdeadline support to reduce vm-exit of msr-write
>>>>
>>>>
>>>>   arch/x86/include/asm/kvm_host.h      |  10 ++++++++
>>>>   arch/x86/include/uapi/asm/kvm_para.h |   9 +++++++
>>>>   arch/x86/kernel/apic/apic.c          |  47 ++++++++++++++++++++++++++++++++++-
>>>>   arch/x86/kernel/kvm.c                |  13 ++++++++++
>>>>   arch/x86/kvm/cpuid.c                 |   1 +
>>>>   arch/x86/kvm/lapic.c                 | 128 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
>>>>   arch/x86/kvm/lapic.h                 |   4 +++
>>>>   arch/x86/kvm/x86.c                   |  26 ++++++++++++++++++++
>>>>   8 files changed, 229 insertions(+), 9 deletions(-)
>>>
>

2023-07-14 01:57:10

by Wang Jianchao

[permalink] [raw]
Subject: Re: [RFC 0/3] KVM: x86: introduce pv feature lazy tscdeadline



On 2023.07.14 09:29, Wang Jianchao wrote:
>
>
> On 2023.07.13 21:32, Xiaoyao Li wrote:
>> On 7/13/2023 10:50 AM, Wang Jianchao wrote:
>>>
>>>
>>> On 2023.07.13 02:14, Zhi Wang wrote:
>>>> On Fri,  7 Jul 2023 14:17:58 +0800
>>>> Wang Jianchao <[email protected]> wrote:
>>>>
>>>>> Hi
>>>>>
>>>>> This patchset attemps to introduce a new pv feature, lazy tscdeadline.
>>>>> Everytime guest write msr of MSR_IA32_TSC_DEADLINE, a vm-exit occurs
>>>>> and host side handle it. However, a lot of the vm-exit is unnecessary
>>>>> because the timer is often over-written before it expires.
>>>>>
>>>>> v : write to msr of tsc deadline
>>>>> | : timer armed by tsc deadline
>>>>>
>>>>>           v v v v v        | | | | |
>>>>> --------------------------------------->  Time
>>>>>
>>>>> The timer armed by msr write is over-written before expires and the
>>>>> vm-exit caused by it are wasted. The lazy tscdeadline works as following,
>>>>>
>>>>>           v v v v v        |       |
>>>>> --------------------------------------->  Time
>>>>>                            '- arm -'
>>>>>
>>>>
>>>> Interesting patch.
>>>>
>>>> I am a little bit confused of the chart above. It seems the write of MSR,
>>>> which is said to cause VM exit, is not reduced in the chart of lazy
>>>> tscdeadline, only the times of arm are getting less. And the benefit of
>>>> lazy tscdeadline is said coming from "less vm exit". Maybe it is better
>>>> to imporve the chart a little bit to help people jump into the idea
>>>> easily?
>>>
>>> Thanks so much for you comment and sorry for my poor chart.
>>>
>>> Let me try to rework the chart.
>>>
>>> Before this patch, every time guest start or modify a hrtimer, we need to write the msr of tsc deadline,
>>> a vm-exit occurs and host arms a hv or sw timer for it.
>>>
>>>
>>> w: write msr
>>> x: vm-exit
>>> t: hv or sw timer
>>>
>>>
>>> Guest
>>>           w
>>> --------------------------------------->  Time
>>> Host     x              t
>>>  
>>> However, in some workload that needs setup timer frequently, msr of tscdeadline is usually overwritten
>>> many times before the timer expires. And every time we modify the tscdeadline, a vm-exit ocurrs
>>>
>>>
>>> 1. write to msr with t0
>>>
>>> Guest
>>>           w0
>>> ---------------------------------------->  Time
>>> Host     x0             t0
>>>
>>>   2. write to msr with t1
>>> Guest
>>>               w1
>>> ------------------------------------------>  Time
>>> Host         x1          t0->t1
>>>
>>>
>>> 2. write to msr with t2
>>> Guest
>>>                  w2
>>> ------------------------------------------>  Time
>>> Host            x2          t1->t2
>>>  
>>> 3. write to msr with t3
>>> Guest
>>>                      w3
>>> ------------------------------------------>  Time
>>> Host                x3           t2->t3
>>>
>>>
>>>
>>> What this patch want to do is to eliminate the vm-exit of x1 x2 and x3 as following,
>>>
>>>
>>> Firstly, we have two fields shared between guest and host as other pv features, saying,
>>>   - armed, the value of tscdeadline that has a timer in host side, only updated by __host__ side
>>>   - pending, the next value of tscdeadline, only updated by __guest__ side
>>>
>>>
>>> 1. write to msr with t0
>>>
>>>               armed   : t0
>>>               pending : t0
>>> Guest
>>>           w0
>>> ---------------------------------------->  Time
>>> Host     x0             t0
>>>
>>> vm-exit occurs and arms a timer for t0 in host side
>>
>> What's the initial value of @armed and @pending?
>
> Both of them are zero.
>
> @armed is only updated by host
> @pending is updated by guest
>
> Guest side will check @armed, it it is zero, jumps to wrmsrl
>
>>
>>>   2. write to msr with t1
>>>
>>>               armed   : t0
>>>               pending : t1
>>>
>>> Guest
>>>               w1
>>> ------------------------------------------>  Time
>>> Host                     t0
>>>
>>> the value of tsc deadline that has been armed, namely t0, is smaller than t1, needn't to write
>>> to msr but just update pending
>>
>> if t1 < t0, then it triggers the vm exit, right?
>
> Yes. If new tsc deadline value is smaller than @armed, namely t1 here, it jumps to wrmsrl
>
>> And in this case, I think @armed will be updated to t1. What about pending? will it get updated to t1 or not?
>
> Yes, the guest jumps to wrmsrl and causes a vm-exit, the host side will update the @armed and re-arm the timer
>

@pending is always updated in guest side no matter whether invokes wrmsrl. In this case, it is updated to t1.

host side only checks it to decide whether to inject local timer interrupt and re-arm the timer.

if @pending == @armed, host injects local timer interrupt
if @pending > @armed, host don't inject but re-arm the timer forward.

>
> Thanks
> Jianchao
>
>>
>>>
>>> 3. write to msr with t2
>>>
>>>               armed   : t0
>>>               pending : t2
>>>   Guest
>>>                  w2
>>> ------------------------------------------>  Time
>>> Host                      t0
>>>   Similar with step 2, just update pending field with t2, no vm-exit
>>>
>>>
>>> 4.  write to msr with t3
>>>
>>>               armed   : t0
>>>               pending : t3
>>>
>>> Guest
>>>                      w3
>>> ------------------------------------------>  Time
>>> Host                       t0
>>> Similar with step 2, just update pending field with t3, no vm-exit
>>>
>>>
>>> 5.  t0 expires, arm t3
>>>
>>>               armed   : t3
>>>               pending : t3
>>>
>>>
>>> Guest
>>>                              ------------------------------------------>  Time
>>> Host                       t0  ------> t3
>>>
>>> t0 is fired, it checks the pending field and re-arm a timer based on it.
>>>
>>>
>>> Here is the core ideal of this patch ;)
>>>
>>>
>>> Thanks
>>> Jianchao
>>>
>>>>
>>>>> The 1st timer is responsible for arming the next timer. When the armed
>>>>> timer is expired, it will check pending and arm a new timer.
>>>>>
>>>>> In the netperf test with TCP_RR on loopback, this lazy_tscdeadline can
>>>>> reduce vm-exit obviously.
>>>>>
>>>>>                           Close               Open
>>>>> --------------------------------------------------------
>>>>> VM-Exit
>>>>>               sum         12617503            5815737
>>>>>              intr      0% 37023            0% 33002
>>>>>             cpuid      0% 1                0% 0
>>>>>              halt     19% 2503932         47% 2780683
>>>>>         msr-write     79% 10046340        51% 2966824
>>>>>             pause      0% 90               0% 84
>>>>>     ept-violation      0% 584              0% 336
>>>>>     ept-misconfig      0% 0                0% 2
>>>>> preemption-timer      0% 29518            0% 34800
>>>>> -------------------------------------------------------
>>>>> MSR-Write
>>>>>              sum          10046455            2966864
>>>>>          apic-icr     25% 2533498         93% 2781235
>>>>>      tsc-deadline     74% 7512945          6% 185629
>>>>>
>>>>> This patchset is made and tested on 6.4.0, includes 3 patches,
>>>>>
>>>>> The 1st one adds necessary data structures for this feature
>>>>> The 2nd one adds the specific msr operations between guest and host
>>>>> The 3rd one are the one make this feature works.
>>>>>
>>>>> Any comment is welcome.
>>>>>
>>>>> Thanks
>>>>> Jianchao
>>>>>
>>>>> Wang Jianchao (3)
>>>>>     KVM: x86: add msr register and data structure for lazy tscdeadline
>>>>>     KVM: x86: exchange info about lazy_tscdeadline with msr
>>>>>     KVM: X86: add lazy tscdeadline support to reduce vm-exit of msr-write
>>>>>
>>>>>
>>>>>   arch/x86/include/asm/kvm_host.h      |  10 ++++++++
>>>>>   arch/x86/include/uapi/asm/kvm_para.h |   9 +++++++
>>>>>   arch/x86/kernel/apic/apic.c          |  47 ++++++++++++++++++++++++++++++++++-
>>>>>   arch/x86/kernel/kvm.c                |  13 ++++++++++
>>>>>   arch/x86/kvm/cpuid.c                 |   1 +
>>>>>   arch/x86/kvm/lapic.c                 | 128 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++------
>>>>>   arch/x86/kvm/lapic.h                 |   4 +++
>>>>>   arch/x86/kvm/x86.c                   |  26 ++++++++++++++++++++
>>>>>   8 files changed, 229 insertions(+), 9 deletions(-)
>>>>
>>