2024-01-16 03:27:12

by sundongxu (A)

[permalink] [raw]
Subject: [bug report] GICv4.1: VM performance degradation due to not trapping vCPU WFI

Hi Guys,

We found a problem about GICv4/4.1, for example:
We use QEMU to start a VM (4 vCPUs and 8G memory), VM disk was
configured with virtio, and the network is configured with vhost-net,
the CPU affinity of the vCPU and emulator is as follows, in VM xml:

<cputune>
<vcpupin vcpu='0' cpuset='4'/>
<vcpupin vcpu='1' cpuset='5'/>
<vcpupin vcpu='2' cpuset='6'/>
<vcpupin vcpu='3' cpuset='7'/>
<emulatorpin cpuset='4,5,6,7'/>
</cputune>

Running Mysql in the VM, and sysbench (Mysql benchmark) on the host,
the performance index is tps, the higher the better.
If the host only enabled GICv3, the tps will be around 1400.
If the host enabled GICv4.1, other configurations remain unchanged, the
tps will be around 40.

We found that when the host enabled GICv4.1, because vSGI is directly
injected to VM, and most time vCPU exclusively occupy the pCPU, vCPU
will not trap when executing the WFI instruction. Then from the host
view, the CPU usage of vCPU0~vCPU3 is almost 100%. When running mysql
service in VM, the vhost-net and qemu processes also need to obtain
enough CPU time, but unfortunately these processes cannot get that much
time (for example, only GICv3 enabled, the cpu usage of vhost-net is
about 43%, but with GICv4.1 enabled, it becomes 0~2%). During the test,
it was found that vhost-net sleeps and wakes up very frequently. When
vhost-net wakes up, it often cannot obtain CPU in time (because of
wake-up preemption check). After waking up, vhost-net will usually run
for a short period of time before going to sleep again.

If the host enabled GICv4.1, and force vCPU to trap when executing WFI,
the tps will be around 1400.

On the other hand, when vCPU executes WFI instruction without trapping,
the vcpu wake-up delay will be significantly improved. For example, the
result of running cyclictest in VM:
WFI trap 6us
WFI no trap 2us

Currently, I add a KVM module parameter to control whether the vCPU
traps (by set or clear HCR_TWI) when executing the WFI instruction with
host enabled GICv4/4.1, and by default, vCPU traps are set.

Or, it there a better way?


2024-01-16 11:13:26

by Marc Zyngier

[permalink] [raw]
Subject: Re: [bug report] GICv4.1: VM performance degradation due to not trapping vCPU WFI

On Tue, 16 Jan 2024 03:26:08 +0000,
"sundongxu (A)" <[email protected]> wrote:
>
> Hi Guys,
>
> We found a problem about GICv4/4.1, for example:
> We use QEMU to start a VM (4 vCPUs and 8G memory), VM disk was
> configured with virtio, and the network is configured with vhost-net,
> the CPU affinity of the vCPU and emulator is as follows, in VM xml:
>
> <cputune>
> <vcpupin vcpu='0' cpuset='4'/>
> <vcpupin vcpu='1' cpuset='5'/>
> <vcpupin vcpu='2' cpuset='6'/>
> <vcpupin vcpu='3' cpuset='7'/>
> <emulatorpin cpuset='4,5,6,7'/>
> </cputune>
>
> Running Mysql in the VM, and sysbench (Mysql benchmark) on the host,
> the performance index is tps, the higher the better.
> If the host only enabled GICv3, the tps will be around 1400.
> If the host enabled GICv4.1, other configurations remain unchanged, the
> tps will be around 40.
>
> We found that when the host enabled GICv4.1, because vSGI is directly
> injected to VM, and most time vCPU exclusively occupy the pCPU, vCPU
> will not trap when executing the WFI instruction. Then from the host
> view, the CPU usage of vCPU0~vCPU3 is almost 100%. When running mysql
> service in VM, the vhost-net and qemu processes also need to obtain
> enough CPU time, but unfortunately these processes cannot get that much
> time (for example, only GICv3 enabled, the cpu usage of vhost-net is
> about 43%, but with GICv4.1 enabled, it becomes 0~2%). During the test,
> it was found that vhost-net sleeps and wakes up very frequently. When
> vhost-net wakes up, it often cannot obtain CPU in time (because of
> wake-up preemption check). After waking up, vhost-net will usually run
> for a short period of time before going to sleep again.

Can you elaborate on this preemption check issue?

>
> If the host enabled GICv4.1, and force vCPU to trap when executing WFI,
> the tps will be around 1400.
>
> On the other hand, when vCPU executes WFI instruction without trapping,
> the vcpu wake-up delay will be significantly improved. For example, the
> result of running cyclictest in VM:
> WFI trap 6us
> WFI no trap 2us
>
> Currently, I add a KVM module parameter to control whether the vCPU
> traps (by set or clear HCR_TWI) when executing the WFI instruction with
> host enabled GICv4/4.1, and by default, vCPU traps are set.
>
> Or, it there a better way?

As you foudn out, KVM has an adaptive way of dealing with HCR_TWI,
turning it off when the vcpu is alone in the run queue. Which means it
doesn't compete with any other thread. How comes the other threads
don't register as being runnable?

Effectively, we apply the same principle to vSGIs as to vLPIs, and it
was found that this heuristic was pretty beneficial to vLPIs. I'm a
bit surprised that vSGIs are so different in their usage pattern.

Does it help if you move your "emulatorpin" to some other physical
CPUs?

Thanks,

M.

--
Without deviation from the norm, progress is not possible.

2024-01-17 14:21:11

by sundongxu (A)

[permalink] [raw]
Subject: Re: [bug report] GICv4.1: VM performance degradation due to not trapping vCPU WFI

Hi Marc,

Thank you for your reply.

On 2024/1/16 19:13, Marc Zyngier wrote:
> On Tue, 16 Jan 2024 03:26:08 +0000,
> "sundongxu (A)" <[email protected]> wrote:
>>
>> Hi Guys,
>>
>> We found a problem about GICv4/4.1, for example:
>> We use QEMU to start a VM (4 vCPUs and 8G memory), VM disk was
>> configured with virtio, and the network is configured with vhost-net,
>> the CPU affinity of the vCPU and emulator is as follows, in VM xml:
>>
>> <cputune>
>> <vcpupin vcpu='0' cpuset='4'/>
>> <vcpupin vcpu='1' cpuset='5'/>
>> <vcpupin vcpu='2' cpuset='6'/>
>> <vcpupin vcpu='3' cpuset='7'/>
>> <emulatorpin cpuset='4,5,6,7'/>
>> </cputune>
>>
>> Running Mysql in the VM, and sysbench (Mysql benchmark) on the host,
>> the performance index is tps, the higher the better.
>> If the host only enabled GICv3, the tps will be around 1400.
>> If the host enabled GICv4.1, other configurations remain unchanged, the
>> tps will be around 40.
>>
>> We found that when the host enabled GICv4.1, because vSGI is directly
>> injected to VM, and most time vCPU exclusively occupy the pCPU, vCPU
>> will not trap when executing the WFI instruction. Then from the host
>> view, the CPU usage of vCPU0~vCPU3 is almost 100%. When running mysql
>> service in VM, the vhost-net and qemu processes also need to obtain
>> enough CPU time, but unfortunately these processes cannot get that much
>> time (for example, only GICv3 enabled, the cpu usage of vhost-net is
>> about 43%, but with GICv4.1 enabled, it becomes 0~2%). During the test,
>> it was found that vhost-net sleeps and wakes up very frequently. When
>> vhost-net wakes up, it often cannot obtain CPU in time (because of
>> wake-up preemption check). After waking up, vhost-net will usually run
>> for a short period of time before going to sleep again.
>
> Can you elaborate on this preemption check issue?

Well, I forgot to post the version of my kernel(5.10).
I've noticed that the scheduler of mainline kernel is EEVDF, while my
kernel's scheduler is still CFS. I feel sorry if my report misled you.

With the CFS scheduler, if vhost-net thread wake-up, the scheduler
need to check whether vhost-net could preempt vCPU
(check_preempt_wakeup->wakeup_preempt_entity).
From my understanding, if vhost-net want to run on CPU,the conditions
need to meet:
1. the difference between vhost-net's vruntime and current running
thread's vruntime exceeds sched_wakeup_granularity_ns (in my host,
it's 15ms, and the sched_latency_ns is 24ms).
2. If condition 1 is not met, the CFS scheduler thinks vhost-net
should not preempt current thread, the vhost-net will be enqueued and
wait for the tick preempt check. In tick preempt check, the vhost-net
could run when the growth of current thread vruntime is already higher
than sched_min_granularity_ns (in my host, it's 10ms), since the last
time current thread was preempted.
So, if vhost-net wake-up preemption fails, it needs to wait for a
while, but in the meantime, maybe vCPU is just running WFI.

Frankly, the sched_wakeup_granularity_ns (default 4ms) and
sched_wakeup_granularity_ns(default 3ms) are set too high, So I reset
these parameters to default value and test again. Here is the result:
If the host enabled GICv3, the tps will be around 1300.
If the host enabled GICv4.1 with WFI no trapping, the tps will be
around 235.

I tested the kernel 6.5rc2 (git checkout 752182b24bf4), it has the
same issue.

And I also tested it in mainline kernel (6.7, with EEVDF), the result
as below:
If the host only enabled GICv3, the tps will be around 377.
If the host enabled GICv4.1 with WFI no trapping, the tps will be
around 323.
If the host enabled GICv4.1 with WFI trapping, the tps will be around
387.
With host enabled GICv3 the tps is much lower than when the scheduler
is CFS. Looks like there's a serious issue, but I haven't found it yet.

>
>>
>> If the host enabled GICv4.1, and force vCPU to trap when executing WFI,
>> the tps will be around 1400.
>>
>> On the other hand, when vCPU executes WFI instruction without trapping,
>> the vcpu wake-up delay will be significantly improved. For example, the
>> result of running cyclictest in VM:
>> WFI trap 6us
>> WFI no trap 2us
>>
>> Currently, I add a KVM module parameter to control whether the vCPU
>> traps (by set or clear HCR_TWI) when executing the WFI instruction with
>> host enabled GICv4/4.1, and by default, vCPU traps are set.
>>
>> Or, it there a better way?
>
> As you foudn out, KVM has an adaptive way of dealing with HCR_TWI,
> turning it off when the vcpu is alone in the run queue. Which means it
> doesn't compete with any other thread. How comes the other threads
> don't register as being runnable?

Actually,other threads is registered as runnable,but they may not get
CPU immediately.

>
> Effectively, we apply the same principle to vSGIs as to vLPIs, and it
> was found that this heuristic was pretty beneficial to vLPIs. I'm a
> bit surprised that vSGIs are so different in their usage pattern.

IMO, the point is hypervisor not trapping vCPU WFI, rather than
vSGI/vLPI usage pattern.

>
> Does it help if you move your "emulatorpin" to some other physical
> CPUs?

Yes,it does, in kernel 5.10 or 6.5rc1.

Thanks,
Dongxu
>
> Thanks,
>
> M.
>


2024-01-17 16:54:52

by Oliver Upton

[permalink] [raw]
Subject: Re: [bug report] GICv4.1: VM performance degradation due to not trapping vCPU WFI

On Wed, Jan 17, 2024 at 10:20:32PM +0800, sundongxu (A) wrote:
> On 2024/1/16 19:13, Marc Zyngier wrote:
> > On Tue, 16 Jan 2024 03:26:08 +0000, "sundongxu (A)" <[email protected]> wrote:
> >> We found a problem about GICv4/4.1, for example:
> >> We use QEMU to start a VM (4 vCPUs and 8G memory), VM disk was
> >> configured with virtio, and the network is configured with vhost-net,
> >> the CPU affinity of the vCPU and emulator is as follows, in VM xml:

<snip>

> >> <cputune>
> >> <vcpupin vcpu='0' cpuset='4'/>
> >> <vcpupin vcpu='1' cpuset='5'/>
> >> <vcpupin vcpu='2' cpuset='6'/>
> >> <vcpupin vcpu='3' cpuset='7'/>
> >> <emulatorpin cpuset='4,5,6,7'/>
> >> </cputune>

</snip>

> > Effectively, we apply the same principle to vSGIs as to vLPIs, and it
> > was found that this heuristic was pretty beneficial to vLPIs. I'm a
> > bit surprised that vSGIs are so different in their usage pattern.
>
> IMO, the point is hypervisor not trapping vCPU WFI, rather than
> vSGI/vLPI usage pattern.

Sure, that's what's affecting your use case, but the logic in the kernel
came about because improving virtual interrupt injection has been found
to be generally useful.

> >
> > Does it help if you move your "emulatorpin" to some other physical
> > CPUs?
>
> Yes,it does, in kernel 5.10 or 6.5rc1.

Won't your VM have a poor experience in this configuration regardless of WFx
traps? The value of vCPU pinning is to *isolate* the vCPU threads from
noise/overheads of the host and scheduler latencies. Seems to me that
VMM overhead threads are being forced to take time away from the guest.

Nevertheless, disabling WFI traps isn't going to work well for
overcommitted scenarios. The thought of tacking on more hacks in KVM has be
a bit uneasy, perhaps instead we can give userspace an interface to explicitly
enable/disable WFx traps and let it pick a suitable policy.

--
Thanks,
Oliver

2024-01-18 07:56:44

by sundongxu (A)

[permalink] [raw]
Subject: Re: [bug report] GICv4.1: VM performance degradation due to not trapping vCPU WFI

On 2024/1/18 0:50, Oliver Upton wrote:
> On Wed, Jan 17, 2024 at 10:20:32PM +0800, sundongxu (A) wrote:
>> On 2024/1/16 19:13, Marc Zyngier wrote:
>>> On Tue, 16 Jan 2024 03:26:08 +0000, "sundongxu (A)" <[email protected]> wrote:
>>>> We found a problem about GICv4/4.1, for example:
>>>> We use QEMU to start a VM (4 vCPUs and 8G memory), VM disk was
>>>> configured with virtio, and the network is configured with vhost-net,
>>>> the CPU affinity of the vCPU and emulator is as follows, in VM xml:
>
> <snip>
>
>>>> <cputune>
>>>> <vcpupin vcpu='0' cpuset='4'/>
>>>> <vcpupin vcpu='1' cpuset='5'/>
>>>> <vcpupin vcpu='2' cpuset='6'/>
>>>> <vcpupin vcpu='3' cpuset='7'/>
>>>> <emulatorpin cpuset='4,5,6,7'/>
>>>> </cputune>
>
> </snip>
>
>>> Effectively, we apply the same principle to vSGIs as to vLPIs, and it
>>> was found that this heuristic was pretty beneficial to vLPIs. I'm a
>>> bit surprised that vSGIs are so different in their usage pattern.
>>
>> IMO, the point is hypervisor not trapping vCPU WFI, rather than
>> vSGI/vLPI usage pattern.
>
> Sure, that's what's affecting your use case, but the logic in the kernel
> came about because improving virtual interrupt injection has been found
> to be generally useful.
>
>>>
>>> Does it help if you move your "emulatorpin" to some other physical
>>> CPUs?
>>
>> Yes,it does, in kernel 5.10 or 6.5rc1.
>
> Won't your VM have a poor experience in this configuration regardless of WFx
> traps? The value of vCPU pinning is to *isolate* the vCPU threads from
> noise/overheads of the host and scheduler latencies. Seems to me that
> VMM overhead threads are being forced to take time away from the guest.

When the emulators and vCPUs have affinity on same CPUs, the VM
performance is worse than when emulators and vCPUs have affinity on
different CPUs. Emulators will steal time from vCPU, since we need them
to deal with some IO/net requests. If we allocate 4 pCPUs to one VM, we
do not want it's emulators to run on other pCPU, which may interfere
with other VMs. May be SPDK/DPDK will alleviate the issue.

>
> Nevertheless, disabling WFI traps isn't going to work well for
> overcommitted scenarios. The thought of tacking on more hacks in KVM has be
> a bit uneasy, perhaps instead we can give userspace an interface to explicitly
> enable/disable WFx traps and let it pick a suitable policy.

Agreed, I added a KVM parameter to do that, and default trapping vCPU WFI.

Thanks,
Dongxu