2015-02-06 12:49:06

by Paolo Bonzini

[permalink] [raw]
Subject: [PATCH] kvm: add halt_poll_ns module parameter

This patch introduces a new module parameter for the KVM module; when it
is present, KVM attempts a bit of polling on every HLT before scheduling
itself out via kvm_vcpu_block.

This parameter helps a lot for latency-bound workloads---in particular
I tested it with O_DSYNC writes with a battery-backed disk in the host.
In this case, writes are fast (because the data doesn't have to go all
the way to the platters) but they cannot be merged by either the host or
the guest. KVM's performance here is usually around 30% of bare metal,
or 50% if you use cache=directsync or cache=writethrough (these
parameters avoid that the guest sends pointless flush requests, and
at the same time they are not slow because of the battery-backed cache).
The bad performance happens because on every halt the host CPU decides
to halt itself too. When the interrupt comes, the vCPU thread is then
migrated to a new physical CPU, and in general the latency is horrible
because the vCPU thread has to be scheduled back in.

With this patch performance reaches 60-65% of bare metal and, more
important, 99% of what you get if you use idle=poll in the guest. This
means that the tunable gets rid of this particular bottleneck, and more
work can be done to improve performance in the kernel or QEMU.

Of course there is some price to pay; every time an otherwise idle vCPUs
is interrupted by an interrupt, it will poll unnecessarily and thus
impose a little load on the host. The above results were obtained with
a mostly random value of the parameter (500000), and the load was around
1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.

The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
that can be used to tune the parameter. It counts how many HLT
instructions received an interrupt during the polling period; each
successful poll avoids that Linux schedules the VCPU thread out and back
in, and may also avoid a likely trip to C1 and back for the physical CPU.

While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
Of these halts, almost all are failed polls. During the benchmark,
instead, basically all halts end within the polling period, except a more
or less constant stream of 50 per second coming from vCPUs that are not
running the benchmark. The wasted time is thus very low. Things may
be slightly different for Windows VMs, which have a ~10 ms timer tick.

The effect is also visible on Marcelo's recently-introduced latency
test for the TSC deadline timer. Though of course a non-RT kernel has
awful latency bounds, the latency of the timer is around 8000-10000 clock
cycles compared to 20000-120000 without setting halt_poll_ns. For the TSC
deadline timer, thus, the effect is both a smaller average latency and
a smaller variance.

Signed-off-by: Paolo Bonzini <[email protected]>
---
arch/arm/include/asm/kvm_host.h | 1 +
arch/arm64/include/asm/kvm_host.h | 1 +
arch/mips/include/asm/kvm_host.h | 1 +
arch/mips/kvm/mips.c | 1 +
arch/powerpc/include/asm/kvm_host.h | 1 +
arch/powerpc/kvm/book3s.c | 1 +
arch/powerpc/kvm/booke.c | 1 +
arch/s390/include/asm/kvm_host.h | 1 +
arch/s390/kvm/kvm-s390.c | 1 +
arch/x86/include/asm/kvm_host.h | 1 +
arch/x86/kvm/x86.c | 1 +
include/trace/events/kvm.h | 19 +++++++++++++++
virt/kvm/kvm_main.c | 48 +++++++++++++++++++++++++++++++------
13 files changed, 71 insertions(+), 7 deletions(-)

diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
index bde494654bcc..6a79314bc1df 100644
--- a/arch/arm/include/asm/kvm_host.h
+++ b/arch/arm/include/asm/kvm_host.h
@@ -148,6 +148,7 @@ struct kvm_vm_stat {
};

struct kvm_vcpu_stat {
+ u32 halt_successful_poll;
u32 halt_wakeup;
};

diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
index 2c49aa4ac818..8efde89613f2 100644
--- a/arch/arm64/include/asm/kvm_host.h
+++ b/arch/arm64/include/asm/kvm_host.h
@@ -165,6 +165,7 @@ struct kvm_vm_stat {
};

struct kvm_vcpu_stat {
+ u32 halt_successful_poll;
u32 halt_wakeup;
};

diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
index f2c249796ea8..ac4fc716062b 100644
--- a/arch/mips/include/asm/kvm_host.h
+++ b/arch/mips/include/asm/kvm_host.h
@@ -120,6 +120,7 @@ struct kvm_vcpu_stat {
u32 resvd_inst_exits;
u32 break_inst_exits;
u32 flush_dcache_exits;
+ u32 halt_successful_poll;
u32 halt_wakeup;
};

diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
index e97b90784031..c9eccf5df912 100644
--- a/arch/mips/kvm/mips.c
+++ b/arch/mips/kvm/mips.c
@@ -49,6 +49,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
{ "resvd_inst", VCPU_STAT(resvd_inst_exits), KVM_STAT_VCPU },
{ "break_inst", VCPU_STAT(break_inst_exits), KVM_STAT_VCPU },
{ "flush_dcache", VCPU_STAT(flush_dcache_exits), KVM_STAT_VCPU },
+ { "halt_successful_poll", VCPU_STAT(halt_successful_poll), KVM_STAT_VCPU },
{ "halt_wakeup", VCPU_STAT(halt_wakeup), KVM_STAT_VCPU },
{NULL}
};
diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
index 7efd666a3fa7..8ef05121d3cd 100644
--- a/arch/powerpc/include/asm/kvm_host.h
+++ b/arch/powerpc/include/asm/kvm_host.h
@@ -107,6 +107,7 @@ struct kvm_vcpu_stat {
u32 emulated_inst_exits;
u32 dec_exits;
u32 ext_intr_exits;
+ u32 halt_successful_poll;
u32 halt_wakeup;
u32 dbell_exits;
u32 gdbell_exits;
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 888bf466d8c6..cfbcdc654201 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -52,6 +52,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
{ "dec", VCPU_STAT(dec_exits) },
{ "ext_intr", VCPU_STAT(ext_intr_exits) },
{ "queue_intr", VCPU_STAT(queue_intr) },
+ { "halt_successful_poll", VCPU_STAT(halt_successful_poll), },
{ "halt_wakeup", VCPU_STAT(halt_wakeup) },
{ "pf_storage", VCPU_STAT(pf_storage) },
{ "sp_storage", VCPU_STAT(sp_storage) },
diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
index 9b55dec2d6cc..6c1316a15a27 100644
--- a/arch/powerpc/kvm/booke.c
+++ b/arch/powerpc/kvm/booke.c
@@ -62,6 +62,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
{ "inst_emu", VCPU_STAT(emulated_inst_exits) },
{ "dec", VCPU_STAT(dec_exits) },
{ "ext_intr", VCPU_STAT(ext_intr_exits) },
+ { "halt_successful_poll", VCPU_STAT(halt_successful_poll) },
{ "halt_wakeup", VCPU_STAT(halt_wakeup) },
{ "doorbell", VCPU_STAT(dbell_exits) },
{ "guest doorbell", VCPU_STAT(gdbell_exits) },
diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
index d1ecc7fd0579..f79058e3fd98 100644
--- a/arch/s390/include/asm/kvm_host.h
+++ b/arch/s390/include/asm/kvm_host.h
@@ -196,6 +196,7 @@ struct kvm_vcpu_stat {
u32 exit_stop_request;
u32 exit_validity;
u32 exit_instruction;
+ u32 halt_successful_poll;
u32 halt_wakeup;
u32 instruction_lctl;
u32 instruction_lctlg;
diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
index b2371c0fd1f8..1dbab2340a66 100644
--- a/arch/s390/kvm/kvm-s390.c
+++ b/arch/s390/kvm/kvm-s390.c
@@ -51,6 +51,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
{ "exit_instruction", VCPU_STAT(exit_instruction) },
{ "exit_program_interruption", VCPU_STAT(exit_program_interruption) },
{ "exit_instr_and_program_int", VCPU_STAT(exit_instr_and_program) },
+ { "halt_successful_poll", VCPU_STAT(halt_successful_poll) },
{ "halt_wakeup", VCPU_STAT(halt_wakeup) },
{ "instruction_lctlg", VCPU_STAT(instruction_lctlg) },
{ "instruction_lctl", VCPU_STAT(instruction_lctl) },
diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
index 848947ac6ade..a236e39cc385 100644
--- a/arch/x86/include/asm/kvm_host.h
+++ b/arch/x86/include/asm/kvm_host.h
@@ -655,6 +655,7 @@ struct kvm_vcpu_stat {
u32 irq_window_exits;
u32 nmi_window_exits;
u32 halt_exits;
+ u32 halt_successful_poll;
u32 halt_wakeup;
u32 request_irq_exits;
u32 irq_exits;
diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
index 1373e04e1f19..bd7a70be41b3 100644
--- a/arch/x86/kvm/x86.c
+++ b/arch/x86/kvm/x86.c
@@ -145,6 +145,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
{ "irq_window", VCPU_STAT(irq_window_exits) },
{ "nmi_window", VCPU_STAT(nmi_window_exits) },
{ "halt_exits", VCPU_STAT(halt_exits) },
+ { "halt_successful_poll", VCPU_STAT(halt_successful_poll) },
{ "halt_wakeup", VCPU_STAT(halt_wakeup) },
{ "hypercalls", VCPU_STAT(hypercalls) },
{ "request_irq", VCPU_STAT(request_irq_exits) },
diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
index 6edf1f2028cd..6bfe7eec1c2c 100644
--- a/include/trace/events/kvm.h
+++ b/include/trace/events/kvm.h
@@ -37,6 +37,25 @@ TRACE_EVENT(kvm_userspace_exit,
__entry->errno < 0 ? -__entry->errno : __entry->reason)
);

+TRACE_EVENT(kvm_vcpu_wakeup,
+ TP_PROTO(__u64 ns, bool waited),
+ TP_ARGS(ns, waited),
+
+ TP_STRUCT__entry(
+ __field( __u64, ns )
+ __field( bool, waited )
+ ),
+
+ TP_fast_assign(
+ __entry->ns = ns;
+ __entry->waited = waited;
+ ),
+
+ TP_printk("%s time %lld ns",
+ __entry->waited ? "wait" : "poll",
+ __entry->ns)
+);
+
#if defined(CONFIG_HAVE_KVM_IRQFD)
TRACE_EVENT(kvm_set_irq,
TP_PROTO(unsigned int gsi, int level, int irq_source_id),
diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
index 0c281760a1c5..32449e0e9aa8 100644
--- a/virt/kvm/kvm_main.c
+++ b/virt/kvm/kvm_main.c
@@ -66,6 +66,9 @@
MODULE_AUTHOR("Qumranet");
MODULE_LICENSE("GPL");

+unsigned int halt_poll_ns = 0;
+module_param(halt_poll_ns, uint, S_IRUGO | S_IWUSR);
+
/*
* Ordering of locks:
*
@@ -1813,29 +1816,60 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
}
EXPORT_SYMBOL_GPL(mark_page_dirty);

+static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
+{
+ if (kvm_arch_vcpu_runnable(vcpu)) {
+ kvm_make_request(KVM_REQ_UNHALT, vcpu);
+ return -EINTR;
+ }
+ if (kvm_cpu_has_pending_timer(vcpu))
+ return -EINTR;
+ if (signal_pending(current))
+ return -EINTR;
+
+ return 0;
+}
+
/*
* The vCPU has executed a HLT instruction with in-kernel mode enabled.
*/
void kvm_vcpu_block(struct kvm_vcpu *vcpu)
{
+ ktime_t start, cur;
DEFINE_WAIT(wait);
+ bool waited = false;
+
+ start = cur = ktime_get();
+ if (halt_poll_ns) {
+ ktime_t stop = ktime_add_ns(ktime_get(), halt_poll_ns);
+ do {
+ /*
+ * This sets KVM_REQ_UNHALT if an interrupt
+ * arrives.
+ */
+ if (kvm_vcpu_check_block(vcpu) < 0) {
+ ++vcpu->stat.halt_successful_poll;
+ goto out;
+ }
+ cur = ktime_get();
+ } while (single_task_running() && ktime_before(cur, stop));
+ }

for (;;) {
prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);

- if (kvm_arch_vcpu_runnable(vcpu)) {
- kvm_make_request(KVM_REQ_UNHALT, vcpu);
- break;
- }
- if (kvm_cpu_has_pending_timer(vcpu))
- break;
- if (signal_pending(current))
+ if (kvm_vcpu_check_block(vcpu) < 0)
break;

+ waited = true;
schedule();
}

finish_wait(&vcpu->wq, &wait);
+ cur = ktime_get();
+
+out:
+ trace_kvm_vcpu_wakeup(ktime_to_ns(cur) - ktime_to_ns(start), waited);
}
EXPORT_SYMBOL_GPL(kvm_vcpu_block);

--
1.8.3.1


2015-02-09 08:23:14

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH] kvm: add halt_poll_ns module parameter



On 02/06/2015 08:48 PM, Paolo Bonzini wrote:

>
> +unsigned int halt_poll_ns = 0;
> +module_param(halt_poll_ns, uint, S_IRUGO | S_IWUSR);
> +

Can we make this parameter be changeable? So that we can tune it
on the fly.

> finish_wait(&vcpu->wq, &wait);
> + cur = ktime_get();

We can move this into the tracepoint to avoid the workload if the
tracepoint is not enabled.

Otherwise looks good to me.

Reviewed-by: Xiao Guangrong <[email protected]>

2015-02-09 09:07:05

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH] kvm: add halt_poll_ns module parameter



On 09/02/2015 09:22, Xiao Guangrong wrote:
>
>
> On 02/06/2015 08:48 PM, Paolo Bonzini wrote:
>
>>
>> +unsigned int halt_poll_ns = 0;
>> +module_param(halt_poll_ns, uint, S_IRUGO | S_IWUSR);
>> +
>
> Can we make this parameter be changeable? So that we can tune it
> on the fly.

It is changeable (S_IWUSR).

>> finish_wait(&vcpu->wq, &wait);
>> + cur = ktime_get();
>
> We can move this into the tracepoint to avoid the workload if the
> tracepoint is not enabled.

You have a point. I didn't do it because I expect that halt_poll_ns
will be always enabled as soon as it auto-tunes, and in that case you'll
always have to do the ktime_get() for autotuning purposes.

> Otherwise looks good to me.
>
> Reviewed-by: Xiao Guangrong <[email protected]>

Thanks!

Paolo

2015-02-09 10:13:14

by Xiao Guangrong

[permalink] [raw]
Subject: Re: [PATCH] kvm: add halt_poll_ns module parameter



On 02/09/2015 05:06 PM, Paolo Bonzini wrote:
>
>
> On 09/02/2015 09:22, Xiao Guangrong wrote:
>>
>>
>> On 02/06/2015 08:48 PM, Paolo Bonzini wrote:
>>
>>>
>>> +unsigned int halt_poll_ns = 0;
>>> +module_param(halt_poll_ns, uint, S_IRUGO | S_IWUSR);
>>> +
>>
>> Can we make this parameter be changeable? So that we can tune it
>> on the fly.
>
> It is changeable (S_IWUSR).

Oh... Yes, sorry for my careless.

>
>>> finish_wait(&vcpu->wq, &wait);
>>> + cur = ktime_get();
>>
>> We can move this into the tracepoint to avoid the workload if the
>> tracepoint is not enabled.
>
> You have a point. I didn't do it because I expect that halt_poll_ns
> will be always enabled as soon as it auto-tunes, and in that case you'll
> always have to do the ktime_get() for autotuning purposes.

Great, look forward to your future patches. ;)

2015-02-09 15:21:23

by Radim Krčmář

[permalink] [raw]
Subject: Re: [PATCH] kvm: add halt_poll_ns module parameter

2015-02-06 13:48+0100, Paolo Bonzini:
[...]
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---

Reviewed-by: Radim Krčmář <[email protected]>

Noticed changes since RFC:
- polling is used in more situations
- new tracepoint
- module parameter in nanoseconds
- properly handled time
- no polling with overcommit

> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
> @@ -148,6 +148,7 @@ struct kvm_vm_stat {
> };
>
> struct kvm_vcpu_stat {
> + u32 halt_successful_poll;
> u32 halt_wakeup;
> };

We don't expose it in arch/arm/kvm/guest.c,
struct kvm_stats_debugfs_item debugfs_entries[] = {
{ NULL }
};

> diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
> @@ -37,6 +37,25 @@ TRACE_EVENT(kvm_userspace_exit,
> __entry->errno < 0 ? -__entry->errno : __entry->reason)
> );
>
> +TRACE_EVENT(kvm_vcpu_wakeup,
> + TP_PROTO(__u64 ns, bool waited),

(__u64 is preferred here?)

> @@ -1813,29 +1816,60 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
> void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> {
> + ktime_t start, cur;
> DEFINE_WAIT(wait);
> + bool waited = false;
> +
> + start = cur = ktime_get();
> + if (halt_poll_ns) {
> + ktime_t stop = ktime_add_ns(ktime_get(), halt_poll_ns);
> + do {
> + /*
> + * This sets KVM_REQ_UNHALT if an interrupt
> + * arrives.
> + */
> + if (kvm_vcpu_check_block(vcpu) < 0) {
> + ++vcpu->stat.halt_successful_poll;
> + goto out;
> + }
> + cur = ktime_get();
> + } while (single_task_running() && ktime_before(cur, stop));

After reading a bunch of code, I'm still not sure ...
- need_resched() can't be true when single_task_running()?
(I think it could happen -- balancing comes into mind.)
- is it ok to ignore need_resched() when single_task_running()?
(Most likely not.)

Thanks.


---
Annoying detail:

single_task_running() makes it look that we are only trying to avoid
CPU power saving with accompanied latency. We aren't doing it
perfectly though: scheduler can run to short task and the CPU will go
to sleep. (Before the event KVM is waiting for arrives.)

Adding a general API (cpu_don't_sleep_for_next_x_ns) and modifying the
cpu_idle_loop() would cover this improbable situation and may be
applied to other use-cases too ... but still is extra work with
uncertain acceptability :)

2015-02-09 16:10:52

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH] kvm: add halt_poll_ns module parameter



On 09/02/2015 16:21, Radim Krčmář wrote:
> 2015-02-06 13:48+0100, Paolo Bonzini:
> [...]
>> Signed-off-by: Paolo Bonzini <[email protected]>
>> ---
>
> Reviewed-by: Radim Krčmář <[email protected]>
>
> Noticed changes since RFC:
> - polling is used in more situations
> - new tracepoint
> - module parameter in nanoseconds
> - properly handled time
> - no polling with overcommit

Yup, pretty much what came in from Marcelo and David.

>> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
>> @@ -148,6 +148,7 @@ struct kvm_vm_stat {
>> };
>>
>> struct kvm_vcpu_stat {
>> + u32 halt_successful_poll;
>> u32 halt_wakeup;
>> };
>
> We don't expose it in arch/arm/kvm/guest.c,
> struct kvm_stats_debugfs_item debugfs_entries[] = {
> { NULL }
> };

Yes. Too late for 3.20.

>> +TRACE_EVENT(kvm_vcpu_wakeup,
>> + TP_PROTO(__u64 ns, bool waited),
>
> (__u64 is preferred here?)

Preferred to what?

>> @@ -1813,29 +1816,60 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
>> void kvm_vcpu_block(struct kvm_vcpu *vcpu)
>> {
>> + ktime_t start, cur;
>> DEFINE_WAIT(wait);
>> + bool waited = false;
>> +
>> + start = cur = ktime_get();
>> + if (halt_poll_ns) {
>> + ktime_t stop = ktime_add_ns(ktime_get(), halt_poll_ns);
>> + do {
>> + /*
>> + * This sets KVM_REQ_UNHALT if an interrupt
>> + * arrives.
>> + */
>> + if (kvm_vcpu_check_block(vcpu) < 0) {
>> + ++vcpu->stat.halt_successful_poll;
>> + goto out;
>> + }
>> + cur = ktime_get();
>> + } while (single_task_running() && ktime_before(cur, stop));
>
> After reading a bunch of code, I'm still not sure ...
> - need_resched() can't be true when single_task_running()?
> (I think it could happen -- balancing comes into mind.)

Single_task_running is per-CPU; for a task to relinquish control to
another task, you first need to have multiple tasks running. In other
words, I think it cannot.

> - is it ok to ignore need_resched() when single_task_running()?
> (Most likely not.)

Paolo

2015-02-09 17:28:30

by Radim Krčmář

[permalink] [raw]
Subject: Re: [PATCH] kvm: add halt_poll_ns module parameter

2015-02-09 17:10+0100, Paolo Bonzini:
> On 09/02/2015 16:21, Radim Krčmář wrote:
> > 2015-02-06 13:48+0100, Paolo Bonzini:
> >> +TRACE_EVENT(kvm_vcpu_wakeup,
> >> + TP_PROTO(__u64 ns, bool waited),
> >
> > (__u64 is preferred here?)
>
> Preferred to what?

To 'u64'. (The header file shouldn't be reachable from user-space.)

> >> + } while (single_task_running() && ktime_before(cur, stop));
> >
> > After reading a bunch of code, I'm still not sure ...
> > - need_resched() can't be true when single_task_running()?
> > (I think it could happen -- balancing comes into mind.)
>
> Single_task_running is per-CPU; for a task to relinquish control to
> another task, you first need to have multiple tasks running. In other
> words, I think it cannot.

Ok, thanks. (I thought that need_resched has more general meaning and
couldn't confirm that balancing/CPU-offlining/... just evicts the task
without waiting for its schedule().)

2015-02-09 19:53:05

by David Matlack

[permalink] [raw]
Subject: Re: [PATCH] kvm: add halt_poll_ns module parameter

On Fri, Feb 6, 2015 at 4:48 AM, Paolo Bonzini <[email protected]> wrote:
> This patch introduces a new module parameter for the KVM module; when it
> is present, KVM attempts a bit of polling on every HLT before scheduling
> itself out via kvm_vcpu_block.
>
[...]
>
> Signed-off-by: Paolo Bonzini <[email protected]>
> ---

Looks good, thanks for making those changes. I ran this patch on my
benchmarks (loopback TCP_RR and memcache) using halt_poll_ns=70000 and
saw performance go from 40% to 60-65% of bare-metal.

Tested-by: David Matlack <[email protected]>
Reviewed-by: David Matlack <[email protected]>

> arch/arm/include/asm/kvm_host.h | 1 +
> arch/arm64/include/asm/kvm_host.h | 1 +
> arch/mips/include/asm/kvm_host.h | 1 +
> arch/mips/kvm/mips.c | 1 +
> arch/powerpc/include/asm/kvm_host.h | 1 +
> arch/powerpc/kvm/book3s.c | 1 +
> arch/powerpc/kvm/booke.c | 1 +
> arch/s390/include/asm/kvm_host.h | 1 +
> arch/s390/kvm/kvm-s390.c | 1 +
> arch/x86/include/asm/kvm_host.h | 1 +
> arch/x86/kvm/x86.c | 1 +
> include/trace/events/kvm.h | 19 +++++++++++++++
> virt/kvm/kvm_main.c | 48 +++++++++++++++++++++++++++++++------
> 13 files changed, 71 insertions(+), 7 deletions(-)
>
> diff --git a/arch/arm/include/asm/kvm_host.h b/arch/arm/include/asm/kvm_host.h
> index bde494654bcc..6a79314bc1df 100644
> --- a/arch/arm/include/asm/kvm_host.h
> +++ b/arch/arm/include/asm/kvm_host.h
> @@ -148,6 +148,7 @@ struct kvm_vm_stat {
> };
>
> struct kvm_vcpu_stat {
> + u32 halt_successful_poll;
> u32 halt_wakeup;
> };
>
> diff --git a/arch/arm64/include/asm/kvm_host.h b/arch/arm64/include/asm/kvm_host.h
> index 2c49aa4ac818..8efde89613f2 100644
> --- a/arch/arm64/include/asm/kvm_host.h
> +++ b/arch/arm64/include/asm/kvm_host.h
> @@ -165,6 +165,7 @@ struct kvm_vm_stat {
> };
>
> struct kvm_vcpu_stat {
> + u32 halt_successful_poll;
> u32 halt_wakeup;
> };
>
> diff --git a/arch/mips/include/asm/kvm_host.h b/arch/mips/include/asm/kvm_host.h
> index f2c249796ea8..ac4fc716062b 100644
> --- a/arch/mips/include/asm/kvm_host.h
> +++ b/arch/mips/include/asm/kvm_host.h
> @@ -120,6 +120,7 @@ struct kvm_vcpu_stat {
> u32 resvd_inst_exits;
> u32 break_inst_exits;
> u32 flush_dcache_exits;
> + u32 halt_successful_poll;
> u32 halt_wakeup;
> };
>
> diff --git a/arch/mips/kvm/mips.c b/arch/mips/kvm/mips.c
> index e97b90784031..c9eccf5df912 100644
> --- a/arch/mips/kvm/mips.c
> +++ b/arch/mips/kvm/mips.c
> @@ -49,6 +49,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
> { "resvd_inst", VCPU_STAT(resvd_inst_exits), KVM_STAT_VCPU },
> { "break_inst", VCPU_STAT(break_inst_exits), KVM_STAT_VCPU },
> { "flush_dcache", VCPU_STAT(flush_dcache_exits), KVM_STAT_VCPU },
> + { "halt_successful_poll", VCPU_STAT(halt_successful_poll), KVM_STAT_VCPU },
> { "halt_wakeup", VCPU_STAT(halt_wakeup), KVM_STAT_VCPU },
> {NULL}
> };
> diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h
> index 7efd666a3fa7..8ef05121d3cd 100644
> --- a/arch/powerpc/include/asm/kvm_host.h
> +++ b/arch/powerpc/include/asm/kvm_host.h
> @@ -107,6 +107,7 @@ struct kvm_vcpu_stat {
> u32 emulated_inst_exits;
> u32 dec_exits;
> u32 ext_intr_exits;
> + u32 halt_successful_poll;
> u32 halt_wakeup;
> u32 dbell_exits;
> u32 gdbell_exits;
> diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
> index 888bf466d8c6..cfbcdc654201 100644
> --- a/arch/powerpc/kvm/book3s.c
> +++ b/arch/powerpc/kvm/book3s.c
> @@ -52,6 +52,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
> { "dec", VCPU_STAT(dec_exits) },
> { "ext_intr", VCPU_STAT(ext_intr_exits) },
> { "queue_intr", VCPU_STAT(queue_intr) },
> + { "halt_successful_poll", VCPU_STAT(halt_successful_poll), },
> { "halt_wakeup", VCPU_STAT(halt_wakeup) },
> { "pf_storage", VCPU_STAT(pf_storage) },
> { "sp_storage", VCPU_STAT(sp_storage) },
> diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c
> index 9b55dec2d6cc..6c1316a15a27 100644
> --- a/arch/powerpc/kvm/booke.c
> +++ b/arch/powerpc/kvm/booke.c
> @@ -62,6 +62,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
> { "inst_emu", VCPU_STAT(emulated_inst_exits) },
> { "dec", VCPU_STAT(dec_exits) },
> { "ext_intr", VCPU_STAT(ext_intr_exits) },
> + { "halt_successful_poll", VCPU_STAT(halt_successful_poll) },
> { "halt_wakeup", VCPU_STAT(halt_wakeup) },
> { "doorbell", VCPU_STAT(dbell_exits) },
> { "guest doorbell", VCPU_STAT(gdbell_exits) },
> diff --git a/arch/s390/include/asm/kvm_host.h b/arch/s390/include/asm/kvm_host.h
> index d1ecc7fd0579..f79058e3fd98 100644
> --- a/arch/s390/include/asm/kvm_host.h
> +++ b/arch/s390/include/asm/kvm_host.h
> @@ -196,6 +196,7 @@ struct kvm_vcpu_stat {
> u32 exit_stop_request;
> u32 exit_validity;
> u32 exit_instruction;
> + u32 halt_successful_poll;
> u32 halt_wakeup;
> u32 instruction_lctl;
> u32 instruction_lctlg;
> diff --git a/arch/s390/kvm/kvm-s390.c b/arch/s390/kvm/kvm-s390.c
> index b2371c0fd1f8..1dbab2340a66 100644
> --- a/arch/s390/kvm/kvm-s390.c
> +++ b/arch/s390/kvm/kvm-s390.c
> @@ -51,6 +51,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
> { "exit_instruction", VCPU_STAT(exit_instruction) },
> { "exit_program_interruption", VCPU_STAT(exit_program_interruption) },
> { "exit_instr_and_program_int", VCPU_STAT(exit_instr_and_program) },
> + { "halt_successful_poll", VCPU_STAT(halt_successful_poll) },
> { "halt_wakeup", VCPU_STAT(halt_wakeup) },
> { "instruction_lctlg", VCPU_STAT(instruction_lctlg) },
> { "instruction_lctl", VCPU_STAT(instruction_lctl) },
> diff --git a/arch/x86/include/asm/kvm_host.h b/arch/x86/include/asm/kvm_host.h
> index 848947ac6ade..a236e39cc385 100644
> --- a/arch/x86/include/asm/kvm_host.h
> +++ b/arch/x86/include/asm/kvm_host.h
> @@ -655,6 +655,7 @@ struct kvm_vcpu_stat {
> u32 irq_window_exits;
> u32 nmi_window_exits;
> u32 halt_exits;
> + u32 halt_successful_poll;
> u32 halt_wakeup;
> u32 request_irq_exits;
> u32 irq_exits;
> diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c
> index 1373e04e1f19..bd7a70be41b3 100644
> --- a/arch/x86/kvm/x86.c
> +++ b/arch/x86/kvm/x86.c
> @@ -145,6 +145,7 @@ struct kvm_stats_debugfs_item debugfs_entries[] = {
> { "irq_window", VCPU_STAT(irq_window_exits) },
> { "nmi_window", VCPU_STAT(nmi_window_exits) },
> { "halt_exits", VCPU_STAT(halt_exits) },
> + { "halt_successful_poll", VCPU_STAT(halt_successful_poll) },
> { "halt_wakeup", VCPU_STAT(halt_wakeup) },
> { "hypercalls", VCPU_STAT(hypercalls) },
> { "request_irq", VCPU_STAT(request_irq_exits) },
> diff --git a/include/trace/events/kvm.h b/include/trace/events/kvm.h
> index 6edf1f2028cd..6bfe7eec1c2c 100644
> --- a/include/trace/events/kvm.h
> +++ b/include/trace/events/kvm.h
> @@ -37,6 +37,25 @@ TRACE_EVENT(kvm_userspace_exit,
> __entry->errno < 0 ? -__entry->errno : __entry->reason)
> );
>
> +TRACE_EVENT(kvm_vcpu_wakeup,
> + TP_PROTO(__u64 ns, bool waited),
> + TP_ARGS(ns, waited),
> +
> + TP_STRUCT__entry(
> + __field( __u64, ns )
> + __field( bool, waited )
> + ),
> +
> + TP_fast_assign(
> + __entry->ns = ns;
> + __entry->waited = waited;
> + ),
> +
> + TP_printk("%s time %lld ns",
> + __entry->waited ? "wait" : "poll",
> + __entry->ns)
> +);
> +
> #if defined(CONFIG_HAVE_KVM_IRQFD)
> TRACE_EVENT(kvm_set_irq,
> TP_PROTO(unsigned int gsi, int level, int irq_source_id),
> diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c
> index 0c281760a1c5..32449e0e9aa8 100644
> --- a/virt/kvm/kvm_main.c
> +++ b/virt/kvm/kvm_main.c
> @@ -66,6 +66,9 @@
> MODULE_AUTHOR("Qumranet");
> MODULE_LICENSE("GPL");
>
> +unsigned int halt_poll_ns = 0;
> +module_param(halt_poll_ns, uint, S_IRUGO | S_IWUSR);
> +
> /*
> * Ordering of locks:
> *
> @@ -1813,29 +1816,60 @@ void mark_page_dirty(struct kvm *kvm, gfn_t gfn)
> }
> EXPORT_SYMBOL_GPL(mark_page_dirty);
>
> +static int kvm_vcpu_check_block(struct kvm_vcpu *vcpu)
> +{
> + if (kvm_arch_vcpu_runnable(vcpu)) {
> + kvm_make_request(KVM_REQ_UNHALT, vcpu);
> + return -EINTR;
> + }
> + if (kvm_cpu_has_pending_timer(vcpu))
> + return -EINTR;
> + if (signal_pending(current))
> + return -EINTR;
> +
> + return 0;
> +}
> +
> /*
> * The vCPU has executed a HLT instruction with in-kernel mode enabled.
> */
> void kvm_vcpu_block(struct kvm_vcpu *vcpu)
> {
> + ktime_t start, cur;
> DEFINE_WAIT(wait);
> + bool waited = false;
> +
> + start = cur = ktime_get();
> + if (halt_poll_ns) {
> + ktime_t stop = ktime_add_ns(ktime_get(), halt_poll_ns);
> + do {
> + /*
> + * This sets KVM_REQ_UNHALT if an interrupt
> + * arrives.
> + */
> + if (kvm_vcpu_check_block(vcpu) < 0) {
> + ++vcpu->stat.halt_successful_poll;
> + goto out;
> + }
> + cur = ktime_get();
> + } while (single_task_running() && ktime_before(cur, stop));
> + }
>
> for (;;) {
> prepare_to_wait(&vcpu->wq, &wait, TASK_INTERRUPTIBLE);
>
> - if (kvm_arch_vcpu_runnable(vcpu)) {
> - kvm_make_request(KVM_REQ_UNHALT, vcpu);
> - break;
> - }
> - if (kvm_cpu_has_pending_timer(vcpu))
> - break;
> - if (signal_pending(current))
> + if (kvm_vcpu_check_block(vcpu) < 0)
> break;
>
> + waited = true;
> schedule();
> }
>
> finish_wait(&vcpu->wq, &wait);
> + cur = ktime_get();
> +
> +out:
> + trace_kvm_vcpu_wakeup(ktime_to_ns(cur) - ktime_to_ns(start), waited);
> }
> EXPORT_SYMBOL_GPL(kvm_vcpu_block);
>
> --
> 1.8.3.1
>

2015-02-09 21:01:02

by Christian Borntraeger

[permalink] [raw]
Subject: Re: [PATCH] kvm: add halt_poll_ns module parameter

Am 06.02.2015 um 13:48 schrieb Paolo Bonzini:
> This patch introduces a new module parameter for the KVM module; when it
> is present, KVM attempts a bit of polling on every HLT before scheduling
> itself out via kvm_vcpu_block.
>
> This parameter helps a lot for latency-bound workloads---in particular
> I tested it with O_DSYNC writes with a battery-backed disk in the host.
> In this case, writes are fast (because the data doesn't have to go all
> the way to the platters) but they cannot be merged by either the host or
> the guest. KVM's performance here is usually around 30% of bare metal,
> or 50% if you use cache=directsync or cache=writethrough (these
> parameters avoid that the guest sends pointless flush requests, and
> at the same time they are not slow because of the battery-backed cache).
> The bad performance happens because on every halt the host CPU decides
> to halt itself too. When the interrupt comes, the vCPU thread is then
> migrated to a new physical CPU, and in general the latency is horrible
> because the vCPU thread has to be scheduled back in.
>
> With this patch performance reaches 60-65% of bare metal and, more
> important, 99% of what you get if you use idle=poll in the guest. This
> means that the tunable gets rid of this particular bottleneck, and more
> work can be done to improve performance in the kernel or QEMU.
>
> Of course there is some price to pay; every time an otherwise idle vCPUs
> is interrupted by an interrupt, it will poll unnecessarily and thus
> impose a little load on the host. The above results were obtained with
> a mostly random value of the parameter (500000), and the load was around
> 1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU.
>
> The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll,
> that can be used to tune the parameter. It counts how many HLT
> instructions received an interrupt during the polling period; each
> successful poll avoids that Linux schedules the VCPU thread out and back
> in, and may also avoid a likely trip to C1 and back for the physical CPU.
>
> While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second.
> Of these halts, almost all are failed polls. During the benchmark,
> instead, basically all halts end within the polling period, except a more
> or less constant stream of 50 per second coming from vCPUs that are not
> running the benchmark. The wasted time is thus very low. Things may
> be slightly different for Windows VMs, which have a ~10 ms timer tick.
>
> The effect is also visible on Marcelo's recently-introduced latency
> test for the TSC deadline timer. Though of course a non-RT kernel has
> awful latency bounds, the latency of the timer is around 8000-10000 clock
> cycles compared to 20000-120000 without setting halt_poll_ns. For the TSC
> deadline timer, thus, the effect is both a smaller average latency and
> a smaller variance.
>
> Signed-off-by: Paolo Bonzini <[email protected]>

I can confirm that this also helps uperf with a 1/1 byte round trip work load
between guests on s390. And I can confirm the higher CPU load. This is normally
a no-go for the typical s390 users, which utilize their systems as much as
possible. Your check for single_task_running could actually solve that
problem because on overcommitment this will never switch to polling if the
runqueues get full.
Since this is also runtime configurable and defaults to 0 it should be pretty
painless.

The only question is: is there a sane way of doing autotuning?
Christian

2015-02-10 07:51:16

by Paolo Bonzini

[permalink] [raw]
Subject: Re: [PATCH] kvm: add halt_poll_ns module parameter



On 09/02/2015 22:00, Christian Borntraeger wrote:
> I can confirm that this also helps uperf with a 1/1 byte round trip work load
> between guests on s390. And I can confirm the higher CPU load. This is normally
> a no-go for the typical s390 users, which utilize their systems as much as
> possible. Your check for single_task_running could actually solve that
> problem because on overcommitment this will never switch to polling if the
> runqueues get full.
> Since this is also runtime configurable and defaults to 0 it should be pretty
> painless.
>
> The only question is: is there a sane way of doing autotuning?

The answer is: we'll see. :) I have some ideas, I have to run them
through the necessary experiments.

Paolo